aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
AgeCommit message (Collapse)Author
2018-04-12net/mlx5e: Sync netdev vxlan ports at openShahar Klein
[ Upstream commit a117f73dc2430443f23e18367fa545981129c1a6 ] When mlx5_core is loaded it is expected to sync ports with all vxlan devices so it can support vxlan encap/decap. This is done via udp_tunnel_get_rx_info(). Currently this call is set in mlx5e_nic_enable() and if the netdev is not in NETREG_REGISTERED state it will not be called. Normally on load the netdev state is not NETREG_REGISTERED so udp_tunnel_get_rx_info() will not be called. Moving udp_tunnel_get_rx_info() to mlx5e_open() so it will be called on netdev UP event and allow encap/decap. Fixes: 610e89e05c3f ("net/mlx5e: Don't sync netdev state when not registered") Signed-off-by: Shahar Klein <shahark@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-12net/mlx5e: Set EQE based as default TX interrupt moderation modeTal Gilboa
[ Upstream commit 48bfc39791b8b4a25f165e711f18b9c1617cefbc ] The default TX moderation mode was mistakenly set to CQE based. The intention was to add a control ability in order to improve some specific use-cases. In general, we prefer to use EQE based moderation as it gives much better numbers for the common cases. CQE based causes a degradation in the common case since it resets the moderation timer on CQE generation. This causes an issue when TSO is well utilized (large TSO sessions). The timer is set to 16us so traffic of ~64KB TSO sessions per second would mean timer reset (CQE per TSO session -> long time between CQEs). In this case we quickly reach the tcp_limit_output_bytes (256KB by default) and cause a halt in TX traffic. By setting EQE based moderation we make sure timer would expire after 16us regardless of the packet rate. This fixes an up to 40% packet rate and up to 23% bandwidth degradtions. Fixes: 0088cbbc4b66 ("net/mlx5e: Enable CQE based moderation on TX CQ") Signed-off-by: Tal Gilboa <talgi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-08net/mlx5e: Specify numa node when allocating drop rqGal Pressman
[ Upstream commit 2f0db87901698cd73d828cc6fb1957b8916fc911 ] When allocating a drop rq, no numa node is explicitly set which means allocations are done on node zero. This is not necessarily the nearest numa node to the HCA, and even worse, might even be a memoryless numa node. Choose the numa_node given to us by the pci device in order to properly allocate the coherent dma memory instead of assuming zero is valid. Fixes: 556dd1b9c313 ("net/mlx5e: Set drop RQ's necessary parameters only") Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-12net/mlx5e: Remove timestamp set from netdevice open flowFeras Daoud
To avoid configuration override, timestamp set call will be moved from the netdevice open flow to the init flow. By this, a close-open procedure will not override the timestamp configuration. In addition, the change will rename mlx5e_timestamp_set function to be mlx5e_timestamp_init. Fixes: ef9814deafd0 ("net/mlx5e: Add HW timestamping (TS) support") Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-01-12net/mlx5e: Don't override netdev features field unless in error flowGal Pressman
Set features function sets dev->features in order to keep track of which features were successfully changed and which weren't (in case the user asks for more than one change in a single command). This breaks the logic in __netdev_update_features which assumes that dev->features is not changed on success and checks for diffs between features and dev->features (diffs that might not exist at this point because of the driver override). The solution is to keep track of successful/failed feature changes and assign them to dev->features in case of failure only. Fixes: 0e405443e803 ("net/mlx5e: Improve set features ndo resiliency") Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-12-19net/mlx5e: Fix defaulting RX ring size when not neededEugenia Emantayev
Fixes the bug when turning on/off CQE compression mechanism resets the RX rings size to default value when it is not needed. Fixes: 2fc4bfb7250d ("net/mlx5e: Dynamic RQ type infrastructure") Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-12-19net/mlx5e: Fix features check of IPv6 trafficGal Pressman
The assumption that the next header field contains the transport protocol is wrong for IPv6 packets with extension headers. Instead, we should look the inner-most next header field in the buffer. This will fix TSO offload for tunnels over IPv6 with extension headers. Performance testing: 19.25x improvement, cool! Measuring bandwidth of 16 threads TCP traffic over IPv6 GRE tap. CPU: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz NIC: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] TSO: Enabled Before: 4,926.24 Mbps Now : 94,827.91 Mbps Fixes: b3f63c3d5e2c ("net/mlx5e: Add netdev support for VXLAN tunneling") Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-12-19Revert "mlx5: move affinity hints assignments to generic code"Saeed Mahameed
Before the offending commit, mlx5 core did the IRQ affinity itself, and it seems that the new generic code have some drawbacks and one of them is the lack for user ability to modify irq affinity after the initial affinity values got assigned. The issue is still being discussed and a solution in the new generic code is required, until then we need to revert this patch. This fixes the following issue: echo <new affinity> > /proc/irq/<x>/smp_affinity fails with -EIO This reverts commit a435393acafbf0ecff4deb3e3cb554b34f0d0664. Note: kept mlx5_get_vector_affinity in include/linux/mlx5/driver.h since it is used in mlx5_ib driver. Fixes: a435393acafb ("mlx5: move affinity hints assignments to generic code") Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jes Sorensen <jsorensen@fb.com> Reported-by: Jes Sorensen <jsorensen@fb.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-09net/mlx5e: Add VLAN offloads statisticsGal Pressman
The following counters are now exposed through ethtool -S: rx[i]_removed_vlan_packets (per channel) rx_removed_vlan_packets tx[i]_added_vlan_packets (per channel) tx_added_vlan_packets rx_removed_vlan_packets: The number of packets that had their outer VLAN header stripped to the CQE by the hardware. tx_added_vlan_packets: The number of packets that had their outer VLAN header inserted by the hardware. Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-09net/mlx5e: Add 802.1ad VLAN insertion supportGal Pressman
Report VLAN insertion support for S-tagged packets and add support by choosing the correct VLAN type in the WQE. Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-09net/mlx5e: Add 802.1ad VLAN filter steering rulesGal Pressman
When a user chooses to use 802.1ad VLAN the proper steering rules will be added to the VLAN flow table (matching the specific S-tag VID). Due to current hardware limitation, when using 802.1ad, we must disable C-tag VLAN stripping on the RQs. Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-09net/mlx5e: Rename VLAN related variables and functionsGal Pressman
Rename VLAN related symbols to better reflect the fact that they are associated to C-tag VLAN. Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-08net_sch: mqprio: Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIONogah Frankel
Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO to match the new convention. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05Merge tag 'mlx5-updates-2017-11-04' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2017-11-04 This series includes: From Huy: dscp to priority mapping for Ethernet packet. =================================================== First six patches enable differentiated services code point (dscp) to priority mapping for Ethernet packet. Once this feature is enabled, the packet is routed to the corresponding priority based on its dscp. User can combine this feature with priority flow control (pfc) feature to have priority flow control based on the dscp. Firmware interface: Mellanox firmware provides two control knobs for this feature: QPTS register allow changing the trust state between dscp and pcp mode. The default is pcp mode. Once in dscp mode, firmware will route the packet based on its dscp value if the dscp field exists. QPDPM register allow mapping a specific dscp (0 to 63) to a specific priority (0 to 7). By default, all the dscps are mapped to priority zero. Software interface: This feature is controlled via application priority TLV. IEEE specification P802.1Qcd/D2.1 defines priority selector id 5 for application priority TLV. This APP TLV selector defines DSCP to priority map. This APP TLV can be sent by the switch or can be set locally using software such as lldptool. In mlx5 drivers, we add the support for net dcb's getapp and setapp call back. Mlx5 driver only handles the selector id 5 application entry (dscp application priority application entry). If user sends multiple dscp to priority APP TLV entries on the same dscp, the last sent one will take effect. All the previous sent will be deleted. The firmware trust state (in QPTS register) is changed based on the number of dscp to priority application entries. When the first dscp to priority application entry is added by the user, the trust state is changed to dscp. When the last dscp to priority application entry is deleted by the user, the trust state is changed to pcp. When the port is in DSCP trust state, the transmit queue is selected based on the dscp of the skb. When the port is in DSCP trust state and vport inline mode is not NONE, firmware requires mlx5 driver to copy the IP header to the wqe ethernet segment inline header if the skb has it. This is done by changing the transmit queue sq's min inline mode to L3. Note that the min inline mode of sqs that belong to other features such as xdpsq, icosq are not modified. =================================================== Plus to the dscp series, some small misc changes are include as well: From Inbar, Ethtool msglvl support and some debug prints in DCBNL logic From Or Gerlitz, Enlarge the NIC TC offload table size From Rabie, Initialize destination_flow struct to 0 From Feras, Add inner TTC table to IPoIB flow steering From Tal, Enable CQE based moderation on TX CQ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05net: bpf: rename ndo_xdp to ndo_bpfJakub Kicinski
ndo_xdp is a control path callback for setting up XDP in the driver. We can reuse it for other forms of communication between the eBPF stack and the drivers. Rename the callback and associated structures and definitions. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04net/mlx5e: Enable CQE based moderation on TX CQTal Gilboa
By using CQE based moderation on TX CQ we can reduce the number of TX interrupt rate. Besides the benefit of less interrupts, this also allows the kernel to better utilize TSO. Since TSO has some CPU overhead, it might not aggregate when CPU is under high stress. By reducing the interrupt rate and the CPU utilization, we can get better aggregation and better overall throughput. The feature is enabled by default and has a private flag in ethtool for control. Throughput, interrupt rate and TSO utilization improvements: (ConnectX-4Lx 40GbE, unidirectional, 1/16 TCP streams, 64B packets) --------------------------------------------------------- Metric | Streams | CQE Based | EQE Based | improvement --------------------------------------------------------- BW | 1 | 2.4Gb/s | 2.15Gb/s | +11.6% IR | 1 | 27Kips | 50.6Kips | -46.7% TSO Util | 1 | 74.6% | 71% | +5% BW | 16 | 29Gb/s | 25.85Gb/s | +12.2% IR | 16 | 482Kips | 745Kips | -35.3% TSO Util | 16 | 69.1% | 49% | +41.1% *BW = Bandwidth, IR = Interrupt rate, ips = interrupt per second. TSO Util = bytes in TSO sessions / all bytes transferred Signed-off-by: Tal Gilboa <talgi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-04net/mlx5e: Add support for ethtool msglvl supportGal Pressman
Use ethtool -s <devname> msglvl <type> on/off to toggle debug messages. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Inbar Karmy <inbark@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-04net/mlx5e: Support DSCP trust state to Ethernet's IP packet on SQHuy Nguyen
If the port is in DSCP trust state, packets are placed in the right priority queue based on the dscp value. This is done by selecting the transmit queue based on the dscp of the skb. Until now select_queue honors priority only from the vlan header. However that is not sufficient in cases where port trust state is DSCP mode as packet might not even contain vlan header. Therefore if the port is in dscp trust state and vport's min inline mode is not NONE, copy the IP header to the eseg's inline header if the skb has it. This is done by changing the transmit queue sq's min inline mode to L3. Note that the min inline mode of sqs that belong to other features such as xdpsq, icosq are not modified. Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-04net/mlx5e: Add dcbnl dscp to priority supportHuy Nguyen
This patch implements dcbnl hooks to set and delete DSCP to priority map as defined by the DCB subsystem. Device maintains internal trust state which needs to be set to DSCP state for performing DSCP to priority mapping. When the first dscp to priority APP entry is added by the user, the trust state is changed to dscp. When the last dscp to priority APP entry is deleted by the user, the trust state is changed to pcp. If user sends multiple dscp to priority APP entries on the same dscp, the last sent one will take effect. All the previous sent will be deleted. The dscp to priority APP entries are added and deleted in the net/dcb APP database using dcb_ieee_setapp/getapp. Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-02net: sched: move the can_offload check from binding phase to rule insertion ↵Jiri Pirko
phase This restores the original behaviour before the block callbacks were introduced. Allow the drivers to do binding of block always, no matter if the NETIF_F_HW_TC feature is on or off. Move the check to the block callback which is called for rule insertion. Reported-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21net: sched: avoid ndo_setup_tc calls for TC_SETUP_CLS*Jiri Pirko
All drivers are converted to use block callbacks for TC_SETUP_CLS*. So it is now safe to remove the calls to ndo_setup_tc from cls_* Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21mlx5e: Convert ndo_setup_tc offloads to block callbacksJiri Pirko
Benefit from the newly introduced block callback infrastructure and convert ndo_setup_tc calls for flower offloads to block callbacks. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16Merge tag 'mlx5-updates-2017-10-11' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== mlx5-updates-2017-10-11: IPoIB Multi Pkey support This series provides the support for IPoIB Multi Pkey. InfiniBand Pkeys are the equivalent of Ethernet vlans. Currently IPoIB device driver supports only default Pkey and IPoIB Pkey child interfaces are not supported with IPoIB offloads mode, this series will add the support for that by allowing creating mlx5 multiple IPoIB netdevices with a non-default Pkey. mlx5 IPoIB Pkey child interface is smaller version of mlx5i IPoIB interfaces and shares most of its resources with the parent IPoIB interface, namely RX steering and ring queue resources. The only mlx5 resources a child Pkey interface will be creating are the TX rings, since they should be assigned to a specific Pkey. mlx5i Pkey netdev is implemented via new mlx5e netdev profile implemented in mlx5/core/ipoib/ipoib_vlan.c. The series starts with a refactoring of mlx5e PTP and mlx5 clock implementation to move the code to be part of mlx5 core rather than mlx5e netdevice, in order to make mlx5 clock and PTP registration part of the core to be shared with mlx5e master Ethernet netdev/IPoIB parent netdev and mlx5_ib in the near future. Add the support for attaching multiple underlay QPs for the different Pkeys in mlx5 core RX steering. Add Pkey index to rdma_netdev to add the ability to set PKEY index to lower IPoIB offload netdev. Use hash-table to map between DQPN (Destination QP number) to child netdev for the IPoIB parent netdev to forward RX packets to the corresponding child Pkey netdev, since the RX rings are shared. The reset of the series adds the ipoib child Pkey: mlx5e netdev profile, netdev nods implementation and minimal set of ethtool callbacks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14net/mlx5: PTP code migration to driver core sectionFeras Daoud
PTP code is moved to core section of mlx5 driver in order to share it between ethernet and infiniband. This movement involves the following changes: - Change mlx5e_ prefix to be mlx5_ - Add clock structs to Core - Add clock object to mlx5_core_dev - Call Init/Uninit clock from core init/cleanup - Rename mlx5e_tstamp to be mlx5_clock Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Eitan Rabin <rabin@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-10-11net: sched: convert cls_flower->egress_dev users to tc_setup_cb_egdev infraJiri Pirko
The only user of cls_flower->egress_dev is mlx5. So do the conversion there alongside with the code originating the call in cls_flower function fl_hw_replace_filter to the newly introduced egress device callback infrastucture. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28net/mlx5e: Fix calculated checksum offloads countersGal Pressman
Instead of calculating the offloads counters, count them explicitly. The calculations done for these counters would result in bugs in some cases, for example: When running TCP traffic over a VXLAN tunnel with TSO enabled the following counters would increase: tx_csum_partial: 1,333,284 tx_csum_partial_inner: 29,286 tx4_csum_partial_inner: 384 tx7_csum_partial_inner: 8 tx9_csum_partial_inner: 34 tx10_csum_partial_inner: 26,807 tx11_csum_partial_inner: 287 tx12_csum_partial_inner: 27 tx16_csum_partial_inner: 6 tx25_csum_partial_inner: 1,733 Seems like tx_csum_partial increased out of nowhere. The issue is in the following calculation in mlx5e_update_sw_counters: s->tx_csum_partial = s->tx_packets - tx_offload_none - s->tx_csum_partial_inner; While tx_packets increases by the number of GSO segments for each SKB, tx_csum_partial_inner will only increase by one, resulting in wrong tx_csum_partial counter. Fixes: bfe6d8d1d433 ("net/mlx5e: Reorganize ethtool statistics") Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-28net/mlx5e: Print netdev features correctly in error messageGal Pressman
Use the correct formatting for netdev features. Fixes: 0e405443e803 ("net/mlx5e: Improve set features ndo resiliency") Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Support ipv6 checksum offload in sunvnet driver, from Shannon Nelson. 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric Dumazet. 3) Allow generic XDP to work on virtual devices, from John Fastabend. 4) Add bpf device maps and XDP_REDIRECT, which can be used to build arbitrary switching frameworks using XDP. From John Fastabend. 5) Remove UFO offloads from the tree, gave us little other than bugs. 6) Remove the IPSEC flow cache, from Florian Westphal. 7) Support ipv6 route offload in mlxsw driver. 8) Support VF representors in bnxt_en, from Sathya Perla. 9) Add support for forward error correction modes to ethtool, from Vidya Sagar Ravipati. 10) Add time filter for packet scheduler action dumping, from Jamal Hadi Salim. 11) Extend the zerocopy sendmsg() used by virtio and tap to regular sockets via MSG_ZEROCOPY. From Willem de Bruijn. 12) Significantly rework value tracking in the BPF verifier, from Edward Cree. 13) Add new jump instructions to eBPF, from Daniel Borkmann. 14) Rework rtnetlink plumbing so that operations can be run without taking the RTNL semaphore. From Florian Westphal. 15) Support XDP in tap driver, from Jason Wang. 16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal. 17) Add Huawei hinic ethernet driver. 18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan Delalande. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits) i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq i40e: avoid NVM acquire deadlock during NVM update drivers: net: xgene: Remove return statement from void function drivers: net: xgene: Configure tx/rx delay for ACPI drivers: net: xgene: Read tx/rx delay for ACPI rocker: fix kcalloc parameter order rds: Fix non-atomic operation on shared flag variable net: sched: don't use GFP_KERNEL under spin lock vhost_net: correctly check tx avail during rx busy polling net: mdio-mux: add mdio_mux parameter to mdio_mux_init() rxrpc: Make service connection lookup always check for retry net: stmmac: Delete dead code for MDIO registration gianfar: Fix Tx flow control deactivation cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6 cxgb4: Fix pause frame count in t4_get_port_stats cxgb4: fix memory leak tun: rename generic_xdp to skb_xdp tun: reserve extra headroom only when XDP is set net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping net: dsa: bcm_sf2: Advertise number of egress queues ...
2017-09-03Merge tag 'for-linus-ioctl' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "This is a big pull request. Of note is that I'm sending you the new ioctl API for the rdma subsystem. We put it up on linux-api@, but didn't get much response. The API is complex, but it solves two different problems in one go: 1) The bi-directional nature of the RDMA file write calls, which created the security hole we had to handle (and for which the fix is now causing problems for systems in production, we were a bit over zealous in the fix and the ability to open a device, then fork, then create new queue pairs on the device and use them is broken). 2) The bloat caused by different vendors implementing extensions to the base verbs API. Each vendor's hardware is slightly different, and the hardware might be suitable for one extension but not another. By the time we add generic extensions for all the different ways that the different hardware can offload things, the API becomes bloated. Things like our completion structs have started to exceed a cache line in size because of all the elements needed to support this. That in turn shows up heavily in the performance graphs with a noticable drop in performance on 100Gigabit links as our completion structs go from occupying one cache line to 1+. This API makes things like the completion structs modular in a very similar way to netlink so that your structs can only include the items needed for the offloads/features you are actually using on a given queue pair. In that way we support everything, but only use what we need, and our structs stay smaller. The ioctl API is better explained by the posting on linux-api@ than I can explain it here, so I'll just leave it at that. The rest of the pull request is typical stuff. Updates for 4.14 kernel merge window - Lots of hfi1 driver updates (mixed with a few qib and core updates as well) - rxe updates - various mlx updates - Set default roce type to RoCEv2 - Several larger fixes for bnxt_re that were too big for -rc - Several larger fixes for qedr that, likewise, were too big for -rc - Misc core changes - Make the hns_roce driver compilable on arches other than aarch64 so we can more easily debug build issues related to it - Add rdma-netlink infrastructure updates - Add automatic IRQ affinity infrastructure - Add 32bit lid support - Lots of misc fixes across the subsystem from random people - Autoloading of RDMA netlink modules - PCI pool cleanups from Romain Perier - mlx5 driver feature additions and fixes - Hardware tag matchine feature - Fix sleeping in atomic when resolving roce ah - Add experimental ioctl interface as posted to linux-api@" * tag 'for-linus-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (328 commits) IB/core: Expose ioctl interface through experimental Kconfig IB/core: Assign root to all drivers IB/core: Add completion queue (cq) object actions IB/core: Add legacy driver's user-data IB/core: Export ioctl enum types to user-space IB/core: Explicitly destroy an object while keeping uobject IB/core: Add macros for declaring methods and attributes IB/core: Add uverbs merge trees functionality IB/core: Add DEVICE object and root tree structure IB/core: Declare an object instead of declaring only type attributes IB/core: Add new ioctl interface RDMA/vmw_pvrdma: Fix a signedness RDMA/vmw_pvrdma: Report network header type in WC IB/core: Add might_sleep() annotation to ib_init_ah_from_wc() IB/cm: Fix sleeping in atomic when RoCE is used IB/core: Add support to finalize objects in one transaction IB/core: Add a generic way to execute an operation on a uobject Documentation: Hardware tag matching IB/mlx5: Support IB_SRQT_TM net/mlx5: Add XRQ support ...
2017-09-03net/mlx5e: Distribute RSS table among all RX ringsTariq Toukan
In default, uniformly distribute the RSS indirection table entries among all RX rings, rather than restricting this only to the rings on the close NUMA node. irqbalancer would anyway dynamically override the default affinities set to the RX rings. This gives better multi-stream performance and CPU util. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03net/mlx5e: Stop NAPI when irq balancer changes affinityTariq Toukan
NAPI context keeps rescheduling on same CPU as long as it's busy. This doesn't give the oppurtunity for changes in irq affinities to take effect. Fix that by calling napi_complete_done() upon a change in affinity. This would stop the NAPI and reschedule it on the new CPU. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03net/mlx5e: Use kernel's mechanism to avoid missing NAPIsTariq Toukan
We used a channel state bit MLX5E_CHANNEL_NAPI_SCHED to make sure no NAPI is missed when a channel's napi_schedule() is called for completion events of the different channel's resources/rings while NAPI is currently running. Now, as similar mechanism is implemented in kernel, ("39e6c8208d7b net: solve a NAPI race"), we obsolete our own implementation and rely on the return value of napi_complete_done(). This patch removes a redundant overhead of atomic bit operations. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03net/mlx5e: Don't recycle page if moved to far NUMATariq Toukan
Avoid recycling an RX page if it moved to another NUMA node. Add an ethtool counter to count such events. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03net/mlx5e: Remove unnecessary fields in ICO SQTariq Toukan
As of current design, in each NAPI, only a single UMR WQE completion could be available in the completion queue of the the internal control operations (ICO) send queue, in addition to nop operations that require no actions upon completion. This renders the consume index obsolete, as the wqe_counter field in CQE is sufficient. This helps removing a memory barrier, and obsoletes the need for tracking the num_wqebbs to update the consumer counter. In addition, remove other unused fields in icosq struct: pdev, dma_fifo_pc, and prev_cc. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03net/mlx5e: Type-specific optimizations for RX post WQEs functionTariq Toukan
Separate the RX post WQEs function of the different RQ types. This enables RQ type-specific optimizations in data-path. Poll the ICOSQ completion queue only for Striding RQ, and only when a UMR post completion could be possibly available. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03net/mlx5e: Non-atomic RQ state indicator for UMR WQE in progressTariq Toukan
The indication for a UMR WQE in progress is needed only within the NAPI context, and hence no races possible and no need for the use of atomic operations. The only place the flag is read outside of NAPI context is in closure flow, after RQ is disabled flag is no more accessed in NAPI. Use a boolean instead of a bit in ring state, so that its non-atomic set operations do not race with the atomic sets of the other bits. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03net/mlx5e: Non-atomic indicator for ring enabled stateTariq Toukan
Rings enabled state change occurs in control path only, and is always followed by a napi_sychronize(), so that following NAPIs read the new value. This read does not need to be atomic. The RQ auto-moderation bit is not set/cleared in data-path. No need for atomic read, a regular read operation is sufficient. In RQ creation time as well, there's no multiple threads trying to access it yet, hence a regular read can be used. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03net/mlx5e: Small enhancements for RX MPWQE allocation and freeTariq Toukan
The dma offset of a MPWQE (Multi-Packet WQE) in memory region is fixed for all rounds. Calculate it once on creation time, instead of in runtime. This also obsoletes the wqe argument in the function. In addition, optimize dma_info iterator calculation. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03net/mlx5e: Remove unnecessary wqe_sz field from RQ bufferTariq Toukan
Field is used only locally within the RQ create function. The use of a local variable is sufficient. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03net/mlx5e: Replace multiplication by stride size with a shiftTariq Toukan
In RX data-path, use shift operations instead of a regular multiplication by stride size, as it is a power of two. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03net/mlx5e: Reorganize struct mlx5e_rqTariq Toukan
Bring fast-path fields together, and combine RX WQE mutual exclusive fields into a union. Page-reuse and XDP are mutually exclusive and cannot be used at the same time. Use a union to combine their footprints. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Three cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-31net/mlx5e: Support RSS for GRE tunneled packetsGal Pressman
Introduce a new flow table and indirect TIRs which are used to hash the inner packet headers of GRE tunneled packets. When a GRE tunneled packet is received, the TTC flow table will match the new IPv4/6->GRE rules which will forward it to the inner TTC table. The inner TTC is similar to its counterpart outer TTC table, but matching the inner packet headers instead of the outer ones (and does not include the new IPv4/6->GRE rules). The new rules will not add steering hops since they are added to an already existing flow group which will be matched regardless of this patch. Non GRE traffic will not be affected. The inner flow table will forward the packet to inner indirect TIRs which hash the inner packet and thus result in RSS for the tunneled packets. Testing 8 TCP streams bandwidth over GRE: System: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz NIC: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] Before: 21.3 Gbps (Single RQ) Now : 90.5 Gbps (RSS spread on 8 RQs) Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-31net/mlx5e: Support TSO and TX checksum offloads for GRE tunnelsGal Pressman
Add TX offloads support for GRE tunneled packets by reporting the needed netdev features. Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-30net/mlx5e: Fix CQ moderation mode not set properlyTal Gilboa
cq_period_mode assignment was mistakenly removed so it was always set to "0", which is EQE based moderation, regardless of the device CAPs and requested value in ethtool. Fixes: 6a9764efb255 ("net/mlx5e: Isolate open_channels from priv->params") Signed-off-by: Tal Gilboa <talgi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-20net/mlx5e: Add RX buffer fullness countersGal Pressman
rx_buffer_passed_thres_phy - The number of events where the port RX buffer has passed a fullness threshold. rx_buffer_full_phy - The number of events where the port RX buffer has reached 100% fullness. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-20net/mlx5e: Send PAOS command on interface up/downEran Ben Elisha
Upon interface up/down, driver will send PAOS (Ports Administrative and Operational Status Register) in order to inform the Firmware on the desired status of the port by the driver. Since now we might change physical link status on mlx5e_open/close, logical VF representor should not use mlx5e_open/close ndos as is, and should call the logical version mlx5e_open/closed_locked. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-18net/sched: Fix the logic error to decide the ingress qdiscChris Mi
The offending commit used a newly added helper function. But the logic is wrong. Without this fix, the affected NICs can't do HW offload. Error -EOPNOTSUPP will be returned directly. Fixes: a2e8da9378cc ("net/sched: use newly added classid identity helpers") Signed-off-by: Chris Mi <chrism@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11net: sched: use newly added classid identity helpersJiri Pirko
Instead of checking handle, which does not have the inner class information and drivers wrongly assume clsact->egress as ingress, use the newly introduced classid identification helpers. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-08mlx5: move affinity hints assignments to generic codeSagi Grimberg
generic api takes care of spreading affinity similar to what mlx5 open coded (and even handles better asymmetric configurations). Ask the generic API to spread affinity for us, and feed him pre_vectors that do not participate in affinity settings (which is an improvement to what we had before). The affinity assignments should match what mlx5 tried to do earlier but now we do not set affinity to async, cmd and pages dedicated vectors. Also, remove mlx5e_get_cpu and introduce mlx5e_get_node (used for allocation purposes) and mlx5_get_vector_affinity (for indirection table construction) as they provide the needed information. Luckily, we have generic helpers to get cpumask and node given a irq vector. mlx5_get_vector_affinity will be used by mlx5_ib in a subsequent patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Doug Ledford <dledford@redhat.com>