aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
AgeCommit message (Collapse)Author
2020-09-21net/mlx5e: Fix multicast counter not up-to-date in "ip -s"Ron Diskin
Currently the FW does not generate events for counters other than error counters. Unlike ".get_ethtool_stats", ".ndo_get_stats64" (which ip -s uses) might run in atomic context, while the FW interface is non atomic. Thus, 'ip' is not allowed to issue FW commands, so it will only display cached counters in the driver. Add a SW counter (mcast_packets) in the driver to count rx multicast packets. The counter also counts broadcast packets, as we consider it a special case of multicast. Use the counter value when calling "ip -s"/"ifconfig". Fixes: f62b8bb8f2d3 ("net/mlx5: Extend mlx5_core to support ConnectX-4 Ethernet functionality") Signed-off-by: Ron Diskin <rondi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-09-21net/mlx5e: Use synchronize_rcu to sync with NAPIMaxim Mikityanskiy
As described in the previous commit, napi_synchronize doesn't quite fit the purpose when we just need to wait until the currently running NAPI quits. Its implementation waits until NAPI is not running by polling and waiting for 1ms in between. In cases where we need to deactivate one queue (e.g., recovery flows) or where we deactivate them one-by-one (deactivate channel flow), we may get stuck in napi_synchronize forever if other queues keep NAPI active, causing a soft lockup. Depending on kernel configuration (CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC), it may result in a kernel panic. To fix the issue, use synchronize_rcu to wait for NAPI to quit, and wrap the whole NAPI in rcu_read_lock. Fixes: acc6c5953af1 ("net/mlx5e: Split open/close channels to stages") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-28net/mlx5e: Use indirect call wrappers for RX post WQEs functionsTariq Toukan
Use the indirect call wrapper API macros for declaration and scope of the RX post WQEs functions. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-28net/mlx5e: RX, Re-work initializaiton of RX function pointersTariq Toukan
Instead of exposing the RQ datapath handlers (from en_rx.c) so that they are set in the control path (in en_main.c), wrap this logic in a single function in en_rx.c and expose it alone. Every profile will now have a pointer to the new mlx5e_rx_handlers structure, instead of directly pointing to the previously-exposed RQ handlers. This significantly improves locality and modularity of the driver, and allows many functions in en_rx.c to become static. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-16net/mlx5e: RX, Avoid indirect call in representor CQE handlingTariq Toukan
Use INDIRECT_CALL_2() helper to avoid the cost of the indirect call when/if CONFIG_RETPOLINE=y. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-16net/mlx5e: IPsec: Add Connect-X IPsec Rx data path offloadRaed Salem
On receive flow inspect received packets for IPsec offload indication using the cqe, for IPsec offloaded packets propagate offload status and stack handle to stack for further processing. Supported statuses: - Offload ok. - Authentication failure. - Bad trailer indication. Connect-X IPsec does not use mlx5e_ipsec_handle_rx_cqe. For RX only offload, we see the BW gain. Below is the iperf3 performance report on two server of 24 cores Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz with ConnectX6-DX. We use one thread per IPsec tunnel. --------------------------------------------------------------------- Mode | Num tunnel | BW | Send CPU util | Recv CPU util | | (Gbps) | (Average %) | (Average %) --------------------------------------------------------------------- Cryto offload | 1 | 4.6 | 4.2 | 14.5 --------------------------------------------------------------------- Cryto offload | 24 | 38 | 73 | 63 --------------------------------------------------------------------- Non-offload | 1 | 4 | 4 | 13 --------------------------------------------------------------------- Non-offload | 24 | 23 | 52 | 67 Signed-off-by: Raed Salem <raeds@mellanox.com> Reviewed-by: Boris Pismenny <borisp@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-02net/mlx5e: Remove redundant RQ state queryAya Levin
When received a CQE error, the driver inspect the syndrome given by the firmware. RQ recovery is initiated only as a result of a fatal syndrome; syndrome which set the RQ into an error state. Hence no need to query the RQ state at the beginning of the recovery process. Add additional debug prints before recovering. Signed-off-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-06-27net/mlx5e: kTLS, Improve rx handler function callTariq Toukan
Prior to this patch mlx5e tls rx handler was called unconditionally on all rx frames and the decision whether a frame is a valid tls record is done inside that function. A function call can be expensive especially for regular rx packet rate. To avoid this, check the tls validity before jumping into the tls rx handler. While at it, split between kTLS device offload rx handler and FPGA tls rx handler using a similar method. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
2020-06-27net/mlx5e: kTLS, Add kTLS RX resync supportTariq Toukan
Implement the RX resync procedure, using the TLS async resync API. The HW offload of TLS decryption in RX side might get out-of-sync due to out-of-order reception of packets. This requires SW intervention to update the HW context and get it back in-sync. Performance: CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 24 cores, HT off NIC: ConnectX-6 Dx 100GbE dual port Goodput (app-layer throughput) comparison: +---------------+-------+-------+---------+ | # connections | 1 | 4 | 8 | +---------------+-------+-------+---------+ | SW (Gbps) | 7.26 | 24.70 | 50.30 | +---------------+-------+-------+---------+ | HW (Gbps) | 18.50 | 64.30 | 92.90 | +---------------+-------+-------+---------+ | Speedup | 2.55x | 2.56x | 1.85x * | +---------------+-------+-------+---------+ * After linerate is reached, diff is observed in CPU util. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-06-27net/mlx5e: kTLS, Add kTLS RX HW offload supportTariq Toukan
Implement driver support for the kTLS RX HW offload feature. Resync support is added in a downstream patch. New offload contexts post their static/progress params WQEs over the per-channel async ICOSQ, protected under a spin-lock. The Channel/RQ is selected according to the socket's rxq index. Feature is OFF by default. Can be turned on by: $ ethtool -K <if> tls-hw-rx-offload on A new TLS-RX workqueue is used to allow asynchronous addition of steering rules, out of the NAPI context. It will be also used in a downstream patch in the resync procedure. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-29mlx5: fix xdp data_meta setup in mlx5e_fill_xdp_buffJesper Dangaard Brouer
The helper function xdp_set_data_meta_invalid() must be called after setting xdp->data as it depends on it. The bug was introduced in the cited patch below, and cause the kernel to crash when using BPF helper bpf_xdp_adjust_head() on mlx5 driver. Fixes: 39d6443c8daf ("mlx5, xsk: Migrate to new MEM_TYPE_XSK_BUFF_POOL") Reported-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-23Merge tag 'mlx5-updates-2020-05-22' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2020-05-22 This series includes two updates and one cleanup patch 1) Tang Bim, clean-up with IS_ERR() usage 2) Vlad introduces a new mlx5 kconfig flag for TC support This is required due to the high volume of current and upcoming development in the eswitch and representors areas where some of the feature are TC based such as the downstream patches of MPLSoUDP and the following representor bonding support for VF live migration and uplink representor dynamic loading. For this Vlad kept TC specific code in tc.c and rep/tc.c and organized non TC code in representors specific files. 3) Eli Cohen adds support for MPLS over UPD encap and decap TC offloads. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-05-23 The following pull-request contains BPF updates for your *net-next* tree. We've added 50 non-merge commits during the last 8 day(s) which contain a total of 109 files changed, 2776 insertions(+), 2887 deletions(-). The main changes are: 1) Add a new AF_XDP buffer allocation API to the core in order to help lowering the bar for drivers adopting AF_XDP support. i40e, ice, ixgbe as well as mlx5 have been moved over to the new API and also gained a small improvement in performance, from Björn Töpel and Magnus Karlsson. 2) Add getpeername()/getsockname() attach types for BPF sock_addr programs in order to allow for e.g. reverse translation of load-balancer backend to service address/port tuple from a connected peer, from Daniel Borkmann. 3) Improve the BPF verifier is_branch_taken() logic to evaluate pointers being non-NULL, e.g. if after an initial test another non-NULL test on that pointer follows in a given path, then it can be pruned right away, from John Fastabend. 4) Larger rework of BPF sockmap selftests to make output easier to understand and to reduce overall runtime as well as adding new BPF kTLS selftests that run in combination with sockmap, also from John Fastabend. 5) Batch of misc updates to BPF selftests including fixing up test_align to match verifier output again and moving it under test_progs, allowing bpf_iter selftest to compile on machines with older vmlinux.h, and updating config options for lirc and v6 segment routing helpers, from Stanislav Fomichev, Andrii Nakryiko and Alan Maguire. 6) Conversion of BPF tracing samples outdated internal BPF loader to use libbpf API instead, from Daniel T. Lee. 7) Follow-up to BPF kernel test infrastructure in order to fix a flake in the XDP selftests, from Jesper Dangaard Brouer. 8) Minor improvements to libbpf's internal hashmap implementation, from Ian Rogers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22net/mlx5e: Extract TC-specific code from en_rep.c to rep/tc.cVlad Buslov
As a preparation for introducing new kconfig option that controls compilation of all TC offloads code in mlx5, extract TC-specific code from en_rep.c to standalone file. This allows easily compiling out the code by only including new source in make file when corresponding kconfig is enabled instead of adding multiple ifdef blocks to en_rep. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-21mlx5, xsk: Migrate to new MEM_TYPE_XSK_BUFF_POOLBjörn Töpel
Use the new MEM_TYPE_XSK_BUFF_POOL API in lieu of MEM_TYPE_ZERO_COPY in mlx5e. It allows to drop a lot of code from the driver (which is now common in AF_XDP core and was related to XSK RX frame allocation, DMA mapping, etc.) and slightly improve performance (RX +0.8 Mpps, TX +0.4 Mpps). rfc->v1: Put back the sanity check for XSK params, use XSK API to get the total headroom size. (Maxim) v1->v2: Fix DMA address handling, set XDP metadata to invalid. (Maxim) v2->v3: Handle frame_sz, use xsk_buff_xdp_get_frame_dma, use xsk_buff API for DMA sync on TX, add performance numbers. (Maxim) v3->v4: Remove unused variable num_xsk_frames. (Jakub) Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-12-bjorn.topel@gmail.com
2020-05-15net/mlx5e: IPoIB, Drop multicast packets that this interface sentErez Shitrit
After enabled loopback packets for IPoIB, we need to drop these packets that this HCA has replicated and came back to the same interface that sent them. Fixes: 4c6c615e3f30 ("net/mlx5e: IPoIB, Add PKEY child interface nic profile") Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Reviewed-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-14mlx5: Rx queue setup time determine frame_sz for XDPJesper Dangaard Brouer
The mlx5 driver have multiple memory models, which are also changed according to whether a XDP bpf_prog is attached. The 'rx_striding_rq' setting is adjusted via ethtool priv-flags e.g.: # ethtool --set-priv-flags mlx5p2 rx_striding_rq off On the general case with 4K page_size and regular MTU packet, then the frame_sz is 2048 and 4096 when XDP is enabled, in both modes. The info on the given frame size is stored differently depending on the RQ-mode and encoded in a union in struct mlx5e_rq union wqe/mpwqe. In rx striding mode rq->mpwqe.log_stride_sz is either 11 or 12, which corresponds to 2048 or 4096 (MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ). In non-striding mode (MLX5_WQ_TYPE_CYCLIC) the frag_stride is stored in rq->wqe.info.arr[0].frag_stride, for the first fragment, which is what the XDP case cares about. To reduce effect on fast-path, this patch determine the frame_sz at setup time, to avoid determining the memory model runtime. Variable is named frame0_sz to make it clear that this is only the frame size of the first fragment. This mlx5 driver does a DMA-sync on XDP_TX action, but grow is safe as it have done a DMA-map on the entire PAGE_SIZE. The driver also already does a XDP length check against sq->hw_mtu on the possible XDP xmit paths mlx5e_xmit_xdp_frame() + mlx5e_xmit_xdp_frame_mpwqe(). V3+4: Change variable name first_frame_sz to frame0_sz V2: Fix that frag_size need to be recalc before creating SKB. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Link: https://lore.kernel.org/bpf/158945348021.97035.12295039384250022883.stgit@firesoul
2020-05-09net/mlx5e: Enhance ICOSQ WQE info fieldsTariq Toukan
The same WQE opcode might be used in different ICOSQ flows and WQE types. To have a better distinguishability, replace it with an enum that better indicates the WQE type and flow it is used for. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-09net/mlx5e: Use struct assignment for WQE info updatesTariq Toukan
Struct assignment looks more clean, and implies resetting the not assigned fields to zero, instead of holding values from older ring cycles. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-04-30net/mlx5e: Unify reserving space for WQEsMaxim Mikityanskiy
In our fast-path design, a WQE (Work Queue Element) must not cross the page boundary. To enforce that, for WQEs consisting of more than one BB (Basic Block), the driver checks the available contiguous space in the WQ in advance, and if it's not enough, it pads it with NOPs. This patch modifies the code that calculates the position of next WQE, considering the padding, and prepares the WQE. This code is common for all SQ types. In this patch it's reorganized in a way that makes the usage pattern unified for all SQ types, and makes the implementations self-contained and look almost the same, preparing the repeating code to further attempts to deduplicate it. One place is left as is: mlx5e_sq_xmit and mlx5e_fill_sq_frag_edge call inside, because it is special in a way that it may also copy WQE's cseg and eseg when reserving space. This will be eliminated in one of the following patches, and this place will be converted to the new approach, too. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-04-30net/mlx5e: Rename ICOSQ WQE info struct and fieldMaxim Mikityanskiy
Structs mlx5e_txqsq and mlx5e_xdpsq contain wqe_info arrays to store supplementary information corresponding to WQEs in the queue. Struct mlx5e_icosq also has such an array, but it's called differently - ico_wqe. This patch renames it to unify with the other SQs. In addition, rename the struct to emphasize its specific usage. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-04-30net/mlx5e: TX, Generalise code and usage of error CQE dumpTariq Toukan
Error CQE was dumped only for TXQ SQs. Generalise the function, and add usage for error completions on ICO SQs and XDP SQs. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-04-20net/mlx5e: Don't trigger IRQ multiple times on XSK wakeup to avoid WQ overrunsMaxim Mikityanskiy
XSK wakeup function triggers NAPI by posting a NOP WQE to a special XSK ICOSQ. When the application floods the driver with wakeup requests by calling sendto() in a certain pattern that ends up in mlx5e_trigger_irq, the XSK ICOSQ may overflow. Multiple NOPs are not required and won't accelerate the process, so avoid posting a second NOP if there is one already on the way. This way we also avoid increasing the queue size (which might not help anyway). Fixes: db05815b36cb ("net/mlx5e: Add XSK zero-copy support") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Overlapping header include additions in macsec.c A bug fix in 'net' overlapping with the removal of 'version' string in ena_netdev.c Overlapping test additions in selftests Makefile Overlapping PCI ID table adjustments in iwlwifi driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24net/mlx5e: Enhance ICOSQ WQE info fieldsAya Levin
Add number of WQEBBs (WQE's Basic Block) to WQE info struct. Set the number of WQEBBs on WQE post, and increment the consumer counter (cc) on completion. In case of error completions, the cc was mistakenly not incremented, keeping a gap between cc and pc (producer counter). This failed the recovery flow on the ICOSQ from a CQE error which timed-out waiting for the cc and pc to meet. Fixes: be5323c8379f ("net/mlx5e: Report and recover from CQE error on ICOSQ") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-12Merge branch 'ct-offload' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
2020-02-25net/mlx5e: RX, Use indirect calls wrapper for handling compressed completionsTariq Toukan
We can avoid an indirect call per compressed completion wrapping the completion handling call with the appropriate helper. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-19net/mlx5e: Restore tunnel metadata on missPaul Blakey
In tunnel and chains setup, we decapsulate the packets on first chain hop, if we miss on later chains, the packet will comes up without tunnel header, so it won't be taken by the tunnel device automatically, which fills the tunnel metadata, and further tc tunnel matches won't work. On miss, we get the tunnel mapping id, which was set on the chain 0 rule that decapsulated the packet. This rule matched the tunnel outer headers. From the tunnel mapping id, we get to this tunnel matches and restore the equivalent tunnel info metadata dst on the skb. We also set the skb->dev to the relevant device (tunnel device). Now further tc processing can be done on the relevant device. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-19net/mlx5: E-Switch, Restore chain id on missPaul Blakey
Chain ids are mapped to the lower part of reg C, and after loopback are copied to to CQE via a restore rule's flow_tag. To let tc continue in the correct chain, we find the corresponding chain id in the eswitch chain id <-> reg C mapping, and set the SKB's tc extension chain to it. That tells tc to continue processing from this set chain. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-19net/mlx5e: Rx, Split rep rx mpwqe handler from nicPaul Blakey
Copy the current rep mpwqe rx handler which is also used by nic profile. In the next patch, we will add rep specific logic, just for the rep profile rx handler. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-06net/mlx5e: TX, Error completion is for last WQE in batchTariq Toukan
For a cyclic work queue, when not requesting a completion per WQE, a single CQE might indicate the completion of several WQEs. However, in case some WQE in the batch causes an error, then an error completion is issued, breaking the batch, and pointing to the offending WQE in the wqe_counter field. Hence, WQE-specific error CQE handling (like printing, breaking, etc...) should be performed only for the last WQE in batch. Fixes: 130c7b46c93d ("net/mlx5e: TX, Dump WQs wqe descriptors on CQE with error events") Fixes: fd9b4be8002c ("net/mlx5e: RX, Support multiple outstanding UMR posts") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-11-20net/mlx5e: Rx, Update page pool numa node when changedSaeed Mahameed
Once every napi poll cycle, check if numa node is different than the page pool's numa id, and update it using page_pool_update_nid(). Alternatively, we could have registered an irq affinity change handler, but page_pool_update_nid() must be called from napi context anyways, so the handler won't actually help. Performance testing: XDP drop/tx rate and TCP single/multi stream, on mlx5 driver while migrating rx ring irq from close to far numa: mlx5 internal page cache was locally disabled to get pure page pool results. CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G) XDP Drop/TX single core: NUMA | XDP | Before | After --------------------------------------- Close | Drop | 11 Mpps | 10.9 Mpps Far | Drop | 4.4 Mpps | 5.8 Mpps Close | TX | 6.5 Mpps | 6.5 Mpps Far | TX | 3.5 Mpps | 4 Mpps Improvement is about 30% drop packet rate, 15% tx packet rate for numa far test. No degradation for numa close tests. TCP single/multi cpu/stream: NUMA | #cpu | Before | After -------------------------------------- Close | 1 | 18 Gbps | 18 Gbps Far | 1 | 15 Gbps | 18 Gbps Close | 12 | 80 Gbps | 80 Gbps Far | 12 | 68 Gbps | 80 Gbps In all test cases we see improvement for the far numa case, and no impact on the close numa case. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29net/mlx5e: Fix handling of compressed CQEs in case of low NAPI budgetMaxim Mikityanskiy
When CQE compression is enabled, compressed CQEs use the following structure: a title is followed by one or many blocks, each containing 8 mini CQEs (except the last, which may contain fewer mini CQEs). Due to NAPI budget restriction, a complete structure is not always parsed in one NAPI run, and some blocks with mini CQEs may be deferred to the next NAPI poll call - we have the mlx5e_decompress_cqes_cont call in the beginning of mlx5e_poll_rx_cq. However, if the budget is extremely low, some blocks may be left even after that, but the code that follows the mlx5e_decompress_cqes_cont call doesn't check it and assumes that a new CQE begins, which may not be the case. In such cases, random memory corruptions occur. An extremely low NAPI budget of 8 is used when busy_poll or busy_read is active. This commit adds a check to make sure that the previous compressed CQE has been completely parsed after mlx5e_decompress_cqes_cont, otherwise it prevents a new CQE from being fetched in the middle of a compressed CQE. This commit fixes random crashes in __build_skb, __page_pool_put_page and other not-related-directly places, that used to happen when both CQE compression and busy_poll/busy_read were enabled. Fixes: 7219ab34f184 ("net/mlx5e: CQE compression") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-09-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add the ability to use unaligned chunks in the AF_XDP umem. By relaxing where the chunks can be placed, it allows to use an arbitrary buffer size and place whenever there is a free address in the umem. Helps more seamless DPDK AF_XDP driver integration. Support for i40e, ixgbe and mlx5e, from Kevin and Maxim. 2) Addition of a wakeup flag for AF_XDP tx and fill rings so the application can wake up the kernel for rx/tx processing which avoids busy-spinning of the latter, useful when app and driver is located on the same core. Support for i40e, ixgbe and mlx5e, from Magnus and Maxim. 3) bpftool fixes for printf()-like functions so compiler can actually enforce checks, bpftool build system improvements for custom output directories, and addition of 'bpftool map freeze' command, from Quentin. 4) Support attaching/detaching XDP programs from 'bpftool net' command, from Daniel. 5) Automatic xskmap cleanup when AF_XDP socket is released, and several barrier/{read,write}_once fixes in AF_XDP code, from Björn. 6) Relicense of bpf_helpers.h/bpf_endian.h for future libbpf inclusion as well as libbpf versioning improvements, from Andrii. 7) Several new BPF kselftests for verifier precision tracking, from Alexei. 8) Several BPF kselftest fixes wrt endianess to run on s390x, from Ilya. 9) And more BPF kselftest improvements all over the place, from Stanislav. 10) Add simple BPF map op cache for nfp driver to batch dumps, from Jakub. 11) AF_XDP socket umem mapping improvements for 32bit archs, from Ivan. 12) Add BPF-to-BPF call and BTF line info support for s390x JIT, from Yauheni. 13) Small optimization in arm64 JIT to spare 1 insns for BPF_MOD, from Jerin. 14) Fix an error check in bpf_tcp_gen_syncookie() helper, from Petar. 15) Various minor fixes and cleanups, from Nathan, Masahiro, Masanari, Peter, Wei, Yue. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-20net/mlx5e: Report and recover from CQE with error on RQAya Levin
Add support for report and recovery from error on completion on RQ by setting the queue back to ready state. Handle only errors with a syndrome indicating the RQ might enter error state and could be recovered. Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-08-20net/mlx5e: RX, Handle CQE with error at the earliest stageSaeed Mahameed
Just to be aligned with the MPWQE handlers, handle RX WQE with error for legacy RQs in the top RX handlers, just before calling skb_from_cqe(). CQE error handling will now be called at the same stage regardless of the RQ type or netdev mode NIC, Representor, IPoIB, etc .. This will be useful for down stream patch to improve error CQE handling. Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-08-20net/mlx5e: Report and recover from CQE error on ICOSQAya Levin
Add support for report and recovery from error on completion on ICOSQ. Deactivate RQ and flush, then deactivate ICOSQ. Set the queue back to ready state (firmware) and reset the ICOSQ and the RQ (software resources). Finally, activate the ICOSQ and the RQ. Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-08-17net/mlx5e: Add AF_XDP need_wakeup supportMaxim Mikityanskiy
This commit adds support for the new need_wakeup feature of AF_XDP. The applications can opt-in by using the XDP_USE_NEED_WAKEUP bind() flag. When this feature is enabled, some behavior changes: RX side: If the Fill Ring is empty, instead of busy-polling, set the flag to tell the application to kick the driver when it refills the Fill Ring. TX side: If there are pending completions or packets queued for transmission, set the flag to tell the application that it can skip the sendto() syscall and save time. The performance testing was performed on a machine with the following configuration: - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz - Mellanox ConnectX-5 Ex with 100 Gbit/s link The results with retpoline disabled: | without need_wakeup | with need_wakeup | |----------------------|----------------------| | one core | two cores | one core | two cores | -------|----------|-----------|----------|-----------| txonly | 20.1 | 33.5 | 29.0 | 34.2 | rxdrop | 0.065 | 14.1 | 12.0 | 14.1 | l2fwd | 0.032 | 7.3 | 6.6 | 7.2 | "One core" means the application and NAPI run on the same core. "Two cores" means they are pinned to different cores. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-01net/mlx5e: Rx, checksum handling refactoringSaeed Mahameed
Move vlan checksum fixup flow into mlx5e_skb_padding_csum(), which is supposed to fixup SKB checksum if needed. And rename mlx5e_skb_padding_csum() to mlx5e_skb_csum_fixup(). Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-11Merge tag 'mlx5-fixes-2019-07-11' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== Mellanox, mlx5 fixes 2019-07-11 This series introduces some fixes to mlx5 driver. Please pull and let me know if there is any problem. For -stable v4.15 ('net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn') For -stable v5.1 ('net/mlx5e: Fix port tunnel GRE entropy control') ('net/mlx5e: Rx, Fix checksum calculation for new hardware') ('net/mlx5e: Fix return value from timeout recover function') ('net/mlx5e: Fix error flow in tx reporter diagnose') For -stable v5.2 ('net/mlx5: E-Switch, Fix default encap mode') Conflict note: This pull request will produce a small conflict when merged with net-next. In drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c Take the hunk from net and replace: esw_offloads_steering_init(esw, vf_nvports, total_nvports); with: esw_offloads_steering_init(esw); ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-11net/mlx5e: Rx, Fix checksum calculation for new hardwareSaeed Mahameed
CQE checksum full mode in new HW, provides a full checksum of rx frame. Covering bytes starting from eth protocol up to last byte in the received frame (frame_size - ETH_HLEN), as expected by the stack. Fixing up skb->csum by the driver is not required in such case. This fix is to avoid wrong checksum calculation in drivers which already support the new hardware with the new checksum mode. Fixes: 85327a9c4150 ("net/mlx5: Update the list of the PCI supported devices") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-27net/mlx5e: Add XSK zero-copy supportMaxim Mikityanskiy
This commit adds support for AF_XDP zero-copy RX and TX. We create a dedicated XSK RQ inside the channel, it means that two RQs are running simultaneously: one for non-XSK traffic and the other for XSK traffic. The regular and XSK RQs use a single ID namespace split into two halves: the lower half is regular RQs, and the upper half is XSK RQs. When any zero-copy AF_XDP socket is active, changing the number of channels is not allowed, because it would break to mapping between XSK RQ IDs and channels. XSK requires different page allocation and release routines. Such functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are generic enough to be used for both regular and XSK RQs, and they use the mlx5e_page_{alloc,release} wrappers around the real allocation functions. Function pointers are not used to avoid losing the performance with retpolines. Wherever it's certain that the regular (non-XSK) page release function should be used, it's called directly. Only the stats that could be meaningful for XSK are exposed to the userspace. Those that don't take part in the XSK flow are not considered. Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ), because the newer xdpsock sample doesn't provide any Fill Ring entries at the setup stage. We create a dedicated XSK SQ in the channel. This separation has its advantages: 1. When the UMEM is closed, the XSK SQ can also be closed and stop receiving completions. If an existing SQ was used for XSK, it would continue receiving completions for the packets of the closed socket. If a new UMEM was opened at that point, it would start getting completions that don't belong to it. 2. Calculating statistics separately. When the userspace kicks the TX, the driver triggers a hardware interrupt by posting a NOP to a dedicated XSK ICO (internal control operations) SQ, in order to trigger NAPI on the right CPU core. This XSK ICO SQ is protected by a spinlock, as the userspace application may kick the TX from any core. Store the pointers to the UMEMs in the net device private context, independently from the kernel. This way the driver can distinguish between the zero-copy and non-zero-copy UMEMs. The kernel function xdp_get_umem_from_qid does not care about this difference, but the driver is only interested in zero-copy UMEMs, particularly, on the cleanup it determines whether to close the XSK RQ and SQ or not by looking at the presence of the UMEM. Use state_lock to protect the access to this area of UMEM pointers. LRO isn't compatible with XDP, but there may be active UMEMs while XDP is off. If this is the case, don't allow LRO to ensure XDP can be reenabled at any time. The validation of XSK parameters typically happens when XSK queues open. However, when the interface is down or the XDP program isn't set, it's still possible to have active AF_XDP sockets and even to open new, but the XSK queues will be closed. To cover these cases, perform the validation also in these flows: 1. A new UMEM is registered, but the XSK queues aren't going to be created due to missing XDP program or interface being down. 2. MTU changes while there are UMEMs registered. Having this early check prevents mlx5e_open_channels from failing at a later stage, where recovery is impossible and the application has no chance to handle the error, because it got the successful return value for an MTU change or XSK open operation. The performance testing was performed on a machine with the following configuration: - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz - Mellanox ConnectX-5 Ex with 100 Gbit/s link The results with retpoline disabled, single stream: txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU) rxdrop: 12.2 Mpps l2fwd: 9.4 Mpps The results with retpoline enabled, single stream: txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU) rxdrop: 9.9 Mpps l2fwd: 6.8 Mpps Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: Allow ICO SQ to be used by multiple RQsMaxim Mikityanskiy
Prepare to creation of the XSK RQ, which will require posting UMRs, too. The same ICO SQ will be used for both RQs and also to trigger interrupts by posting NOPs. UMR WQEs can't be reused any more. Optimization introduced in commit ab966d7e4ff98 ("net/mlx5e: RX, Recycle buffer of UMR WQEs") is reverted. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-19mlx5: more strict use of page_pool APIJesper Dangaard Brouer
The mlx5 driver is using page_pool, but not for DMA-mapping (currently), and is a little too relaxed about returning or releasing page resources, as it is not strictly necessary, when not using DMA-mappings. As this patchset is working towards tracking page_pool resources, to know about in-flight frames on shutdown. Then fix places where mlx5 leak page_pool resource. In case of dma_mapping_error, then recycle into page_pool. In mlx5e_free_rq() moved the page_pool_destroy() call to after the mlx5e_page_release() calls, as it is more correct. In mlx5e_page_release() when no recycle was requested, then release page from the page_pool, via page_pool_release_page(). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net/mlx5e: use indirect calls wrapper for the rx packet handlerPaolo Abeni
We can avoid another indirect call per packet wrapping the rx handler call with the proper helper. To ensure that even the last listed direct call experience measurable gain, despite the additional conditionals we must traverse before reaching it, I tested reversing the order of the listed options, with performance differences below noise level. Together with the previous indirect call patch, this gives ~6% performance improvement in raw UDP tput. v2 -> v3: - use only the direct calls always available regardless of the mlx5 build options - drop the direct call list macro, to keep the code as simple as possible for future rework v1 -> v2: - update the direct call list and use a macro to define it, as per Saeed suggestion. An intermediated additional macro is needed to allow arg list expansion Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net/mlx5e: use indirect calls wrapper for skb allocationPaolo Abeni
We can avoid an indirect call per packet wrapping the skb creation with the appropriate helper. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23net/mlx5e: RX, Support multiple outstanding UMR postsTariq Toukan
The buffers mapping of the Multi-Packet WQEs (of Striding RQ) is done via UMR posts, one UMR WQE per an RX MPWQE. A single MPWQE is capable of serving many incoming packets, usually larger than the budget of a single napi cycle. Hence, posting a single UMR WQE per napi cycle (and handling its completion in the next cycle) works fine in many common cases, but not always. When an XDP program is loaded, every MPWQE is capable of serving less packets, to satisfy the packet-per-page requirement. Thus, for the same number of packets more MPWQEs (and UMR posts) are needed (twice as much for the default MTU), giving less latency room for the UMR completions. In this patch, we add support for multiple outstanding UMR posts, to allow faster gap closure between consuming MPWQEs and reposting them back into the WQ. For better SW and HW locality, we combine the UMR posts in bulks of (at least) two. This is expected to improve packet rate in high CPU scale. Performance test: As expected, huge improvement in large-scale (48 cores). xdp_redirect_map, 64B UDP multi-stream. Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. Before: Unstable, 7 to 30 Mpps After: Stable, at 70.5 Mpps No degradation in other tested scenarios. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-09Revert "net/mlx5e: Enable reporting checksum unnecessary also for L3 packets"Or Gerlitz
This reverts commit b820e6fb0978f9c2ac438c199d2bb2f35950e9c9. Prior the commit we are reverting, checksum unnecessary was only set when both the L3 OK and L4 OK bits are set on the CQE. This caused packets of IP protocols such as SCTP which are not dealt by the current HW L4 parser (hence the L4 OK bit is not set, but the L4 header type none bit is set) to go through the checksum none code, where currently we wrongly report checksum unnecessary for them, a regression. Fix this by a revert. Note that on our usual track we report checksum complete, so the revert isn't expected to have any notable performance impact. Also, when we are not on the checksum complete track, the L4 protocols for which we report checksum none are not high performance ones, we will still report checksum unnecessary for UDP/TCP. Fixes: b820e6fb0978 ("net/mlx5e: Enable reporting checksum unnecessary also for L3 packets") Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reported-by: Avi Urman <aviu@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-09net/mlx5e: Rx, Check ip headers sanitySaeed Mahameed
In the two places is_last_ethertype_ip is being called, the caller will be looking inside the ip header, to be safe, add ip{4,6} header sanity check. And return true only on valid ip headers, i.e: the whole header is contained in the linear part of the skb. Note: Such situation is very rare and hard to reproduce, since mlx5e allocates a large enough headroom to contain the largest header one can imagine. Fixes: fe1dc069990c ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-09net/mlx5e: Rx, Fixup skb checksum for packets with tail paddingSaeed Mahameed
When an ethernet frame with ip payload is padded, the padding octets are not covered by the hardware checksum. Prior to the cited commit, skb checksum was forced to be CHECKSUM_NONE when padding is detected. After it, the kernel will try to trim the padding bytes and subtract their checksum from skb->csum. In this patch we fixup skb->csum for any ip packet with tail padding of any size, if any padding found. FCS case is just one special case of this general purpose patch, hence, it is removed. Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"), Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>