aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/networking/af_xdp.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking/af_xdp.rst')
-rw-r--r--Documentation/networking/af_xdp.rst262
1 files changed, 240 insertions, 22 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 2ccc5644cc98..dceeb0d763aa 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -243,8 +243,8 @@ Configuration Flags and Socket Options
These are the various configuration flags that can be used to control
and monitor the behavior of AF_XDP sockets.
-XDP_COPY and XDP_ZERO_COPY bind flags
--------------------------------------
+XDP_COPY and XDP_ZEROCOPY bind flags
+------------------------------------
When you bind to a socket, the kernel will first try to use zero-copy
copy. If zero-copy is not supported, it will fall back on using copy
@@ -252,7 +252,7 @@ mode, i.e. copying all packets out to user space. But if you would
like to force a certain mode, you can use the following flags. If you
pass the XDP_COPY flag to the bind call, the kernel will force the
socket into copy mode. If it cannot use copy mode, the bind call will
-fail with an error. Conversely, the XDP_ZERO_COPY flag will force the
+fail with an error. Conversely, the XDP_ZEROCOPY flag will force the
socket into zero-copy mode or fail.
XDP_SHARED_UMEM bind flag
@@ -290,19 +290,19 @@ round-robin example of distributing packets is shown below:
#define MAX_SOCKS 16
struct {
- __uint(type, BPF_MAP_TYPE_XSKMAP);
- __uint(max_entries, MAX_SOCKS);
- __uint(key_size, sizeof(int));
- __uint(value_size, sizeof(int));
+ __uint(type, BPF_MAP_TYPE_XSKMAP);
+ __uint(max_entries, MAX_SOCKS);
+ __uint(key_size, sizeof(int));
+ __uint(value_size, sizeof(int));
} xsks_map SEC(".maps");
static unsigned int rr;
SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
{
- rr = (rr + 1) & (MAX_SOCKS - 1);
+ rr = (rr + 1) & (MAX_SOCKS - 1);
- return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
+ return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
}
Note, that since there is only a single set of FILL and COMPLETION
@@ -379,7 +379,7 @@ would look like this for the TX path:
.. code-block:: c
if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
- sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
+ sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
I.e., only use the syscall if the flag is set.
@@ -419,7 +419,7 @@ XDP_UMEM_REG setsockopt
-----------------------
This setsockopt registers a UMEM to a socket. This is the area that
-contain all the buffers that packet can recide in. The call takes a
+contain all the buffers that packet can reside in. The call takes a
pointer to the beginning of this area and the size of it. Moreover, it
also has parameter called chunk_size that is the size that the UMEM is
divided into. It can only be 2K or 4K at the moment. If you have an
@@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
application to use. The final option is the flags field, but it will
be dealt with in separate sections for each UMEM flag.
+SO_BINDTODEVICE setsockopt
+--------------------------
+
+This is a generic SOL_SOCKET option that can be used to tie AF_XDP
+socket to a particular network interface. It is useful when a socket
+is created by a privileged process and passed to a non-privileged one.
+Once the option is set, kernel will refuse attempts to bind that socket
+to a different interface. Updating the value requires CAP_NET_RAW.
+
XDP_STATISTICS getsockopt
-------------------------
@@ -442,9 +451,9 @@ purposes. The supported statistics are shown below:
.. code-block:: c
struct xdp_statistics {
- __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
- __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
- __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+ __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+ __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+ __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
};
XDP_OPTIONS getsockopt
@@ -453,8 +462,92 @@ XDP_OPTIONS getsockopt
Gets options from an XDP socket. The only one supported so far is
XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
+Multi-Buffer Support
+====================
+
+With multi-buffer support, programs using AF_XDP sockets can receive
+and transmit packets consisting of multiple buffers both in copy and
+zero-copy mode. For example, a packet can consist of two
+frames/buffers, one with the header and the other one with the data,
+or a 9K Ethernet jumbo frame can be constructed by chaining together
+three 4K frames.
+
+Some definitions:
+
+* A packet consists of one or more frames
+
+* A descriptor in one of the AF_XDP rings always refers to a single
+ frame. In the case the packet consists of a single frame, the
+ descriptor refers to the whole packet.
+
+To enable multi-buffer support for an AF_XDP socket, use the new bind
+flag XDP_USE_SG. If this is not provided, all multi-buffer packets
+will be dropped just as before. Note that the XDP program loaded also
+needs to be in multi-buffer mode. This can be accomplished by using
+"xdp.frags" as the section name of the XDP program used.
+
+To represent a packet consisting of multiple frames, a new flag called
+XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
+descriptors. If it is true (1) the packet continues with the next
+descriptor and if it is false (0) it means this is the last descriptor
+of the packet. Why the reverse logic of end-of-packet (eop) flag found
+in many NICs? Just to preserve compatibility with non-multi-buffer
+applications that have this bit set to false for all packets on Rx,
+and the apps set the options field to zero for Tx, as anything else
+will be treated as an invalid descriptor.
+
+These are the semantics for producing packets onto AF_XDP Tx ring
+consisting of multiple frames:
+
+* When an invalid descriptor is found, all the other
+ descriptors/frames of this packet are marked as invalid and not
+ completed. The next descriptor is treated as the start of a new
+ packet, even if this was not the intent (because we cannot guess
+ the intent). As before, if your program is producing invalid
+ descriptors you have a bug that must be fixed.
+
+* Zero length descriptors are treated as invalid descriptors.
+
+* For copy mode, the maximum supported number of frames in a packet is
+ equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
+ descriptors accumulated so far are dropped and treated as
+ invalid. To produce an application that will work on any system
+ regardless of this config setting, limit the number of frags to 18,
+ as the minimum value of the config is 17.
+
+* For zero-copy mode, the limit is up to what the NIC HW
+ supports. Usually at least five on the NICs we have checked. We
+ consciously chose to not enforce a rigid limit (such as
+ CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
+ resulted in copy actions under the hood to fit into what limit the
+ NIC supports. Kind of defeats the purpose of zero-copy mode. How to
+ probe for this limit is explained in the "probe for multi-buffer
+ support" section.
+
+On the Rx path in copy-mode, the xsk core copies the XDP data into
+multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
+detailed before. Zero-copy mode works the same, though the data is not
+copied. When the application gets a descriptor with the XDP_PKT_CONTD
+flag set to one, it means that the packet consists of multiple buffers
+and it continues with the next buffer in the following
+descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
+means that this is the last buffer of the packet. AF_XDP guarantees
+that only a complete packet (all frames in the packet) is sent to the
+application. If there is not enough space in the AF_XDP Rx ring, all
+frames of the packet will be dropped.
+
+If application reads a batch of descriptors, using for example the libxdp
+interfaces, it is not guaranteed that the batch will end with a full
+packet. It might end in the middle of a packet and the rest of the
+buffers of that packet will arrive at the beginning of the next batch,
+since the libxdp interface does not read the whole ring (unless you
+have an enormous batch size or a very small ring size).
+
+An example program each for Rx and Tx multi-buffer support can be found
+later in this document.
+
Usage
-=====
+-----
In order to use AF_XDP sockets two parts are needed. The
user-space application and the XDP program. For a complete setup and
@@ -483,15 +576,15 @@ like this:
.. code-block:: c
// struct xdp_rxtx_ring {
- // __u32 *producer;
- // __u32 *consumer;
- // struct xdp_desc *desc;
+ // __u32 *producer;
+ // __u32 *consumer;
+ // struct xdp_desc *desc;
// };
// struct xdp_umem_ring {
- // __u32 *producer;
- // __u32 *consumer;
- // __u64 *desc;
+ // __u32 *producer;
+ // __u32 *consumer;
+ // __u64 *desc;
// };
// typedef struct xdp_rxtx_ring RING;
@@ -532,6 +625,131 @@ like this:
But please use the libbpf functions as they are optimized and ready to
use. Will make your life easier.
+Usage Multi-Buffer Rx
+---------------------
+
+Here is a simple Rx path pseudo-code example (using libxdp interfaces
+for simplicity). Error paths have been excluded to keep it short:
+
+.. code-block:: c
+
+ void rx_packets(struct xsk_socket_info *xsk)
+ {
+ static bool new_packet = true;
+ u32 idx_rx = 0, idx_fq = 0;
+ static char *pkt;
+
+ int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
+
+ xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+
+ for (int i = 0; i < rcvd; i++) {
+ struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
+ char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
+ bool eop = !(desc->options & XDP_PKT_CONTD);
+
+ if (new_packet)
+ pkt = frag;
+ else
+ add_frag_to_pkt(pkt, frag);
+
+ if (eop)
+ process_pkt(pkt);
+
+ new_packet = eop;
+
+ *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
+ }
+
+ xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+ xsk_ring_cons__release(&xsk->rx, rcvd);
+ }
+
+Usage Multi-Buffer Tx
+---------------------
+
+Here is an example Tx path pseudo-code (using libxdp interfaces for
+simplicity) ignoring that the umem is finite in size, and that we
+eventually will run out of packets to send. Also assumes pkts.addr
+points to a valid location in the umem.
+
+.. code-block:: c
+
+ void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
+ int batch_size)
+ {
+ u32 idx, i, pkt_nb = 0;
+
+ xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
+
+ for (i = 0; i < batch_size;) {
+ u64 addr = pkts[pkt_nb].addr;
+ u32 len = pkts[pkt_nb].size;
+
+ do {
+ struct xdp_desc *tx_desc;
+
+ tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
+ tx_desc->addr = addr;
+
+ if (len > xsk_frame_size) {
+ tx_desc->len = xsk_frame_size;
+ tx_desc->options = XDP_PKT_CONTD;
+ } else {
+ tx_desc->len = len;
+ tx_desc->options = 0;
+ pkt_nb++;
+ }
+ len -= tx_desc->len;
+ addr += xsk_frame_size;
+
+ if (i == batch_size) {
+ /* Remember len, addr, pkt_nb for next iteration.
+ * Skipped for simplicity.
+ */
+ break;
+ }
+ } while (len);
+ }
+
+ xsk_ring_prod__submit(&xsk->tx, i);
+ }
+
+Probing for Multi-Buffer Support
+--------------------------------
+
+To discover if a driver supports multi-buffer AF_XDP in SKB or DRV
+mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to
+query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for
+querying for XDP multi-buffer support. If XDP supports multi-buffer in
+a driver, then AF_XDP will also support that in SKB and DRV mode.
+
+To discover if a driver supports multi-buffer AF_XDP in zero-copy
+mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY
+flag. If it is set, it means that at least zero-copy is supported and
+you should go and check the netlink attribute
+NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer
+value will be returned stating the max number of frags that are
+supported by this device in zero-copy mode. These are the possible
+return values:
+
+1: Multi-buffer for zero-copy is not supported by this device, as max
+ one fragment supported means that multi-buffer is not possible.
+
+>=2: Multi-buffer is supported in zero-copy mode for this device. The
+ returned number signifies the max number of frags supported.
+
+For an example on how these are used through libbpf, please take a
+look at tools/testing/selftests/bpf/xskxceiver.c.
+
+Multi-Buffer Support for Zero-Copy Drivers
+------------------------------------------
+
+Zero-copy drivers usually use the batched APIs for Rx and Tx
+processing. Note that the Tx batch API guarantees that it will provide
+a batch of Tx descriptors that ends with full packet at the end. This
+to facilitate extending a zero-copy driver with multi-buffer support.
+
Sample application
==================
@@ -592,7 +810,7 @@ A: When a netdev of a physical NIC is initialized, Linux usually
A number of other ways are possible all up to the capabilities of
the NIC you have.
-Q: Can I use the XSKMAP to implement a switch betwen different umems
+Q: Can I use the XSKMAP to implement a switch between different umems
in copy mode?
A: The short answer is no, that is not supported at the moment. The