aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/bpf
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/bpf')
-rw-r--r--Documentation/bpf/bpf_design_QA.rst53
-rw-r--r--Documentation/bpf/btf.rst907
-rw-r--r--Documentation/bpf/index.rst17
-rw-r--r--Documentation/bpf/prog_cgroup_sysctl.rst125
-rw-r--r--Documentation/bpf/prog_flow_dissector.rst126
5 files changed, 1214 insertions, 14 deletions
diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst
index 7cc9e368c1e9..cb402c59eca5 100644
--- a/Documentation/bpf/bpf_design_QA.rst
+++ b/Documentation/bpf/bpf_design_QA.rst
@@ -36,27 +36,27 @@ consideration important quirks of other architectures) and
defines calling convention that is compatible with C calling
convention of the linux kernel on those architectures.
-Q: can multiple return values be supported in the future?
+Q: Can multiple return values be supported in the future?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: NO. BPF allows only register R0 to be used as return value.
-Q: can more than 5 function arguments be supported in the future?
+Q: Can more than 5 function arguments be supported in the future?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: NO. BPF calling convention only allows registers R1-R5 to be used
as arguments. BPF is not a standalone instruction set.
(unlike x64 ISA that allows msft, cdecl and other conventions)
-Q: can BPF programs access instruction pointer or return address?
+Q: Can BPF programs access instruction pointer or return address?
-----------------------------------------------------------------
A: NO.
-Q: can BPF programs access stack pointer ?
+Q: Can BPF programs access stack pointer ?
------------------------------------------
A: NO.
Only frame pointer (register R10) is accessible.
From compiler point of view it's necessary to have stack pointer.
-For example LLVM defines register R11 as stack pointer in its
+For example, LLVM defines register R11 as stack pointer in its
BPF backend, but it makes sure that generated code never uses it.
Q: Does C-calling convention diminishes possible use cases?
@@ -66,8 +66,8 @@ A: YES.
BPF design forces addition of major functionality in the form
of kernel helper functions and kernel objects like BPF maps with
seamless interoperability between them. It lets kernel call into
-BPF programs and programs call kernel helpers with zero overhead.
-As all of them were native C code. That is particularly the case
+BPF programs and programs call kernel helpers with zero overhead,
+as all of them were native C code. That is particularly the case
for JITed BPF programs that are indistinguishable from
native kernel C code.
@@ -75,9 +75,9 @@ Q: Does it mean that 'innovative' extensions to BPF code are disallowed?
------------------------------------------------------------------------
A: Soft yes.
-At least for now until BPF core has support for
+At least for now, until BPF core has support for
bpf-to-bpf calls, indirect calls, loops, global variables,
-jump tables, read only sections and all other normal constructs
+jump tables, read-only sections, and all other normal constructs
that C code can produce.
Q: Can loops be supported in a safe way?
@@ -85,8 +85,33 @@ Q: Can loops be supported in a safe way?
A: It's not clear yet.
BPF developers are trying to find a way to
-support bounded loops where the verifier can guarantee that
-the program terminates in less than 4096 instructions.
+support bounded loops.
+
+Q: What are the verifier limits?
+--------------------------------
+A: The only limit known to the user space is BPF_MAXINSNS (4096).
+It's the maximum number of instructions that the unprivileged bpf
+program can have. The verifier has various internal limits.
+Like the maximum number of instructions that can be explored during
+program analysis. Currently, that limit is set to 1 million.
+Which essentially means that the largest program can consist
+of 1 million NOP instructions. There is a limit to the maximum number
+of subsequent branches, a limit to the number of nested bpf-to-bpf
+calls, a limit to the number of the verifier states per instruction,
+a limit to the number of maps used by the program.
+All these limits can be hit with a sufficiently complex program.
+There are also non-numerical limits that can cause the program
+to be rejected. The verifier used to recognize only pointer + constant
+expressions. Now it can recognize pointer + bounded_register.
+bpf_lookup_map_elem(key) had a requirement that 'key' must be
+a pointer to the stack. Now, 'key' can be a pointer to map value.
+The verifier is steadily getting 'smarter'. The limits are
+being removed. The only way to know that the program is going to
+be accepted by the verifier is to try to load it.
+The bpf development process guarantees that the future kernel
+versions will accept all bpf programs that were accepted by
+the earlier versions.
+
Instruction level questions
---------------------------
@@ -109,16 +134,16 @@ For example why BPF_JNE and other compare and jumps are not cpu-like?
A: This was necessary to avoid introducing flags into ISA which are
impossible to make generic and efficient across CPU architectures.
-Q: why BPF_DIV instruction doesn't map to x64 div?
+Q: Why BPF_DIV instruction doesn't map to x64 div?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: Because if we picked one-to-one relationship to x64 it would have made
it more complicated to support on arm64 and other archs. Also it
needs div-by-zero runtime check.
-Q: why there is no BPF_SDIV for signed divide operation?
+Q: Why there is no BPF_SDIV for signed divide operation?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: Because it would be rarely used. llvm errors in such case and
-prints a suggestion to use unsigned divide instead
+prints a suggestion to use unsigned divide instead.
Q: Why BPF has implicit prologue and epilogue?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst
new file mode 100644
index 000000000000..35d83e24dbdb
--- /dev/null
+++ b/Documentation/bpf/btf.rst
@@ -0,0 +1,907 @@
+=====================
+BPF Type Format (BTF)
+=====================
+
+1. Introduction
+***************
+
+BTF (BPF Type Format) is the metadata format which encodes the debug info
+related to BPF program/map. The name BTF was used initially to describe data
+types. The BTF was later extended to include function info for defined
+subroutines, and line info for source/line information.
+
+The debug info is used for map pretty print, function signature, etc. The
+function signature enables better bpf program/function kernel symbol. The line
+info helps generate source annotated translated byte code, jited code and
+verifier log.
+
+The BTF specification contains two parts,
+ * BTF kernel API
+ * BTF ELF file format
+
+The kernel API is the contract between user space and kernel. The kernel
+verifies the BTF info before using it. The ELF file format is a user space
+contract between ELF file and libbpf loader.
+
+The type and string sections are part of the BTF kernel API, describing the
+debug info (mostly types related) referenced by the bpf program. These two
+sections are discussed in details in :ref:`BTF_Type_String`.
+
+.. _BTF_Type_String:
+
+2. BTF Type and String Encoding
+*******************************
+
+The file ``include/uapi/linux/btf.h`` provides high-level definition of how
+types/strings are encoded.
+
+The beginning of data blob must be::
+
+ struct btf_header {
+ __u16 magic;
+ __u8 version;
+ __u8 flags;
+ __u32 hdr_len;
+
+ /* All offsets are in bytes relative to the end of this header */
+ __u32 type_off; /* offset of type section */
+ __u32 type_len; /* length of type section */
+ __u32 str_off; /* offset of string section */
+ __u32 str_len; /* length of string section */
+ };
+
+The magic is ``0xeB9F``, which has different encoding for big and little
+endian systems, and can be used to test whether BTF is generated for big- or
+little-endian target. The ``btf_header`` is designed to be extensible with
+``hdr_len`` equal to ``sizeof(struct btf_header)`` when a data blob is
+generated.
+
+2.1 String Encoding
+===================
+
+The first string in the string section must be a null string. The rest of
+string table is a concatenation of other null-terminated strings.
+
+2.2 Type Encoding
+=================
+
+The type id ``0`` is reserved for ``void`` type. The type section is parsed
+sequentially and type id is assigned to each recognized type starting from id
+``1``. Currently, the following types are supported::
+
+ #define BTF_KIND_INT 1 /* Integer */
+ #define BTF_KIND_PTR 2 /* Pointer */
+ #define BTF_KIND_ARRAY 3 /* Array */
+ #define BTF_KIND_STRUCT 4 /* Struct */
+ #define BTF_KIND_UNION 5 /* Union */
+ #define BTF_KIND_ENUM 6 /* Enumeration */
+ #define BTF_KIND_FWD 7 /* Forward */
+ #define BTF_KIND_TYPEDEF 8 /* Typedef */
+ #define BTF_KIND_VOLATILE 9 /* Volatile */
+ #define BTF_KIND_CONST 10 /* Const */
+ #define BTF_KIND_RESTRICT 11 /* Restrict */
+ #define BTF_KIND_FUNC 12 /* Function */
+ #define BTF_KIND_FUNC_PROTO 13 /* Function Proto */
+ #define BTF_KIND_VAR 14 /* Variable */
+ #define BTF_KIND_DATASEC 15 /* Section */
+
+Note that the type section encodes debug info, not just pure types.
+``BTF_KIND_FUNC`` is not a type, and it represents a defined subprogram.
+
+Each type contains the following common data::
+
+ struct btf_type {
+ __u32 name_off;
+ /* "info" bits arrangement
+ * bits 0-15: vlen (e.g. # of struct's members)
+ * bits 16-23: unused
+ * bits 24-27: kind (e.g. int, ptr, array...etc)
+ * bits 28-30: unused
+ * bit 31: kind_flag, currently used by
+ * struct, union and fwd
+ */
+ __u32 info;
+ /* "size" is used by INT, ENUM, STRUCT and UNION.
+ * "size" tells the size of the type it is describing.
+ *
+ * "type" is used by PTR, TYPEDEF, VOLATILE, CONST, RESTRICT,
+ * FUNC and FUNC_PROTO.
+ * "type" is a type_id referring to another type.
+ */
+ union {
+ __u32 size;
+ __u32 type;
+ };
+ };
+
+For certain kinds, the common data are followed by kind-specific data. The
+``name_off`` in ``struct btf_type`` specifies the offset in the string table.
+The following sections detail encoding of each kind.
+
+2.2.1 BTF_KIND_INT
+~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: any valid offset
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_INT
+ * ``info.vlen``: 0
+ * ``size``: the size of the int type in bytes.
+
+``btf_type`` is followed by a ``u32`` with the following bits arrangement::
+
+ #define BTF_INT_ENCODING(VAL) (((VAL) & 0x0f000000) >> 24)
+ #define BTF_INT_OFFSET(VAL) (((VAL) & 0x00ff0000) >> 16)
+ #define BTF_INT_BITS(VAL) ((VAL) & 0x000000ff)
+
+The ``BTF_INT_ENCODING`` has the following attributes::
+
+ #define BTF_INT_SIGNED (1 << 0)
+ #define BTF_INT_CHAR (1 << 1)
+ #define BTF_INT_BOOL (1 << 2)
+
+The ``BTF_INT_ENCODING()`` provides extra information: signedness, char, or
+bool, for the int type. The char and bool encoding are mostly useful for
+pretty print. At most one encoding can be specified for the int type.
+
+The ``BTF_INT_BITS()`` specifies the number of actual bits held by this int
+type. For example, a 4-bit bitfield encodes ``BTF_INT_BITS()`` equals to 4.
+The ``btf_type.size * 8`` must be equal to or greater than ``BTF_INT_BITS()``
+for the type. The maximum value of ``BTF_INT_BITS()`` is 128.
+
+The ``BTF_INT_OFFSET()`` specifies the starting bit offset to calculate values
+for this int. For example, a bitfield struct member has:
+ * btf member bit offset 100 from the start of the structure,
+ * btf member pointing to an int type,
+ * the int type has ``BTF_INT_OFFSET() = 2`` and ``BTF_INT_BITS() = 4``
+
+Then in the struct memory layout, this member will occupy ``4`` bits starting
+from bits ``100 + 2 = 102``.
+
+Alternatively, the bitfield struct member can be the following to access the
+same bits as the above:
+ * btf member bit offset 102,
+ * btf member pointing to an int type,
+ * the int type has ``BTF_INT_OFFSET() = 0`` and ``BTF_INT_BITS() = 4``
+
+The original intention of ``BTF_INT_OFFSET()`` is to provide flexibility of
+bitfield encoding. Currently, both llvm and pahole generate
+``BTF_INT_OFFSET() = 0`` for all int types.
+
+2.2.2 BTF_KIND_PTR
+~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: 0
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_PTR
+ * ``info.vlen``: 0
+ * ``type``: the pointee type of the pointer
+
+No additional type data follow ``btf_type``.
+
+2.2.3 BTF_KIND_ARRAY
+~~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: 0
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_ARRAY
+ * ``info.vlen``: 0
+ * ``size/type``: 0, not used
+
+``btf_type`` is followed by one ``struct btf_array``::
+
+ struct btf_array {
+ __u32 type;
+ __u32 index_type;
+ __u32 nelems;
+ };
+
+The ``struct btf_array`` encoding:
+ * ``type``: the element type
+ * ``index_type``: the index type
+ * ``nelems``: the number of elements for this array (``0`` is also allowed).
+
+The ``index_type`` can be any regular int type (``u8``, ``u16``, ``u32``,
+``u64``, ``unsigned __int128``). The original design of including
+``index_type`` follows DWARF, which has an ``index_type`` for its array type.
+Currently in BTF, beyond type verification, the ``index_type`` is not used.
+
+The ``struct btf_array`` allows chaining through element type to represent
+multidimensional arrays. For example, for ``int a[5][6]``, the following type
+information illustrates the chaining:
+
+ * [1]: int
+ * [2]: array, ``btf_array.type = [1]``, ``btf_array.nelems = 6``
+ * [3]: array, ``btf_array.type = [2]``, ``btf_array.nelems = 5``
+
+Currently, both pahole and llvm collapse multidimensional array into
+one-dimensional array, e.g., for ``a[5][6]``, the ``btf_array.nelems`` is
+equal to ``30``. This is because the original use case is map pretty print
+where the whole array is dumped out so one-dimensional array is enough. As
+more BTF usage is explored, pahole and llvm can be changed to generate proper
+chained representation for multidimensional arrays.
+
+2.2.4 BTF_KIND_STRUCT
+~~~~~~~~~~~~~~~~~~~~~
+2.2.5 BTF_KIND_UNION
+~~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: 0 or offset to a valid C identifier
+ * ``info.kind_flag``: 0 or 1
+ * ``info.kind``: BTF_KIND_STRUCT or BTF_KIND_UNION
+ * ``info.vlen``: the number of struct/union members
+ * ``info.size``: the size of the struct/union in bytes
+
+``btf_type`` is followed by ``info.vlen`` number of ``struct btf_member``.::
+
+ struct btf_member {
+ __u32 name_off;
+ __u32 type;
+ __u32 offset;
+ };
+
+``struct btf_member`` encoding:
+ * ``name_off``: offset to a valid C identifier
+ * ``type``: the member type
+ * ``offset``: <see below>
+
+If the type info ``kind_flag`` is not set, the offset contains only bit offset
+of the member. Note that the base type of the bitfield can only be int or enum
+type. If the bitfield size is 32, the base type can be either int or enum
+type. If the bitfield size is not 32, the base type must be int, and int type
+``BTF_INT_BITS()`` encodes the bitfield size.
+
+If the ``kind_flag`` is set, the ``btf_member.offset`` contains both member
+bitfield size and bit offset. The bitfield size and bit offset are calculated
+as below.::
+
+ #define BTF_MEMBER_BITFIELD_SIZE(val) ((val) >> 24)
+ #define BTF_MEMBER_BIT_OFFSET(val) ((val) & 0xffffff)
+
+In this case, if the base type is an int type, it must be a regular int type:
+
+ * ``BTF_INT_OFFSET()`` must be 0.
+ * ``BTF_INT_BITS()`` must be equal to ``{1,2,4,8,16} * 8``.
+
+The following kernel patch introduced ``kind_flag`` and explained why both
+modes exist:
+
+ https://github.com/torvalds/linux/commit/9d5f9f701b1891466fb3dbb1806ad97716f95cc3#diff-fa650a64fdd3968396883d2fe8215ff3
+
+2.2.6 BTF_KIND_ENUM
+~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: 0 or offset to a valid C identifier
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_ENUM
+ * ``info.vlen``: number of enum values
+ * ``size``: 4
+
+``btf_type`` is followed by ``info.vlen`` number of ``struct btf_enum``.::
+
+ struct btf_enum {
+ __u32 name_off;
+ __s32 val;
+ };
+
+The ``btf_enum`` encoding:
+ * ``name_off``: offset to a valid C identifier
+ * ``val``: any value
+
+2.2.7 BTF_KIND_FWD
+~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: offset to a valid C identifier
+ * ``info.kind_flag``: 0 for struct, 1 for union
+ * ``info.kind``: BTF_KIND_FWD
+ * ``info.vlen``: 0
+ * ``type``: 0
+
+No additional type data follow ``btf_type``.
+
+2.2.8 BTF_KIND_TYPEDEF
+~~~~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: offset to a valid C identifier
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_TYPEDEF
+ * ``info.vlen``: 0
+ * ``type``: the type which can be referred by name at ``name_off``
+
+No additional type data follow ``btf_type``.
+
+2.2.9 BTF_KIND_VOLATILE
+~~~~~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: 0
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_VOLATILE
+ * ``info.vlen``: 0
+ * ``type``: the type with ``volatile`` qualifier
+
+No additional type data follow ``btf_type``.
+
+2.2.10 BTF_KIND_CONST
+~~~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: 0
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_CONST
+ * ``info.vlen``: 0
+ * ``type``: the type with ``const`` qualifier
+
+No additional type data follow ``btf_type``.
+
+2.2.11 BTF_KIND_RESTRICT
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: 0
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_RESTRICT
+ * ``info.vlen``: 0
+ * ``type``: the type with ``restrict`` qualifier
+
+No additional type data follow ``btf_type``.
+
+2.2.12 BTF_KIND_FUNC
+~~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: offset to a valid C identifier
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_FUNC
+ * ``info.vlen``: 0
+ * ``type``: a BTF_KIND_FUNC_PROTO type
+
+No additional type data follow ``btf_type``.
+
+A BTF_KIND_FUNC defines not a type, but a subprogram (function) whose
+signature is defined by ``type``. The subprogram is thus an instance of that
+type. The BTF_KIND_FUNC may in turn be referenced by a func_info in the
+:ref:`BTF_Ext_Section` (ELF) or in the arguments to :ref:`BPF_Prog_Load`
+(ABI).
+
+2.2.13 BTF_KIND_FUNC_PROTO
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: 0
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_FUNC_PROTO
+ * ``info.vlen``: # of parameters
+ * ``type``: the return type
+
+``btf_type`` is followed by ``info.vlen`` number of ``struct btf_param``.::
+
+ struct btf_param {
+ __u32 name_off;
+ __u32 type;
+ };
+
+If a BTF_KIND_FUNC_PROTO type is referred by a BTF_KIND_FUNC type, then
+``btf_param.name_off`` must point to a valid C identifier except for the
+possible last argument representing the variable argument. The btf_param.type
+refers to parameter type.
+
+If the function has variable arguments, the last parameter is encoded with
+``name_off = 0`` and ``type = 0``.
+
+2.2.14 BTF_KIND_VAR
+~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: offset to a valid C identifier
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_VAR
+ * ``info.vlen``: 0
+ * ``type``: the type of the variable
+
+``btf_type`` is followed by a single ``struct btf_variable`` with the
+following data::
+
+ struct btf_var {
+ __u32 linkage;
+ };
+
+``struct btf_var`` encoding:
+ * ``linkage``: currently only static variable 0, or globally allocated
+ variable in ELF sections 1
+
+Not all type of global variables are supported by LLVM at this point.
+The following is currently available:
+
+ * static variables with or without section attributes
+ * global variables with section attributes
+
+The latter is for future extraction of map key/value type id's from a
+map definition.
+
+2.2.15 BTF_KIND_DATASEC
+~~~~~~~~~~~~~~~~~~~~~~~
+
+``struct btf_type`` encoding requirement:
+ * ``name_off``: offset to a valid name associated with a variable or
+ one of .data/.bss/.rodata
+ * ``info.kind_flag``: 0
+ * ``info.kind``: BTF_KIND_DATASEC
+ * ``info.vlen``: # of variables
+ * ``size``: total section size in bytes (0 at compilation time, patched
+ to actual size by BPF loaders such as libbpf)
+
+``btf_type`` is followed by ``info.vlen`` number of ``struct btf_var_secinfo``.::
+
+ struct btf_var_secinfo {
+ __u32 type;
+ __u32 offset;
+ __u32 size;
+ };
+
+``struct btf_var_secinfo`` encoding:
+ * ``type``: the type of the BTF_KIND_VAR variable
+ * ``offset``: the in-section offset of the variable
+ * ``size``: the size of the variable in bytes
+
+3. BTF Kernel API
+*****************
+
+The following bpf syscall command involves BTF:
+ * BPF_BTF_LOAD: load a blob of BTF data into kernel
+ * BPF_MAP_CREATE: map creation with btf key and value type info.
+ * BPF_PROG_LOAD: prog load with btf function and line info.
+ * BPF_BTF_GET_FD_BY_ID: get a btf fd
+ * BPF_OBJ_GET_INFO_BY_FD: btf, func_info, line_info
+ and other btf related info are returned.
+
+The workflow typically looks like:
+::
+
+ Application:
+ BPF_BTF_LOAD
+ |
+ v
+ BPF_MAP_CREATE and BPF_PROG_LOAD
+ |
+ V
+ ......
+
+ Introspection tool:
+ ......
+ BPF_{PROG,MAP}_GET_NEXT_ID (get prog/map id's)
+ |
+ V
+ BPF_{PROG,MAP}_GET_FD_BY_ID (get a prog/map fd)
+ |
+ V
+ BPF_OBJ_GET_INFO_BY_FD (get bpf_prog_info/bpf_map_info with btf_id)
+ | |
+ V |
+ BPF_BTF_GET_FD_BY_ID (get btf_fd) |
+ | |
+ V |
+ BPF_OBJ_GET_INFO_BY_FD (get btf) |
+ | |
+ V V
+ pretty print types, dump func signatures and line info, etc.
+
+
+3.1 BPF_BTF_LOAD
+================
+
+Load a blob of BTF data into kernel. A blob of data, described in
+:ref:`BTF_Type_String`, can be directly loaded into the kernel. A ``btf_fd``
+is returned to a userspace.
+
+3.2 BPF_MAP_CREATE
+==================
+
+A map can be created with ``btf_fd`` and specified key/value type id.::
+
+ __u32 btf_fd; /* fd pointing to a BTF type data */
+ __u32 btf_key_type_id; /* BTF type_id of the key */
+ __u32 btf_value_type_id; /* BTF type_id of the value */
+
+In libbpf, the map can be defined with extra annotation like below:
+::
+
+ struct bpf_map_def SEC("maps") btf_map = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(struct ipv_counts),
+ .max_entries = 4,
+ };
+ BPF_ANNOTATE_KV_PAIR(btf_map, int, struct ipv_counts);
+
+Here, the parameters for macro BPF_ANNOTATE_KV_PAIR are map name, key and
+value types for the map. During ELF parsing, libbpf is able to extract
+key/value type_id's and assign them to BPF_MAP_CREATE attributes
+automatically.
+
+.. _BPF_Prog_Load:
+
+3.3 BPF_PROG_LOAD
+=================
+
+During prog_load, func_info and line_info can be passed to kernel with proper
+values for the following attributes:
+::
+
+ __u32 insn_cnt;
+ __aligned_u64 insns;
+ ......
+ __u32 prog_btf_fd; /* fd pointing to BTF type data */
+ __u32 func_info_rec_size; /* userspace bpf_func_info size */
+ __aligned_u64 func_info; /* func info */
+ __u32 func_info_cnt; /* number of bpf_func_info records */
+ __u32 line_info_rec_size; /* userspace bpf_line_info size */
+ __aligned_u64 line_info; /* line info */
+ __u32 line_info_cnt; /* number of bpf_line_info records */
+
+The func_info and line_info are an array of below, respectively.::
+
+ struct bpf_func_info {
+ __u32 insn_off; /* [0, insn_cnt - 1] */
+ __u32 type_id; /* pointing to a BTF_KIND_FUNC type */
+ };
+ struct bpf_line_info {
+ __u32 insn_off; /* [0, insn_cnt - 1] */
+ __u32 file_name_off; /* offset to string table for the filename */
+ __u32 line_off; /* offset to string table for the source line */
+ __u32 line_col; /* line number and column number */
+ };
+
+func_info_rec_size is the size of each func_info record, and
+line_info_rec_size is the size of each line_info record. Passing the record
+size to kernel make it possible to extend the record itself in the future.
+
+Below are requirements for func_info:
+ * func_info[0].insn_off must be 0.
+ * the func_info insn_off is in strictly increasing order and matches
+ bpf func boundaries.
+
+Below are requirements for line_info:
+ * the first insn in each func must have a line_info record pointing to it.
+ * the line_info insn_off is in strictly increasing order.
+
+For line_info, the line number and column number are defined as below:
+::
+
+ #define BPF_LINE_INFO_LINE_NUM(line_col) ((line_col) >> 10)
+ #define BPF_LINE_INFO_LINE_COL(line_col) ((line_col) & 0x3ff)
+
+3.4 BPF_{PROG,MAP}_GET_NEXT_ID
+==============================
+
+In kernel, every loaded program, map or btf has a unique id. The id won't
+change during the lifetime of a program, map, or btf.
+
+The bpf syscall command BPF_{PROG,MAP}_GET_NEXT_ID returns all id's, one for
+each command, to user space, for bpf program or maps, respectively, so an
+inspection tool can inspect all programs and maps.
+
+3.5 BPF_{PROG,MAP}_GET_FD_BY_ID
+===============================
+
+An introspection tool cannot use id to get details about program or maps.
+A file descriptor needs to be obtained first for reference-counting purpose.
+
+3.6 BPF_OBJ_GET_INFO_BY_FD
+==========================
+
+Once a program/map fd is acquired, an introspection tool can get the detailed
+information from kernel about this fd, some of which are BTF-related. For
+example, ``bpf_map_info`` returns ``btf_id`` and key/value type ids.
+``bpf_prog_info`` returns ``btf_id``, func_info, and line info for translated
+bpf byte codes, and jited_line_info.
+
+3.7 BPF_BTF_GET_FD_BY_ID
+========================
+
+With ``btf_id`` obtained in ``bpf_map_info`` and ``bpf_prog_info``, bpf
+syscall command BPF_BTF_GET_FD_BY_ID can retrieve a btf fd. Then, with
+command BPF_OBJ_GET_INFO_BY_FD, the btf blob, originally loaded into the
+kernel with BPF_BTF_LOAD, can be retrieved.
+
+With the btf blob, ``bpf_map_info``, and ``bpf_prog_info``, an introspection
+tool has full btf knowledge and is able to pretty print map key/values, dump
+func signatures and line info, along with byte/jit codes.
+
+4. ELF File Format Interface
+****************************
+
+4.1 .BTF section
+================
+
+The .BTF section contains type and string data. The format of this section is
+same as the one describe in :ref:`BTF_Type_String`.
+
+.. _BTF_Ext_Section:
+
+4.2 .BTF.ext section
+====================
+
+The .BTF.ext section encodes func_info and line_info which needs loader
+manipulation before loading into the kernel.
+
+The specification for .BTF.ext section is defined at ``tools/lib/bpf/btf.h``
+and ``tools/lib/bpf/btf.c``.
+
+The current header of .BTF.ext section::
+
+ struct btf_ext_header {
+ __u16 magic;
+ __u8 version;
+ __u8 flags;
+ __u32 hdr_len;
+
+ /* All offsets are in bytes relative to the end of this header */
+ __u32 func_info_off;
+ __u32 func_info_len;
+ __u32 line_info_off;
+ __u32 line_info_len;
+ };
+
+It is very similar to .BTF section. Instead of type/string section, it
+contains func_info and line_info section. See :ref:`BPF_Prog_Load` for details
+about func_info and line_info record format.
+
+The func_info is organized as below.::
+
+ func_info_rec_size
+ btf_ext_info_sec for section #1 /* func_info for section #1 */
+ btf_ext_info_sec for section #2 /* func_info for section #2 */
+ ...
+
+``func_info_rec_size`` specifies the size of ``bpf_func_info`` structure when
+.BTF.ext is generated. ``btf_ext_info_sec``, defined below, is a collection of
+func_info for each specific ELF section.::
+
+ struct btf_ext_info_sec {
+ __u32 sec_name_off; /* offset to section name */
+ __u32 num_info;
+ /* Followed by num_info * record_size number of bytes */
+ __u8 data[0];
+ };
+
+Here, num_info must be greater than 0.
+
+The line_info is organized as below.::
+
+ line_info_rec_size
+ btf_ext_info_sec for section #1 /* line_info for section #1 */
+ btf_ext_info_sec for section #2 /* line_info for section #2 */
+ ...
+
+``line_info_rec_size`` specifies the size of ``bpf_line_info`` structure when
+.BTF.ext is generated.
+
+The interpretation of ``bpf_func_info->insn_off`` and
+``bpf_line_info->insn_off`` is different between kernel API and ELF API. For
+kernel API, the ``insn_off`` is the instruction offset in the unit of ``struct
+bpf_insn``. For ELF API, the ``insn_off`` is the byte offset from the
+beginning of section (``btf_ext_info_sec->sec_name_off``).
+
+5. Using BTF
+************
+
+5.1 bpftool map pretty print
+============================
+
+With BTF, the map key/value can be printed based on fields rather than simply
+raw bytes. This is especially valuable for large structure or if your data
+structure has bitfields. For example, for the following map,::
+
+ enum A { A1, A2, A3, A4, A5 };
+ typedef enum A ___A;
+ struct tmp_t {
+ char a1:4;
+ int a2:4;
+ int :4;
+ __u32 a3:4;
+ int b;
+ ___A b1:4;
+ enum A b2:4;
+ };
+ struct bpf_map_def SEC("maps") tmpmap = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(__u32),
+ .value_size = sizeof(struct tmp_t),
+ .max_entries = 1,
+ };
+ BPF_ANNOTATE_KV_PAIR(tmpmap, int, struct tmp_t);
+
+bpftool is able to pretty print like below:
+::
+
+ [{
+ "key": 0,
+ "value": {
+ "a1": 0x2,
+ "a2": 0x4,
+ "a3": 0x6,
+ "b": 7,
+ "b1": 0x8,
+ "b2": 0xa
+ }
+ }
+ ]
+
+5.2 bpftool prog dump
+=====================
+
+The following is an example showing how func_info and line_info can help prog
+dump with better kernel symbol names, function prototypes and line
+information.::
+
+ $ bpftool prog dump jited pinned /sys/fs/bpf/test_btf_haskv
+ [...]
+ int test_long_fname_2(struct dummy_tracepoint_args * arg):
+ bpf_prog_44a040bf25481309_test_long_fname_2:
+ ; static int test_long_fname_2(struct dummy_tracepoint_args *arg)
+ 0: push %rbp
+ 1: mov %rsp,%rbp
+ 4: sub $0x30,%rsp
+ b: sub $0x28,%rbp
+ f: mov %rbx,0x0(%rbp)
+ 13: mov %r13,0x8(%rbp)
+ 17: mov %r14,0x10(%rbp)
+ 1b: mov %r15,0x18(%rbp)
+ 1f: xor %eax,%eax
+ 21: mov %rax,0x20(%rbp)
+ 25: xor %esi,%esi
+ ; int key = 0;
+ 27: mov %esi,-0x4(%rbp)
+ ; if (!arg->sock)
+ 2a: mov 0x8(%rdi),%rdi
+ ; if (!arg->sock)
+ 2e: cmp $0x0,%rdi
+ 32: je 0x0000000000000070
+ 34: mov %rbp,%rsi
+ ; counts = bpf_map_lookup_elem(&btf_map, &key);
+ [...]
+
+5.3 Verifier Log
+================
+
+The following is an example of how line_info can help debugging verification
+failure.::
+
+ /* The code at tools/testing/selftests/bpf/test_xdp_noinline.c
+ * is modified as below.
+ */
+ data = (void *)(long)xdp->data;
+ data_end = (void *)(long)xdp->data_end;
+ /*
+ if (data + 4 > data_end)
+ return XDP_DROP;
+ */
+ *(u32 *)data = dst->dst;
+
+ $ bpftool prog load ./test_xdp_noinline.o /sys/fs/bpf/test_xdp_noinline type xdp
+ ; data = (void *)(long)xdp->data;
+ 224: (79) r2 = *(u64 *)(r10 -112)
+ 225: (61) r2 = *(u32 *)(r2 +0)
+ ; *(u32 *)data = dst->dst;
+ 226: (63) *(u32 *)(r2 +0) = r1
+ invalid access to packet, off=0 size=4, R2(id=0,off=0,r=0)
+ R2 offset is outside of the packet
+
+6. BTF Generation
+*****************
+
+You need latest pahole
+
+ https://git.kernel.org/pub/scm/devel/pahole/pahole.git/
+
+or llvm (8.0 or later). The pahole acts as a dwarf2btf converter. It doesn't
+support .BTF.ext and btf BTF_KIND_FUNC type yet. For example,::
+
+ -bash-4.4$ cat t.c
+ struct t {
+ int a:2;
+ int b:3;
+ int c:2;
+ } g;
+ -bash-4.4$ gcc -c -O2 -g t.c
+ -bash-4.4$ pahole -JV t.o
+ File t.o:
+ [1] STRUCT t kind_flag=1 size=4 vlen=3
+ a type_id=2 bitfield_size=2 bits_offset=0
+ b type_id=2 bitfield_size=3 bits_offset=2
+ c type_id=2 bitfield_size=2 bits_offset=5
+ [2] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED
+
+The llvm is able to generate .BTF and .BTF.ext directly with -g for bpf target
+only. The assembly code (-S) is able to show the BTF encoding in assembly
+format.::
+
+ -bash-4.4$ cat t2.c
+ typedef int __int32;
+ struct t2 {
+ int a2;
+ int (*f2)(char q1, __int32 q2, ...);
+ int (*f3)();
+ } g2;
+ int main() { return 0; }
+ int test() { return 0; }
+ -bash-4.4$ clang -c -g -O2 -target bpf t2.c
+ -bash-4.4$ readelf -S t2.o
+ ......
+ [ 8] .BTF PROGBITS 0000000000000000 00000247
+ 000000000000016e 0000000000000000 0 0 1
+ [ 9] .BTF.ext PROGBITS 0000000000000000 000003b5
+ 0000000000000060 0000000000000000 0 0 1
+ [10] .rel.BTF.ext REL 0000000000000000 000007e0
+ 0000000000000040 0000000000000010 16 9 8
+ ......
+ -bash-4.4$ clang -S -g -O2 -target bpf t2.c
+ -bash-4.4$ cat t2.s
+ ......
+ .section .BTF,"",@progbits
+ .short 60319 # 0xeb9f
+ .byte 1
+ .byte 0
+ .long 24
+ .long 0
+ .long 220
+ .long 220
+ .long 122
+ .long 0 # BTF_KIND_FUNC_PROTO(id = 1)
+ .long 218103808 # 0xd000000
+ .long 2
+ .long 83 # BTF_KIND_INT(id = 2)
+ .long 16777216 # 0x1000000
+ .long 4
+ .long 16777248 # 0x1000020
+ ......
+ .byte 0 # string offset=0
+ .ascii ".text" # string offset=1
+ .byte 0
+ .ascii "/home/yhs/tmp-pahole/t2.c" # string offset=7
+ .byte 0
+ .ascii "int main() { return 0; }" # string offset=33
+ .byte 0
+ .ascii "int test() { return 0; }" # string offset=58
+ .byte 0
+ .ascii "int" # string offset=83
+ ......
+ .section .BTF.ext,"",@progbits
+ .short 60319 # 0xeb9f
+ .byte 1
+ .byte 0
+ .long 24
+ .long 0
+ .long 28
+ .long 28
+ .long 44
+ .long 8 # FuncInfo
+ .long 1 # FuncInfo section string offset=1
+ .long 2
+ .long .Lfunc_begin0
+ .long 3
+ .long .Lfunc_begin1
+ .long 5
+ .long 16 # LineInfo
+ .long 1 # LineInfo section string offset=1
+ .long 2
+ .long .Ltmp0
+ .long 7
+ .long 33
+ .long 7182 # Line 7 Col 14
+ .long .Ltmp3
+ .long 7
+ .long 58
+ .long 8206 # Line 8 Col 14
+
+7. Testing
+**********
+
+Kernel bpf selftest `test_btf.c` provides extensive set of BTF-related tests.
diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst
index 00a8450a602f..d3fe4cac0c90 100644
--- a/Documentation/bpf/index.rst
+++ b/Documentation/bpf/index.rst
@@ -15,6 +15,13 @@ that goes into great technical depth about the BPF Architecture.
The primary info for the bpf syscall is available in the `man-pages`_
for `bpf(2)`_.
+BPF Type Format (BTF)
+=====================
+
+.. toctree::
+ :maxdepth: 1
+
+ btf
Frequently asked questions (FAQ)
@@ -29,6 +36,16 @@ Two sets of Questions and Answers (Q&A) are maintained.
bpf_devel_QA
+Program types
+=============
+
+.. toctree::
+ :maxdepth: 1
+
+ prog_cgroup_sysctl
+ prog_flow_dissector
+
+
.. Links:
.. _Documentation/networking/filter.txt: ../networking/filter.txt
.. _man-pages: https://www.kernel.org/doc/man-pages/
diff --git a/Documentation/bpf/prog_cgroup_sysctl.rst b/Documentation/bpf/prog_cgroup_sysctl.rst
new file mode 100644
index 000000000000..677d6c637cf3
--- /dev/null
+++ b/Documentation/bpf/prog_cgroup_sysctl.rst
@@ -0,0 +1,125 @@
+.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+
+===========================
+BPF_PROG_TYPE_CGROUP_SYSCTL
+===========================
+
+This document describes ``BPF_PROG_TYPE_CGROUP_SYSCTL`` program type that
+provides cgroup-bpf hook for sysctl.
+
+The hook has to be attached to a cgroup and will be called every time a
+process inside that cgroup tries to read from or write to sysctl knob in proc.
+
+1. Attach type
+**************
+
+``BPF_CGROUP_SYSCTL`` attach type has to be used to attach
+``BPF_PROG_TYPE_CGROUP_SYSCTL`` program to a cgroup.
+
+2. Context
+**********
+
+``BPF_PROG_TYPE_CGROUP_SYSCTL`` provides access to the following context from
+BPF program::
+
+ struct bpf_sysctl {
+ __u32 write;
+ __u32 file_pos;
+ };
+
+* ``write`` indicates whether sysctl value is being read (``0``) or written
+ (``1``). This field is read-only.
+
+* ``file_pos`` indicates file position sysctl is being accessed at, read
+ or written. This field is read-write. Writing to the field sets the starting
+ position in sysctl proc file ``read(2)`` will be reading from or ``write(2)``
+ will be writing to. Writing zero to the field can be used e.g. to override
+ whole sysctl value by ``bpf_sysctl_set_new_value()`` on ``write(2)`` even
+ when it's called by user space on ``file_pos > 0``. Writing non-zero
+ value to the field can be used to access part of sysctl value starting from
+ specified ``file_pos``. Not all sysctl support access with ``file_pos !=
+ 0``, e.g. writes to numeric sysctl entries must always be at file position
+ ``0``. See also ``kernel.sysctl_writes_strict`` sysctl.
+
+See `linux/bpf.h`_ for more details on how context field can be accessed.
+
+3. Return code
+**************
+
+``BPF_PROG_TYPE_CGROUP_SYSCTL`` program must return one of the following
+return codes:
+
+* ``0`` means "reject access to sysctl";
+* ``1`` means "proceed with access".
+
+If program returns ``0`` user space will get ``-1`` from ``read(2)`` or
+``write(2)`` and ``errno`` will be set to ``EPERM``.
+
+4. Helpers
+**********
+
+Since sysctl knob is represented by a name and a value, sysctl specific BPF
+helpers focus on providing access to these properties:
+
+* ``bpf_sysctl_get_name()`` to get sysctl name as it is visible in
+ ``/proc/sys`` into provided by BPF program buffer;
+
+* ``bpf_sysctl_get_current_value()`` to get string value currently held by
+ sysctl into provided by BPF program buffer. This helper is available on both
+ ``read(2)`` from and ``write(2)`` to sysctl;
+
+* ``bpf_sysctl_get_new_value()`` to get new string value currently being
+ written to sysctl before actual write happens. This helper can be used only
+ on ``ctx->write == 1``;
+
+* ``bpf_sysctl_set_new_value()`` to override new string value currently being
+ written to sysctl before actual write happens. Sysctl value will be
+ overridden starting from the current ``ctx->file_pos``. If the whole value
+ has to be overridden BPF program can set ``file_pos`` to zero before calling
+ to the helper. This helper can be used only on ``ctx->write == 1``. New
+ string value set by the helper is treated and verified by kernel same way as
+ an equivalent string passed by user space.
+
+BPF program sees sysctl value same way as user space does in proc filesystem,
+i.e. as a string. Since many sysctl values represent an integer or a vector
+of integers, the following helpers can be used to get numeric value from the
+string:
+
+* ``bpf_strtol()`` to convert initial part of the string to long integer
+ similar to user space `strtol(3)`_;
+* ``bpf_strtoul()`` to convert initial part of the string to unsigned long
+ integer similar to user space `strtoul(3)`_;
+
+See `linux/bpf.h`_ for more details on helpers described here.
+
+5. Examples
+***********
+
+See `test_sysctl_prog.c`_ for an example of BPF program in C that access
+sysctl name and value, parses string value to get vector of integers and uses
+the result to make decision whether to allow or deny access to sysctl.
+
+6. Notes
+********
+
+``BPF_PROG_TYPE_CGROUP_SYSCTL`` is intended to be used in **trusted** root
+environment, for example to monitor sysctl usage or catch unreasonable values
+an application, running as root in a separate cgroup, is trying to set.
+
+Since `task_dfl_cgroup(current)` is called at `sys_read` / `sys_write` time it
+may return results different from that at `sys_open` time, i.e. process that
+opened sysctl file in proc filesystem may differ from process that is trying
+to read from / write to it and two such processes may run in different
+cgroups, what means ``BPF_PROG_TYPE_CGROUP_SYSCTL`` should not be used as a
+security mechanism to limit sysctl usage.
+
+As with any cgroup-bpf program additional care should be taken if an
+application running as root in a cgroup should not be allowed to
+detach/replace BPF program attached by administrator.
+
+.. Links
+.. _linux/bpf.h: ../../include/uapi/linux/bpf.h
+.. _strtol(3): http://man7.org/linux/man-pages/man3/strtol.3p.html
+.. _strtoul(3): http://man7.org/linux/man-pages/man3/strtoul.3p.html
+.. _test_sysctl_prog.c:
+ ../../tools/testing/selftests/bpf/progs/test_sysctl_prog.c
diff --git a/Documentation/bpf/prog_flow_dissector.rst b/Documentation/bpf/prog_flow_dissector.rst
new file mode 100644
index 000000000000..ed343abe541e
--- /dev/null
+++ b/Documentation/bpf/prog_flow_dissector.rst
@@ -0,0 +1,126 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+BPF_PROG_TYPE_FLOW_DISSECTOR
+============================
+
+Overview
+========
+
+Flow dissector is a routine that parses metadata out of the packets. It's
+used in the various places in the networking subsystem (RFS, flow hash, etc).
+
+BPF flow dissector is an attempt to reimplement C-based flow dissector logic
+in BPF to gain all the benefits of BPF verifier (namely, limits on the
+number of instructions and tail calls).
+
+API
+===
+
+BPF flow dissector programs operate on an ``__sk_buff``. However, only the
+limited set of fields is allowed: ``data``, ``data_end`` and ``flow_keys``.
+``flow_keys`` is ``struct bpf_flow_keys`` and contains flow dissector input
+and output arguments.
+
+The inputs are:
+ * ``nhoff`` - initial offset of the networking header
+ * ``thoff`` - initial offset of the transport header, initialized to nhoff
+ * ``n_proto`` - L3 protocol type, parsed out of L2 header
+
+Flow dissector BPF program should fill out the rest of the ``struct
+bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be
+also adjusted accordingly.
+
+The return code of the BPF program is either BPF_OK to indicate successful
+dissection, or BPF_DROP to indicate parsing error.
+
+__sk_buff->data
+===============
+
+In the VLAN-less case, this is what the initial state of the BPF flow
+dissector looks like::
+
+ +------+------+------------+-----------+
+ | DMAC | SMAC | ETHER_TYPE | L3_HEADER |
+ +------+------+------------+-----------+
+ ^
+ |
+ +-- flow dissector starts here
+
+
+.. code:: c
+
+ skb->data + flow_keys->nhoff point to the first byte of L3_HEADER
+ flow_keys->thoff = nhoff
+ flow_keys->n_proto = ETHER_TYPE
+
+In case of VLAN, flow dissector can be called with the two different states.
+
+Pre-VLAN parsing::
+
+ +------+------+------+-----+-----------+-----------+
+ | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER |
+ +------+------+------+-----+-----------+-----------+
+ ^
+ |
+ +-- flow dissector starts here
+
+.. code:: c
+
+ skb->data + flow_keys->nhoff point the to first byte of TCI
+ flow_keys->thoff = nhoff
+ flow_keys->n_proto = TPID
+
+Please note that TPID can be 802.1AD and, hence, BPF program would
+have to parse VLAN information twice for double tagged packets.
+
+
+Post-VLAN parsing::
+
+ +------+------+------+-----+-----------+-----------+
+ | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER |
+ +------+------+------+-----+-----------+-----------+
+ ^
+ |
+ +-- flow dissector starts here
+
+.. code:: c
+
+ skb->data + flow_keys->nhoff point the to first byte of L3_HEADER
+ flow_keys->thoff = nhoff
+ flow_keys->n_proto = ETHER_TYPE
+
+In this case VLAN information has been processed before the flow dissector
+and BPF flow dissector is not required to handle it.
+
+
+The takeaway here is as follows: BPF flow dissector program can be called with
+the optional VLAN header and should gracefully handle both cases: when single
+or double VLAN is present and when it is not present. The same program
+can be called for both cases and would have to be written carefully to
+handle both cases.
+
+
+Reference Implementation
+========================
+
+See ``tools/testing/selftests/bpf/progs/bpf_flow.c`` for the reference
+implementation and ``tools/testing/selftests/bpf/flow_dissector_load.[hc]``
+for the loader. bpftool can be used to load BPF flow dissector program as well.
+
+The reference implementation is organized as follows:
+ * ``jmp_table`` map that contains sub-programs for each supported L3 protocol
+ * ``_dissect`` routine - entry point; it does input ``n_proto`` parsing and
+ does ``bpf_tail_call`` to the appropriate L3 handler
+
+Since BPF at this point doesn't support looping (or any jumping back),
+jmp_table is used instead to handle multiple levels of encapsulation (and
+IPv6 options).
+
+
+Current Limitations
+===================
+BPF flow dissector doesn't support exporting all the metadata that in-kernel
+C-based implementation can export. Notable example is single VLAN (802.1Q)
+and double VLAN (802.1AD) tags. Please refer to the ``struct bpf_flow_keys``
+for a set of information that's currently can be exported from the BPF context.