aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/networking/devlink
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking/devlink')
-rw-r--r--Documentation/networking/devlink/am65-nuss-cpsw-switch.rst26
-rw-r--r--Documentation/networking/devlink/bnxt.rst2
-rw-r--r--Documentation/networking/devlink/devlink-dpipe.rst2
-rw-r--r--Documentation/networking/devlink/devlink-flash.rst28
-rw-r--r--Documentation/networking/devlink/devlink-health.rst40
-rw-r--r--Documentation/networking/devlink/devlink-info.rst17
-rw-r--r--Documentation/networking/devlink/devlink-linecard.rst122
-rw-r--r--Documentation/networking/devlink/devlink-params.rst33
-rw-r--r--Documentation/networking/devlink/devlink-port.rst443
-rw-r--r--Documentation/networking/devlink/devlink-region.rst30
-rw-r--r--Documentation/networking/devlink/devlink-reload.rst90
-rw-r--r--Documentation/networking/devlink/devlink-resource.rst14
-rw-r--r--Documentation/networking/devlink/devlink-selftests.rst38
-rw-r--r--Documentation/networking/devlink/devlink-trap.rst324
-rw-r--r--Documentation/networking/devlink/etas_es58x.rst36
-rw-r--r--Documentation/networking/devlink/hns3.rst25
-rw-r--r--Documentation/networking/devlink/i40e.rst59
-rw-r--r--Documentation/networking/devlink/ice.rst328
-rw-r--r--Documentation/networking/devlink/index.rst54
-rw-r--r--Documentation/networking/devlink/iosm.rst162
-rw-r--r--Documentation/networking/devlink/mlx5.rst223
-rw-r--r--Documentation/networking/devlink/mlxsw.rst24
-rw-r--r--Documentation/networking/devlink/netdevsim.rst31
-rw-r--r--Documentation/networking/devlink/octeontx2.rst42
-rw-r--r--Documentation/networking/devlink/prestera.rst141
-rw-r--r--Documentation/networking/devlink/sfc.rst57
26 files changed, 2351 insertions, 40 deletions
diff --git a/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst b/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
new file mode 100644
index 000000000000..1e589c26abff
--- /dev/null
+++ b/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+am65-cpsw-nuss devlink support
+==============================
+
+This document describes the devlink features implemented by the ``am65-cpsw-nuss``
+device driver.
+
+Parameters
+==========
+
+The ``am65-cpsw-nuss`` driver implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``switch_mode``
+ - Boolean
+ - runtime
+ - Enable switch mode
diff --git a/Documentation/networking/devlink/bnxt.rst b/Documentation/networking/devlink/bnxt.rst
index 3dfd84ccb1c7..a4fb27663cd6 100644
--- a/Documentation/networking/devlink/bnxt.rst
+++ b/Documentation/networking/devlink/bnxt.rst
@@ -22,6 +22,8 @@ Parameters
- Permanent
* - ``msix_vec_per_pf_min``
- Permanent
+ * - ``enable_remote_dev_reset``
+ - Runtime
The ``bnxt`` driver also implements the following driver-specific
parameters.
diff --git a/Documentation/networking/devlink/devlink-dpipe.rst b/Documentation/networking/devlink/devlink-dpipe.rst
index 468fe1001b74..af37f250df43 100644
--- a/Documentation/networking/devlink/devlink-dpipe.rst
+++ b/Documentation/networking/devlink/devlink-dpipe.rst
@@ -52,7 +52,7 @@ purposes as a standard complementary tool. The system's view from
``devlink-dpipe`` should change according to the changes done by the
standard configuration tools.
-For example, it’s quiet common to implement Access Control Lists (ACL)
+For example, it’s quite common to implement Access Control Lists (ACL)
using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
divided into TCAM regions. Complex TC filters can have multiple rules with
different priorities and different lookup keys. On the other hand hardware
diff --git a/Documentation/networking/devlink/devlink-flash.rst b/Documentation/networking/devlink/devlink-flash.rst
index 40a87c0222cb..603e732f00cc 100644
--- a/Documentation/networking/devlink/devlink-flash.rst
+++ b/Documentation/networking/devlink/devlink-flash.rst
@@ -16,6 +16,34 @@ Note that the file name is a path relative to the firmware loading path
(usually ``/lib/firmware/``). Drivers may send status updates to inform
user space about the progress of the update operation.
+Overwrite Mask
+==============
+
+The ``devlink-flash`` command allows optionally specifying a mask indicating
+how the device should handle subsections of flash components when updating.
+This mask indicates the set of sections which are allowed to be overwritten.
+
+.. list-table:: List of overwrite mask bits
+ :widths: 5 95
+
+ * - Name
+ - Description
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+ - Indicates that the device should overwrite settings in the components
+ being updated with the settings found in the provided image.
+ * - ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+ - Indicates that the device should overwrite identifiers in the
+ components being updated with the identifiers found in the provided
+ image. This includes MAC addresses, serial IDs, and similar device
+ identifiers.
+
+Multiple overwrite bits may be combined and requested together. If no bits
+are provided, it is expected that the device only update firmware binaries
+in the components being updated. Settings and identifiers are expected to be
+preserved across the update. A device may not support every combination and
+the driver for such a device must reject any combination which cannot be
+faithfully implemented.
+
Firmware Loading
================
diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst
index 0c99b11f05f9..e0b8cfed610a 100644
--- a/Documentation/networking/devlink/devlink-health.rst
+++ b/Documentation/networking/devlink/devlink-health.rst
@@ -24,7 +24,7 @@ attributes of the health reporting and recovery procedures.
The ``devlink`` health reporter:
Device driver creates a "health reporter" per each error/health type.
-Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
+Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
or unknown (driver specific).
For each registered health reporter a driver can issue error/health reports
asynchronously. All health reports handling is done by ``devlink``.
@@ -33,7 +33,7 @@ Device driver can provide specific callbacks for each "health reporter", e.g.:
* Recovery procedures
* Diagnostics procedures
* Object dump procedures
- * OOB initial parameters
+ * Out Of Box initial parameters
Different parts of the driver can register different types of health reporters
with different handlers.
@@ -46,11 +46,31 @@ Once an error is reported, devlink health will perform the following actions:
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
* Object dump is being taken and saved at the reporter instance (as long as
- there is no other dump which is already stored)
+ auto-dump is set and there is no other dump which is already stored)
* Auto recovery attempt is being done. Depends on:
+
- Auto-recovery configuration
- Grace period vs. time passed since last recover
+Devlink formatted message
+=========================
+
+To handle devlink health diagnose and health dump requests, devlink creates a
+formatted message structure ``devlink_fmsg`` and send it to the driver's callback
+to fill the data in using the devlink fmsg API.
+
+Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
+json-like format. The API allows the driver to add nested attributes such as
+object, object pair and value array, in addition to attributes such as name and
+value.
+
+Driver should use this API to fill the fmsg context in a format which will be
+translated by the devlink to the netlink message later. When it needs to send
+the data using SKBs to the netlink layer, it fragments the data between
+different SKBs. In order to do this fragmentation, it uses virtual nests
+attributes, to avoid actual nesting use which cannot be divided between
+different SKBs.
+
User Interface
==============
@@ -72,14 +92,18 @@ via ``devlink``, e.g per error type (per health reporter):
* - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
- Allows reporter-related configuration setting.
* - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
- - Triggers a reporter's recovery procedure.
+ - Triggers reporter's recovery procedure.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
+ - Triggers a fake health event on the reporter. The effects of the test
+ event in terms of recovery flow should follow closely that of a real
+ event.
* - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
- - Retrieves diagnostics data from a reporter on a device.
+ - Retrieves current device state related to the reporter.
* - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
- Retrieves the last stored dump. Devlink health
- saves a single dump. If an dump is not already stored by the devlink
+ saves a single dump. If an dump is not already stored by devlink
for this reporter, devlink generates a new dump.
- dump output is defined by the reporter.
+ Dump output is defined by the reporter.
* - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
- Clears the last saved dump file for the specified reporter.
@@ -93,7 +117,7 @@ The following diagram provides a general overview of ``devlink-health``::
+--------------------------+
|request for ops
|(diagnose,
- mlx5_core devlink |recover,
+ driver devlink |recover,
|dump)
+--------+ +--------------------------+
| | | reporter| |
diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst
index 3fe11401b838..1242b0e6826b 100644
--- a/Documentation/networking/devlink/devlink-info.rst
+++ b/Documentation/networking/devlink/devlink-info.rst
@@ -44,9 +44,11 @@ versions is generally discouraged - here, and via any other Linux API.
reported for two ports of the same device or on two hosts of
a multi-host device should be identical.
- .. note:: ``devlink-info`` API should be extended with a new field
- if devices want to report board/product serial number (often
- reported in PCI *Vital Product Data* capability).
+ * - ``board.serial_number``
+ - Board serial number of the device.
+
+ This is usually the serial number of the board, often available in
+ PCI *Vital Product Data*.
* - ``fixed``
- Group for hardware identifiers, and versions of components
@@ -196,15 +198,16 @@ fw.bundle_id
Unique identifier of the entire firmware bundle.
+fw.bootloader
+-------------
+
+Version of the bootloader.
+
Future work
===========
The following extensions could be useful:
- - product serial number - NIC boards often get labeled with a board serial
- number rather than ASIC serial number; it'd be useful to add board serial
- numbers to the API if they can be retrieved from the device;
-
- on-disk firmware file names - drivers list the file names of firmware they
may need to load onto devices via the ``MODULE_FIRMWARE()`` macro. These,
however, are per module, rather than per device. It'd be useful to list
diff --git a/Documentation/networking/devlink/devlink-linecard.rst b/Documentation/networking/devlink/devlink-linecard.rst
new file mode 100644
index 000000000000..6c0b8928bc13
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-linecard.rst
@@ -0,0 +1,122 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Devlink Line card
+=================
+
+Background
+==========
+
+The ``devlink-linecard`` mechanism is targeted for manipulation of
+line cards that serve as a detachable PHY modules for modular switch
+system. Following operations are provided:
+
+ * Get a list of supported line card types.
+ * Provision of a slot with specific line card type.
+ * Get and monitor of line card state and its change.
+
+Line card according to the type may contain one or more gearboxes
+to mux the lanes with certain speed to multiple ports with lanes
+of different speed. Line card ensures N:M mapping between
+the switch ASIC modules and physical front panel ports.
+
+Overview
+========
+
+Each line card devlink object is created by device driver,
+according to the physical line card slots available on the device.
+
+Similar to splitter cable, where the device might have no way
+of detection of the splitter cable geometry, the device
+might not have a way to detect line card type. For that devices,
+concept of provisioning is introduced. It allows the user to:
+
+ * Provision a line card slot with certain line card type
+
+ - Device driver would instruct the ASIC to prepare all
+ resources accordingly. The device driver would
+ create all instances, namely devlink port and netdevices
+ that reside on the line card, according to the line card type
+ * Manipulate of line card entities even without line card
+ being physically connected or powered-up
+ * Setup splitter cable on line card ports
+
+ - As on the ordinary ports, user may provision a splitter
+ cable of a certain type, without the need to
+ be physically connected to the port
+ * Configure devlink ports and netdevices
+
+Netdevice carrier is decided as follows:
+
+ * Line card is not inserted or powered-down
+
+ - The carrier is always down
+ * Line card is inserted and powered up
+
+ - The carrier is decided as for ordinary port netdevice
+
+Line card state
+===============
+
+The ``devlink-linecard`` mechanism supports the following line card states:
+
+ * ``unprovisioned``: Line card is not provisioned on the slot.
+ * ``unprovisioning``: Line card slot is currently being unprovisioned.
+ * ``provisioning``: Line card slot is currently in a process of being provisioned
+ with a line card type.
+ * ``provisioning_failed``: Provisioning was not successful.
+ * ``provisioned``: Line card slot is provisioned with a type.
+ * ``active``: Line card is powered-up and active.
+
+The following diagram provides a general overview of ``devlink-linecard``
+state transitions::
+
+ +-------------------------+
+ | |
+ +----------------------------------> unprovisioned |
+ | | |
+ | +--------|-------^--------+
+ | | |
+ | | |
+ | +--------v-------|--------+
+ | | |
+ | | provisioning |
+ | | |
+ | +------------|------------+
+ | |
+ | +-----------------------------+
+ | | |
+ | +------------v------------+ +------------v------------+ +-------------------------+
+ | | | | ----> |
+ +----- provisioning_failed | | provisioned | | active |
+ | | | | <---- |
+ | +------------^------------+ +------------|------------+ +-------------------------+
+ | | |
+ | | |
+ | | +------------v------------+
+ | | | |
+ | | | unprovisioning |
+ | | | |
+ | | +------------|------------+
+ | | |
+ | +-----------------------------+
+ | |
+ +-----------------------------------------------+
+
+
+Example usage
+=============
+
+.. code:: shell
+
+ $ devlink lc show [ DEV [ lc LC_INDEX ] ]
+ $ devlink lc set DEV lc LC_INDEX [ { type LC_TYPE | notype } ]
+
+ # Show current line card configuration and status for all slots:
+ $ devlink lc
+
+ # Set slot 8 to be provisioned with type "16x100G":
+ $ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G
+
+ # Set slot 8 to be unprovisioned:
+ $ devlink lc set pci/0000:01:00.0 lc 8 notype
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index d075fd090b3d..4e01dc32bc08 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -97,14 +97,43 @@ own name.
* - ``enable_roce``
- Boolean
- Enable handling of RoCE traffic in the device.
+ * - ``enable_eth``
+ - Boolean
+ - When enabled, the device driver will instantiate Ethernet specific
+ auxiliary device of the devlink device.
+ * - ``enable_rdma``
+ - Boolean
+ - When enabled, the device driver will instantiate RDMA specific
+ auxiliary device of the devlink device.
+ * - ``enable_vnet``
+ - Boolean
+ - When enabled, the device driver will instantiate VDPA networking
+ specific auxiliary device of the devlink device.
+ * - ``enable_iwarp``
+ - Boolean
+ - Enable handling of iWARP traffic in the device.
* - ``internal_err_reset``
- Boolean
- When enabled, the device driver will reset the device on internal
errors.
* - ``max_macs``
- u32
- - Specifies the maximum number of MAC addresses per ethernet port of
- this device.
+ - Typically macvlan, vlan net devices mac are also programmed in their
+ parent netdevice's Function rx filter. This parameter limit the
+ maximum number of unicast mac address filters to receive traffic from
+ per ethernet port of this device.
* - ``region_snapshot_enable``
- Boolean
- Enable capture of ``devlink-region`` snapshots.
+ * - ``enable_remote_dev_reset``
+ - Boolean
+ - Enable device reset by remote host. When cleared, the device driver
+ will NACK any attempt of other host to reset the device. This parameter
+ is useful for setups where a device is shared by different hosts, such
+ as multi-host setup.
+ * - ``io_eq_size``
+ - u32
+ - Control the size of I/O completion EQs.
+ * - ``event_eq_size``
+ - u32
+ - Control the size of asynchronous control events EQ.
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
new file mode 100644
index 000000000000..562f46b41274
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -0,0 +1,443 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _devlink_port:
+
+============
+Devlink Port
+============
+
+``devlink-port`` is a port that exists on the device. It has a logically
+separate ingress/egress point of the device. A devlink port can be any one
+of many flavours. A devlink port flavour along with port attributes
+describe what a port represents.
+
+A device driver that intends to publish a devlink port sets the
+devlink port attributes and registers the devlink port.
+
+Devlink port flavours are described below.
+
+.. list-table:: List of devlink port flavours
+ :widths: 33 90
+
+ * - Flavour
+ - Description
+ * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
+ - Any kind of physical port. This can be an eswitch physical port or any
+ other physical port on the device.
+ * - ``DEVLINK_PORT_FLAVOUR_DSA``
+ - This indicates a DSA interconnect port.
+ * - ``DEVLINK_PORT_FLAVOUR_CPU``
+ - This indicates a CPU port applicable only to DSA.
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
+ - This indicates an eswitch port representing a port of PCI
+ physical function (PF).
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
+ - This indicates an eswitch port representing a port of PCI
+ virtual function (VF).
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
+ - This indicates an eswitch port representing a port of PCI
+ subfunction (SF).
+ * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
+ - This indicates a virtual port for the PCI virtual function.
+
+Devlink port can have a different type based on the link layer described below.
+
+.. list-table:: List of devlink port types
+ :widths: 23 90
+
+ * - Type
+ - Description
+ * - ``DEVLINK_PORT_TYPE_ETH``
+ - Driver should set this port type when a link layer of the port is
+ Ethernet.
+ * - ``DEVLINK_PORT_TYPE_IB``
+ - Driver should set this port type when a link layer of the port is
+ InfiniBand.
+ * - ``DEVLINK_PORT_TYPE_AUTO``
+ - This type is indicated by the user when driver should detect the port
+ type automatically.
+
+PCI controllers
+---------------
+In most cases a PCI device has only one controller. A controller consists of
+potentially multiple physical, virtual functions and subfunctions. A function
+consists of one or more ports. This port is represented by the devlink eswitch
+port.
+
+A PCI device connected to multiple CPUs or multiple PCI root complexes or a
+SmartNIC, however, may have multiple controllers. For a device with multiple
+controllers, each controller is distinguished by a unique controller number.
+An eswitch is on the PCI device which supports ports of multiple controllers.
+
+An example view of a system with two controllers::
+
+ ---------------------------------------------------------
+ | |
+ | --------- --------- ------- ------- |
+ ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
+ | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- |
+ | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ |
+ | connect | | ------- ------- |
+ ----------- | | controller_num=1 (no eswitch) |
+ ------|--------------------------------------------------
+ (internal wire)
+ |
+ ---------------------------------------------------------
+ | devlink eswitch ports and reps |
+ | ----------------------------------------------------- |
+ | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
+ | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
+ | ----------------------------------------------------- |
+ | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
+ | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
+ | ----------------------------------------------------- |
+ | |
+ | |
+ ----------- | --------- --------- ------- ------- |
+ | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
+ | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- |
+ | connect | | | pf0 |______/________/ | pf1 |___/_______/ |
+ ----------- | ------- ------- |
+ | |
+ | local controller_num=0 (eswitch) |
+ ---------------------------------------------------------
+
+In the above example, the external controller (identified by controller number = 1)
+doesn't have the eswitch. Local controller (identified by controller number = 0)
+has the eswitch. The Devlink instance on the local controller has eswitch
+devlink ports for both the controllers.
+
+Function configuration
+======================
+
+Users can configure one or more function attributes before enumerating the PCI
+function. Usually it means, user should configure function attribute
+before a bus specific device for the function is created. However, when
+SRIOV is enabled, virtual function devices are created on the PCI bus.
+Hence, function attribute should be configured before binding virtual
+function device to the driver. For subfunctions, this means user should
+configure port function attribute before activating the port function.
+
+A user may set the hardware address of the function using
+`devlink port function set hw_addr` command. For Ethernet port function
+this means a MAC address.
+
+Users may also set the RoCE capability of the function using
+`devlink port function set roce` command.
+
+Users may also set the function as migratable using
+`devlink port function set migratable` command.
+
+Users may also set the IPsec crypto capability of the function using
+`devlink port function set ipsec_crypto` command.
+
+Users may also set the IPsec packet capability of the function using
+`devlink port function set ipsec_packet` command.
+
+Function attributes
+===================
+
+MAC address setup
+-----------------
+The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
+device created for the PCI VF/SF.
+
+- Get the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:11:22:33:44:55
+
+- Get the MAC address of the SF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the SF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:88:88
+
+RoCE capability setup
+---------------------
+Not all PCI VFs/SFs require RoCE capability.
+
+When RoCE capability is disabled, it saves system memory per PCI VF/SF.
+
+When user disables RoCE capability for a VF/SF, user application cannot send or
+receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
+will be empty.
+
+When RoCE capability is disabled in the device using port function attribute,
+VF/SF driver cannot override it.
+
+- Get RoCE capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 roce enable
+
+- Set RoCE capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 roce disable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 roce disable
+
+migratable capability setup
+---------------------------
+Live migration is the process of transferring a live virtual machine
+from one physical host to another without disrupting its normal
+operation.
+
+User who want PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
+
+When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
+with migration support, the user can migrate the VM with this VF from one HV to a
+different one.
+
+However, when migratable capability is enable, device will disable features which cannot
+be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
+
+Example of LM with migratable function configuration:
+- Get migratable capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 migratable disable
+
+- Set migratable capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 migratable enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 migratable enable
+
+- Bind VF to VFIO driver with migration support::
+
+ $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
+ $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
+ $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
+
+Attach VF to the VM.
+Start the VM.
+Perform live migration.
+
+IPsec crypto capability setup
+-----------------------------
+When user enables IPsec crypto capability for a VF, user application can offload
+XFRM state crypto operation (Encrypt/Decrypt) to this VF.
+
+When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
+processed in software by the kernel.
+
+- Get IPsec crypto capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
+
+- Set IPsec crypto capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
+
+IPsec packet capability setup
+-----------------------------
+When user enables IPsec packet capability for a VF, user application can offload
+XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
+IPsec encapsulation.
+
+When IPsec packet capability is disabled (default) for a VF, the XFRM state and
+policy is processed in software by the kernel.
+
+- Get IPsec packet capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_packet disabled
+
+- Set IPsec packet capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_packet enabled
+
+Subfunction
+============
+
+Subfunction is a lightweight function that has a parent PCI function on which
+it is deployed. Subfunction is created and deployed in unit of 1. Unlike
+SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
+A subfunction communicates with the hardware through the parent PCI function.
+
+To use a subfunction, 3 steps setup sequence is followed:
+
+1) create - create a subfunction;
+2) configure - configure subfunction attributes;
+3) deploy - deploy the subfunction;
+
+Subfunction management is done using devlink port user interface.
+User performs setup on the subfunction management device.
+
+(1) Create
+----------
+A subfunction is created using a devlink port interface. A user adds the
+subfunction by adding a devlink port of subfunction flavour. The devlink
+kernel code calls down to subfunction management driver (devlink ops) and asks
+it to create a subfunction devlink port. Driver then instantiates the
+subfunction port and any associated objects such as health reporters and
+representor netdevice.
+
+(2) Configure
+-------------
+A subfunction devlink port is created but it is not active yet. That means the
+entities are created on devlink side, the e-switch port representor is created,
+but the subfunction device itself is not created. A user might use e-switch port
+representor to do settings, putting it into bridge, adding TC rules, etc. A user
+might as well configure the hardware address (such as MAC address) of the
+subfunction while subfunction is inactive.
+
+(3) Deploy
+----------
+Once a subfunction is configured, user must activate it to use it. Upon
+activation, subfunction management driver asks the subfunction management
+device to instantiate the subfunction device on particular PCI function.
+A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
+At this point a matching subfunction driver binds to the subfunction's auxiliary device.
+
+Rate object management
+======================
+
+Devlink provides API to manage tx rates of single devlink port or a group.
+This is done through rate objects, which can be one of the two types:
+
+``leaf``
+ Represents a single devlink port; created/destroyed by the driver. Since leaf
+ have 1to1 mapping to its devlink port, in user space it is referred as
+ ``pci/<bus_addr>/<port_index>``;
+
+``node``
+ Represents a group of rate objects (leafs and/or nodes); created/deleted by
+ request from the userspace; initially empty (no rate objects added). In
+ userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
+ ``node_name`` can be any identifier, except decimal number, to avoid
+ collisions with leafs.
+
+API allows to configure following rate object's parameters:
+
+``tx_share``
+ Minimum TX rate value shared among all other rate objects, or rate objects
+ that parts of the parent group, if it is a part of the same group.
+
+``tx_max``
+ Maximum TX rate value.
+
+``tx_priority``
+ Allows for usage of strict priority arbiter among siblings. This
+ arbitration scheme attempts to schedule nodes based on their priority
+ as long as the nodes remain within their bandwidth limit. The higher the
+ priority the higher the probability that the node will get selected for
+ scheduling.
+
+``tx_weight``
+ Allows for usage of Weighted Fair Queuing arbitration scheme among
+ siblings. This arbitration scheme can be used simultaneously with the
+ strict priority. As a node is configured with a higher rate it gets more
+ BW relative to its siblings. Values are relative like a percentage
+ points, they basically tell how much BW should node take relative to
+ its siblings.
+
+``parent``
+ Parent node name. Parent node rate limits are considered as additional limits
+ to all node children limits. ``tx_max`` is an upper limit for children.
+ ``tx_share`` is a total bandwidth distributed among children.
+
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+Arbitration flow from the high level:
+
+#. Choose a node, or group of nodes with the highest priority that stays
+ within the BW limit and are not blocked. Use ``tx_priority`` as a
+ parameter for this arbitration.
+
+#. If group of nodes have the same priority perform WFQ arbitration on
+ that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
+
+#. Select the winner node, and continue arbitration flow among its children,
+ until leaf node is reached, and the winner is established.
+
+#. If all the nodes from the highest priority sub-group are satisfied, or
+ overused their assigned BW, move to the lower priority nodes.
+
+Driver implementations are allowed to support both or either rate object types
+and setting methods of their parameters. Additionally driver implementation
+may export nodes/leafs and their child-parent relationships.
+
+Terms and Definitions
+=====================
+
+.. list-table:: Terms and Definitions
+ :widths: 22 90
+
+ * - Term
+ - Definitions
+ * - ``PCI device``
+ - A physical PCI device having one or more PCI buses consists of one or
+ more PCI controllers.
+ * - ``PCI controller``
+ - A controller consists of potentially multiple physical functions,
+ virtual functions and subfunctions.
+ * - ``Port function``
+ - An object to manage the function of a port.
+ * - ``Subfunction``
+ - A lightweight function that has parent PCI function on which it is
+ deployed.
+ * - ``Subfunction device``
+ - A bus device of the subfunction, usually on a auxiliary bus.
+ * - ``Subfunction driver``
+ - A device driver for the subfunction auxiliary device.
+ * - ``Subfunction management device``
+ - A PCI physical function that supports subfunction management.
+ * - ``Subfunction management driver``
+ - A device driver for PCI physical function that supports
+ subfunction management using devlink port interface.
+ * - ``Subfunction host driver``
+ - A device driver for PCI physical function that hosts subfunction
+ devices. In most cases it is same as subfunction management driver. When
+ subfunction is used on external controller, subfunction management and
+ host drivers are different.
diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst
index 04e04d1ff627..9232cd7da301 100644
--- a/Documentation/networking/devlink/devlink-region.rst
+++ b/Documentation/networking/devlink/devlink-region.rst
@@ -14,16 +14,31 @@ Region snapshots are collected by the driver, and can be accessed via read
or dump commands. This allows future analysis on the created snapshots.
Regions may optionally support triggering snapshots on demand.
+Snapshot identifiers are scoped to the devlink instance, not a region.
+All snapshots with the same snapshot id within a devlink instance
+correspond to the same event.
+
The major benefit to creating a region is to provide access to internal
address regions that are otherwise inaccessible to the user.
Regions may also be used to provide an additional way to debug complex error
-states, but see also :doc:`devlink-health`
+states, but see also Documentation/networking/devlink/devlink-health.rst
Regions may optionally support capturing a snapshot on demand via the
``DEVLINK_CMD_REGION_NEW`` netlink message. A driver wishing to allow
requested snapshots must implement the ``.snapshot`` callback for the region
-in its ``devlink_region_ops`` structure.
+in its ``devlink_region_ops`` structure. If snapshot id is not set in
+the ``DEVLINK_CMD_REGION_NEW`` request kernel will allocate one and send
+the snapshot information to user space.
+
+Regions may optionally allow directly reading from their contents without a
+snapshot. Direct read requests are not atomic. In particular a read request
+of size 256 bytes or larger will be split into multiple chunks. If atomic
+access is required, use a snapshot. A driver wishing to enable this for a
+region should implement the ``.read`` callback in the ``devlink_region_ops``
+structure. User space can request a direct read by using the
+``DEVLINK_ATTR_REGION_DIRECT`` attribute instead of specifying a snapshot
+id.
example usage
-------------
@@ -38,14 +53,15 @@ example usage
# Show all of the exposed regions with region sizes:
$ devlink region show
- pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]
- pci/0000:00:05.0/fw-health: size 64 snapshot [1 2]
+ pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2] max 8
+ pci/0000:00:05.0/fw-health: size 64 snapshot [1 2] max 8
# Delete a snapshot using:
$ devlink region del pci/0000:00:05.0/cr-space snapshot 1
# Request an immediate snapshot, if supported by the region
- $ devlink region new pci/0000:00:05.0/cr-space snapshot 5
+ $ devlink region new pci/0000:00:05.0/cr-space
+ pci/0000:00:05.0/cr-space: snapshot 5
# Dump a snapshot:
$ devlink region dump pci/0000:00:05.0/fw-health snapshot 1
@@ -58,6 +74,10 @@ example usage
$ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 length 16
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+ # Read from the region without a snapshot
+ $ devlink region read pci/0000:00:05.0/fw-health address 16 length 16
+ 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+
As regions are likely very device or driver specific, no generic regions are
defined. See the driver-specific documentation files for information on the
specific regions a driver supports.
diff --git a/Documentation/networking/devlink/devlink-reload.rst b/Documentation/networking/devlink/devlink-reload.rst
new file mode 100644
index 000000000000..2fb0269b2054
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-reload.rst
@@ -0,0 +1,90 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Reload
+==============
+
+``devlink-reload`` provides mechanism to reinit driver entities, applying
+``devlink-params`` and ``devlink-resources`` new values. It also provides
+mechanism to activate firmware.
+
+Reload Actions
+==============
+
+User may select a reload action.
+By default ``driver_reinit`` action is selected.
+
+.. list-table:: Possible reload actions
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``driver-reinit``
+ - Devlink driver entities re-initialization, including applying
+ new values to devlink entities which are used during driver
+ load which are:
+
+ * ``devlink-params`` in configuration mode ``driverinit``
+ * ``devlink-resources``
+
+ Other devlink entities may stay over the re-initialization:
+
+ * ``devlink-health-reporter``
+ * ``devlink-region``
+
+ The rest of the devlink entities have to be removed and readded.
+ * - ``fw_activate``
+ - Firmware activate. Activates new firmware if such image is stored and
+ pending activation. If no limitation specified this action may involve
+ firmware reset. If no new image pending this action will reload current
+ firmware image.
+
+Note that even though user asks for a specific action, the driver
+implementation might require to perform another action alongside with
+it. For example, some driver do not support driver reinitialization
+being performed without fw activation. Therefore, the devlink reload
+command returns the list of actions which were actrually performed.
+
+Reload Limits
+=============
+
+By default reload actions are not limited and driver implementation may
+include reset or downtime as needed to perform the actions.
+
+However, some drivers support action limits, which limit the action
+implementation to specific constraints.
+
+.. list-table:: Possible reload limits
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``no_reset``
+ - No reset allowed, no down time allowed, no link flap and no
+ configuration is lost.
+
+Change Namespace
+================
+
+The netns option allows user to be able to move devlink instances into
+namespaces during devlink reload operation.
+By default all devlink instances are created in init_net and stay there.
+
+example usage
+-------------
+
+.. code:: shell
+
+ $ devlink dev reload help
+ $ devlink dev reload DEV [ netns { PID | NAME | ID } ] [ action { driver_reinit | fw_activate } ] [ limit no_reset ]
+
+ # Run reload command for devlink driver entities re-initialization:
+ $ devlink dev reload pci/0000:82:00.0 action driver_reinit
+ reload_actions_performed:
+ driver_reinit
+
+ # Run reload command to activate firmware:
+ # Note that mlx5 driver reloads the driver while activating firmware
+ $ devlink dev reload pci/0000:82:00.0 action fw_activate
+ reload_actions_performed:
+ driver_reinit fw_activate
diff --git a/Documentation/networking/devlink/devlink-resource.rst b/Documentation/networking/devlink/devlink-resource.rst
index 93e92d2f0752..3d5ae51e65a2 100644
--- a/Documentation/networking/devlink/devlink-resource.rst
+++ b/Documentation/networking/devlink/devlink-resource.rst
@@ -23,6 +23,20 @@ current size and related sub resources. To access a sub resource, you
specify the path of the resource. For example ``/IPv4/fib`` is the id for
the ``fib`` sub-resource under the ``IPv4`` resource.
+Generic Resources
+=================
+
+Generic resources are used to describe resources that can be shared by multiple
+device drivers and their description must be added to the following table:
+
+.. list-table:: List of Generic Resources
+ :widths: 10 90
+
+ * - Name
+ - Description
+ * - ``physical_ports``
+ - A limited capacity of physical ports that the switch ASIC can support
+
example usage
-------------
diff --git a/Documentation/networking/devlink/devlink-selftests.rst b/Documentation/networking/devlink/devlink-selftests.rst
new file mode 100644
index 000000000000..c0aa1f3aef0d
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-selftests.rst
@@ -0,0 +1,38 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+=================
+Devlink Selftests
+=================
+
+The ``devlink-selftests`` API allows executing selftests on the device.
+
+Tests Mask
+==========
+The ``devlink-selftests`` command should be run with a mask indicating
+the tests to be executed.
+
+Tests Description
+=================
+The following is a list of tests that drivers may execute.
+
+.. list-table:: List of tests
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``DEVLINK_SELFTEST_FLASH``
+ - Devices may have the firmware on non-volatile memory on the board, e.g.
+ flash. This particular test helps to run a flash selftest on the device.
+ Implementation of the test is left to the driver/firmware.
+
+example usage
+-------------
+
+.. code:: shell
+
+ # Query selftests supported on the devlink device
+ $ devlink dev selftests show DEV
+ # Query selftests supported on all devlink devices
+ $ devlink dev selftests show
+ # Executes selftests on the device
+ $ devlink dev selftests run DEV id flash
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
index fe089acb7783..2c14dfe69b3a 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -55,7 +55,7 @@ The following diagram provides a general overview of ``devlink-trap``::
| |
+-------^--------+
|
- |
+ | Non-control traps
|
+----+----+
| | Kernel's Rx path
@@ -97,6 +97,12 @@ The ``devlink-trap`` mechanism supports the following packet trap types:
processed by ``devlink`` and injected to the kernel's Rx path. Changing the
action of such traps is not allowed, as it can easily break the control
plane.
+ * ``control``: Trapped packets were trapped by the device because these are
+ control packets required for the correct functioning of the control plane.
+ For example, ARP request and IGMP query packets. Packets are injected to
+ the kernel's Rx path, but not reported to the kernel's drop monitor.
+ Changing the action of such traps is not allowed, as it can easily break
+ the control plane.
.. _Trap-Actions:
@@ -108,6 +114,8 @@ The ``devlink-trap`` mechanism supports the following packet trap actions:
* ``trap``: The sole copy of the packet is sent to the CPU.
* ``drop``: The packet is dropped by the underlying device and a copy is not
sent to the CPU.
+ * ``mirror``: The packet is forwarded by the underlying device and a copy is
+ sent to the CPU.
Generic Packet Traps
====================
@@ -244,6 +252,249 @@ be added to the following table:
* - ``egress_flow_action_drop``
- ``drop``
- Traps packets dropped during processing of egress flow action drop
+ * - ``stp``
+ - ``control``
+ - Traps STP packets
+ * - ``lacp``
+ - ``control``
+ - Traps LACP packets
+ * - ``lldp``
+ - ``control``
+ - Traps LLDP packets
+ * - ``igmp_query``
+ - ``control``
+ - Traps IGMP Membership Query packets
+ * - ``igmp_v1_report``
+ - ``control``
+ - Traps IGMP Version 1 Membership Report packets
+ * - ``igmp_v2_report``
+ - ``control``
+ - Traps IGMP Version 2 Membership Report packets
+ * - ``igmp_v3_report``
+ - ``control``
+ - Traps IGMP Version 3 Membership Report packets
+ * - ``igmp_v2_leave``
+ - ``control``
+ - Traps IGMP Version 2 Leave Group packets
+ * - ``mld_query``
+ - ``control``
+ - Traps MLD Multicast Listener Query packets
+ * - ``mld_v1_report``
+ - ``control``
+ - Traps MLD Version 1 Multicast Listener Report packets
+ * - ``mld_v2_report``
+ - ``control``
+ - Traps MLD Version 2 Multicast Listener Report packets
+ * - ``mld_v1_done``
+ - ``control``
+ - Traps MLD Version 1 Multicast Listener Done packets
+ * - ``ipv4_dhcp``
+ - ``control``
+ - Traps IPv4 DHCP packets
+ * - ``ipv6_dhcp``
+ - ``control``
+ - Traps IPv6 DHCP packets
+ * - ``arp_request``
+ - ``control``
+ - Traps ARP request packets
+ * - ``arp_response``
+ - ``control``
+ - Traps ARP response packets
+ * - ``arp_overlay``
+ - ``control``
+ - Traps NVE-decapsulated ARP packets that reached the overlay network.
+ This is required, for example, when the address that needs to be
+ resolved is a local address
+ * - ``ipv6_neigh_solicit``
+ - ``control``
+ - Traps IPv6 Neighbour Solicitation packets
+ * - ``ipv6_neigh_advert``
+ - ``control``
+ - Traps IPv6 Neighbour Advertisement packets
+ * - ``ipv4_bfd``
+ - ``control``
+ - Traps IPv4 BFD packets
+ * - ``ipv6_bfd``
+ - ``control``
+ - Traps IPv6 BFD packets
+ * - ``ipv4_ospf``
+ - ``control``
+ - Traps IPv4 OSPF packets
+ * - ``ipv6_ospf``
+ - ``control``
+ - Traps IPv6 OSPF packets
+ * - ``ipv4_bgp``
+ - ``control``
+ - Traps IPv4 BGP packets
+ * - ``ipv6_bgp``
+ - ``control``
+ - Traps IPv6 BGP packets
+ * - ``ipv4_vrrp``
+ - ``control``
+ - Traps IPv4 VRRP packets
+ * - ``ipv6_vrrp``
+ - ``control``
+ - Traps IPv6 VRRP packets
+ * - ``ipv4_pim``
+ - ``control``
+ - Traps IPv4 PIM packets
+ * - ``ipv6_pim``
+ - ``control``
+ - Traps IPv6 PIM packets
+ * - ``uc_loopback``
+ - ``control``
+ - Traps unicast packets that need to be routed through the same layer 3
+ interface from which they were received. Such packets are routed by the
+ kernel, but also cause it to potentially generate ICMP redirect packets
+ * - ``local_route``
+ - ``control``
+ - Traps unicast packets that hit a local route and need to be locally
+ delivered
+ * - ``external_route``
+ - ``control``
+ - Traps packets that should be routed through an external interface (e.g.,
+ management interface) that does not belong to the same device (e.g.,
+ switch ASIC) as the ingress interface
+ * - ``ipv6_uc_dip_link_local_scope``
+ - ``control``
+ - Traps unicast IPv6 packets that need to be routed and have a destination
+ IP address with a link-local scope (i.e., fe80::/10). The trap allows
+ device drivers to avoid programming link-local routes, but still receive
+ packets for local delivery
+ * - ``ipv6_dip_all_nodes``
+ - ``control``
+ - Traps IPv6 packets that their destination IP address is the "All Nodes
+ Address" (i.e., ff02::1)
+ * - ``ipv6_dip_all_routers``
+ - ``control``
+ - Traps IPv6 packets that their destination IP address is the "All Routers
+ Address" (i.e., ff02::2)
+ * - ``ipv6_router_solicit``
+ - ``control``
+ - Traps IPv6 Router Solicitation packets
+ * - ``ipv6_router_advert``
+ - ``control``
+ - Traps IPv6 Router Advertisement packets
+ * - ``ipv6_redirect``
+ - ``control``
+ - Traps IPv6 Redirect Message packets
+ * - ``ipv4_router_alert``
+ - ``control``
+ - Traps IPv4 packets that need to be routed and include the Router Alert
+ option. Such packets need to be locally delivered to raw sockets that
+ have the IP_ROUTER_ALERT socket option set
+ * - ``ipv6_router_alert``
+ - ``control``
+ - Traps IPv6 packets that need to be routed and include the Router Alert
+ option in their Hop-by-Hop extension header. Such packets need to be
+ locally delivered to raw sockets that have the IPV6_ROUTER_ALERT socket
+ option set
+ * - ``ptp_event``
+ - ``control``
+ - Traps PTP time-critical event messages (Sync, Delay_req, Pdelay_Req and
+ Pdelay_Resp)
+ * - ``ptp_general``
+ - ``control``
+ - Traps PTP general messages (Announce, Follow_Up, Delay_Resp,
+ Pdelay_Resp_Follow_Up, management and signaling)
+ * - ``flow_action_sample``
+ - ``control``
+ - Traps packets sampled during processing of flow action sample (e.g., via
+ tc's sample action)
+ * - ``flow_action_trap``
+ - ``control``
+ - Traps packets logged during processing of flow action trap (e.g., via
+ tc's trap action)
+ * - ``early_drop``
+ - ``drop``
+ - Traps packets dropped due to the RED (Random Early Detection) algorithm
+ (i.e., early drops)
+ * - ``vxlan_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the VXLAN header parsing which
+ might be because of packet truncation or the I flag is not set.
+ * - ``llc_snap_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the LLC+SNAP header parsing
+ * - ``vlan_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the VLAN header parsing. Could
+ include unexpected packet truncation.
+ * - ``pppoe_ppp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the PPPoE+PPP header parsing.
+ This could include finding a session ID of 0xFFFF (which is reserved and
+ not for use), a PPPoE length which is larger than the frame received or
+ any common error on this type of header
+ * - ``mpls_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the MPLS header parsing which
+ could include unexpected header truncation
+ * - ``arp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the ARP header parsing
+ * - ``ip_1_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the first IP header parsing.
+ This packet trap could include packets which do not pass an IP checksum
+ check, a header length check (a minimum of 20 bytes), which might suffer
+ from packet truncation thus the total length field exceeds the received
+ packet length etc
+ * - ``ip_n_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the parsing of the last IP
+ header (the inner one in case of an IP over IP tunnel). The same common
+ error checking is performed here as for the ip_1_parsing trap
+ * - ``gre_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the GRE header parsing
+ * - ``udp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the UDP header parsing.
+ This packet trap could include checksum errorrs, an improper UDP
+ length detected (smaller than 8 bytes) or detection of header
+ truncation.
+ * - ``tcp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the TCP header parsing.
+ This could include TCP checksum errors, improper combination of SYN, FIN
+ and/or RESET etc.
+ * - ``ipsec_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the IPSEC header parsing
+ * - ``sctp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the SCTP header parsing.
+ This would mean that port number 0 was used or that the header is
+ truncated.
+ * - ``dccp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the DCCP header parsing
+ * - ``gtp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the GTP header parsing
+ * - ``esp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the ESP header parsing
+ * - ``blackhole_nexthop``
+ - ``drop``
+ - Traps packets that the device decided to drop in case they hit a
+ blackhole nexthop
+ * - ``dmac_filter``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop because
+ the destination MAC is not configured in the MAC table and
+ the interface is not in promiscuous mode
+ * - ``eapol``
+ - ``control``
+ - Traps "Extensible Authentication Protocol over LAN" (EAPOL) packets
+ specified in IEEE 802.1X
+ * - ``locked_port``
+ - ``drop``
+ - Traps packets that the device decided to drop because they failed the
+ locked bridge port check. That is, packets that were received via a
+ locked port and whose {SMAC, VID} does not correspond to an FDB entry
+ pointing to the port
Driver-specific Packet Traps
============================
@@ -254,8 +505,9 @@ help debug packet drops caused by these exceptions. The following list includes
links to the description of driver-specific traps registered by various device
drivers:
- * :doc:`netdevsim`
- * :doc:`mlxsw`
+ * Documentation/networking/devlink/netdevsim.rst
+ * Documentation/networking/devlink/mlxsw.rst
+ * Documentation/networking/devlink/prestera.rst
.. _Generic-Packet-Trap-Groups:
@@ -277,8 +529,11 @@ narrow. The description of these groups must be added to the following table:
- Contains packet traps for packets that were dropped by the device during
layer 2 forwarding (i.e., bridge)
* - ``l3_drops``
- - Contains packet traps for packets that were dropped by the device or hit
- an exception (e.g., TTL error) during layer 3 forwarding
+ - Contains packet traps for packets that were dropped by the device during
+ layer 3 forwarding
+ * - ``l3_exceptions``
+ - Contains packet traps for packets that hit an exception (e.g., TTL
+ error) during layer 3 forwarding
* - ``buffer_drops``
- Contains packet traps for packets that were dropped by the device due to
an enqueue decision
@@ -288,6 +543,65 @@ narrow. The description of these groups must be added to the following table:
* - ``acl_drops``
- Contains packet traps for packets that were dropped by the device during
ACL processing
+ * - ``stp``
+ - Contains packet traps for STP packets
+ * - ``lacp``
+ - Contains packet traps for LACP packets
+ * - ``lldp``
+ - Contains packet traps for LLDP packets
+ * - ``mc_snooping``
+ - Contains packet traps for IGMP and MLD packets required for multicast
+ snooping
+ * - ``dhcp``
+ - Contains packet traps for DHCP packets
+ * - ``neigh_discovery``
+ - Contains packet traps for neighbour discovery packets (e.g., ARP, IPv6
+ ND)
+ * - ``bfd``
+ - Contains packet traps for BFD packets
+ * - ``ospf``
+ - Contains packet traps for OSPF packets
+ * - ``bgp``
+ - Contains packet traps for BGP packets
+ * - ``vrrp``
+ - Contains packet traps for VRRP packets
+ * - ``pim``
+ - Contains packet traps for PIM packets
+ * - ``uc_loopback``
+ - Contains a packet trap for unicast loopback packets (i.e.,
+ ``uc_loopback``). This trap is singled-out because in cases such as
+ one-armed router it will be constantly triggered. To limit the impact on
+ the CPU usage, a packet trap policer with a low rate can be bound to the
+ group without affecting other traps
+ * - ``local_delivery``
+ - Contains packet traps for packets that should be locally delivered after
+ routing, but do not match more specific packet traps (e.g.,
+ ``ipv4_bgp``)
+ * - ``external_delivery``
+ - Contains packet traps for packets that should be routed through an
+ external interface (e.g., management interface) that does not belong to
+ the same device (e.g., switch ASIC) as the ingress interface
+ * - ``ipv6``
+ - Contains packet traps for various IPv6 control packets (e.g., Router
+ Advertisements)
+ * - ``ptp_event``
+ - Contains packet traps for PTP time-critical event messages (Sync,
+ Delay_req, Pdelay_Req and Pdelay_Resp)
+ * - ``ptp_general``
+ - Contains packet traps for PTP general messages (Announce, Follow_Up,
+ Delay_Resp, Pdelay_Resp_Follow_Up, management and signaling)
+ * - ``acl_sample``
+ - Contains packet traps for packets that were sampled by the device during
+ ACL processing
+ * - ``acl_trap``
+ - Contains packet traps for packets that were trapped (logged) by the
+ device during ACL processing
+ * - ``parser_error_drops``
+ - Contains packet traps for packets that were marked by the device during
+ parsing as erroneous
+ * - ``eapol``
+ - Contains packet traps for "Extensible Authentication Protocol over LAN"
+ (EAPOL) packets specified in IEEE 802.1X
Packet Trap Policers
====================
diff --git a/Documentation/networking/devlink/etas_es58x.rst b/Documentation/networking/devlink/etas_es58x.rst
new file mode 100644
index 000000000000..3b857d82a44c
--- /dev/null
+++ b/Documentation/networking/devlink/etas_es58x.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+etas_es58x devlink support
+==========================
+
+This document describes the devlink features implemented by the
+``etas_es58x`` device driver.
+
+Info versions
+=============
+
+The ``etas_es58x`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of the firmware running on the device. Also available
+ through ``ethtool -i`` as the first member of the
+ ``firmware-version``.
+ * - ``fw.bootloader``
+ - running
+ - Version of the bootloader running on the device. Also available
+ through ``ethtool -i`` as the second member of the
+ ``firmware-version``.
+ * - ``board.rev``
+ - fixed
+ - The hardware revision of the device.
+ * - ``serial_number``
+ - fixed
+ - The USB serial number. Also available through ``lsusb -v``.
diff --git a/Documentation/networking/devlink/hns3.rst b/Documentation/networking/devlink/hns3.rst
new file mode 100644
index 000000000000..4562a6e4782f
--- /dev/null
+++ b/Documentation/networking/devlink/hns3.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+hns3 devlink support
+====================
+
+This document describes the devlink features implemented by the ``hns3``
+device driver.
+
+The ``hns3`` driver supports reloading via ``DEVLINK_CMD_RELOAD``.
+
+Info versions
+=============
+
+The ``hns3`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 10 10 80
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Used to represent the firmware version.
diff --git a/Documentation/networking/devlink/i40e.rst b/Documentation/networking/devlink/i40e.rst
new file mode 100644
index 000000000000..d3cb5bb5197e
--- /dev/null
+++ b/Documentation/networking/devlink/i40e.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+i40e devlink support
+====================
+
+This document describes the devlink features implemented by the ``i40e``
+device driver.
+
+Info versions
+=============
+
+The ``i40e`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 5 90
+
+ * - Name
+ - Type
+ - Example
+ - Description
+ * - ``board.id``
+ - fixed
+ - K15190-000
+ - The Product Board Assembly (PBA) identifier of the board.
+ * - ``fw.mgmt``
+ - running
+ - 9.130
+ - 2-digit version number of the management firmware that controls the
+ PHY, link, etc.
+ * - ``fw.mgmt.api``
+ - running
+ - 1.15
+ - 2-digit version number of the API exported over the AdminQ by the
+ management firmware. Used by the driver to identify what commands
+ are supported.
+ * - ``fw.mgmt.build``
+ - running
+ - 73618
+ - Build number of the source for the management firmware.
+ * - ``fw.undi``
+ - running
+ - 1.3429.0
+ - Version of the Option ROM containing the UEFI driver. The version is
+ reported in ``major.minor.patch`` format. The major version is
+ incremented whenever a major breaking change occurs, or when the
+ minor version would overflow. The minor version is incremented for
+ non-breaking changes and reset to 1 when the major version is
+ incremented. The patch version is normally 0 but is incremented when
+ a fix is delivered as a patch against an older base Option ROM.
+ * - ``fw.psid.api``
+ - running
+ - 9.30
+ - Version defining the format of the flash contents.
+ * - ``fw.bundle_id``
+ - running
+ - 0x8000e5f3
+ - Unique identifier of the firmware image file that was loaded onto
+ the device. Also referred to as the EETRACK identifier of the NVM.
diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst
index 4574352d6ff4..7f30ebd5debb 100644
--- a/Documentation/networking/devlink/ice.rst
+++ b/Documentation/networking/devlink/ice.rst
@@ -7,6 +7,21 @@ ice devlink support
This document describes the devlink features implemented by the ``ice``
device driver.
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ - Notes
+ * - ``enable_roce``
+ - runtime
+ - mutually exclusive with ``enable_iwarp``
+ * - ``enable_iwarp``
+ - runtime
+ - mutually exclusive with ``enable_roce``
+
Info versions
=============
@@ -23,17 +38,24 @@ The ``ice`` driver reports the following versions
- fixed
- K65390-000
- The Product Board Assembly (PBA) identifier of the board.
+ * - ``cgu.id``
+ - fixed
+ - 36
+ - The Clock Generation Unit (CGU) hardware revision identifier.
* - ``fw.mgmt``
- running
- 2.1.7
- - 3-digit version number of the management firmware that controls the
- PHY, link, etc.
+ - 3-digit version number of the management firmware running on the
+ Embedded Management Processor of the device. It controls the PHY,
+ link, access to device resources, etc. Intel documentation refers to
+ this as the EMP firmware.
* - ``fw.mgmt.api``
- running
- - 1.5
- - 2-digit version number of the API exported over the AdminQ by the
- management firmware. Used by the driver to identify what commands
- are supported.
+ - 1.5.1
+ - 3-digit version number (major.minor.patch) of the API exported over
+ the AdminQ by the management firmware. Used by the driver to
+ identify what commands are supported. Historical versions of the
+ kernel only displayed a 2-digit version number (major.minor).
* - ``fw.mgmt.build``
- running
- 0x305d955f
@@ -69,18 +91,150 @@ The ``ice`` driver reports the following versions
- The version of the DDP package that is active in the device. Note
that both the name (as reported by ``fw.app.name``) and version are
required to uniquely identify the package.
+ * - ``fw.app.bundle_id``
+ - running
+ - 0xc0000001
+ - Unique identifier for the DDP package loaded in the device. Also
+ referred to as the DDP Track ID. Can be used to uniquely identify
+ the specific DDP package.
+ * - ``fw.netlist``
+ - running
+ - 1.1.2000-6.7.0
+ - The version of the netlist module. This module defines the device's
+ Ethernet capabilities and default settings, and is used by the
+ management firmware as part of managing link and device
+ connectivity.
+ * - ``fw.netlist.build``
+ - running
+ - 0xee16ced7
+ - The first 4 bytes of the hash of the netlist module contents.
+ * - ``fw.cgu``
+ - running
+ - 8032.16973825.6021
+ - The version of Clock Generation Unit (CGU). Format:
+ <CGU type>.<configuration version>.<firmware version>.
+
+Flash Update
+============
+
+The ``ice`` driver implements support for flash update using the
+``devlink-flash`` interface. It supports updating the device flash using a
+combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
+``fw.netlist`` components.
+
+.. list-table:: List of supported overwrite modes
+ :widths: 5 95
+
+ * - Bits
+ - Behavior
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+ - Do not preserve settings stored in the flash components being
+ updated. This includes overwriting the port configuration that
+ determines the number of physical functions the device will
+ initialize with.
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+ - Do not preserve either settings or identifiers. Overwrite everything
+ in the flash with the contents from the provided image, without
+ performing any preservation. This includes overwriting device
+ identifying fields such as the MAC address, VPD area, and device
+ serial number. It is expected that this combination be used with an
+ image customized for the specific device.
+
+The ice hardware does not support overwriting only identifiers while
+preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its
+own will be rejected. If no overwrite mask is provided, the firmware will be
+instructed to preserve all settings and identifying fields when updating.
+
+Reload
+======
+
+The ``ice`` driver supports activating new firmware after a flash update
+using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
+action.
+
+.. code:: shell
+
+ $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
+
+The new firmware is activated by issuing a device specific Embedded
+Management Processor reset which requests the device to reset and reload the
+EMP firmware image.
+
+The driver does not currently support reloading the driver via
+``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
+
+Port split
+==========
+
+The ``ice`` driver supports port splitting only for port 0, as the FW has
+a predefined set of available port split options for the whole device.
+
+A system reboot is required for port split to be applied.
+
+The following command will select the port split option with 4 ports:
+
+.. code:: shell
+
+ $ devlink port split pci/0000:16:00.0/0 count 4
+
+The list of all available port options will be printed to dynamic debug after
+each ``split`` and ``unsplit`` command. The first option is the default.
+
+.. code:: shell
+
+ ice 0000:16:00.0: Available port split options and max port speeds (Gbps):
+ ice 0000:16:00.0: Status Split Quad 0 Quad 1
+ ice 0000:16:00.0: count L0 L1 L2 L3 L4 L5 L6 L7
+ ice 0000:16:00.0: Active 2 100 - - - 100 - - -
+ ice 0000:16:00.0: 2 50 - 50 - - - - -
+ ice 0000:16:00.0: Pending 4 25 25 25 25 - - - -
+ ice 0000:16:00.0: 4 25 25 - - 25 25 - -
+ ice 0000:16:00.0: 8 10 10 10 10 10 10 10 10
+ ice 0000:16:00.0: 1 100 - - - - - - -
+
+There could be multiple FW port options with the same port split count. When
+the same port split count request is issued again, the next FW port option with
+the same port split count will be selected.
+
+``devlink port unsplit`` will select the option with a split count of 1. If
+there is no FW option available with split count 1, you will receive an error.
Regions
=======
-The ``ice`` driver enables access to the contents of the Non Volatile Memory
-flash chip via the ``nvm-flash`` region.
+The ``ice`` driver implements the following regions for accessing internal
+device data.
+
+.. list-table:: regions implemented
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``nvm-flash``
+ - The contents of the entire flash chip, sometimes referred to as
+ the device's Non Volatile Memory.
+ * - ``shadow-ram``
+ - The contents of the Shadow RAM, which is loaded from the beginning
+ of the flash. Although the contents are primarily from the flash,
+ this area also contains data generated during device boot which is
+ not stored in flash.
+ * - ``device-caps``
+ - The contents of the device firmware's capabilities buffer. Useful to
+ determine the current state and configuration of the device.
-Users can request an immediate capture of a snapshot via the
-``DEVLINK_CMD_REGION_NEW``
+Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
+snapshot. The ``device-caps`` region requires a snapshot as the contents are
+sent by firmware and can't be split into separate reads.
+
+Users can request an immediate capture of a snapshot for all three regions
+via the ``DEVLINK_CMD_REGION_NEW`` command.
.. code:: shell
+ $ devlink region show
+ pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
+ pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
+
$ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
$ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
@@ -94,3 +248,157 @@ Users can request an immediate capture of a snapshot via the
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
$ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1
+
+ $ devlink region new pci/0000:01:00.0/device-caps snapshot 1
+ $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1
+ 0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00
+ 0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00
+ 0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00
+ 00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00
+ 00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00
+ 0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00
+ 00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
+ 00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
+ 0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+
+ $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
+
+Devlink Rate
+============
+
+The ``ice`` driver implements devlink-rate API. It allows for offload of
+the Hierarchical QoS to the hardware. It enables user to group Virtual
+Functions in a tree structure and assign supported parameters: tx_share,
+tx_max, tx_priority and tx_weight to each node in a tree. So effectively
+user gains an ability to control how much bandwidth is allocated for each
+VF group. This is later enforced by the HW.
+
+It is assumed that this feature is mutually exclusive with DCB performed
+in FW and ADQ, or any driver feature that would trigger changes in QoS,
+for example creation of the new traffic class. The driver will prevent DCB
+or ADQ configuration if user started making any changes to the nodes using
+devlink-rate API. To configure those features a driver reload is necessary.
+Correspondingly if ADQ or DCB will get configured the driver won't export
+hierarchy at all, or will remove the untouched hierarchy if those
+features are enabled after the hierarchy is exported, but before any
+changes are made.
+
+This feature is also dependent on switchdev being enabled in the system.
+It's required because devlink-rate requires devlink-port objects to be
+present, and those objects are only created in switchdev mode.
+
+If the driver is set to the switchdev mode, it will export internal
+hierarchy the moment VF's are created. Root of the tree is always
+represented by the node_0. This node can't be deleted by the user. Leaf
+nodes and nodes with children also can't be deleted.
+
+.. list-table:: Attributes supported
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``tx_max``
+ - maximum bandwidth to be consumed by the tree Node. Rate Limit is
+ an absolute number specifying a maximum amount of bytes a Node may
+ consume during the course of one second. Rate limit guarantees
+ that a link will not oversaturate the receiver on the remote end
+ and also enforces an SLA between the subscriber and network
+ provider.
+ * - ``tx_share``
+ - minimum bandwidth allocated to a tree node when it is not blocked.
+ It specifies an absolute BW. While tx_max defines the maximum
+ bandwidth the node may consume, the tx_share marks committed BW
+ for the Node.
+ * - ``tx_priority``
+ - allows for usage of strict priority arbiter among siblings. This
+ arbitration scheme attempts to schedule nodes based on their
+ priority as long as the nodes remain within their bandwidth limit.
+ Range 0-7. Nodes with priority 7 have the highest priority and are
+ selected first, while nodes with priority 0 have the lowest
+ priority. Nodes that have the same priority are treated equally.
+ * - ``tx_weight``
+ - allows for usage of Weighted Fair Queuing arbitration scheme among
+ siblings. This arbitration scheme can be used simultaneously with
+ the strict priority. Range 1-200. Only relative values matter for
+ arbitration.
+
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+.. code:: shell
+
+ # enable switchdev
+ $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
+
+ # at this point driver should export internal hierarchy
+ $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
+
+ $ devlink port function rate show
+ pci/0000:4b:00.0/node_25: type node parent node_24
+ pci/0000:4b:00.0/node_24: type node parent node_0
+ pci/0000:4b:00.0/node_32: type node parent node_31
+ pci/0000:4b:00.0/node_31: type node parent node_30
+ pci/0000:4b:00.0/node_30: type node parent node_16
+ pci/0000:4b:00.0/node_19: type node parent node_18
+ pci/0000:4b:00.0/node_18: type node parent node_17
+ pci/0000:4b:00.0/node_17: type node parent node_16
+ pci/0000:4b:00.0/node_14: type node parent node_5
+ pci/0000:4b:00.0/node_5: type node parent node_3
+ pci/0000:4b:00.0/node_13: type node parent node_4
+ pci/0000:4b:00.0/node_12: type node parent node_4
+ pci/0000:4b:00.0/node_11: type node parent node_4
+ pci/0000:4b:00.0/node_10: type node parent node_4
+ pci/0000:4b:00.0/node_9: type node parent node_4
+ pci/0000:4b:00.0/node_8: type node parent node_4
+ pci/0000:4b:00.0/node_7: type node parent node_4
+ pci/0000:4b:00.0/node_6: type node parent node_4
+ pci/0000:4b:00.0/node_4: type node parent node_3
+ pci/0000:4b:00.0/node_3: type node parent node_16
+ pci/0000:4b:00.0/node_16: type node parent node_15
+ pci/0000:4b:00.0/node_15: type node parent node_0
+ pci/0000:4b:00.0/node_2: type node parent node_1
+ pci/0000:4b:00.0/node_1: type node parent node_0
+ pci/0000:4b:00.0/node_0: type node
+ pci/0000:4b:00.0/1: type leaf parent node_25
+ pci/0000:4b:00.0/2: type leaf parent node_25
+
+ # let's create some custom node
+ $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
+
+ # second custom node
+ $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
+
+ # reassign second VF to newly created branch
+ $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
+
+ # assign tx_weight to the VF
+ $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
+
+ # assign tx_share to the VF
+ $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index c536db2cc0f9..e14d7a701b72 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -4,6 +4,48 @@ Linux Devlink Documentation
devlink is an API to expose device information and resources not directly
related to any device class, such as chip-wide/switch-ASIC-wide configuration.
+Locking
+-------
+
+Driver facing APIs are currently transitioning to allow more explicit
+locking. Drivers can use the existing ``devlink_*`` set of APIs, or
+new APIs prefixed by ``devl_*``. The older APIs handle all the locking
+in devlink core, but don't allow registration of most sub-objects once
+the main devlink object is itself registered. The newer ``devl_*`` APIs assume
+the devlink instance lock is already held. Drivers can take the instance
+lock by calling ``devl_lock()``. It is also held all callbacks of devlink
+netlink commands.
+
+Drivers are encouraged to use the devlink instance lock for their own needs.
+
+Drivers need to be cautious when taking devlink instance lock and
+taking RTNL lock at the same time. Devlink instance lock needs to be taken
+first, only after that RTNL lock could be taken.
+
+Nested instances
+----------------
+
+Some objects, like linecards or port functions, could have another
+devlink instances created underneath. In that case, drivers should make
+sure to respect following rules:
+
+ - Lock ordering should be maintained. If driver needs to take instance
+ lock of both nested and parent instances at the same time, devlink
+ instance lock of the parent instance should be taken first, only then
+ instance lock of the nested instance could be taken.
+ - Driver should use object-specific helpers to setup the
+ nested relationship:
+
+ - ``devl_nested_devlink_set()`` - called to setup devlink -> nested
+ devlink relationship (could be user for multiple nested instances.
+ - ``devl_port_fn_devlink_set()`` - called to setup port function ->
+ nested devlink relationship.
+ - ``devlink_linecard_nested_dl_set()`` - called to setup linecard ->
+ nested devlink relationship.
+
+The nested devlink info is exposed to the userspace over object-specific
+attributes of devlink netlink.
+
Interface documentation
-----------------------
@@ -18,9 +60,13 @@ general.
devlink-info
devlink-flash
devlink-params
+ devlink-port
devlink-region
devlink-resource
+ devlink-reload
+ devlink-selftests
devlink-trap
+ devlink-linecard
Driver-specific documentation
-----------------------------
@@ -32,6 +78,9 @@ parameters, info versions, and other features it supports.
:maxdepth: 1
bnxt
+ etas_es58x
+ hns3
+ i40e
ionic
ice
mlx4
@@ -42,3 +91,8 @@ parameters, info versions, and other features it supports.
nfp
qed
ti-cpsw-switch
+ am65-nuss-cpsw-switch
+ prestera
+ iosm
+ octeontx2
+ sfc
diff --git a/Documentation/networking/devlink/iosm.rst b/Documentation/networking/devlink/iosm.rst
new file mode 100644
index 000000000000..6136181339aa
--- /dev/null
+++ b/Documentation/networking/devlink/iosm.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+iosm devlink support
+====================
+
+This document describes the devlink features implemented by the ``iosm``
+device driver.
+
+Parameters
+==========
+
+The ``iosm`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``erase_full_flash``
+ - u8
+ - runtime
+ - erase_full_flash parameter is used to check if full erase is required for
+ the device during firmware flashing.
+ If set, Full nand erase command will be sent to the device. By default,
+ only conditional erase support is enabled.
+
+
+Flash Update
+============
+
+The ``iosm`` driver implements support for flash update using the
+``devlink-flash`` interface.
+
+It supports updating the device flash using a combined flash image which contains
+the Bootloader images and other modem software images.
+
+The driver uses DEVLINK_SUPPORT_FLASH_UPDATE_COMPONENT to identify type of
+firmware image that need to be flashed as requested by user space application.
+Supported firmware image types.
+
+.. list-table:: Firmware Image types
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``PSI RAM``
+ - Primary Signed Image
+ * - ``EBL``
+ - External Bootloader
+ * - ``FLS``
+ - Modem Software Image
+
+PSI RAM and EBL are the RAM images which are injected to the device when the
+device is in BOOT ROM stage. Once this is successful, the actual modem firmware
+image is flashed to the device. The modem software image contains multiple files
+each having one secure bin file and at least one Loadmap/Region file. For flashing
+these files, appropriate commands are sent to the modem device along with the
+data required for flashing. The data like region count and address of each region
+has to be passed to the driver using the devlink param command.
+
+If the device has to be fully erased before firmware flashing, user application
+need to set the erase_full_flash parameter using devlink param command.
+By default, conditional erase feature is supported.
+
+Flash Commands:
+===============
+1) When modem is in Boot ROM stage, user can use below command to inject PSI RAM
+image using devlink flash command.
+
+$ devlink dev flash pci/0000:02:00.0 file <PSI_RAM_File_name>
+
+2) If user want to do a full erase, below command need to be issued to set the
+erase full flash param (To be set only if full erase required).
+
+$ devlink dev param set pci/0000:02:00.0 name erase_full_flash value true cmode runtime
+
+3) Inject EBL after the modem is in PSI stage.
+
+$ devlink dev flash pci/0000:02:00.0 file <EBL_File_name>
+
+4) Once EBL is injected successfully, then the actual firmware flashing takes
+place. Below is the sequence of commands used for each of the firmware images.
+
+a) Flash secure bin file.
+
+$ devlink dev flash pci/0000:02:00.0 file <Secure_bin_file_name>
+
+b) Flashing the Loadmap/Region file
+
+$ devlink dev flash pci/0000:02:00.0 file <Load_map_file_name>
+
+Regions
+=======
+
+The ``iosm`` driver supports dumping the coredump logs.
+
+In case a firmware encounters an exception, a snapshot will be taken by the
+driver. Following regions are accessed for device internal data.
+
+.. list-table:: Regions implemented
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``report.json``
+ - The summary of exception details logged as part of this region.
+ * - ``coredump.fcd``
+ - This region contains the details related to the exception occurred in the
+ device (RAM dump).
+ * - ``cdd.log``
+ - This region contains the logs related to the modem CDD driver.
+ * - ``eeprom.bin``
+ - This region contains the eeprom logs.
+ * - ``bootcore_trace.bin``
+ - This region contains the current instance of bootloader logs.
+ * - ``bootcore_prev_trace.bin``
+ - This region contains the previous instance of bootloader logs.
+
+
+Region commands
+===============
+
+$ devlink region show
+
+$ devlink region new pci/0000:02:00.0/report.json
+
+$ devlink region dump pci/0000:02:00.0/report.json snapshot 0
+
+$ devlink region del pci/0000:02:00.0/report.json snapshot 0
+
+$ devlink region new pci/0000:02:00.0/coredump.fcd
+
+$ devlink region dump pci/0000:02:00.0/coredump.fcd snapshot 1
+
+$ devlink region del pci/0000:02:00.0/coredump.fcd snapshot 1
+
+$ devlink region new pci/0000:02:00.0/cdd.log
+
+$ devlink region dump pci/0000:02:00.0/cdd.log snapshot 2
+
+$ devlink region del pci/0000:02:00.0/cdd.log snapshot 2
+
+$ devlink region new pci/0000:02:00.0/eeprom.bin
+
+$ devlink region dump pci/0000:02:00.0/eeprom.bin snapshot 3
+
+$ devlink region del pci/0000:02:00.0/eeprom.bin snapshot 3
+
+$ devlink region new pci/0000:02:00.0/bootcore_trace.bin
+
+$ devlink region dump pci/0000:02:00.0/bootcore_trace.bin snapshot 4
+
+$ devlink region del pci/0000:02:00.0/bootcore_trace.bin snapshot 4
+
+$ devlink region new pci/0000:02:00.0/bootcore_prev_trace.bin
+
+$ devlink region dump pci/0000:02:00.0/bootcore_prev_trace.bin snapshot 5
+
+$ devlink region del pci/0000:02:00.0/bootcore_prev_trace.bin snapshot 5
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 4e4b97f7971a..702f204a3dbd 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -14,8 +14,24 @@ Parameters
* - Name
- Mode
+ - Validation
* - ``enable_roce``
- driverinit
+ - Type: Boolean
+
+ If the device supports RoCE disablement, RoCE enablement state controls
+ device support for RoCE capability. Otherwise, the control occurs in the
+ driver stack. When RoCE is disabled at the driver level, only raw
+ ethernet QPs are supported.
+ * - ``io_eq_size``
+ - driverinit
+ - The range is between 64 and 4096.
+ * - ``event_eq_size``
+ - driverinit
+ - The range is between 64 and 4096.
+ * - ``max_macs``
+ - driverinit
+ - The range is between 1 and 2^31. Only power of 2 values are supported.
The ``mlx5`` driver also implements the following driver-specific
parameters.
@@ -37,12 +53,62 @@ parameters.
* ``smfs`` Software managed flow steering. In SMFS mode, the HW
steering entities are created and manage through the driver without
firmware intervention.
+
+ SMFS mode is faster and provides better rule insertion rate compared to
+ default DMFS mode.
* - ``fdb_large_groups``
- u32
- driverinit
- Control the number of large groups (size > 1) in the FDB table.
* The default value is 15, and the range is between 1 and 1024.
+ * - ``esw_multiport``
+ - Boolean
+ - runtime
+ - Control MultiPort E-Switch shared fdb mode.
+
+ An experimental mode where a single E-Switch is used and all the vports
+ and physical ports on the NIC are connected to it.
+
+ An example is to send traffic from a VF that is created on PF0 to an
+ uplink that is natively associated with the uplink of PF1
+
+ Note: Future devices, ConnectX-8 and onward, will eventually have this
+ as the default to allow forwarding between all NIC ports in a single
+ E-switch environment and the dual E-switch mode will likely get
+ deprecated.
+
+ Default: disabled
+ * - ``esw_port_metadata``
+ - Boolean
+ - runtime
+ - When applicable, disabling eswitch metadata can increase packet rate up
+ to 20% depending on the use case and packet sizes.
+
+ Eswitch port metadata state controls whether to internally tag packets
+ with metadata. Metadata tagging must be enabled for multi-port RoCE,
+ failover between representors and stacked devices. By default metadata is
+ enabled on the supported devices in E-switch. Metadata is applicable only
+ for E-switch in switchdev mode and users may disable it when NONE of the
+ below use cases will be in use:
+ 1. HCA is in Dual/multi-port RoCE mode.
+ 2. VF/SF representor bonding (Usually used for Live migration)
+ 3. Stacked devices
+
+ When metadata is disabled, the above use cases will fail to initialize if
+ users try to enable them.
+ * - ``hairpin_num_queues``
+ - u32
+ - driverinit
+ - We refer to a TC NIC rule that involves forwarding as "hairpin".
+ Hairpin queues are mlx5 hardware specific implementation for hardware
+ forwarding of such packets.
+
+ Control the number of hairpin queues.
+ * - ``hairpin_queue_size``
+ - u32
+ - driverinit
+ - Control the size (in packets) of the hairpin queues.
The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
@@ -63,3 +129,160 @@ The ``mlx5`` driver reports the following versions
* - ``fw.version``
- stored, running
- Three digit major.minor.subminor firmware version number.
+
+Health reporters
+================
+
+tx reporter
+-----------
+The tx reporter is responsible for reporting and recovering of the following three error scenarios:
+
+- tx timeout
+ Report on kernel tx timeout detection.
+ Recover by searching lost interrupts.
+- tx error completion
+ Report on error tx completion.
+ Recover by flushing the tx queue and reset it.
+- tx PTP port timestamping CQ unhealthy
+ Report too many CQEs never delivered on port ts CQ.
+ Recover by flushing and re-creating all PTP channels.
+
+tx reporter also support on demand diagnose callback, on which it provides
+real time information of its send queues status.
+
+User commands examples:
+
+- Diagnose send queues status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter tx
+
+.. note::
+ This command has valid output only when interface is up, otherwise the command has empty output.
+
+- Show number of tx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter tx
+
+rx reporter
+-----------
+The rx reporter is responsible for reporting and recovering of the following two error scenarios:
+
+- rx queues' initialization (population) timeout
+ Population of rx queues' descriptors on ring initialization is done
+ in napi context via triggering an irq. In case of a failure to get
+ the minimum amount of descriptors, a timeout would occur, and
+ descriptors could be recovered by polling the EQ (Event Queue).
+- rx completions with errors (reported by HW on interrupt context)
+ Report on rx completion error.
+ Recover (if needed) by flushing the related queue and reset it.
+
+rx reporter also supports on demand diagnose callback, on which it
+provides real time information of its receive queues' status.
+
+- Diagnose rx queues' status and corresponding completion queue::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter rx
+
+.. note::
+ This command has valid output only when interface is up. Otherwise, the command has empty output.
+
+- Show number of rx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled, and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter rx
+
+fw reporter
+-----------
+The fw reporter implements `diagnose` and `dump` callbacks.
+It follows symptoms of fw error such as fw syndrome by triggering
+fw core dump and storing it into the dump buffer.
+The fw reporter diagnose command can be triggered any time by the user to check
+current fw status.
+
+User commands examples:
+
+- Check fw heath status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter fw
+
+- Read FW core dump if already stored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.0 reporter fw
+
+.. note::
+ This command can run only on the PF which has fw tracer ownership,
+ running it on other PF or any VF will return "Operation not permitted".
+
+fw fatal reporter
+-----------------
+The fw fatal reporter implements `dump` and `recover` callbacks.
+It follows fatal errors indications by CR-space dump and recover flow.
+The CR-space dump uses vsc interface which is valid even if the FW command
+interface is not functional, which is the case in most FW fatal errors.
+The recover function runs recover flow which reloads the driver and triggers fw
+reset if needed.
+On firmware error, the health buffer is dumped into the dmesg. The log
+level is derived from the error's severity (given in health buffer).
+
+User commands examples:
+
+- Run fw recover flow manually::
+
+ $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
+
+- Read FW CR-space dump if already stored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
+
+.. note::
+ This command can run only on PF.
+
+vnic reporter
+-------------
+The vnic reporter implements only the `diagnose` callback.
+It is responsible for querying the vnic diagnostic counters from fw and displaying
+them in realtime.
+
+Description of the vnic counters:
+
+- total_q_under_processor_handle
+ number of queues in an error state due to
+ an async error or errored command.
+- send_queue_priority_update_flow
+ number of QP/SQ priority/SL update events.
+- cq_overrun
+ number of times CQ entered an error state due to an overflow.
+- async_eq_overrun
+ number of times an EQ mapped to async events was overrun.
+ comp_eq_overrun number of times an EQ mapped to completion events was
+ overrun.
+- quota_exceeded_command
+ number of commands issued and failed due to quota exceeded.
+- invalid_command
+ number of commands issued and failed dues to any reason other than quota
+ exceeded.
+- nic_receive_steering_discard
+ number of packets that completed RX flow
+ steering but were discarded due to a mismatch in flow table.
+- generated_pkt_steering_fail
+ number of packets generated by the VNIC experiencing unexpected steering
+ failure (at any point in steering flow).
+- handled_pkt_steering_fail
+ number of packets handled by the VNIC experiencing unexpected steering
+ failure (at any point in steering flow owned by the VNIC, including the FDB
+ for the eswitch owner).
+
+User commands examples:
+
+- Diagnose PF/VF vnic counters::
+
+ $ devlink health diagnose pci/0000:82:00.1 reporter vnic
+
+- Diagnose representor vnic counters (performed by supplying devlink port of the
+ representor, which can be obtained via devlink port command)::
+
+ $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic
+
+.. note::
+ This command can run over all interfaces such as PF/VF and representor ports.
diff --git a/Documentation/networking/devlink/mlxsw.rst b/Documentation/networking/devlink/mlxsw.rst
index cf857cb4ba8f..433962225bd4 100644
--- a/Documentation/networking/devlink/mlxsw.rst
+++ b/Documentation/networking/devlink/mlxsw.rst
@@ -58,6 +58,30 @@ The ``mlxsw`` driver reports the following versions
- running
- Three digit firmware version
+Line card auxiliary device info versions
+========================================
+
+The ``mlxsw`` driver reports the following versions for line card auxiliary device
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``hw.revision``
+ - fixed
+ - The hardware revision for this line card
+ * - ``ini.version``
+ - running
+ - Version of line card INI loaded
+ * - ``fw.psid``
+ - fixed
+ - Line card device PSID
+ * - ``fw.version``
+ - running
+ - Three digit firmware version of line card device
+
Driver-specific Traps
=====================
diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst
index 2a266b7e7b38..88482725422c 100644
--- a/Documentation/networking/devlink/netdevsim.rst
+++ b/Documentation/networking/devlink/netdevsim.rst
@@ -46,7 +46,7 @@ Resources
=========
The ``netdevsim`` driver exposes resources to control the number of FIB
-entries and FIB rule entries that the driver will allow.
+entries, FIB rule entries and nexthops that the driver will allow.
.. code:: shell
@@ -54,8 +54,35 @@ entries and FIB rule entries that the driver will allow.
$ devlink resource set netdevsim/netdevsim0 path /IPv4/fib-rules size 16
$ devlink resource set netdevsim/netdevsim0 path /IPv6/fib size 64
$ devlink resource set netdevsim/netdevsim0 path /IPv6/fib-rules size 16
+ $ devlink resource set netdevsim/netdevsim0 path /nexthops size 16
$ devlink dev reload netdevsim/netdevsim0
+Rate objects
+============
+
+The ``netdevsim`` driver supports rate objects management, which includes:
+
+- registerging/unregistering leaf rate objects per VF devlink port;
+- creation/deletion node rate objects;
+- setting tx_share and tx_max rate values for any rate object type;
+- setting parent node for any rate object type.
+
+Rate nodes and their parameters are exposed in ``netdevsim`` debugfs in RO mode.
+For example created rate node with name ``some_group``:
+
+.. code:: shell
+
+ $ ls /sys/kernel/debug/netdevsim/netdevsim0/rate_groups/some_group
+ rate_parent tx_max tx_share
+
+Same parameters are exposed for leaf objects in corresponding ports directories.
+For ex.:
+
+.. code:: shell
+
+ $ ls /sys/kernel/debug/netdevsim/netdevsim0/ports/1
+ dev ethtool rate_parent tx_max tx_share
+
Driver-specific Traps
=====================
@@ -68,5 +95,5 @@ Driver-specific Traps
* - ``fid_miss``
- ``exception``
- When a packet enters the device it is classified to a filtering
- indentifier (FID) based on the ingress port and VLAN. This trap is used
+ identifier (FID) based on the ingress port and VLAN. This trap is used
to trap packets for which a FID could not be found
diff --git a/Documentation/networking/devlink/octeontx2.rst b/Documentation/networking/devlink/octeontx2.rst
new file mode 100644
index 000000000000..610de99b728a
--- /dev/null
+++ b/Documentation/networking/devlink/octeontx2.rst
@@ -0,0 +1,42 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+octeontx2 devlink support
+=========================
+
+This document describes the devlink features implemented by the ``octeontx2 AF, PF and VF``
+device drivers.
+
+Parameters
+==========
+
+The ``octeontx2 PF and VF`` drivers implement the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``mcam_count``
+ - u16
+ - runtime
+ - Select number of match CAM entries to be allocated for an interface.
+ The same is used for ntuple filters of the interface. Supported by
+ PF and VF drivers.
+
+The ``octeontx2 AF`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``dwrr_mtu``
+ - u32
+ - runtime
+ - Use to set the quantum which hardware uses for scheduling among transmit queues.
+ Hardware uses weighted DWRR algorithm to schedule among all transmit queues.
diff --git a/Documentation/networking/devlink/prestera.rst b/Documentation/networking/devlink/prestera.rst
new file mode 100644
index 000000000000..96b1124e614b
--- /dev/null
+++ b/Documentation/networking/devlink/prestera.rst
@@ -0,0 +1,141 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+prestera devlink support
+========================
+
+This document describes the devlink features implemented by the ``prestera``
+device driver.
+
+Driver-specific Traps
+=====================
+
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``arp_bc``
+ - ``trap``
+ - Traps ARP broadcast packets (both requests/responses)
+ * - ``is_is``
+ - ``trap``
+ - Traps IS-IS packets
+ * - ``ospf``
+ - ``trap``
+ - Traps OSPF packets
+ * - ``ip_bc_mac``
+ - ``trap``
+ - Traps IPv4 packets with broadcast DA Mac address
+ * - ``stp``
+ - ``trap``
+ - Traps STP BPDU
+ * - ``lacp``
+ - ``trap``
+ - Traps LACP packets
+ * - ``lldp``
+ - ``trap``
+ - Traps LLDP packets
+ * - ``router_mc``
+ - ``trap``
+ - Traps multicast packets
+ * - ``vrrp``
+ - ``trap``
+ - Traps VRRP packets
+ * - ``dhcp``
+ - ``trap``
+ - Traps DHCP packets
+ * - ``mtu_error``
+ - ``trap``
+ - Traps (exception) packets that exceeded port's MTU
+ * - ``mac_to_me``
+ - ``trap``
+ - Traps packets with switch-port's DA Mac address
+ * - ``ttl_error``
+ - ``trap``
+ - Traps (exception) IPv4 packets whose TTL exceeded
+ * - ``ipv4_options``
+ - ``trap``
+ - Traps (exception) packets due to the malformed IPV4 header options
+ * - ``ip_default_route``
+ - ``trap``
+ - Traps packets that have no specific IP interface (IP to me) and no forwarding prefix
+ * - ``local_route``
+ - ``trap``
+ - Traps packets that have been send to one of switch IP interfaces addresses
+ * - ``ipv4_icmp_redirect``
+ - ``trap``
+ - Traps (exception) IPV4 ICMP redirect packets
+ * - ``arp_response``
+ - ``trap``
+ - Traps ARP replies packets that have switch-port's DA Mac address
+ * - ``acl_code_0``
+ - ``trap``
+ - Traps packets that have ACL priority set to 0 (tc pref 0)
+ * - ``acl_code_1``
+ - ``trap``
+ - Traps packets that have ACL priority set to 1 (tc pref 1)
+ * - ``acl_code_2``
+ - ``trap``
+ - Traps packets that have ACL priority set to 2 (tc pref 2)
+ * - ``acl_code_3``
+ - ``trap``
+ - Traps packets that have ACL priority set to 3 (tc pref 3)
+ * - ``acl_code_4``
+ - ``trap``
+ - Traps packets that have ACL priority set to 4 (tc pref 4)
+ * - ``acl_code_5``
+ - ``trap``
+ - Traps packets that have ACL priority set to 5 (tc pref 5)
+ * - ``acl_code_6``
+ - ``trap``
+ - Traps packets that have ACL priority set to 6 (tc pref 6)
+ * - ``acl_code_7``
+ - ``trap``
+ - Traps packets that have ACL priority set to 7 (tc pref 7)
+ * - ``ipv4_bgp``
+ - ``trap``
+ - Traps IPv4 BGP packets
+ * - ``ssh``
+ - ``trap``
+ - Traps SSH packets
+ * - ``telnet``
+ - ``trap``
+ - Traps Telnet packets
+ * - ``icmp``
+ - ``trap``
+ - Traps ICMP packets
+ * - ``rxdma_drop``
+ - ``drop``
+ - Drops packets (RxDMA) due to the lack of ingress buffers etc.
+ * - ``port_no_vlan``
+ - ``drop``
+ - Drops packets due to faulty-configured network or due to internal bug (config issue).
+ * - ``local_port``
+ - ``drop``
+ - Drops packets whose decision (FDB entry) is to bridge packet back to the incoming port/trunk.
+ * - ``invalid_sa``
+ - ``drop``
+ - Drops packets with multicast source MAC address.
+ * - ``illegal_ip_addr``
+ - ``drop``
+ - Drops packets with illegal SIP/DIP multicast/unicast addresses.
+ * - ``illegal_ipv4_hdr``
+ - ``drop``
+ - Drops packets with illegal IPV4 header.
+ * - ``ip_uc_dip_da_mismatch``
+ - ``drop``
+ - Drops packets with destination MAC being unicast, but destination IP address being multicast.
+ * - ``ip_sip_is_zero``
+ - ``drop``
+ - Drops packets with zero (0) IPV4 source address.
+ * - ``met_red``
+ - ``drop``
+ - Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwidth.
diff --git a/Documentation/networking/devlink/sfc.rst b/Documentation/networking/devlink/sfc.rst
new file mode 100644
index 000000000000..db64a1bd9733
--- /dev/null
+++ b/Documentation/networking/devlink/sfc.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+sfc devlink support
+===================
+
+This document describes the devlink features implemented by the ``sfc``
+device driver for the ef100 device.
+
+Info versions
+=============
+
+The ``sfc`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw.mgmt.suc``
+ - running
+ - For boards where the management function is split between multiple
+ control units, this is the SUC control unit's firmware version.
+ * - ``fw.mgmt.cmc``
+ - running
+ - For boards where the management function is split between multiple
+ control units, this is the CMC control unit's firmware version.
+ * - ``fpga.rev``
+ - running
+ - FPGA design revision.
+ * - ``fpga.app``
+ - running
+ - Datapath programmable logic version.
+ * - ``fw.app``
+ - running
+ - Datapath software/microcode/firmware version.
+ * - ``coproc.boot``
+ - running
+ - SmartNIC application co-processor (APU) first stage boot loader version.
+ * - ``coproc.uboot``
+ - running
+ - SmartNIC application co-processor (APU) co-operating system loader version.
+ * - ``coproc.main``
+ - running
+ - SmartNIC application co-processor (APU) main operating system version.
+ * - ``coproc.recovery``
+ - running
+ - SmartNIC application co-processor (APU) recovery operating system version.
+ * - ``fw.exprom``
+ - running
+ - Expansion ROM version. For boards where the expansion ROM is split between
+ multiple images (e.g. PXE and UEFI), this is the specifically the PXE boot
+ ROM version.
+ * - ``fw.uefi``
+ - running
+ - UEFI driver version (No UNDI support).