mirror of
https://github.com/edk2-porting/linux-next.git
synced 2025-01-02 02:34:05 +08:00
devlink: introduce devlink-dpipe.rst documentation file
Primarily based on the DPIPE netdev conference paper, introduce a new file to document the dpipe interface. This likely needs further improvement, but is at least a good overall start. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:
parent
a1af8e9a80
commit
b7ec1eca71
252
Documentation/networking/devlink/devlink-dpipe.rst
Normal file
252
Documentation/networking/devlink/devlink-dpipe.rst
Normal file
@ -0,0 +1,252 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=============
|
||||||
|
Devlink DPIPE
|
||||||
|
=============
|
||||||
|
|
||||||
|
Background
|
||||||
|
==========
|
||||||
|
|
||||||
|
While performing the hardware offloading process, much of the hardware
|
||||||
|
specifics cannot be presented. These details are useful for debugging, and
|
||||||
|
``devlink-dpipe`` provides a standardized way to provide visibility into the
|
||||||
|
offloading process.
|
||||||
|
|
||||||
|
For example, the routing longest prefix match (LPM) algorithm used by the
|
||||||
|
Linux kernel may differ from the hardware implementation. The pipeline debug
|
||||||
|
API (DPIPE) is aimed at providing the user visibility into the ASIC's
|
||||||
|
pipeline in a generic way.
|
||||||
|
|
||||||
|
The hardware offload process is expected to be done in a way that the user
|
||||||
|
should not be able to distinguish between the hardware vs. software
|
||||||
|
implementation. In this process, hardware specifics are neglected. In
|
||||||
|
reality those details can have lots of meaning and should be exposed in some
|
||||||
|
standard way.
|
||||||
|
|
||||||
|
This problem is made even more complex when one wishes to offload the
|
||||||
|
control path of the whole networking stack to a switch ASIC. Due to
|
||||||
|
differences in the hardware and software models some processes cannot be
|
||||||
|
represented correctly.
|
||||||
|
|
||||||
|
One example is the kernel's LPM algorithm which in many cases differs
|
||||||
|
greatly to the hardware implementation. The configuration API is the same,
|
||||||
|
but one cannot rely on the Forward Information Base (FIB) to look like the
|
||||||
|
Level Path Compression trie (LPC-trie) in hardware.
|
||||||
|
|
||||||
|
In many situations trying to analyze systems failure solely based on the
|
||||||
|
kernel's dump may not be enough. By combining this data with complementary
|
||||||
|
information about the underlying hardware, this debugging can be made
|
||||||
|
easier; additionally, the information can be useful when debugging
|
||||||
|
performance issues.
|
||||||
|
|
||||||
|
Overview
|
||||||
|
========
|
||||||
|
|
||||||
|
The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
|
||||||
|
modeled as a graph of match/action tables. Each table represents a specific
|
||||||
|
hardware block. This model is not new, first being used by the P4 language.
|
||||||
|
|
||||||
|
Traditionally it has been used as an alternative model for hardware
|
||||||
|
configuration, but the ``devlink-dpipe`` interface uses it for visibility
|
||||||
|
purposes as a standard complementary tool. The system's view from
|
||||||
|
``devlink-dpipe`` should change according to the changes done by the
|
||||||
|
standard configuration tools.
|
||||||
|
|
||||||
|
For example, it’s quiet common to implement Access Control Lists (ACL)
|
||||||
|
using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
|
||||||
|
divided into TCAM regions. Complex TC filters can have multiple rules with
|
||||||
|
different priorities and different lookup keys. On the other hand hardware
|
||||||
|
TCAM regions have a predefined lookup key. Offloading the TC filter rules
|
||||||
|
using TCAM engine can result in multiple TCAM regions being interconnected
|
||||||
|
in a chain (which may affect the data path latency). In response to a new TC
|
||||||
|
filter new tables should be created describing those regions.
|
||||||
|
|
||||||
|
Model
|
||||||
|
=====
|
||||||
|
|
||||||
|
The ``DPIPE`` model introduces several objects:
|
||||||
|
|
||||||
|
* headers
|
||||||
|
* tables
|
||||||
|
* entries
|
||||||
|
|
||||||
|
A ``header`` describes packet formats and provides names for fields within
|
||||||
|
the packet. A ``table`` describes hardware blocks. An ``entry`` describes
|
||||||
|
the actual content of a specific table.
|
||||||
|
|
||||||
|
The hardware pipeline is not port specific, but rather describes the whole
|
||||||
|
ASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
|
||||||
|
|
||||||
|
Drivers can register and unregister tables at run time, in order to support
|
||||||
|
dynamic behavior. This dynamic behavior is mandatory for describing hardware
|
||||||
|
blocks like TCAM regions which can be allocated and freed dynamically.
|
||||||
|
|
||||||
|
``devlink-dpipe`` generally is not intended for configuration. The exception
|
||||||
|
is hardware counting for a specific table.
|
||||||
|
|
||||||
|
The following commands are used to obtain the ``dpipe`` objects from
|
||||||
|
userspace:
|
||||||
|
|
||||||
|
* ``table_get``: Receive a table's description.
|
||||||
|
* ``headers_get``: Receive a device's supported headers.
|
||||||
|
* ``entries_get``: Receive a table's current entries.
|
||||||
|
* ``counters_set``: Enable or disable counters on a table.
|
||||||
|
|
||||||
|
Table
|
||||||
|
-----
|
||||||
|
|
||||||
|
The driver should implement the following operations for each table:
|
||||||
|
|
||||||
|
* ``matches_dump``: Dump the supported matches.
|
||||||
|
* ``actions_dump``: Dump the supported actions.
|
||||||
|
* ``entries_dump``: Dump the actual content of the table.
|
||||||
|
* ``counters_set_update``: Synchronize hardware with counters enabled or
|
||||||
|
disabled.
|
||||||
|
|
||||||
|
Header/Field
|
||||||
|
------------
|
||||||
|
|
||||||
|
In a similar way to P4 headers and fields are used to describe a table's
|
||||||
|
behavior. There is a slight difference between the standard protocol headers
|
||||||
|
and specific ASIC metadata. The protocol headers should be declared in the
|
||||||
|
``devlink`` core API. On the other hand ASIC meta data is driver specific
|
||||||
|
and should be defined in the driver. Additionally, each driver-specific
|
||||||
|
devlink documentation file should document the driver-specific ``dpipe``
|
||||||
|
headers it implements. The headers and fields are identified by enumeration.
|
||||||
|
|
||||||
|
In order to provide further visibility some ASIC metadata fields could be
|
||||||
|
mapped to kernel objects. For example, internal router interface indexes can
|
||||||
|
be directly mapped to the net device ifindex. FIB table indexes used by
|
||||||
|
different Virtual Routing and Forwarding (VRF) tables can be mapped to
|
||||||
|
internal routing table indexes.
|
||||||
|
|
||||||
|
Match
|
||||||
|
-----
|
||||||
|
|
||||||
|
Matches are kept primitive and close to hardware operation. Match types like
|
||||||
|
LPM are not supported due to the fact that this is exactly a process we wish
|
||||||
|
to describe in full detail. Example of matches:
|
||||||
|
|
||||||
|
* ``field_exact``: Exact match on a specific field.
|
||||||
|
* ``field_exact_mask``: Exact match on a specific field after masking.
|
||||||
|
* ``field_range``: Match on a specific range.
|
||||||
|
|
||||||
|
The id's of the header and the field should be specified in order to
|
||||||
|
identify the specific field. Furthermore, the header index should be
|
||||||
|
specified in order to distinguish multiple headers of the same type in a
|
||||||
|
packet (tunneling).
|
||||||
|
|
||||||
|
Action
|
||||||
|
------
|
||||||
|
|
||||||
|
Similar to match, the actions are kept primitive and close to hardware
|
||||||
|
operation. For example:
|
||||||
|
|
||||||
|
* ``field_modify``: Modify the field value.
|
||||||
|
* ``field_inc``: Increment the field value.
|
||||||
|
* ``push_header``: Add a header.
|
||||||
|
* ``pop_header``: Remove a header.
|
||||||
|
|
||||||
|
Entry
|
||||||
|
-----
|
||||||
|
|
||||||
|
Entries of a specific table can be dumped on demand. Each eentry is
|
||||||
|
identified with an index and its properties are described by a list of
|
||||||
|
match/action values and specific counter. By dumping the tables content the
|
||||||
|
interactions between tables can be resolved.
|
||||||
|
|
||||||
|
Abstraction Example
|
||||||
|
===================
|
||||||
|
|
||||||
|
The following is an example of the abstraction model of the L3 part of
|
||||||
|
Mellanox Spectrum ASIC. The blocks are described in the order they appear in
|
||||||
|
the pipeline. The table sizes in the following examples are not real
|
||||||
|
hardware sizes and are provided for demonstration purposes.
|
||||||
|
|
||||||
|
LPM
|
||||||
|
---
|
||||||
|
|
||||||
|
The LPM algorithm can be implemented as a list of hash tables. Each hash
|
||||||
|
table contains routes with the same prefix length. The root of the list is
|
||||||
|
/32, and in case of a miss the hardware will continue to the next hash
|
||||||
|
table. The depth of the search will affect the data path latency.
|
||||||
|
|
||||||
|
In case of a hit the entry contains information about the next stage of the
|
||||||
|
pipeline which resolves the MAC address. The next stage can be either local
|
||||||
|
host table for directly connected routes, or adjacency table for next-hops.
|
||||||
|
The ``meta.lpm_prefix`` field is used to connect two LPM tables.
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
table lpm_prefix_16 {
|
||||||
|
size: 4096,
|
||||||
|
counters_enabled: true,
|
||||||
|
match: { meta.vr_id: exact,
|
||||||
|
ipv4.dst_addr: exact_mask,
|
||||||
|
ipv6.dst_addr: exact_mask,
|
||||||
|
meta.lpm_prefix: exact },
|
||||||
|
action: { meta.adj_index: set,
|
||||||
|
meta.adj_group_size: set,
|
||||||
|
meta.rif_port: set,
|
||||||
|
meta.lpm_prefix: set },
|
||||||
|
}
|
||||||
|
|
||||||
|
Local Host
|
||||||
|
----------
|
||||||
|
|
||||||
|
In the case of local routes the LPM lookup already resolves the egress
|
||||||
|
router interface (RIF), yet the exact MAC address is not known. The local
|
||||||
|
host table is a hash table combining the output interface id with
|
||||||
|
destination IP address as a key. The result is the MAC address.
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
table local_host {
|
||||||
|
size: 4096,
|
||||||
|
counters_enabled: true,
|
||||||
|
match: { meta.rif_port: exact,
|
||||||
|
ipv4.dst_addr: exact},
|
||||||
|
action: { ethernet.daddr: set }
|
||||||
|
}
|
||||||
|
|
||||||
|
Adjacency
|
||||||
|
---------
|
||||||
|
|
||||||
|
In case of remote routes this table does the ECMP. The LPM lookup results in
|
||||||
|
ECMP group size and index that serves as a global offset into this table.
|
||||||
|
Concurrently a hash of the packet is generated. Based on the ECMP group size
|
||||||
|
and the packet's hash a local offset is generated. Multiple LPM entries can
|
||||||
|
point to the same adjacency group.
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
table adjacency {
|
||||||
|
size: 4096,
|
||||||
|
counters_enabled: true,
|
||||||
|
match: { meta.adj_index: exact,
|
||||||
|
meta.adj_group_size: exact,
|
||||||
|
meta.packet_hash_index: exact },
|
||||||
|
action: { ethernet.daddr: set,
|
||||||
|
meta.erif: set }
|
||||||
|
}
|
||||||
|
|
||||||
|
ERIF
|
||||||
|
----
|
||||||
|
|
||||||
|
In case the egress RIF and destination MAC have been resolved by previous
|
||||||
|
tables this table does multiple operations like TTL decrease and MTU check.
|
||||||
|
Then the decision of forward/drop is taken and the port L3 statistics are
|
||||||
|
updated based on the packet's type (broadcast, unicast, multicast).
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
table erif {
|
||||||
|
size: 800,
|
||||||
|
counters_enabled: true,
|
||||||
|
match: { meta.rif_port: exact,
|
||||||
|
meta.is_l3_unicast: exact,
|
||||||
|
meta.is_l3_broadcast: exact,
|
||||||
|
meta.is_l3_multicast, exact },
|
||||||
|
action: { meta.l3_drop: set,
|
||||||
|
meta.l3_forward: set }
|
||||||
|
}
|
@ -13,6 +13,7 @@ general.
|
|||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
|
|
||||||
|
devlink-dpipe
|
||||||
devlink-health
|
devlink-health
|
||||||
devlink-info
|
devlink-info
|
||||||
devlink-params
|
devlink-params
|
||||||
|
Loading…
Reference in New Issue
Block a user