2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-19 11:04:00 +08:00

Networking changes for 6.2.

Core
 ----
  - Allow live renaming when an interface is up
 
  - Add retpoline wrappers for tc, improving considerably the
    performances of complex queue discipline configurations.
 
  - Add inet drop monitor support.
 
  - A few GRO performance improvements.
 
  - Add infrastructure for atomic dev stats, addressing long standing
    data races.
 
  - De-duplicate common code between OVS and conntrack offloading
    infrastructure.
 
  - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements.
 
  - Netfilter: introduce packet parser for tunneled packets
 
  - Replace IPVS timer-based estimators with kthreads to scale up
    the workload with the number of available CPUs.
 
  - Add the helper support for connection-tracking OVS offload.
 
 BPF
 ---
  - Support for user defined BPF objects: the use case is to allocate
    own objects, build own object hierarchies and use the building
    blocks to build own data structures flexibly, for example, linked
    lists in BPF.
 
  - Make cgroup local storage available to non-cgroup attached BPF
    programs.
 
  - Avoid unnecessary deadlock detection and failures wrt BPF task
    storage helpers.
 
  - A relevant bunch of BPF verifier fixes and improvements.
 
  - Veristat tool improvements to support custom filtering, sorting,
    and replay of results.
 
  - Add LLVM disassembler as default library for dumping JITed code.
 
  - Lots of new BPF documentation for various BPF maps.
 
  - Add bpf_rcu_read_{,un}lock() support for sleepable programs.
 
  - Add RCU grace period chaining to BPF to wait for the completion
    of access from both sleepable and non-sleepable BPF programs.
 
  - Add support storing struct task_struct objects as kptrs in maps.
 
  - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
    values.
 
  - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions.
 
 Protocols
 ---------
  - TCP: implement Protective Load Balancing across switch links.
 
  - TCP: allow dynamically disabling TCP-MD5 static key, reverting
    back to fast[er]-path.
 
  - UDP: Introduce optional per-netns hash lookup table.
 
  - IPv6: simplify and cleanup sockets disposal.
 
  - Netlink: support different type policies for each generic
    netlink operation.
 
  - MPTCP: add MSG_FASTOPEN and FastOpen listener side support.
 
  - MPTCP: add netlink notification support for listener sockets
    events.
 
  - SCTP: add VRF support, allowing sctp sockets binding to VRF
    devices.
 
  - Add bridging MAC Authentication Bypass (MAB) support.
 
  - Extensions for Ethernet VPN bridging implementation to better
    support multicast scenarios.
 
  - More work for Wi-Fi 7 support, comprising conversion of all
    the existing drivers to internal TX queue usage.
 
  - IPSec: introduce a new offload type (packet offload) allowing
    complete header processing and crypto offloading.
 
  - IPSec: extended ack support for more descriptive XFRM error
    reporting.
 
  - RXRPC: increase SACK table size and move processing into a
    per-local endpoint kernel thread, reducing considerably the
    required locking.
 
  - IEEE 802154: synchronous send frame and extended filtering
    support, initial support for scanning available 15.4 networks.
 
  - Tun: bump the link speed from 10Mbps to 10Gbps.
 
  - Tun/VirtioNet: implement UDP segmentation offload support.
 
 Driver API
 ----------
 
  - PHY/SFP: improve power level switching between standard
    level 1 and the higher power levels.
 
  - New API for netdev <-> devlink_port linkage.
 
  - PTP: convert existing drivers to new frequency adjustment
    implementation.
 
  - DSA: add support for rx offloading.
 
  - Autoload DSA tagging driver when dynamically changing protocol.
 
  - Add new PCP and APPTRUST attributes to Data Center Bridging.
 
  - Add configuration support for 800Gbps link speed.
 
  - Add devlink port function attribute to enable/disable RoCE and
    migratable.
 
  - Extend devlink-rate to support strict prioriry and weighted fair
    queuing.
 
  - Add devlink support to directly reading from region memory.
 
  - New device tree helper to fetch MAC address from nvmem.
 
  - New big TCP helper to simplify temporary header stripping.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Marvel Octeon CNF95N and CN10KB Ethernet Switches.
    - Marvel Prestera AC5X Ethernet Switch.
    - WangXun 10 Gigabit NIC.
    - Motorcomm yt8521 Gigabit Ethernet.
    - Microchip ksz9563 Gigabit Ethernet Switch.
    - Microsoft Azure Network Adapter.
    - Linux Automation 10Base-T1L adapter.
 
  - PHY:
    - Aquantia AQR112 and AQR412.
    - Motorcomm YT8531S.
 
  - PTP:
    - Orolia ART-CARD.
 
  - WiFi:
    - MediaTek Wi-Fi 7 (802.11be) devices.
    - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
      devices.
 
  - Bluetooth:
    - Broadcom BCM4377/4378/4387 Bluetooth chipsets.
    - Realtek RTL8852BE and RTL8723DS.
    - Cypress.CYW4373A0 WiFi + Bluetooth combo device.
 
 Drivers
 -------
  - CAN:
    - gs_usb: bus error reporting support.
    - kvaser_usb: listen only and bus error reporting support.
 
  - Ethernet NICs:
    - Intel (100G):
      - extend action skbedit to RX queue mapping.
      - implement devlink-rate support.
      - support direct read from memory.
    - nVidia/Mellanox (mlx5):
      - SW steering improvements, increasing rules update rate.
      - Support for enhanced events compression.
      - extend H/W offload packet manipulation capabilities.
      - implement IPSec packet offload mode.
    - nVidia/Mellanox (mlx4):
      - better big TCP support.
    - Netronome Ethernet NICs (nfp):
      - IPsec offload support.
      - add support for multicast filter.
    - Broadcom:
      - RSS and PTP support improvements.
    - AMD/SolarFlare:
      - netlink extened ack improvements.
      - add basic flower matches to offload, and related stats.
    - Virtual NICs:
      - ibmvnic: introduce affinity hint support.
    - small / embedded:
      - FreeScale fec: add initial XDP support.
      - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood.
      - TI am65-cpsw: add suspend/resume support.
      - Mediatek MT7986: add RX wireless wthernet dispatch support.
      - Realtek 8169: enable GRO software interrupt coalescing per
        default.
 
  - Ethernet high-speed switches:
    - Microchip (sparx5):
      - add support for Sparx5 TC/flower H/W offload via VCAP.
    - Mellanox mlxsw:
      - add 802.1X and MAC Authentication Bypass offload support.
      - add ip6gre support.
 
  - Embedded Ethernet switches:
    - Mediatek (mtk_eth_soc):
      - improve PCS implementation, add DSA untag support.
      - enable flow offload support.
    - Renesas:
      - add rswitch R-Car Gen4 gPTP support.
    - Microchip (lan966x):
      - add full XDP support.
      - add TC H/W offload via VCAP.
      - enable PTP on bridge interfaces.
    - Microchip (ksz8):
      - add MTU support for KSZ8 series.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - support configuring channel dwell time during scan.
 
  - MediaTek WiFi (mt76):
    - enable Wireless Ethernet Dispatch (WED) offload support.
    - add ack signal support.
    - enable coredump support.
    - remain_on_channel support.
 
  - Intel WiFi (iwlwifi):
    - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities.
    - 320 MHz channels support.
 
  - RealTek WiFi (rtw89):
    - new dynamic header firmware format support.
    - wake-over-WLAN support.
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmOYXUcSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk8zQP/R7BZtbJMTPiWkRnSoKHnAyupDVwrz5U
 ktukLkwPsCyJuEbAjgxrxf4EEEQ9uq2FFlxNSYuKiiQMqIpFxV6KED7LCUygn4Tc
 kxtkp0Q+5XiqisWlQmtfExf2OjuuPqcjV9tWCDBI6GebKUbfNwY/eI44RcMu4BSv
 DzIlW5GkX/kZAPqnnuqaLsN3FudDTJHGEAD7NbA++7wJ076RWYSLXlFv0Z+SCSPS
 H8/PEG0/ZK/65rIWMAFRClJ9BNIDwGVgp0GrsIvs1gqbRUOlA1hl1rDM21TqtNFf
 5QPQT7sIfTcCE/nerxKJD5JE3JyP+XRlRn96PaRw3rt4MgI6I/EOj/HOKQ5tMCNc
 oPiqb7N70+hkLZyr42qX+vN9eDPjp2koEQm7EO2Zs+/534/zWDs24Zfk/Aa1ps0I
 Fa82oGjAgkBhGe/FZ6i5cYoLcyxqRqZV1Ws9XQMl72qRC7/BwvNbIW6beLpCRyeM
 yYIU+0e9dEm+wHQEdh2niJuVtR63hy8tvmPx56lyh+6u0+pondkwbfSiC5aD3kAC
 ikKsN5DyEsdXyiBAlytCEBxnaOjQy4RAz+3YXSiS0eBNacXp03UUrNGx4Pzpu/D0
 QLFJhBnMFFCgy5to8/DvKnrTPgZdSURwqbIUcZdvU21f1HLR8tUTpaQnYffc/Whm
 V8gnt1EL+0cc
 =CbJC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Allow live renaming when an interface is up

   - Add retpoline wrappers for tc, improving considerably the
     performances of complex queue discipline configurations

   - Add inet drop monitor support

   - A few GRO performance improvements

   - Add infrastructure for atomic dev stats, addressing long standing
     data races

   - De-duplicate common code between OVS and conntrack offloading
     infrastructure

   - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements

   - Netfilter: introduce packet parser for tunneled packets

   - Replace IPVS timer-based estimators with kthreads to scale up the
     workload with the number of available CPUs

   - Add the helper support for connection-tracking OVS offload

  BPF:

   - Support for user defined BPF objects: the use case is to allocate
     own objects, build own object hierarchies and use the building
     blocks to build own data structures flexibly, for example, linked
     lists in BPF

   - Make cgroup local storage available to non-cgroup attached BPF
     programs

   - Avoid unnecessary deadlock detection and failures wrt BPF task
     storage helpers

   - A relevant bunch of BPF verifier fixes and improvements

   - Veristat tool improvements to support custom filtering, sorting,
     and replay of results

   - Add LLVM disassembler as default library for dumping JITed code

   - Lots of new BPF documentation for various BPF maps

   - Add bpf_rcu_read_{,un}lock() support for sleepable programs

   - Add RCU grace period chaining to BPF to wait for the completion of
     access from both sleepable and non-sleepable BPF programs

   - Add support storing struct task_struct objects as kptrs in maps

   - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
     values

   - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions

  Protocols:

   - TCP: implement Protective Load Balancing across switch links

   - TCP: allow dynamically disabling TCP-MD5 static key, reverting back
     to fast[er]-path

   - UDP: Introduce optional per-netns hash lookup table

   - IPv6: simplify and cleanup sockets disposal

   - Netlink: support different type policies for each generic netlink
     operation

   - MPTCP: add MSG_FASTOPEN and FastOpen listener side support

   - MPTCP: add netlink notification support for listener sockets events

   - SCTP: add VRF support, allowing sctp sockets binding to VRF devices

   - Add bridging MAC Authentication Bypass (MAB) support

   - Extensions for Ethernet VPN bridging implementation to better
     support multicast scenarios

   - More work for Wi-Fi 7 support, comprising conversion of all the
     existing drivers to internal TX queue usage

   - IPSec: introduce a new offload type (packet offload) allowing
     complete header processing and crypto offloading

   - IPSec: extended ack support for more descriptive XFRM error
     reporting

   - RXRPC: increase SACK table size and move processing into a
     per-local endpoint kernel thread, reducing considerably the
     required locking

   - IEEE 802154: synchronous send frame and extended filtering support,
     initial support for scanning available 15.4 networks

   - Tun: bump the link speed from 10Mbps to 10Gbps

   - Tun/VirtioNet: implement UDP segmentation offload support

  Driver API:

   - PHY/SFP: improve power level switching between standard level 1 and
     the higher power levels

   - New API for netdev <-> devlink_port linkage

   - PTP: convert existing drivers to new frequency adjustment
     implementation

   - DSA: add support for rx offloading

   - Autoload DSA tagging driver when dynamically changing protocol

   - Add new PCP and APPTRUST attributes to Data Center Bridging

   - Add configuration support for 800Gbps link speed

   - Add devlink port function attribute to enable/disable RoCE and
     migratable

   - Extend devlink-rate to support strict prioriry and weighted fair
     queuing

   - Add devlink support to directly reading from region memory

   - New device tree helper to fetch MAC address from nvmem

   - New big TCP helper to simplify temporary header stripping

  New hardware / drivers:

   - Ethernet:
      - Marvel Octeon CNF95N and CN10KB Ethernet Switches
      - Marvel Prestera AC5X Ethernet Switch
      - WangXun 10 Gigabit NIC
      - Motorcomm yt8521 Gigabit Ethernet
      - Microchip ksz9563 Gigabit Ethernet Switch
      - Microsoft Azure Network Adapter
      - Linux Automation 10Base-T1L adapter

   - PHY:
      - Aquantia AQR112 and AQR412
      - Motorcomm YT8531S

   - PTP:
      - Orolia ART-CARD

   - WiFi:
      - MediaTek Wi-Fi 7 (802.11be) devices
      - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
        devices

   - Bluetooth:
      - Broadcom BCM4377/4378/4387 Bluetooth chipsets
      - Realtek RTL8852BE and RTL8723DS
      - Cypress.CYW4373A0 WiFi + Bluetooth combo device

  Drivers:

   - CAN:
      - gs_usb: bus error reporting support
      - kvaser_usb: listen only and bus error reporting support

   - Ethernet NICs:
      - Intel (100G):
         - extend action skbedit to RX queue mapping
         - implement devlink-rate support
         - support direct read from memory
      - nVidia/Mellanox (mlx5):
         - SW steering improvements, increasing rules update rate
         - Support for enhanced events compression
         - extend H/W offload packet manipulation capabilities
         - implement IPSec packet offload mode
      - nVidia/Mellanox (mlx4):
         - better big TCP support
      - Netronome Ethernet NICs (nfp):
         - IPsec offload support
         - add support for multicast filter
      - Broadcom:
         - RSS and PTP support improvements
      - AMD/SolarFlare:
         - netlink extened ack improvements
         - add basic flower matches to offload, and related stats
      - Virtual NICs:
         - ibmvnic: introduce affinity hint support
      - small / embedded:
         - FreeScale fec: add initial XDP support
         - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
         - TI am65-cpsw: add suspend/resume support
         - Mediatek MT7986: add RX wireless wthernet dispatch support
         - Realtek 8169: enable GRO software interrupt coalescing per
           default

   - Ethernet high-speed switches:
      - Microchip (sparx5):
         - add support for Sparx5 TC/flower H/W offload via VCAP
      - Mellanox mlxsw:
         - add 802.1X and MAC Authentication Bypass offload support
         - add ip6gre support

   - Embedded Ethernet switches:
      - Mediatek (mtk_eth_soc):
         - improve PCS implementation, add DSA untag support
         - enable flow offload support
      - Renesas:
         - add rswitch R-Car Gen4 gPTP support
      - Microchip (lan966x):
         - add full XDP support
         - add TC H/W offload via VCAP
         - enable PTP on bridge interfaces
      - Microchip (ksz8):
         - add MTU support for KSZ8 series

   - Qualcomm 802.11ax WiFi (ath11k):
      - support configuring channel dwell time during scan

   - MediaTek WiFi (mt76):
      - enable Wireless Ethernet Dispatch (WED) offload support
      - add ack signal support
      - enable coredump support
      - remain_on_channel support

   - Intel WiFi (iwlwifi):
      - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
      - 320 MHz channels support

   - RealTek WiFi (rtw89):
      - new dynamic header firmware format support
      - wake-over-WLAN support"

* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
  ipvs: fix type warning in do_div() on 32 bit
  net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
  net: ipa: add IPA v4.7 support
  dt-bindings: net: qcom,ipa: Add SM6350 compatible
  bnxt: Use generic HBH removal helper in tx path
  IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
  selftests: forwarding: Add bridge MDB test
  selftests: forwarding: Rename bridge_mdb test
  bridge: mcast: Support replacement of MDB port group entries
  bridge: mcast: Allow user space to specify MDB entry routing protocol
  bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
  bridge: mcast: Add support for (*, G) with a source list and filter mode
  bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
  bridge: mcast: Add a flag for user installed source entries
  bridge: mcast: Expose __br_multicast_del_group_src()
  bridge: mcast: Expose br_multicast_new_group_src()
  bridge: mcast: Add a centralized error path
  bridge: mcast: Place netlink policy before validation functions
  bridge: mcast: Split (*, G) and (S, G) addition into different functions
  bridge: mcast: Do not derive entry type from its filter mode
  ...
This commit is contained in:
Linus Torvalds 2022-12-13 15:47:48 -08:00
commit 7e68dd7d07
2013 changed files with 166136 additions and 34555 deletions

View File

@ -298,3 +298,48 @@ A: NO.
The BTF_ID macro does not cause a function to become part of the ABI
any more than does the EXPORT_SYMBOL_GPL macro.
Q: What is the compatibility story for special BPF types in map values?
-----------------------------------------------------------------------
Q: Users are allowed to embed bpf_spin_lock, bpf_timer fields in their BPF map
values (when using BTF support for BPF maps). This allows to use helpers for
such objects on these fields inside map values. Users are also allowed to embed
pointers to some kernel types (with __kptr and __kptr_ref BTF tags). Will the
kernel preserve backwards compatibility for these features?
A: It depends. For bpf_spin_lock, bpf_timer: YES, for kptr and everything else:
NO, but see below.
For struct types that have been added already, like bpf_spin_lock and bpf_timer,
the kernel will preserve backwards compatibility, as they are part of UAPI.
For kptrs, they are also part of UAPI, but only with respect to the kptr
mechanism. The types that you can use with a __kptr and __kptr_ref tagged
pointer in your struct are NOT part of the UAPI contract. The supported types can
and will change across kernel releases. However, operations like accessing kptr
fields and bpf_kptr_xchg() helper will continue to be supported across kernel
releases for the supported types.
For any other supported struct type, unless explicitly stated in this document
and added to bpf.h UAPI header, such types can and will arbitrarily change their
size, type, and alignment, or any other user visible API or ABI detail across
kernel releases. The users must adapt their BPF programs to the new changes and
update them to make sure their programs continue to work correctly.
NOTE: BPF subsystem specially reserves the 'bpf\_' prefix for type names, in
order to introduce more special fields in the future. Hence, user programs must
avoid defining types with 'bpf\_' prefix to not be broken in future releases.
In other words, no backwards compatibility is guaranteed if one using a type
in BTF with 'bpf\_' prefix.
Q: What is the compatibility story for special BPF types in allocated objects?
------------------------------------------------------------------------------
Q: Same as above, but for allocated objects (i.e. objects allocated using
bpf_obj_new for user defined types). Will the kernel preserve backwards
compatibility for these features?
A: NO.
Unlike map value types, there are no stability guarantees for this case. The
whole API to work with allocated objects and any support for special fields
inside them is unstable (since it is exposed through kfuncs).

View File

@ -44,6 +44,33 @@ is a guarantee that the reported issue will be overlooked.**
Submitting patches
==================
Q: How do I run BPF CI on my changes before sending them out for review?
------------------------------------------------------------------------
A: BPF CI is GitHub based and hosted at https://github.com/kernel-patches/bpf.
While GitHub also provides a CLI that can be used to accomplish the same
results, here we focus on the UI based workflow.
The following steps lay out how to start a CI run for your patches:
- Create a fork of the aforementioned repository in your own account (one time
action)
- Clone the fork locally, check out a new branch tracking either the bpf-next
or bpf branch, and apply your to-be-tested patches on top of it
- Push the local branch to your fork and create a pull request against
kernel-patches/bpf's bpf-next_base or bpf_base branch, respectively
Shortly after the pull request has been created, the CI workflow will run. Note
that capacity is shared with patches submitted upstream being checked and so
depending on utilization the run can take a while to finish.
Note furthermore that both base branches (bpf-next_base and bpf_base) will be
updated as patches are pushed to the respective upstream branches they track. As
such, your patch set will automatically (be attempted to) be rebased as well.
This behavior can result in a CI run being aborted and restarted with the new
base line.
Q: To which mailing list do I need to submit my BPF patches?
------------------------------------------------------------
A: Please submit your BPF patches to the bpf kernel mailing list:

View File

@ -0,0 +1,485 @@
=============
BPF Iterators
=============
----------
Motivation
----------
There are a few existing ways to dump kernel data into user space. The most
popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink
sockets in the system. However, their output format tends to be fixed, and if
users want more information about these sockets, they have to patch the kernel,
which often takes time to publish upstream and release. The same is true for popular
tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any
additional information needs a kernel patch.
To solve this problem, the `drgn
<https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to
dig out the kernel data with no kernel change. However, the main drawback for
drgn is performance, as it cannot do pointer tracing inside the kernel. In
addition, drgn cannot validate a pointer value and may read invalid data if the
pointer becomes invalid inside the kernel.
The BPF iterator solves the above problem by providing flexibility on what data
(e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel
data object.
----------------------
How BPF Iterators Work
----------------------
A BPF iterator is a type of BPF program that allows users to iterate over
specific types of kernel objects. Unlike traditional BPF tracing programs that
allow users to define callbacks that are invoked at particular points of
execution in the kernel, BPF iterators allow users to define callbacks that
should be executed for every entry in a variety of kernel data structures.
For example, users can define a BPF iterator that iterates over every task on
the system and dumps the total amount of CPU runtime currently used by each of
them. Another BPF task iterator may instead dump the cgroup information for each
task. Such flexibility is the core value of BPF iterators.
A BPF program is always loaded into the kernel at the behest of a user space
process. A user space process loads a BPF program by opening and initializing
the program skeleton as required and then invoking a syscall to have the BPF
program verified and loaded by the kernel.
In traditional tracing programs, a program is activated by having user space
obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once
activated, the program callback will be invoked whenever the tracepoint is
triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the
program is obtained using ``bpf_link_create()``, and the program callback is
invoked by issuing system calls from user space.
Next, let us see how you can use the iterators to iterate on kernel objects and
read data.
------------------------
How to Use BPF iterators
------------------------
BPF selftests are a great resource to illustrate how to use the iterators. In
this section, well walk through a BPF selftest which shows how to load and use
a BPF iterator program. To begin, well look at `bpf_iter.c
<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_,
which illustrates how to load and trigger BPF iterators on the user space side.
Later, well look at a BPF program that runs in kernel space.
Loading a BPF iterator in the kernel from user space typically involves the
following steps:
* The BPF program is loaded into the kernel through ``libbpf``. Once the kernel
has verified and loaded the program, it returns a file descriptor (fd) to user
space.
* Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()``
specified with the BPF program file descriptor received from the kernel.
* Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the
``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2.
* Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is
available.
* Close the iterator fd using ``close(bpf_iter_fd)``.
* If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again.
The following are a few examples of selftest BPF iterator programs:
* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_
* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_
Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:
Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h
<https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_.
Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>``
represents a BPF iterator. The suffix ``<iter_name>`` represents the type of
iterator.
::
struct bpf_iter__task_file {
union {
struct bpf_iter_meta *meta;
};
union {
struct task_struct *task;
};
u32 fd;
union {
struct file *file;
};
};
In the above code, the field 'meta' contains the metadata, which is the same for
all BPF iterator programs. The rest of the fields are specific to different
iterators. For example, for task_file iterators, the kernel layer provides the
'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference
counted
<https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_,
so they won't go away when the BPF program runs.
Here is a snippet from the ``bpf_iter_task_file.c`` file:
::
SEC("iter/task_file")
int dump_task_file(struct bpf_iter__task_file *ctx)
{
struct seq_file *seq = ctx->meta->seq;
struct task_struct *task = ctx->task;
struct file *file = ctx->file;
__u32 fd = ctx->fd;
if (task == NULL || file == NULL)
return 0;
if (ctx->meta->seq_num == 0) {
count = 0;
BPF_SEQ_PRINTF(seq, " tgid gid fd file\n");
}
if (tgid == task->tgid && task->tgid != task->pid)
count++;
if (last_tgid != task->tgid) {
last_tgid = task->tgid;
unique_tgid_count++;
}
BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
(long)file->f_op);
return 0;
}
In the above example, the section name ``SEC(iter/task_file)``, indicates that
the program is a BPF iterator program to iterate all files from all tasks. The
context of the program is ``bpf_iter__task_file`` struct.
The user space program invokes the BPF iterator program running in the kernel
by issuing a ``read()`` syscall. Once invoked, the BPF
program can export data to user space using a variety of BPF helper functions.
You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or
``bpf_seq_write()`` function based on whether you need formatted output or just
binary data, respectively. For binary-encoded data, the user space applications
can process the data from ``bpf_seq_write()`` as needed. For the formatted data,
you can use ``cat <path>`` to print the results similar to ``cat
/proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later,
use ``rm -f <path>`` to remove the pinned iterator.
For example, you can use the following command to create a BPF iterator from the
``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route``
path:
::
$ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route
And then print out the results using the following command:
::
$ cat /sys/fs/bpf/my_route
-------------------------------------------------------
Implement Kernel Support for BPF Iterator Program Types
-------------------------------------------------------
To implement a BPF iterator in the kernel, the developer must make a one-time
change to the following key data structure defined in the `bpf.h
<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_
file.
::
struct bpf_iter_reg {
const char *target;
bpf_iter_attach_target_t attach_target;
bpf_iter_detach_target_t detach_target;
bpf_iter_show_fdinfo_t show_fdinfo;
bpf_iter_fill_link_info_t fill_link_info;
bpf_iter_get_func_proto_t get_func_proto;
u32 ctx_arg_info_size;
u32 feature;
struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
const struct bpf_iter_seq_info *seq_info;
};
After filling the data structure fields, call ``bpf_iter_reg_target()`` to
register the iterator to the main BPF iterator subsystem.
The following is the breakdown for each field in struct ``bpf_iter_reg``.
.. list-table::
:widths: 25 50
:header-rows: 1
* - Fields
- Description
* - target
- Specifies the name of the BPF iterator. For example: ``bpf_map``,
``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel.
* - attach_target and detach_target
- Allows for target specific ``link_create`` action since some targets
may need special processing. Called during the user space link_create stage.
* - show_fdinfo and fill_link_info
- Called to fill target specific information when user tries to get link
info associated with the iterator.
* - get_func_proto
- Permits a BPF iterator to access BPF helpers specific to the iterator.
* - ctx_arg_info_size and ctx_arg_info
- Specifies the verifier states for BPF program arguments associated with
the bpf iterator.
* - feature
- Specifies certain action requests in the kernel BPF iterator
infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
that the kernel function cond_resched() is called to avoid other kernel
subsystem (e.g., rcu) misbehaving.
* - seq_info
- Specifies certain action requests in the kernel BPF iterator
infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
that the kernel function cond_resched() is called to avoid other kernel
subsystem (e.g., rcu) misbehaving.
`Click here
<https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_
to see an implementation of the ``task_vma`` BPF iterator in the kernel.
---------------------------------
Parameterizing BPF Task Iterators
---------------------------------
By default, BPF iterators walk through all the objects of the specified types
(processes, cgroups, maps, etc.) across the entire system to read relevant
kernel data. But often, there are cases where we only care about a much smaller
subset of iterable kernel objects, such as only iterating tasks within a
specific process. Therefore, BPF iterator programs support filtering out objects
from iteration by allowing user space to configure the iterator program when it
is attached.
--------------------------
BPF Task Iterator Program
--------------------------
The following code is a BPF iterator program to print files and task information
through the ``seq_file`` of the iterator. It is a standard BPF iterator program
that visits every file of an iterator. We will use this BPF program in our
example later.
::
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
char _license[] SEC("license") = "GPL";
SEC("iter/task_file")
int dump_task_file(struct bpf_iter__task_file *ctx)
{
struct seq_file *seq = ctx->meta->seq;
struct task_struct *task = ctx->task;
struct file *file = ctx->file;
__u32 fd = ctx->fd;
if (task == NULL || file == NULL)
return 0;
if (ctx->meta->seq_num == 0) {
BPF_SEQ_PRINTF(seq, " tgid pid fd file\n");
}
BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
(long)file->f_op);
return 0;
}
----------------------------------------
Creating a File Iterator with Parameters
----------------------------------------
Now, let us look at how to create an iterator that includes only files of a
process.
First, fill the ``bpf_iter_attach_opts`` struct as shown below:
::
LIBBPF_OPTS(bpf_iter_attach_opts, opts);
union bpf_iter_link_info linfo;
memset(&linfo, 0, sizeof(linfo));
linfo.task.pid = getpid();
opts.link_info = &linfo;
opts.link_info_len = sizeof(linfo);
``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator
that only includes opened files for the process with the specified ``pid``. In
this example, we will only be iterating files for our process. If
``linfo.task.pid`` is zero, the iterator will visit every opened file of every
process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator
that visits opened files of a specific thread, not a process. In this example,
``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a
separate file descriptor table. In most circumstances, all process threads share
a single file descriptor table.
Now, in the userspace program, pass the pointer of struct to the
``bpf_program__attach_iter()``.
::
link = bpf_program__attach_iter(prog, &opts); iter_fd =
bpf_iter_create(bpf_link__fd(link));
If both *tid* and *pid* are zero, an iterator created from this struct
``bpf_iter_attach_opts`` will include every opened file of every task in the
system (in the namespace, actually.) It is the same as passing a NULL as the
second argument to ``bpf_program__attach_iter()``.
The whole program looks like the following code:
::
#include <stdio.h>
#include <unistd.h>
#include <bpf/bpf.h>
#include <bpf/libbpf.h>
#include "bpf_iter_task_ex.skel.h"
static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts)
{
struct bpf_link *link;
char buf[16] = {};
int iter_fd = -1, len;
int ret = 0;
link = bpf_program__attach_iter(prog, opts);
if (!link) {
fprintf(stderr, "bpf_program__attach_iter() fails\n");
return -1;
}
iter_fd = bpf_iter_create(bpf_link__fd(link));
if (iter_fd < 0) {
fprintf(stderr, "bpf_iter_create() fails\n");
ret = -1;
goto free_link;
}
/* not check contents, but ensure read() ends without error */
while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
buf[len] = 0;
printf("%s", buf);
}
printf("\n");
free_link:
if (iter_fd >= 0)
close(iter_fd);
bpf_link__destroy(link);
return 0;
}
static void test_task_file(void)
{
LIBBPF_OPTS(bpf_iter_attach_opts, opts);
struct bpf_iter_task_ex *skel;
union bpf_iter_link_info linfo;
skel = bpf_iter_task_ex__open_and_load();
if (skel == NULL)
return;
memset(&linfo, 0, sizeof(linfo));
linfo.task.pid = getpid();
opts.link_info = &linfo;
opts.link_info_len = sizeof(linfo);
printf("PID %d\n", getpid());
do_read_opts(skel->progs.dump_task_file, &opts);
bpf_iter_task_ex__destroy(skel);
}
int main(int argc, const char * const * argv)
{
test_task_file();
return 0;
}
The following lines are the output of the program.
::
PID 1859
tgid pid fd file
1859 1859 0 ffffffff82270aa0
1859 1859 1 ffffffff82270aa0
1859 1859 2 ffffffff82270aa0
1859 1859 3 ffffffff82272980
1859 1859 4 ffffffff8225e120
1859 1859 5 ffffffff82255120
1859 1859 6 ffffffff82254f00
1859 1859 7 ffffffff82254d80
1859 1859 8 ffffffff8225abe0
------------------
Without Parameters
------------------
Let us look at how a BPF iterator without parameters skips files of other
processes in the system. In this case, the BPF program has to check the pid or
the tid of tasks, or it will receive every opened file in the system (in the
current *pid* namespace, actually). So, we usually add a global variable in the
BPF program to pass a *pid* to the BPF program.
The BPF program would look like the following block.
::
......
int target_pid = 0;
SEC("iter/task_file")
int dump_task_file(struct bpf_iter__task_file *ctx)
{
......
if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */
return 0;
BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
(long)file->f_op);
return 0;
}
The user space program would look like the following block:
::
......
static void test_task_file(void)
{
......
skel = bpf_iter_task_ex__open_and_load();
if (skel == NULL)
return;
skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */
memset(&linfo, 0, sizeof(linfo));
linfo.task.pid = getpid();
opts.link_info = &linfo;
opts.link_info_len = sizeof(linfo);
......
}
``target_pid`` is a global variable in the BPF program. The user space program
should initialize the variable with a process ID to skip opened files of other
processes in the BPF program. When you parametrize a BPF iterator, the iterator
calls the BPF program fewer times which can save significant resources.
---------------------------
Parametrizing VMA Iterators
---------------------------
By default, a BPF VMA iterator includes every VMA in every process. However,
you can still specify a process or a thread to include only its VMAs. Unlike
files, a thread can not have a separate address space (since Linux 2.6.0-test6).
Here, using *tid* makes no difference from using *pid*.
----------------------------
Parametrizing Task Iterators
----------------------------
A BPF task iterator with *pid* includes all tasks (threads) of a process. The
BPF program receives these tasks one after another. You can specify a BPF task
iterator with *tid* parameter to include only the tasks that match the given
*tid*.

View File

@ -1062,4 +1062,9 @@ format.::
7. Testing
==========
Kernel bpf selftest `test_btf.c` provides extensive set of BTF-related tests.
The kernel BPF selftest `tools/testing/selftests/bpf/prog_tests/btf.c`_
provides an extensive set of BTF-related tests.
.. Links
.. _tools/testing/selftests/bpf/prog_tests/btf.c:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/prog_tests/btf.c

View File

@ -24,11 +24,13 @@ that goes into great technical depth about the BPF Architecture.
maps
bpf_prog_run
classic_vs_extended.rst
bpf_iterators
bpf_licensing
test_debug
clang-notes
linux-notes
other
redirect
.. only:: subproject and html

View File

@ -122,11 +122,11 @@ BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below)
``BPF_XOR | BPF_K | BPF_ALU`` means::
src_reg = (u32) src_reg ^ (u32) imm32
dst_reg = (u32) dst_reg ^ (u32) imm32
``BPF_XOR | BPF_K | BPF_ALU64`` means::
src_reg = src_reg ^ imm32
dst_reg = dst_reg ^ imm32
Byte swap instructions

View File

@ -72,6 +72,30 @@ argument as its size. By default, without __sz annotation, the size of the type
of the pointer is used. Without __sz annotation, a kfunc cannot accept a void
pointer.
2.2.2 __k Annotation
--------------------
This annotation is only understood for scalar arguments, where it indicates that
the verifier must check the scalar argument to be a known constant, which does
not indicate a size parameter, and the value of the constant is relevant to the
safety of the program.
An example is given below::
void *bpf_obj_new(u32 local_type_id__k, ...)
{
...
}
Here, bpf_obj_new uses local_type_id argument to find out the size of that type
ID in program's BTF and return a sized pointer to it. Each type ID will have a
distinct size, hence it is crucial to treat each such call as distinct when
values don't match during verifier state pruning checks.
Hence, whenever a constant scalar argument is accepted by a kfunc which is not a
size parameter, and the value of the constant matters for program safety, __k
suffix should be used.
.. _BPF_kfunc_nodef:
2.3 Using an existing kernel function
@ -137,22 +161,20 @@ KF_ACQUIRE and KF_RET_NULL flags.
--------------------------
The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It
indicates that the all pointer arguments will always have a guaranteed lifetime,
and pointers to kernel objects are always passed to helpers in their unmodified
form (as obtained from acquire kfuncs).
indicates that the all pointer arguments are valid, and that all pointers to
BTF objects have been passed in their unmodified form (that is, at a zero
offset, and without having been obtained from walking another pointer).
It can be used to enforce that a pointer to a refcounted object acquired from a
kfunc or BPF helper is passed as an argument to this kfunc without any
modifications (e.g. pointer arithmetic) such that it is trusted and points to
the original object.
There are two types of pointers to kernel objects which are considered "valid":
Meanwhile, it is also allowed pass pointers to normal memory to such kfuncs,
but those can have a non-zero offset.
1. Pointers which are passed as tracepoint or struct_ops callback arguments.
2. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc.
This flag is often used for kfuncs that operate (change some property, perform
some operation) on an object that was obtained using an acquire kfunc. Such
kfuncs need an unchanged pointer to ensure the integrity of the operation being
performed on the expected object.
Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to
KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset.
The definition of "valid" pointers is subject to change at any time, and has
absolutely no ABI stability guarantees.
2.4.6 KF_SLEEPABLE flag
-----------------------
@ -169,6 +191,15 @@ rebooting or panicking. Due to this additional restrictions apply to these
calls. At the moment they only require CAP_SYS_BOOT capability, but more can be
added later.
2.4.8 KF_RCU flag
-----------------
The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument.
When used together with KF_ACQUIRE, it indicates the kfunc should have a
single argument which must be a trusted argument or a MEM_RCU pointer.
The argument may have reference count of 0 and the kfunc must take this
into consideration.
2.5 Registering the kfuncs
--------------------------
@ -191,3 +222,201 @@ type. An example is shown below::
return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set);
}
late_initcall(init_subsystem);
3. Core kfuncs
==============
The BPF subsystem provides a number of "core" kfuncs that are potentially
applicable to a wide variety of different possible use cases and programs.
Those kfuncs are documented here.
3.1 struct task_struct * kfuncs
-------------------------------
There are a number of kfuncs that allow ``struct task_struct *`` objects to be
used as kptrs:
.. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_task_acquire bpf_task_release
These kfuncs are useful when you want to acquire or release a reference to a
``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a
struct_ops callback arg. For example:
.. code-block:: c
/**
* A trivial example tracepoint program that shows how to
* acquire and release a struct task_struct * pointer.
*/
SEC("tp_btf/task_newtask")
int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags)
{
struct task_struct *acquired;
acquired = bpf_task_acquire(task);
/*
* In a typical program you'd do something like store
* the task in a map, and the map will automatically
* release it later. Here, we release it manually.
*/
bpf_task_release(acquired);
return 0;
}
----
A BPF program can also look up a task from a pid. This can be useful if the
caller doesn't have a trusted pointer to a ``struct task_struct *`` object that
it can acquire a reference on with bpf_task_acquire().
.. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_task_from_pid
Here is an example of it being used:
.. code-block:: c
SEC("tp_btf/task_newtask")
int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags)
{
struct task_struct *lookup;
lookup = bpf_task_from_pid(task->pid);
if (!lookup)
/* A task should always be found, as %task is a tracepoint arg. */
return -ENOENT;
if (lookup->pid != task->pid) {
/* bpf_task_from_pid() looks up the task via its
* globally-unique pid from the init_pid_ns. Thus,
* the pid of the lookup task should always be the
* same as the input task.
*/
bpf_task_release(lookup);
return -EINVAL;
}
/* bpf_task_from_pid() returns an acquired reference,
* so it must be dropped before returning from the
* tracepoint handler.
*/
bpf_task_release(lookup);
return 0;
}
3.2 struct cgroup * kfuncs
--------------------------
``struct cgroup *`` objects also have acquire and release functions:
.. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_cgroup_acquire bpf_cgroup_release
These kfuncs are used in exactly the same manner as bpf_task_acquire() and
bpf_task_release() respectively, so we won't provide examples for them.
----
You may also acquire a reference to a ``struct cgroup`` kptr that's already
stored in a map using bpf_cgroup_kptr_get():
.. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_cgroup_kptr_get
Here's an example of how it can be used:
.. code-block:: c
/* struct containing the struct task_struct kptr which is actually stored in the map. */
struct __cgroups_kfunc_map_value {
struct cgroup __kptr_ref * cgroup;
};
/* The map containing struct __cgroups_kfunc_map_value entries. */
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, int);
__type(value, struct __cgroups_kfunc_map_value);
__uint(max_entries, 1);
} __cgroups_kfunc_map SEC(".maps");
/* ... */
/**
* A simple example tracepoint program showing how a
* struct cgroup kptr that is stored in a map can
* be acquired using the bpf_cgroup_kptr_get() kfunc.
*/
SEC("tp_btf/cgroup_mkdir")
int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path)
{
struct cgroup *kptr;
struct __cgroups_kfunc_map_value *v;
s32 id = cgrp->self.id;
/* Assume a cgroup kptr was previously stored in the map. */
v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id);
if (!v)
return -ENOENT;
/* Acquire a reference to the cgroup kptr that's already stored in the map. */
kptr = bpf_cgroup_kptr_get(&v->cgroup);
if (!kptr)
/* If no cgroup was present in the map, it's because
* we're racing with another CPU that removed it with
* bpf_kptr_xchg() between the bpf_map_lookup_elem()
* above, and our call to bpf_cgroup_kptr_get().
* bpf_cgroup_kptr_get() internally safely handles this
* race, and will return NULL if the task is no longer
* present in the map by the time we invoke the kfunc.
*/
return -EBUSY;
/* Free the reference we just took above. Note that the
* original struct cgroup kptr is still in the map. It will
* be freed either at a later time if another context deletes
* it from the map, or automatically by the BPF subsystem if
* it's still present when the map is destroyed.
*/
bpf_cgroup_release(kptr);
return 0;
}
----
Another kfunc available for interacting with ``struct cgroup *`` objects is
bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup,
and return it as a cgroup kptr.
.. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_cgroup_ancestor
Eventually, BPF should be updated to allow this to happen with a normal memory
load in the program itself. This is currently not possible without more work in
the verifier. bpf_cgroup_ancestor() can be used as follows:
.. code-block:: c
/**
* Simple tracepoint example that illustrates how a cgroup's
* ancestor can be accessed using bpf_cgroup_ancestor().
*/
SEC("tp_btf/cgroup_mkdir")
int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path)
{
struct cgroup *parent;
/* The parent cgroup resides at the level before the current cgroup's level. */
parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1);
if (!parent)
return -ENOENT;
bpf_printk("Parent id is %d", parent->self.id);
/* Return the parent cgroup that was acquired above. */
bpf_cgroup_release(parent);
return 0;
}

View File

@ -1,5 +1,7 @@
.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
.. _libbpf:
libbpf
======
@ -7,6 +9,7 @@ libbpf
:maxdepth: 1
API Documentation <https://libbpf.readthedocs.io/en/latest/api.html>
program_types
libbpf_naming_convention
libbpf_build

View File

@ -0,0 +1,203 @@
.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
.. _program_types_and_elf:
Program Types and ELF Sections
==============================
The table below lists the program types, their attach types where relevant and the ELF section
names supported by libbpf for them. The ELF section names follow these rules:
- ``type`` is an exact match, e.g. ``SEC("socket")``
- ``type+`` means it can be either exact ``SEC("type")`` or well-formed ``SEC("type/extras")``
with a '``/``' separator between ``type`` and ``extras``.
When ``extras`` are specified, they provide details of how to auto-attach the BPF program. The
format of ``extras`` depends on the program type, e.g. ``SEC("tracepoint/<category>/<name>")``
for tracepoints or ``SEC("usdt/<path>:<provider>:<name>")`` for USDT probes. The extras are
described in more detail in the footnotes.
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| Program Type | Attach Type | ELF Section Name | Sleepable |
+===========================================+========================================+==================================+===========+
| ``BPF_PROG_TYPE_CGROUP_DEVICE`` | ``BPF_CGROUP_DEVICE`` | ``cgroup/dev`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_CGROUP_SKB`` | | ``cgroup/skb`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET_EGRESS`` | ``cgroup_skb/egress`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET_INGRESS`` | ``cgroup_skb/ingress`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_CGROUP_SOCKOPT`` | ``BPF_CGROUP_GETSOCKOPT`` | ``cgroup/getsockopt`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_SETSOCKOPT`` | ``cgroup/setsockopt`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_CGROUP_SOCK_ADDR`` | ``BPF_CGROUP_INET4_BIND`` | ``cgroup/bind4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET4_CONNECT`` | ``cgroup/connect4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET4_GETPEERNAME`` | ``cgroup/getpeername4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET4_GETSOCKNAME`` | ``cgroup/getsockname4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET6_BIND`` | ``cgroup/bind6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET6_CONNECT`` | ``cgroup/connect6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET6_GETPEERNAME`` | ``cgroup/getpeername6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET6_GETSOCKNAME`` | ``cgroup/getsockname6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_UDP4_RECVMSG`` | ``cgroup/recvmsg4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_UDP4_SENDMSG`` | ``cgroup/sendmsg4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_UDP6_RECVMSG`` | ``cgroup/recvmsg6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_UDP6_SENDMSG`` | ``cgroup/sendmsg6`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_CGROUP_SOCK`` | ``BPF_CGROUP_INET4_POST_BIND`` | ``cgroup/post_bind4`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET6_POST_BIND`` | ``cgroup/post_bind6`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET_SOCK_CREATE`` | ``cgroup/sock_create`` | |
+ + +----------------------------------+-----------+
| | | ``cgroup/sock`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_CGROUP_INET_SOCK_RELEASE`` | ``cgroup/sock_release`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_CGROUP_SYSCTL`` | ``BPF_CGROUP_SYSCTL`` | ``cgroup/sysctl`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_EXT`` | | ``freplace+`` [#fentry]_ | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_FLOW_DISSECTOR`` | ``BPF_FLOW_DISSECTOR`` | ``flow_dissector`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_KPROBE`` | | ``kprobe+`` [#kprobe]_ | |
+ + +----------------------------------+-----------+
| | | ``kretprobe+`` [#kprobe]_ | |
+ + +----------------------------------+-----------+
| | | ``ksyscall+`` [#ksyscall]_ | |
+ + +----------------------------------+-----------+
| | | ``kretsyscall+`` [#ksyscall]_ | |
+ + +----------------------------------+-----------+
| | | ``uprobe+`` [#uprobe]_ | |
+ + +----------------------------------+-----------+
| | | ``uprobe.s+`` [#uprobe]_ | Yes |
+ + +----------------------------------+-----------+
| | | ``uretprobe+`` [#uprobe]_ | |
+ + +----------------------------------+-----------+
| | | ``uretprobe.s+`` [#uprobe]_ | Yes |
+ + +----------------------------------+-----------+
| | | ``usdt+`` [#usdt]_ | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TRACE_KPROBE_MULTI`` | ``kprobe.multi+`` [#kpmulti]_ | |
+ + +----------------------------------+-----------+
| | | ``kretprobe.multi+`` [#kpmulti]_ | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LIRC_MODE2`` | ``BPF_LIRC_MODE2`` | ``lirc_mode2`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LSM`` | ``BPF_LSM_CGROUP`` | ``lsm_cgroup+`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_LSM_MAC`` | ``lsm+`` [#lsm]_ | |
+ + +----------------------------------+-----------+
| | | ``lsm.s+`` [#lsm]_ | Yes |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LWT_IN`` | | ``lwt_in`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LWT_OUT`` | | ``lwt_out`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LWT_SEG6LOCAL`` | | ``lwt_seg6local`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_LWT_XMIT`` | | ``lwt_xmit`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_PERF_EVENT`` | | ``perf_event`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE`` | | ``raw_tp.w+`` [#rawtp]_ | |
+ + +----------------------------------+-----------+
| | | ``raw_tracepoint.w+`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_RAW_TRACEPOINT`` | | ``raw_tp+`` [#rawtp]_ | |
+ + +----------------------------------+-----------+
| | | ``raw_tracepoint+`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SCHED_ACT`` | | ``action`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SCHED_CLS`` | | ``classifier`` | |
+ + +----------------------------------+-----------+
| | | ``tc`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SK_LOOKUP`` | ``BPF_SK_LOOKUP`` | ``sk_lookup`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SK_MSG`` | ``BPF_SK_MSG_VERDICT`` | ``sk_msg`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SK_REUSEPORT`` | ``BPF_SK_REUSEPORT_SELECT_OR_MIGRATE`` | ``sk_reuseport/migrate`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_SK_REUSEPORT_SELECT`` | ``sk_reuseport`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SK_SKB`` | | ``sk_skb`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_SK_SKB_STREAM_PARSER`` | ``sk_skb/stream_parser`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_SK_SKB_STREAM_VERDICT`` | ``sk_skb/stream_verdict`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SOCKET_FILTER`` | | ``socket`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SOCK_OPS`` | ``BPF_CGROUP_SOCK_OPS`` | ``sockops`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_STRUCT_OPS`` | | ``struct_ops+`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_SYSCALL`` | | ``syscall`` | Yes |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_TRACEPOINT`` | | ``tp+`` [#tp]_ | |
+ + +----------------------------------+-----------+
| | | ``tracepoint+`` [#tp]_ | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_TRACING`` | ``BPF_MODIFY_RETURN`` | ``fmod_ret+`` [#fentry]_ | |
+ + +----------------------------------+-----------+
| | | ``fmod_ret.s+`` [#fentry]_ | Yes |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TRACE_FENTRY`` | ``fentry+`` [#fentry]_ | |
+ + +----------------------------------+-----------+
| | | ``fentry.s+`` [#fentry]_ | Yes |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TRACE_FEXIT`` | ``fexit+`` [#fentry]_ | |
+ + +----------------------------------+-----------+
| | | ``fexit.s+`` [#fentry]_ | Yes |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TRACE_ITER`` | ``iter+`` [#iter]_ | |
+ + +----------------------------------+-----------+
| | | ``iter.s+`` [#iter]_ | Yes |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TRACE_RAW_TP`` | ``tp_btf+`` [#fentry]_ | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| ``BPF_PROG_TYPE_XDP`` | ``BPF_XDP_CPUMAP`` | ``xdp.frags/cpumap`` | |
+ + +----------------------------------+-----------+
| | | ``xdp/cpumap`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_XDP_DEVMAP`` | ``xdp.frags/devmap`` | |
+ + +----------------------------------+-----------+
| | | ``xdp/devmap`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_XDP`` | ``xdp.frags`` | |
+ + +----------------------------------+-----------+
| | | ``xdp`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
.. rubric:: Footnotes
.. [#fentry] The ``fentry`` attach format is ``fentry[.s]/<function>``.
.. [#kprobe] The ``kprobe`` attach format is ``kprobe/<function>[+<offset>]``. Valid
characters for ``function`` are ``a-zA-Z0-9_.`` and ``offset`` must be a valid
non-negative integer.
.. [#ksyscall] The ``ksyscall`` attach format is ``ksyscall/<syscall>``.
.. [#uprobe] The ``uprobe`` attach format is ``uprobe[.s]/<path>:<function>[+<offset>]``.
.. [#usdt] The ``usdt`` attach format is ``usdt/<path>:<provider>:<name>``.
.. [#kpmulti] The ``kprobe.multi`` attach format is ``kprobe.multi/<pattern>`` where ``pattern``
supports ``*`` and ``?`` wildcards. Valid characters for pattern are
``a-zA-Z0-9_.*?``.
.. [#lsm] The ``lsm`` attachment format is ``lsm[.s]/<hook>``.
.. [#rawtp] The ``raw_tp`` attach format is ``raw_tracepoint[.w]/<tracepoint>``.
.. [#tp] The ``tracepoint`` attach format is ``tracepoint/<category>/<name>``.
.. [#iter] The ``iter`` attach format is ``iter[.s]/<struct-name>``.

View File

@ -0,0 +1,262 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
================================================
BPF_MAP_TYPE_ARRAY and BPF_MAP_TYPE_PERCPU_ARRAY
================================================
.. note::
- ``BPF_MAP_TYPE_ARRAY`` was introduced in kernel version 3.19
- ``BPF_MAP_TYPE_PERCPU_ARRAY`` was introduced in version 4.6
``BPF_MAP_TYPE_ARRAY`` and ``BPF_MAP_TYPE_PERCPU_ARRAY`` provide generic array
storage. The key type is an unsigned 32-bit integer (4 bytes) and the map is
of constant size. The size of the array is defined in ``max_entries`` at
creation time. All array elements are pre-allocated and zero initialized when
created. ``BPF_MAP_TYPE_PERCPU_ARRAY`` uses a different memory region for each
CPU whereas ``BPF_MAP_TYPE_ARRAY`` uses the same memory region. The value
stored can be of any size, however, all array elements are aligned to 8
bytes.
Since kernel 5.5, memory mapping may be enabled for ``BPF_MAP_TYPE_ARRAY`` by
setting the flag ``BPF_F_MMAPABLE``. The map definition is page-aligned and
starts on the first page. Sufficient page-sized and page-aligned blocks of
memory are allocated to store all array values, starting on the second page,
which in some cases will result in over-allocation of memory. The benefit of
using this is increased performance and ease of use since userspace programs
would not be required to use helper functions to access and mutate data.
Usage
=====
Kernel BPF
----------
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
Array elements can be retrieved using the ``bpf_map_lookup_elem()`` helper.
This helper returns a pointer into the array element, so to avoid data races
with userspace reading the value, the user must use primitives like
``__sync_fetch_and_add()`` when updating the value in-place.
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
Array elements can be updated using the ``bpf_map_update_elem()`` helper.
``bpf_map_update_elem()`` returns 0 on success, or negative error in case of
failure.
Since the array is of constant size, ``bpf_map_delete_elem()`` is not supported.
To clear an array element, you may use ``bpf_map_update_elem()`` to insert a
zero value to that index.
Per CPU Array
-------------
Values stored in ``BPF_MAP_TYPE_ARRAY`` can be accessed by multiple programs
across different CPUs. To restrict storage to a single CPU, you may use a
``BPF_MAP_TYPE_PERCPU_ARRAY``.
When using a ``BPF_MAP_TYPE_PERCPU_ARRAY`` the ``bpf_map_update_elem()`` and
``bpf_map_lookup_elem()`` helpers automatically access the slot for the current
CPU.
bpf_map_lookup_percpu_elem()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu)
The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the array
value for a specific CPU. Returns value on success , or ``NULL`` if no entry was
found or ``cpu`` is invalid.
Concurrency
-----------
Since kernel version 5.1, the BPF infrastructure provides ``struct bpf_spin_lock``
to synchronize access.
Userspace
---------
Access from userspace uses libbpf APIs with the same names as above, with
the map identified by its ``fd``.
Examples
========
Please see the ``tools/testing/selftests/bpf`` directory for functional
examples. The code samples below demonstrate API usage.
Kernel BPF
----------
This snippet shows how to declare an array in a BPF program.
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__type(key, u32);
__type(value, long);
__uint(max_entries, 256);
} my_map SEC(".maps");
This example BPF program shows how to access an array element.
.. code-block:: c
int bpf_prog(struct __sk_buff *skb)
{
struct iphdr ip;
int index;
long *value;
if (bpf_skb_load_bytes(skb, ETH_HLEN, &ip, sizeof(ip)) < 0)
return 0;
index = ip.protocol;
value = bpf_map_lookup_elem(&my_map, &index);
if (value)
__sync_fetch_and_add(value, skb->len);
return 0;
}
Userspace
---------
BPF_MAP_TYPE_ARRAY
~~~~~~~~~~~~~~~~~~
This snippet shows how to create an array, using ``bpf_map_create_opts`` to
set flags.
.. code-block:: c
#include <bpf/libbpf.h>
#include <bpf/bpf.h>
int create_array()
{
int fd;
LIBBPF_OPTS(bpf_map_create_opts, opts, .map_flags = BPF_F_MMAPABLE);
fd = bpf_map_create(BPF_MAP_TYPE_ARRAY,
"example_array", /* name */
sizeof(__u32), /* key size */
sizeof(long), /* value size */
256, /* max entries */
&opts); /* create opts */
return fd;
}
This snippet shows how to initialize the elements of an array.
.. code-block:: c
int initialize_array(int fd)
{
__u32 i;
long value;
int ret;
for (i = 0; i < 256; i++) {
value = i;
ret = bpf_map_update_elem(fd, &i, &value, BPF_ANY);
if (ret < 0)
return ret;
}
return ret;
}
This snippet shows how to retrieve an element value from an array.
.. code-block:: c
int lookup(int fd)
{
__u32 index = 42;
long value;
int ret;
ret = bpf_map_lookup_elem(fd, &index, &value);
if (ret < 0)
return ret;
/* use value here */
assert(value == 42);
return ret;
}
BPF_MAP_TYPE_PERCPU_ARRAY
~~~~~~~~~~~~~~~~~~~~~~~~~
This snippet shows how to initialize the elements of a per CPU array.
.. code-block:: c
int initialize_array(int fd)
{
int ncpus = libbpf_num_possible_cpus();
long values[ncpus];
__u32 i, j;
int ret;
for (i = 0; i < 256 ; i++) {
for (j = 0; j < ncpus; j++)
values[j] = i;
ret = bpf_map_update_elem(fd, &i, &values, BPF_ANY);
if (ret < 0)
return ret;
}
return ret;
}
This snippet shows how to access the per CPU elements of an array value.
.. code-block:: c
int lookup(int fd)
{
int ncpus = libbpf_num_possible_cpus();
__u32 index = 42, j;
long values[ncpus];
int ret;
ret = bpf_map_lookup_elem(fd, &index, &values);
if (ret < 0)
return ret;
for (j = 0; j < ncpus; j++) {
/* Use per CPU value here */
assert(values[j] == 42);
}
return ret;
}
Semantics
=========
As shown in the example above, when accessing a ``BPF_MAP_TYPE_PERCPU_ARRAY``
in userspace, each value is an array with ``ncpus`` elements.
When calling ``bpf_map_update_elem()`` the flag ``BPF_NOEXIST`` can not be used
for these maps.

View File

@ -0,0 +1,174 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
=========================
BPF_MAP_TYPE_BLOOM_FILTER
=========================
.. note::
- ``BPF_MAP_TYPE_BLOOM_FILTER`` was introduced in kernel version 5.16
``BPF_MAP_TYPE_BLOOM_FILTER`` provides a BPF bloom filter map. Bloom
filters are a space-efficient probabilistic data structure used to
quickly test whether an element exists in a set. In a bloom filter,
false positives are possible whereas false negatives are not.
The bloom filter map does not have keys, only values. When the bloom
filter map is created, it must be created with a ``key_size`` of 0. The
bloom filter map supports two operations:
- push: adding an element to the map
- peek: determining whether an element is present in the map
BPF programs must use ``bpf_map_push_elem`` to add an element to the
bloom filter map and ``bpf_map_peek_elem`` to query the map. These
operations are exposed to userspace applications using the existing
``bpf`` syscall in the following way:
- ``BPF_MAP_UPDATE_ELEM`` -> push
- ``BPF_MAP_LOOKUP_ELEM`` -> peek
The ``max_entries`` size that is specified at map creation time is used
to approximate a reasonable bitmap size for the bloom filter, and is not
otherwise strictly enforced. If the user wishes to insert more entries
into the bloom filter than ``max_entries``, this may lead to a higher
false positive rate.
The number of hashes to use for the bloom filter is configurable using
the lower 4 bits of ``map_extra`` in ``union bpf_attr`` at map creation
time. If no number is specified, the default used will be 5 hash
functions. In general, using more hashes decreases both the false
positive rate and the speed of a lookup.
It is not possible to delete elements from a bloom filter map. A bloom
filter map may be used as an inner map. The user is responsible for
synchronising concurrent updates and lookups to ensure no false negative
lookups occur.
Usage
=====
Kernel BPF
----------
bpf_map_push_elem()
~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
A ``value`` can be added to a bloom filter using the
``bpf_map_push_elem()`` helper. The ``flags`` parameter must be set to
``BPF_ANY`` when adding an entry to the bloom filter. This helper
returns ``0`` on success, or negative error in case of failure.
bpf_map_peek_elem()
~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_peek_elem(struct bpf_map *map, void *value)
The ``bpf_map_peek_elem()`` helper is used to determine whether
``value`` is present in the bloom filter map. This helper returns ``0``
if ``value`` is probably present in the map, or ``-ENOENT`` if ``value``
is definitely not present in the map.
Userspace
---------
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_update_elem (int fd, const void *key, const void *value, __u64 flags)
A userspace program can add a ``value`` to a bloom filter using libbpf's
``bpf_map_update_elem`` function. The ``key`` parameter must be set to
``NULL`` and ``flags`` must be set to ``BPF_ANY``. Returns ``0`` on
success, or negative error in case of failure.
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_lookup_elem (int fd, const void *key, void *value)
A userspace program can determine the presence of ``value`` in a bloom
filter using libbpf's ``bpf_map_lookup_elem`` function. The ``key``
parameter must be set to ``NULL``. Returns ``0`` if ``value`` is
probably present in the map, or ``-ENOENT`` if ``value`` is definitely
not present in the map.
Examples
========
Kernel BPF
----------
This snippet shows how to declare a bloom filter in a BPF program:
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_BLOOM_FILTER);
__type(value, __u32);
__uint(max_entries, 1000);
__uint(map_extra, 3);
} bloom_filter SEC(".maps");
This snippet shows how to determine presence of a value in a bloom
filter in a BPF program:
.. code-block:: c
void *lookup(__u32 key)
{
if (bpf_map_peek_elem(&bloom_filter, &key) == 0) {
/* Verify not a false positive and fetch an associated
* value using a secondary lookup, e.g. in a hash table
*/
return bpf_map_lookup_elem(&hash_table, &key);
}
return 0;
}
Userspace
---------
This snippet shows how to use libbpf to create a bloom filter map from
userspace:
.. code-block:: c
int create_bloom()
{
LIBBPF_OPTS(bpf_map_create_opts, opts,
.map_extra = 3); /* number of hashes */
return bpf_map_create(BPF_MAP_TYPE_BLOOM_FILTER,
"ipv6_bloom", /* name */
0, /* key size, must be zero */
sizeof(ipv6_addr), /* value size */
10000, /* max entries */
&opts); /* create options */
}
This snippet shows how to add an element to a bloom filter from
userspace:
.. code-block:: c
int add_element(struct bpf_map *bloom_map, __u32 value)
{
int bloom_fd = bpf_map__fd(bloom_map);
return bpf_map_update_elem(bloom_fd, NULL, &value, BPF_ANY);
}
References
==========
https://lwn.net/ml/bpf/20210831225005.2762202-1-joannekoong@fb.com/

View File

@ -0,0 +1,109 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Meta Platforms, Inc. and affiliates.
=========================
BPF_MAP_TYPE_CGRP_STORAGE
=========================
The ``BPF_MAP_TYPE_CGRP_STORAGE`` map type represents a local fix-sized
storage for cgroups. It is only available with ``CONFIG_CGROUPS``.
The programs are made available by the same Kconfig. The
data for a particular cgroup can be retrieved by looking up the map
with that cgroup.
This document describes the usage and semantics of the
``BPF_MAP_TYPE_CGRP_STORAGE`` map type.
Usage
=====
The map key must be ``sizeof(int)`` representing a cgroup fd.
To access the storage in a program, use ``bpf_cgrp_storage_get``::
void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
``flags`` could be 0 or ``BPF_LOCAL_STORAGE_GET_F_CREATE`` which indicates that
a new local storage will be created if one does not exist.
The local storage can be removed with ``bpf_cgrp_storage_delete``::
long bpf_cgrp_storage_delete(struct bpf_map *map, struct cgroup *cgroup)
The map is available to all program types.
Examples
========
A BPF program example with BPF_MAP_TYPE_CGRP_STORAGE::
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
struct {
__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
__type(key, int);
__type(value, long);
} cgrp_storage SEC(".maps");
SEC("tp_btf/sys_enter")
int BPF_PROG(on_enter, struct pt_regs *regs, long id)
{
struct task_struct *task = bpf_get_current_task_btf();
long *ptr;
ptr = bpf_cgrp_storage_get(&cgrp_storage, task->cgroups->dfl_cgrp, 0,
BPF_LOCAL_STORAGE_GET_F_CREATE);
if (ptr)
__sync_fetch_and_add(ptr, 1);
return 0;
}
Userspace accessing map declared above::
#include <linux/bpf.h>
#include <linux/libbpf.h>
__u32 map_lookup(struct bpf_map *map, int cgrp_fd)
{
__u32 *value;
value = bpf_map_lookup_elem(bpf_map__fd(map), &cgrp_fd);
if (value)
return *value;
return 0;
}
Difference Between BPF_MAP_TYPE_CGRP_STORAGE and BPF_MAP_TYPE_CGROUP_STORAGE
============================================================================
The old cgroup storage map ``BPF_MAP_TYPE_CGROUP_STORAGE`` has been marked as
deprecated (renamed to ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``). The new
``BPF_MAP_TYPE_CGRP_STORAGE`` map should be used instead. The following
illusates the main difference between ``BPF_MAP_TYPE_CGRP_STORAGE`` and
``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.
(1). ``BPF_MAP_TYPE_CGRP_STORAGE`` can be used by all program types while
``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` is available only to cgroup program types
like BPF_CGROUP_INET_INGRESS or BPF_CGROUP_SOCK_OPS, etc.
(2). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports local storage for more than one
cgroup while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only supports one cgroup
which is attached by a BPF program.
(3). ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` allocates local storage at attach time so
``bpf_get_local_storage()`` always returns non-NULL local storage.
``BPF_MAP_TYPE_CGRP_STORAGE`` allocates local storage at runtime so
it is possible that ``bpf_cgrp_storage_get()`` may return null local storage.
To avoid such null local storage issue, user space can do
``bpf_map_update_elem()`` to pre-allocate local storage before a BPF program
is attached.
(4). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports deleting local storage by a BPF program
while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only deletes storage during
prog detach time.
So overall, ``BPF_MAP_TYPE_CGRP_STORAGE`` supports all ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``
functionality and beyond. It is recommended to use ``BPF_MAP_TYPE_CGRP_STORAGE``
instead of ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.

View File

@ -0,0 +1,177 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
===================
BPF_MAP_TYPE_CPUMAP
===================
.. note::
- ``BPF_MAP_TYPE_CPUMAP`` was introduced in kernel version 4.15
.. kernel-doc:: kernel/bpf/cpumap.c
:doc: cpu map
An example use-case for this map type is software based Receive Side Scaling (RSS).
The CPUMAP represents the CPUs in the system indexed as the map-key, and the
map-value is the config setting (per CPUMAP entry). Each CPUMAP entry has a dedicated
kernel thread bound to the given CPU to represent the remote CPU execution unit.
Starting from Linux kernel version 5.9 the CPUMAP can run a second XDP program
on the remote CPU. This allows an XDP program to split its processing across
multiple CPUs. For example, a scenario where the initial CPU (that sees/receives
the packets) needs to do minimal packet processing and the remote CPU (to which
the packet is directed) can afford to spend more cycles processing the frame. The
initial CPU is where the XDP redirect program is executed. The remote CPU
receives raw ``xdp_frame`` objects.
Usage
=====
Kernel BPF
----------
bpf_redirect_map()
^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
For ``BPF_MAP_TYPE_CPUMAP`` this map contains references to CPUs.
The lower two bits of ``flags`` are used as the return code if the map lookup
fails. This is so that the return value can be one of the XDP program return
codes up to ``XDP_TX``, as chosen by the caller.
User space
----------
.. note::
CPUMAP entries can only be updated/looked up/deleted from user space and not
from an eBPF program. Trying to call these functions from a kernel eBPF
program will result in the program failing to load and a verifier warning.
bpf_map_update_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags);
CPU entries can be added or updated using the ``bpf_map_update_elem()``
helper. This helper replaces existing elements atomically. The ``value`` parameter
can be ``struct bpf_cpumap_val``.
.. code-block:: c
struct bpf_cpumap_val {
__u32 qsize; /* queue size to remote target CPU */
union {
int fd; /* prog fd on map write */
__u32 id; /* prog id on map read */
} bpf_prog;
};
The flags argument can be one of the following:
- BPF_ANY: Create a new element or update an existing element.
- BPF_NOEXIST: Create a new element only if it did not exist.
- BPF_EXIST: Update an existing element.
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_lookup_elem(int fd, const void *key, void *value);
CPU entries can be retrieved using the ``bpf_map_lookup_elem()``
helper.
bpf_map_delete_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_delete_elem(int fd, const void *key);
CPU entries can be deleted using the ``bpf_map_delete_elem()``
helper. This helper will return 0 on success, or negative error in case of
failure.
Examples
========
Kernel
------
The following code snippet shows how to declare a ``BPF_MAP_TYPE_CPUMAP`` called
``cpu_map`` and how to redirect packets to a remote CPU using a round robin scheme.
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_CPUMAP);
__type(key, __u32);
__type(value, struct bpf_cpumap_val);
__uint(max_entries, 12);
} cpu_map SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__type(key, __u32);
__type(value, __u32);
__uint(max_entries, 12);
} cpus_available SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__type(key, __u32);
__type(value, __u32);
__uint(max_entries, 1);
} cpus_iterator SEC(".maps");
SEC("xdp")
int xdp_redir_cpu_round_robin(struct xdp_md *ctx)
{
__u32 key = 0;
__u32 cpu_dest = 0;
__u32 *cpu_selected, *cpu_iterator;
__u32 cpu_idx;
cpu_iterator = bpf_map_lookup_elem(&cpus_iterator, &key);
if (!cpu_iterator)
return XDP_ABORTED;
cpu_idx = *cpu_iterator;
*cpu_iterator += 1;
if (*cpu_iterator == bpf_num_possible_cpus())
*cpu_iterator = 0;
cpu_selected = bpf_map_lookup_elem(&cpus_available, &cpu_idx);
if (!cpu_selected)
return XDP_ABORTED;
cpu_dest = *cpu_selected;
if (cpu_dest >= bpf_num_possible_cpus())
return XDP_ABORTED;
return bpf_redirect_map(&cpu_map, cpu_dest, 0);
}
User space
----------
The following code snippet shows how to dynamically set the max_entries for a
CPUMAP to the max number of cpus available on the system.
.. code-block:: c
int set_max_cpu_entries(struct bpf_map *cpu_map)
{
if (bpf_map__set_max_entries(cpu_map, libbpf_num_possible_cpus()) < 0) {
fprintf(stderr, "Failed to set max entries for cpu_map map: %s",
strerror(errno));
return -1;
}
return 0;
}
References
===========
- https://developers.redhat.com/blog/2021/05/13/receive-side-scaling-rss-with-ebpf-and-cpumap#redirecting_into_a_cpumap

View File

@ -0,0 +1,238 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
=================================================
BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_DEVMAP_HASH
=================================================
.. note::
- ``BPF_MAP_TYPE_DEVMAP`` was introduced in kernel version 4.14
- ``BPF_MAP_TYPE_DEVMAP_HASH`` was introduced in kernel version 5.4
``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` are BPF maps primarily
used as backend maps for the XDP BPF helper call ``bpf_redirect_map()``.
``BPF_MAP_TYPE_DEVMAP`` is backed by an array that uses the key as
the index to lookup a reference to a net device. While ``BPF_MAP_TYPE_DEVMAP_HASH``
is backed by a hash table that uses a key to lookup a reference to a net device.
The user provides either <``key``/ ``ifindex``> or <``key``/ ``struct bpf_devmap_val``>
pairs to update the maps with new net devices.
.. note::
- The key to a hash map doesn't have to be an ``ifindex``.
- While ``BPF_MAP_TYPE_DEVMAP_HASH`` allows for densely packing the net devices
it comes at the cost of a hash of the key when performing a look up.
The setup and packet enqueue/send code is shared between the two types of
devmap; only the lookup and insertion is different.
Usage
=====
Kernel BPF
----------
bpf_redirect_map()
^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
For ``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` this map contains
references to net devices (for forwarding packets through other ports).
The lower two bits of *flags* are used as the return code if the map lookup
fails. This is so that the return value can be one of the XDP program return
codes up to ``XDP_TX``, as chosen by the caller. The higher bits of ``flags``
can be set to ``BPF_F_BROADCAST`` or ``BPF_F_EXCLUDE_INGRESS`` as defined
below.
With ``BPF_F_BROADCAST`` the packet will be broadcast to all the interfaces
in the map, with ``BPF_F_EXCLUDE_INGRESS`` the ingress interface will be excluded
from the broadcast.
.. note::
- The key is ignored if BPF_F_BROADCAST is set.
- The broadcast feature can also be used to implement multicast forwarding:
simply create multiple DEVMAPs, each one corresponding to a single multicast group.
This helper will return ``XDP_REDIRECT`` on success, or the value of the two
lower bits of the ``flags`` argument if the map lookup fails.
More information about redirection can be found :doc:`redirect`
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
Net device entries can be retrieved using the ``bpf_map_lookup_elem()``
helper.
User space
----------
.. note::
DEVMAP entries can only be updated/deleted from user space and not
from an eBPF program. Trying to call these functions from a kernel eBPF
program will result in the program failing to load and a verifier warning.
bpf_map_update_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags);
Net device entries can be added or updated using the ``bpf_map_update_elem()``
helper. This helper replaces existing elements atomically. The ``value`` parameter
can be ``struct bpf_devmap_val`` or a simple ``int ifindex`` for backwards
compatibility.
.. code-block:: c
struct bpf_devmap_val {
__u32 ifindex; /* device index */
union {
int fd; /* prog fd on map write */
__u32 id; /* prog id on map read */
} bpf_prog;
};
The ``flags`` argument can be one of the following:
- ``BPF_ANY``: Create a new element or update an existing element.
- ``BPF_NOEXIST``: Create a new element only if it did not exist.
- ``BPF_EXIST``: Update an existing element.
DEVMAPs can associate a program with a device entry by adding a ``bpf_prog.fd``
to ``struct bpf_devmap_val``. Programs are run after ``XDP_REDIRECT`` and have
access to both Rx device and Tx device. The program associated with the ``fd``
must have type XDP with expected attach type ``xdp_devmap``.
When a program is associated with a device index, the program is run on an
``XDP_REDIRECT`` and before the buffer is added to the per-cpu queue. Examples
of how to attach/use xdp_devmap progs can be found in the kernel selftests:
- ``tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c``
- ``tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c``
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
.. c:function::
int bpf_map_lookup_elem(int fd, const void *key, void *value);
Net device entries can be retrieved using the ``bpf_map_lookup_elem()``
helper.
bpf_map_delete_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
.. c:function::
int bpf_map_delete_elem(int fd, const void *key);
Net device entries can be deleted using the ``bpf_map_delete_elem()``
helper. This helper will return 0 on success, or negative error in case of
failure.
Examples
========
Kernel BPF
----------
The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP``
called tx_port.
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_DEVMAP);
__type(key, __u32);
__type(value, __u32);
__uint(max_entries, 256);
} tx_port SEC(".maps");
The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP_HASH``
called forward_map.
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_DEVMAP_HASH);
__type(key, __u32);
__type(value, struct bpf_devmap_val);
__uint(max_entries, 32);
} forward_map SEC(".maps");
.. note::
The value type in the DEVMAP above is a ``struct bpf_devmap_val``
The following code snippet shows a simple xdp_redirect_map program. This program
would work with a user space program that populates the devmap ``forward_map`` based
on ingress ifindexes. The BPF program (below) is redirecting packets using the
ingress ``ifindex`` as the ``key``.
.. code-block:: c
SEC("xdp")
int xdp_redirect_map_func(struct xdp_md *ctx)
{
int index = ctx->ingress_ifindex;
return bpf_redirect_map(&forward_map, index, 0);
}
The following code snippet shows a BPF program that is broadcasting packets to
all the interfaces in the ``tx_port`` devmap.
.. code-block:: c
SEC("xdp")
int xdp_redirect_map_func(struct xdp_md *ctx)
{
return bpf_redirect_map(&tx_port, 0, BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS);
}
User space
----------
The following code snippet shows how to update a devmap called ``tx_port``.
.. code-block:: c
int update_devmap(int ifindex, int redirect_ifindex)
{
int ret;
ret = bpf_map_update_elem(bpf_map__fd(tx_port), &ifindex, &redirect_ifindex, 0);
if (ret < 0) {
fprintf(stderr, "Failed to update devmap_ value: %s\n",
strerror(errno));
}
return ret;
}
The following code snippet shows how to update a hash_devmap called ``forward_map``.
.. code-block:: c
int update_devmap(int ifindex, int redirect_ifindex)
{
struct bpf_devmap_val devmap_val = { .ifindex = redirect_ifindex };
int ret;
ret = bpf_map_update_elem(bpf_map__fd(forward_map), &ifindex, &devmap_val, 0);
if (ret < 0) {
fprintf(stderr, "Failed to update devmap_ value: %s\n",
strerror(errno));
}
return ret;
}
References
===========
- https://lwn.net/Articles/728146/
- https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=6f9d451ab1a33728adb72d7ff66a7b374d665176
- https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4106

View File

@ -34,7 +34,14 @@ the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``.
Usage
=====
.. c:function::
Kernel BPF
----------
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
Hash entries can be added or updated using the ``bpf_map_update_elem()``
@ -49,14 +56,22 @@ parameter can be used to control the update behaviour:
``bpf_map_update_elem()`` returns 0 on success, or negative error in
case of failure.
.. c:function::
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
Hash entries can be retrieved using the ``bpf_map_lookup_elem()``
helper. This helper returns a pointer to the value associated with
``key``, or ``NULL`` if no entry was found.
.. c:function::
bpf_map_delete_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_delete_elem(struct bpf_map *map, const void *key)
Hash entries can be deleted using the ``bpf_map_delete_elem()``
@ -70,7 +85,11 @@ For ``BPF_MAP_TYPE_PERCPU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
the ``bpf_map_update_elem()`` and ``bpf_map_lookup_elem()`` helpers
automatically access the hash slot for the current CPU.
.. c:function::
bpf_map_lookup_percpu_elem()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu)
The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the
@ -89,7 +108,11 @@ See ``tools/testing/selftests/bpf/progs/test_spin_lock.c``.
Userspace
---------
.. c:function::
bpf_map_get_next_key()
~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_get_next_key(int fd, const void *cur_key, void *next_key)
In userspace, it is possible to iterate through the keys of a hash using

View File

@ -0,0 +1,197 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
=====================
BPF_MAP_TYPE_LPM_TRIE
=====================
.. note::
- ``BPF_MAP_TYPE_LPM_TRIE`` was introduced in kernel version 4.11
``BPF_MAP_TYPE_LPM_TRIE`` provides a longest prefix match algorithm that
can be used to match IP addresses to a stored set of prefixes.
Internally, data is stored in an unbalanced trie of nodes that uses
``prefixlen,data`` pairs as its keys. The ``data`` is interpreted in
network byte order, i.e. big endian, so ``data[0]`` stores the most
significant byte.
LPM tries may be created with a maximum prefix length that is a multiple
of 8, in the range from 8 to 2048. The key used for lookup and update
operations is a ``struct bpf_lpm_trie_key``, extended by
``max_prefixlen/8`` bytes.
- For IPv4 addresses the data length is 4 bytes
- For IPv6 addresses the data length is 16 bytes
The value type stored in the LPM trie can be any user defined type.
.. note::
When creating a map of type ``BPF_MAP_TYPE_LPM_TRIE`` you must set the
``BPF_F_NO_PREALLOC`` flag.
Usage
=====
Kernel BPF
----------
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
The longest prefix entry for a given data value can be found using the
``bpf_map_lookup_elem()`` helper. This helper returns a pointer to the
value associated with the longest matching ``key``, or ``NULL`` if no
entry was found.
The ``key`` should have ``prefixlen`` set to ``max_prefixlen`` when
performing longest prefix lookups. For example, when searching for the
longest prefix match for an IPv4 address, ``prefixlen`` should be set to
``32``.
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
Prefix entries can be added or updated using the ``bpf_map_update_elem()``
helper. This helper replaces existing elements atomically.
``bpf_map_update_elem()`` returns ``0`` on success, or negative error in
case of failure.
.. note::
The flags parameter must be one of BPF_ANY, BPF_NOEXIST or BPF_EXIST,
but the value is ignored, giving BPF_ANY semantics.
bpf_map_delete_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_delete_elem(struct bpf_map *map, const void *key)
Prefix entries can be deleted using the ``bpf_map_delete_elem()``
helper. This helper will return 0 on success, or negative error in case
of failure.
Userspace
---------
Access from userspace uses libbpf APIs with the same names as above, with
the map identified by ``fd``.
bpf_map_get_next_key()
~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_get_next_key (int fd, const void *cur_key, void *next_key)
A userspace program can iterate through the entries in an LPM trie using
libbpf's ``bpf_map_get_next_key()`` function. The first key can be
fetched by calling ``bpf_map_get_next_key()`` with ``cur_key`` set to
``NULL``. Subsequent calls will fetch the next key that follows the
current key. ``bpf_map_get_next_key()`` returns ``0`` on success,
``-ENOENT`` if ``cur_key`` is the last key in the trie, or negative
error in case of failure.
``bpf_map_get_next_key()`` will iterate through the LPM trie elements
from leftmost leaf first. This means that iteration will return more
specific keys before less specific ones.
Examples
========
Please see ``tools/testing/selftests/bpf/test_lpm_map.c`` for examples
of LPM trie usage from userspace. The code snippets below demonstrate
API usage.
Kernel BPF
----------
The following BPF code snippet shows how to declare a new LPM trie for IPv4
address prefixes:
.. code-block:: c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
struct ipv4_lpm_key {
__u32 prefixlen;
__u32 data;
};
struct {
__uint(type, BPF_MAP_TYPE_LPM_TRIE);
__type(key, struct ipv4_lpm_key);
__type(value, __u32);
__uint(map_flags, BPF_F_NO_PREALLOC);
__uint(max_entries, 255);
} ipv4_lpm_map SEC(".maps");
The following BPF code snippet shows how to lookup by IPv4 address:
.. code-block:: c
void *lookup(__u32 ipaddr)
{
struct ipv4_lpm_key key = {
.prefixlen = 32,
.data = ipaddr
};
return bpf_map_lookup_elem(&ipv4_lpm_map, &key);
}
Userspace
---------
The following snippet shows how to insert an IPv4 prefix entry into an
LPM trie:
.. code-block:: c
int add_prefix_entry(int lpm_fd, __u32 addr, __u32 prefixlen, struct value *value)
{
struct ipv4_lpm_key ipv4_key = {
.prefixlen = prefixlen,
.data = addr
};
return bpf_map_update_elem(lpm_fd, &ipv4_key, value, BPF_ANY);
}
The following snippet shows a userspace program walking through the entries
of an LPM trie:
.. code-block:: c
#include <bpf/libbpf.h>
#include <bpf/bpf.h>
void iterate_lpm_trie(int map_fd)
{
struct ipv4_lpm_key *cur_key = NULL;
struct ipv4_lpm_key next_key;
struct value value;
int err;
for (;;) {
err = bpf_map_get_next_key(map_fd, cur_key, &next_key);
if (err)
break;
bpf_map_lookup_elem(map_fd, &next_key, &value);
/* Use key and value here */
cur_key = &next_key;
}
}

View File

@ -0,0 +1,130 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
========================================================
BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS
========================================================
.. note::
- ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and ``BPF_MAP_TYPE_HASH_OF_MAPS`` were
introduced in kernel version 4.12
``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and ``BPF_MAP_TYPE_HASH_OF_MAPS`` provide general
purpose support for map in map storage. One level of nesting is supported, where
an outer map contains instances of a single type of inner map, for example
``array_of_maps->sock_map``.
When creating an outer map, an inner map instance is used to initialize the
metadata that the outer map holds about its inner maps. This inner map has a
separate lifetime from the outer map and can be deleted after the outer map has
been created.
The outer map supports element lookup, update and delete from user space using
the syscall API. A BPF program is only allowed to do element lookup in the outer
map.
.. note::
- Multi-level nesting is not supported.
- Any BPF map type can be used as an inner map, except for
``BPF_MAP_TYPE_PROG_ARRAY``.
- A BPF program cannot update or delete outer map entries.
For ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` the key is an unsigned 32-bit integer index
into the array. The array is a fixed size with ``max_entries`` elements that are
zero initialized when created.
For ``BPF_MAP_TYPE_HASH_OF_MAPS`` the key type can be chosen when defining the
map. The kernel is responsible for allocating and freeing key/value pairs, up to
the max_entries limit that you specify. Hash maps use pre-allocation of hash
table elements by default. The ``BPF_F_NO_PREALLOC`` flag can be used to disable
pre-allocation when it is too memory expensive.
Usage
=====
Kernel BPF Helper
-----------------
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
Inner maps can be retrieved using the ``bpf_map_lookup_elem()`` helper. This
helper returns a pointer to the inner map, or ``NULL`` if no entry was found.
Examples
========
Kernel BPF Example
------------------
This snippet shows how to create and initialise an array of devmaps in a BPF
program. Note that the outer array can only be modified from user space using
the syscall API.
.. code-block:: c
struct inner_map {
__uint(type, BPF_MAP_TYPE_DEVMAP);
__uint(max_entries, 10);
__type(key, __u32);
__type(value, __u32);
} inner_map1 SEC(".maps"), inner_map2 SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
__uint(max_entries, 2);
__type(key, __u32);
__array(values, struct inner_map);
} outer_map SEC(".maps") = {
.values = { &inner_map1,
&inner_map2 }
};
See ``progs/test_btf_map_in_map.c`` in ``tools/testing/selftests/bpf`` for more
examples of declarative initialisation of outer maps.
User Space
----------
This snippet shows how to create an array based outer map:
.. code-block:: c
int create_outer_array(int inner_fd) {
LIBBPF_OPTS(bpf_map_create_opts, opts, .inner_map_fd = inner_fd);
int fd;
fd = bpf_map_create(BPF_MAP_TYPE_ARRAY_OF_MAPS,
"example_array", /* name */
sizeof(__u32), /* key size */
sizeof(__u32), /* value size */
256, /* max entries */
&opts); /* create opts */
return fd;
}
This snippet shows how to add an inner map to an outer map:
.. code-block:: c
int add_devmap(int outer_fd, int index, const char *name) {
int fd;
fd = bpf_map_create(BPF_MAP_TYPE_DEVMAP, name,
sizeof(__u32), sizeof(__u32), 256, NULL);
if (fd < 0)
return fd;
return bpf_map_update_elem(outer_fd, &index, &fd, BPF_ANY);
}
References
==========
- https://lore.kernel.org/netdev/20170322170035.923581-3-kafai@fb.com/
- https://lore.kernel.org/netdev/20170322170035.923581-4-kafai@fb.com/

View File

@ -0,0 +1,146 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
=========================================
BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK
=========================================
.. note::
- ``BPF_MAP_TYPE_QUEUE`` and ``BPF_MAP_TYPE_STACK`` were introduced
in kernel version 4.20
``BPF_MAP_TYPE_QUEUE`` provides FIFO storage and ``BPF_MAP_TYPE_STACK``
provides LIFO storage for BPF programs. These maps support peek, pop and
push operations that are exposed to BPF programs through the respective
helpers. These operations are exposed to userspace applications using
the existing ``bpf`` syscall in the following way:
- ``BPF_MAP_LOOKUP_ELEM`` -> peek
- ``BPF_MAP_LOOKUP_AND_DELETE_ELEM`` -> pop
- ``BPF_MAP_UPDATE_ELEM`` -> push
``BPF_MAP_TYPE_QUEUE`` and ``BPF_MAP_TYPE_STACK`` do not support
``BPF_F_NO_PREALLOC``.
Usage
=====
Kernel BPF
----------
bpf_map_push_elem()
~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
An element ``value`` can be added to a queue or stack using the
``bpf_map_push_elem`` helper. The ``flags`` parameter must be set to
``BPF_ANY`` or ``BPF_EXIST``. If ``flags`` is set to ``BPF_EXIST`` then,
when the queue or stack is full, the oldest element will be removed to
make room for ``value`` to be added. Returns ``0`` on success, or
negative error in case of failure.
bpf_map_peek_elem()
~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_peek_elem(struct bpf_map *map, void *value)
This helper fetches an element ``value`` from a queue or stack without
removing it. Returns ``0`` on success, or negative error in case of
failure.
bpf_map_pop_elem()
~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_map_pop_elem(struct bpf_map *map, void *value)
This helper removes an element into ``value`` from a queue or
stack. Returns ``0`` on success, or negative error in case of failure.
Userspace
---------
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_update_elem (int fd, const void *key, const void *value, __u64 flags)
A userspace program can push ``value`` onto a queue or stack using libbpf's
``bpf_map_update_elem`` function. The ``key`` parameter must be set to
``NULL`` and ``flags`` must be set to ``BPF_ANY`` or ``BPF_EXIST``, with the
same semantics as the ``bpf_map_push_elem`` kernel helper. Returns ``0`` on
success, or negative error in case of failure.
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_lookup_elem (int fd, const void *key, void *value)
A userspace program can peek at the ``value`` at the head of a queue or stack
using the libbpf ``bpf_map_lookup_elem`` function. The ``key`` parameter must be
set to ``NULL``. Returns ``0`` on success, or negative error in case of
failure.
bpf_map_lookup_and_delete_elem()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_lookup_and_delete_elem (int fd, const void *key, void *value)
A userspace program can pop a ``value`` from the head of a queue or stack using
the libbpf ``bpf_map_lookup_and_delete_elem`` function. The ``key`` parameter
must be set to ``NULL``. Returns ``0`` on success, or negative error in case of
failure.
Examples
========
Kernel BPF
----------
This snippet shows how to declare a queue in a BPF program:
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_QUEUE);
__type(value, __u32);
__uint(max_entries, 10);
} queue SEC(".maps");
Userspace
---------
This snippet shows how to use libbpf's low-level API to create a queue from
userspace:
.. code-block:: c
int create_queue()
{
return bpf_map_create(BPF_MAP_TYPE_QUEUE,
"sample_queue", /* name */
0, /* key size, must be zero */
sizeof(__u32), /* value size */
10, /* max entries */
NULL); /* create options */
}
References
==========
https://lwn.net/ml/netdev/153986858555.9127.14517764371945179514.stgit@kernel/

View File

@ -0,0 +1,155 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
=======================
BPF_MAP_TYPE_SK_STORAGE
=======================
.. note::
- ``BPF_MAP_TYPE_SK_STORAGE`` was introduced in kernel version 5.2
``BPF_MAP_TYPE_SK_STORAGE`` is used to provide socket-local storage for BPF
programs. A map of type ``BPF_MAP_TYPE_SK_STORAGE`` declares the type of storage
to be provided and acts as the handle for accessing the socket-local
storage. The values for maps of type ``BPF_MAP_TYPE_SK_STORAGE`` are stored
locally with each socket instead of with the map. The kernel is responsible for
allocating storage for a socket when requested and for freeing the storage when
either the map or the socket is deleted.
.. note::
- The key type must be ``int`` and ``max_entries`` must be set to ``0``.
- The ``BPF_F_NO_PREALLOC`` flag must be used when creating a map for
socket-local storage.
Usage
=====
Kernel BPF
----------
bpf_sk_storage_get()
~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
void *bpf_sk_storage_get(struct bpf_map *map, void *sk, void *value, u64 flags)
Socket-local storage can be retrieved using the ``bpf_sk_storage_get()``
helper. The helper gets the storage from ``sk`` that is associated with ``map``.
If the ``BPF_LOCAL_STORAGE_GET_F_CREATE`` flag is used then
``bpf_sk_storage_get()`` will create the storage for ``sk`` if it does not
already exist. ``value`` can be used together with
``BPF_LOCAL_STORAGE_GET_F_CREATE`` to initialize the storage value, otherwise it
will be zero initialized. Returns a pointer to the storage on success, or
``NULL`` in case of failure.
.. note::
- ``sk`` is a kernel ``struct sock`` pointer for LSM or tracing programs.
- ``sk`` is a ``struct bpf_sock`` pointer for other program types.
bpf_sk_storage_delete()
~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
long bpf_sk_storage_delete(struct bpf_map *map, void *sk)
Socket-local storage can be deleted using the ``bpf_sk_storage_delete()``
helper. The helper deletes the storage from ``sk`` that is identified by
``map``. Returns ``0`` on success, or negative error in case of failure.
User space
----------
bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_update_elem(int map_fd, const void *key, const void *value, __u64 flags)
Socket-local storage for the socket identified by ``key`` belonging to
``map_fd`` can be added or updated using the ``bpf_map_update_elem()`` libbpf
function. ``key`` must be a pointer to a valid ``fd`` in the user space
program. The ``flags`` parameter can be used to control the update behaviour:
- ``BPF_ANY`` will create storage for ``fd`` or update existing storage.
- ``BPF_NOEXIST`` will create storage for ``fd`` only if it did not already
exist, otherwise the call will fail with ``-EEXIST``.
- ``BPF_EXIST`` will update existing storage for ``fd`` if it already exists,
otherwise the call will fail with ``-ENOENT``.
Returns ``0`` on success, or negative error in case of failure.
bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_lookup_elem(int map_fd, const void *key, void *value)
Socket-local storage for the socket identified by ``key`` belonging to
``map_fd`` can be retrieved using the ``bpf_map_lookup_elem()`` libbpf
function. ``key`` must be a pointer to a valid ``fd`` in the user space
program. Returns ``0`` on success, or negative error in case of failure.
bpf_map_delete_elem()
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: c
int bpf_map_delete_elem(int map_fd, const void *key)
Socket-local storage for the socket identified by ``key`` belonging to
``map_fd`` can be deleted using the ``bpf_map_delete_elem()`` libbpf
function. Returns ``0`` on success, or negative error in case of failure.
Examples
========
Kernel BPF
----------
This snippet shows how to declare socket-local storage in a BPF program:
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_SK_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
__type(key, int);
__type(value, struct my_storage);
} socket_storage SEC(".maps");
This snippet shows how to retrieve socket-local storage in a BPF program:
.. code-block:: c
SEC("sockops")
int _sockops(struct bpf_sock_ops *ctx)
{
struct my_storage *storage;
struct bpf_sock *sk;
sk = ctx->sk;
if (!sk)
return 1;
storage = bpf_sk_storage_get(&socket_storage, sk, 0,
BPF_LOCAL_STORAGE_GET_F_CREATE);
if (!storage)
return 1;
/* Use 'storage' here */
return 1;
}
Please see the ``tools/testing/selftests/bpf`` directory for functional
examples.
References
==========
https://lwn.net/ml/netdev/20190426171103.61892-1-kafai@fb.com/

View File

@ -0,0 +1,192 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
===================
BPF_MAP_TYPE_XSKMAP
===================
.. note::
- ``BPF_MAP_TYPE_XSKMAP`` was introduced in kernel version 4.18
The ``BPF_MAP_TYPE_XSKMAP`` is used as a backend map for XDP BPF helper
call ``bpf_redirect_map()`` and ``XDP_REDIRECT`` action, like 'devmap' and 'cpumap'.
This map type redirects raw XDP frames to `AF_XDP`_ sockets (XSKs), a new type of
address family in the kernel that allows redirection of frames from a driver to
user space without having to traverse the full network stack. An AF_XDP socket
binds to a single netdev queue. A mapping of XSKs to queues is shown below:
.. code-block:: none
+---------------------------------------------------+
| xsk A | xsk B | xsk C |<---+ User space
=========================================================|==========
| Queue 0 | Queue 1 | Queue 2 | | Kernel
+---------------------------------------------------+ |
| Netdev eth0 | |
+---------------------------------------------------+ |
| +=============+ | |
| | key | xsk | | |
| +---------+ +=============+ | |
| | | | 0 | xsk A | | |
| | | +-------------+ | |
| | | | 1 | xsk B | | |
| | BPF |-- redirect -->+-------------+-------------+
| | prog | | 2 | xsk C | |
| | | +-------------+ |
| | | |
| | | |
| +---------+ |
| |
+---------------------------------------------------+
.. note::
An AF_XDP socket that is bound to a certain <netdev/queue_id> will *only*
accept XDP frames from that <netdev/queue_id>. If an XDP program tries to redirect
from a <netdev/queue_id> other than what the socket is bound to, the frame will
not be received on the socket.
Typically an XSKMAP is created per netdev. This map contains an array of XSK File
Descriptors (FDs). The number of array elements is typically set or adjusted using
the ``max_entries`` map parameter. For AF_XDP ``max_entries`` is equal to the number
of queues supported by the netdev.
.. note::
Both the map key and map value size must be 4 bytes.
Usage
=====
Kernel BPF
----------
bpf_redirect_map()
^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
For ``BPF_MAP_TYPE_XSKMAP`` this map contains references to XSK FDs
for sockets attached to a netdev's queues.
.. note::
If the map is empty at an index, the packet is dropped. This means that it is
necessary to have an XDP program loaded with at least one XSK in the
XSKMAP to be able to get any traffic to user space through the socket.
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
XSK entry references of type ``struct xdp_sock *`` can be retrieved using the
``bpf_map_lookup_elem()`` helper.
User space
----------
.. note::
XSK entries can only be updated/deleted from user space and not from
a BPF program. Trying to call these functions from a kernel BPF program will
result in the program failing to load and a verifier warning.
bpf_map_update_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags)
XSK entries can be added or updated using the ``bpf_map_update_elem()``
helper. The ``key`` parameter is equal to the queue_id of the queue the XSK
is attaching to. And the ``value`` parameter is the FD value of that socket.
Under the hood, the XSKMAP update function uses the XSK FD value to retrieve the
associated ``struct xdp_sock`` instance.
The flags argument can be one of the following:
- BPF_ANY: Create a new element or update an existing element.
- BPF_NOEXIST: Create a new element only if it did not exist.
- BPF_EXIST: Update an existing element.
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_lookup_elem(int fd, const void *key, void *value)
Returns ``struct xdp_sock *`` or negative error in case of failure.
bpf_map_delete_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_delete_elem(int fd, const void *key)
XSK entries can be deleted using the ``bpf_map_delete_elem()``
helper. This helper will return 0 on success, or negative error in case of
failure.
.. note::
When `libxdp`_ deletes an XSK it also removes the associated socket
entry from the XSKMAP.
Examples
========
Kernel
------
The following code snippet shows how to declare a ``BPF_MAP_TYPE_XSKMAP`` called
``xsks_map`` and how to redirect packets to an XSK.
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_XSKMAP);
__type(key, __u32);
__type(value, __u32);
__uint(max_entries, 64);
} xsks_map SEC(".maps");
SEC("xdp")
int xsk_redir_prog(struct xdp_md *ctx)
{
__u32 index = ctx->rx_queue_index;
if (bpf_map_lookup_elem(&xsks_map, &index))
return bpf_redirect_map(&xsks_map, index, 0);
return XDP_PASS;
}
User space
----------
The following code snippet shows how to update an XSKMAP with an XSK entry.
.. code-block:: c
int update_xsks_map(struct bpf_map *xsks_map, int queue_id, int xsk_fd)
{
int ret;
ret = bpf_map_update_elem(bpf_map__fd(xsks_map), &queue_id, &xsk_fd, 0);
if (ret < 0)
fprintf(stderr, "Failed to update xsks_map: %s\n", strerror(errno));
return ret;
}
For an example on how create AF_XDP sockets, please see the AF_XDP-example and
AF_XDP-forwarding programs in the `bpf-examples`_ directory in the `libxdp`_ repository.
For a detailed explaination of the AF_XDP interface please see:
- `libxdp-readme`_.
- `AF_XDP`_ kernel documentation.
.. note::
The most comprehensive resource for using XSKMAPs and AF_XDP is `libxdp`_.
.. _libxdp: https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp
.. _AF_XDP: https://www.kernel.org/doc/html/latest/networking/af_xdp.html
.. _bpf-examples: https://github.com/xdp-project/bpf-examples
.. _libxdp-readme: https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp#using-af_xdp-sockets

View File

@ -1,46 +1,19 @@
=========
eBPF maps
=========
========
BPF maps
========
'maps' is a generic storage of different types for sharing data between kernel
and userspace.
BPF 'maps' provide generic storage of different types for sharing data between
kernel and user space. There are several storage types available, including
hash, array, bloom filter and radix-tree. Several of the map types exist to
support specific BPF helpers that perform actions based on the map contents. The
maps are accessed from BPF programs via BPF helpers which are documented in the
`man-pages`_ for `bpf-helpers(7)`_.
The maps are accessed from user space via BPF syscall, which has commands:
- create a map with given type and attributes
``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)``
using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
returns process-local file descriptor or negative error
- lookup key in a given map
``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)``
using attr->map_fd, attr->key, attr->value
returns zero and stores found elem into value or negative error
- create or update key/value pair in a given map
``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)``
using attr->map_fd, attr->key, attr->value
returns zero or negative error
- find and delete element by key in a given map
``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)``
using attr->map_fd, attr->key
- to delete map: close(fd)
Exiting process will delete maps automatically
userspace programs use this syscall to create/access maps that eBPF programs
are concurrently updating.
maps can have different types: hash, array, bloom filter, radix-tree, etc.
The map is defined by:
- type
- max number of elements
- key size in bytes
- value size in bytes
BPF maps are accessed from user space via the ``bpf`` syscall, which provides
commands to create maps, lookup elements, update elements and delete
elements. More details of the BPF syscall are available in
:doc:`/userspace-api/ebpf/syscall` and in the `man-pages`_ for `bpf(2)`_.
Map Types
=========
@ -49,4 +22,60 @@ Map Types
:maxdepth: 1
:glob:
map_*
map_*
Usage Notes
===========
.. c:function::
int bpf(int command, union bpf_attr *attr, u32 size)
Use the ``bpf()`` system call to perform the operation specified by
``command``. The operation takes parameters provided in ``attr``. The ``size``
argument is the size of the ``union bpf_attr`` in ``attr``.
**BPF_MAP_CREATE**
Create a map with the desired type and attributes in ``attr``:
.. code-block:: c
int fd;
union bpf_attr attr = {
.map_type = BPF_MAP_TYPE_ARRAY; /* mandatory */
.key_size = sizeof(__u32); /* mandatory */
.value_size = sizeof(__u32); /* mandatory */
.max_entries = 256; /* mandatory */
.map_flags = BPF_F_MMAPABLE;
.map_name = "example_array";
};
fd = bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
Returns a process-local file descriptor on success, or negative error in case of
failure. The map can be deleted by calling ``close(fd)``. Maps held by open
file descriptors will be deleted automatically when a process exits.
.. note:: Valid characters for ``map_name`` are ``A-Z``, ``a-z``, ``0-9``,
``'_'`` and ``'.'``.
**BPF_MAP_LOOKUP_ELEM**
Lookup key in a given map using ``attr->map_fd``, ``attr->key``,
``attr->value``. Returns zero and stores found elem into ``attr->value`` on
success, or negative error on failure.
**BPF_MAP_UPDATE_ELEM**
Create or update key/value pair in a given map using ``attr->map_fd``, ``attr->key``,
``attr->value``. Returns zero on success or negative error on failure.
**BPF_MAP_DELETE_ELEM**
Find and delete element by key in a given map using ``attr->map_fd``,
``attr->key``. Returns zero on success or negative error on failure.
.. Links:
.. _man-pages: https://www.kernel.org/doc/man-pages/
.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html
.. _bpf-helpers(7): https://man7.org/linux/man-pages/man7/bpf-helpers.7.html

View File

@ -7,3 +7,6 @@ Program Types
:glob:
prog_*
For a list of all program types, see :ref:`program_types_and_elf` in
the :ref:`libbpf` documentation.

View File

@ -0,0 +1,81 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
========
Redirect
========
XDP_REDIRECT
############
Supported maps
--------------
XDP_REDIRECT works with the following map types:
- ``BPF_MAP_TYPE_DEVMAP``
- ``BPF_MAP_TYPE_DEVMAP_HASH``
- ``BPF_MAP_TYPE_CPUMAP``
- ``BPF_MAP_TYPE_XSKMAP``
For more information on these maps, please see the specific map documentation.
Process
-------
.. kernel-doc:: net/core/filter.c
:doc: xdp redirect
.. note::
Not all drivers support transmitting frames after a redirect, and for
those that do, not all of them support non-linear frames. Non-linear xdp
bufs/frames are bufs/frames that contain more than one fragment.
Debugging packet drops
----------------------
Silent packet drops for XDP_REDIRECT can be debugged using:
- bpf_trace
- perf_record
bpf_trace
^^^^^^^^^
The following bpftrace command can be used to capture and count all XDP tracepoints:
.. code-block:: none
sudo bpftrace -e 'tracepoint:xdp:* { @cnt[probe] = count(); }'
Attaching 12 probes...
^C
@cnt[tracepoint:xdp:mem_connect]: 18
@cnt[tracepoint:xdp:mem_disconnect]: 18
@cnt[tracepoint:xdp:xdp_exception]: 19605
@cnt[tracepoint:xdp:xdp_devmap_xmit]: 1393604
@cnt[tracepoint:xdp:xdp_redirect]: 22292200
.. note::
The various xdp tracepoints can be found in ``source/include/trace/events/xdp.h``
The following bpftrace command can be used to extract the ``ERRNO`` being returned as
part of the err parameter:
.. code-block:: none
sudo bpftrace -e \
'tracepoint:xdp:xdp_redirect*_err {@redir_errno[-args->err] = count();}
tracepoint:xdp:xdp_devmap_xmit {@devmap_errno[-args->err] = count();}'
perf record
^^^^^^^^^^^
The perf tool also supports recording tracepoints:
.. code-block:: none
perf record -a -e xdp:xdp_redirect_err \
-e xdp:xdp_redirect_map_err \
-e xdp:xdp_exception \
-e xdp:xdp_devmap_xmit
References
===========
- https://github.com/xdp-project/xdp-tutorial/tree/master/tracing02-xdp-monitor

View File

@ -29,6 +29,38 @@ properties:
interrupts:
maxItems: 1
memory-region:
items:
- description: firmware EMI region
- description: firmware ILM region
- description: firmware DLM region
- description: firmware CPU DATA region
- description: firmware BOOT region
memory-region-names:
items:
- const: wo-emi
- const: wo-ilm
- const: wo-dlm
- const: wo-data
- const: wo-boot
mediatek,wo-ccif:
$ref: /schemas/types.yaml#/definitions/phandle
description: mediatek wed-wo controller interface.
allOf:
- if:
properties:
compatible:
contains:
const: mediatek,mt7622-wed
then:
properties:
memory-region-names: false
memory-region: false
mediatek,wo-ccif: false
required:
- compatible
- reg
@ -49,3 +81,23 @@ examples:
interrupts = <GIC_SPI 214 IRQ_TYPE_LEVEL_LOW>;
};
};
- |
#include <dt-bindings/interrupt-controller/arm-gic.h>
#include <dt-bindings/interrupt-controller/irq.h>
soc {
#address-cells = <2>;
#size-cells = <2>;
wed@15010000 {
compatible = "mediatek,mt7986-wed", "syscon";
reg = <0 0x15010000 0 0x1000>;
interrupts = <GIC_SPI 205 IRQ_TYPE_LEVEL_HIGH>;
memory-region = <&wo_emi>, <&wo_ilm>, <&wo_dlm>,
<&wo_data>, <&wo_boot>;
memory-region-names = "wo-emi", "wo-ilm", "wo-dlm",
"wo-data", "wo-boot";
mediatek,wo-ccif = <&wo_ccif0>;
};
};

View File

@ -7,7 +7,8 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: Freescale INTMUX interrupt multiplexer
maintainers:
- Joakim Zhang <qiangqing.zhang@nxp.com>
- Shawn Guo <shawnguo@kernel.org>
- NXP Linux Team <linux-imx@nxp.com>
properties:
compatible:

View File

@ -46,6 +46,10 @@ properties:
interrupts:
maxItems: 1
reset-gpios:
maxItems: 1
description: GPIO connected to active low reset
required:
- compatible
- reg

View File

@ -27,7 +27,9 @@ properties:
- usbb95,772b # ASIX AX88772B
- usbb95,7e2b # ASIX AX88772B
reg: true
reg:
maxItems: 1
local-mac-address: true
mac-address: true

View File

@ -1,5 +0,0 @@
The following properties are common to the Bluetooth controllers:
- local-bd-address: array of 6 bytes, specifies the BD address that was
uniquely assigned to the Bluetooth device, formatted with least significant
byte first (little-endian).

View File

@ -0,0 +1,29 @@
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/bluetooth/bluetooth-controller.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Bluetooth Controller Generic Binding
maintainers:
- Marcel Holtmann <marcel@holtmann.org>
- Johan Hedberg <johan.hedberg@gmail.com>
- Luiz Augusto von Dentz <luiz.dentz@gmail.com>
properties:
$nodename:
pattern: "^bluetooth(@.*)?$"
local-bd-address:
$ref: /schemas/types.yaml#/definitions/uint8-array
maxItems: 6
description:
Specifies the BD address that was uniquely assigned to the Bluetooth
device. Formatted with least significant byte first (little-endian), e.g.
in order to assign the address 00:11:22:33:44:55 this property must have
the value [55 44 33 22 11 00].
additionalProperties: true
...

View File

@ -0,0 +1,81 @@
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/bluetooth/brcm,bcm4377-bluetooth.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Broadcom BCM4377 family PCIe Bluetooth Chips
maintainers:
- Sven Peter <sven@svenpeter.dev>
description:
This binding describes Broadcom BCM4377 family PCIe-attached bluetooth chips
usually found in Apple machines. The Wi-Fi part of the chip is described in
bindings/net/wireless/brcm,bcm4329-fmac.yaml.
allOf:
- $ref: bluetooth-controller.yaml#
properties:
compatible:
enum:
- pci14e4,5fa0 # BCM4377
- pci14e4,5f69 # BCM4378
- pci14e4,5f71 # BCM4387
reg:
maxItems: 1
brcm,board-type:
$ref: /schemas/types.yaml#/definitions/string
description: Board type of the Bluetooth chip. This is used to decouple
the overall system board from the Bluetooth module and used to construct
firmware and calibration data filenames.
On Apple platforms, this should be the Apple module-instance codename
prefixed by "apple,", e.g. "apple,atlantisb".
pattern: '^apple,.*'
brcm,taurus-cal-blob:
$ref: /schemas/types.yaml#/definitions/uint8-array
description: A per-device calibration blob for the Bluetooth radio. This
should be filled in by the bootloader from platform configuration
data, if necessary, and will be uploaded to the device.
This blob is used if the chip stepping of the Bluetooth module does not
support beamforming.
brcm,taurus-bf-cal-blob:
$ref: /schemas/types.yaml#/definitions/uint8-array
description: A per-device calibration blob for the Bluetooth radio. This
should be filled in by the bootloader from platform configuration
data, if necessary, and will be uploaded to the device.
This blob is used if the chip stepping of the Bluetooth module supports
beamforming.
local-bd-address: true
required:
- compatible
- reg
- local-bd-address
- brcm,board-type
additionalProperties: false
examples:
- |
pcie@a0000000 {
#address-cells = <3>;
#size-cells = <2>;
reg = <0xa0000000 0x1000000>;
device_type = "pci";
ranges = <0x43000000 0x6 0xa0000000 0xa0000000 0x0 0x20000000>;
bluetooth@0,1 {
compatible = "pci14e4,5f69";
reg = <0x100 0x0 0x0 0x0 0x0>;
brcm,board-type = "apple,honshu";
/* To be filled by the bootloader */
local-bd-address = [00 00 00 00 00 00];
};
};

View File

@ -1,7 +1,7 @@
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/qualcomm-bluetooth.yaml#
$id: http://devicetree.org/schemas/net/bluetooth/qualcomm-bluetooth.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Qualcomm Bluetooth Chips
@ -79,8 +79,7 @@ properties:
firmware-name:
description: specify the name of nvm firmware to load
local-bd-address:
description: see Documentation/devicetree/bindings/net/bluetooth.txt
local-bd-address: true
required:
@ -89,6 +88,7 @@ required:
additionalProperties: false
allOf:
- $ref: bluetooth-controller.yaml#
- if:
properties:
compatible:

View File

@ -19,11 +19,14 @@ properties:
- brcm,bcm4329-bt
- brcm,bcm4330-bt
- brcm,bcm4334-bt
- brcm,bcm43430a0-bt
- brcm,bcm43430a1-bt
- brcm,bcm43438-bt
- brcm,bcm4345c5
- brcm,bcm43540-bt
- brcm,bcm4335a0
- brcm,bcm4349-bt
- cypress,cyw4373a0-bt
- infineon,cyw55572-bt
shutdown-gpios:

View File

@ -17,6 +17,7 @@ properties:
compatible:
oneOf:
- enum:
- fsl,imx93-flexcan
- fsl,imx8qm-flexcan
- fsl,imx8mp-flexcan
- fsl,imx6q-flexcan

View File

@ -9,9 +9,6 @@ title: Renesas R-Car CAN FD Controller
maintainers:
- Fabrizio Castro <fabrizio.castro.jz@renesas.com>
allOf:
- $ref: can-controller.yaml#
properties:
compatible:
oneOf:
@ -33,7 +30,7 @@ properties:
- items:
- enum:
- renesas,r9a07g043-canfd # RZ/G2UL
- renesas,r9a07g043-canfd # RZ/G2UL and RZ/Five
- renesas,r9a07g044-canfd # RZ/G2{L,LC}
- renesas,r9a07g054-canfd # RZ/V2L
- const: renesas,rzg2l-canfd # RZ/G2L family
@ -77,12 +74,13 @@ properties:
description: Maximum frequency of the CANFD clock.
patternProperties:
"^channel[01]$":
"^channel[0-7]$":
type: object
description:
The controller supports two channels and each is represented as a child
node. Each child node supports the "status" property only, which
is used to enable/disable the respective channel.
The controller supports multiple channels and each is represented as a
child node. Each channel can be enabled/disabled individually.
additionalProperties: false
required:
- compatible
@ -98,60 +96,73 @@ required:
- channel0
- channel1
if:
properties:
compatible:
contains:
enum:
- renesas,rzg2l-canfd
then:
properties:
interrupts:
items:
- description: CAN global error interrupt
- description: CAN receive FIFO interrupt
- description: CAN0 error interrupt
- description: CAN0 transmit interrupt
- description: CAN0 transmit/receive FIFO receive completion interrupt
- description: CAN1 error interrupt
- description: CAN1 transmit interrupt
- description: CAN1 transmit/receive FIFO receive completion interrupt
allOf:
- $ref: can-controller.yaml#
interrupt-names:
items:
- const: g_err
- const: g_recc
- const: ch0_err
- const: ch0_rec
- const: ch0_trx
- const: ch1_err
- const: ch1_rec
- const: ch1_trx
- if:
properties:
compatible:
contains:
enum:
- renesas,rzg2l-canfd
then:
properties:
interrupts:
items:
- description: CAN global error interrupt
- description: CAN receive FIFO interrupt
- description: CAN0 error interrupt
- description: CAN0 transmit interrupt
- description: CAN0 transmit/receive FIFO receive completion interrupt
- description: CAN1 error interrupt
- description: CAN1 transmit interrupt
- description: CAN1 transmit/receive FIFO receive completion interrupt
resets:
maxItems: 2
interrupt-names:
items:
- const: g_err
- const: g_recc
- const: ch0_err
- const: ch0_rec
- const: ch0_trx
- const: ch1_err
- const: ch1_rec
- const: ch1_trx
reset-names:
items:
- const: rstp_n
- const: rstc_n
resets:
maxItems: 2
required:
- reset-names
else:
properties:
interrupts:
items:
- description: Channel interrupt
- description: Global interrupt
reset-names:
items:
- const: rstp_n
- const: rstc_n
interrupt-names:
items:
- const: ch_int
- const: g_int
required:
- reset-names
else:
properties:
interrupts:
items:
- description: Channel interrupt
- description: Global interrupt
resets:
maxItems: 1
interrupt-names:
items:
- const: ch_int
- const: g_int
resets:
maxItems: 1
- if:
not:
properties:
compatible:
contains:
const: renesas,r8a779a0-canfd
then:
patternProperties:
"^channel[2-7]$": false
unevaluatedProperties: false

View File

@ -19,7 +19,8 @@ allOf:
properties:
reg:
description: Port number
items:
- description: Port number
label:
description:

View File

@ -12,7 +12,7 @@ allOf:
maintainers:
- Andrew Lunn <andrew@lunn.ch>
- Florian Fainelli <f.fainelli@gmail.com>
- Vivien Didelot <vivien.didelot@gmail.com>
- Vladimir Oltean <olteanv@gmail.com>
- Kurt Kanzenbach <kurt@linutronix.de>
description:

View File

@ -74,10 +74,10 @@ properties:
properties:
pcs-handle:
maxItems: 1
description:
phandle pointing to a PCS sub-node compatible with
renesas,rzn1-miic.yaml#
$ref: /schemas/types.yaml#/definitions/phandle
unevaluatedProperties: false

View File

@ -108,11 +108,17 @@ properties:
$ref: "#/properties/phy-connection-type"
pcs-handle:
$ref: /schemas/types.yaml#/definitions/phandle
$ref: /schemas/types.yaml#/definitions/phandle-array
items:
maxItems: 1
description:
Specifies a reference to a node representing a PCS PHY device on a MDIO
bus to link with an external PHY (phy-handle) if exists.
pcs-handle-names:
description:
The name of each PCS in pcs-handle.
phy-handle:
$ref: /schemas/types.yaml#/definitions/phandle
description:
@ -216,6 +222,9 @@ properties:
required:
- speed
dependencies:
pcs-handle-names: [pcs-handle]
allOf:
- if:
properties:

View File

@ -7,7 +7,9 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: Freescale Fast Ethernet Controller (FEC)
maintainers:
- Joakim Zhang <qiangqing.zhang@nxp.com>
- Shawn Guo <shawnguo@kernel.org>
- Wei Fang <wei.fang@nxp.com>
- NXP Linux Team <linux-imx@nxp.com>
allOf:
- $ref: ethernet-controller.yaml#

View File

@ -85,9 +85,39 @@ properties:
$ref: /schemas/types.yaml#/definitions/phandle
description: A reference to the IEEE1588 timer
phys:
description: A reference to the SerDes lane(s)
maxItems: 1
phy-names:
items:
- const: serdes
pcsphy-handle:
$ref: /schemas/types.yaml#/definitions/phandle
description: A reference to the PCS (typically found on the SerDes)
$ref: /schemas/types.yaml#/definitions/phandle-array
minItems: 1
maxItems: 3
deprecated: true
description: See pcs-handle.
pcs-handle:
minItems: 1
maxItems: 3
description: |
A reference to the various PCSs (typically found on the SerDes). If
pcs-handle-names is absent, and phy-connection-type is "xgmii", then the first
reference will be assumed to be for "xfi". Otherwise, if pcs-handle-names is
absent, then the first reference will be assumed to be for "sgmii".
pcs-handle-names:
minItems: 1
maxItems: 3
items:
enum:
- sgmii
- qsgmii
- xfi
description: The type of each PCS in pcsphy-handle.
tbi-handle:
$ref: /schemas/types.yaml#/definitions/phandle
@ -100,6 +130,10 @@ required:
- fsl,fman-ports
- ptp-timer
dependencies:
pcs-handle-names:
- pcs-handle
allOf:
- $ref: ethernet-controller.yaml#
- if:
@ -110,14 +144,6 @@ allOf:
then:
required:
- tbi-handle
- if:
properties:
compatible:
contains:
const: fsl,fman-memac
then:
required:
- pcsphy-handle
unevaluatedProperties: false
@ -138,8 +164,9 @@ examples:
reg = <0xe8000 0x1000>;
fsl,fman-ports = <&fman0_rx_0x0c &fman0_tx_0x2c>;
ptp-timer = <&ptp_timer0>;
pcsphy-handle = <&pcsphy4>;
phy-handle = <&sgmii_phy1>;
phy-connection-type = "sgmii";
pcs-handle = <&pcsphy4>, <&qsgmiib_pcs1>;
pcs-handle-names = "sgmii", "qsgmii";
phys = <&serdes1 1>;
phy-names = "serdes";
};
...

View File

@ -31,7 +31,7 @@ properties:
phy-mode: true
pcs-handle:
$ref: /schemas/types.yaml#/definitions/phandle
maxItems: 1
description:
A reference to a node representing a PCS PHY device found on
the internal MDIO bus.

View File

@ -320,8 +320,9 @@ For internal PHY device on internal mdio bus, a PHY node should be created.
See the definition of the PHY node in booting-without-of.txt for an
example of how to define a PHY (Internal PHY has no interrupt line).
- For "fsl,fman-mdio" compatible internal mdio bus, the PHY is TBI PHY.
- For "fsl,fman-memac-mdio" compatible internal mdio bus, the PHY is PCS PHY,
PCS PHY addr must be '0'.
- For "fsl,fman-memac-mdio" compatible internal mdio bus, the PHY is PCS PHY.
The PCS PHY address should correspond to the value of the appropriate
MDEV_PORT.
EXAMPLE

View File

@ -0,0 +1,62 @@
# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/marvell,dfx-server.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Marvell Prestera DFX server
maintainers:
- Miquel Raynal <miquel.raynal@bootlin.com>
select:
properties:
compatible:
contains:
const: marvell,dfx-server
required:
- compatible
properties:
compatible:
items:
- const: marvell,dfx-server
- const: simple-bus
reg:
maxItems: 1
ranges: true
'#address-cells':
const: 1
'#size-cells':
const: 1
required:
- compatible
- reg
- ranges
# The DFX server may expose clocks described as subnodes
additionalProperties:
type: object
examples:
- |
#define MBUS_ID(target,attributes) (((target) << 24) | ((attributes) << 16))
bus@0 {
reg = <0 0>;
#address-cells = <2>;
#size-cells = <1>;
dfx-bus@ac000000 {
compatible = "marvell,dfx-server", "simple-bus";
#address-cells = <1>;
#size-cells = <1>;
ranges = <0 MBUS_ID(0x08, 0x00) 0 0x100000>;
reg = <MBUS_ID(0x08, 0x00) 0 0x100000>;
};
};

View File

@ -0,0 +1,305 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/marvell,pp2.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Marvell CN913X / Marvell Armada 375, 7K, 8K Ethernet Controller
maintainers:
- Marcin Wojtas <mw@semihalf.com>
- Russell King <linux@armlinux.org>
description: |
Marvell Armada 375 Ethernet Controller (PPv2.1)
Marvell Armada 7K/8K Ethernet Controller (PPv2.2)
Marvell CN913X Ethernet Controller (PPv2.3)
properties:
compatible:
enum:
- marvell,armada-375-pp2
- marvell,armada-7k-pp22
reg:
minItems: 3
maxItems: 4
"#address-cells":
const: 1
"#size-cells":
const: 0
clocks:
minItems: 2
items:
- description: main controller clock
- description: GOP clock
- description: MG clock
- description: MG Core clock
- description: AXI clock
clock-names:
minItems: 2
items:
- const: pp_clk
- const: gop_clk
- const: mg_clk
- const: mg_core_clk
- const: axi_clk
dma-coherent: true
marvell,system-controller:
$ref: /schemas/types.yaml#/definitions/phandle
description: a phandle to the system controller.
patternProperties:
'^(ethernet-)?port@[0-2]$':
type: object
description: subnode for each ethernet port.
$ref: ethernet-controller.yaml#
unevaluatedProperties: false
properties:
reg:
description: ID of the port from the MAC point of view.
maximum: 2
interrupts:
minItems: 1
maxItems: 10
description: interrupt(s) for the port
interrupt-names:
minItems: 1
items:
- const: hif0
- const: hif1
- const: hif2
- const: hif3
- const: hif4
- const: hif5
- const: hif6
- const: hif7
- const: hif8
- const: link
description: >
if more than a single interrupt for is given, must be the
name associated to the interrupts listed. Valid names are:
"hifX", with X in [0..8], and "link". The names "tx-cpu0",
"tx-cpu1", "tx-cpu2", "tx-cpu3" and "rx-shared" are supported
for backward compatibility but shouldn't be used for new
additions.
phys:
minItems: 1
maxItems: 2
description: >
Generic PHY, providing SerDes connectivity. For most modes,
one lane is sufficient, but some (e.g. RXAUI) may require two.
phy-mode:
enum:
- gmii
- sgmii
- rgmii-id
- 1000base-x
- 2500base-x
- 5gbase-r
- rxaui
- 10gbase-r
port-id:
$ref: /schemas/types.yaml#/definitions/uint32
deprecated: true
description: >
ID of the port from the MAC point of view.
Legacy binding for backward compatibility.
marvell,loopback:
$ref: /schemas/types.yaml#/definitions/flag
description: port is loopback mode.
gop-port-id:
$ref: /schemas/types.yaml#/definitions/uint32
description: >
only for marvell,armada-7k-pp22, ID of the port from the
GOP (Group Of Ports) point of view. This ID is used to index the
per-port registers in the second register area.
required:
- reg
- interrupts
- phy-mode
- port-id
required:
- compatible
- reg
- clocks
- clock-names
allOf:
- if:
properties:
compatible:
const: marvell,armada-7k-pp22
then:
properties:
reg:
items:
- description: Packet Processor registers
- description: Networking interfaces registers
- description: CM3 address space used for TX Flow Control
clocks:
minItems: 5
clock-names:
minItems: 5
patternProperties:
'^(ethernet-)?port@[0-2]$':
required:
- gop-port-id
required:
- marvell,system-controller
else:
properties:
reg:
items:
- description: Packet Processor registers
- description: LMS registers
- description: Register area per eth0
- description: Register area per eth1
clocks:
maxItems: 2
clock-names:
maxItems: 2
patternProperties:
'^(ethernet-)?port@[0-1]$':
properties:
reg:
maximum: 1
gop-port-id: false
additionalProperties: false
examples:
- |
// For Armada 375 variant
#include <dt-bindings/interrupt-controller/mvebu-icu.h>
#include <dt-bindings/interrupt-controller/arm-gic.h>
ethernet@f0000 {
#address-cells = <1>;
#size-cells = <0>;
compatible = "marvell,armada-375-pp2";
reg = <0xf0000 0xa000>,
<0xc0000 0x3060>,
<0xc4000 0x100>,
<0xc5000 0x100>;
clocks = <&gateclk 3>, <&gateclk 19>;
clock-names = "pp_clk", "gop_clk";
ethernet-port@0 {
interrupts = <GIC_SPI 37 IRQ_TYPE_LEVEL_HIGH>;
reg = <0>;
port-id = <0>; /* For backward compatibility. */
phy = <&phy0>;
phy-mode = "rgmii-id";
};
ethernet-port@1 {
interrupts = <GIC_SPI 41 IRQ_TYPE_LEVEL_HIGH>;
reg = <1>;
port-id = <1>; /* For backward compatibility. */
phy = <&phy3>;
phy-mode = "gmii";
};
};
- |
// For Armada 7k/8k and Cn913x variants
#include <dt-bindings/interrupt-controller/mvebu-icu.h>
#include <dt-bindings/interrupt-controller/arm-gic.h>
ethernet@0 {
#address-cells = <1>;
#size-cells = <0>;
compatible = "marvell,armada-7k-pp22";
reg = <0x0 0x100000>, <0x129000 0xb000>, <0x220000 0x800>;
clocks = <&cp0_clk 1 3>, <&cp0_clk 1 9>,
<&cp0_clk 1 5>, <&cp0_clk 1 6>, <&cp0_clk 1 18>;
clock-names = "pp_clk", "gop_clk", "mg_clk", "mg_core_clk", "axi_clk";
marvell,system-controller = <&cp0_syscon0>;
ethernet-port@0 {
interrupts = <ICU_GRP_NSR 39 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 43 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 47 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 51 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 55 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 59 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 63 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 67 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 71 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 129 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
"hif5", "hif6", "hif7", "hif8", "link";
phy-mode = "10gbase-r";
phys = <&cp0_comphy4 0>;
reg = <0>;
port-id = <0>; /* For backward compatibility. */
gop-port-id = <0>;
};
ethernet-port@1 {
interrupts = <ICU_GRP_NSR 40 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 44 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 48 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 52 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 56 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 60 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 64 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 68 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 72 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 128 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
"hif5", "hif6", "hif7", "hif8", "link";
phy-mode = "rgmii-id";
reg = <1>;
port-id = <1>; /* For backward compatibility. */
gop-port-id = <2>;
};
ethernet-port@2 {
interrupts = <ICU_GRP_NSR 41 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 45 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 49 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 53 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 57 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 61 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 65 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 69 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 73 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 127 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
"hif5", "hif6", "hif7", "hif8", "link";
phy-mode = "2500base-x";
managed = "in-band-status";
phys = <&cp0_comphy5 2>;
sfp = <&sfp_eth3>;
reg = <2>;
port-id = <2>; /* For backward compatibility. */
gop-port-id = <3>;
};
};

View File

@ -1,81 +0,0 @@
Marvell Prestera Switch Chip bindings
-------------------------------------
Required properties:
- compatible: must be "marvell,prestera" and one of the following
"marvell,prestera-98dx3236",
"marvell,prestera-98dx3336",
"marvell,prestera-98dx4251",
- reg: address and length of the register set for the device.
- interrupts: interrupt for the device
Optional properties:
- dfx: phandle reference to the "DFX Server" node
Example:
switch {
compatible = "simple-bus";
#address-cells = <1>;
#size-cells = <1>;
ranges = <0 MBUS_ID(0x03, 0x00) 0 0x100000>;
packet-processor@0 {
compatible = "marvell,prestera-98dx3236", "marvell,prestera";
reg = <0 0x4000000>;
interrupts = <33>, <34>, <35>;
dfx = <&dfx>;
};
};
DFX Server bindings
-------------------
Required properties:
- compatible: must be "marvell,dfx-server", "simple-bus"
- ranges: describes the address mapping of a memory-mapped bus.
- reg: address and length of the register set for the device.
Example:
dfx-server {
compatible = "marvell,dfx-server", "simple-bus";
#address-cells = <1>;
#size-cells = <1>;
ranges = <0 MBUS_ID(0x08, 0x00) 0 0x100000>;
reg = <MBUS_ID(0x08, 0x00) 0 0x100000>;
};
Marvell Prestera SwitchDev bindings
-----------------------------------
Optional properties:
- compatible: must be "marvell,prestera"
- base-mac-provider: describes handle to node which provides base mac address,
might be a static base mac address or nvme cell provider.
Example:
eeprom_mac_addr: eeprom-mac-addr {
compatible = "eeprom,mac-addr-cell";
status = "okay";
nvmem = <&eeprom_at24>;
};
prestera {
compatible = "marvell,prestera";
status = "okay";
base-mac-provider = <&eeprom_mac_addr>;
};
The current implementation of Prestera Switchdev PCI interface driver requires
that BAR2 is assigned to 0xf6000000 as base address from the PCI IO range:
&cp0_pcie0 {
ranges = <0x81000000 0x0 0xfb000000 0x0 0xfb000000 0x0 0xf0000
0x82000000 0x0 0xf6000000 0x0 0xf6000000 0x0 0x2000000
0x82000000 0x0 0xf9000000 0x0 0xf9000000 0x0 0x100000>;
phys = <&cp0_comphy0 0>;
status = "okay";
};

View File

@ -0,0 +1,91 @@
# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/marvell,prestera.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Marvell Prestera switch family
maintainers:
- Miquel Raynal <miquel.raynal@bootlin.com>
properties:
compatible:
oneOf:
- items:
- enum:
- marvell,prestera-98dx3236
- marvell,prestera-98dx3336
- marvell,prestera-98dx4251
- const: marvell,prestera
- enum:
- pci11ab,c804
- pci11ab,c80c
- pci11ab,cc1e
reg:
maxItems: 1
interrupts:
maxItems: 3
dfx:
description: Reference to the DFX Server bus node.
$ref: /schemas/types.yaml#/definitions/phandle
nvmem-cells: true
nvmem-cell-names: true
if:
properties:
compatible:
contains:
const: marvell,prestera
# Memory mapped AlleyCat3 family
then:
properties:
nvmem-cells: false
nvmem-cell-names: false
required:
- interrupts
# PCI Aldrin family
else:
properties:
interrupts: false
dfx: false
required:
- compatible
- reg
# Ports can also be described
additionalProperties:
type: object
examples:
- |
packet-processor@0 {
compatible = "marvell,prestera-98dx3236", "marvell,prestera";
reg = <0 0x4000000>;
interrupts = <33>, <34>, <35>;
dfx = <&dfx>;
};
- |
pcie@0 {
#address-cells = <3>;
#size-cells = <2>;
ranges = <0x0 0x0 0x0 0x0 0x0 0x0>;
reg = <0x0 0x0 0x0 0x0 0x0 0x0>;
device_type = "pci";
switch@0,0 {
reg = <0x0 0x0 0x0 0x0 0x0>;
compatible = "pci11ab,c80c";
nvmem-cells = <&mac_address 0>;
nvmem-cell-names = "mac-address";
};
};

View File

@ -1,141 +0,0 @@
* Marvell Armada 375 Ethernet Controller (PPv2.1)
Marvell Armada 7K/8K Ethernet Controller (PPv2.2)
Marvell CN913X Ethernet Controller (PPv2.3)
Required properties:
- compatible: should be one of:
"marvell,armada-375-pp2"
"marvell,armada-7k-pp2"
- reg: addresses and length of the register sets for the device.
For "marvell,armada-375-pp2", must contain the following register
sets:
- common controller registers
- LMS registers
- one register area per Ethernet port
For "marvell,armada-7k-pp2" used by 7K/8K and CN913X, must contain the following register
sets:
- packet processor registers
- networking interfaces registers
- CM3 address space used for TX Flow Control
- clocks: pointers to the reference clocks for this device, consequently:
- main controller clock (for both armada-375-pp2 and armada-7k-pp2)
- GOP clock (for both armada-375-pp2 and armada-7k-pp2)
- MG clock (only for armada-7k-pp2)
- MG Core clock (only for armada-7k-pp2)
- AXI clock (only for armada-7k-pp2)
- clock-names: names of used clocks, must be "pp_clk", "gop_clk", "mg_clk",
"mg_core_clk" and "axi_clk" (the 3 latter only for armada-7k-pp2).
The ethernet ports are represented by subnodes. At least one port is
required.
Required properties (port):
- interrupts: interrupt(s) for the port
- port-id: ID of the port from the MAC point of view
- gop-port-id: only for marvell,armada-7k-pp2, ID of the port from the
GOP (Group Of Ports) point of view. This ID is used to index the
per-port registers in the second register area.
- phy-mode: See ethernet.txt file in the same directory
Optional properties (port):
- marvell,loopback: port is loopback mode
- phy: a phandle to a phy node defining the PHY address (as the reg
property, a single integer).
- interrupt-names: if more than a single interrupt for is given, must be the
name associated to the interrupts listed. Valid names are:
"hifX", with X in [0..8], and "link". The names "tx-cpu0",
"tx-cpu1", "tx-cpu2", "tx-cpu3" and "rx-shared" are supported
for backward compatibility but shouldn't be used for new
additions.
- marvell,system-controller: a phandle to the system controller.
Example for marvell,armada-375-pp2:
ethernet@f0000 {
compatible = "marvell,armada-375-pp2";
reg = <0xf0000 0xa000>,
<0xc0000 0x3060>,
<0xc4000 0x100>,
<0xc5000 0x100>;
clocks = <&gateclk 3>, <&gateclk 19>;
clock-names = "pp_clk", "gop_clk";
eth0: eth0@c4000 {
interrupts = <GIC_SPI 37 IRQ_TYPE_LEVEL_HIGH>;
port-id = <0>;
phy = <&phy0>;
phy-mode = "gmii";
};
eth1: eth1@c5000 {
interrupts = <GIC_SPI 41 IRQ_TYPE_LEVEL_HIGH>;
port-id = <1>;
phy = <&phy3>;
phy-mode = "gmii";
};
};
Example for marvell,armada-7k-pp2:
cpm_ethernet: ethernet@0 {
compatible = "marvell,armada-7k-pp22";
reg = <0x0 0x100000>, <0x129000 0xb000>, <0x220000 0x800>;
clocks = <&cpm_syscon0 1 3>, <&cpm_syscon0 1 9>,
<&cpm_syscon0 1 5>, <&cpm_syscon0 1 6>, <&cpm_syscon0 1 18>;
clock-names = "pp_clk", "gop_clk", "mg_clk", "mg_core_clk", "axi_clk";
eth0: eth0 {
interrupts = <ICU_GRP_NSR 39 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 43 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 47 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 51 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 55 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 59 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 63 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 67 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 71 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 129 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
"hif5", "hif6", "hif7", "hif8", "link";
port-id = <0>;
gop-port-id = <0>;
};
eth1: eth1 {
interrupts = <ICU_GRP_NSR 40 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 44 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 48 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 52 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 56 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 60 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 64 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 68 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 72 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 128 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
"hif5", "hif6", "hif7", "hif8", "link";
port-id = <1>;
gop-port-id = <2>;
};
eth2: eth2 {
interrupts = <ICU_GRP_NSR 41 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 45 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 49 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 53 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 57 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 61 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 65 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 69 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 73 IRQ_TYPE_LEVEL_HIGH>,
<ICU_GRP_NSR 127 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
"hif5", "hif6", "hif7", "hif8", "link";
port-id = <2>;
gop-port-id = <3>;
};
};

View File

@ -39,7 +39,9 @@ properties:
- usb424,9e08 # SMSC LAN89530 USB Ethernet Device
- usb424,ec00 # SMSC9512/9514 USB Hub & Ethernet Device
reg: true
reg:
maxItems: 1
local-mac-address: true
mac-address: true

View File

@ -14,7 +14,9 @@ properties:
oneOf:
- const: nxp,nxp-nci-i2c
- items:
- const: nxp,pn547
- enum:
- nxp,nq310
- nxp,pn547
- const: nxp,nxp-nci-i2c
enable-gpios:

View File

@ -7,7 +7,9 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: NXP i.MX8 DWMAC glue layer
maintainers:
- Joakim Zhang <qiangqing.zhang@nxp.com>
- Clark Wang <xiaoning.wang@nxp.com>
- Shawn Guo <shawnguo@kernel.org>
- NXP Linux Team <linux-imx@nxp.com>
# We need a select here so we don't match all nodes with 'snps,dwmac'
select:

View File

@ -0,0 +1,40 @@
# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/pcs/fsl,lynx-pcs.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: NXP Lynx PCS
maintainers:
- Ioana Ciornei <ioana.ciornei@nxp.com>
description: |
NXP Lynx 10G and 28G SerDes have Ethernet PCS devices which can be used as
protocol controllers. They are accessible over the Ethernet interface's MDIO
bus.
properties:
compatible:
const: fsl,lynx-pcs
reg:
maxItems: 1
required:
- compatible
- reg
additionalProperties: false
examples:
- |
mdio-bus {
#address-cells = <1>;
#size-cells = <0>;
qsgmii_pcs1: ethernet-pcs@1 {
compatible = "fsl,lynx-pcs";
reg = <1>;
};
};

View File

@ -123,7 +123,6 @@ examples:
switch_port0: port@0 {
reg = <0x0>;
label = "cpu";
ethernet = <&eth1>;
phy-mode = "gmii";

View File

@ -49,6 +49,7 @@ properties:
- qcom,sc7280-ipa
- qcom,sdm845-ipa
- qcom,sdx55-ipa
- qcom,sm6350-ipa
- qcom,sm8350-ipa
reg:
@ -124,19 +125,31 @@ properties:
- const: ipa-clock-enabled-valid
- const: ipa-clock-enabled
qcom,gsi-loader:
enum:
- self
- modem
- skip
description:
Indicates how GSI firmware should be loaded. If the AP loads
and validates GSI firmware, this property has value "self".
If the modem does this, this property has value "modem".
Otherwise, "skip" means GSI firmware loading is not required.
modem-init:
deprecated: true
type: boolean
description:
If present, it indicates that the modem is responsible for
performing early IPA initialization, including loading and
validating firwmare used by the GSI.
This is the older (deprecated) way of indicating how GSI firmware
should be loaded. If present, the modem loads GSI firmware; if
absent, the AP loads GSI firmware.
memory-region:
maxItems: 1
description:
If present, a phandle for a reserved memory area that holds
the firmware passed to Trust Zone for authentication. Required
when Trust Zone (not the modem) performs early initialization.
when the AP (not the modem) performs early initialization.
firmware-name:
$ref: /schemas/types.yaml#/definitions/string
@ -155,22 +168,36 @@ required:
- interconnects
- qcom,smem-states
# Either modem-init is present, or memory-region must be present.
oneOf:
- required:
- modem-init
- required:
- memory-region
allOf:
# If qcom,gsi-loader is present, modem-init must not be present
- if:
required:
- qcom,gsi-loader
then:
properties:
modem-init: false
# If memory-region is present, firmware-name may optionally be present.
# But if modem-init is present, firmware-name must not be present.
if:
required:
- modem-init
then:
not:
required:
- firmware-name
# If qcom,gsi-loader is "self", the AP loads GSI firmware, and
# memory-region must be specified
if:
properties:
qcom,gsi-loader:
contains:
const: self
then:
required:
- memory-region
else:
# If qcom,gsi-loader is not present, we use deprecated behavior.
# If modem-init is not present, the AP loads GSI firmware, and
# memory-region must be specified.
if:
not:
required:
- modem-init
then:
required:
- memory-region
additionalProperties: false
@ -201,14 +228,17 @@ examples:
};
ipa@1e40000 {
compatible = "qcom,sdm845-ipa";
compatible = "qcom,sc7180-ipa";
modem-init;
qcom,gsi-loader = "self";
memory-region = <&ipa_fw_mem>;
firmware-name = "qcom/sc7180-trogdor/modem/modem.mdt";
iommus = <&apps_smmu 0x720 0x3>;
iommus = <&apps_smmu 0x440 0x0>,
<&apps_smmu 0x442 0x0>;
reg = <0x1e40000 0x7000>,
<0x1e47000 0x2000>,
<0x1e04000 0x2c000>;
<0x1e47000 0x2000>,
<0x1e04000 0x2c000>;
reg-names = "ipa-reg",
"ipa-shared",
"gsi";
@ -226,9 +256,9 @@ examples:
clock-names = "core";
interconnects =
<&rsc_hlos MASTER_IPA &rsc_hlos SLAVE_EBI1>,
<&rsc_hlos MASTER_IPA &rsc_hlos SLAVE_IMEM>,
<&rsc_hlos MASTER_APPSS_PROC &rsc_hlos SLAVE_IPA_CFG>;
<&aggre2_noc MASTER_IPA 0 &mc_virt SLAVE_EBI1 0>,
<&aggre2_noc MASTER_IPA 0 &system_noc SLAVE_IMEM 0>,
<&gem_noc MASTER_APPSS_PROC 0 &config_noc SLAVE_IPA_CFG 0>;
interconnect-names = "memory",
"imem",
"config";

View File

@ -9,14 +9,18 @@ title: Qualcomm IPQ40xx MDIO Controller
maintainers:
- Robert Marko <robert.marko@sartura.hr>
allOf:
- $ref: "mdio.yaml#"
properties:
compatible:
enum:
- qcom,ipq4019-mdio
- qcom,ipq5018-mdio
oneOf:
- enum:
- qcom,ipq4019-mdio
- qcom,ipq5018-mdio
- items:
- enum:
- qcom,ipq6018-mdio
- qcom,ipq8074-mdio
- const: qcom,ipq4019-mdio
"#address-cells":
const: 1
@ -33,10 +37,12 @@ properties:
address range is only required by the platform IPQ50xx.
clocks:
maxItems: 1
description: |
MDIO clock source frequency fixed to 100MHZ, this clock should be specified
by the platform IPQ807x, IPQ60xx and IPQ50xx.
items:
- description: MDIO clock source frequency fixed to 100MHZ
clock-names:
items:
- const: gcc_mdio_ahb_clk
required:
- compatible
@ -44,6 +50,26 @@ required:
- "#address-cells"
- "#size-cells"
allOf:
- $ref: "mdio.yaml#"
- if:
properties:
compatible:
contains:
enum:
- qcom,ipq5018-mdio
- qcom,ipq6018-mdio
- qcom,ipq8074-mdio
then:
required:
- clocks
- clock-names
else:
properties:
clocks: false
clock-names: false
unevaluatedProperties: false
examples:

View File

@ -20,6 +20,7 @@ properties:
enum:
- realtek,rtl8723bs-bt
- realtek,rtl8723cs-bt
- realtek,rtl8723ds-bt
- realtek,rtl8822cs-bt
device-wake-gpios:

View File

@ -0,0 +1,262 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/renesas,r8a779f0-ether-switch.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Renesas Ethernet Switch
maintainers:
- Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
properties:
compatible:
const: renesas,r8a779f0-ether-switch
reg:
maxItems: 2
reg-names:
items:
- const: base
- const: secure_base
interrupts:
maxItems: 47
interrupt-names:
items:
- const: mfwd_error
- const: race_error
- const: coma_error
- const: gwca0_error
- const: gwca1_error
- const: etha0_error
- const: etha1_error
- const: etha2_error
- const: gptp0_status
- const: gptp1_status
- const: mfwd_status
- const: race_status
- const: coma_status
- const: gwca0_status
- const: gwca1_status
- const: etha0_status
- const: etha1_status
- const: etha2_status
- const: rmac0_status
- const: rmac1_status
- const: rmac2_status
- const: gwca0_rxtx0
- const: gwca0_rxtx1
- const: gwca0_rxtx2
- const: gwca0_rxtx3
- const: gwca0_rxtx4
- const: gwca0_rxtx5
- const: gwca0_rxtx6
- const: gwca0_rxtx7
- const: gwca1_rxtx0
- const: gwca1_rxtx1
- const: gwca1_rxtx2
- const: gwca1_rxtx3
- const: gwca1_rxtx4
- const: gwca1_rxtx5
- const: gwca1_rxtx6
- const: gwca1_rxtx7
- const: gwca0_rxts0
- const: gwca0_rxts1
- const: gwca1_rxts0
- const: gwca1_rxts1
- const: rmac0_mdio
- const: rmac1_mdio
- const: rmac2_mdio
- const: rmac0_phy
- const: rmac1_phy
- const: rmac2_phy
clocks:
maxItems: 1
resets:
maxItems: 1
iommus:
maxItems: 16
power-domains:
maxItems: 1
ethernet-ports:
type: object
additionalProperties: false
properties:
'#address-cells':
description: Port number of ETHA (TSNA).
const: 1
'#size-cells':
const: 0
patternProperties:
"^port@[0-9a-f]+$":
type: object
$ref: /schemas/net/ethernet-controller.yaml#
unevaluatedProperties: false
properties:
reg:
maxItems: 1
description:
Port number of ETHA (TSNA).
phys:
maxItems: 1
description:
Phandle of an Ethernet SERDES.
mdio:
$ref: /schemas/net/mdio.yaml#
unevaluatedProperties: false
required:
- reg
- phy-handle
- phy-mode
- phys
- mdio
required:
- compatible
- reg
- reg-names
- interrupts
- interrupt-names
- clocks
- resets
- power-domains
- ethernet-ports
additionalProperties: false
examples:
- |
#include <dt-bindings/clock/r8a779f0-cpg-mssr.h>
#include <dt-bindings/interrupt-controller/arm-gic.h>
#include <dt-bindings/power/r8a779f0-sysc.h>
ethernet@e6880000 {
compatible = "renesas,r8a779f0-ether-switch";
reg = <0xe6880000 0x20000>, <0xe68c0000 0x20000>;
reg-names = "base", "secure_base";
interrupts = <GIC_SPI 256 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 257 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 258 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 259 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 260 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 261 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 262 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 263 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 265 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 266 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 267 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 268 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 269 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 270 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 271 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 272 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 273 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 274 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 276 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 277 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 278 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 280 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 281 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 282 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 283 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 284 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 285 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 286 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 287 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 288 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 289 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 290 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 291 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 292 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 293 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 294 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 295 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 296 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 297 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 298 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 299 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 300 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 301 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 302 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 304 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 305 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 306 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "mfwd_error", "race_error",
"coma_error", "gwca0_error",
"gwca1_error", "etha0_error",
"etha1_error", "etha2_error",
"gptp0_status", "gptp1_status",
"mfwd_status", "race_status",
"coma_status", "gwca0_status",
"gwca1_status", "etha0_status",
"etha1_status", "etha2_status",
"rmac0_status", "rmac1_status",
"rmac2_status",
"gwca0_rxtx0", "gwca0_rxtx1",
"gwca0_rxtx2", "gwca0_rxtx3",
"gwca0_rxtx4", "gwca0_rxtx5",
"gwca0_rxtx6", "gwca0_rxtx7",
"gwca1_rxtx0", "gwca1_rxtx1",
"gwca1_rxtx2", "gwca1_rxtx3",
"gwca1_rxtx4", "gwca1_rxtx5",
"gwca1_rxtx6", "gwca1_rxtx7",
"gwca0_rxts0", "gwca0_rxts1",
"gwca1_rxts0", "gwca1_rxts1",
"rmac0_mdio", "rmac1_mdio",
"rmac2_mdio",
"rmac0_phy", "rmac1_phy",
"rmac2_phy";
clocks = <&cpg CPG_MOD 1505>;
power-domains = <&sysc R8A779F0_PD_ALWAYS_ON>;
resets = <&cpg 1505>;
ethernet-ports {
#address-cells = <1>;
#size-cells = <0>;
port@0 {
reg = <0>;
phy-handle = <&eth_phy0>;
phy-mode = "sgmii";
phys = <&eth_serdes 0>;
mdio {
#address-cells = <1>;
#size-cells = <0>;
};
};
port@1 {
reg = <1>;
phy-handle = <&eth_phy1>;
phy-mode = "sgmii";
phys = <&eth_serdes 1>;
mdio {
#address-cells = <1>;
#size-cells = <0>;
};
};
port@2 {
reg = <2>;
phy-handle = <&eth_phy2>;
phy-mode = "sgmii";
phys = <&eth_serdes 2>;
mdio {
#address-cells = <1>;
#size-cells = <0>;
};
};
};
};

View File

@ -22,7 +22,8 @@ properties:
phandle of an I2C bus controller for the SFP two wire serial
maximum-power-milliwatt:
maxItems: 1
minimum: 1000
default: 1000
description:
Maximum module power consumption Specifies the maximum power consumption
allowable by a module in the slot, in milli-Watts. Presently, modules can

View File

@ -167,56 +167,238 @@ properties:
snps,mtl-rx-config:
$ref: /schemas/types.yaml#/definitions/phandle
description:
Multiple RX Queues parameters. Phandle to a node that can
contain the following properties
* snps,rx-queues-to-use, number of RX queues to be used in the
driver
* Choose one of these RX scheduling algorithms
* snps,rx-sched-sp, Strict priority
* snps,rx-sched-wsp, Weighted Strict priority
* For each RX queue
* Choose one of these modes
* snps,dcb-algorithm, Queue to be enabled as DCB
* snps,avb-algorithm, Queue to be enabled as AVB
* snps,map-to-dma-channel, Channel to map
* Specifiy specific packet routing
* snps,route-avcp, AV Untagged Control packets
* snps,route-ptp, PTP Packets
* snps,route-dcbcp, DCB Control Packets
* snps,route-up, Untagged Packets
* snps,route-multi-broad, Multicast & Broadcast Packets
* snps,priority, bitmask of the tagged frames priorities assigned to
the queue
Multiple RX Queues parameters. Phandle to a node that
implements the 'rx-queues-config' object described in
this binding.
rx-queues-config:
type: object
properties:
snps,rx-queues-to-use:
$ref: /schemas/types.yaml#/definitions/uint32
description: number of RX queues to be used in the driver
snps,rx-sched-sp:
type: boolean
description: Strict priority
snps,rx-sched-wsp:
type: boolean
description: Weighted Strict priority
allOf:
- if:
required:
- snps,rx-sched-sp
then:
properties:
snps,rx-sched-wsp: false
- if:
required:
- snps,rx-sched-wsp
then:
properties:
snps,rx-sched-sp: false
patternProperties:
"^queue[0-9]$":
description: Each subnode represents a queue.
type: object
properties:
snps,dcb-algorithm:
type: boolean
description: Queue to be enabled as DCB
snps,avb-algorithm:
type: boolean
description: Queue to be enabled as AVB
snps,map-to-dma-channel:
$ref: /schemas/types.yaml#/definitions/uint32
description: DMA channel id to map
snps,route-avcp:
type: boolean
description: AV Untagged Control packets
snps,route-ptp:
type: boolean
description: PTP Packets
snps,route-dcbcp:
type: boolean
description: DCB Control Packets
snps,route-up:
type: boolean
description: Untagged Packets
snps,route-multi-broad:
type: boolean
description: Multicast & Broadcast Packets
snps,priority:
$ref: /schemas/types.yaml#/definitions/uint32
description: Bitmask of the tagged frames priorities assigned to the queue
allOf:
- if:
required:
- snps,dcb-algorithm
then:
properties:
snps,avb-algorithm: false
- if:
required:
- snps,avb-algorithm
then:
properties:
snps,dcb-algorithm: false
- if:
required:
- snps,route-avcp
then:
properties:
snps,route-ptp: false
snps,route-dcbcp: false
snps,route-up: false
snps,route-multi-broad: false
- if:
required:
- snps,route-ptp
then:
properties:
snps,route-avcp: false
snps,route-dcbcp: false
snps,route-up: false
snps,route-multi-broad: false
- if:
required:
- snps,route-dcbcp
then:
properties:
snps,route-avcp: false
snps,route-ptp: false
snps,route-up: false
snps,route-multi-broad: false
- if:
required:
- snps,route-up
then:
properties:
snps,route-avcp: false
snps,route-ptp: false
snps,route-dcbcp: false
snps,route-multi-broad: false
- if:
required:
- snps,route-multi-broad
then:
properties:
snps,route-avcp: false
snps,route-ptp: false
snps,route-dcbcp: false
snps,route-up: false
additionalProperties: false
additionalProperties: false
snps,mtl-tx-config:
$ref: /schemas/types.yaml#/definitions/phandle
description:
Multiple TX Queues parameters. Phandle to a node that can
contain the following properties
* snps,tx-queues-to-use, number of TX queues to be used in the
driver
* Choose one of these TX scheduling algorithms
* snps,tx-sched-wrr, Weighted Round Robin
* snps,tx-sched-wfq, Weighted Fair Queuing
* snps,tx-sched-dwrr, Deficit Weighted Round Robin
* snps,tx-sched-sp, Strict priority
* For each TX queue
* snps,weight, TX queue weight (if using a DCB weight
algorithm)
* Choose one of these modes
* snps,dcb-algorithm, TX queue will be working in DCB
* snps,avb-algorithm, TX queue will be working in AVB
[Attention] Queue 0 is reserved for legacy traffic
and so no AVB is available in this queue.
* Configure Credit Base Shaper (if AVB Mode selected)
* snps,send_slope, enable Low Power Interface
* snps,idle_slope, unlock on WoL
* snps,high_credit, max write outstanding req. limit
* snps,low_credit, max read outstanding req. limit
* snps,priority, bitmask of the priorities assigned to the queue.
When a PFC frame is received with priorities matching the bitmask,
the queue is blocked from transmitting for the pause time specified
in the PFC frame.
Multiple TX Queues parameters. Phandle to a node that
implements the 'tx-queues-config' object described in
this binding.
tx-queues-config:
type: object
properties:
snps,tx-queues-to-use:
$ref: /schemas/types.yaml#/definitions/uint32
description: number of TX queues to be used in the driver
snps,tx-sched-wrr:
type: boolean
description: Weighted Round Robin
snps,tx-sched-wfq:
type: boolean
description: Weighted Fair Queuing
snps,tx-sched-dwrr:
type: boolean
description: Deficit Weighted Round Robin
snps,tx-sched-sp:
type: boolean
description: Strict priority
allOf:
- if:
required:
- snps,tx-sched-wrr
then:
properties:
snps,tx-sched-wfq: false
snps,tx-sched-dwrr: false
snps,tx-sched-sp: false
- if:
required:
- snps,tx-sched-wfq
then:
properties:
snps,tx-sched-wrr: false
snps,tx-sched-dwrr: false
snps,tx-sched-sp: false
- if:
required:
- snps,tx-sched-dwrr
then:
properties:
snps,tx-sched-wrr: false
snps,tx-sched-wfq: false
snps,tx-sched-sp: false
- if:
required:
- snps,tx-sched-sp
then:
properties:
snps,tx-sched-wrr: false
snps,tx-sched-wfq: false
snps,tx-sched-dwrr: false
patternProperties:
"^queue[0-9]$":
description: Each subnode represents a queue.
type: object
properties:
snps,weight:
$ref: /schemas/types.yaml#/definitions/uint32
description: TX queue weight (if using a DCB weight algorithm)
snps,dcb-algorithm:
type: boolean
description: TX queue will be working in DCB
snps,avb-algorithm:
type: boolean
description:
TX queue will be working in AVB.
Queue 0 is reserved for legacy traffic and so no AVB is
available in this queue.
snps,send_slope:
$ref: /schemas/types.yaml#/definitions/uint32
description: enable Low Power Interface
snps,idle_slope:
$ref: /schemas/types.yaml#/definitions/uint32
description: unlock on WoL
snps,high_credit:
$ref: /schemas/types.yaml#/definitions/uint32
description: max write outstanding req. limit
snps,low_credit:
$ref: /schemas/types.yaml#/definitions/uint32
description: max read outstanding req. limit
snps,priority:
$ref: /schemas/types.yaml#/definitions/uint32
description:
Bitmask of the tagged frames priorities assigned to the queue.
When a PFC frame is received with priorities matching the bitmask,
the queue is blocked from transmitting for the pause time specified
in the PFC frame.
allOf:
- if:
required:
- snps,dcb-algorithm
then:
properties:
snps,avb-algorithm: false
- if:
required:
- snps,avb-algorithm
then:
properties:
snps,dcb-algorithm: false
snps,weight: false
additionalProperties: false
additionalProperties: false
snps,reset-gpio:
deprecated: true
@ -463,41 +645,6 @@ additionalProperties: true
examples:
- |
stmmac_axi_setup: stmmac-axi-config {
snps,wr_osr_lmt = <0xf>;
snps,rd_osr_lmt = <0xf>;
snps,blen = <256 128 64 32 0 0 0>;
};
mtl_rx_setup: rx-queues-config {
snps,rx-queues-to-use = <1>;
snps,rx-sched-sp;
queue0 {
snps,dcb-algorithm;
snps,map-to-dma-channel = <0x0>;
snps,priority = <0x0>;
};
};
mtl_tx_setup: tx-queues-config {
snps,tx-queues-to-use = <2>;
snps,tx-sched-wrr;
queue0 {
snps,weight = <0x10>;
snps,dcb-algorithm;
snps,priority = <0x0>;
};
queue1 {
snps,avb-algorithm;
snps,send_slope = <0x1000>;
snps,idle_slope = <0x1000>;
snps,high_credit = <0x3E800>;
snps,low_credit = <0xFFC18000>;
snps,priority = <0x1>;
};
};
gmac0: ethernet@e0800000 {
compatible = "snps,dwxgmac-2.10", "snps,dwxgmac";
reg = <0xe0800000 0x8000>;
@ -516,6 +663,42 @@ examples:
snps,axi-config = <&stmmac_axi_setup>;
snps,mtl-rx-config = <&mtl_rx_setup>;
snps,mtl-tx-config = <&mtl_tx_setup>;
stmmac_axi_setup: stmmac-axi-config {
snps,wr_osr_lmt = <0xf>;
snps,rd_osr_lmt = <0xf>;
snps,blen = <256 128 64 32 0 0 0>;
};
mtl_rx_setup: rx-queues-config {
snps,rx-queues-to-use = <1>;
snps,rx-sched-sp;
queue0 {
snps,dcb-algorithm;
snps,map-to-dma-channel = <0x0>;
snps,priority = <0x0>;
};
};
mtl_tx_setup: tx-queues-config {
snps,tx-queues-to-use = <2>;
snps,tx-sched-wrr;
queue0 {
snps,weight = <0x10>;
snps,dcb-algorithm;
snps,priority = <0x0>;
};
queue1 {
snps,avb-algorithm;
snps,send_slope = <0x1000>;
snps,idle_slope = <0x1000>;
snps,high_credit = <0x3E800>;
snps,low_credit = <0xFFC18000>;
snps,priority = <0x1>;
};
};
mdio0 {
#address-cells = <1>;
#size-cells = <0>;

View File

@ -0,0 +1,73 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/socionext,synquacer-netsec.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Socionext NetSec Ethernet Controller IP
maintainers:
- Jassi Brar <jaswinder.singh@linaro.org>
- Ilias Apalodimas <ilias.apalodimas@linaro.org>
allOf:
- $ref: ethernet-controller.yaml#
properties:
compatible:
const: socionext,synquacer-netsec
reg:
items:
- description: control register area
- description: EEPROM holding the MAC address and microengine firmware
clocks:
maxItems: 1
clock-names:
const: phy_ref_clk
dma-coherent: true
interrupts:
maxItems: 1
mdio:
$ref: mdio.yaml#
required:
- compatible
- reg
- clocks
- clock-names
- interrupts
- mdio
unevaluatedProperties: false
examples:
- |
#include <dt-bindings/interrupt-controller/arm-gic.h>
ethernet@522d0000 {
compatible = "socionext,synquacer-netsec";
reg = <0x522d0000 0x10000>, <0x10000000 0x10000>;
interrupts = <GIC_SPI 176 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&clk_netsec>;
clock-names = "phy_ref_clk";
phy-mode = "rgmii";
max-speed = <1000>;
max-frame-size = <9000>;
phy-handle = <&phy1>;
mdio {
#address-cells = <1>;
#size-cells = <0>;
phy1: ethernet-phy@1 {
compatible = "ethernet-phy-ieee802.3-c22";
reg = <1>;
};
};
};
...

View File

@ -1,56 +0,0 @@
* Socionext NetSec Ethernet Controller IP
Required properties:
- compatible: Should be "socionext,synquacer-netsec"
- reg: Address and length of the control register area, followed by the
address and length of the EEPROM holding the MAC address and
microengine firmware
- interrupts: Should contain ethernet controller interrupt
- clocks: phandle to the PHY reference clock
- clock-names: Should be "phy_ref_clk"
- phy-mode: See ethernet.txt file in the same directory
- phy-handle: See ethernet.txt in the same directory.
- mdio device tree subnode: When the Netsec has a phy connected to its local
mdio, there must be device tree subnode with the following
required properties:
- #address-cells: Must be <1>.
- #size-cells: Must be <0>.
For each phy on the mdio bus, there must be a node with the following
fields:
- compatible: Refer to phy.txt
- reg: phy id used to communicate to phy.
Optional properties: (See ethernet.txt file in the same directory)
- dma-coherent: Boolean property, must only be present if memory
accesses performed by the device are cache coherent.
- max-speed: See ethernet.txt in the same directory.
- max-frame-size: See ethernet.txt in the same directory.
The MAC address will be determined using the optional properties
defined in ethernet.txt. The 'phy-mode' property is required, but may
be set to the empty string if the PHY configuration is programmed by
the firmware or set by hardware straps, and needs to be preserved.
Example:
eth0: ethernet@522d0000 {
compatible = "socionext,synquacer-netsec";
reg = <0 0x522d0000 0x0 0x10000>, <0 0x10000000 0x0 0x10000>;
interrupts = <GIC_SPI 176 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&clk_netsec>;
clock-names = "phy_ref_clk";
phy-mode = "rgmii";
max-speed = <1000>;
max-frame-size = <9000>;
phy-handle = <&phy1>;
mdio {
#address-cells = <1>;
#size-cells = <0>;
phy1: ethernet-phy@1 {
compatible = "ethernet-phy-ieee802.3-c22";
reg = <1>;
};
};

View File

@ -68,6 +68,8 @@ Optional properties:
- mdio : Child node for MDIO bus. Must be defined if PHY access is
required through the core's MDIO interface (i.e. always,
unless the PHY is accessed through a different bus).
Non-standard MDIO bus frequency is supported via
"clock-frequency", see mdio.yaml.
- pcs-handle: Phandle to the internal PCS/PMA PHY in SGMII or 1000Base-X
modes, where "pcs-handle" should be used to point

View File

@ -0,0 +1,51 @@
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/soc/mediatek/mediatek,mt7986-wo-ccif.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: MediaTek Wireless Ethernet Dispatch (WED) WO controller interface for MT7986
maintainers:
- Lorenzo Bianconi <lorenzo@kernel.org>
- Felix Fietkau <nbd@nbd.name>
description:
The MediaTek wo-ccif provides a configuration interface for WED WO
controller used to perfrom offload rx packet processing (e.g. 802.11
aggregation packet reordering or rx header translation) on MT7986 soc.
properties:
compatible:
items:
- enum:
- mediatek,mt7986-wo-ccif
- const: syscon
reg:
maxItems: 1
interrupts:
maxItems: 1
required:
- compatible
- reg
- interrupts
additionalProperties: false
examples:
- |
#include <dt-bindings/interrupt-controller/arm-gic.h>
#include <dt-bindings/interrupt-controller/irq.h>
soc {
#address-cells = <2>;
#size-cells = <2>;
syscon@151a5000 {
compatible = "mediatek,mt7986-wo-ccif", "syscon";
reg = <0 0x151a5000 0 0x1000>;
interrupts = <GIC_SPI 205 IRQ_TYPE_LEVEL_HIGH>;
};
};

View File

@ -42,15 +42,13 @@ properties:
bluetooth:
type: object
additionalProperties: false
allOf:
- $ref: /schemas/net/bluetooth/bluetooth-controller.yaml#
properties:
compatible:
const: qcom,wcnss-bt
local-bd-address:
$ref: /schemas/types.yaml#/definitions/uint8-array
maxItems: 6
description:
See Documentation/devicetree/bindings/net/bluetooth.txt
local-bd-address: true
required:
- compatible

View File

@ -566,7 +566,8 @@ miimon
link monitoring. A value of 100 is a good starting point.
The use_carrier option, below, affects how the link state is
determined. See the High Availability section for additional
information. The default value is 0.
information. The default value is 100 if arp_interval is not
set.
min_links
@ -956,6 +957,7 @@ xmit_hash_policy
hash = hash XOR source IP XOR destination IP
hash = hash XOR (hash RSHIFT 16)
hash = hash XOR (hash RSHIFT 8)
hash = hash RSHIFT 1
And then hash is reduced modulo slave count.
If the protocol is IPv6 then the source and destination

View File

@ -1148,6 +1148,39 @@ tuning on deep embedded systems'. The author is running a MPC603e
load without any problems ...
Switchable Termination Resistors
--------------------------------
CAN bus requires a specific impedance across the differential pair,
typically provided by two 120Ohm resistors on the farthest nodes of
the bus. Some CAN controllers support activating / deactivating a
termination resistor(s) to provide the correct impedance.
Query the available resistances::
$ ip -details link show can0
...
termination 120 [ 0, 120 ]
Activate the terminating resistor::
$ ip link set dev can0 type can termination 120
Deactivate the terminating resistor::
$ ip link set dev can0 type can termination 0
To enable termination resistor support to a can-controller, either
implement in the controller's struct can-priv::
termination_const
termination_const_cnt
do_set_termination
or add gpio control with the device tree entries from
Documentation/devicetree/bindings/net/can/can-controller.yaml
The Virtual CAN Driver (vcan)
-----------------------------

View File

@ -181,10 +181,13 @@ when necessary using the below listed API::
- int dpaa2_mac_connect(struct dpaa2_mac *mac);
- void dpaa2_mac_disconnect(struct dpaa2_mac *mac);
A phylink integration is necessary only when the partner DPMAC is not of TYPE_FIXED.
One can check for this condition using the below API::
A phylink integration is necessary only when the partner DPMAC is not of
``TYPE_FIXED``. This means it is either of ``TYPE_PHY``, or of
``TYPE_BACKPLANE`` (the difference being the two that in the ``TYPE_BACKPLANE``
mode, the MC firmware does not access the PCS registers). One can check for
this condition using the following helper::
- bool dpaa2_mac_is_type_fixed(struct fsl_mc_device *dpmac_dev,struct fsl_mc_io *mc_io);
- static inline bool dpaa2_mac_is_type_phy(struct dpaa2_mac *mac);
Before connection to a MAC, the caller must allocate and populate the
dpaa2_mac structure with the associated net_device, a pointer to the MC portal

View File

@ -23,6 +23,7 @@ Supported Devices
=================
Currently, this driver support following devices:
* Network controller: Cavium, Inc. Device b200
* Network controller: Cavium, Inc. Device b400
Interface Control
=================

View File

@ -25,7 +25,7 @@ Enabling the driver and kconfig options
| at build time via kernel Kconfig flags.
| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
| For the list of advanced features please see below.
| For the list of advanced features, please see below.
**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
@ -89,11 +89,11 @@ Enabling the driver and kconfig options
**CONFIG_MLX5_EN_IPSEC=(y/n)**
| Enables `IPSec XFRM cryptography-offload accelaration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
| Enables `IPSec XFRM cryptography-offload acceleration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
**CONFIG_MLX5_EN_TLS=(y/n)**
| TLS cryptography-offload accelaration.
| TLS cryptography-offload acceleration.
**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
@ -139,14 +139,14 @@ flow_steering_mode: Device flow steering mode
The flow steering mode parameter controls the flow steering mode of the driver.
Two modes are supported:
1. 'dmfs' - Device managed flow steering.
2. 'smfs - Software/Driver managed flow steering.
2. 'smfs' - Software/Driver managed flow steering.
In DMFS mode, the HW steering entities are created and managed through the
Firmware.
In SMFS mode, the HW steering entities are created and managed though by
the driver directly into Hardware without firmware intervention.
the driver directly into hardware without firmware intervention.
SMFS mode is faster and provides better rule inserstion rate compared to default DMFS mode.
SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode.
User command examples:
@ -165,9 +165,9 @@ User command examples:
enable_roce: RoCE enablement state
----------------------------------
RoCE enablement state controls driver support for RoCE traffic.
When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well known UDP RoCE port is handled as raw ethernet traffic.
When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well-known UDP RoCE port is handled as raw ethernet traffic.
To change RoCE enablement state a user must change the driverinit cmode value and run devlink reload.
To change RoCE enablement state, a user must change the driverinit cmode value and run devlink reload.
User command examples:
@ -186,7 +186,7 @@ User command examples:
esw_port_metadata: Eswitch port metadata state
----------------------------------------------
When applicable, disabling Eswitch metadata can increase packet rate
When applicable, disabling eswitch metadata can increase packet rate
up to 20% depending on the use case and packet sizes.
Eswitch port metadata state controls whether to internally tag packets with
@ -253,26 +253,26 @@ mlx5 subfunction
================
mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
A Subfunction has its own function capabilities and its own resources. This
A subfunction has its own function capabilities and its own resources. This
means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
queues are neither shared nor stolen from the parent PCI function.
When a subfunction is RDMA capable, it has its own QP1, GID table and rdma
When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA
resources neither shared nor stolen from the parent PCI function.
A subfunction has a dedicated window in PCI BAR space that is not shared
with ther other subfunctions or the parent PCI function. This ensures that all
devices (netdev, rdma, vdpa etc.) of the subfunction accesses only assigned
with the other subfunctions or the parent PCI function. This ensures that all
devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned
PCI BAR space.
A Subfunction supports eswitch representation through which it supports tc
A subfunction supports eswitch representation through which it supports tc
offloads. The user configures eswitch to send/receive packets from/to
the subfunction port.
Subfunctions share PCI level resources such as PCI MSI-X IRQs with
other subfunctions and/or with its parent PCI function.
Example mlx5 software, system and device view::
Example mlx5 software, system, and device view::
_______
| admin |
@ -310,7 +310,7 @@ Example mlx5 software, system and device view::
| (device add/del)
_____|____ ____|________
| | | subfunction |
| PCI NIC |---- activate/deactive events---->| host driver |
| PCI NIC |--- activate/deactivate events--->| host driver |
|__________| | (mlx5_core) |
|_____________|
@ -320,7 +320,7 @@ Subfunction is created using devlink port interface.
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
- Add a devlink port of subfunction flaovur::
- Add a devlink port of subfunction flavour::
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
@ -351,46 +351,30 @@ driver.
MAC address setup
-----------------
mlx5 driver provides mechanism to setup the MAC address of the PCI VF/SF.
mlx5 driver support devlink port function attr mechanism to setup MAC
address. (refer to Documentation/networking/devlink/devlink-port.rst)
The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
device created for the PCI VF/SF.
RoCE capability setup
---------------------
Not all mlx5 PCI devices/SFs require RoCE capability.
- Get the MAC address of the VF identified by its unique devlink port index::
When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
PCI devices/SF.
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00
mlx5 driver support devlink port function attr mechanism to setup RoCE
capability. (refer to Documentation/networking/devlink/devlink-port.rst)
- Set the MAC address of the VF identified by its unique devlink port index::
migratable capability setup
---------------------------
User who wants mlx5 PCI VFs to be able to perform live migration need to
explicitly enable the VF migratable capability.
$ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:11:22:33:44:55
- Get the MAC address of the SF identified by its unique devlink port index::
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
function:
hw_addr 00:00:00:00:00:00
- Set the MAC address of the VF identified by its unique devlink port index::
$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcivf pfnum 0 sfnum 88
function:
hw_addr 00:00:00:00:88:88
mlx5 driver support devlink port function attr mechanism to setup migratable
capability. (refer to Documentation/networking/devlink/devlink-port.rst)
SF state setup
--------------
To use the SF, the user must active the SF using the SF function state
To use the SF, the user must activate the SF using the SF function state
attribute.
- Get the state of the SF identified by its unique devlink port index::
@ -447,7 +431,7 @@ for it.
Additionally, the SF port also gets the event when the driver attaches to the
auxiliary device of the subfunction. This results in changing the operational
state of the function. This provides visiblity to the user to decide when is it
state of the function. This provides visibility to the user to decide when is it
safe to delete the SF port for graceful termination of the subfunction.
- Show the SF port operational state::
@ -464,14 +448,14 @@ tx reporter
-----------
The tx reporter is responsible for reporting and recovering of the following two error scenarios:
- TX timeout
- tx timeout
Report on kernel tx timeout detection.
Recover by searching lost interrupts.
- TX error completion
- tx error completion
Report on error tx completion.
Recover by flushing the TX queue and reset it.
Recover by flushing the tx queue and reset it.
TX reporter also support on demand diagnose callback, on which it provides
tx reporter also support on demand diagnose callback, on which it provides
real time information of its send queues status.
User commands examples:
@ -491,32 +475,32 @@ rx reporter
-----------
The rx reporter is responsible for reporting and recovering of the following two error scenarios:
- RX queues initialization (population) timeout
RX queues descriptors population on ring initialization is done in
napi context via triggering an irq, in case of a failure to get
the minimum amount of descriptors, a timeout would occur and it
could be recoverable by polling the EQ (Event Queue).
- RX completions with errors (reported by HW on interrupt context)
- rx queues' initialization (population) timeout
Population of rx queues' descriptors on ring initialization is done
in napi context via triggering an irq. In case of a failure to get
the minimum amount of descriptors, a timeout would occur, and
descriptors could be recovered by polling the EQ (Event Queue).
- rx completions with errors (reported by HW on interrupt context)
Report on rx completion error.
Recover (if needed) by flushing the related queue and reset it.
RX reporter also supports on demand diagnose callback, on which it
provides real time information of its receive queues status.
rx reporter also supports on demand diagnose callback, on which it
provides real time information of its receive queues' status.
- Diagnose rx queues status, and corresponding completion queue::
- Diagnose rx queues' status and corresponding completion queue::
$ devlink health diagnose pci/0000:82:00.0 reporter rx
NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
NOTE: This command has valid output only when interface is up. Otherwise, the command has empty output.
- Show number of rx errors indicated, number of recover flows ended successfully,
is autorecover enabled and graceful period from last recover::
is autorecover enabled, and graceful period from last recover::
$ devlink health show pci/0000:82:00.0 reporter rx
fw reporter
-----------
The fw reporter implements diagnose and dump callbacks.
The fw reporter implements `diagnose` and `dump` callbacks.
It follows symptoms of fw error such as fw syndrome by triggering
fw core dump and storing it into the dump buffer.
The fw reporter diagnose command can be triggered any time by the user to check
@ -537,7 +521,7 @@ running it on other PF or any VF will return "Operation not permitted".
fw fatal reporter
-----------------
The fw fatal reporter implements dump and recover callbacks.
The fw fatal reporter implements `dump` and `recover` callbacks.
It follows fatal errors indications by CR-space dump and recover flow.
The CR-space dump uses vsc interface which is valid even if the FW command
interface is not functional, which is the case in most FW fatal errors.
@ -552,7 +536,7 @@ User commands examples:
$ devlink health recover pci/0000:82:00.0 reporter fw_fatal
- Read FW CR-space dump if already strored or trigger new one::
- Read FW CR-space dump if already stored or trigger new one::
$ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
@ -561,10 +545,10 @@ NOTE: This command can run only on PF.
mlx5 tracepoints
================
mlx5 driver provides internal trace points for tracking and debugging using
mlx5 driver provides internal tracepoints for tracking and debugging using
kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
For the list of support mlx5 events check /sys/kernel/debug/tracing/events/mlx5/
For the list of support mlx5 events, check `/sys/kernel/debug/tracing/events/mlx5/`.
tc and eswitch offloads tracepoints:

View File

@ -1,50 +1,57 @@
.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
.. include:: <isonum.txt>
=============================================
Netronome Flow Processor (NFP) Kernel Drivers
=============================================
===========================================
Network Flow Processor (NFP) Kernel Drivers
===========================================
Copyright (c) 2019, Netronome Systems, Inc.
:Copyright: |copy| 2019, Netronome Systems, Inc.
:Copyright: |copy| 2022, Corigine, Inc.
Contents
========
- `Overview`_
- `Acquiring Firmware`_
- `Devlink Info`_
- `Configure Device`_
- `Statistics`_
Overview
========
This driver supports Netronome's line of Flow Processor devices,
including the NFP4000, NFP5000, and NFP6000 models, which are also
incorporated in the company's family of Agilio SmartNICs. The SR-IOV
physical and virtual functions for these devices are supported by
the driver.
This driver supports Netronome and Corigine's line of Network Flow Processor
devices, including the NFP3800, NFP4000, NFP5000, and NFP6000 models, which
are also incorporated in the companies' family of Agilio SmartNICs. The SR-IOV
physical and virtual functions for these devices are supported by the driver.
Acquiring Firmware
==================
The NFP4000 and NFP6000 devices require application specific firmware
to function. Application firmware can be located either on the host file system
The NFP3800, NFP4000 and NFP6000 devices require application specific firmware
to function. Application firmware can be located either on the host file system
or in the device flash (if supported by management firmware).
Firmware files on the host filesystem contain card type (`AMDA-*` string), media
config etc. They should be placed in `/lib/firmware/netronome` directory to
config etc. They should be placed in `/lib/firmware/netronome` directory to
load firmware from the host file system.
Firmware for basic NIC operation is available in the upstream
`linux-firmware.git` repository.
A more comprehensive list of firmware can be downloaded from the
`Corigine Support site <https://www.corigine.com/DPUDownload.html>`_.
Firmware in NVRAM
-----------------
Recent versions of management firmware supports loading application
firmware from flash when the host driver gets probed. The firmware loading
firmware from flash when the host driver gets probed. The firmware loading
policy configuration may be used to configure this feature appropriately.
Devlink or ethtool can be used to update the application firmware on the device
flash by providing the appropriate `nic_AMDA*.nffw` file to the respective
command. Users need to take care to write the correct firmware image for the
command. Users need to take care to write the correct firmware image for the
card and media configuration to flash.
Available storage space in flash depends on the card being used.
@ -79,9 +86,9 @@ You may need to use hard instead of symbolic links on distributions
which use old `mkinitrd` command instead of `dracut` (e.g. Ubuntu).
After changing firmware files you may need to regenerate the initramfs
image. Initramfs contains drivers and firmware files your system may
need to boot. Refer to the documentation of your distribution to find
out how to update initramfs. Good indication of stale initramfs
image. Initramfs contains drivers and firmware files your system may
need to boot. Refer to the documentation of your distribution to find
out how to update initramfs. Good indication of stale initramfs
is system loading wrong driver or firmware on boot, but when driver is
later reloaded manually everything works correctly.
@ -89,9 +96,9 @@ Selecting firmware per device
-----------------------------
Most commonly all cards on the system use the same type of firmware.
If you want to load specific firmware image for a specific card, you
can use either the PCI bus address or serial number. Driver will print
which files it's looking for when it recognizes a NFP device::
If you want to load a specific firmware image for a specific card, you
can use either the PCI bus address or serial number. The driver will
print which files it's looking for when it recognizes a NFP device::
nfp: Looking for firmware file in order of priority:
nfp: netronome/serial-00-12-34-aa-bb-cc-10-ff.nffw: not found
@ -106,6 +113,15 @@ Note that `serial-*` and `pci-*` files are **not** automatically included
in initramfs, you will have to refer to documentation of appropriate tools
to find out how to include them.
Running firmware version
------------------------
The version of the loaded firmware for a particular <netdev> interface,
(e.g. enp4s0), or an interface's port <netdev port> (e.g. enp4s0np0) can
be displayed with the ethtool command::
$ ethtool -i <netdev>
Firmware loading policy
-----------------------
@ -132,6 +148,115 @@ abi_drv_load_ifc
Defines a list of PF devices allowed to load FW on the device.
This variable is not currently user configurable.
Devlink Info
============
The devlink info command displays the running and stored firmware versions
on the device, serial number and board information.
Devlink info command example (replace PCI address)::
$ devlink dev info pci/0000:03:00.0
pci/0000:03:00.0:
driver nfp
serial_number CSAAMDA2001-1003000111
versions:
fixed:
board.id AMDA2001-1003
board.rev 01
board.manufacture CSA
board.model mozart
running:
fw.mgmt 22.10.0-rc3
fw.cpld 0x1000003
fw.app nic-22.09.0
chip.init AMDA-2001-1003 1003000111
stored:
fw.bundle_id bspbundle_1003000111
fw.mgmt 22.10.0-rc3
fw.cpld 0x0
chip.init AMDA-2001-1003 1003000111
Configure Device
================
This section explains how to use Agilio SmartNICs running basic NIC firmware.
Configure interface link-speed
------------------------------
The following steps explains how to change between 10G mode and 25G mode on
Agilio CX 2x25GbE cards. The changing of port speed must be done in order,
port 0 (p0) must be set to 10G before port 1 (p1) may be set to 10G.
Down the respective interface(s)::
$ ip link set dev <netdev port 0> down
$ ip link set dev <netdev port 1> down
Set interface link-speed to 10G::
$ ethtool -s <netdev port 0> speed 10000
$ ethtool -s <netdev port 1> speed 10000
Set interface link-speed to 25G::
$ ethtool -s <netdev port 0> speed 25000
$ ethtool -s <netdev port 1> speed 25000
Reload driver for changes to take effect::
$ rmmod nfp; modprobe nfp
Configure interface Maximum Transmission Unit (MTU)
---------------------------------------------------
The MTU of interfaces can temporarily be set using the iproute2, ip link or
ifconfig tools. Note that this change will not persist. Setting this via
Network Manager, or another appropriate OS configuration tool, is
recommended as changes to the MTU using Network Manager can be made to
persist.
Set interface MTU to 9000 bytes::
$ ip link set dev <netdev port> mtu 9000
It is the responsibility of the user or the orchestration layer to set
appropriate MTU values when handling jumbo frames or utilizing tunnels. For
example, if packets sent from a VM are to be encapsulated on the card and
egress a physical port, then the MTU of the VF should be set to lower than
that of the physical port to account for the extra bytes added by the
additional header. If a setup is expected to see fallback traffic between
the SmartNIC and the kernel then the user should also ensure that the PF MTU
is appropriately set to avoid unexpected drops on this path.
Configure Forward Error Correction (FEC) modes
----------------------------------------------
Agilio SmartNICs support FEC mode configuration, e.g. Auto, Firecode Base-R,
ReedSolomon and Off modes. Each physical port's FEC mode can be set
independently using ethtool. The supported FEC modes for an interface can
be viewed using::
$ ethtool <netdev>
The currently configured FEC mode can be viewed using::
$ ethtool --show-fec <netdev>
To force the FEC mode for a particular port, auto-negotiation must be disabled
(see the `Auto-negotiation`_ section). An example of how to set the FEC mode
to Reed-Solomon is::
$ ethtool --set-fec <netdev> encoding rs
Auto-negotiation
----------------
To change auto-negotiation settings, the link must first be put down. After the
link is down, auto-negotiation can be enabled or disabled using::
ethtool -s <netdev> autoneg <on|off>
Statistics
==========

View File

@ -198,6 +198,11 @@ fw.bundle_id
Unique identifier of the entire firmware bundle.
fw.bootloader
-------------
Version of the bootloader.
Future work
===========

View File

@ -110,7 +110,7 @@ devlink ports for both the controllers.
Function configuration
======================
A user can configure the function attribute before enumerating the PCI
Users can configure one or more function attributes before enumerating the PCI
function. Usually it means, user should configure function attribute
before a bus specific device for the function is created. However, when
SRIOV is enabled, virtual function devices are created on the PCI bus.
@ -119,9 +119,127 @@ function device to the driver. For subfunctions, this means user should
configure port function attribute before activating the port function.
A user may set the hardware address of the function using
'devlink port function set hw_addr' command. For Ethernet port function
`devlink port function set hw_addr` command. For Ethernet port function
this means a MAC address.
Users may also set the RoCE capability of the function using
`devlink port function set roce` command.
Users may also set the function as migratable using
'devlink port function set migratable' command.
Function attributes
===================
MAC address setup
-----------------
The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
device created for the PCI VF/SF.
- Get the MAC address of the VF identified by its unique devlink port index::
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00
- Set the MAC address of the VF identified by its unique devlink port index::
$ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:11:22:33:44:55
- Get the MAC address of the SF identified by its unique devlink port index::
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
function:
hw_addr 00:00:00:00:00:00
- Set the MAC address of the SF identified by its unique devlink port index::
$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
function:
hw_addr 00:00:00:00:88:88
RoCE capability setup
---------------------
Not all PCI VFs/SFs require RoCE capability.
When RoCE capability is disabled, it saves system memory per PCI VF/SF.
When user disables RoCE capability for a VF/SF, user application cannot send or
receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
will be empty.
When RoCE capability is disabled in the device using port function attribute,
VF/SF driver cannot override it.
- Get RoCE capability of the VF device::
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 roce enable
- Set RoCE capability of the VF device::
$ devlink port function set pci/0000:06:00.0/2 roce disable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 roce disable
migratable capability setup
---------------------------
Live migration is the process of transferring a live virtual machine
from one physical host to another without disrupting its normal
operation.
User who want PCI VFs to be able to perform live migration need to
explicitly enable the VF migratable capability.
When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
with migration support, the user can migrate the VM with this VF from one HV to a
different one.
However, when migratable capability is enable, device will disable features which cannot
be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
Example of LM with migratable function configuration:
- Get migratable capability of the VF device::
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 migratable disable
- Set migratable capability of the VF device::
$ devlink port function set pci/0000:06:00.0/2 migratable enable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 migratable enable
- Bind VF to VFIO driver with migration support::
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
$ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
Attach VF to the VM.
Start the VM.
Perform live migration.
Subfunction
============
@ -130,10 +248,11 @@ it is deployed. Subfunction is created and deployed in unit of 1. Unlike
SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
A subfunction communicates with the hardware through the parent PCI function.
To use a subfunction, 3 steps setup sequence is followed.
(1) create - create a subfunction;
(2) configure - configure subfunction attributes;
(3) deploy - deploy the subfunction;
To use a subfunction, 3 steps setup sequence is followed:
1) create - create a subfunction;
2) configure - configure subfunction attributes;
3) deploy - deploy the subfunction;
Subfunction management is done using devlink port user interface.
User performs setup on the subfunction management device.
@ -191,13 +310,48 @@ API allows to configure following rate object's parameters:
``tx_max``
Maximum TX rate value.
``tx_priority``
Allows for usage of strict priority arbiter among siblings. This
arbitration scheme attempts to schedule nodes based on their priority
as long as the nodes remain within their bandwidth limit. The higher the
priority the higher the probability that the node will get selected for
scheduling.
``tx_weight``
Allows for usage of Weighted Fair Queuing arbitration scheme among
siblings. This arbitration scheme can be used simultaneously with the
strict priority. As a node is configured with a higher rate it gets more
BW relative to it's siblings. Values are relative like a percentage
points, they basically tell how much BW should node take relative to
it's siblings.
``parent``
Parent node name. Parent node rate limits are considered as additional limits
to all node children limits. ``tx_max`` is an upper limit for children.
``tx_share`` is a total bandwidth distributed among children.
``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
nodes with the same priority form a WFQ subgroup in the sibling group
and arbitration among them is based on assigned weights.
Arbitration flow from the high level:
#. Choose a node, or group of nodes with the highest priority that stays
within the BW limit and are not blocked. Use ``tx_priority`` as a
parameter for this arbitration.
#. If group of nodes have the same priority perform WFQ arbitration on
that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
#. Select the winner node, and continue arbitration flow among it's children,
until leaf node is reached, and the winner is established.
#. If all the nodes from the highest priority sub-group are satisfied, or
overused their assigned BW, move to the lower priority nodes.
Driver implementations are allowed to support both or either rate object types
and setting methods of their parameters.
and setting methods of their parameters. Additionally driver implementation
may export nodes/leafs and their child-parent relationships.
Terms and Definitions
=====================

View File

@ -31,6 +31,15 @@ in its ``devlink_region_ops`` structure. If snapshot id is not set in
the ``DEVLINK_CMD_REGION_NEW`` request kernel will allocate one and send
the snapshot information to user space.
Regions may optionally allow directly reading from their contents without a
snapshot. Direct read requests are not atomic. In particular a read request
of size 256 bytes or larger will be split into multiple chunks. If atomic
access is required, use a snapshot. A driver wishing to enable this for a
region should implement the ``.read`` callback in the ``devlink_region_ops``
structure. User space can request a direct read by using the
``DEVLINK_ATTR_REGION_DIRECT`` attribute instead of specifying a snapshot
id.
example usage
-------------
@ -65,6 +74,10 @@ example usage
$ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 length 16
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
# Read from the region without a snapshot
$ devlink region read pci/0000:00:05.0/fw-health address 16 length 16
0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
As regions are likely very device or driver specific, no generic regions are
defined. See the driver-specific documentation files for information on the
specific regions a driver supports.

View File

@ -485,6 +485,16 @@ be added to the following table:
- Traps incoming packets that the device decided to drop because
the destination MAC is not configured in the MAC table and
the interface is not in promiscuous mode
* - ``eapol``
- ``control``
- Traps "Extensible Authentication Protocol over LAN" (EAPOL) packets
specified in IEEE 802.1X
* - ``locked_port``
- ``drop``
- Traps packets that the device decided to drop because they failed the
locked bridge port check. That is, packets that were received via a
locked port and whose {SMAC, VID} does not correspond to an FDB entry
pointing to the port
Driver-specific Packet Traps
============================
@ -589,6 +599,9 @@ narrow. The description of these groups must be added to the following table:
* - ``parser_error_drops``
- Contains packet traps for packets that were marked by the device during
parsing as erroneous
* - ``eapol``
- Contains packet traps for "Extensible Authentication Protocol over LAN"
(EAPOL) packets specified in IEEE 802.1X
Packet Trap Policers
====================

View File

@ -0,0 +1,36 @@
.. SPDX-License-Identifier: GPL-2.0
==========================
etas_es58x devlink support
==========================
This document describes the devlink features implemented by the
``etas_es58x`` device driver.
Info versions
=============
The ``etas_es58x`` driver reports the following versions
.. list-table:: devlink info versions implemented
:widths: 5 5 90
* - Name
- Type
- Description
* - ``fw``
- running
- Version of the firmware running on the device. Also available
through ``ethtool -i`` as the first member of the
``firmware-version``.
* - ``fw.bootloader``
- running
- Version of the bootloader running on the device. Also available
through ``ethtool -i`` as the second member of the
``firmware-version``.
* - ``board.rev``
- fixed
- The hardware revision of the device.
* - ``serial_number``
- fixed
- The USB serial number. Also available through ``lsusb -v``.

View File

@ -189,12 +189,21 @@ device data.
* - ``nvm-flash``
- The contents of the entire flash chip, sometimes referred to as
the device's Non Volatile Memory.
* - ``shadow-ram``
- The contents of the Shadow RAM, which is loaded from the beginning
of the flash. Although the contents are primarily from the flash,
this area also contains data generated during device boot which is
not stored in flash.
* - ``device-caps``
- The contents of the device firmware's capabilities buffer. Useful to
determine the current state and configuration of the device.
Users can request an immediate capture of a snapshot via the
``DEVLINK_CMD_REGION_NEW``
Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
snapshot. The ``device-caps`` region requires a snapshot as the contents are
sent by firmware and can't be split into separate reads.
Users can request an immediate capture of a snapshot for all three regions
via the ``DEVLINK_CMD_REGION_NEW`` command.
.. code:: shell
@ -254,3 +263,118 @@ Users can request an immediate capture of a snapshot via the
0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
$ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
Devlink Rate
============
The ``ice`` driver implements devlink-rate API. It allows for offload of
the Hierarchical QoS to the hardware. It enables user to group Virtual
Functions in a tree structure and assign supported parameters: tx_share,
tx_max, tx_priority and tx_weight to each node in a tree. So effectively
user gains an ability to control how much bandwidth is allocated for each
VF group. This is later enforced by the HW.
It is assumed that this feature is mutually exclusive with DCB performed
in FW and ADQ, or any driver feature that would trigger changes in QoS,
for example creation of the new traffic class. The driver will prevent DCB
or ADQ configuration if user started making any changes to the nodes using
devlink-rate API. To configure those features a driver reload is necessary.
Correspondingly if ADQ or DCB will get configured the driver won't export
hierarchy at all, or will remove the untouched hierarchy if those
features are enabled after the hierarchy is exported, but before any
changes are made.
This feature is also dependent on switchdev being enabled in the system.
It's required bacause devlink-rate requires devlink-port objects to be
present, and those objects are only created in switchdev mode.
If the driver is set to the switchdev mode, it will export internal
hierarchy the moment VF's are created. Root of the tree is always
represented by the node_0. This node can't be deleted by the user. Leaf
nodes and nodes with children also can't be deleted.
.. list-table:: Attributes supported
:widths: 15 85
* - Name
- Description
* - ``tx_max``
- maximum bandwidth to be consumed by the tree Node. Rate Limit is
an absolute number specifying a maximum amount of bytes a Node may
consume during the course of one second. Rate limit guarantees
that a link will not oversaturate the receiver on the remote end
and also enforces an SLA between the subscriber and network
provider.
* - ``tx_share``
- minimum bandwidth allocated to a tree node when it is not blocked.
It specifies an absolute BW. While tx_max defines the maximum
bandwidth the node may consume, the tx_share marks committed BW
for the Node.
* - ``tx_priority``
- allows for usage of strict priority arbiter among siblings. This
arbitration scheme attempts to schedule nodes based on their
priority as long as the nodes remain within their bandwidth limit.
Range 0-7. Nodes with priority 7 have the highest priority and are
selected first, while nodes with priority 0 have the lowest
priority. Nodes that have the same priority are treated equally.
* - ``tx_weight``
- allows for usage of Weighted Fair Queuing arbitration scheme among
siblings. This arbitration scheme can be used simultaneously with
the strict priority. Range 1-200. Only relative values mater for
arbitration.
``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
nodes with the same priority form a WFQ subgroup in the sibling group
and arbitration among them is based on assigned weights.
.. code:: shell
# enable switchdev
$ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
# at this point driver should export internal hierarchy
$ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
$ devlink port function rate show
pci/0000:4b:00.0/node_25: type node parent node_24
pci/0000:4b:00.0/node_24: type node parent node_0
pci/0000:4b:00.0/node_32: type node parent node_31
pci/0000:4b:00.0/node_31: type node parent node_30
pci/0000:4b:00.0/node_30: type node parent node_16
pci/0000:4b:00.0/node_19: type node parent node_18
pci/0000:4b:00.0/node_18: type node parent node_17
pci/0000:4b:00.0/node_17: type node parent node_16
pci/0000:4b:00.0/node_14: type node parent node_5
pci/0000:4b:00.0/node_5: type node parent node_3
pci/0000:4b:00.0/node_13: type node parent node_4
pci/0000:4b:00.0/node_12: type node parent node_4
pci/0000:4b:00.0/node_11: type node parent node_4
pci/0000:4b:00.0/node_10: type node parent node_4
pci/0000:4b:00.0/node_9: type node parent node_4
pci/0000:4b:00.0/node_8: type node parent node_4
pci/0000:4b:00.0/node_7: type node parent node_4
pci/0000:4b:00.0/node_6: type node parent node_4
pci/0000:4b:00.0/node_4: type node parent node_3
pci/0000:4b:00.0/node_3: type node parent node_16
pci/0000:4b:00.0/node_16: type node parent node_15
pci/0000:4b:00.0/node_15: type node parent node_0
pci/0000:4b:00.0/node_2: type node parent node_1
pci/0000:4b:00.0/node_1: type node parent node_0
pci/0000:4b:00.0/node_0: type node
pci/0000:4b:00.0/1: type leaf parent node_25
pci/0000:4b:00.0/2: type leaf parent node_25
# let's create some custom node
$ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
# second custom node
$ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
# reassign second VF to newly created branch
$ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
# assign tx_weight to the VF
$ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
# assign tx_share to the VF
$ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps

View File

@ -222,6 +222,7 @@ Userspace to kernel:
``ETHTOOL_MSG_MODULE_GET`` get transceiver module parameters
``ETHTOOL_MSG_PSE_SET`` set PSE parameters
``ETHTOOL_MSG_PSE_GET`` get PSE parameters
``ETHTOOL_MSG_RSS_GET`` get RSS settings
===================================== =================================
Kernel to userspace:
@ -263,6 +264,7 @@ Kernel to userspace:
``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY`` PHC virtual clocks info
``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters
``ETHTOOL_MSG_PSE_GET_REPLY`` PSE parameters
``ETHTOOL_MSG_RSS_GET_REPLY`` RSS settings
======================================== =================================
``GET`` requests are sent by userspace applications to retrieve device
@ -491,6 +493,7 @@ Kernel response contents:
``ETHTOOL_A_LINKSTATE_SQI_MAX`` u32 Max support SQI value
``ETHTOOL_A_LINKSTATE_EXT_STATE`` u8 link extended state
``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE`` u8 link extended substate
``ETHTOOL_A_LINKSTATE_EXT_DOWN_CNT`` u32 count of link down events
==================================== ====== ============================
For most NIC drivers, the value of ``ETHTOOL_A_LINKSTATE_LINK`` returns
@ -1686,6 +1689,33 @@ to control PoDL PSE Admin functions. This option is implementing
``IEEE 802.3-2018`` 30.15.1.2.1 acPoDLPSEAdminControl. See
``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` for supported values.
RSS_GET
=======
Get indirection table, hash key and hash function info associated with a
RSS context of an interface similar to ``ETHTOOL_GRSSH`` ioctl request.
Request contents:
===================================== ====== ==========================
``ETHTOOL_A_RSS_HEADER`` nested request header
``ETHTOOL_A_RSS_CONTEXT`` u32 context number
===================================== ====== ==========================
Kernel response contents:
===================================== ====== ==========================
``ETHTOOL_A_RSS_HEADER`` nested reply header
``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func
``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes
``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes
===================================== ====== ==========================
ETHTOOL_A_RSS_HFUNC attribute is bitmap indicating the hash function
being used. Current supported options are toeplitz, xor or crc32.
ETHTOOL_A_RSS_INDIR attribute returns RSS indrection table where each byte
indicates queue number.
Request translation
===================
@ -1767,7 +1797,7 @@ are netlink only.
``ETHTOOL_GMODULEEEPROM`` ``ETHTOOL_MSG_MODULE_EEPROM_GET``
``ETHTOOL_GEEE`` ``ETHTOOL_MSG_EEE_GET``
``ETHTOOL_SEEE`` ``ETHTOOL_MSG_EEE_SET``
``ETHTOOL_GRSSH`` n/a
``ETHTOOL_GRSSH`` ``ETHTOOL_MSG_RSS_GET``
``ETHTOOL_SRSSH`` n/a
``ETHTOOL_GTUNABLE`` n/a
``ETHTOOL_STUNABLE`` n/a

View File

@ -104,6 +104,7 @@ Contents:
switchdev
sysfs-tagging
tc-actions-env-rules
tc-queue-filters
tcp-thin
team
timestamping

View File

@ -1069,6 +1069,81 @@ tcp_child_ehash_entries - INTEGER
Default: 0
tcp_plb_enabled - BOOLEAN
If set and the underlying congestion control (e.g. DCTCP) supports
and enables PLB feature, TCP PLB (Protective Load Balancing) is
enabled. PLB is described in the following paper:
https://doi.org/10.1145/3544216.3544226. Based on PLB parameters,
upon sensing sustained congestion, TCP triggers a change in
flow label field for outgoing IPv6 packets. A change in flow label
field potentially changes the path of outgoing packets for switches
that use ECMP/WCMP for routing.
PLB changes socket txhash which results in a change in IPv6 Flow Label
field, and currently no-op for IPv4 headers. It is possible
to apply PLB for IPv4 with other network header fields (e.g. TCP
or IPv4 options) or using encapsulation where outer header is used
by switches to determine next hop. In either case, further host
and switch side changes will be needed.
When set, PLB assumes that congestion signal (e.g. ECN) is made
available and used by congestion control module to estimate a
congestion measure (e.g. ce_ratio). PLB needs a congestion measure to
make repathing decisions.
Default: FALSE
tcp_plb_idle_rehash_rounds - INTEGER
Number of consecutive congested rounds (RTT) seen after which
a rehash can be performed, given there are no packets in flight.
This is referred to as M in PLB paper:
https://doi.org/10.1145/3544216.3544226.
Possible Values: 0 - 31
Default: 3
tcp_plb_rehash_rounds - INTEGER
Number of consecutive congested rounds (RTT) seen after which
a forced rehash can be performed. Be careful when setting this
parameter, as a small value increases the risk of retransmissions.
This is referred to as N in PLB paper:
https://doi.org/10.1145/3544216.3544226.
Possible Values: 0 - 31
Default: 12
tcp_plb_suspend_rto_sec - INTEGER
Time, in seconds, to suspend PLB in event of an RTO. In order to avoid
having PLB repath onto a connectivity "black hole", after an RTO a TCP
connection suspends PLB repathing for a random duration between 1x and
2x of this parameter. Randomness is added to avoid concurrent rehashing
of multiple TCP connections. This should be set corresponding to the
amount of time it takes to repair a failed link.
Possible Values: 0 - 255
Default: 60
tcp_plb_cong_thresh - INTEGER
Fraction of packets marked with congestion over a round (RTT) to
tag that round as congested. This is referred to as K in the PLB paper:
https://doi.org/10.1145/3544216.3544226.
The 0-1 fraction range is mapped to 0-256 range to avoid floating
point operations. For example, 128 means that if at least 50% of
the packets in a round were marked as congested then the round
will be tagged as congested.
Setting threshold to 0 means that PLB repaths every RTT regardless
of congestion. This is not intended behavior for PLB and should be
used only for experimentation purpose.
Possible Values: 0 - 256
Default: 128
UDP variables
=============
@ -1102,6 +1177,33 @@ udp_rmem_min - INTEGER
udp_wmem_min - INTEGER
UDP does not have tx memory accounting and this tunable has no effect.
udp_hash_entries - INTEGER
Show the number of hash buckets for UDP sockets in the current
networking namespace.
A negative value means the networking namespace does not own its
hash buckets and shares the initial networking namespace's one.
udp_child_ehash_entries - INTEGER
Control the number of hash buckets for UDP sockets in the child
networking namespace, which must be set before clone() or unshare().
If the value is not 0, the kernel uses a value rounded up to 2^n
as the actual hash bucket size. 0 is a special value, meaning
the child networking namespace will share the initial networking
namespace's hash buckets.
Note that the child will use the global one in case the kernel
fails to allocate enough memory. In addition, the global hash
buckets are spread over available NUMA nodes, but the allocation
of the child hash table depends on the current process's NUMA
policy, which could result in performance differences.
Possible values: 0, 2^n (n: 7 (128) - 16 (64K))
Default: 0
RAW variables
=============
@ -3025,6 +3127,15 @@ ecn_enable - BOOLEAN
Default: 1
l3mdev_accept - BOOLEAN
Enabling this option allows a "global" bound socket to work
across L3 master domains (e.g., VRFs) with packets capable of
being received regardless of the L3 domain in which they
originated. Only valid when the kernel was compiled with
CONFIG_NET_L3_MASTER_DEV.
Default: 1 (enabled)
``/proc/sys/net/core/*``
========================

View File

@ -129,6 +129,26 @@ drop_packet - INTEGER
threshold. When the mode 3 is set, the always mode drop rate
is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
est_cpulist - CPULIST
Allowed CPUs for estimation kthreads
Syntax: standard cpulist format
empty list - stop kthread tasks and estimation
default - the system's housekeeping CPUs for kthreads
Example:
"all": all possible CPUs
"0-N": all possible CPUs, N denotes last CPU number
"0,1-N:1/2": first and all CPUs with odd number
"": empty list
est_nice - INTEGER
default 0
Valid range: -20 (more favorable) .. 19 (less favorable)
Niceness value to use for the estimation kthreads (scheduling
priority)
expire_nodest_conn - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
@ -304,8 +324,8 @@ run_estimation - BOOLEAN
0 - disabled
not 0 - enabled (default)
If disabled, the estimation will be stop, and you can't see
any update on speed estimation data.
If disabled, the estimation will be suspended and kthread tasks
stopped.
You can always re-enable estimation by setting this value to 1.
But be careful, the first estimation after re-enable is not

View File

@ -0,0 +1,37 @@
.. SPDX-License-Identifier: GPL-2.0
=========================
TC queue based filtering
=========================
TC can be used for directing traffic to either a set of queues or
to a single queue on both the transmit and receive side.
On the transmit side:
1) TC filter directing traffic to a set of queues is achieved
using the action skbedit priority for Tx priority selection,
the priority maps to a traffic class (set of queues) when
the queue-sets are configured using mqprio.
2) TC filter directs traffic to a transmit queue with the action
skbedit queue_mapping $tx_qid. The action skbedit queue_mapping
for transmit queue is executed in software only and cannot be
offloaded.
Likewise, on the receive side, the two filters for selecting set of
queues and/or a single queue are supported as below:
1) TC flower filter directs incoming traffic to a set of queues using
the 'hw_tc' option.
hw_tc $TCID - Specify a hardware traffic class to pass matching
packets on to. TCID is in the range 0 through 15.
2) TC filter with action skbedit queue_mapping $rx_qid selects a
receive queue. The action skbedit queue_mapping for receive queue
is supported only in hardware. Multiple filters may compete in
the hardware for queue selection. In such case, the hardware
pipeline resolves conflicts based on priority. On Intel E810
devices, TC filter directing traffic to a queue have higher
priority over flow director filter assigning a queue. The hash
filter has lowest priority.

View File

@ -179,7 +179,8 @@ SOF_TIMESTAMPING_OPT_ID:
identifier and returns that along with the timestamp. The identifier
is derived from a per-socket u32 counter (that wraps). For datagram
sockets, the counter increments with each sent packet. For stream
sockets, it increments with every byte.
sockets, it increments with every byte. For stream sockets, also set
SOF_TIMESTAMPING_OPT_ID_TCP, see the section below.
The counter starts at zero. It is initialized the first time that
the socket option is enabled. It is reset each time the option is
@ -192,6 +193,35 @@ SOF_TIMESTAMPING_OPT_ID:
among all possibly concurrently outstanding timestamp requests for
that socket.
SOF_TIMESTAMPING_OPT_ID_TCP:
Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
counter increments for stream sockets, but its starting point is
not entirely trivial. This option fixes that.
For stream sockets, if SOF_TIMESTAMPING_OPT_ID is set, this should
always be set too. On datagram sockets the option has no effect.
A reasonable expectation is that the counter is reset to zero with
the system call, so that a subsequent write() of N bytes generates
a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
implements this behavior under all conditions.
SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
especially when the socket option is set when no data is in
transmission. If data is being transmitted, it may be off by the
length of the output queue (SIOCOUTQ).
The difference is due to being based on snd_una versus write_seq.
snd_una is the offset in the stream acknowledged by the peer. This
depends on factors outside of process control, such as network RTT.
write_seq is the last byte written by the process. This offset is
not affected by external inputs.
The difference is subtle and unlikely to be noticed when configured
at initial socket creation, when no data is queued or sent. But
SOF_TIMESTAMPING_OPT_ID_TCP behavior is more robust regardless of
when the socket option is set.
SOF_TIMESTAMPING_OPT_CMSG:
Support recv() cmsg for all timestamped packets. Control messages

View File

@ -5,6 +5,7 @@ XFRM device - offloading the IPsec computations
===============================================
Shannon Nelson <shannon.nelson@oracle.com>
Leon Romanovsky <leonro@nvidia.com>
Overview
@ -18,10 +19,21 @@ can radically increase throughput and decrease CPU utilization. The XFRM
Device interface allows NIC drivers to offer to the stack access to the
hardware offload.
Right now, there are two types of hardware offload that kernel supports.
* IPsec crypto offload:
* NIC performs encrypt/decrypt
* Kernel does everything else
* IPsec packet offload:
* NIC performs encrypt/decrypt
* NIC does encapsulation
* Kernel and NIC have SA and policy in-sync
* NIC handles the SA and policies states
* The Kernel talks to the keymanager
Userland access to the offload is typically through a system such as
libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can
be handy when experimenting. An example command might look something
like this::
like this for crypto offload:
ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
reqid 0x07 replay-window 32 \
@ -29,6 +41,17 @@ like this::
sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
offload dev eth4 dir in
and for packet offload
ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
reqid 0x07 replay-window 32 \
aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
offload packet dev eth4 dir in
ip x p add src 14.0.0.70 dst 14.0.0.52 offload packet dev eth4 dir in
tmpl src 14.0.0.70 dst 14.0.0.52 proto esp reqid 10000 mode transport
Yes, that's ugly, but that's what shell scripts and/or libreswan are for.
@ -40,17 +63,24 @@ Callbacks to implement
/* from include/linux/netdevice.h */
struct xfrmdev_ops {
/* Crypto and Packet offload callbacks */
int (*xdo_dev_state_add) (struct xfrm_state *x);
void (*xdo_dev_state_delete) (struct xfrm_state *x);
void (*xdo_dev_state_free) (struct xfrm_state *x);
bool (*xdo_dev_offload_ok) (struct sk_buff *skb,
struct xfrm_state *x);
void (*xdo_dev_state_advance_esn) (struct xfrm_state *x);
/* Solely packet offload callbacks */
void (*xdo_dev_state_update_curlft) (struct xfrm_state *x);
int (*xdo_dev_policy_add) (struct xfrm_policy *x);
void (*xdo_dev_policy_delete) (struct xfrm_policy *x);
void (*xdo_dev_policy_free) (struct xfrm_policy *x);
};
The NIC driver offering ipsec offload will need to implement these
callbacks to make the offload available to the network stack's
XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
The NIC driver offering ipsec offload will need to implement callbacks
relevant to supported offload to make the offload available to the network
stack's XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload.
@ -79,7 +109,8 @@ and an indication of whether it is for Rx or Tx. The driver should
=========== ===================================
0 success
-EOPNETSUPP offload not supported, try SW IPsec
-EOPNETSUPP offload not supported, try SW IPsec,
not applicable for packet offload mode
other fail the request
=========== ===================================
@ -96,6 +127,7 @@ will serviceable. This can check the packet information to be sure the
offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and
return true of false to signify its support.
Crypto offload mode:
When ready to send, the driver needs to inspect the Tx packet for the
offload information, including the opaque context, and set up the packet
send accordingly::
@ -139,13 +171,25 @@ the stack in xfrm_input().
In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn().
Driver will check packet seq number and update HW ESN state machine if needed.
Packet offload mode:
HW adds and deletes XFRM headers. So in RX path, XFRM stack is bypassed if HW
reported success. In TX path, the packet lefts kernel without extra header
and not encrypted, the HW is responsible to perform it.
When the SA is removed by the user, the driver's xdo_dev_state_delete()
is asked to disable the offload. Later, xdo_dev_state_free() is called
from a garbage collection routine after all reference counts to the state
and xdo_dev_policy_delete() are asked to disable the offload. Later,
xdo_dev_state_free() and xdo_dev_policy_free() are called from a garbage
collection routine after all reference counts to the state and policy
have been removed and any remaining resources can be cleared for the
offload state. How these are used by the driver will depend on specific
hardware needs.
As a netdev is set to DOWN the XFRM stack's netdev listener will call
xdo_dev_state_delete() and xdo_dev_state_free() on any remaining offloaded
states.
xdo_dev_state_delete(), xdo_dev_policy_delete(), xdo_dev_state_free() and
xdo_dev_policy_free() on any remaining offloaded states.
Outcome of HW handling packets, the XFRM core can't count hard, soft limits.
The HW/driver are responsible to perform it and provide accurate data when
xdo_dev_state_update_curlft() is called. In case of one of these limits
occuried, the driver needs to call to xfrm_state_check_expire() to make sure
that XFRM performs rekeying sequence.

View File

@ -1932,6 +1932,7 @@ F: Documentation/devicetree/bindings/interrupt-controller/apple,*
F: Documentation/devicetree/bindings/iommu/apple,dart.yaml
F: Documentation/devicetree/bindings/iommu/apple,sart.yaml
F: Documentation/devicetree/bindings/mailbox/apple,mailbox.yaml
F: Documentation/devicetree/bindings/net/bluetooth/brcm,bcm4377-bluetooth.yaml
F: Documentation/devicetree/bindings/nvme/apple,nvme-ans.yaml
F: Documentation/devicetree/bindings/nvmem/apple,efuses.yaml
F: Documentation/devicetree/bindings/pci/apple,pcie.yaml
@ -1939,6 +1940,7 @@ F: Documentation/devicetree/bindings/pinctrl/apple,pinctrl.yaml
F: Documentation/devicetree/bindings/power/apple*
F: Documentation/devicetree/bindings/watchdog/apple,wdt.yaml
F: arch/arm64/boot/dts/apple/
F: drivers/bluetooth/hci_bcm4377.c
F: drivers/clk/clk-apple-nco.c
F: drivers/cpufreq/apple-soc-cpufreq.c
F: drivers/dma/apple-admac.c
@ -2470,6 +2472,7 @@ L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
S: Supported
T: git git://github.com/microchip-ung/linux-upstream.git
F: arch/arm64/boot/dts/microchip/
F: drivers/net/ethernet/microchip/vcap/
F: drivers/pinctrl/pinctrl-microchip-sgpio.c
N: sparx5
@ -6362,6 +6365,7 @@ F: drivers/net/ethernet/freescale/dpaa2/Kconfig
F: drivers/net/ethernet/freescale/dpaa2/Makefile
F: drivers/net/ethernet/freescale/dpaa2/dpaa2-eth*
F: drivers/net/ethernet/freescale/dpaa2/dpaa2-mac*
F: drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk*
F: drivers/net/ethernet/freescale/dpaa2/dpkg.h
F: drivers/net/ethernet/freescale/dpaa2/dpmac*
F: drivers/net/ethernet/freescale/dpaa2/dpni*
@ -7734,6 +7738,7 @@ ETAS ES58X CAN/USB DRIVER
M: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
L: linux-can@vger.kernel.org
S: Maintained
F: Documentation/networking/devlink/etas_es58x.rst
F: drivers/net/can/usb/etas_es58x/
ETHERNET BRIDGE
@ -8236,7 +8241,10 @@ S: Maintained
F: drivers/i2c/busses/i2c-cpm.c
FREESCALE IMX / MXC FEC DRIVER
M: Joakim Zhang <qiangqing.zhang@nxp.com>
M: Wei Fang <wei.fang@nxp.com>
R: Shenwei Wang <shenwei.wang@nxp.com>
R: Clark Wang <xiaoning.wang@nxp.com>
R: NXP Linux Team <linux-imx@nxp.com>
L: netdev@vger.kernel.org
S: Maintained
F: Documentation/devicetree/bindings/net/fsl,fec.yaml
@ -9493,8 +9501,9 @@ F: Documentation/devicetree/bindings/iio/humidity/st,hts221.yaml
F: drivers/iio/humidity/hts221*
HUAWEI ETHERNET DRIVER
M: Cai Huoqing <cai.huoqing@linux.dev>
L: netdev@vger.kernel.org
S: Orphan
S: Maintained
F: Documentation/networking/device_drivers/ethernet/huawei/hinic.rst
F: drivers/net/ethernet/huawei/hinic/
@ -9597,6 +9606,7 @@ F: include/asm-generic/hyperv-tlfs.h
F: include/asm-generic/mshyperv.h
F: include/clocksource/hyperv_timer.h
F: include/linux/hyperv.h
F: include/net/mana
F: include/uapi/linux/hyperv.h
F: net/vmw_vsock/hyperv_transport.c
F: tools/hv/
@ -12410,7 +12420,7 @@ M: Marcin Wojtas <mw@semihalf.com>
M: Russell King <linux@armlinux.org.uk>
L: netdev@vger.kernel.org
S: Maintained
F: Documentation/devicetree/bindings/net/marvell-pp2.txt
F: Documentation/devicetree/bindings/net/marvell,pp2.yaml
F: drivers/net/ethernet/marvell/mvpp2/
MARVELL MWIFIEX WIRELESS DRIVER
@ -12458,7 +12468,7 @@ F: Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
F: drivers/net/ethernet/marvell/octeontx2/af/
MARVELL PRESTERA ETHERNET SWITCH DRIVER
M: Taras Chornyi <tchornyi@marvell.com>
M: Taras Chornyi <taras.chornyi@plvision.eu>
S: Supported
W: https://github.com/Marvell-switching/switchdev-prestera
F: drivers/net/ethernet/marvell/prestera/
@ -13017,6 +13027,7 @@ M: Felix Fietkau <nbd@nbd.name>
M: John Crispin <john@phrozen.org>
M: Sean Wang <sean.wang@mediatek.com>
M: Mark Lee <Mark-MC.Lee@mediatek.com>
M: Lorenzo Bianconi <lorenzo@kernel.org>
L: netdev@vger.kernel.org
S: Maintained
F: drivers/net/ethernet/mediatek/
@ -14048,6 +14059,7 @@ F: include/uapi/linux/meye.h
MOTORCOMM PHY DRIVER
M: Peter Geis <pgwipeout@gmail.com>
M: Frank <Frank.Sae@motor-comm.com>
L: netdev@vger.kernel.org
S: Maintained
F: drivers/net/phy/motorcomm.c
@ -19175,7 +19187,7 @@ M: Jassi Brar <jaswinder.singh@linaro.org>
M: Ilias Apalodimas <ilias.apalodimas@linaro.org>
L: netdev@vger.kernel.org
S: Maintained
F: Documentation/devicetree/bindings/net/socionext-netsec.txt
F: Documentation/devicetree/bindings/net/socionext,synquacer-netsec.yaml
F: drivers/net/ethernet/socionext/netsec.c
SOCIONEXT (SNI) Synquacer SPI DRIVER
@ -20828,7 +20840,6 @@ W: https://wireless.wiki.kernel.org/en/users/Drivers/wl12xx
W: https://wireless.wiki.kernel.org/en/users/Drivers/wl1251
T: git git://git.kernel.org/pub/scm/linux/kernel/git/luca/wl12xx.git
F: drivers/net/wireless/ti/
F: include/linux/wl12xx.h
TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER
M: John Stultz <jstultz@google.com>

View File

@ -178,6 +178,8 @@
/* Network controller */
ethernet: ethernet@f0000 {
#address-cells = <1>;
#size-cells = <0>;
compatible = "marvell,armada-375-pp2";
reg = <0xf0000 0xa000>, /* Packet Processor regs */
<0xc0000 0x3060>, /* LMS regs */
@ -187,15 +189,17 @@
clock-names = "pp_clk", "gop_clk";
status = "disabled";
eth0: eth0 {
eth0: ethernet-port@0 {
interrupts = <GIC_SPI 37 IRQ_TYPE_LEVEL_HIGH>;
port-id = <0>;
reg = <0>;
port-id = <0>; /* For backward compatibility. */
status = "disabled";
};
eth1: eth1 {
eth1: ethernet-port@1 {
interrupts = <GIC_SPI 41 IRQ_TYPE_LEVEL_HIGH>;
port-id = <1>;
reg = <1>;
port-id = <1>; /* For backward compatibility. */
status = "disabled";
};
};

View File

@ -10,7 +10,6 @@
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/of_platform.h>
#include <linux/wl12xx.h>
#include <linux/mmc/card.h>
#include <linux/mmc/host.h>
#include <linux/power/smartreflex.h>

View File

@ -21,6 +21,10 @@
};
};
&bluetooth0 {
brcm,board-type = "apple,atlantisb";
};
&wifi0 {
brcm,board-type = "apple,atlantisb";
};

View File

@ -17,6 +17,10 @@
model = "Apple MacBook Pro (13-inch, M1, 2020)";
};
&bluetooth0 {
brcm,board-type = "apple,honshu";
};
&wifi0 {
brcm,board-type = "apple,honshu";
};

View File

@ -17,6 +17,10 @@
model = "Apple MacBook Air (M1, 2020)";
};
&bluetooth0 {
brcm,board-type = "apple,shikoku";
};
&wifi0 {
brcm,board-type = "apple,shikoku";
};

View File

@ -21,6 +21,10 @@
};
};
&bluetooth0 {
brcm,board-type = "apple,capri";
};
&wifi0 {
brcm,board-type = "apple,capri";
};

View File

@ -21,6 +21,10 @@
};
};
&bluetooth0 {
brcm,board-type = "apple,santorini";
};
&wifi0 {
brcm,board-type = "apple,santorini";
};

View File

@ -11,6 +11,7 @@
/ {
aliases {
bluetooth0 = &bluetooth0;
serial0 = &serial0;
serial2 = &serial2;
wifi0 = &wifi0;
@ -77,6 +78,13 @@
local-mac-address = [00 00 00 00 00 00];
apple,antenna-sku = "XX";
};
bluetooth0: bluetooth@0,1 {
compatible = "pci14e4,5f69";
reg = <0x10100 0x0 0x0 0x0 0x0>;
/* To be filled by the loader */
local-bd-address = [00 00 00 00 00 00];
};
};
&nco_clkref {

View File

@ -24,9 +24,12 @@
/* these aliases provide the FMan ports mapping */
enet0: ethernet@e0000 {
pcs-handle-names = "qsgmii";
};
enet1: ethernet@e2000 {
pcsphy-handle = <&pcsphy1>, <&qsgmiib_pcs1>;
pcs-handle-names = "sgmii", "qsgmii";
};
enet2: ethernet@e4000 {
@ -36,11 +39,32 @@
};
enet4: ethernet@e8000 {
pcsphy-handle = <&pcsphy4>, <&qsgmiib_pcs2>;
pcs-handle-names = "sgmii", "qsgmii";
};
enet5: ethernet@ea000 {
pcsphy-handle = <&pcsphy5>, <&qsgmiib_pcs3>;
pcs-handle-names = "sgmii", "qsgmii";
};
enet6: ethernet@f0000 {
};
mdio@e1000 {
qsgmiib_pcs1: ethernet-pcs@1 {
compatible = "fsl,lynx-pcs";
reg = <0x1>;
};
qsgmiib_pcs2: ethernet-pcs@2 {
compatible = "fsl,lynx-pcs";
reg = <0x2>;
};
qsgmiib_pcs3: ethernet-pcs@3 {
compatible = "fsl,lynx-pcs";
reg = <0x3>;
};
};
};

View File

@ -23,6 +23,8 @@
&fman0 {
/* these aliases provide the FMan ports mapping */
enet0: ethernet@e0000 {
pcsphy-handle = <&qsgmiib_pcs3>;
pcs-handle-names = "qsgmii";
};
enet1: ethernet@e2000 {
@ -35,14 +37,37 @@
};
enet4: ethernet@e8000 {
pcsphy-handle = <&pcsphy4>, <&qsgmiib_pcs1>;
pcs-handle-names = "sgmii", "qsgmii";
};
enet5: ethernet@ea000 {
pcsphy-handle = <&pcsphy5>, <&pcsphy5>;
pcs-handle-names = "sgmii", "qsgmii";
};
enet6: ethernet@f0000 {
};
enet7: ethernet@f2000 {
pcsphy-handle = <&pcsphy7>, <&qsgmiib_pcs2>, <&pcsphy7>;
pcs-handle-names = "sgmii", "qsgmii", "xfi";
};
mdio@eb000 {
qsgmiib_pcs1: ethernet-pcs@1 {
compatible = "fsl,lynx-pcs";
reg = <0x1>;
};
qsgmiib_pcs2: ethernet-pcs@2 {
compatible = "fsl,lynx-pcs";
reg = <0x2>;
};
qsgmiib_pcs3: ethernet-pcs@3 {
compatible = "fsl,lynx-pcs";
reg = <0x3>;
};
};
};

View File

@ -58,6 +58,8 @@
ranges = <0x0 0x0 ADDRESSIFY(CP11X_BASE) 0x2000000>;
CP11X_LABEL(ethernet): ethernet@0 {
#address-cells = <1>;
#size-cells = <0>;
compatible = "marvell,armada-7k-pp22";
reg = <0x0 0x100000>, <0x129000 0xb000>, <0x220000 0x800>;
clocks = <&CP11X_LABEL(clk) 1 3>, <&CP11X_LABEL(clk) 1 9>,
@ -69,7 +71,7 @@
status = "disabled";
dma-coherent;
CP11X_LABEL(eth0): eth0 {
CP11X_LABEL(eth0): ethernet-port@0 {
interrupts = <39 IRQ_TYPE_LEVEL_HIGH>,
<43 IRQ_TYPE_LEVEL_HIGH>,
<47 IRQ_TYPE_LEVEL_HIGH>,
@ -83,12 +85,13 @@
interrupt-names = "hif0", "hif1", "hif2",
"hif3", "hif4", "hif5", "hif6", "hif7",
"hif8", "link";
port-id = <0>;
reg = <0>;
port-id = <0>; /* For backward compatibility. */
gop-port-id = <0>;
status = "disabled";
};
CP11X_LABEL(eth1): eth1 {
CP11X_LABEL(eth1): ethernet-port@1 {
interrupts = <40 IRQ_TYPE_LEVEL_HIGH>,
<44 IRQ_TYPE_LEVEL_HIGH>,
<48 IRQ_TYPE_LEVEL_HIGH>,
@ -102,12 +105,13 @@
interrupt-names = "hif0", "hif1", "hif2",
"hif3", "hif4", "hif5", "hif6", "hif7",
"hif8", "link";
port-id = <1>;
reg = <1>;
port-id = <1>; /* For backward compatibility. */
gop-port-id = <2>;
status = "disabled";
};
CP11X_LABEL(eth2): eth2 {
CP11X_LABEL(eth2): ethernet-port@2 {
interrupts = <41 IRQ_TYPE_LEVEL_HIGH>,
<45 IRQ_TYPE_LEVEL_HIGH>,
<49 IRQ_TYPE_LEVEL_HIGH>,
@ -121,7 +125,8 @@
interrupt-names = "hif0", "hif1", "hif2",
"hif3", "hif4", "hif5", "hif6", "hif7",
"hif8", "link";
port-id = <2>;
reg = <2>;
port-id = <2>; /* For backward compatibility. */
gop-port-id = <3>;
status = "disabled";
};

View File

@ -77,6 +77,47 @@
no-map;
reg = <0 0x4fc00000 0 0x00100000>;
};
wo_emi0: wo-emi@4fd00000 {
reg = <0 0x4fd00000 0 0x40000>;
no-map;
};
wo_emi1: wo-emi@4fd40000 {
reg = <0 0x4fd40000 0 0x40000>;
no-map;
};
wo_ilm0: wo-ilm@151e0000 {
reg = <0 0x151e0000 0 0x8000>;
no-map;
};
wo_ilm1: wo-ilm@151f0000 {
reg = <0 0x151f0000 0 0x8000>;
no-map;
};
wo_data: wo-data@4fd80000 {
reg = <0 0x4fd80000 0 0x240000>;
no-map;
};
wo_dlm0: wo-dlm@151e8000 {
reg = <0 0x151e8000 0 0x2000>;
no-map;
};
wo_dlm1: wo-dlm@151f8000 {
reg = <0 0x151f8000 0 0x2000>;
no-map;
};
wo_boot: wo-boot@15194000 {
reg = <0 0x15194000 0 0x1000>;
no-map;
};
};
timer {
@ -298,6 +339,11 @@
reg = <0 0x15010000 0 0x1000>;
interrupt-parent = <&gic>;
interrupts = <GIC_SPI 205 IRQ_TYPE_LEVEL_HIGH>;
memory-region = <&wo_emi0>, <&wo_ilm0>, <&wo_dlm0>,
<&wo_data>, <&wo_boot>;
memory-region-names = "wo-emi", "wo-ilm", "wo-dlm",
"wo-data", "wo-boot";
mediatek,wo-ccif = <&wo_ccif0>;
};
wed1: wed@15011000 {
@ -306,6 +352,25 @@
reg = <0 0x15011000 0 0x1000>;
interrupt-parent = <&gic>;
interrupts = <GIC_SPI 206 IRQ_TYPE_LEVEL_HIGH>;
memory-region = <&wo_emi1>, <&wo_ilm1>, <&wo_dlm1>,
<&wo_data>, <&wo_boot>;
memory-region-names = "wo-emi", "wo-ilm", "wo-dlm",
"wo-data", "wo-boot";
mediatek,wo-ccif = <&wo_ccif1>;
};
wo_ccif0: syscon@151a5000 {
compatible = "mediatek,mt7986-wo-ccif", "syscon";
reg = <0 0x151a5000 0 0x1000>;
interrupt-parent = <&gic>;
interrupts = <GIC_SPI 211 IRQ_TYPE_LEVEL_HIGH>;
};
wo_ccif1: syscon@151ad000 {
compatible = "mediatek,mt7986-wo-ccif", "syscon";
reg = <0 0x151ad000 0 0x1000>;
interrupt-parent = <&gic>;
interrupts = <GIC_SPI 212 IRQ_TYPE_LEVEL_HIGH>;
};
eth: ethernet@15100000 {

View File

@ -1649,13 +1649,8 @@ static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
struct bpf_prog *p = l->link.prog;
int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
if (p->aux->sleepable) {
enter_prog = (u64)__bpf_prog_enter_sleepable;
exit_prog = (u64)__bpf_prog_exit_sleepable;
} else {
enter_prog = (u64)__bpf_prog_enter;
exit_prog = (u64)__bpf_prog_exit;
}
enter_prog = (u64)bpf_trampoline_enter(p);
exit_prog = (u64)bpf_trampoline_exit(p);
if (l->cookie == 0) {
/* if cookie is zero, one instruction is enough to store it */

View File

@ -284,7 +284,6 @@ CONFIG_IXGB=m
CONFIG_SKGE=m
CONFIG_SKY2=m
CONFIG_MYRI10GE=m
CONFIG_FEALNX=m
CONFIG_NATSEMI=m
CONFIG_NS83820=m
CONFIG_S2IO=m

View File

@ -55,7 +55,8 @@ fman@400000 {
reg = <0xe0000 0x1000>;
fsl,fman-ports = <&fman0_rx_0x08 &fman0_tx_0x28>;
ptp-timer = <&ptp_timer0>;
pcsphy-handle = <&pcsphy0>;
pcsphy-handle = <&pcsphy0>, <&pcsphy0>;
pcs-handle-names = "sgmii", "qsgmii";
};
mdio@e1000 {

View File

@ -52,7 +52,15 @@ fman@400000 {
compatible = "fsl,fman-memac";
reg = <0xf0000 0x1000>;
fsl,fman-ports = <&fman0_rx_0x10 &fman0_tx_0x30>;
pcsphy-handle = <&pcsphy6>;
pcsphy-handle = <&pcsphy6>, <&qsgmiib_pcs2>, <&pcsphy6>;
pcs-handle-names = "sgmii", "qsgmii", "xfi";
};
mdio@e9000 {
qsgmiib_pcs2: ethernet-pcs@2 {
compatible = "fsl,lynx-pcs";
reg = <2>;
};
};
mdio@f1000 {

View File

@ -55,7 +55,15 @@ fman@400000 {
reg = <0xe2000 0x1000>;
fsl,fman-ports = <&fman0_rx_0x09 &fman0_tx_0x29>;
ptp-timer = <&ptp_timer0>;
pcsphy-handle = <&pcsphy1>;
pcsphy-handle = <&pcsphy1>, <&qsgmiia_pcs1>;
pcs-handle-names = "sgmii", "qsgmii";
};
mdio@e1000 {
qsgmiia_pcs1: ethernet-pcs@1 {
compatible = "fsl,lynx-pcs";
reg = <1>;
};
};
mdio@e3000 {

Some files were not shown because too many files have changed in this diff Show More