mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2025-01-16 10:54:09 +08:00
73be7fb14e
Current release - regressions: - eth: stmmac: fix failure to probe without MAC interface specified Current release - new code bugs: - docs: netlink: fix missing classic_netlink doc reference Previous releases - regressions: - deal with integer overflows in kmalloc_reserve() - use sk_forward_alloc_get() in sk_get_meminfo() - bpf_sk_storage: fix the missing uncharge in sk_omem_alloc - fib: avoid warn splat in flow dissector after packet mangling - skb_segment: call zero copy functions before using skbuff frags - eth: sfc: check for zero length in EF10 RX prefix Previous releases - always broken: - af_unix: fix msg_controllen test in scm_pidfd_recv() for MSG_CMSG_COMPAT - xsk: fix xsk_build_skb() dereferencing possible ERR_PTR() - netfilter: - nft_exthdr: fix non-linear header modification - xt_u32, xt_sctp: validate user space input - nftables: exthdr: fix 4-byte stack OOB write - nfnetlink_osf: avoid OOB read - one more fix for the garbage collection work from last release - igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU - bpf, sockmap: fix preempt_rt splat when using raw_spin_lock_t - handshake: fix null-deref in handshake_nl_done_doit() - ip: ignore dst hint for multipath routes to ensure packets are hashed across the nexthops - phy: micrel: - correct bit assignments for cable test errata - disable EEE according to the KSZ9477 errata Misc: - docs/bpf: document compile-once-run-everywhere (CO-RE) relocations - Revert "net: macsec: preserve ingress frame ordering", it appears to have been developed against an older kernel, problem doesn't exist upstream Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmT6R6wACgkQMUZtbf5S IrsmTg//TgmRjxSZ0lrPQtJwZR/eN3ZR2oQG3rwnssCx+YgHEGGxQsfT4KHEMacR ZgGDZVTpthUJkkACBPi8ZMoy++RdjEmlCcanfeDkGHoYGtiX1lhkofhLMn1KUHbI rIbP9EdNKxQT0SsBlw/U28pD5jKyqOgL23QobEwmcjLTdMpamb+qIsD6/xNv9tEj Tu4BdCIkhjxnBD622hsE3pFTG7oSn2WM6rf5NT1E43mJ3W8RrMcydSB27J7Oryo9 l3nYMAhz0vQINS2WQ9eCT1/7GI6gg1nDtxFtrnV7ASvxayRBPIUr4kg1vT+Tixsz CZMnwVamEBIYl9agmj7vSji7d5nOUgXPhtWhwWUM2tRoGdeGw3vSi1pgDvRiUCHE PJ4UHv7goa2AgnOlOQCFtRybAu+9nmSGm7V+GkeGLnH7xbFsEa5smQ/+FSPJs8Dn Yf4q5QAhdN8tdnofRlrN/nCssoDF3cfmBsTJ7wo5h71gW+BWhsP58eDCJlXd/r8k +Qnvoe2kw27ktFR1tjsUDZ0AcSmeVARNwmXCOBYZsG4tEek8pLyj008mDvJvdfyn PGPn7Eo5DyaERlHVmPuebHXSyniDEPe2GLTmlHcGiRpGspoUHbB+HRiDAuRLMB9g pkL8RHpNfppnuUXeUoNy3rgEkYwlpTjZX0QHC6N8NQ76ccB6CNM= =YpmE -----END PGP SIGNATURE----- Merge tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking updates from Jakub Kicinski: "Including fixes from netfilter and bpf. Current release - regressions: - eth: stmmac: fix failure to probe without MAC interface specified Current release - new code bugs: - docs: netlink: fix missing classic_netlink doc reference Previous releases - regressions: - deal with integer overflows in kmalloc_reserve() - use sk_forward_alloc_get() in sk_get_meminfo() - bpf_sk_storage: fix the missing uncharge in sk_omem_alloc - fib: avoid warn splat in flow dissector after packet mangling - skb_segment: call zero copy functions before using skbuff frags - eth: sfc: check for zero length in EF10 RX prefix Previous releases - always broken: - af_unix: fix msg_controllen test in scm_pidfd_recv() for MSG_CMSG_COMPAT - xsk: fix xsk_build_skb() dereferencing possible ERR_PTR() - netfilter: - nft_exthdr: fix non-linear header modification - xt_u32, xt_sctp: validate user space input - nftables: exthdr: fix 4-byte stack OOB write - nfnetlink_osf: avoid OOB read - one more fix for the garbage collection work from last release - igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU - bpf, sockmap: fix preempt_rt splat when using raw_spin_lock_t - handshake: fix null-deref in handshake_nl_done_doit() - ip: ignore dst hint for multipath routes to ensure packets are hashed across the nexthops - phy: micrel: - correct bit assignments for cable test errata - disable EEE according to the KSZ9477 errata Misc: - docs/bpf: document compile-once-run-everywhere (CO-RE) relocations - Revert "net: macsec: preserve ingress frame ordering", it appears to have been developed against an older kernel, problem doesn't exist upstream" * tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits) net: enetc: distinguish error from valid pointers in enetc_fixup_clear_rss_rfs() Revert "net: team: do not use dynamic lockdep key" net: hns3: remove GSO partial feature bit net: hns3: fix the port information display when sfp is absent net: hns3: fix invalid mutex between tc qdisc and dcb ets command issue net: hns3: fix debugfs concurrency issue between kfree buffer and read net: hns3: fix byte order conversion issue in hclge_dbg_fd_tcam_read() net: hns3: Support query tx timeout threshold by debugfs net: hns3: fix tx timeout issue net: phy: Provide Module 4 KSZ9477 errata (DS80000754C) netfilter: nf_tables: Unbreak audit log reset netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID netfilter: nfnetlink_osf: avoid OOB read netfilter: nftables: exthdr: fix 4-byte stack OOB write selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc bpf: bpf_sk_storage: Fix invalid wait context lockdep report s390/bpf: Pass through tail call counter in trampolines ...
684 lines
26 KiB
ReStructuredText
684 lines
26 KiB
ReStructuredText
.. SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
=======================
|
|
Introduction to Netlink
|
|
=======================
|
|
|
|
Netlink is often described as an ioctl() replacement.
|
|
It aims to replace fixed-format C structures as supplied
|
|
to ioctl() with a format which allows an easy way to add
|
|
or extended the arguments.
|
|
|
|
To achieve this Netlink uses a minimal fixed-format metadata header
|
|
followed by multiple attributes in the TLV (type, length, value) format.
|
|
|
|
Unfortunately the protocol has evolved over the years, in an organic
|
|
and undocumented fashion, making it hard to coherently explain.
|
|
To make the most practical sense this document starts by describing
|
|
netlink as it is used today and dives into more "historical" uses
|
|
in later sections.
|
|
|
|
Opening a socket
|
|
================
|
|
|
|
Netlink communication happens over sockets, a socket needs to be
|
|
opened first:
|
|
|
|
.. code-block:: c
|
|
|
|
fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
|
|
|
|
The use of sockets allows for a natural way of exchanging information
|
|
in both directions (to and from the kernel). The operations are still
|
|
performed synchronously when applications send() the request but
|
|
a separate recv() system call is needed to read the reply.
|
|
|
|
A very simplified flow of a Netlink "call" will therefore look
|
|
something like:
|
|
|
|
.. code-block:: c
|
|
|
|
fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
|
|
|
|
/* format the request */
|
|
send(fd, &request, sizeof(request));
|
|
n = recv(fd, &response, RSP_BUFFER_SIZE);
|
|
/* interpret the response */
|
|
|
|
Netlink also provides natural support for "dumping", i.e. communicating
|
|
to user space all objects of a certain type (e.g. dumping all network
|
|
interfaces).
|
|
|
|
.. code-block:: c
|
|
|
|
fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
|
|
|
|
/* format the dump request */
|
|
send(fd, &request, sizeof(request));
|
|
while (1) {
|
|
n = recv(fd, &buffer, RSP_BUFFER_SIZE);
|
|
/* one recv() call can read multiple messages, hence the loop below */
|
|
for (nl_msg in buffer) {
|
|
if (nl_msg.nlmsg_type == NLMSG_DONE)
|
|
goto dump_finished;
|
|
/* process the object */
|
|
}
|
|
}
|
|
dump_finished:
|
|
|
|
The first two arguments of the socket() call require little explanation -
|
|
it is opening a Netlink socket, with all headers provided by the user
|
|
(hence NETLINK, RAW). The last argument is the protocol within Netlink.
|
|
This field used to identify the subsystem with which the socket will
|
|
communicate.
|
|
|
|
Classic vs Generic Netlink
|
|
--------------------------
|
|
|
|
Initial implementation of Netlink depended on a static allocation
|
|
of IDs to subsystems and provided little supporting infrastructure.
|
|
Let us refer to those protocols collectively as **Classic Netlink**.
|
|
The list of them is defined on top of the ``include/uapi/linux/netlink.h``
|
|
file, they include among others - general networking (NETLINK_ROUTE),
|
|
iSCSI (NETLINK_ISCSI), and audit (NETLINK_AUDIT).
|
|
|
|
**Generic Netlink** (introduced in 2005) allows for dynamic registration of
|
|
subsystems (and subsystem ID allocation), introspection and simplifies
|
|
implementing the kernel side of the interface.
|
|
|
|
The following section describes how to use Generic Netlink, as the
|
|
number of subsystems using Generic Netlink outnumbers the older
|
|
protocols by an order of magnitude. There are also no plans for adding
|
|
more Classic Netlink protocols to the kernel.
|
|
Basic information on how communicating with core networking parts of
|
|
the Linux kernel (or another of the 20 subsystems using Classic
|
|
Netlink) differs from Generic Netlink is provided later in this document.
|
|
|
|
Generic Netlink
|
|
===============
|
|
|
|
In addition to the Netlink fixed metadata header each Netlink protocol
|
|
defines its own fixed metadata header. (Similarly to how network
|
|
headers stack - Ethernet > IP > TCP we have Netlink > Generic N. > Family.)
|
|
|
|
A Netlink message always starts with struct nlmsghdr, which is followed
|
|
by a protocol-specific header. In case of Generic Netlink the protocol
|
|
header is struct genlmsghdr.
|
|
|
|
The practical meaning of the fields in case of Generic Netlink is as follows:
|
|
|
|
.. code-block:: c
|
|
|
|
struct nlmsghdr {
|
|
__u32 nlmsg_len; /* Length of message including headers */
|
|
__u16 nlmsg_type; /* Generic Netlink Family (subsystem) ID */
|
|
__u16 nlmsg_flags; /* Flags - request or dump */
|
|
__u32 nlmsg_seq; /* Sequence number */
|
|
__u32 nlmsg_pid; /* Port ID, set to 0 */
|
|
};
|
|
struct genlmsghdr {
|
|
__u8 cmd; /* Command, as defined by the Family */
|
|
__u8 version; /* Irrelevant, set to 1 */
|
|
__u16 reserved; /* Reserved, set to 0 */
|
|
};
|
|
/* TLV attributes follow... */
|
|
|
|
In Classic Netlink :c:member:`nlmsghdr.nlmsg_type` used to identify
|
|
which operation within the subsystem the message was referring to
|
|
(e.g. get information about a netdev). Generic Netlink needs to mux
|
|
multiple subsystems in a single protocol so it uses this field to
|
|
identify the subsystem, and :c:member:`genlmsghdr.cmd` identifies
|
|
the operation instead. (See :ref:`res_fam` for
|
|
information on how to find the Family ID of the subsystem of interest.)
|
|
Note that the first 16 values (0 - 15) of this field are reserved for
|
|
control messages both in Classic Netlink and Generic Netlink.
|
|
See :ref:`nl_msg_type` for more details.
|
|
|
|
There are 3 usual types of message exchanges on a Netlink socket:
|
|
|
|
- performing a single action (``do``);
|
|
- dumping information (``dump``);
|
|
- getting asynchronous notifications (``multicast``).
|
|
|
|
Classic Netlink is very flexible and presumably allows other types
|
|
of exchanges to happen, but in practice those are the three that get
|
|
used.
|
|
|
|
Asynchronous notifications are sent by the kernel and received by
|
|
the user sockets which subscribed to them. ``do`` and ``dump`` requests
|
|
are initiated by the user. :c:member:`nlmsghdr.nlmsg_flags` should
|
|
be set as follows:
|
|
|
|
- for ``do``: ``NLM_F_REQUEST | NLM_F_ACK``
|
|
- for ``dump``: ``NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP``
|
|
|
|
:c:member:`nlmsghdr.nlmsg_seq` should be a set to a monotonically
|
|
increasing value. The value gets echoed back in responses and doesn't
|
|
matter in practice, but setting it to an increasing value for each
|
|
message sent is considered good hygiene. The purpose of the field is
|
|
matching responses to requests. Asynchronous notifications will have
|
|
:c:member:`nlmsghdr.nlmsg_seq` of ``0``.
|
|
|
|
:c:member:`nlmsghdr.nlmsg_pid` is the Netlink equivalent of an address.
|
|
This field can be set to ``0`` when talking to the kernel.
|
|
See :ref:`nlmsg_pid` for the (uncommon) uses of the field.
|
|
|
|
The expected use for :c:member:`genlmsghdr.version` was to allow
|
|
versioning of the APIs provided by the subsystems. No subsystem to
|
|
date made significant use of this field, so setting it to ``1`` seems
|
|
like a safe bet.
|
|
|
|
.. _nl_msg_type:
|
|
|
|
Netlink message types
|
|
---------------------
|
|
|
|
As previously mentioned :c:member:`nlmsghdr.nlmsg_type` carries
|
|
protocol specific values but the first 16 identifiers are reserved
|
|
(first subsystem specific message type should be equal to
|
|
``NLMSG_MIN_TYPE`` which is ``0x10``).
|
|
|
|
There are only 4 Netlink control messages defined:
|
|
|
|
- ``NLMSG_NOOP`` - ignore the message, not used in practice;
|
|
- ``NLMSG_ERROR`` - carries the return code of an operation;
|
|
- ``NLMSG_DONE`` - marks the end of a dump;
|
|
- ``NLMSG_OVERRUN`` - socket buffer has overflown, not used to date.
|
|
|
|
``NLMSG_ERROR`` and ``NLMSG_DONE`` are of practical importance.
|
|
They carry return codes for operations. Note that unless
|
|
the ``NLM_F_ACK`` flag is set on the request Netlink will not respond
|
|
with ``NLMSG_ERROR`` if there is no error. To avoid having to special-case
|
|
this quirk it is recommended to always set ``NLM_F_ACK``.
|
|
|
|
The format of ``NLMSG_ERROR`` is described by struct nlmsgerr::
|
|
|
|
----------------------------------------------
|
|
| struct nlmsghdr - response header |
|
|
----------------------------------------------
|
|
| int error |
|
|
----------------------------------------------
|
|
| struct nlmsghdr - original request header |
|
|
----------------------------------------------
|
|
| ** optionally (1) payload of the request |
|
|
----------------------------------------------
|
|
| ** optionally (2) extended ACK |
|
|
----------------------------------------------
|
|
|
|
There are two instances of struct nlmsghdr here, first of the response
|
|
and second of the request. ``NLMSG_ERROR`` carries the information about
|
|
the request which led to the error. This could be useful when trying
|
|
to match requests to responses or re-parse the request to dump it into
|
|
logs.
|
|
|
|
The payload of the request is not echoed in messages reporting success
|
|
(``error == 0``) or if ``NETLINK_CAP_ACK`` setsockopt() was set.
|
|
The latter is common
|
|
and perhaps recommended as having to read a copy of every request back
|
|
from the kernel is rather wasteful. The absence of request payload
|
|
is indicated by ``NLM_F_CAPPED`` in :c:member:`nlmsghdr.nlmsg_flags`.
|
|
|
|
The second optional element of ``NLMSG_ERROR`` are the extended ACK
|
|
attributes. See :ref:`ext_ack` for more details. The presence
|
|
of extended ACK is indicated by ``NLM_F_ACK_TLVS`` in
|
|
:c:member:`nlmsghdr.nlmsg_flags`.
|
|
|
|
``NLMSG_DONE`` is simpler, the request is never echoed but the extended
|
|
ACK attributes may be present::
|
|
|
|
----------------------------------------------
|
|
| struct nlmsghdr - response header |
|
|
----------------------------------------------
|
|
| int error |
|
|
----------------------------------------------
|
|
| ** optionally extended ACK |
|
|
----------------------------------------------
|
|
|
|
.. _res_fam:
|
|
|
|
Resolving the Family ID
|
|
-----------------------
|
|
|
|
This section explains how to find the Family ID of a subsystem.
|
|
It also serves as an example of Generic Netlink communication.
|
|
|
|
Generic Netlink is itself a subsystem exposed via the Generic Netlink API.
|
|
To avoid a circular dependency Generic Netlink has a statically allocated
|
|
Family ID (``GENL_ID_CTRL`` which is equal to ``NLMSG_MIN_TYPE``).
|
|
The Generic Netlink family implements a command used to find out information
|
|
about other families (``CTRL_CMD_GETFAMILY``).
|
|
|
|
To get information about the Generic Netlink family named for example
|
|
``"test1"`` we need to send a message on the previously opened Generic Netlink
|
|
socket. The message should target the Generic Netlink Family (1), be a
|
|
``do`` (2) call to ``CTRL_CMD_GETFAMILY`` (3). A ``dump`` version of this
|
|
call would make the kernel respond with information about *all* the families
|
|
it knows about. Last but not least the name of the family in question has
|
|
to be specified (4) as an attribute with the appropriate type::
|
|
|
|
struct nlmsghdr:
|
|
__u32 nlmsg_len: 32
|
|
__u16 nlmsg_type: GENL_ID_CTRL // (1)
|
|
__u16 nlmsg_flags: NLM_F_REQUEST | NLM_F_ACK // (2)
|
|
__u32 nlmsg_seq: 1
|
|
__u32 nlmsg_pid: 0
|
|
|
|
struct genlmsghdr:
|
|
__u8 cmd: CTRL_CMD_GETFAMILY // (3)
|
|
__u8 version: 2 /* or 1, doesn't matter */
|
|
__u16 reserved: 0
|
|
|
|
struct nlattr: // (4)
|
|
__u16 nla_len: 10
|
|
__u16 nla_type: CTRL_ATTR_FAMILY_NAME
|
|
char data: test1\0
|
|
|
|
(padding:)
|
|
char data: \0\0
|
|
|
|
The length fields in Netlink (:c:member:`nlmsghdr.nlmsg_len`
|
|
and :c:member:`nlattr.nla_len`) always *include* the header.
|
|
Attribute headers in netlink must be aligned to 4 bytes from the start
|
|
of the message, hence the extra ``\0\0`` after ``CTRL_ATTR_FAMILY_NAME``.
|
|
The attribute lengths *exclude* the padding.
|
|
|
|
If the family is found kernel will reply with two messages, the response
|
|
with all the information about the family::
|
|
|
|
/* Message #1 - reply */
|
|
struct nlmsghdr:
|
|
__u32 nlmsg_len: 136
|
|
__u16 nlmsg_type: GENL_ID_CTRL
|
|
__u16 nlmsg_flags: 0
|
|
__u32 nlmsg_seq: 1 /* echoed from our request */
|
|
__u32 nlmsg_pid: 5831 /* The PID of our user space process */
|
|
|
|
struct genlmsghdr:
|
|
__u8 cmd: CTRL_CMD_GETFAMILY
|
|
__u8 version: 2
|
|
__u16 reserved: 0
|
|
|
|
struct nlattr:
|
|
__u16 nla_len: 10
|
|
__u16 nla_type: CTRL_ATTR_FAMILY_NAME
|
|
char data: test1\0
|
|
|
|
(padding:)
|
|
data: \0\0
|
|
|
|
struct nlattr:
|
|
__u16 nla_len: 6
|
|
__u16 nla_type: CTRL_ATTR_FAMILY_ID
|
|
__u16: 123 /* The Family ID we are after */
|
|
|
|
(padding:)
|
|
char data: \0\0
|
|
|
|
struct nlattr:
|
|
__u16 nla_len: 9
|
|
__u16 nla_type: CTRL_ATTR_FAMILY_VERSION
|
|
__u16: 1
|
|
|
|
/* ... etc, more attributes will follow. */
|
|
|
|
And the error code (success) since ``NLM_F_ACK`` had been set on the request::
|
|
|
|
/* Message #2 - the ACK */
|
|
struct nlmsghdr:
|
|
__u32 nlmsg_len: 36
|
|
__u16 nlmsg_type: NLMSG_ERROR
|
|
__u16 nlmsg_flags: NLM_F_CAPPED /* There won't be a payload */
|
|
__u32 nlmsg_seq: 1 /* echoed from our request */
|
|
__u32 nlmsg_pid: 5831 /* The PID of our user space process */
|
|
|
|
int error: 0
|
|
|
|
struct nlmsghdr: /* Copy of the request header as we sent it */
|
|
__u32 nlmsg_len: 32
|
|
__u16 nlmsg_type: GENL_ID_CTRL
|
|
__u16 nlmsg_flags: NLM_F_REQUEST | NLM_F_ACK
|
|
__u32 nlmsg_seq: 1
|
|
__u32 nlmsg_pid: 0
|
|
|
|
The order of attributes (struct nlattr) is not guaranteed so the user
|
|
has to walk the attributes and parse them.
|
|
|
|
Note that Generic Netlink sockets are not associated or bound to a single
|
|
family. A socket can be used to exchange messages with many different
|
|
families, selecting the recipient family on message-by-message basis using
|
|
the :c:member:`nlmsghdr.nlmsg_type` field.
|
|
|
|
.. _ext_ack:
|
|
|
|
Extended ACK
|
|
------------
|
|
|
|
Extended ACK controls reporting of additional error/warning TLVs
|
|
in ``NLMSG_ERROR`` and ``NLMSG_DONE`` messages. To maintain backward
|
|
compatibility this feature has to be explicitly enabled by setting
|
|
the ``NETLINK_EXT_ACK`` setsockopt() to ``1``.
|
|
|
|
Types of extended ack attributes are defined in enum nlmsgerr_attrs.
|
|
The most commonly used attributes are ``NLMSGERR_ATTR_MSG``,
|
|
``NLMSGERR_ATTR_OFFS`` and ``NLMSGERR_ATTR_MISS_*``.
|
|
|
|
``NLMSGERR_ATTR_MSG`` carries a message in English describing
|
|
the encountered problem. These messages are far more detailed
|
|
than what can be expressed thru standard UNIX error codes.
|
|
|
|
``NLMSGERR_ATTR_OFFS`` points to the attribute which caused the problem.
|
|
|
|
``NLMSGERR_ATTR_MISS_TYPE`` and ``NLMSGERR_ATTR_MISS_NEST``
|
|
inform about a missing attribute.
|
|
|
|
Extended ACKs can be reported on errors as well as in case of success.
|
|
The latter should be treated as a warning.
|
|
|
|
Extended ACKs greatly improve the usability of Netlink and should
|
|
always be enabled, appropriately parsed and reported to the user.
|
|
|
|
Advanced topics
|
|
===============
|
|
|
|
Dump consistency
|
|
----------------
|
|
|
|
Some of the data structures kernel uses for storing objects make
|
|
it hard to provide an atomic snapshot of all the objects in a dump
|
|
(without impacting the fast-paths updating them).
|
|
|
|
Kernel may set the ``NLM_F_DUMP_INTR`` flag on any message in a dump
|
|
(including the ``NLMSG_DONE`` message) if the dump was interrupted and
|
|
may be inconsistent (e.g. missing objects). User space should retry
|
|
the dump if it sees the flag set.
|
|
|
|
Introspection
|
|
-------------
|
|
|
|
The basic introspection abilities are enabled by access to the Family
|
|
object as reported in :ref:`res_fam`. User can query information about
|
|
the Generic Netlink family, including which operations are supported
|
|
by the kernel and what attributes the kernel understands.
|
|
Family information includes the highest ID of an attribute kernel can parse,
|
|
a separate command (``CTRL_CMD_GETPOLICY``) provides detailed information
|
|
about supported attributes, including ranges of values the kernel accepts.
|
|
|
|
Querying family information is useful in cases when user space needs
|
|
to make sure that the kernel has support for a feature before issuing
|
|
a request.
|
|
|
|
.. _nlmsg_pid:
|
|
|
|
nlmsg_pid
|
|
---------
|
|
|
|
:c:member:`nlmsghdr.nlmsg_pid` is the Netlink equivalent of an address.
|
|
It is referred to as Port ID, sometimes Process ID because for historical
|
|
reasons if the application does not select (bind() to) an explicit Port ID
|
|
kernel will automatically assign it the ID equal to its Process ID
|
|
(as reported by the getpid() system call).
|
|
|
|
Similarly to the bind() semantics of the TCP/IP network protocols the value
|
|
of zero means "assign automatically", hence it is common for applications
|
|
to leave the :c:member:`nlmsghdr.nlmsg_pid` field initialized to ``0``.
|
|
|
|
The field is still used today in rare cases when kernel needs to send
|
|
a unicast notification. User space application can use bind() to associate
|
|
its socket with a specific PID, it then communicates its PID to the kernel.
|
|
This way the kernel can reach the specific user space process.
|
|
|
|
This sort of communication is utilized in UMH (User Mode Helper)-like
|
|
scenarios when kernel needs to trigger user space processing or ask user
|
|
space for a policy decision.
|
|
|
|
Multicast notifications
|
|
-----------------------
|
|
|
|
One of the strengths of Netlink is the ability to send event notifications
|
|
to user space. This is a unidirectional form of communication (kernel ->
|
|
user) and does not involve any control messages like ``NLMSG_ERROR`` or
|
|
``NLMSG_DONE``.
|
|
|
|
For example the Generic Netlink family itself defines a set of multicast
|
|
notifications about registered families. When a new family is added the
|
|
sockets subscribed to the notifications will get the following message::
|
|
|
|
struct nlmsghdr:
|
|
__u32 nlmsg_len: 136
|
|
__u16 nlmsg_type: GENL_ID_CTRL
|
|
__u16 nlmsg_flags: 0
|
|
__u32 nlmsg_seq: 0
|
|
__u32 nlmsg_pid: 0
|
|
|
|
struct genlmsghdr:
|
|
__u8 cmd: CTRL_CMD_NEWFAMILY
|
|
__u8 version: 2
|
|
__u16 reserved: 0
|
|
|
|
struct nlattr:
|
|
__u16 nla_len: 10
|
|
__u16 nla_type: CTRL_ATTR_FAMILY_NAME
|
|
char data: test1\0
|
|
|
|
(padding:)
|
|
data: \0\0
|
|
|
|
struct nlattr:
|
|
__u16 nla_len: 6
|
|
__u16 nla_type: CTRL_ATTR_FAMILY_ID
|
|
__u16: 123 /* The Family ID we are after */
|
|
|
|
(padding:)
|
|
char data: \0\0
|
|
|
|
struct nlattr:
|
|
__u16 nla_len: 9
|
|
__u16 nla_type: CTRL_ATTR_FAMILY_VERSION
|
|
__u16: 1
|
|
|
|
/* ... etc, more attributes will follow. */
|
|
|
|
The notification contains the same information as the response
|
|
to the ``CTRL_CMD_GETFAMILY`` request.
|
|
|
|
The Netlink headers of the notification are mostly 0 and irrelevant.
|
|
The :c:member:`nlmsghdr.nlmsg_seq` may be either zero or a monotonically
|
|
increasing notification sequence number maintained by the family.
|
|
|
|
To receive notifications the user socket must subscribe to the relevant
|
|
notification group. Much like the Family ID, the Group ID for a given
|
|
multicast group is dynamic and can be found inside the Family information.
|
|
The ``CTRL_ATTR_MCAST_GROUPS`` attribute contains nests with names
|
|
(``CTRL_ATTR_MCAST_GRP_NAME``) and IDs (``CTRL_ATTR_MCAST_GRP_ID``) of
|
|
the groups family.
|
|
|
|
Once the Group ID is known a setsockopt() call adds the socket to the group:
|
|
|
|
.. code-block:: c
|
|
|
|
unsigned int group_id;
|
|
|
|
/* .. find the group ID... */
|
|
|
|
setsockopt(fd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP,
|
|
&group_id, sizeof(group_id));
|
|
|
|
The socket will now receive notifications.
|
|
|
|
It is recommended to use separate sockets for receiving notifications
|
|
and sending requests to the kernel. The asynchronous nature of notifications
|
|
means that they may get mixed in with the responses making the message
|
|
handling much harder.
|
|
|
|
Buffer sizing
|
|
-------------
|
|
|
|
Netlink sockets are datagram sockets rather than stream sockets,
|
|
meaning that each message must be received in its entirety by a single
|
|
recv()/recvmsg() system call. If the buffer provided by the user is too
|
|
short, the message will be truncated and the ``MSG_TRUNC`` flag set
|
|
in struct msghdr (struct msghdr is the second argument
|
|
of the recvmsg() system call, *not* a Netlink header).
|
|
|
|
Upon truncation the remaining part of the message is discarded.
|
|
|
|
Netlink expects that the user buffer will be at least 8kB or a page
|
|
size of the CPU architecture, whichever is bigger. Particular Netlink
|
|
families may, however, require a larger buffer. 32kB buffer is recommended
|
|
for most efficient handling of dumps (larger buffer fits more dumped
|
|
objects and therefore fewer recvmsg() calls are needed).
|
|
|
|
.. _classic_netlink:
|
|
|
|
Classic Netlink
|
|
===============
|
|
|
|
The main differences between Classic and Generic Netlink are the dynamic
|
|
allocation of subsystem identifiers and availability of introspection.
|
|
In theory the protocol does not differ significantly, however, in practice
|
|
Classic Netlink experimented with concepts which were abandoned in Generic
|
|
Netlink (really, they usually only found use in a small corner of a single
|
|
subsystem). This section is meant as an explainer of a few of such concepts,
|
|
with the explicit goal of giving the Generic Netlink
|
|
users the confidence to ignore them when reading the uAPI headers.
|
|
|
|
Most of the concepts and examples here refer to the ``NETLINK_ROUTE`` family,
|
|
which covers much of the configuration of the Linux networking stack.
|
|
Real documentation of that family, deserves a chapter (or a book) of its own.
|
|
|
|
Families
|
|
--------
|
|
|
|
Netlink refers to subsystems as families. This is a remnant of using
|
|
sockets and the concept of protocol families, which are part of message
|
|
demultiplexing in ``NETLINK_ROUTE``.
|
|
|
|
Sadly every layer of encapsulation likes to refer to whatever it's carrying
|
|
as "families" making the term very confusing:
|
|
|
|
1. AF_NETLINK is a bona fide socket protocol family
|
|
2. AF_NETLINK's documentation refers to what comes after its own
|
|
header (struct nlmsghdr) in a message as a "Family Header"
|
|
3. Generic Netlink is a family for AF_NETLINK (struct genlmsghdr follows
|
|
struct nlmsghdr), yet it also calls its users "Families".
|
|
|
|
Note that the Generic Netlink Family IDs are in a different "ID space"
|
|
and overlap with Classic Netlink protocol numbers (e.g. ``NETLINK_CRYPTO``
|
|
has the Classic Netlink protocol ID of 21 which Generic Netlink will
|
|
happily allocate to one of its families as well).
|
|
|
|
Strict checking
|
|
---------------
|
|
|
|
The ``NETLINK_GET_STRICT_CHK`` socket option enables strict input checking
|
|
in ``NETLINK_ROUTE``. It was needed because historically kernel did not
|
|
validate the fields of structures it didn't process. This made it impossible
|
|
to start using those fields later without risking regressions in applications
|
|
which initialized them incorrectly or not at all.
|
|
|
|
``NETLINK_GET_STRICT_CHK`` declares that the application is initializing
|
|
all fields correctly. It also opts into validating that message does not
|
|
contain trailing data and requests that kernel rejects attributes with
|
|
type higher than largest attribute type known to the kernel.
|
|
|
|
``NETLINK_GET_STRICT_CHK`` is not used outside of ``NETLINK_ROUTE``.
|
|
|
|
Unknown attributes
|
|
------------------
|
|
|
|
Historically Netlink ignored all unknown attributes. The thinking was that
|
|
it would free the application from having to probe what kernel supports.
|
|
The application could make a request to change the state and check which
|
|
parts of the request "stuck".
|
|
|
|
This is no longer the case for new Generic Netlink families and those opting
|
|
in to strict checking. See enum netlink_validation for validation types
|
|
performed.
|
|
|
|
Fixed metadata and structures
|
|
-----------------------------
|
|
|
|
Classic Netlink made liberal use of fixed-format structures within
|
|
the messages. Messages would commonly have a structure with
|
|
a considerable number of fields after struct nlmsghdr. It was also
|
|
common to put structures with multiple members inside attributes,
|
|
without breaking each member into an attribute of its own.
|
|
|
|
This has caused problems with validation and extensibility and
|
|
therefore using binary structures is actively discouraged for new
|
|
attributes.
|
|
|
|
Request types
|
|
-------------
|
|
|
|
``NETLINK_ROUTE`` categorized requests into 4 types ``NEW``, ``DEL``, ``GET``,
|
|
and ``SET``. Each object can handle all or some of those requests
|
|
(objects being netdevs, routes, addresses, qdiscs etc.) Request type
|
|
is defined by the 2 lowest bits of the message type, so commands for
|
|
new objects would always be allocated with a stride of 4.
|
|
|
|
Each object would also have its own fixed metadata shared by all request
|
|
types (e.g. struct ifinfomsg for netdev requests, struct ifaddrmsg for address
|
|
requests, struct tcmsg for qdisc requests).
|
|
|
|
Even though other protocols and Generic Netlink commands often use
|
|
the same verbs in their message names (``GET``, ``SET``) the concept
|
|
of request types did not find wider adoption.
|
|
|
|
Notification echo
|
|
-----------------
|
|
|
|
``NLM_F_ECHO`` requests for notifications resulting from the request
|
|
to be queued onto the requesting socket. This is useful to discover
|
|
the impact of the request.
|
|
|
|
Note that this feature is not universally implemented.
|
|
|
|
Other request-type-specific flags
|
|
---------------------------------
|
|
|
|
Classic Netlink defined various flags for its ``GET``, ``NEW``
|
|
and ``DEL`` requests in the upper byte of nlmsg_flags in struct nlmsghdr.
|
|
Since request types have not been generalized the request type specific
|
|
flags are rarely used (and considered deprecated for new families).
|
|
|
|
For ``GET`` - ``NLM_F_ROOT`` and ``NLM_F_MATCH`` are combined into
|
|
``NLM_F_DUMP``, and not used separately. ``NLM_F_ATOMIC`` is never used.
|
|
|
|
For ``DEL`` - ``NLM_F_NONREC`` is only used by nftables and ``NLM_F_BULK``
|
|
only by FDB some operations.
|
|
|
|
The flags for ``NEW`` are used most commonly in classic Netlink. Unfortunately,
|
|
the meaning is not crystal clear. The following description is based on the
|
|
best guess of the intention of the authors, and in practice all families
|
|
stray from it in one way or another. ``NLM_F_REPLACE`` asks to replace
|
|
an existing object, if no matching object exists the operation should fail.
|
|
``NLM_F_EXCL`` has the opposite semantics and only succeeds if object already
|
|
existed.
|
|
``NLM_F_CREATE`` asks for the object to be created if it does not
|
|
exist, it can be combined with ``NLM_F_REPLACE`` and ``NLM_F_EXCL``.
|
|
|
|
A comment in the main Netlink uAPI header states::
|
|
|
|
4.4BSD ADD NLM_F_CREATE|NLM_F_EXCL
|
|
4.4BSD CHANGE NLM_F_REPLACE
|
|
|
|
True CHANGE NLM_F_CREATE|NLM_F_REPLACE
|
|
Append NLM_F_CREATE
|
|
Check NLM_F_EXCL
|
|
|
|
which seems to indicate that those flags predate request types.
|
|
``NLM_F_REPLACE`` without ``NLM_F_CREATE`` was initially used instead
|
|
of ``SET`` commands.
|
|
``NLM_F_EXCL`` without ``NLM_F_CREATE`` was used to check if object exists
|
|
without creating it, presumably predating ``GET`` commands.
|
|
|
|
``NLM_F_APPEND`` indicates that if one key can have multiple objects associated
|
|
with it (e.g. multiple next-hop objects for a route) the new object should be
|
|
added to the list rather than replacing the entire list.
|
|
|
|
uAPI reference
|
|
==============
|
|
|
|
.. kernel-doc:: include/uapi/linux/netlink.h
|