This new set type allows for intervals in concatenated fields,
which are expressed in the usual way, that is, simple byte
concatenation with padding to 32 bits for single fields, and
given as ranges by specifying start and end elements containing,
each, the full concatenation of start and end values for the
single fields.
Ranges are expanded to composing netmasks, for each field: these
are inserted as rules in per-field lookup tables. Bits to be
classified are divided in 4-bit groups, and for each group, the
lookup table contains 4^2 buckets, representing all the possible
values of a bit group. This approach was inspired by the Grouper
algorithm:
http://www.cse.usf.edu/~ligatti/projects/grouper/
Matching is performed by a sequence of AND operations between
bucket values, with buckets selected according to the value of
packet bits, for each group. The result of this sequence tells
us which rules matched for a given field.
In order to concatenate several ranged fields, per-field rules
are mapped using mapping arrays, one per field, that specify
which rules should be considered while matching the next field.
The mapping array for the last field contains a reference to
the element originally inserted.
The notes in nft_set_pipapo.c cover the algorithm in deeper
detail.
A pure hash-based approach is of no use here, as ranges need
to be classified. An implementation based on "proxying" the
existing red-black tree set type, creating a tree for each
field, was considered, but deemed impractical due to the fact
that elements would need to be shared between trees, at least
as long as we want to keep UAPI changes to a minimum.
A stand-alone implementation of this algorithm is available at:
https://pipapo.lameexcu.se
together with notes about possible future optimisations
(in pipapo.c).
This algorithm was designed with data locality in mind, and can
be highly optimised for SIMD instruction sets, as the bulk of
the matching work is done with repetitive, simple bitwise
operations.
At this point, without further optimisations, nft_concat_range.sh
reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB
L2$):
TEST: performance
net,port [ OK ]
baseline (drop from netdev hook): 10190076pps
baseline hash (non-ranged entries): 6179564pps
baseline rbtree (match on first field only): 2950341pps
set with 1000 full, ranged entries: 2304165pps
port,net [ OK ]
baseline (drop from netdev hook): 10143615pps
baseline hash (non-ranged entries): 6135776pps
baseline rbtree (match on first field only): 4311934pps
set with 100 full, ranged entries: 4131471pps
net6,port [ OK ]
baseline (drop from netdev hook): 9730404pps
baseline hash (non-ranged entries): 4809557pps
baseline rbtree (match on first field only): 1501699pps
set with 1000 full, ranged entries: 1092557pps
port,proto [ OK ]
baseline (drop from netdev hook): 10812426pps
baseline hash (non-ranged entries): 6929353pps
baseline rbtree (match on first field only): 3027105pps
set with 30000 full, ranged entries: 284147pps
net6,port,mac [ OK ]
baseline (drop from netdev hook): 9660114pps
baseline hash (non-ranged entries): 3778877pps
baseline rbtree (match on first field only): 3179379pps
set with 10 full, ranged entries: 2082880pps
net6,port,mac,proto [ OK ]
baseline (drop from netdev hook): 9718324pps
baseline hash (non-ranged entries): 3799021pps
baseline rbtree (match on first field only): 1506689pps
set with 1000 full, ranged entries: 783810pps
net,mac [ OK ]
baseline (drop from netdev hook): 10190029pps
baseline hash (non-ranged entries): 5172218pps
baseline rbtree (match on first field only): 2946863pps
set with 1000 full, ranged entries: 1279122pps
v4:
- fix build for 32-bit architectures: 64-bit division needs
div_u64() (kbuild test robot <lkp@intel.com>)
v3:
- rework interface for field length specification,
NFT_SET_SUBKEY disappears and information is stored in
description
- remove scratch area to store closing element of ranges,
as elements now come with an actual attribute to specify
the upper range limit (Pablo Neira Ayuso)
- also remove pointer to 'start' element from mapping table,
closing key is now accessible via extension data
- use bytes right away instead of bits for field lengths,
this way we can also double the inner loop of the lookup
function to take care of upper and lower bits in a single
iteration (minor performance improvement)
- make it clearer that set operations are actually atomic
API-wise, but we can't e.g. implement flush() as one-shot
action
- fix type for 'dup' in nft_pipapo_insert(), check for
duplicates only in the next generation, and in general take
care of differentiating generation mask cases depending on
the operation (Pablo Neira Ayuso)
- report C implementation matching rate in commit message, so
that AVX2 implementation can be compared (Pablo Neira Ayuso)
v2:
- protect access to scratch maps in nft_pipapo_lookup() with
local_bh_disable/enable() (Florian Westphal)
- drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
already implied (Florian Westphal)
- explain why partial allocation failures don't need handling
in pipapo_realloc_scratch(), rename 'm' to clone and update
related kerneldoc to make it clear we're not operating on
the live copy (Florian Westphal)
- add expicit check for priv->start_elem in
nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
with a NULL start element, and also zero it out in every
operation that might make it invalid, so that insertion
doesn't proceed with an invalid element (Florian Westphal)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used
to specify the length of each field in a set concatenation.
This allows set implementations to support concatenation of multiple
ranged items, as they can divide the input key into matching data for
every single field. Such set implementations would be selected as
they specify support for NFT_SET_INTERVAL and allow desc->field_count
to be greater than one. Explicitly disallow this for nft_set_rbtree.
In order to specify the interval for a set entry, userspace would
include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass
range endpoints as two separate keys, represented by attributes
NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END.
While at it, export the number of 32-bit registers available for
packet matching, as nftables will need this to know the maximum
number of field lengths that can be specified.
For example, "packets with an IPv4 address between 192.0.2.0 and
192.0.2.42, with destination port between 22 and 25", can be
expressed as two concatenated elements:
NFTA_SET_ELEM_KEY: 192.0.2.0 . 22
NFTA_SET_ELEM_KEY_END: 192.0.2.42 . 25
and NFTA_SET_DESC_CONCAT attribute would contain:
NFTA_LIST_ELEM
NFTA_SET_FIELD_LEN: 4
NFTA_LIST_ELEM
NFTA_SET_FIELD_LEN: 2
v4: No changes
v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY
v2: No changes
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add NFTA_SET_ELEM_KEY_END attribute to convey the closing element of the
interval between kernel and userspace.
This patch also adds the NFT_SET_EXT_KEY_END extension to store the
closing element value in this interval.
v4: No changes
v3: New patch
[sbrivio: refactor error paths and labels; add corresponding
nft_set_ext_type for new key; rebase]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If nf_flow_offload_add() fails to add the flow to hardware, then the
NF_FLOW_HW_REFRESH flag bit is set and the flow remains in the flowtable
software path.
If flowtable hardware offload is enabled, this patch enqueues a new
request to offload this flow to hardware.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This function checks for the NF_FLOWTABLE_HW_OFFLOAD flag, meaning that
the flowtable hardware offload is enabled.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Originally, all flow flag bits were set on only from the workqueue. With
the introduction of the flow teardown state and hardware offload this is
no longer true. Let's be safe and use atomic bitwise operation to
operation with flow flags.
Fixes: 59c466dd68 ("netfilter: nf_flow_table: add a new flow state for tearing down offloading")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The dying bit removes the conntrack entry if the netdev that owns this
flow is going down. Instead, use the teardown mechanism to push back the
flow to conntrack to let the classic software path decide what to do
with it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds nf_flowtable_time_stamp and updates the existing code to
use it.
This patch is also implicitly fixing up hardware statistic fetching via
nf_flow_offload_stats() where casting to u32 is missing. Use
nf_flow_timeout_delta() to fix this.
Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: wenxu <wenxu@ucloud.cn>
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().
This patch is generated using following script:
EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"
git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
do
if [[ "$file" =~ $EXCLUDE_FILES ]]; then
continue
fi
sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
done
Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net
This patch constifies the pointer to source register data that is passed
as an input parameter.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add nf_flow_rule_route_ipv6() and use it from the IPv6 and the inet
flowtable type definitions. Rename the nf_flow_rule_route() function to
nf_flow_rule_route_ipv4().
Adjust maximum number of actions, which now becomes 16 to leave
sufficient room for the IPv6 address mangling for NAT.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the dataplane hardware offload to the flowtable
infrastructure. Three new flags represent the hardware state of this
flow:
* FLOW_OFFLOAD_HW: This flow entry resides in the hardware.
* FLOW_OFFLOAD_HW_DYING: This flow entry has been scheduled to be remove
from hardware. This might be triggered by either packet path (via TCP
RST/FIN packet) or via aging.
* FLOW_OFFLOAD_HW_DEAD: This flow entry has been already removed from
the hardware, the software garbage collector can remove it from the
software flowtable.
This patch supports for:
* IPv4 only.
* Aging via FLOW_CLS_STATS, no packet and byte counter synchronization
at this stage.
This patch also adds the action callback that specifies how to convert
the flow entry into the flow_rule object that is passed to the driver.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the NFTA_FLOWTABLE_FLAGS attribute that allows users to
specify the NF_FLOWTABLE_HW_OFFLOAD flag. This patch also adds a new
setup interface for the flowtable type to perform the flowtable offload
block callback configuration.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the infrastructure to support for flow entry types.
The initial type is NF_FLOW_OFFLOAD_ROUTE that stores the routing
information into the flow entry to define a fastpath for the classic
forwarding path.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move rcu_head to struct flow_offload, then remove the flow_offload_entry
structure definition.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers do not have access to the flow_offload structure, hence remove
this union from this flow_offload object as well as the original comment
on top of it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify this code by storing the pointer to conntrack object in the
flow_offload structure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
One conflict in the BPF samples Makefile, some fixes in 'net' whilst
we were converting over to Makefile.target rules in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Invoking the following commands on a 32-bit architecture with strict
alignment requirements (such as an ARMv7-based Raspberry Pi) results
in an alignment exception:
# nft add table ip test-ip4
# nft add chain ip test-ip4 output { type filter hook output priority 0; }
# nft add rule ip test-ip4 output quota 1025 bytes
Alignment trap: not handling instruction e1b26f9f at [<7f4473f8>]
Unhandled fault: alignment exception (0x001) at 0xb832e824
Internal error: : 1 [#1] PREEMPT SMP ARM
Hardware name: BCM2835
[<7f4473fc>] (nft_quota_do_init [nft_quota])
[<7f447448>] (nft_quota_init [nft_quota])
[<7f4260d0>] (nf_tables_newrule [nf_tables])
[<7f4168dc>] (nfnetlink_rcv_batch [nfnetlink])
[<7f416bd0>] (nfnetlink_rcv [nfnetlink])
[<8078b334>] (netlink_unicast)
[<8078b664>] (netlink_sendmsg)
[<8071b47c>] (sock_sendmsg)
[<8071bd18>] (___sys_sendmsg)
[<8071ce3c>] (__sys_sendmsg)
[<8071ce94>] (sys_sendmsg)
The reason is that nft_quota_do_init() calls atomic64_set() on an
atomic64_t which is only aligned to 32-bit, not 64-bit, because it
succeeds struct nft_expr in memory which only contains a 32-bit pointer.
Fix by aligning the nft_expr private data to 64-bit.
Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch allows you to register one netdev basechain to multiple
devices. This adds a new NFTA_HOOK_DEVS netlink attribute to specify
the list of netdevices. Basechains store a list of hooks.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Rise the maximum limit of devices per flowtable up to 256. Rename
NFT_FLOWTABLE_DEVICE_MAX to NFT_NETDEVICE_MAX in preparation to reuse
the netdev hook parser for ingress basechain.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Hardware offload needs access to the priority field, store this field in
the nf_flowtable object.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead of waiting for rcu grace period just free it directly.
This is safe because conntrack lookup doesn't consider extensions.
Other accesses happen while ct->ext can't be free'd, either because
a ct refcount was taken or because the conntrack hash bucket lock or
the dying list spinlock have been taken.
This allows to remove __krealloc in a followup patch, netfilter was the
only user.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The deletion of a flowtable after a flush in the same transaction
results in EBUSY. This patch adds an activation and deactivation of
flowtables in order to update the _use_ counter.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Default policy is defined as a unsigned 8-bit field, do not use a
negative value to leave it unset, use this new NFT_CHAIN_POLICY_UNSET
instead.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Two inline functions defined in nf_conntrack_timestamp.h,
`nf_ct_tstamp_enabled` and `nf_ct_set_tstamp`, are not called anywhere.
Remove them.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_conntrack_zones.h was wrapped in a CONFIG_NF_CONNTRACK check in order
to fix compilation failures:
37ee3d5b3e ("netfilter: nf_defrag_ipv4: fix compilation error with NF_CONNTRACK=n")
Subsequent changes mean that these failures will no longer occur and the
check is unnecessary. Remove it.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
`struct nf_hook_ops`, `struct nf_hook_state` and the `nf_hookfn`
function typedef appear in function and struct declarations and
definitions in a number of netfilter headers. The structs and typedef
themselves are defined by linux/netfilter.h but only when
CONFIG_NETFILTER is enabled. Define them unconditionally and add
forward declarations in order to remove CONFIG_NETFILTER conditionals
from the other headers.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is a superfluous `#if IS_ENABLED(CONFIG_NF_CONNTRACK)` check
wrapping some function declarations. Remove it.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move some `struct nf_conntrack` code from linux/skbuff.h to
linux/nf_conntrack_common.h. Together with a couple of helpers for
getting and setting skb->_nfct, it allows us to remove
CONFIG_NF_CONNTRACK checks from net/netfilter/nf_conntrack.h.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The real br_nf_pre_routing_ipv6 function, defined when CONFIG_IPV6 is
enabled, expects `void *priv`, not `const struct nf_hook_ops *ops`.
Update the stub br_nf_pre_routing_ipv6, defined when CONFIG_IPV6 is
disabled, to match.
Fixes: 06198b34a3 ("netfilter: Pass priv instead of nf_hook_ops to netfilter hooks")
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_conntrack_synproxy.h contains three inline functions. The contents
of two of them are wrapped in CONFIG_NETFILTER_SYNPROXY checks and just
return NULL if it is not enabled. The third does nothing if they return
NULL, so wrap its contents as well.
nf_ct_timeout_data is only called if CONFIG_NETFILTER_TIMEOUT is
enabled. Wrap its contents in a CONFIG_NETFILTER_TIMEOUT check like the
other inline functions in nf_conntrack_timeout.h.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A few headers contain instances of:
#if defined(CONFIG_XXX) or defined(CONFIG_XXX_MODULE)
Replace them with:
#if IS_ENABLED(CONFIG_XXX)
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The header contains some inline functions defined as:
static inline f (...)
{
#ifdef CONFIG_NF_CONNTRACK_EVENTS
...
#else
...
#endif
}
and a few others as:
#ifdef CONFIG_NF_CONNTRACK_EVENTS
static inline f (...)
{
...
}
#else
static inline f (...)
{
...
}
#endif
Prefer the former style, which is more numerous.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is a struct definition function in nf_conntrack_bridge.h which is
not specific to conntrack and is used elswhere in netfilter. Move it
into netfilter_bridge.h.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is some non-conntrack code in the nf_conntrack_synproxy.h header.
Move it to the nf_synproxy.h header.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_conntrack_icmpv6.h contains two object macros which duplicate macros
in linux/icmpv6.h. The latter definitions are also visible wherever it
is included, so remove it.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Include some headers in files which require them, and remove others
which are not required.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Several header-files, Kconfig files and Makefiles have trailing
white-space. Remove it.
In netfilter/Kconfig, indent the type of CONFIG_NETFILTER_NETLINK_ACCT
correctly.
There are semicolons at the end of two function definitions in
include/net/netfilter/nf_conntrack_acct.h and
include/net/netfilter/nf_conntrack_ecache.h. Remove them.
Fix indentation in nf_conntrack_l4proto.h.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_conntrack_labels.h has no include guard. Add it.
The comment following the #endif in the nf_flow_table.h include guard
referred to the wrong macro. Fix it.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the net_device unregisters, clean up the offload rules before the
chain is destroy.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds support for packet mirroring and redirection. The
nft_fwd_dup_netdev_offload() function configures the flow_action object
for the fwd and the dup actions.
Extend nft_flow_rule_destroy() to release the net_device object when the
flow_rule object is released, since nft_fwd_dup_netdev_offload() bumps
the net_device reference counter.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: wenxu <wenxu@ucloud.cn>
Add nft_offload_init() and nft_offload_exit() function to deal with the
init and the exit path of the offload infrastructure.
Rename nft_indr_block_get_and_ing_cmd() to nft_indr_block_cb().
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the infrastructure needed for the stateful object update
support.
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch implements the delete operation from the ruleset.
It implements a new delete() function in nft_set_rhash. It is simpler
to use than the already existing remove(), because it only takes the set
and the key as arguments, whereas remove() expects a full
nft_set_elem structure.
Signed-off-by: Ander Juaristi <a@juaristi.eus>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduce new helper functions to load/store 64-bit values onto/from
registers:
- nft_reg_store64
- nft_reg_load64
This commit also re-orders all these helpers from smallest to largest
target bit size.
Signed-off-by: Ander Juaristi <a@juaristi.eus>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds initial support for offloading basechains using the
priority range from 1 to 65535. This is restricting the netfilter
priority range to 16-bit integer since this is what most drivers assume
so far from tc. It should be possible to extend this range of supported
priorities later on once drivers are updated to support for 32-bit
integer priorities.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>