In the commit referenced below, hw_stats_type of an entry is set for every
entry that corresponds to a pedit action. However, the assignment is only
done after the entry pointer is bumped, and therefore could overwrite
memory outside of the entries array.
The reason for this positioning may have been that the current entry's
hw_stats_type is already set above, before the action-type dispatch.
However, if there are no more actions, the assignment is wrong. And if
there are, the next round of the for_each_action loop will make the
assignment before the action-type dispatch anyway.
Therefore fix this issue by simply reordering the two lines.
Fixes: 74522e7baa ("net: sched: set the hw_stats_type in pedit loop")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts the following commits:
8537f78647 ("netfilter: Introduce egress hook")
5418d3881e ("netfilter: Generalize ingress hook")
b030f194ae ("netfilter: Rename ingress hook include file")
>From the discussion in [0], the author's main motivation to add a hook
in fast path is for an out of tree kernel module, which is a red flag
to begin with. Other mentioned potential use cases like NAT{64,46}
is on future extensions w/o concrete code in the tree yet. Revert as
suggested [1] given the weak justification to add more hooks to critical
fast-path.
[0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/
[1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Miller <davem@davemloft.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Nacked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Use nf_flow_offload_tuple() to fetch flow stats, from Paul Blakey.
2) Add new xt_IDLETIMER hard mode, from Manoj Basapathi.
Follow up patch to clean up this new mode, from Dan Carpenter.
3) Add support for geneve tunnel options, from Xin Long.
4) Make sets built-in and remove modular infrastructure for sets,
from Florian Westphal.
5) Remove unused TEMPLATE_NULLS_VAL, from Li RongQing.
6) Statify nft_pipapo_get, from Chen Wandun.
7) Use C99 flexible-array member, from Gustavo A. R. Silva.
8) More descriptive variable names for bitwise, from Jeremy Sowden.
9) Four patches to add tunnel device hardware offload to the flowtable
infrastructure, from wenxu.
10) pipapo set supports for 8-bit grouping, from Stefano Brivio.
11) pipapo can switch between nibble and byte grouping, also from
Stefano.
12) Add AVX2 vectorized version of pipapo, from Stefano Brivio.
13) Update pipapo to be use it for single ranges, from Stefano.
14) Add stateful expression support to elements via control plane,
eg. counter per element.
15) Re-visit sysctls in unprivileged namespaces, from Florian Westphal.
15) Add new egress hook, from Lukas Wunner.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 58b0991962 ("mptcp: create msk early"), the
msk socket is already available at subflow_syn_recv_sock()
time. Let's move there the state update, to mirror more
closely the first subflow state.
The above will also help multiple subflow supports.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for manipulating vlan/tunnel mappings. The
tunnel ids are globally unique and are one per-vlan. There were two
trickier issues - first in order to support vlan ranges we have to
compute the current tunnel id in the following way:
- base tunnel id (attr) + current vlan id - starting vlan id
This is in line how the old API does vlan/tunnel mapping with ranges. We
already have the vlan range present, so it's redundant to add another
attribute for the tunnel range end. It's simply base tunnel id + vlan
range. And second to support removing mappings we need an out-of-band way
to tell the option manipulating function because there are no
special/reserved tunnel id values, so we use a vlan flag to denote the
operation is tunnel mapping removal.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new option - BRIDGE_VLANDB_ENTRY_TUNNEL_ID which is used to dump
the tunnel id mapping. Since they're unique per vlan they can enter a
vlan range if they're consecutive, thus we can calculate the tunnel id
range map simply as: vlan range end id - vlan range start id. The
starting point is the tunnel id in BRIDGE_VLANDB_ENTRY_TUNNEL_ID. This
is similar to how the tunnel entries can be created in a range via the
old API (a vlan range maps to a tunnel range).
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The vlan tunnel code changes vlan options, it shouldn't touch port or
bridge options so we can constify the port argument. This would later help
us to re-use these functions from the vlan options code.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is more appropriate name as it shows the intent of why we need to
check the options' state. It also allows us to give meaning to the two
arguments of the function: the first is the current vlan (v_curr) being
checked if it could enter the range ending in the second one (range_end).
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc_watchdog_schedule_range_ns() can use the newly added slack
and avoid rearming the hrtimer a bit earlier than the current
value. This patch has no effect if delta_ns parameter
is zero.
Note that this means the max slack is potentially doubled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some packet schedulers might want to add a slack
when programming hrtimers. This can reduce number
of interrupts and increase batch sizes and thus
give good xmit_more savings.
This commit adds qdisc_watchdog_schedule_range_ns()
helper, with an extra delta_ns parameter.
Legacy qdisc_watchdog_schedule_n() becomes an inline
passing a zero slack.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
flow_action_hw_stats_types_check() helper takes one of the
FLOW_ACTION_HW_STATS_*_BIT values as input. If we align
the arguments to the opening bracket of the helper there
is no way to call this helper and stay under 80 characters.
Remove the "types" part from the new flow_action helpers
and enum values.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all in-tree drivers have been updated we can
make the supported_coalesce_params mandatory.
To save debugging time in case some driver was missed
(or is out of tree) add a warning when netdev is registered
with set_coalesce but without supported_coalesce_params.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e687ad60af ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key") introduced the ability to
classify packets on ingress.
Allow the same on egress. Position the hook immediately before a packet
is handed to tc and then sent out on an interface, thereby mirroring the
ingress order. This order allows marking packets in the netfilter
egress hook and subsequently using the mark in tc. Another benefit of
this order is consistency with a lot of existing documentation which
says that egress tc is performed after netfilter hooks.
Egress hooks already exist for the most common protocols, such as
NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because
they are executed earlier during packet processing. However for more
exotic protocols, there is currently no provision to apply netfilter on
egress. A common workaround is to enslave the interface to a bridge and
use ebtables, or to resort to tc. But when the ingress hook was
introduced, consensus was that users should be given the choice to use
netfilter or tc, whichever tool suits their needs best:
https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/
This hook is also useful for NAT46/NAT64, tunneling and filtering of
locally generated af_packet traffic such as dhclient.
There have also been occasional user requests for a netfilter egress
hook in the past, e.g.:
https://www.spinics.net/lists/netfilter/msg50038.html
Performance measurements with pktgen surprisingly show a speedup rather
than a slowdown with this commit:
* Without this commit:
Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags)
2920481pps 1401Mb/sec (1401830880bps) errors: 0
* With this commit:
Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags)
2941410pps 1411Mb/sec (1411876800bps) errors: 0
* Without this commit + tc egress:
Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags)
2562631pps 1230Mb/sec (1230062880bps) errors: 0
* With this commit + tc egress:
Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags)
2659259pps 1276Mb/sec (1276444320bps) errors: 0
* With this commit + nft egress:
Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags)
2413320pps 1158Mb/sec (1158393600bps) errors: 0
Tested on a bare-metal Core i7-3615QM, each measurement was performed
three times to verify that the numbers are stable.
Commands to perform a measurement:
modprobe pktgen
echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000
Commands for testing tc egress:
tc qdisc add dev lo clsact
tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32
Commands for testing nft egress:
nft add table netdev t
nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \}
nft add rule netdev t co ip daddr 4.3.2.1/32 drop
All testing was performed on the loopback interface to avoid distorting
measurements by the packet handling in the low-level Ethernet driver.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Prepare for addition of a netfilter egress hook by generalizing the
ingress hook introduced by commit e687ad60af ("netfilter: add
netfilter ingress hook after handle_ing() under unique static key").
In particular, rename and refactor the ingress hook's static inlines
such that they can be reused for an egress hook.
No functional change intended.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Prepare for addition of a netfilter egress hook by renaming
<linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.
The egress hook also necessitates a refactoring of the include file,
but that is done in a separate commit to ease reviewing.
No functional change intended.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Change Yeah to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets
to tcp_cong_avoid_ai().
In addition, we re-implemented the scalable path using
tcp_cong_avoid_ai() and removed the pkts_acked variable.
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change Veno to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets
to tcp_cong_avoid_ai().
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No code logic has been changed in this patch.
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change Scalable to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets to
tcp_cong_avoid_ai().
In addition, because we are now precisely accounting for
stretch ACKs, including delayed ACKs, we can now change
TCP_SCALABLE_AI_CNT to 100.
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes BIC to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets
to tcp_cong_avoid_ai().
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This path fixes the suspicious RCU usage warning reported by
kernel test robot.
net/kcm/kcmproc.c:#RCU-list_traversed_in_non-reader_section
There is no need to use list_for_each_entry_rcu() in
kcm_stats_seq_show() as the list is always traversed under
knet->mutex held.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For a single pedit action, multiple offload entries may be used. Set the
hw_stats_type to all of them.
Fixes: 44f8658017 ("sched: act: allow user to specify type of HW stats for a filter")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Issue a warning to the kernel log if phylink_mac_link_state() returns
an error. This should not occur, but let's make it visible.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
since commit b884fa4617 ("netfilter: conntrack: unify sysctl handling")
conntrack no longer exposes most of its sysctls (e.g. tcp timeouts
settings) to network namespaces that are not owned by the initial user
namespace.
This patch exposes all sysctls even if the namespace is unpriviliged.
compared to a 4.19 kernel, the newly visible and writeable sysctls are:
net.netfilter.nf_conntrack_acct
net.netfilter.nf_conntrack_timestamp
.. to allow to enable accouting and timestamp extensions.
net.netfilter.nf_conntrack_events
.. to turn off conntrack event notifications.
net.netfilter.nf_conntrack_checksum
.. to disable checksum validation.
net.netfilter.nf_conntrack_log_invalid
.. to enable logging of packets deemed invalid by conntrack.
newly visible sysctls that are only exported as read-only:
net.netfilter.nf_conntrack_count
.. current number of conntrack entries living in this netns.
net.netfilter.nf_conntrack_max
.. global upperlimit (maximum size of the table).
net.netfilter.nf_conntrack_buckets
.. size of the conntrack table (hash buckets).
net.netfilter.nf_conntrack_expect_max
.. maximum number of permitted expectations in this netns.
net.netfilter.nf_conntrack_helper
.. conntrack helper auto assignment.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Update nft_add_set_elem() to handle the NFTA_SET_ELEM_EXPR netlink
attribute. This patch allows users to to add elements with stateful
expressions.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the AVX2 set is available, we can exploit the repetitive
characteristic of this algorithm to provide a fast, vectorised
version by using 256-bit wide AVX2 operations for bucket loads and
bitwise intersections.
In most cases, this implementation consistently outperforms rbtree
set instances despite the fact they are configured to use a given,
single, ranged data type out of the ones used for performance
measurements by the nft_concat_range.sh kselftest.
That script, injecting packets directly on the ingoing device path
with pktgen, reports, averaged over five runs on a single AMD Epyc
7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below.
CONFIG_RETPOLINE was not set here.
Note that this is not a fair comparison over hash and rbtree set
types: non-ranged entries (used to have a reference for hash types)
would be matched faster than this, and matching on a single field
only (which is the case for rbtree) is also significantly faster.
However, it's not possible at the moment to choose this set type
for non-ranged entries, and the current implementation also needs
a few minor adjustments in order to match on less than two fields.
---------------.-----------------------------------.------------.
AMD Epyc 7402 | baselines, Mpps | this patch |
1 thread |___________________________________|____________|
3.35GHz | | | | | |
768KiB L1D$ | netdev | hash | rbtree | | |
---------------| hook | no | single | | pipapo |
type entries | drop | ranges | field | pipapo | AVX2 |
---------------|--------|--------|--------|--------|------------|
net,port | | | | | |
1000 | 19.0 | 10.4 | 3.8 | 4.0 | 7.5 +87% |
---------------|--------|--------|--------|--------|------------|
port,net | | | | | |
100 | 18.8 | 10.3 | 5.8 | 6.3 | 8.1 +29% |
---------------|--------|--------|--------|--------|------------|
net6,port | | | | | |
1000 | 16.4 | 7.6 | 1.8 | 2.1 | 4.8 +128% |
---------------|--------|--------|--------|--------|------------|
port,proto | | | | | |
30000 | 19.6 | 11.6 | 3.9 | 0.5 | 2.6 +420% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac | | | | | |
10 | 16.5 | 5.4 | 4.3 | 3.4 | 4.7 +38% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac, | | | | | |
proto 1000 | 16.5 | 5.7 | 1.9 | 1.4 | 3.6 +26% |
---------------|--------|--------|--------|--------|------------|
net,mac | | | | | |
1000 | 19.0 | 8.4 | 3.9 | 2.5 | 6.4 +156% |
---------------'--------'--------'--------'--------'------------'
A similar strategy could be easily reused to implement specialised
versions for other SIMD sets, and I plan to post at least a NEON
version at a later time.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move most macros and helpers to a header file, so that they can be
conveniently used by related implementations.
No functional changes are intended here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
SIMD vector extension sets require stricter alignment than native
instruction sets to operate efficiently (AVX, NEON) or for some
instructions to work at all (AltiVec).
Provide facilities to define arbitrary alignment for lookup tables
and scratch maps. By defining byte alignment with NFT_PIPAPO_ALIGN,
lt_aligned and scratch_aligned pointers become available.
Additional headroom is allocated, and pointers to the possibly
unaligned, originally allocated areas are kept so that they can
be freed.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
While grouping matching bits in groups of four saves memory compared
to the more natural choice of 8-bit words (lookup table size is one
eighth), it comes at a performance cost, as the number of lookup
comparisons is doubled, and those also needs bitshifts and masking.
Introduce support for 8-bit lookup groups, together with a mapping
mechanism to dynamically switch, based on defined per-table size
thresholds and hysteresis, between 8-bit and 4-bit groups, as tables
grow and shrink. Empty sets start with 8-bit groups, and per-field
tables are converted to 4-bit groups if they get too big.
An alternative approach would have been to swap per-set lookup
operation functions as needed, but this doesn't allow for different
group sizes in the same set, which looks desirable if some fields
need significantly more matching data compared to others due to
heavier impact of ranges (e.g. a big number of subnets with
relatively simple port specifications).
Allowing different group sizes for the same lookup functions implies
the need for further conditional clauses, whose cost, however,
appears to be negligible in tests.
The matching rate figures below were obtained for x86_64 running
the nft_concat_range.sh "performance" cases, averaged over five
runs, on a single thread of an AMD Epyc 7402 CPU, and for aarch64
on a single thread of a BCM2711 (Raspberry Pi 4 Model B 4GB),
clocked at a stable 2147MHz frequency:
---------------.-----------------------------------.------------.
AMD Epyc 7402 | baselines, Mpps | this patch |
1 thread |___________________________________|____________|
3.35GHz | | | | | |
768KiB L1D$ | netdev | hash | rbtree | | |
---------------| hook | no | single | pipapo | pipapo |
type entries | drop | ranges | field | 4 bits | bit switch |
---------------|--------|--------|--------|--------|------------|
net,port | | | | | |
1000 | 19.0 | 10.4 | 3.8 | 2.8 | 4.0 +43% |
---------------|--------|--------|--------|--------|------------|
port,net | | | | | |
100 | 18.8 | 10.3 | 5.8 | 5.5 | 6.3 +14% |
---------------|--------|--------|--------|--------|------------|
net6,port | | | | | |
1000 | 16.4 | 7.6 | 1.8 | 1.3 | 2.1 +61% |
---------------|--------|--------|--------|--------|------------|
port,proto | | | | | [1] |
30000 | 19.6 | 11.6 | 3.9 | 0.3 | 0.5 +66% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac | | | | | |
10 | 16.5 | 5.4 | 4.3 | 2.6 | 3.4 +31% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac, | | | | | |
proto 1000 | 16.5 | 5.7 | 1.9 | 1.0 | 1.4 +40% |
---------------|--------|--------|--------|--------|------------|
net,mac | | | | | |
1000 | 19.0 | 8.4 | 3.9 | 1.7 | 2.5 +47% |
---------------'--------'--------'--------'--------'------------'
[1] Causes switch of lookup table buckets for 'port', not 'proto',
to 4-bit groups
---------------.-----------------------------------.------------.
BCM2711 | baselines, Mpps | this patch |
1 thread |___________________________________|____________|
2147MHz | | | | | |
32KiB L1D$ | netdev | hash | rbtree | | |
---------------| hook | no | single | pipapo | pipapo |
type entries | drop | ranges | field | 4 bits | bit switch |
---------------|--------|--------|--------|--------|------------|
net,port | | | | | |
1000 | 1.63 | 1.37 | 0.87 | 0.61 | 0.70 +17% |
---------------|--------|--------|--------|--------|------------|
port,net | | | | | |
100 | 1.64 | 1.36 | 1.02 | 0.78 | 0.81 +4% |
---------------|--------|--------|--------|--------|------------|
net6,port | | | | | |
1000 | 1.56 | 1.27 | 0.65 | 0.34 | 0.50 +47% |
---------------|--------|--------|--------|--------|------------|
port,proto [2] | | | | | |
10000 | 1.68 | 1.43 | 0.84 | 0.30 | 0.40 +13% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac | | | | | |
10 | 1.56 | 1.14 | 1.02 | 0.62 | 0.66 +6% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac, | | | | | |
proto 1000 | 1.56 | 1.12 | 0.64 | 0.27 | 0.40 +48% |
---------------|--------|--------|--------|--------|------------|
net,mac | | | | | |
1000 | 1.63 | 1.26 | 0.87 | 0.41 | 0.53 +29% |
---------------'--------'--------'--------'--------'------------'
[2] Using 10000 entries instead of 30000 as it would take way too
long for the test script to generate all of them
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Get rid of all hardcoded assumptions that buckets in lookup tables
correspond to four-bit groups, and replace them with appropriate
calculations based on a variable group size, now stored in struct
field.
The group size could now be in principle any divisor of eight. Note,
though, that lookup and get functions need an implementation
intimately depending on the group size, and the only supported size
there, currently, is four bits, which is also the initial and only
used size at the moment.
While at it, drop 'groups' from struct nft_pipapo: it was never used.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch add tunnel encap decap action offload in the flowtable
offload.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch support both ipv4 and ipv6 tunnel_id, tunnel_src and
tunnel_dst match for flowtable offload
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add etfilter flowtable support indr-block setup. It makes flowtable offload
vlan and tunnel device.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
These lines were indented wrong so Smatch complained.
net/netfilter/xt_IDLETIMER.c:81 idletimer_tg_show() warn: inconsistent indenting
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Name the mask and xor data variables, "mask" and "xor," instead of "d1"
and "d2."
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
Lastly, fix checkpatch.pl warning
WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))
in net/bridge/netfilter/ebtables.c
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix the following sparse warning:
net/netfilter/nft_set_pipapo.c:739:6: warning: symbol 'nft_pipapo_get' was not declared. Should it be static?
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
TEMPLATE_NULLS_VAL is not used after commit 0838aa7fcf
("netfilter: fix netns dependencies with conntrack templates")
PFX is not used after commit 8bee4bad03 ("netfilter: xt
extensions: use pr_<level>")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
They do not need to be writeable anymore.
v2: remove left-over __read_mostly annotation in set_pipapo.c (Stefano)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Placing nftables set support in an extra module is pointless:
1. nf_tables needs dynamic registeration interface for sake of one module
2. nft heavily relies on sets, e.g. even simple rule like
"nft ... tcp dport { 80, 443 }" will not work with _SETS=n.
IOW, either nftables isn't used or both nf_tables and nf_tables_set
modules are needed anyway.
With extra module:
307K net/netfilter/nf_tables.ko
79K net/netfilter/nf_tables_set.ko
text data bss dec filename
146416 3072 545 150033 nf_tables.ko
35496 1817 0 37313 nf_tables_set.ko
This patch:
373K net/netfilter/nf_tables.ko
178563 4049 545 183157 nf_tables.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Like vxlan and erspan opts, geneve opts should also be supported in
nft_tunnel. The difference is geneve RFC (draft-ietf-nvo3-geneve-14)
allows a geneve packet to carry multiple geneve opts. So with this
patch, nftables/libnftnl would do:
# nft add table ip filter
# nft add chain ip filter input { type filter hook input priority 0 \; }
# nft add tunnel filter geneve_02 { type geneve\; id 2\; \
ip saddr 192.168.1.1\; ip daddr 192.168.1.2\; \
sport 9000\; dport 9001\; dscp 1234\; ttl 64\; flags 1\; \
opts \"1:1:34567890,2:2:12121212,3:3:1212121234567890\"\; }
# nft list tunnels table filter
table ip filter {
tunnel geneve_02 {
id 2
ip saddr 192.168.1.1
ip daddr 192.168.1.2
sport 9000
dport 9001
tos 18
ttl 64
flags 1
geneve opts 1:1:34567890,2:2:12121212,3:3:1212121234567890
}
}
v1->v2:
- no changes, just post it separately.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is a snapshot of hardidletimer netfilter target.
This patch implements a hardidletimer Xtables target that can be
used to identify when interfaces have been idle for a certain period
of time.
Timers are identified by labels and are created when a rule is set
with a new label. The rules also take a timeout value (in seconds) as
an option. If more than one rule uses the same timer label, the timer
will be restarted whenever any of the rules get a hit.
One entry for each timer is created in sysfs. This attribute contains
the timer remaining for the timer to expire. The attributes are
located under the xt_idletimer class:
/sys/class/xt_idletimer/timers/<label>
When the timer expires, the target module sends a sysfs notification
to the userspace, which can then decide what to do (eg. disconnect to
save power)
Compared to IDLETIMER, HARDIDLETIMER can send notifications when
CPU is in suspend too, to notify the timer expiry.
v1->v2: Moved all functionality into IDLETIMER module to avoid
code duplication per comment from Florian.
Signed-off-by: Manoj Basapathi <manojbm@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>