Commit Graph

96 Commits

Author SHA1 Message Date
Xin Long
0564c3e51b net: sched: update the nat flag for icmp error packets in ct_nat_execute
In ovs_ct_nat_execute(), the packet flow key nat flags are updated
when it processes ICMP(v6) error packets translation successfully.

In ct_nat_execute() when processing ICMP(v6) error packets translation
successfully, it should have done the same in ct_nat_execute() to set
post_ct_s/dnat flag, which will be used to update flow key nat flags
in OVS module later.

Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-12 10:14:03 +00:00
Pedro Tammela
871cf386dd net/sched: avoid indirect act functions on retpoline kernels
Expose the necessary tc act functions and wire up act_api to use
direct calls in retpoline kernels.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09 09:18:07 +00:00
Jakub Kicinski
f2bb566f5c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/lib/bpf/ringbuf.c
  927cbb478a ("libbpf: Handle size overflow for ringbuf mmap")
  b486d19a0a ("libbpf: checkpatch: Fixed code alignments in ringbuf.c")
https://lore.kernel.org/all/20221121122707.44d1446a@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29 13:04:52 -08:00
Daniel Xu
52d1aa8b82 netfilter: conntrack: Fix data-races around ct mark
nf_conn:mark can be read from and written to in parallel. Use
READ_ONCE()/WRITE_ONCE() for reads and writes to prevent unwanted
compiler optimizations.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-18 15:21:00 +01:00
Xin Long
a21b06e731 net: sched: add helper support in act_ct
This patch is to add helper support in act_ct for OVS actions=ct(alg=xxx)
offloading, which is corresponding to Commit cae3a26275 ("openvswitch:
Allow attaching helpers to ct action") in OVS kernel part.

The difference is when adding TC actions family and proto cannot be got
from the filter/match, other than helper name in tb[TCA_CT_HELPER_NAME],
we also need to send the family in tb[TCA_CT_HELPER_FAMILY] and the
proto in tb[TCA_CT_HELPER_PROTO] to kernel.

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-08 12:15:19 +01:00
Xin Long
1913894100 net: sched: call tcf_ct_params_free to free params in tcf_ct_init
This patch is to make the err path simple by calling tcf_ct_params_free(),
so that it won't cause problems when more members are added into param and
need freeing on the err path.

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-08 12:15:19 +01:00
Jakub Kicinski
accc3b4a57 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29 14:30:51 -07:00
Hangyu Hua
6e23ec0ba9 net: sched: act_ct: fix possible refcount leak in tcf_ct_init()
nf_ct_put need to be called to put the refcount got by tcf_ct_fill_params
to avoid possible refcount leak when tcf_ct_flow_table_get fails.

Fixes: c34b961a24 ("net/sched: act_ct: Create nf flow table per zone")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Link: https://lore.kernel.org/r/20220923020046.8021-1-hbh25y@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-26 12:40:39 -07:00
Jinpeng Cui
2a566f0148 net: sched: act_ct: remove redundant variable err
Return value directly from pskb_trim_rcsum() instead of
getting value from redundant variable err.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Jinpeng Cui <cui.jinpeng2@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-21 12:49:32 +01:00
Zhengchao Shao
cb967ace0a net: sched: act_ct: get rid of tcf_ct_walker and tcf_ct_search
tcf_ct_walker() and tcf_ct_search() do the same thing as generic
walk/search function, so remove them.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-09 08:24:41 +01:00
Zhengchao Shao
acd0a7ab63 net: sched: act: move global static variable net_id to tc_action_ops
Each tc action module has a corresponding net_id, so put net_id directly
into the structure tc_action_ops.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-09 08:24:41 +01:00
Vlad Buslov
fc54d9065f net/sched: act_ct: set 'net' pointer when creating new nf_flow_table
Following patches in series use the pointer to access flow table offload
debug variables.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-07-11 16:25:14 +02:00
Dan Carpenter
86360030cc net/sched: act_api: fix error code in tcf_ct_flow_table_fill_tuple_ipv6()
The tcf_ct_flow_table_fill_tuple_ipv6() function is supposed to return
false on failure.  It should not return negatives because that means
succes/true.

Fixes: fcb6aa8653 ("act_ct: Support GRE offload")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Link: https://lore.kernel.org/r/YpYFnbDxFl6tQ3Bn@kili
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-01 13:32:04 +02:00
Ido Schimmel
c2ccf84ecb net/sched: act_api: Add extack to offload_act_setup() callback
The callback is used by various actions to populate the flow action
structure prior to offload. Pass extack to this callback so that the
various actions will be able to report accurate error messages to user
space.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-08 13:45:43 +01:00
Marcelo Ricardo Leitner
bcb74e132a net/sched: act_ct: fix ref leak when switching zones
When switching zones or network namespaces without doing a ct clear in
between, it is now leaking a reference to the old ct entry. That's
because tcf_ct_skb_nfct_cached() returns false and
tcf_ct_flow_table_lookup() may simply overwrite it.

The fix is to, as the ct entry is not reusable, free it already at
tcf_ct_skb_nfct_cached().

Reported-by: Florian Westphal <fw@strlen.de>
Fixes: 2f131de361 ("net/sched: act_ct: Fix flow table lookup after ct clear or switching zones")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-26 17:00:51 -07:00
Jakub Kicinski
abe2fec8ee Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

1) Revert CHECKSUM_UNNECESSARY for UDP packet from conntrack.

2) Reject unsupported families when creating tables, from Phil Sutter.

3) GRE support for the flowtable, from Toshiaki Makita.

4) Add GRE offload support for act_ct, also from Toshiaki.

5) Update mlx5 driver to support for GRE flowtable offload,
   from Toshiaki Makita.

6) Oneliner to clean up incorrect indentation in nf_conntrack_bridge,
   from Jiapeng Chong.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: bridge: clean up some inconsistent indenting
  net/mlx5: Support GRE conntrack offload
  act_ct: Support GRE offload
  netfilter: flowtable: Support GRE
  netfilter: nf_tables: Reject tables of unsupported family
  Revert "netfilter: conntrack: mark UDP zero checksum as CHECKSUM_UNNECESSARY"
====================

Link: https://lore.kernel.org/r/20220315091513.66544-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-15 11:52:25 -07:00
Jakub Kicinski
80901bff81 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/batman-adv/hard-interface.c
  commit 690bb6fb64 ("batman-adv: Request iflink once in batadv-on-batadv check")
  commit 6ee3c393ee ("batman-adv: Demote batadv-on-batadv skip error message")
https://lore.kernel.org/all/20220302163049.101957-1-sw@simonwunderlich.de/

net/smc/af_smc.c
  commit 4d08b7b57e ("net/smc: Fix cleanup when register ULP fails")
  commit 462791bbfa ("net/smc: add sysctl interface for SMC")
https://lore.kernel.org/all/20220302112209.355def40@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-03 11:55:12 -08:00
Toshiaki Makita
fcb6aa8653 act_ct: Support GRE offload
Support GREv0 without NAT.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-03 15:20:51 +01:00
Paul Blakey
db6140e5e3 net/sched: act_ct: Fix flow table lookup failure with no originating ifindex
After cited commit optimizted hw insertion, flow table entries are
populated with ifindex information which was intended to only be used
for HW offload. This tuple ifindex is hashed in the flow table key, so
it must be filled for lookup to be successful. But tuple ifindex is only
relevant for the netfilter flowtables (nft), so it's not filled in
act_ct flow table lookup, resulting in lookup failure, and no SW
offload and no offload teardown for TCP connection FIN/RST packets.

To fix this, add new tc ifindex field to tuple, which will
only be used for offloading, not for lookup, as it will not be
part of the tuple hash.

Fixes: 9795ded7f9 ("net/sched: act_ct: Fill offloading tuple iifidx")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-01 22:08:31 +01:00
Jakub Kicinski
aaa25a2fa7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/testing/selftests/net/mptcp/mptcp_join.sh
  34aa6e3bcc ("selftests: mptcp: add ip mptcp wrappers")

  857898eb4b ("selftests: mptcp: add missing join check")
  6ef84b1517 ("selftests: mptcp: more robust signal race test")
https://lore.kernel.org/all/20220221131842.468893-1-broonie@kernel.org/

drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/act.h
drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c
  fb7e76ea3f ("net/mlx5e: TC, Skip redundant ct clear actions")
  c63741b426 ("net/mlx5e: Fix MPLSoUDP encap to use MPLS action information")

  09bf979232 ("net/mlx5e: TC, Move pedit_headers_action to parse_attr")
  84ba8062e3 ("net/mlx5e: Test CT and SAMPLE on flow attr")
  efe6f961cd ("net/mlx5e: CT, Don't set flow flag CT for ct clear flow")
  3b49a7edec ("net/mlx5e: TC, Reject rules with multiple CT actions")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-24 17:54:25 -08:00
Paul Blakey
2f131de361 net/sched: act_ct: Fix flow table lookup after ct clear or switching zones
Flow table lookup is skipped if packet either went through ct clear
action (which set the IP_CT_UNTRACKED flag on the packet), or while
switching zones and there is already a connection associated with
the packet. This will result in no SW offload of the connection,
and the and connection not being removed from flow table with
TCP teardown (fin/rst packet).

To fix the above, remove these unneccary checks in flow
table lookup.

Fixes: 46475bb20f ("net/sched: act_ct: Software offload of established flows")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-18 11:02:48 +00:00
Florian Westphal
1015c3de23 netfilter: conntrack: remove extension register api
These no longer register/unregister a meaningful structure so remove it.

Cc: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Florian Westphal
5f31edc067 netfilter: conntrack: move extension sizes into core
No need to specify this in the registration modules, we already
collect all sizes for build-time checks on the maximum combined size.

After this change, all extensions except nat have no meaningful content
in their nf_ct_ext_type struct definition.

Next patch handles nat, this will then allow to remove the dynamic
register api completely.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Florian Westphal
bb62a765b1 netfilter: conntrack: make all extensions 8-byte alignned
All extensions except one need 8 byte alignment, so just make that the
default.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Jakub Kicinski
8aaaf2f3af Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in fixes directly in prep for the 5.17 merge window.
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09 17:00:17 -08:00
Paul Blakey
6f022c2ddb net: openvswitch: Fix ct_state nat flags for conns arriving from tc
Netfilter conntrack maintains NAT flags per connection indicating
whether NAT was configured for the connection. Openvswitch maintains
NAT flags on the per packet flow key ct_state field, indicating
whether NAT was actually executed on the packet.

When a packet misses from tc to ovs the conntrack NAT flags are set.
However, NAT was not necessarily executed on the packet because the
connection's state might still be in NEW state. As such, openvswitch
wrongly assumes that NAT was executed and sets an incorrect flow key
NAT flags.

Fix this, by flagging to openvswitch which NAT was actually done in
act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so
the packet flow key NAT flags will be correctly set.

Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20220106153804.26451-1-paulb@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09 16:24:12 -08:00
Jakub Kicinski
77bbcb60f7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next. This
includes one patch to update ovs and act_ct to use nf_ct_put() instead
of nf_conntrack_put().

1) Add netns_tracker to nfnetlink_log and masquerade, from Eric Dumazet.

2) Remove redundant rcu read-size lock in nf_tables packet path.

3) Replace BUG() by WARN_ON_ONCE() in nft_payload.

4) Consolidate rule verdict tracing.

5) Replace WARN_ON() by WARN_ON_ONCE() in nf_tables core.

6) Make counter support built-in in nf_tables.

7) Add new field to conntrack object to identify locally generated
   traffic, from Florian Westphal.

8) Prevent NAT from shadowing well-known ports, from Florian Westphal.

9) Merge nf_flow_table_{ipv4,ipv6} into nf_flow_table_inet, also from
   Florian.

10) Remove redundant pointer in nft_pipapo AVX2 support, from Colin Ian King.

11) Replace opencoded max() in conntrack, from Jiapeng Chong.

12) Update conntrack to use refcount_t API, from Florian Westphal.

13) Move ip_ct_attach indirection into the nf_ct_hook structure.

14) Constify several pointer object in the netfilter codebase,
    from Florian Westphal.

15) Tree-wide replacement of nf_conntrack_put() by nf_ct_put(), also
    from Florian.

16) Fix egress splat due to incorrect rcu notation, from Florian.

17) Move stateful fields of connlimit, last, quota, numgen and limit
    out of the expression data area.

18) Build a blob to represent the ruleset in nf_tables, this is a
    requirement of the new register tracking infrastructure.

19) Add NFT_REG32_NUM to define the maximum number of 32-bit registers.

20) Add register tracking infrastructure to skip redundant
    store-to-register operations, this includes support for payload,
    meta and bitwise expresssions.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: (32 commits)
  netfilter: nft_meta: cancel register tracking after meta update
  netfilter: nft_payload: cancel register tracking after payload update
  netfilter: nft_bitwise: track register operations
  netfilter: nft_meta: track register operations
  netfilter: nft_payload: track register operations
  netfilter: nf_tables: add register tracking infrastructure
  netfilter: nf_tables: add NFT_REG32_NUM
  netfilter: nf_tables: add rule blob layout
  netfilter: nft_limit: move stateful fields out of expression data
  netfilter: nft_limit: rename stateful structure
  netfilter: nft_numgen: move stateful fields out of expression data
  netfilter: nft_quota: move stateful fields out of expression data
  netfilter: nft_last: move stateful fields out of expression data
  netfilter: nft_connlimit: move stateful fields out of expression data
  netfilter: egress: avoid a lockdep splat
  net: prefer nf_ct_put instead of nf_conntrack_put
  netfilter: conntrack: avoid useless indirection during conntrack destruction
  netfilter: make function op structures const
  netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook
  netfilter: conntrack: convert to refcount_t api
  ...
====================

Link: https://lore.kernel.org/r/20220109231640.104123-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09 15:59:23 -08:00
Florian Westphal
408bdcfce8 net: prefer nf_ct_put instead of nf_conntrack_put
Its the same as nf_conntrack_put(), but without the
need for an indirect call.  The downside is a module dependency on
nf_conntrack, but all of these already depend on conntrack anyway.

Cc: Paul Blakey <paulb@mellanox.com>
Cc: dev@openvswitch.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09 23:30:14 +01:00
Florian Westphal
7197743776 netfilter: conntrack: convert to refcount_t api
Convert nf_conn reference counting from atomic_t to refcount_t based api.
refcount_t api provides more runtime sanity checks and will warn on
certain constructs, e.g. refcount_inc() on a zero reference count, which
usually indicates use-after-free.

For this reason template allocation is changed to init the refcount to
1, the subsequenct add operations are removed.

Likewise, init_conntrack() is changed to set the initial refcount to 1
instead refcount_inc().

This is safe because the new entry is not (yet) visible to other cpus.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09 23:30:13 +01:00
Paul Blakey
9795ded7f9 net/sched: act_ct: Fill offloading tuple iifidx
Driver offloading ct tuples can use the information of which devices
received the packets that created the offloaded connections, to
more efficiently offload them only to the relevant device.

Add new act_ct nf conntrack extension, which is used to store the skb
devices before offloading the connection, and then fill in the tuple
iifindex so drivers can get the device via metadata dissector match.

Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04 12:12:55 +00:00
Jakub Kicinski
8b3f913322 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/net/sock.h
  commit 8f905c0e73 ("inet: fully convert sk->sk_rx_dst to RCU rules")
  commit 43f51df417 ("net: move early demux fields close to sk_refcnt")
  https://lore.kernel.org/all/20211222141641.0caa0ab3@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-23 16:09:58 -08:00
Baowen Zheng
8cbfe939ab flow_offload: allow user to offload tc action to net device
Use flow_indr_dev_register/flow_indr_dev_setup_offload to
offload tc action.

We need to call tc_cleanup_flow_action to clean up tc action entry since
in tc_setup_action, some actions may hold dev refcnt, especially the mirror
action.

Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-19 14:08:48 +00:00
Baowen Zheng
c54e1d920f flow_offload: add ops to tc_action_ops for flow action setup
Add a new ops to tc_action_ops for flow action setup.

Refactor function tc_setup_flow_action to use this new ops.

We make this change to facilitate to add standalone action module.

We will also use this ops to offload action independent of filter
in following patch.

Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-19 14:08:48 +00:00
Paul Blakey
3849595866 net/sched: flow_dissector: Fix matching on zone id for invalid conns
If ct rejects a flow, it removes the conntrack info from the skb.
act_ct sets the post_ct variable so the dissector will see this case
as an +tracked +invalid state, but the zone id is lost with the
conntrack info.

To restore the zone id on such cases, set the last executed zone,
via the tc control block, when passing ct, and read it back in the
dissector if there is no ct info on the skb (invalid connection).

Fixes: 7baf2429a1 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-17 18:06:35 -08:00
Paul Blakey
ec624fe740 net/sched: Extend qdisc control block with tc control block
BPF layer extends the qdisc control block via struct bpf_skb_data_end
and because of that there is no more room to add variables to the
qdisc layer control block without going over the skb->cb size.

Extend the qdisc control block with a tc control block,
and move all tc related variables to there as a pre-step for
extending the tc control block with additional members.

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-17 18:06:35 -08:00
Chris Mi
43332cf974 net/sched: act_ct: Offload only ASSURED connections
Short-lived connections increase the insertion rate requirements,
fill the offload table and provide very limited offload value since
they process a very small amount of packets. The ct ASSURED flag is
designed to filter short-lived connections for early expiration.

Offload connections when they are ESTABLISHED and ASSURED.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-03 11:02:25 +00:00
Paul Blakey
2dc4e9e88c net/sched: act_ct: Fix byte count on fragmented packets
First fragmented packets (frag offset = 0) byte len is zeroed
when stolen by ip_defrag(). And since act_ct update the stats
only afterwards (at end of execute), bytes aren't correctly
accounted for such packets.

To fix this, move stats update to start of action execute.

Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 13:31:58 +01:00
Cong Wang
695176bfe5 net_sched: refactor TC action init API
TC action ->init() API has 10 parameters, it becomes harder
to read. Some of them are just boolean and can be replaced
by flags. Similarly for the internal API tcf_action_init()
and tcf_exts_validate().

This patch converts them to flags and fold them into
the upper 16 bits of "flags", whose lower 16 bits are still
reserved for user-space. More specifically, the following
kernel flags are introduced:

TCA_ACT_FLAGS_POLICE replace 'name' in a few contexts, to
distinguish whether it is compatible with policer.

TCA_ACT_FLAGS_BIND replaces 'bind', to indicate whether
this action is bound to a filter.

TCA_ACT_FLAGS_REPLACE  replaces 'ovr' in most contexts,
means we are replacing an existing action.

TCA_ACT_FLAGS_NO_RTNL replaces 'rtnl_held' but has the
opposite meaning, because we still hold RTNL in most
cases.

The only user-space flag TCA_ACT_FLAGS_NO_PERCPU_STATS is
untouched and still stored as before.

I have tested this patch with tdc and I do not see any
failure related to this patch.

Tested-by: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-02 10:24:38 +01:00
Louis Peens
77ac5e40c4 net/sched: act_ct: remove and free nf_table callbacks
When cleaning up the nf_table in tcf_ct_flow_table_cleanup_work
there is no guarantee that the callback list, added to by
nf_flow_table_offload_add_cb, is empty. This means that it is
possible that the flow_block_cb memory allocated will be lost.

Fix this by iterating the list and freeing the flow_block_cb entries
before freeing the nf_table entry (via freeing ct_ft).

Fixes: 978703f425 ("netfilter: flowtable: Add API for registering to flow table events")
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 13:36:35 -07:00
wenxu
8955b90c3c net/sched: act_ct: fix err check for nf_conntrack_confirm
The confirm operation should be checked. If there are any failed,
the packet should be dropped like in ovs and netfilter.

Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 12:07:08 -07:00
Marcelo Ricardo Leitner
13c62f5371 net/sched: act_ct: handle DNAT tuple collision
This this the counterpart of 8aa7b526dc ("openvswitch: handle DNAT
tuple collision") for act_ct. From that commit changelog:

"""
With multiple DNAT rules it's possible that after destination
translation the resulting tuples collide.

...

Netfilter handles this case by allocating a null binding for SNAT at
egress by default.  Perform the same operation in openvswitch for DNAT
if no explicit SNAT is requested by the user and allocate a null binding
for SNAT for packets in the "original" direction.
"""

Fixes: 95219afbb9 ("act_ct: support asymmetric conntrack")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09 15:34:51 -07:00
Ariel Levkovich
fb91702b74 net/sched: act_ct: Fix ct template allocation for zone 0
Fix current behavior of skipping template allocation in case the
ct action is in zone 0.

Skipping the allocation may cause the datapath ct code to ignore the
entire ct action with all its attributes (commit, nat) in case the ct
action in zone 0 was preceded by a ct clear action.

The ct clear action sets the ct_state to untracked and resets the
skb->_nfct pointer. Under these conditions and without an allocated
ct template, the skb->_nfct pointer will remain NULL which will
cause the tc ct action handler to exit without handling commit and nat
actions, if such exist.

For example, the following rule in OVS dp:
recirc_id(0x2),ct_state(+new-est-rel-rpl+trk),ct_label(0/0x1), \
in_port(eth0),actions:ct_clear,ct(commit,nat(src=10.11.0.12)), \
recirc(0x37a)

Will result in act_ct skipping the commit and nat actions in zone 0.

The change removes the skipping of template allocation for zone 0 and
treats it the same as any other zone.

Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/20210526170110.54864-1-lariel@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27 14:54:23 -07:00
Paul Blakey
0cc254e5aa net/sched: act_ct: Offload connections with commit action
Currently established connections are not offloaded if the filter has a
"ct commit" action. This behavior will not offload connections of the
following scenario:

$ tc_filter add dev $DEV ingress protocol ip prio 1 flower \
  ct_state -trk \
  action ct commit action goto chain 1

$ tc_filter add dev $DEV ingress protocol ip chain 1 prio 1 flower \
  action mirred egress redirect dev $DEV2

$ tc_filter add dev $DEV2 ingress protocol ip prio 1 flower \
  action ct commit action goto chain 1

$ tc_filter add dev $DEV2 ingress protocol ip prio 1 chain 1 flower \
  ct_state +trk+est \
  action mirred egress redirect dev $DEV

Offload established connections, regardless of the commit flag.

Fixes: 46475bb20f ("net/sched: act_ct: Software offload of established flows")
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Link: https://lore.kernel.org/r/1622029449-27060-1-git-send-email-paulb@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27 14:54:14 -07:00
Roi Dayan
9be02dd385 net/sched: act_ct: Remove redundant ct get and check
The assignment is not being used and redundant.
The check for null is redundant as nf_conntrack_put() also
checks this.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Link: https://lore.kernel.org/r/20210428060532.3330974-1-roid@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-28 13:51:18 -07:00
Davide Caratti
f77bd544a6 net/sched: act_ct: fix wild memory access when clearing fragments
while testing re-assembly/re-fragmentation using act_ct, it's possible to
observe a crash like the following one:

 KASAN: maybe wild-memory-access in range [0x0001000000000448-0x000100000000044f]
 CPU: 50 PID: 0 Comm: swapper/50 Tainted: G S                5.12.0-rc7+ #424
 Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
 RIP: 0010:inet_frag_rbtree_purge+0x50/0xc0
 Code: 00 fc ff df 48 89 c3 31 ed 48 89 df e8 a9 7a 38 ff 4c 89 fe 48 89 df 49 89 c6 e8 5b 3a 38 ff 48 8d 7b 40 48 89 f8 48 c1 e8 03 <42> 80 3c 20 00 75 59 48 8d bb d0 00 00 00 4c 8b 6b 40 48 89 f8 48
 RSP: 0018:ffff888c31449db8 EFLAGS: 00010203
 RAX: 0000200000000089 RBX: 000100000000040e RCX: ffffffff989eb960
 RDX: 0000000000000140 RSI: ffffffff97cfb977 RDI: 000100000000044e
 RBP: 0000000000000900 R08: 0000000000000000 R09: ffffed1186289350
 R10: 0000000000000003 R11: ffffed1186289350 R12: dffffc0000000000
 R13: 000100000000040e R14: 0000000000000000 R15: ffff888155e02160
 FS:  0000000000000000(0000) GS:ffff888c31440000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00005600cb70a5b8 CR3: 0000000a2c014005 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <IRQ>
  inet_frag_destroy+0xa9/0x150
  call_timer_fn+0x2d/0x180
  run_timer_softirq+0x4fe/0xe70
  __do_softirq+0x197/0x5a0
  irq_exit_rcu+0x1de/0x200
  sysvec_apic_timer_interrupt+0x6b/0x80
  </IRQ>

when act_ct temporarily stores an IP fragment, restoring the skb qdisc cb
results in putting random data in FRAG_CB(), and this causes those "wild"
memory accesses later, when the rbtree is purged. Never overwrite the skb
cb in case tcf_ct_handle_fragments() returns -EINPROGRESS.

Fixes: ae372cb175 ("net/sched: act_ct: fix restore the qdisc_skb_cb after defrag")
Fixes: 7baf2429a1 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-26 13:06:39 -07:00
Marcelo Ricardo Leitner
8ca1b090e5 net/sched: act_ct: clear post_ct if doing ct_clear
Invalid detection works with two distinct moments: act_ct tries to find
a conntrack entry and set post_ct true, indicating that that was
attempted. Then, when flow dissector tries to dissect CT info and no
entry is there, it knows that it was tried and no entry was found, and
synthesizes/sets
                  key->ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
                                  TCA_FLOWER_KEY_CT_FLAGS_INVALID;
mimicing what OVS does.

OVS has this a bit more streamlined, as it recomputes the key after
trying to find a conntrack entry for it.

Issue here is, when we have 'tc action ct clear', it didn't clear
post_ct, causing a subsequent match on 'ct_state -trk' to fail, due to
the above. The fix, thus, is to clear it.

Reproducer rules:
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 0 \
	protocol ip flower ip_proto tcp ct_state -trk \
	action ct zone 1 pipe \
	action goto chain 2
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 2 \
	protocol ip flower \
	action ct clear pipe \
	action goto chain 4
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 4 \
	protocol ip flower ct_state -trk \
	action mirred egress redirect dev enp130s0f1np1_0

With the fix, the 3rd rule matches, like it does with OVS kernel
datapath.

Fixes: 7baf2429a1 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23 14:32:26 -07:00
Paul Blakey
941eff5aea net: flow_offload: Add original direction flag to ct_metadata
Give offloading drivers the direction of the offloaded ct flow,
this will be used for matches on direction (ct_state +/-rpl).

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:05:30 -08:00
wenxu
7baf2429a1 net/sched: cls_flower add CT_FLAGS_INVALID flag support
This patch add the TCA_FLOWER_KEY_CT_FLAGS_INVALID flag to
match the ct_state with invalid for conntrack.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/1611045110-682-1-git-send-email-wenxu@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:09:44 -08:00
Marcelo Ricardo Leitner
3567e23379 net/sched: act_ct: enable stats for HW offloaded entries
By setting NF_FLOWTABLE_COUNTER. Otherwise, the updates added by
commit ef803b3cf9 ("netfilter: flowtable: add counter support in HW
offload") are not effective when using act_ct.

While at it, now that we have the flag set, protect the call to
nf_ct_acct_update() by commit beb97d3a31 ("net/sched: act_ct: update
nf_conn_acct for act_ct SW offload in flowtable") with the check on
NF_FLOWTABLE_COUNTER, as also done on other places.

Note that this shouldn't impact performance as these stats are only
enabled when net.netfilter.nf_conntrack_acct is enabled.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: wenxu <wenxu@ucloud.cn>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/481a65741261fd81b0a0813e698af163477467ec.1606415787.git.marcelo.leitner@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 11:43:40 -08:00
wenxu
c129412f74 net/sched: sch_frag: add generic packet fragment support.
Currently kernel tc subsystem can do conntrack in cat_ct. But when several
fragment packets go through the act_ct, function tcf_ct_handle_fragments
will defrag the packets to a big one. But the last action will redirect
mirred to a device which maybe lead the reassembly big packet over the mtu
of target device.

This patch add support for a xmit hook to mirred, that gets executed before
xmiting the packet. Then, when act_ct gets loaded, it configs that hook.
The frag xmit hook maybe reused by other modules.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 14:36:02 -08:00