The bug is here:
ret = i40e_add_macvlan_filter(hw, ch->seid, vdev->dev_addr, &aq_err);
The list iterator 'ch' will point to a bogus position containing
HEAD if the list is empty or no element is found. This case must
be checked before any use of the iterator, otherwise it will
lead to a invalid memory access.
To fix this bug, use a new variable 'iter' as the list iterator,
while use the origin variable 'ch' as a dedicated pointer to
point to the found element.
Cc: stable@vger.kernel.org
Fixes: 1d8d80b4e4 ("i40e: Add macvlan support on i40e")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20220510204846.2166999-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently pedit tries to ensure that the accessed skb offset
is writable via skb_unclone(). The action potentially allows
touching any skb bytes, so it may end-up modifying shared data.
The above causes some sporadic MPTCP self-test failures, due to
this code:
tc -n $ns2 filter add dev ns2eth$i egress \
protocol ip prio 1000 \
handle 42 fw \
action pedit munge offset 148 u8 invert \
pipe csum tcp \
index 100
The above modifies a data byte outside the skb head and the skb is
a cloned one, carrying a TCP output packet.
This change addresses the issue by keeping track of a rough
over-estimate highest skb offset accessed by the action and ensuring
such offset is really writable.
Note that this may cause performance regressions in some scenarios,
but hopefully pedit is not in the critical path.
Fixes: db2c24175d ("act_pedit: access skb->data safely")
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/1fcf78e6679d0a287dd61bb0f04730ce33b3255d.1652194627.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexandra Winter says:
====================
s390/net: Cleanup some code checker findings
clean up smatch findings in legacy code. I was not able to provoke
any real failures on my systems, but other hardware reactions,
timing conditions or compiler output, may cause failures.
There are still 2 smatch warnings left in s390/net:
drivers/s390/net/ctcm_main.c:1326 add_channel() warn: missing error code 'rc'
This one is a false positive.
drivers/s390/net/netiucv.c:1355 netiucv_check_user() warn: argument 3 to %02x specifier has type 'char'
Postponing this one, need to better understand string handling in iucv.
There are several sparse warnings left in ctcm, like:
drivers/s390/net/ctcm_fsms.c:573:9: warning: context imbalance in 'ctcm_chx_setmode' - different lock contexts for basic block
Those are mentioned in the source, no plan to rework.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
smatch complains about
drivers/s390/net/lcs.c:1741 lcs_get_control() warn: variable dereferenced before check 'card->dev' (see line 1739)
Fixes: 27eb5ac8f0 ("[PATCH] s390: lcs driver bug fixes and improvements [1/2]")
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
smatch complains about
drivers/s390/net/ctcm_mpc.c:1210 ctcmpc_unpack_skb() warn: possible memory leak of 'mpcginfo'
mpc_action_discontact() did not free mpcginfo. Consolidate the freeing in
ctcmpc_unpack_skb().
Fixes: 293d984f0e ("ctcm: infrastructure for replaced ctc driver")
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Found by cppcheck and smatch.
smatch complains about
drivers/s390/net/ctcm_sysfs.c:43 ctcm_buffer_write() warn: variable dereferenced before check 'priv' (see line 42)
Fixes: 3c09e2647b ("ctcm: rename READ/WRITE defines to avoid redefinitions")
Reported-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grant Grundler says:
====================
net: atlantic: more fuzzing fixes
It essentially describes four problems:
1) validate rxd_wb->next_desc_ptr before populating buff->next
2) "frag[0] not initialized" case in aq_ring_rx_clean()
3) limit iterations handling fragments in aq_ring_rx_clean()
4) validate hw_head_ in hw_atl_b0_hw_ring_tx_head_update()
(1) was fixed by Zekun Shen <bruceshenzk@gmail.com> around the same time with
"atlantic: Fix buff_ring OOB in aq_ring_rx_clean" (SHA1 5f50153288).
I've added one "clean up" contribution:
"net: atlantic: reduce scope of is_rsc_complete"
I tested the "original" patches using chromeos-v5.4 kernel branch:
https://chromium-review.googlesource.com/q/hashtag:pcinet-atlantic-2022q1+(status:open%20OR%20status:merged)
I've forward ported those patches to 5.18-rc2 and compiled them but am
unable to test them on 5.18-rc2 kernel (logistics problems).
Credit largely goes to ChromeOS Fuzzing team members:
Aashay Shringarpure, Yi Chou, Shervin Oloumi
V2 changes:
o drop first patch - was already fixed upstream differently
o reduce (4) "validate hw_head_" to simple bounds checking.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Bounds check hw_head index provided by NIC to verify it lies
within the TX buffer ring.
Reported-by: Aashay Shringarpure <aashay@google.com>
Reported-by: Yi Chou <yich@google.com>
Reported-by: Shervin Oloumi <enlightened@google.com>
Signed-off-by: Grant Grundler <grundler@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enforce that the CPU can not get stuck in an infinite loop.
Reported-by: Aashay Shringarpure <aashay@google.com>
Reported-by: Yi Chou <yich@google.com>
Reported-by: Shervin Oloumi <enlightened@google.com>
Signed-off-by: Grant Grundler <grundler@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't defer handling the err case outside the loop. That's pointless.
And since is_rsc_complete is only used inside this loop, declare
it inside the loop to reduce it's scope.
Signed-off-by: Grant Grundler <grundler@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In aq_ring_rx_clean(), if buff->is_eop is not set AND
buff->len < AQ_CFG_RX_HDR_SIZE, then hdr_len remains equal to
buff->len and skb_add_rx_frag(xxx, *0*, ...) is not called.
The loop following this code starts calling skb_add_rx_frag() starting
with i=1 and thus frag[0] is never initialized. Since i is initialized
to zero at the top of the primary loop, we can just reference and
post-increment i instead of hardcoding the 0 when calling
skb_add_rx_frag() the first time.
Reported-by: Aashay Shringarpure <aashay@google.com>
Reported-by: Yi Chou <yich@google.com>
Reported-by: Shervin Oloumi <enlightened@google.com>
Signed-off-by: Grant Grundler <grundler@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch to using pcim_enable_device() to avoid missing pci_disable_device().
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lore.kernel.org/r/20220510031316.1780409-1-yangyingliang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In lanphy_read_page_reg, calling __phy_read() might return a negative
error code. Use 'int' to check the error code.
Fixes: 7c2dcfa295 ("net: phy: micrel: Add support for LAN8804 PHY")
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220509144519.2343399-1-wanjiabing@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Clang's structure layout randomization feature gets upset when it sees
struct neighbor (which is randomized) cast to struct dn_neigh:
net/decnet/dn_route.c:1123:15: error: casting from randomized structure pointer type 'struct neighbour *' to 'struct dn_neigh *'
gateway = ((struct dn_neigh *)neigh)->addr;
^
Update all the open-coded casts to use container_of() to do the conversion
instead of depending on strict member ordering.
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/202205041247.WKBEHGS5-lkp@intel.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Zheng Yongjun <zhengyongjun3@huawei.com>
Cc: Bill Wendling <morbo@google.com>
Cc: linux-decnet-user@lists.sourceforge.net
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220508102217.2647184-1-keescook@chromium.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The impact of this regression is the same for resume that I saw on
thaw: the kernel hangs and nothing except SysRq rebooting can be done.
Fixes regression in commit cbe6c3a8f8 ("net: atlantic: invert deep
par in pm functions, preventing null derefs"), where I disabled deep
pm resets in suspend and resume, trying to make sense of the
atl_resume_common() deep parameter in the first place.
It turns out, that atlantic always has to deep reset on pm
operations. Even though I expected that and tested resume, I screwed
up by kexec-rebooting into an unpatched kernel, thus missing the
breakage.
This fixup obsoletes the deep parameter of atl_resume_common, but I
leave the cleanup for the maintainers to post to mainline.
Suspend and hibernation were successfully tested by the reporters.
Fixes: cbe6c3a8f8 ("net: atlantic: invert deep par in pm functions, preventing null derefs")
Link: https://lore.kernel.org/regressions/9-Ehc_xXSwdXcvZqKD5aSqsqeNj5Izco4MYEwnx5cySXVEc9-x_WC4C3kAoCqNTi-H38frroUK17iobNVnkLtW36V6VWGSQEOHXhmVMm5iQ=@protonmail.com/
Reported-by: Jordan Leppert <jordanleppert@protonmail.com>
Reported-by: Holger Hoffstaette <holger@applied-asynchrony.com>
Tested-by: Jordan Leppert <jordanleppert@protonmail.com>
Tested-by: Holger Hoffstaette <holger@applied-asynchrony.com>
CC: <stable@vger.kernel.org> # 5.10+
Signed-off-by: Manuel Ullmann <labre@posteo.de>
Link: https://lore.kernel.org/r/87bkw8dfmp.fsf@posteo.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- Don't skb_split skbuffs with frag_list, by Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmJ3whYWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoYV9D/wKIqVyKZrMd+O7lougzQJt8ME5
J0mSlBCNoxP7d9WuMdnfjXH7O2hq323Qtw/x47dv0ln+H69XMdXrsiuQxJk2OELk
/2CkswpOVEwGaxZ93FM7vIHwmR0Rw/bd3rC+d83YzCzfKg5QDXNDoxQAMBX45hVv
7DHJRql6sEAAcoGQleM6LmtHN66sxXEcwPnhDvEvrSDDPW6YD7tRvkPPWAPledO7
EdZmr8WWWl4jgL8sPaIGIKgz+XGECMbLO1n7PtUtQ1IH9Xib8FoN9g6JzUUFXA0I
v23zWqTb75xjZvz8gJQUoVpLxEsTM33ai0S6Q26YAUkqKP/70yzjYPGHu+jpPj8I
SLmwzY8NHrfCyAkMxeIChDYQTK8nMFHWDA+lq1BsduBeb/oBNO9WFfuQOopUgSmo
5tYNlBb6Jar7UC17C/bGgV4y8UaTLSygwj/566Ty1b40bqh/+oycjt5rEXFqZN/3
A1doCLKC1pVl9anOyEo70w71VlhH1IKXBIqUF+UDFGiF/zvImfkAvPHSZXFEB13l
rF9+O+J824+yTOoo3BWFjX+A+xrS8rATGt5I6SqrhfCtuRTUCiNj/uAhmDC3Cziu
nR775J2r81y6iyCfY8GOOyXpNjLQR/2ApOa6mmliSCCym3aHBl2A/Dpc3hopcqHN
UX/7ovSwCzZcYhlvew==
=S0xN
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-pullrequest-20220508' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- Don't skb_split skbuffs with frag_list, by Sven Eckelmann
* tag 'batadv-net-pullrequest-20220508' of git://git.open-mesh.org/linux-merge:
batman-adv: Don't skb_split skbuffs with frag_list
====================
Link: https://lore.kernel.org/r/20220508132110.20451-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a race between switchdev_bridge_port_offload() and the
dsa_port_switchdev_sync_attrs() call right below it.
When switchdev_bridge_port_offload() finishes, FDB entries have been
replayed by the bridge, but are scheduled for deferred execution later.
However dsa_port_switchdev_sync_attrs -> dsa_port_can_apply_vlan_filtering()
may impose restrictions on the vlan_filtering attribute and refuse
offloading.
When this happens, the delayed FDB entries will dereference dp->bridge,
which is a NULL pointer because we have stopped the process of
offloading this bridge.
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Workqueue: dsa_ordered dsa_slave_switchdev_event_work
pc : dsa_port_bridge_host_fdb_del+0x64/0x100
lr : dsa_slave_switchdev_event_work+0x130/0x1bc
Call trace:
dsa_port_bridge_host_fdb_del+0x64/0x100
dsa_slave_switchdev_event_work+0x130/0x1bc
process_one_work+0x294/0x670
worker_thread+0x80/0x460
---[ end trace 0000000000000000 ]---
Error: dsa_core: Must first remove VLAN uppers having VIDs also present in bridge.
Fix the bug by doing what we do on the normal bridge leave path as well,
which is to wait until the deferred FDB entries complete executing, then
exit.
The placement of dsa_flush_workqueue() after switchdev_bridge_port_unoffload()
guarantees that both the FDB additions and deletions on rollback are waited for.
Fixes: d7d0d423db ("net: dsa: flush switchdev workqueue when leaving the bridge")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20220507134550.1849834-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-05-06
This series contains updates to ice driver only.
Ivan Vecera fixes a race with aux plug/unplug by delaying setting adev
until initialization is complete and adding locking.
Anatolii ensures VF queues are completely disabled before attempting to
reconfigure them.
Michal ensures stale Tx timestamps are cleared from hardware.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: fix PTP stale Tx timestamps cleanup
ice: clear stale Tx queue settings before configuring
ice: Fix race during aux device (un)plugging
====================
Link: https://lore.kernel.org/r/20220506174129.4976-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This fixes the following error caused by a race condition between
phydev->adjust_link() and a MDIO transaction in the phy interrupt
handler. The issue was reproduced with the ethernet FEC driver and a
micrel KSZ9031 phy.
[ 146.195696] fec 2188000.ethernet eth0: MDIO read timeout
[ 146.201779] ------------[ cut here ]------------
[ 146.206671] WARNING: CPU: 0 PID: 571 at drivers/net/phy/phy.c:942 phy_error+0x24/0x6c
[ 146.214744] Modules linked in: bnep imx_vdoa imx_sdma evbug
[ 146.220640] CPU: 0 PID: 571 Comm: irq/128-2188000 Not tainted 5.18.0-rc3-00080-gd569e86915b7 #9
[ 146.229563] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 146.236257] unwind_backtrace from show_stack+0x10/0x14
[ 146.241640] show_stack from dump_stack_lvl+0x58/0x70
[ 146.246841] dump_stack_lvl from __warn+0xb4/0x24c
[ 146.251772] __warn from warn_slowpath_fmt+0x5c/0xd4
[ 146.256873] warn_slowpath_fmt from phy_error+0x24/0x6c
[ 146.262249] phy_error from kszphy_handle_interrupt+0x40/0x48
[ 146.268159] kszphy_handle_interrupt from irq_thread_fn+0x1c/0x78
[ 146.274417] irq_thread_fn from irq_thread+0xf0/0x1dc
[ 146.279605] irq_thread from kthread+0xe4/0x104
[ 146.284267] kthread from ret_from_fork+0x14/0x28
[ 146.289164] Exception stack(0xe6fa1fb0 to 0xe6fa1ff8)
[ 146.294448] 1fa0: 00000000 00000000 00000000 00000000
[ 146.302842] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 146.311281] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 146.318262] irq event stamp: 12325
[ 146.321780] hardirqs last enabled at (12333): [<c01984c4>] __up_console_sem+0x50/0x60
[ 146.330013] hardirqs last disabled at (12342): [<c01984b0>] __up_console_sem+0x3c/0x60
[ 146.338259] softirqs last enabled at (12324): [<c01017f0>] __do_softirq+0x2c0/0x624
[ 146.346311] softirqs last disabled at (12319): [<c01300ac>] __irq_exit_rcu+0x138/0x178
[ 146.354447] ---[ end trace 0000000000000000 ]---
With the FEC driver phydev->adjust_link() calls fec_enet_adjust_link()
calls fec_stop()/fec_restart() and both these function reset and
temporary disable the FEC disrupting any MII transaction that
could be happening at the same time.
fec_enet_adjust_link() and phy_read() can be running at the same time
when we have one additional interrupt before the phy_state_machine() is
able to terminate.
Thread 1 (phylib WQ) | Thread 2 (phy interrupt)
|
| phy_interrupt() <-- PHY IRQ
| handle_interrupt()
| phy_read()
| phy_trigger_machine()
| --> schedule phylib WQ
|
|
phy_state_machine() |
phy_check_link_status() |
phy_link_change() |
phydev->adjust_link() |
fec_enet_adjust_link() |
--> FEC reset | phy_interrupt() <-- PHY IRQ
| phy_read()
|
Fix this by acquiring the phydev lock in phy_interrupt().
Link: https://lore.kernel.org/all/20220422152612.GA510015@francesco-nb.int.toradex.com/
Fixes: c974bdbc3e ("net: phy: Use threaded IRQ, to allow IRQ from sleeping devices")
cc: <stable@vger.kernel.org>
Signed-off-by: Francesco Dolcini <francesco.dolcini@toradex.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220506060815.327382-1-francesco.dolcini@toradex.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The W=2 build pointed out that the code wasn't initializing all the
variables in the dim_cq_moder declarations with the struct initializers.
The net change here is zero since these structs were already static
const globals and were initialized with zeros by the compiler, but
removing compiler warnings has value in and of itself.
lib/dim/net_dim.c: At top level:
lib/dim/net_dim.c:54:9: warning: missing initializer for field ‘comps’ of ‘const struct dim_cq_moder’ [-Wmissing-field-initializers]
54 | NET_DIM_RX_EQE_PROFILES,
| ^~~~~~~~~~~~~~~~~~~~~~~
In file included from lib/dim/net_dim.c:6:
./include/linux/dim.h:45:13: note: ‘comps’ declared here
45 | u16 comps;
| ^~~~~
and repeats for the tx struct, and once you fix the comps entry then
the cq_period_mode field needs the same treatment.
Use the commonly accepted style to indicate to the compiler that we
know what we're doing, and add a comma at the end of each struct
initializer to clean up the issue, and use explicit initializers
for the fields we are initializing which makes the compiler happy.
While here and fixing these lines, clean up the code slightly with
a fix for the super long lines by removing the word "_MODERATION" from a
couple defines only used in this file.
Fixes: f8be17b81d ("lib/dim: Fix -Wunused-const-variable warnings")
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20220507011038.14568-1-jesse.brandeburg@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The initial code used roundup() to round the starting time to
a multiple of a period. This generated an error on 32-bit
systems, so was replaced with DIV_ROUND_UP_ULL().
However, this truncates to 32-bits on a 64-bit system. Replace
with DIV64_U64_ROUND_UP() instead.
Fixes: b325af3cfa ("ptp: ocp: Add signal generators and update sysfs nodes")
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/r/20220506223739.1930-2-jonathan.lemon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix the missing pci_disable_device() before return
from tulip_init_one() in the error handling case.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lore.kernel.org/r/20220506094250.3630615-1-yangyingliang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When NET_F_F_GRO_FRAGLIST is enabled and bpf_skb_change_proto is used,
check if udp packets and tcp packets are successfully delivered to user
space. If wrong udp packets are delivered, udpgso_bench_rx will exit
with "Initial byte out of range"
Signed-off-by: Maciej enczykowski <maze@google.com>
Signed-off-by: Lina Wang <lina.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When clatd starts with ebpf offloaing, and NETIF_F_GRO_FRAGLIST is enable,
several skbs are gathered in skb_shinfo(skb)->frag_list. The first skb's
ipv6 header will be changed to ipv4 after bpf_skb_proto_6_to_4,
network_header\transport_header\mac_header have been updated as ipv4 acts,
but other skbs in frag_list didnot update anything, just ipv6 packets.
udp_queue_rcv_skb will call skb_segment_list to traverse other skbs in
frag_list and make sure right udp payload is delivered to user space.
Unfortunately, other skbs in frag_list who are still ipv6 packets are
updated like the first skb and will have wrong transport header length.
e.g.before bpf_skb_proto_6_to_4,the first skb and other skbs in frag_list
has the same network_header(24)& transport_header(64), after
bpf_skb_proto_6_to_4, ipv6 protocol has been changed to ipv4, the first
skb's network_header is 44,transport_header is 64, other skbs in frag_list
didnot change.After skb_segment_list, the other skbs in frag_list has
different network_header(24) and transport_header(44), so there will be 20
bytes different from original,that is difference between ipv6 header and
ipv4 header. Just change transport_header to be the same with original.
Actually, there are two solutions to fix it, one is traversing all skbs
and changing every skb header in bpf_skb_proto_6_to_4, the other is
modifying frag_list skb's header in skb_segment_list. Considering
efficiency, adopt the second one--- when the first skb and other skbs in
frag_list has different network_header length, restore them to make sure
right udp payload is delivered to user space.
Signed-off-by: Lina Wang <lina.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It fixes memory leak in ring buffer change logic.
When ring buffer size is changed(ethtool -G eth0 rx 4096), sfc driver
works like below.
1. stop all channels and remove ring buffers.
2. allocates new buffer array.
3. allocates rx buffers.
4. start channels.
While the above steps are working, it skips some steps if the channel
doesn't have a ->copy callback function.
Due to ptp channel doesn't have ->copy callback, these above steps are
skipped for ptp channel.
It eventually makes some problems.
a. ptp channel's ring buffer size is not changed, it works only
1024(default).
b. memory leak.
The reason for memory leak is to use the wrong ring buffer values.
There are some values, which is related to ring buffer size.
a. efx->rxq_entries
- This is global value of rx queue size.
b. rx_queue->ptr_mask
- used for access ring buffer as circular ring.
- roundup_pow_of_two(efx->rxq_entries) - 1
c. rx_queue->max_fill
- efx->rxq_entries - EFX_RXD_HEAD_ROOM
These all values should be based on ring buffer size consistently.
But ptp channel's values are not.
a. efx->rxq_entries
- This is global(for sfc) value, always new ring buffer size.
b. rx_queue->ptr_mask
- This is always 1023(default).
c. rx_queue->max_fill
- This is new ring buffer size - EFX_RXD_HEAD_ROOM.
Let's assume we set 4096 for rx ring buffer,
normal channel ptp channel
efx->rxq_entries 4096 4096
rx_queue->ptr_mask 4095 1023
rx_queue->max_fill 4086 4086
sfc driver allocates rx ring buffers based on these values.
When it allocates ptp channel's ring buffer, 4086 ring buffers are
allocated then, these buffers are attached to the allocated array.
But ptp channel's ring buffer array size is still 1024(default)
and ptr_mask is still 1023 too.
So, 3062 ring buffers will be overwritten to the array.
This is the reason for memory leak.
Test commands:
ethtool -G <interface name> rx 4096
while :
do
ip link set <interface name> up
ip link set <interface name> down
done
In order to avoid this problem, it adds ->copy callback to ptp channel
type.
So that rx_queue->ptr_mask value will be updated correctly.
Fixes: 7c236c43b8 ("sfc: Add support for IEEE-1588 PTP")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using min_t(int, ...) as a potential array index implies to the compiler
that negative offsets should be allowed. This is not the case, though.
Replace "int" with "unsigned int". Fixes the following warning exposed
under future CONFIG_FORTIFY_SOURCE improvements:
In file included from include/linux/string.h:253,
from include/linux/bitmap.h:11,
from include/linux/cpumask.h:12,
from include/linux/smp.h:13,
from include/linux/lockdep.h:14,
from include/linux/rcupdate.h:29,
from include/linux/rculist.h:11,
from include/linux/pid.h:5,
from include/linux/sched.h:14,
from include/linux/delay.h:23,
from drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:35:
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c: In function 't4_get_raw_vpd_params':
include/linux/fortify-string.h:46:33: warning: '__builtin_memcpy' pointer overflow between offset 29 and size [2147483648, 4294967295] [-Warray-bounds]
46 | #define __underlying_memcpy __builtin_memcpy
| ^
include/linux/fortify-string.h:388:9: note: in expansion of macro '__underlying_memcpy'
388 | __underlying_##op(p, q, __fortify_size); \
| ^~~~~~~~~~~~~
include/linux/fortify-string.h:433:26: note: in expansion of macro '__fortify_memcpy_chk'
433 | #define memcpy(p, q, s) __fortify_memcpy_chk(p, q, s, \
| ^~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2796:9: note: in expansion of macro 'memcpy'
2796 | memcpy(p->id, vpd + id, min_t(int, id_len, ID_LEN));
| ^~~~~~
include/linux/fortify-string.h:46:33: warning: '__builtin_memcpy' pointer overflow between offset 0 and size [2147483648, 4294967295] [-Warray-bounds]
46 | #define __underlying_memcpy __builtin_memcpy
| ^
include/linux/fortify-string.h:388:9: note: in expansion of macro '__underlying_memcpy'
388 | __underlying_##op(p, q, __fortify_size); \
| ^~~~~~~~~~~~~
include/linux/fortify-string.h:433:26: note: in expansion of macro '__fortify_memcpy_chk'
433 | #define memcpy(p, q, s) __fortify_memcpy_chk(p, q, s, \
| ^~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2798:9: note: in expansion of macro 'memcpy'
2798 | memcpy(p->sn, vpd + sn, min_t(int, sn_len, SERNUM_LEN));
| ^~~~~~
Additionally remove needless cast from u8[] to char * in last strim()
call.
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/202205031926.FVP7epJM-lkp@intel.com
Fixes: fc9279298e ("cxgb4: Search VPD with pci_vpd_find_ro_info_keyword()")
Fixes: 24c521f81c ("cxgb4: Use pci_vpd_find_id_string() to find VPD ID string")
Cc: Raju Rangoju <rajur@chelsio.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220505233101.1224230-1-keescook@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netlink_recvmsg() does not need to change transport header.
If transport header was needed, it should have been reset
by the producer (netlink_dump()), not the consumer(s).
The following trace probably happened when multiple threads
were using MSG_PEEK.
BUG: KCSAN: data-race in netlink_recvmsg / netlink_recvmsg
write to 0xffff88811e9f15b2 of 2 bytes by task 32012 on cpu 1:
skb_reset_transport_header include/linux/skbuff.h:2760 [inline]
netlink_recvmsg+0x1de/0x790 net/netlink/af_netlink.c:1978
sock_recvmsg_nosec net/socket.c:948 [inline]
sock_recvmsg net/socket.c:966 [inline]
__sys_recvfrom+0x204/0x2c0 net/socket.c:2097
__do_sys_recvfrom net/socket.c:2115 [inline]
__se_sys_recvfrom net/socket.c:2111 [inline]
__x64_sys_recvfrom+0x74/0x90 net/socket.c:2111
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
write to 0xffff88811e9f15b2 of 2 bytes by task 32005 on cpu 0:
skb_reset_transport_header include/linux/skbuff.h:2760 [inline]
netlink_recvmsg+0x1de/0x790 net/netlink/af_netlink.c:1978
____sys_recvmsg+0x162/0x2f0
___sys_recvmsg net/socket.c:2674 [inline]
__sys_recvmsg+0x209/0x3f0 net/socket.c:2704
__do_sys_recvmsg net/socket.c:2714 [inline]
__se_sys_recvmsg net/socket.c:2711 [inline]
__x64_sys_recvmsg+0x42/0x50 net/socket.c:2711
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0xffff -> 0x0000
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 32005 Comm: syz-executor.4 Not tainted 5.18.0-rc1-syzkaller-00328-ge1f700ebd6be-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20220505161946.2867638-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Read stale PTP Tx timestamps from PHY on cleanup.
After running out of Tx timestamps request handlers, hardware (HW) stops
reporting finished requests. Function ice_ptp_tx_tstamp_cleanup() used
to only clean up stale handlers in driver and was leaving the hardware
registers not read. Not reading stale PTP Tx timestamps prevents next
interrupts from arriving and makes timestamping unusable.
Fixes: ea9b847cda ("ice: enable transmit timestamps for E810 devices")
Signed-off-by: Michal Michalik <michal.michalik@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The iAVF driver uses 3 virtchnl op codes to communicate with the PF
regarding the VF Tx queues:
* VIRTCHNL_OP_CONFIG_VSI_QUEUES configures the hardware and firmware
logic for the Tx queues
* VIRTCHNL_OP_ENABLE_QUEUES configures the queue interrupts
* VIRTCHNL_OP_DISABLE_QUEUES disables the queue interrupts and Tx rings.
There is a bug in the iAVF driver due to the race condition between VF
reset request and shutdown being executed in parallel. This leads to a
break in logic and VIRTCHNL_OP_DISABLE_QUEUES is not being sent.
If this occurs, the PF driver never cleans up the Tx queues. This results
in leaving behind stale Tx queue settings in the hardware and firmware.
The most obvious outcome is that upon the next
VIRTCHNL_OP_CONFIG_VSI_QUEUES, the PF will fail to program the Tx
scheduler node due to a lack of space.
We need to protect ICE driver against such situation.
To fix this, make sure we clear existing stale settings out when
handling VIRTCHNL_OP_CONFIG_VSI_QUEUES. This ensures we remove the
previous settings.
Calling ice_vf_vsi_dis_single_txq should be safe as it will do nothing if
the queue is not configured. The function already handles the case when the
Tx queue is not currently configured and exits with a 0 return in that
case.
Fixes: 7ad15440ac ("ice: Refactor VIRTCHNL_OP_CONFIG_VSI_QUEUES handling")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Anatolii Gerasymenko <anatolii.gerasymenko@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Function ice_plug_aux_dev() assigns pf->adev field too early prior
aux device initialization and on other side ice_unplug_aux_dev()
starts aux device deinit and at the end assigns NULL to pf->adev.
This is wrong because pf->adev should always be non-NULL only when
aux device is fully initialized and ready. This wrong order causes
a crash when ice_send_event_to_aux() call occurs because that function
depends on non-NULL value of pf->adev and does not assume that
aux device is half-initialized or half-destroyed.
After order correction the race window is tiny but it is still there,
as Leon mentioned and manipulation with pf->adev needs to be protected
by mutex.
Fix (un-)plugging functions so pf->adev field is set after aux device
init and prior aux device destroy and protect pf->adev assignment by
new mutex. This mutex is also held during ice_send_event_to_aux()
call to ensure that aux device is valid during that call.
Note that device lock used ice_send_event_to_aux() needs to be kept
to avoid race with aux drv unload.
Reproducer:
cycle=1
while :;do
echo "#### Cycle: $cycle"
ip link set ens7f0 mtu 9000
ip link add bond0 type bond mode 1 miimon 100
ip link set bond0 up
ifenslave bond0 ens7f0
ip link set bond0 mtu 9000
ethtool -L ens7f0 combined 1
ip link del bond0
ip link set ens7f0 mtu 1500
sleep 1
let cycle++
done
In short when the device is added/removed to/from bond the aux device
is unplugged/plugged. When MTU of the device is changed an event is
sent to aux device asynchronously. This can race with (un)plugging
operation and because pf->adev is set too early (plug) or too late
(unplug) the function ice_send_event_to_aux() can touch uninitialized
or destroyed fields. In the case of crash below pf->adev->dev.mutex.
Crash:
[ 53.372066] bond0: (slave ens7f0): making interface the new active one
[ 53.378622] bond0: (slave ens7f0): Enslaving as an active interface with an u
p link
[ 53.386294] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[ 53.549104] bond0: (slave ens7f1): Enslaving as a backup interface with an up
link
[ 54.118906] ice 0000:ca:00.0 ens7f0: Number of in use tx queues changed inval
idating tc mappings. Priority traffic classification disabled!
[ 54.233374] ice 0000:ca:00.1 ens7f1: Number of in use tx queues changed inval
idating tc mappings. Priority traffic classification disabled!
[ 54.248204] bond0: (slave ens7f0): Releasing backup interface
[ 54.253955] bond0: (slave ens7f1): making interface the new active one
[ 54.274875] bond0: (slave ens7f1): Releasing backup interface
[ 54.289153] bond0 (unregistering): Released all slaves
[ 55.383179] MII link monitoring set to 100 ms
[ 55.398696] bond0: (slave ens7f0): making interface the new active one
[ 55.405241] BUG: kernel NULL pointer dereference, address: 0000000000000080
[ 55.405289] bond0: (slave ens7f0): Enslaving as an active interface with an u
p link
[ 55.412198] #PF: supervisor write access in kernel mode
[ 55.412200] #PF: error_code(0x0002) - not-present page
[ 55.412201] PGD 25d2ad067 P4D 0
[ 55.412204] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 55.412207] CPU: 0 PID: 403 Comm: kworker/0:2 Kdump: loaded Tainted: G S
5.17.0-13579-g57f2d6540f03 #1
[ 55.429094] bond0: (slave ens7f1): Enslaving as a backup interface with an up
link
[ 55.430224] Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/
2021
[ 55.430226] Workqueue: ice ice_service_task [ice]
[ 55.468169] RIP: 0010:mutex_unlock+0x10/0x20
[ 55.472439] Code: 0f b1 13 74 96 eb e0 4c 89 ee eb d8 e8 79 54 ff ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 48 8b 04 25 40 ef 01 00 31 d2 <f0> 48 0f b1 17 75 01 c3 e9 e3 fe ff ff 0f 1f 00 0f 1f 44 00 00 48
[ 55.491186] RSP: 0018:ff4454230d7d7e28 EFLAGS: 00010246
[ 55.496413] RAX: ff1a79b208b08000 RBX: ff1a79b2182e8880 RCX: 0000000000000001
[ 55.503545] RDX: 0000000000000000 RSI: ff4454230d7d7db0 RDI: 0000000000000080
[ 55.510678] RBP: ff1a79d1c7e48b68 R08: ff4454230d7d7db0 R09: 0000000000000041
[ 55.517812] R10: 00000000000000a5 R11: 00000000000006e6 R12: ff1a79d1c7e48bc0
[ 55.524945] R13: 0000000000000000 R14: ff1a79d0ffc305c0 R15: 0000000000000000
[ 55.532076] FS: 0000000000000000(0000) GS:ff1a79d0ffc00000(0000) knlGS:0000000000000000
[ 55.540163] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 55.545908] CR2: 0000000000000080 CR3: 00000003487ae003 CR4: 0000000000771ef0
[ 55.553041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 55.560173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 55.567305] PKRU: 55555554
[ 55.570018] Call Trace:
[ 55.572474] <TASK>
[ 55.574579] ice_service_task+0xaab/0xef0 [ice]
[ 55.579130] process_one_work+0x1c5/0x390
[ 55.583141] ? process_one_work+0x390/0x390
[ 55.587326] worker_thread+0x30/0x360
[ 55.590994] ? process_one_work+0x390/0x390
[ 55.595180] kthread+0xe6/0x110
[ 55.598325] ? kthread_complete_and_exit+0x20/0x20
[ 55.603116] ret_from_fork+0x1f/0x30
[ 55.606698] </TASK>
Fixes: f9f5301e7e ("ice: Register auxiliary device to provide RDMA")
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Vladimir Oltean says:
====================
Ocelot VCAP fixes
Changes in v2:
fix the NPDs and UAFs caused by filter->trap_list in a more robust way
that actually does not introduce bugs of its own (1/5)
This series fixes issues found while running
tools/testing/selftests/net/forwarding/tc_actions.sh on the ocelot
switch:
- NULL pointer dereference when failing to offload a filter
- NULL pointer dereference after deleting a trap
- filters still having effect after being deleted
- dropped packets still being seen by software
- statistics counters showing double the amount of hits
- statistics counters showing inexistent hits
- invalid configurations not rejected
====================
Link: https://lore.kernel.org/r/20220504235503.4161890-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Given the following order of operations:
(1) we add filter A using tc-flower
(2) we send a packet that matches it
(3) we read the filter's statistics to find a hit count of 1
(4) we add a second filter B with a higher preference than A, and A
moves one position to the right to make room in the TCAM for it
(5) we send another packet, and this matches the second filter B
(6) we read the filter statistics again.
When this happens, the hit count of filter A is 2 and of filter B is 1,
despite a single packet having matched each filter.
Furthermore, in an alternate history, reading the filter stats a second
time between steps (3) and (4) makes the hit count of filter A remain at
1 after step (6), as expected.
The reason why this happens has to do with the filter->stats.pkts field,
which is written to hardware through the call path below:
vcap_entry_set
/ | \
/ | \
/ | \
/ | \
es0_entry_set is1_entry_set is2_entry_set
\ | /
\ | /
\ | /
vcap_data_set(data.counter, ...)
The primary role of filter->stats.pkts is to transport the filter hit
counters from the last readout all the way from vcap_entry_get() ->
ocelot_vcap_filter_stats_update() -> ocelot_cls_flower_stats().
The reason why vcap_entry_set() writes it to hardware is so that the
counters (saturating and having a limited bit width) are cleared
after each user space readout.
The writing of filter->stats.pkts to hardware during the TCAM entry
movement procedure is an unintentional consequence of the code design,
because the hit count isn't up to date at this point.
So at step (4), when filter A is moved by ocelot_vcap_filter_add() to
make room for filter B, the hardware hit count is 0 (no packet matched
on it in the meantime), but filter->stats.pkts is 1, because the last
readout saw the earlier packet. The movement procedure programs the old
hit count back to hardware, so this creates the impression to user space
that more packets have been matched than they really were.
The bug can be seen when running the gact_drop_and_ok_test() from the
tc_actions.sh selftest.
Fix the issue by reading back the hit count to tmp->stats.pkts before
migrating the VCAP filter. Sure, this is a best-effort technique, since
the packets that hit the rule between vcap_entry_get() and
vcap_entry_set() won't be counted, but at least it allows the counters
to be reliably used for selftests where the traffic is under control.
The vcap_entry_get() name is a bit unintuitive, but it only reads back
the counter portion of the TCAM entry, not the entire entry.
The index from which we retrieve the counter is also a bit unintuitive
(i - 1 during add, i + 1 during del), but this is the way in which TCAM
entry movement works. The "entry index" isn't a stored integer for a
TCAM filter, instead it is dynamically computed by
ocelot_vcap_block_get_filter_index() based on the entry's position in
the &block->rules list. That position (as well as block->count) is
automatically updated by ocelot_vcap_filter_add_to_block() on add, and
by ocelot_vcap_block_remove_filter() on del. So "i" is the new filter
index, and "i - 1" or "i + 1" respectively are the old addresses of that
TCAM entry (we only support installing/deleting one filter at a time).
Fixes: b596229448 ("net: mscc: ocelot: Add support for tcam")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Once the CPU port was added to the destination port mask of a packet, it
can never be cleared, so even packets marked as dropped by the MASK_MODE
of a VCAP IS2 filter will still reach it. This is why we need the
OCELOT_POLICER_DISCARD to "kill dropped packets dead" and make software
stop seeing them.
We disallow policer rules from being put on any other chain than the one
for the first lookup, but we don't do this for "drop" rules, although we
should. This change is merely ascertaining that the rules dont't
(completely) work and letting the user know.
The blamed commit is the one that introduced the multi-chain architecture
in ocelot. Prior to that, we should have always offloaded the filters to
VCAP IS2 lookup 0, where they did work.
Fixes: 1397a2eb52 ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The VCAP IS2 TCAM is looked up twice per packet, and each filter can be
configured to only match during the first, second lookup, or both, or
none.
The blamed commit wrote the code for making VCAP IS2 filters match only
on the given lookup. But right below that code, there was another line
that explicitly made the lookup a "don't care", and this is overwriting
the lookup we've selected. So the code had no effect.
Some of the more noticeable effects of having filters match on both
lookups:
- in "tc -s filter show dev swp0 ingress", we see each packet matching a
VCAP IS2 filter counted twice. This throws off scripts such as
tools/testing/selftests/net/forwarding/tc_actions.sh and makes them
fail.
- a "tc-drop" action offloaded to VCAP IS2 needs a policer as well,
because once the CPU port becomes a member of the destination port
mask of a packet, nothing removes it, not even a PERMIT/DENY mask mode
with a port mask of 0. But VCAP IS2 rules with the POLICE_ENA bit in
the action vector can only appear in the first lookup. What happens
when a filter matches both lookups is that the action vector is
combined, and this makes the POLICE_ENA bit ineffective, since the
last lookup in which it has appeared is the second one. In other
words, "tc-drop" actions do not drop packets for the CPU port, dropped
packets are still seen by software unless there was an FDB entry that
directed those packets to some other place different from the CPU.
The last bit used to work, because in the initial commit b596229448
("net: mscc: ocelot: Add support for tcam"), we were writing the FIRST
field of the VCAP IS2 half key with a 1, not with a "don't care".
The change to "don't care" was made inadvertently by me in commit
c1c3993edb ("net: mscc: ocelot: generalize existing code for VCAP"),
which I just realized, and which needs a separate fix from this one,
for "stable" kernels that lack the commit blamed below.
Fixes: 226e9cd82a ("net: mscc: ocelot: only install TCAM entries into a specific lookup and PAG")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ocelot_vcap_filter_del() works by moving the next filters over the
current one, and then deleting the last filter by calling vcap_entry_set()
with a del_filter which was specially created by memsetting its memory
to zeroes. vcap_entry_set() then programs this to the TCAM and action
RAM via the cache registers.
The problem is that vcap_entry_set() is a dispatch function which looks
at del_filter->block_id. But since del_filter is zeroized memory, the
block_id is 0, or otherwise said, VCAP_ES0. So practically, what we do
is delete the entry at the same TCAM index from VCAP ES0 instead of IS1
or IS2.
The code was not always like this. vcap_entry_set() used to simply be
is2_entry_set(), and then, the logic used to work.
Restore the functionality by populating the block_id of the del_filter
based on the VCAP block of the filter that we're deleting. This makes
vcap_entry_set() know what to do.
Fixes: 1397a2eb52 ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the blamed commit, VCAP filters can appear on more than one list.
If their action is "trap", they are chained on ocelot->traps via
filter->trap_list. This is in addition to their normal placement on the
VCAP block->rules list head.
Therefore, when we free a VCAP filter, we must remove it from all lists
it is a member of, including ocelot->traps.
There are at least 2 bugs which are direct consequences of this design
decision.
First is the incorrect usage of list_empty(), meant to denote whether
"filter" is chained into ocelot->traps via filter->trap_list.
This does not do the correct thing, because list_empty() checks whether
"head->next == head", but in our case, head->next == head->prev == NULL.
So we dereference NULL pointers and die when we call list_del().
Second is the fact that not all places that should remove the filter
from ocelot->traps do so. One example is ocelot_vcap_block_remove_filter(),
which is where we have the main kfree(filter). By keeping freed filters
in ocelot->traps we end up in a use-after-free in
felix_update_trapping_destinations().
Attempting to fix all the buggy patterns is a whack-a-mole game which
makes the driver unmaintainable. Actually this is what the previous
patch version attempted to do:
https://patchwork.kernel.org/project/netdevbpf/patch/20220503115728.834457-3-vladimir.oltean@nxp.com/
but it introduced another set of bugs, because there are other places in
which create VCAP filters, not just ocelot_vcap_filter_create():
- ocelot_trap_add()
- felix_tag_8021q_vlan_add_rx()
- felix_tag_8021q_vlan_add_tx()
Relying on the convention that all those code paths must call
INIT_LIST_HEAD(&filter->trap_list) is not going to scale.
So let's do what should have been done in the first place and keep a
bool in struct ocelot_vcap_filter which denotes whether we are looking
at a trapping rule or not. Iterating now happens over the main VCAP IS2
block->rules. The advantage is that we no longer risk having stale
references to a freed filter, since it is only present in that list.
Fixes: e42bd4ed09 ("net: mscc: ocelot: keep traps in a list")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The find_next_netdev_feature() macro gets the "remaining length",
not bit index.
Passing "bit - 1" for the following iteration is wrong as it skips
the adjacent bit. Pass "bit" instead.
Fixes: 3b89ea9c59 ("net: Fix for_each_netdev_feature on Big endian")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20220504080914.1918-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nicolas Dichtel says:
====================
vrf: fix address binding with icmp socket
The first patch fixes the issue.
The second patch adds related tests in selftests.
====================
Link: https://lore.kernel.org/r/20220504090739.21821-1-nicolas.dichtel@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 'ping' utility is able to manage two kind of sockets (raw or icmp),
depending on the sysctl ping_group_range. By default, ping_group_range is
set to '1 0', which forces ping to use an ip raw socket.
Let's replay the ping tests by allowing 'ping' to use the ip icmp socket.
After the previous patch, ipv4 tests results are the same with both kinds
of socket. For ipv6, there are a lot a new failures (the previous patch
fixes only two cases).
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When ping_group_range is updated, 'ping' uses the DGRAM ICMP socket,
instead of an IP raw socket. In this case, 'ping' is unable to bind its
socket to a local address owned by a vrflite.
Before the patch:
$ sysctl -w net.ipv4.ping_group_range='0 2147483647'
$ ip link add blue type vrf table 10
$ ip link add foo type dummy
$ ip link set foo master blue
$ ip link set foo up
$ ip addr add 192.168.1.1/24 dev foo
$ ip addr add 2001::1/64 dev foo
$ ip vrf exec blue ping -c1 -I 192.168.1.1 192.168.1.2
ping: bind: Cannot assign requested address
$ ip vrf exec blue ping6 -c1 -I 2001::1 2001::2
ping6: bind icmp socket: Cannot assign requested address
CC: stable@vger.kernel.org
Fixes: 1b69c6d0ae ("net: Introduce L3 Master device abstraction")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit f1131b9c23 ("net: phy: micrel: use
kszphy_suspend()/kszphy_resume for irq aware devices") the kszphy_suspend/
resume hooks are used.
These functions require the probe function to be called so that
priv can be allocated.
Otherwise, a NULL pointer dereference happens inside
kszphy_config_reset().
Cc: stable@vger.kernel.org
Fixes: f1131b9c23 ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices")
Reported-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Fabio Estevam <festevam@denx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220504143104.1286960-2-festevam@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit f1131b9c23 ("net: phy: micrel: use
kszphy_suspend()/kszphy_resume for irq aware devices") the following
NULL pointer dereference is observed on a board with KSZ8061:
# udhcpc -i eth0
udhcpc: started, v1.35.0
8<--- cut here ---
Unable to handle kernel NULL pointer dereference at virtual address 00000008
pgd = f73cef4e
[00000008] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 196 Comm: ifconfig Not tainted 5.15.37-dirty #94
Hardware name: Freescale i.MX6 SoloX (Device Tree)
PC is at kszphy_config_reset+0x10/0x114
LR is at kszphy_resume+0x24/0x64
...
The KSZ8061 phy_driver structure does not have the .probe/..driver_data
fields, which means that priv is not allocated.
This causes the NULL pointer dereference inside kszphy_config_reset().
Fix the problem by using the generic suspend/resume functions as before.
Another alternative would be to provide the .probe and .driver_data
information into the structure, but to be on the safe side, let's
just restore Ethernet functionality by using the generic suspend/resume.
Cc: stable@vger.kernel.org
Fixes: f1131b9c23 ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices")
Signed-off-by: Fabio Estevam <festevam@denx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220504143104.1286960-1-festevam@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for
delayed works queued in rds_wq might be invoked after a net namespace's
refcount already reached 0.
Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq),
it is guaranteed that we can instead use maybe_get_net() from delayed work
functions until rds_tcp_exit_net() returns.
Note that I'm not convinced that all works which might access a net
namespace are already queued in rds_wq by the moment rds_tcp_exit_net()
calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net()
will fail to wait for work functions, and kmem_cache_free() could be
called from net_free() before maybe_get_net() is called from
rds_tcp_tune().
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 3a58f13a88 ("net: rds: acquire refcount on TCP sockets")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/41d09faf-bc78-1a87-dfd1-c6d1b5984b61@I-love.SAKURA.ne.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
wireguard
Previous releases - regressions:
- igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter()
- mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()
- rds: acquire netns refcount on TCP sockets
- rxrpc: enable IPv6 checksums on transport socket
- nic: hinic: fix bug of wq out of bound access
- nic: thunder: don't use pci_irq_vector() in atomic context
- nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS flag
- nic: mlx5e:
- lag, fix use-after-free in fib event handler
- fix deadlock in sync reset flow
Previous releases - always broken:
- tcp: fix insufficient TCP source port randomness
- can: grcan: grcan_close(): fix deadlock
- nfc: reorder destructive operations in to avoid bugs
Misc:
- wireguard: improve selftests reliability
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmJznX8SHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkDrgP/R9tErvWO/uvXpNgDr6Qh8osYt5Z297l
EWyhz7cUm4LKi6MYWrRKR4uRK9n43DK+OVws5LXrYL0tIdJH3uYBE0RS67W9WmjA
kE2Srq1A6wUi4koiYKeYDXtodCJLC93n+QnLBfih44Pc+xmk8t+G6qZ1n45qjRss
gzV75AlIfErmjqyYi81DaZ6Z0TV4H5qPM4ZXRViIzH+Ccyx6rk/KNqU4wepoqRSi
lCckTvMt9V7OiYHzM5Pu1kTUV07Jtiy7xkIQMdKYXCZpyqkmqyPFMM+0B7fDOEeP
WZnkdUwi69WMVmeefcpEn7XsoNbVadGkTQM2EcUWvrxuCeawmGxYoORvvFs0IpAX
YkYXk1US0Sd1L2XlMaus+HLsmmx4fWnb/hWqGL/D+arZOvTCOhBQItSRmKA6d+kM
OLfj/gh0YLBsHVrCiHUN06oopvhWuBEBAJbVFkbJCvXoFGqHigijBCVjFBVH1p4o
L5bWVEAQ8tkFdofXw0nOe6vRCD5BGN34N5DkqC5E8mj/uLP0FVEWOISV3TzKKF5B
mEDGZAGN5bTf/ScvbF8XEaqtdk/cxv2ohWNn9wtgoaNBorgKtpTf99pXJtxV2+fs
3RiPM0My9uz8/wMveSfKShQntMSdnmQPMpJ4Vm0e4bOS1K0LRGUgZxOpX2/BTokq
Iv5msx85X5/S
=XuN7
-----END PGP SIGNATURE-----
Merge tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from can, rxrpc and wireguard.
Previous releases - regressions:
- igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter()
- mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()
- rds: acquire netns refcount on TCP sockets
- rxrpc: enable IPv6 checksums on transport socket
- nic: hinic: fix bug of wq out of bound access
- nic: thunder: don't use pci_irq_vector() in atomic context
- nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS
flag
- nic: mlx5e:
- lag, fix use-after-free in fib event handler
- fix deadlock in sync reset flow
Previous releases - always broken:
- tcp: fix insufficient TCP source port randomness
- can: grcan: grcan_close(): fix deadlock
- nfc: reorder destructive operations in to avoid bugs
Misc:
- wireguard: improve selftests reliability"
* tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits)
NFC: netlink: fix sleep in atomic bug when firmware download timeout
selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer
tcp: drop the hash_32() part from the index calculation
tcp: increase source port perturb table to 2^16
tcp: dynamically allocate the perturb table used by source ports
tcp: add small random increments to the source port
tcp: resalt the secret every 10 seconds
tcp: use different parts of the port_offset for index and offset
secure_seq: use the 64 bits of the siphash for port offset calculation
wireguard: selftests: set panic_on_warn=1 from cmdline
wireguard: selftests: bump package deps
wireguard: selftests: restore support for ccache
wireguard: selftests: use newer toolchains to fill out architectures
wireguard: selftests: limit parallelism to $(nproc) tests at once
wireguard: selftests: make routing loop test non-fatal
net/mlx5: Fix matching on inner TTC
net/mlx5: Avoid double clear or set of sync reset requested
net/mlx5: Fix deadlock in sync reset flow
net/mlx5e: Fix trust state reset in reload
net/mlx5e: Avoid checking offload capability in post_parse action
...
There are sleep in atomic bug that could cause kernel panic during
firmware download process. The root cause is that nlmsg_new with
GFP_KERNEL parameter is called in fw_dnld_timeout which is a timer
handler. The call trace is shown below:
BUG: sleeping function called from invalid context at include/linux/sched/mm.h:265
Call Trace:
kmem_cache_alloc_node
__alloc_skb
nfc_genl_fw_download_done
call_timer_fn
__run_timers.part.0
run_timer_softirq
__do_softirq
...
The nlmsg_new with GFP_KERNEL parameter may sleep during memory
allocation process, and the timer handler is run as the result of
a "software interrupt" that should not call any other function
that could sleep.
This patch changes allocation mode of netlink message from GFP_KERNEL
to GFP_ATOMIC in order to prevent sleep in atomic bug. The GFP_ATOMIC
flag makes memory allocation operation could be used in atomic context.
Fixes: 9674da8759 ("NFC: Add firmware upload netlink command")
Fixes: 9ea7187c53 ("NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220504055847.38026-1-duoming@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As discussed here with Ido Schimmel:
https://patchwork.kernel.org/project/netdevbpf/patch/20220224102908.5255-2-jianbol@nvidia.com/
the default conform-exceed action is "reclassify", for a reason we don't
really understand.
The point is that hardware can't offload that police action, so not
specifying "conform-exceed" was always wrong, even though the command
used to work in hardware (but not in software) until the kernel started
adding validation for it.
Fix the command used by the selftest by making the policer drop on
exceed, and pass the packet to the next action (goto) on conform.
Fixes: 8cd6b020b6 ("selftests: ocelot: add some example VCAP IS1, IS2 and ES0 tc offloads")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20220503121428.842906-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Willy Tarreau says:
====================
insufficient TCP source port randomness
In a not-yet published paper, Moshe Kol, Amit Klein, and Yossi Gilad
report being able to accurately identify a client by forcing it to emit
only 40 times more connections than the number of entries in the
table_perturb[] table, which is indexed by hashing the connection tuple.
The current 2^8 setting allows them to perform that attack with only 10k
connections, which is not hard to achieve in a few seconds.
Eric, Amit and I have been working on this for a few weeks now imagining,
testing and eliminating a number of approaches that Amit and his team were
still able to break or that were found to be too risky or too expensive,
and ended up with the simple improvements in this series that resists to
the attack, doesn't degrade the performance, and preserves a reliable port
selection algorithm to avoid connection failures, including the odd/even
port selection preference that allows bind() to always find a port quickly
even under strong connect() stress.
The approach relies on several factors:
- resalting the hash secret that's used to choose the table_perturb[]
entry every 10 seconds to eliminate slow attacks and force the
attacker to forget everything that was learned after this delay.
This already eliminates most of the problem because if a client
stays silent for more than 10 seconds there's no link between the
previous and the next patterns, and 10s isn't yet frequent enough
to cause too frequent repetition of a same port that may induce a
connection failure ;
- adding small random increments to the source port. Previously, a
random 0 or 1 was added every 16 ports. Now a random 0 to 7 is
added after each port. This means that with the default 32768-60999
range, a worst case rollover happens after 1764 connections, and
an average of 3137. This doesn't stop statistical attacks but
requires significantly more iterations of the same attack to
confirm a guess.
- increasing the table_perturb[] size from 2^8 to 2^16, which Amit
says will require 2.6 million connections to be attacked with the
changes above, making it pointless to get a fingerprint that will
only last 10 seconds. Due to the size, the table was made dynamic.
- a few minor improvements on the bits used from the hash, to eliminate
some unfortunate correlations that may possibly have been exploited
to design future attack models.
These changes were tested under the most extreme conditions, up to
1.1 million connections per second to one and a few targets, showing no
performance regression, and only 2 connection failures within 13 billion,
which is less than 2^-32 and perfectly within usual values.
The series is split into small reviewable changes and was already reviewed
by Amit and Eric.
====================
Link: https://lore.kernel.org/r/20220502084614.24123-1-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>