linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-15 00:04:15 +08:00

Author	SHA1	Message	Date
Florian Westphal	bde7170a04	netfilter: xtables: disable 32bit compat interface by default This defaulted to 'y' because before this knob existed the 32bit compat layer was always compiled in if CONFIG_COMPAT was set. 32bit iptables on 64bit kernel isn't common anymore, so remove the default-y now. Signed-off-by: Florian Westphal <fw@strlen.de>	2023-03-22 21:48:59 +01:00
Jeremy Sowden	f6ca5d5ed7	netfilter: nft_masq: deduplicate eval call-backs nft_masq has separate ipv4 and ipv6 call-backs which share much of their code, and an inet one switch containing a switch that calls one of the others based on the family of the packet. Merge the ipv4 and ipv6 ones into the inet one in order to get rid of the duplicate code. Const-qualify the `priv` pointer since we don't need to write through it. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Florian Westphal <fw@strlen.de>	2023-03-22 21:48:59 +01:00
Jeremy Sowden	6f56ad1b92	netfilter: nft_redir: use `struct nf_nat_range2` throughout and deduplicate eval call-backs `nf_nat_redirect_ipv4` takes a `struct nf_nat_ipv4_multi_range_compat`, but converts it internally to a `struct nf_nat_range2`. Change the function to take the latter, factor out the code now shared with `nf_nat_redirect_ipv6`, move the conversion to the xt_REDIRECT module, and update the ipv4 range initialization in the nft_redir module. Replace a bare hex constant for 127.0.0.1 with a macro. Remove `WARN_ON`. `nf_nat_setup_info` calls `nf_ct_is_confirmed`: /* Can't setup nat info for confirmed ct. */ if (nf_ct_is_confirmed(ct)) return NF_ACCEPT; This means that `ct` cannot be null or the kernel will crash, and implies that `ctinfo` is `IP_CT_NEW` or `IP_CT_RELATED`. nft_redir has separate ipv4 and ipv6 call-backs which share much of their code, and an inet one switch containing a switch that calls one of the others based on the family of the packet. Merge the ipv4 and ipv6 ones into the inet one in order to get rid of the duplicate code. Const-qualify the `priv` pointer since we don't need to write through it. Assign `priv->flags` to the range instead of OR-ing it in. Set the `NF_NAT_RANGE_PROTO_SPECIFIED` flag once during init, rather than on every eval. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Florian Westphal <fw@strlen.de>	2023-03-22 21:48:59 +01:00
Jesper Dangaard Brouer	915efd8a44	xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support When driver doesn't implement a bpf_xdp_metadata kfunc the fallback implementation returns EOPNOTSUPP, which indicate device driver doesn't implement this kfunc. Currently many drivers also return EOPNOTSUPP when the hint isn't available, which is ambiguous from an API point of view. Instead change drivers to return ENODATA in these cases. There can be natural cases why a driver doesn't provide any hardware info for a specific hint, even on a frame to frame basis (e.g. PTP). Lets keep these cases as separate return codes. When describing the return values, adjust the function kernel-doc layout to get proper rendering for the return values. Fixes: `ab46182d0d` ("net/mlx4_en: Support RX XDP metadata") Fixes: `bc8d405b1b` ("net/mlx5e: Support RX XDP metadata") Fixes: `306531f024` ("veth: Support RX XDP metadata") Fixes: `3d76a4d3d4` ("bpf: XDP metadata RX kfuncs") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/167940675120.2718408.8176058626864184420.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-22 09:11:09 -07:00
Ido Schimmel	bb765a7433	mlxsw: spectrum_fid: Fix incorrect local port type Local port is a 10-bit number, but it was mistakenly stored in a u8, resulting in firmware errors when using a netdev corresponding to a local port higher than 255. Fix by storing the local port in u16, as is done in the rest of the code. Fixes: `bf73904f5f` ("mlxsw: Add support for 802.1Q FID family") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/eace1f9d96545ab8a2775db857cb7e291a9b166b.1679398549.git.petrm@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-22 15:50:32 +01:00
Xiaoyan Li	5c5945dc69	selftests/net: Add SHA256 computation over data sent in tcp_mmap Add option to compute and send SHA256 over data sent (-i). This is to ensure the correctness of data received. Data is randomly populated from /dev/urandom. Tested: ./tcp_mmap -s -z -i ./tcp_mmap -z -H $ADDR -i SHA256 is correct ./tcp_mmap -s -i ./tcp_mmap -H $ADDR -i SHA256 is correct Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230321081202.2370275-2-lixiaoyan@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-22 15:34:31 +01:00
Xiaoyan Li	593ef60c74	net-zerocopy: Reduce compound page head access When compound pages are enabled, although the mm layer still returns an array of page pointers, a subset (or all) of them may have the same page head since a max 180kb skb can span 2 hugepages if it is on the boundary, be a mix of pages and 1 hugepage, or fit completely in a hugepage. Instead of referencing page head on all page pointers, use page length arithmetic to only call page head when referencing a known different page head to avoid touching a cold cacheline. Tested: See next patch with changes to tcp_mmap Correntess: On a pair of separate hosts as send with MSG_ZEROCOPY will force a copy on tx if using loopback alone, check that the SHA on the message sent is equivalent to checksum on the message received, since the current program already checks for the length. echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages ./tcp_mmap -s -z ./tcp_mmap -H $DADDR -z SHA256 is correct received 2 MB (100 % mmap'ed) in 0.005914 s, 2.83686 Gbit cpu usage user:0.001984 sys:0.000963, 1473.5 usec per MB, 10 c-switches Performance: Run neper between adjacent hosts with the same config tcp_stream -Z --skip-rx-copy -6 -T 20 -F 1000 --stime-use-proc --test-length=30 Before patch: stime_end=37.670000 After patch: stime_end=30.310000 Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230321081202.2370275-1-lixiaoyan@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-22 15:34:31 +01:00
Masami Hiramatsu (Google)	caa0708a81	bootconfig: Change message if no bootconfig with CONFIG_BOOT_CONFIG_FORCE=y Change no bootconfig data error message if user do not specify 'bootconfig' option but CONFIG_BOOT_CONFIG_FORCE=y. With CONFIG_BOOT_CONFIG_FORCE=y, the kernel proceeds bootconfig check even if user does not specify 'bootconfig' option. So the current error message is confusing. Let's show just an information message to notice skipping the bootconfig in that case. Link: https://lore.kernel.org/all/167754610254.318944.16848412476667893329.stgit@devnote2/ Fixes: `b743852ccc` ("Allow forcing unconditional bootconfig processing") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/all/CAMuHMdV9jJvE2y8gY5V_CxidUikCf5515QMZHzTA3rRGEOj6=w@mail.gmail.com/ Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>	2023-03-22 22:21:43 +09:00
Felix Fietkau	f355f70145	wifi: mac80211: fix mesh path discovery based on unicast packets If a packet has reached its intended destination, it was bumped to the code that accepts it, without first checking if a mesh_path needs to be created based on the discovered source. Fix this by moving the destination address check further down. Cc: stable@vger.kernel.org Fixes: `986e43b19a` ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces") Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230314095956.62085-3-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-03-22 13:46:46 +01:00
Felix Fietkau	4e348c6c6e	wifi: mac80211: fix qos on mesh interfaces When ieee80211_select_queue is called for mesh, the sta pointer is usually NULL, since the nexthop is looked up much later in the tx path. Explicitly check for unicast address in that case in order to make qos work again. Cc: stable@vger.kernel.org Fixes: `50e2ab3929` ("wifi: mac80211: fix queue selection for mesh/OCB interfaces") Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230314095956.62085-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-03-22 13:46:38 +01:00
Wolfram Sang	ce1fdb0656	sh_eth: remove open coded netif_running() It had a purpose back in the days, but today we have a handy helper. Reported-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20230321065826.2044-1-wsa+renesas@sang-engineering.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-22 13:23:57 +01:00
Johannes Berg	923bf981eb	wifi: iwlwifi: mvm: protect TXQ list manipulation Some recent upstream debugging uncovered the fact that in iwlwifi, the TXQ list manipulation is racy. Introduce a new state bit for when the TXQ is completely ready and can be used without locking, and if that's not set yet acquire the lock to check everything correctly. Reviewed-by: Benjamin Berg <benjamin.berg@intel.com> Tested-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-03-22 13:14:24 +01:00
Johannes Berg	b58e3d4311	wifi: iwlwifi: mvm: fix mvmtxq->stopped handling This could race if the queue is redirected while full, then the flushing internally would start it while it's not yet usable again. Fix it by using two state bits instead of just one. Reviewed-by: Benjamin Berg <benjamin.berg@intel.com> Tested-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2023-03-22 13:14:18 +01:00
Grygorii Strashko	7849c42da2	net: ethernet: ti: am65-cpts: adjust estf following ptp changes When the CPTS clock is synced/adjusted by running linuxptp (ptp4l/phc2sys), it will cause the TSN EST schedule to drift away over time. This is because the schedule is driven by the EstF periodic counter whose pulse length is defined in ref_clk cycles and it does not automatically sync to CPTS clock. _______ _\| ^ expected cycle start time boundary _______________ _\|_\|___\|_\| ^ EstF drifted away -> direction To fix it, the same PPM adjustment has to be applied to EstF as done to the PHC CPTS clock, in order to correct the TSN EST cycle length and keep them in sync. Drifted cycle: AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635968230373377017 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635968230373877017 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635968230374377017 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635968230374877017 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635968230375377017 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635968230375877023 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635968230376377018 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635968230376877018 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635968230377377018 Stable cycle: AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635966863193375473 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635966863193875473 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635966863194375473 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635966863194875473 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635966863195375473 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635966863195875473 AM65_CPTS_EVT: 7 e1:01770001 e2:000000ff t:1635966863196375473 Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Acked-by: Richard Cochran <richardcochran@gmail.com> Link: https://lore.kernel.org/r/20230321062600.2539544-1-s-vadapalli@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-22 13:00:14 +01:00
Jason Xing	59da2d7b0e	net-sysfs: display two backlog queue len separately Sometimes we need to know which one of backlog queue can be exactly long enough to cause some latency when debugging this part is needed. Thus, we can then separate the display of both. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230321015746.96994-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-22 12:03:52 +01:00
Arseniy Krasnov	4d1f515517	virtio/vsock: check transport before skb allocation Pointer to transport could be checked before allocation of skbuff, thus there is no need to free skbuff when this pointer is NULL. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://lore.kernel.org/r/08d61bef-0c11-c7f9-9266-cb2109070314@sberdevices.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-22 11:02:40 +01:00
Jakub Kicinski	56c874f7db	tools: ynl: skip the explicit op array size when not needed Jiri suggests it reads more naturally to skip the explicit array size when possible. When we export the symbol we want to make sure that the size is right but for statics its not needed. Link: https://lore.kernel.org/r/20230321044159.1031040-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:45:31 -07:00
Jakub Kicinski	85496c9b3b	Merge branch 'net-remove-some-rcu_bh-cruft' Eric Dumazet says: ==================== net: remove some rcu_bh cruft There is no point using rcu_bh variant hoping to free objects faster, especially hen using call_rcu() or kfree_rcu(). Disabling/enabling BH has a non-zero cost, and adds distracting hot spots in kernel profiles eg. in ip6_xmit(). ==================== Link: https://lore.kernel.org/r/20230321040115.787497-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:32:21 -07:00
Eric Dumazet	fe602c87df	net: remove rcu_dereference_bh_rtnl() This helper is no longer used in the tree. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:32:18 -07:00
Eric Dumazet	09eed1192c	neighbour: switch to standard rcu, instead of rcu_bh rcu_bh is no longer a win, especially for objects freed with standard call_rcu(). Switch neighbour code to no longer disable BH when not necessary. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:32:18 -07:00
Eric Dumazet	4c5c496a94	ipv6: flowlabel: do not disable BH where not needed struct ip6_flowlabel are rcu managed, and call_rcu() is used to delay fl_free_rcu() after RCU grace period. There is no point disabling BH for pure RCU lookups. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:32:18 -07:00
Zhang Changzhong	4107b8746d	net/sonic: use dma_mapping_error() for error check The DMA address returned by dma_map_single() should be checked with dma_mapping_error(). Fix it accordingly. Fixes: `efcce83936` ("[PATCH] macsonic/jazzsonic network drivers update") Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@linux-m68k.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/6645a4b5c1e364312103f48b7b36783b94e197a2.1679370343.git.fthain@linux-m68k.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:29:34 -07:00
Jakub Kicinski	82463d9dbb	Merge branch 'fix-trainwreck-with-ocelot-switch-statistics-counters' Vladimir Oltean says: ==================== Fix trainwreck with Ocelot switch statistics counters While testing the patch set for preemptible traffic classes with some controlled traffic and measuring counter deltas: https://lore.kernel.org/netdev/20230220122343.1156614-1-vladimir.oltean@nxp.com/ I noticed that in the output of "ethtool -S swp0 --groups eth-mac eth-phy eth-ctrl rmon -- --src emac \| grep -v ': 0'", the TX counters were off. Quickly I realized that their values were permutated by 1 compared to their names, and that for example tx-rmon-etherStatsPkts64to64Octets was incrementing when tx-rmon-etherStatsPkts65to127Octets should have. Initially I suspected something having to do with the bulk reading logic, and indeed I found a bug there (fixed as 1/3), but that was not the source of the problems. Instead it revealed other problems. While dumping the regions created by the driver on my switch, I figured out that it sees a discontinuity which shouldn't have existed between reg 0x278 and reg 0x280. Discontinuity between last reg 0x0 and new reg 0x0, creating new region Discontinuity between last reg 0x108 and new reg 0x200, creating new region Discontinuity between last reg 0x278 and new reg 0x280, creating new region Discontinuity between last reg 0x2b0 and new reg 0x400, creating new region region of 67 contiguous counters starting with SYS:STAT:CNT[0x000] region of 31 contiguous counters starting with SYS:STAT:CNT[0x080] region of 13 contiguous counters starting with SYS:STAT:CNT[0x0a0] region of 18 contiguous counters starting with SYS:STAT:CNT[0x100] That is where TX_MM_HOLD should have been, and that was the bug, since it was missing. After adding it, the regions look like this and the off-by-one issue is resolved: Discontinuity between last reg 0x000000 and new reg 0x000000, creating new region Discontinuity between last reg 0x000108 and new reg 0x000200, creating new region Discontinuity between last reg 0x0002b0 and new reg 0x000400, creating new region region of 67 contiguous counters starting with SYS:STAT:CNT[0x000] region of 45 contiguous counters starting with SYS:STAT:CNT[0x080] region of 18 contiguous counters starting with SYS:STAT:CNT[0x100] However, as I am thinking out loud, it should have not reported the other counters as off by one even when skipping TX_MM_HOLD... after all, on Ocelot/Seville, there are more counters which need to be skipped. Which is when I investigated and noticed the bug solved in 2/3. I've validated that both on native VSC9959 (which uses ocelot_mm_stats_layout) as well as by faking the other switches by making VSC9959 use the plain ocelot_stats_layout. To summarize: on all Ocelot switches, the TX counters and drop counters are completely broken. The RX counters are mostly fine. With this occasion, I have collected more cleanup patches in this area, which I'm going to submit after the net -> net-next merge. ==================== Link: https://lore.kernel.org/r/20230321010325.897817-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:28:08 -07:00
Vladimir Oltean	5291099e0f	net: mscc: ocelot: add TX_MM_HOLD to ocelot_mm_stats_layout The lack of a definition for this counter is what initially prompted me to investigate a problem which really manifested itself as the previous change, "net: mscc: ocelot: fix transfer from region->buf to ocelot->stats". When TX_MM_HOLD is defined in enum ocelot_stat but not in struct ocelot_stat_layout ocelot_mm_stats_layout, this creates a hole, which due to the aforementioned bug, makes all counters following TX_MM_HOLD be recorded off by one compared to their correct position. So for example, a non-zero TX_PMAC_OCTETS would be reported as TX_MERGE_FRAGMENTS, TX_PMAC_UNICAST would be reported as TX_PMAC_OCTETS, TX_PMAC_64 would be reported as TX_PMAC_PAUSE, etc etc. This is because the size of the hole (1) is much smaller than the size of the region, so the phenomenon where the stats are off-by-one, rather than lost, prevails. However, the phenomenon where stats are lost can be seen too, for example with DROP_LOCAL, which is at the beginning of its own region (offset 0x000400 vs the previous 0x0002b0 constitutes a discontinuity). This is also reported as off by one and saved to TX_PMAC_1527_MAX, but that counter is not reported to the unstructured "ethtool -S", as opposed to DROP_LOCAL which is (as "drop_local"). Fixes: `ab3f97a961` ("net: mscc: ocelot: export ethtool MAC Merge stats for Felix VSC9959") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:28:07 -07:00
Vladimir Oltean	17dfd21045	net: mscc: ocelot: fix transfer from region->buf to ocelot->stats To understand the problem, we need some definitions. The driver is aware of multiple counters (enum ocelot_stat), yet not all switches supported by the driver implement all counters. There are 2 statistics layouts: ocelot_stats_layout and ocelot_mm_stats_layout, the latter having 36 counters more than the former. ocelot->stats[] is not a compact array, i.e. there are elements within it which are not going to be populated for ocelot_stats_layout. On the other hand, ocelot->stats[] is easily indexable, for example "tx_octets" for port 3 can be found at ocelot->stats[3 * OCELOT_NUM_STATS + OCELOT_STAT_TX_OCTETS], and that is why we keep it sparse. Regions, as created by ocelot_prepare_stats_regions(), are compact (every element from region->buf will correspond to a counter that is present in this switch's layout) but are not easily indexable. Let's define holes as the ranges of values of enum ocelot_stat for which ocelot_stats_layout doesn't have a "reg" defined. For example, there is a hole between OCELOT_STAT_RX_GREEN_PRIO_7 and OCELOT_STAT_TX_OCTETS which is of 23 elements that are only present on ocelot_mm_stats_layout, and as such, they are also present in enum ocelot_stat. Let's define the left extremity of the hole - the last enum ocelot_stat still defined - as A (in this case OCELOT_STAT_RX_GREEN_PRIO_7) and the right extremity - the first enum ocelot_stat that is defined after a series of undefined ones - as B (in this case OCELOT_STAT_TX_OCTETS). There is a bug in the procedure which transfers stats from region->buf[] to ocelot->stats[]. For each hole in the ocelot_stats_layout, the logic transfers the stats starting with enum ocelot_stat B to ocelot->stats[] index A + 1. So all stats after a hole are saved to a position which is off by B - A + 1 elements. This causes 2 kinds of issues: (a) counters which shouldn't increment increment (b) counters which should increment don't Holes in the ocelot_stat_layout automatically imply the end of a region and the beginning of a new one; however the reverse is not necessarily true. For example, for ocelot_mm_stat_layout, there could be multiple regions (which indicate discontinuities in register addresses) while there is no hole (which indicates discontinuities in enum ocelot_stat values). In the example above, the stats from the second region->buf[] are not transferred to ocelot->stats starting with index "port * OCELOT_NUM_STATS + OCELOT_STAT_TX_OCTETS" as they should, but rather, starting with element "port * OCELOT_NUM_STATS + OCELOT_STAT_RX_GREEN_PRIO_7 + 1". That stats[] array element is not reported to user space for switches that use ocelot_stat_layout, and that is how issue (b) occurs. However, if the length of the second region is larger than the hole, then some stats will start to be transferred to the ocelot->stats[] indices which are reported to user space, but those indices contain wrong values (corresponding to unexpected counters). This is how issue (a) occurs. The procedure, as it was introduced in commit `d87b1c08f3` ("net: mscc: ocelot: use bulk reads for stats"), was not buggy, because there were no holes in the struct ocelot_stat_layout instances at that time. The problem is that when those holes were introduced, the function was not updated to take them into consideration. To update the procedure, we need to know, for each region, which enum ocelot_stat corresponds to its region->base. We have no way of deducing that based on the contents of struct ocelot_stats_region, so we need to add this information. Fixes: `ab3f97a961` ("net: mscc: ocelot: export ethtool MAC Merge stats for Felix VSC9959") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:27:51 -07:00
Vladimir Oltean	6acc72a43e	net: mscc: ocelot: fix stats region batching The blamed commit changed struct ocelot_stat_layout :: "u32 offset" to "u32 reg". However, "u32 reg" is not quite a register address, but an enum ocelot_reg, which in itself encodes an enum ocelot_target target in the upper bits, and an index into the ocelot->map[target][] array in the lower bits. So, whereas the previous code comparison between stats_layout[i].offset and last + 1 was correct (because those "offsets" at the time were 32-bit relative addresses), the new code, comparing layout[i].reg to last + 4 is not correct, because the "reg" here is an enum/index, not an actual register address. What we want to compare are indeed register addresses, but to do that, we need to actually go through the same motions as __ocelot_bulk_read_ix() itself. With this bug, all statistics counters are deemed by ocelot_prepare_stats_regions() as constituting their own region. (Truncated) log on VSC9959 (Felix) below (prints added by me): Before: region of 1 contiguous counters starting with SYS:STAT:CNT[0x000] region of 1 contiguous counters starting with SYS:STAT:CNT[0x001] region of 1 contiguous counters starting with SYS:STAT:CNT[0x002] ... region of 1 contiguous counters starting with SYS:STAT:CNT[0x041] region of 1 contiguous counters starting with SYS:STAT:CNT[0x042] region of 1 contiguous counters starting with SYS:STAT:CNT[0x080] region of 1 contiguous counters starting with SYS:STAT:CNT[0x081] ... region of 1 contiguous counters starting with SYS:STAT:CNT[0x0ac] region of 1 contiguous counters starting with SYS:STAT:CNT[0x100] region of 1 contiguous counters starting with SYS:STAT:CNT[0x101] ... region of 1 contiguous counters starting with SYS:STAT:CNT[0x111] After: region of 67 contiguous counters starting with SYS:STAT:CNT[0x000] region of 45 contiguous counters starting with SYS:STAT:CNT[0x080] region of 18 contiguous counters starting with SYS:STAT:CNT[0x100] Since commit `d87b1c08f3` ("net: mscc: ocelot: use bulk reads for stats") intended bulking as a performance improvement, and since now, with trivial-sized regions, performance is even worse than without bulking at all, this could easily qualify as a performance regression. Fixes: `d4c3676507` ("net: mscc: ocelot: keep ocelot_stat_layout by reg address, not offset") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Colin Foster <colin.foster@in-advantage.com> Tested-by: Colin Foster <colin.foster@in-advantage.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:27:10 -07:00
Tom Rix	f6f4e739b1	net: atheros: atl1c: remove unused atl1c_irq_reset function clang with W=1 reports drivers/net/ethernet/atheros/atl1c/atl1c_main.c:214:20: error: unused function 'atl1c_irq_reset' [-Werror,-Wunused-function] static inline void atl1c_irq_reset(struct atl1c_adapter *adapter) ^ This function is not used, so remove it. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://lore.kernel.org/r/20230320232317.1729464-1-trix@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:18:54 -07:00
Eric Dumazet	8e50ed7745	erspan: do not use skb_mac_header() in ndo_start_xmit() Drivers should not assume skb_mac_header(skb) == skb->data in their ndo_start_xmit(). Use skb_network_offset() and skb_transport_offset() which better describe what is needed in erspan_fb_xmit() and ip6erspan_tunnel_xmit() syzbot reported: WARNING: CPU: 0 PID: 5083 at include/linux/skbuff.h:2873 skb_mac_header include/linux/skbuff.h:2873 [inline] WARNING: CPU: 0 PID: 5083 at include/linux/skbuff.h:2873 ip6erspan_tunnel_xmit+0x1d9c/0x2d90 net/ipv6/ip6_gre.c:962 Modules linked in: CPU: 0 PID: 5083 Comm: syz-executor406 Not tainted 6.3.0-rc2-syzkaller-00866-gd4671cb96fa3 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023 RIP: 0010:skb_mac_header include/linux/skbuff.h:2873 [inline] RIP: 0010:ip6erspan_tunnel_xmit+0x1d9c/0x2d90 net/ipv6/ip6_gre.c:962 Code: 04 02 41 01 de 84 c0 74 08 3c 03 0f 8e 1c 0a 00 00 45 89 b4 24 c8 00 00 00 c6 85 77 fe ff ff 01 e9 33 e7 ff ff e8 b4 27 a1 f8 <0f> 0b e9 b6 e7 ff ff e8 a8 27 a1 f8 49 8d bf f0 0c 00 00 48 b8 00 RSP: 0018:ffffc90003b2f830 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 000000000000ffff RCX: 0000000000000000 RDX: ffff888021273a80 RSI: ffffffff88e1bd4c RDI: 0000000000000003 RBP: ffffc90003b2f9d8 R08: 0000000000000003 R09: 000000000000ffff R10: 000000000000ffff R11: 0000000000000000 R12: ffff88802b28da00 R13: 00000000000000d0 R14: ffff88807e25b6d0 R15: ffff888023408000 FS: 0000555556a61300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055e5b11eb6e8 CR3: 0000000027c1b000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __netdev_start_xmit include/linux/netdevice.h:4900 [inline] netdev_start_xmit include/linux/netdevice.h:4914 [inline] __dev_direct_xmit+0x504/0x730 net/core/dev.c:4300 dev_direct_xmit include/linux/netdevice.h:3088 [inline] packet_xmit+0x20a/0x390 net/packet/af_packet.c:285 packet_snd net/packet/af_packet.c:3075 [inline] packet_sendmsg+0x31a0/0x5150 net/packet/af_packet.c:3107 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg+0xde/0x190 net/socket.c:747 __sys_sendto+0x23a/0x340 net/socket.c:2142 __do_sys_sendto net/socket.c:2154 [inline] __se_sys_sendto net/socket.c:2150 [inline] __x64_sys_sendto+0xe1/0x1b0 net/socket.c:2150 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f123aaa1039 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffc15d12058 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f123aaa1039 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000020000040 R09: 0000000000000014 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f123aa648c0 R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000 Fixes: `1baf5ebf89` ("erspan: auto detect truncated packets.") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230320163427.8096-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 21:16:26 -07:00
Li Zetao	4fe3c88552	atm: idt77252: fix kmemleak when rmmod idt77252 There are memory leaks reported by kmemleak: unreferenced object 0xffff888106500800 (size 128): comm "modprobe", pid 1017, jiffies 4297787785 (age 67.152s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000970ce626>] __kmem_cache_alloc_node+0x20c/0x380 [<00000000fb5f78d9>] kmalloc_trace+0x2f/0xb0 [<000000000e947e2a>] idt77252_init_one+0x2847/0x3c90 [idt77252] [<000000006efb048e>] local_pci_probe+0xeb/0x1a0 ... unreferenced object 0xffff888106500b00 (size 128): comm "modprobe", pid 1017, jiffies 4297787785 (age 67.152s) hex dump (first 32 bytes): 00 20 3d 01 80 88 ff ff 00 20 3d 01 80 88 ff ff . =...... =..... f0 23 3d 01 80 88 ff ff 00 20 3d 01 00 00 00 00 .#=...... =..... backtrace: [<00000000970ce626>] __kmem_cache_alloc_node+0x20c/0x380 [<00000000fb5f78d9>] kmalloc_trace+0x2f/0xb0 [<00000000f451c5be>] alloc_scq.constprop.0+0x4a/0x400 [idt77252] [<00000000e6313849>] idt77252_init_one+0x28cf/0x3c90 [idt77252] The root cause is traced to the vc_maps which alloced in open_card_oam() are not freed in close_card_oam(). The vc_maps are used to record open connections, so when close a vc_map in close_card_oam(), the memory should be freed. Moreover, the ubr0 is not closed when close a idt77252 device, leading to the memory leak of vc_map and scq_info. Fix them by adding kfree in close_card_oam() and implementing new close_card_ubr0() to close ubr0. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Francois Romieu <romieu@fr.zoreil.com> Link: https://lore.kernel.org/r/20230320143318.2644630-1-lizetao1@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 20:19:28 -07:00
Álvaro Fernández Rojas	032a954061	net: dsa: tag_brcm: legacy: fix daisy-chained switches When BCM63xx internal switches are connected to switches with a 4-byte Broadcom tag, it does not identify the packet as VLAN tagged, so it adds one based on its PVID (which is likely 0). Right now, the packet is received by the BCM63xx internal switch and the 6-byte tag is properly processed. The next step would to decode the corresponding 4-byte tag. However, the internal switch adds an invalid VLAN tag after the 6-byte tag and the 4-byte tag handling fails. In order to fix this we need to remove the invalid VLAN tag after the 6-byte tag before passing it to the 4-byte tag decoding. Fixes: `964dbf186e` ("net: dsa: tag_brcm: add support for legacy tags") Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20230319095540.239064-1-noltari@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-21 17:29:13 -07:00
Linus Torvalds	a1effab7a3	VFIO fixes for v6.3-rc4 - Fix dirty size reporting for pre-copy in mlx5 variant driver (Yishai Hadas) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmQaLRwbHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiQjMP+we+dpw91BTxcLShKrrc A1Mzil0gKLeztDFNH04rbi45JshjXZkzbFKH/e0LXumud3Rnywna6Lk63TiwM7qB pAWdy+SXffOMCAIeEep6McjblWheYTmydJjHtuPLOqC5ZJEN2aD9adBzqvtSyG19 COxkZafBZl0m0AfMkEbguv6b14ywQgTkhq8lYpYLsYiZYUOd2M5jy3cATJ9KmcXR 4XSJmv6P9jDa0cci+g+gphDAz90zI8kdTD4Mq9InE56JE1bBMr4/fes79Rbheiui cpZCkFoBSMg0sOqrn2189Hnzc9gnbInPTMfx3Nz6mh5XH3PlAJurfMRd9I9itUQx D/gAysnnlLP8MQntJfCf2o0Njb7a3PqO7BFjXmG7ZK2ZOP+6G8ZIeALQXn/lkbYF RPJwXPEzlo+Pe8oFg/rTD9PqDVFXK6TjQuStIZNT0u1XpJwNZuYr7qSGXD/OdoUa Lkurr7rM/KsOA2fYrrW0jm9tPxEI9ilTSz9ILxkMc5RAzvF2DXkJ2VNLtLY+Lc7d c4IWSgUFBCO+vtvYJyMVsTeVcqkbKauYOQ8ItQCv1j9kg6xb5b8qzerj9CJDFTSE T1PG9wvsZBW9pWbBzAgXHgUt7ohiF2UynOO4jZUCcEXKTOrzxAK8rwoKGhS6j1CF ID36fHwRLnTx8z9yH1WiXoGd =VSl2 -----END PGP SIGNATURE----- Merge tag 'vfio-v6.3-rc4' of https://github.com/awilliam/linux-vfio Pull VFIO fix from Alex Williamson: - Fix dirty size reporting for pre-copy in mlx5 variant driver (Yishai Hadas) * tag 'vfio-v6.3-rc4' of https://github.com/awilliam/linux-vfio: vfio/mlx5: Fix the report of dirty_bytes upon pre-copy	2023-03-21 15:46:39 -07:00
Linus Torvalds	a2eaf246f5	nfsd-6.3 fixes: - Fix a crash during NFS READs from certain client implementations - Address a minor kbuild regression in v6.3 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmQaEJgACgkQM2qzM29m f5ceYBAAiGHz+ISIIirg2+B5DttDL5ssPoqbgfZ16JZfBgaDduOg8oK7pbILv1nf 4ufZQRcLpJDSnzhqNXxQE/L4Vsei8+tqkbwILSqFsCBZk5ei2AYSsZMu7Ptf8X8X Y+KZ55Bko0uTYAsglnPiEaDEGeq0ZYwfUc5/bnw7ZzXdAHhlwgjoAads6Q25A9f1 6ZuKQ6nIis/ok9q88GdLf7x51IdKrXTrkC1iYAeVTXCgR3qXPezqmPOzW44rZs0t TMzUgDuyPgvB5eT/N+pwXoI9Bli9v1PFoXZjm2NchhJ/Iy6Gt1eEEB93J/qRSU/x ziJLkk0jnJEPWJ0Ht5li4+60b58t2OoWCTJTr3FzHCcF5Iqr7YjenSi2U8IT84ct EeFdqR2nZpJ4Y1wbAhTjQ2w6rw5WpaXDLHCRRdE7OF/D121n3FkSNo8DtSn+fbR8 sFDpOruZn4tjgpf3AjCR1zbyKw9baC6clhnLrW6AE6T4ht9XwfE6eCUSy48FSMzg 4IZKLcofQh2UEvlVtCvGOjtcK151eFKocsq+p6/yhYE6BM20Z3dBqhw9iBzqOAPg 0qZk60AWl9W5h7M+ovkGAZpwXv4djsdWf7LbFuYOQ0kodzXOQ/5sFQm4zC5F3I1Z 8WPPA9SkYyNiNSBUt65tca4JNw5h6x9A3scPWtuHB7Lz3fuGTnE= =frHb -----END PGP SIGNATURE----- Merge tag 'nfsd-6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a crash during NFS READs from certain client implementations - Address a minor kbuild regression in v6.3 * tag 'nfsd-6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: don't replace page in rq_pages if it's a continuation of last page NFS & NFSD: Update GSS dependencies	2023-03-21 14:48:38 -07:00
Dan Carpenter	640fcdbcf2	net/mlx5: E-Switch, Fix an Oops in error handling code The error handling dereferences "vport". There is nothing we can do if it is an error pointer except returning the error code. Fixes: `133dcfc577` ("net/mlx5: E-Switch, Alloc and free unique metadata for match") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-21 14:06:32 -07:00
Maher Sanalla	44d553188c	net/mlx5: Read the TC mapping of all priorities on ETS query When ETS configurations are queried by the user to get the mapping assignment between packet priority and traffic class, only priorities up to maximum TCs are queried from QTCT register in FW to retrieve their assigned TC, leaving the rest of the priorities mapped to the default TC #0 which might be misleading. Fix by querying the TC mapping of all priorities on each ETS query, regardless of the maximum number of TCs configured in FW. Fixes: `820c2c5e77` ("net/mlx5e: Read ETS settings directly from firmware") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-21 14:06:32 -07:00
Emeel Hakim	7e3fce82d9	net/mlx5e: Overcome slow response for first macsec ASO WQE First ASO WQE poll causes a cache miss in hardware hence the resut is delayed. It causes to the situation where such WQE is polled earlier than it is needed. Add logic to retry ASO CQ polling operation. Fixes: `739cfa3451` ("net/mlx5: Make ASO poll CQ usable in atomic context") Signed-off-by: Emeel Hakim <ehakim@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-21 14:06:32 -07:00
Roy Novich	6e9d51b1a5	net/mlx5e: Initialize link speed to zero mlx5e_port_max_linkspeed does not guarantee value assignment for speed. Avoid cases where link_speed might be used uninitialized. In case mlx5e_port_max_linkspeed fails, a default link speed of 50000 will be used for the calculations. Fixes: `3f6d08d196` ("net/mlx5e: Add RSS support for hairpin") Signed-off-by: Roy Novich <royno@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-21 14:06:31 -07:00
Lama Kayal	922f56e9a7	net/mlx5: Fix steering rules cleanup vport's mc, uc and multicast rules are not deleted in teardown path when EEH happens. Since the vport's promisc settings(uc, mc and all) in firmware are reset after EEH, mlx5 driver will try to delete the above rules in the initialization path. This cause kernel crash because these software rules are no longer valid. Fix by nullifying these rules right after delete to avoid accessing any dangling pointers. Call Trace: __list_del_entry_valid+0xcc/0x100 (unreliable) tree_put_node+0xf4/0x1b0 [mlx5_core] tree_remove_node+0x30/0x70 [mlx5_core] mlx5_del_flow_rules+0x14c/0x1f0 [mlx5_core] esw_apply_vport_rx_mode+0x10c/0x200 [mlx5_core] esw_update_vport_rx_mode+0xb4/0x180 [mlx5_core] esw_vport_change_handle_locked+0x1ec/0x230 [mlx5_core] esw_enable_vport+0x130/0x260 [mlx5_core] mlx5_eswitch_enable_sriov+0x2a0/0x2f0 [mlx5_core] mlx5_device_enable_sriov+0x74/0x440 [mlx5_core] mlx5_load_one+0x114c/0x1550 [mlx5_core] mlx5_pci_resume+0x68/0xf0 [mlx5_core] eeh_report_resume+0x1a4/0x230 eeh_pe_dev_traverse+0x98/0x170 eeh_handle_normal_event+0x3e4/0x640 eeh_handle_event+0x4c/0x370 eeh_event_handler+0x14c/0x210 kthread+0x168/0x1b0 ret_from_kernel_thread+0x5c/0x84 Fixes: `a35f71f27a` ("net/mlx5: E-Switch, Implement promiscuous rx modes vf request handling") Signed-off-by: Huy Nguyen <huyn@mellanox.com> Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-21 14:06:31 -07:00
Gavin Li	662404b24a	net/mlx5e: Block entering switchdev mode with ns inconsistency Upon entering switchdev mode, VF/SF representors are spawned in the devlink instance's net namespace, whereas the PF net device transforms into the uplink representor, remaining in the net namespace the PF net device was in. Therefore, if a PF net device's namespace is different from its parent devlink net namespace, entering switchdev mode can create an illegal situation where all representors sharing the same core device are NOT in the same net namespace. To avoid this issue, block entering switchdev mode for devices whose child netdev net namespace has diverged from the parent devlink's. Fixes: `7768d1971d` ("net/mlx5: E-Switch, Add control for encapsulation") Signed-off-by: Gavin Li <gavinl@nvidia.com> Reviewed-by: Gavi Teitz <gavi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-21 14:06:31 -07:00
Gavin Li	c83172b063	net/mlx5e: Set uplink rep as NETNS_LOCAL Previously, NETNS_LOCAL was not set for uplink representors, inconsistent with VF representors, and allowed the uplink representor to be moved between net namespaces and separated from the VF representors it shares the core device with. Such usage would break the isolation model of namespaces, as devices in different namespaces would have access to shared memory. To solve this issue, set NETNS_LOCAL for uplink representors if eswitch is in switchdev mode. Fixes: `7a9fb35e8c` ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode") Signed-off-by: Gavin Li <gavinl@nvidia.com> Reviewed-by: Gavi Teitz <gavi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-21 14:06:31 -07:00
Daniel Borkmann	10ec8ca8ec	bpf: Adjust insufficient default bpf_jit_limit We've seen recent AWS EKS (Kubernetes) user reports like the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo \| grep bpf_jit \| awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: `fdadd04931` ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <sh@synk.net> Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://github.com/awslabs/amazon-eks-ami/issues/1179 Link: https://github.com/awslabs/amazon-eks-ami/issues/1219 Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-21 12:43:05 -07:00
Linus Torvalds	2faac9a98f	keyrings fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmQZ234ACgkQ+7dXa6fL C2t9Yw/+O5uA0kyRp/3xgJ/6T2zbQEFdzUZVSif2P/Kv5uYx+1qC/E00bYOWNIIX dWey23HHHyNm0dWmR39SY9flixnIRFT83W4+chmHkeboXuXzL4I1s2+W2WOKqdoX H3X+uoK6eew3S4OEMwtlWoKebIzzzJls72kaSiCW63yhVOpiOI94muNdUVRQy7ll iXDiI5pB/cBvOAfLazrTnnhmcsUoGhW9qfJFI7kZM7JsVxvk0Fw6G1T35yEiwR6M GO9ZvJPNtan3jzKzJaUc5qtYclqSj9wzxiY3x3uPe2afLsb2KiFKWh0e89DNmr9D DyPfPvj7gJdFVLeE/wkdJNkAPgC8oN7F3uUnZaJjwtqItuCOp5HTYQtZlK532Xej 0OnLcxPj18/SkbIZ2HcE2akwhfChbLghdAioqgsKU/pE1nUP69HBsiLn5PkL/1zK f1dUN+L6MggyQGJRKaQdKPNaY26bHg9GkSZdxPokkQ8QF/GSvmw1cqQZrox3qUQp 9RhfHyh1GRI1MFlVfhVGfhTO8F7y6JwG1xTUJpaCVFgLVTNT9y7L9DAzH9YKGb79 pWgf9OspfxoDJrKREPK6inC3x2O5I6ZruET44S5VY9iWlRmMY/CTQ/fJu6rIcCr+ kHejN8rgU8N0Hc9WnZrSe1BS6e18jhZMMQtrF8/6md/OkoX13fU= =RrmT -----END PGP SIGNATURE----- Merge tag 'keys-fixes-20230321' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull keyrings fixes from David Howells: - Fix request_key() so that it doesn't cache a looked up key on the current thread if that thread is a kernel thread. The cache is cleared during notify_resume - but that doesn't happen in kernel threads. This is causing cifs DNS keys to be un-invalidateable. - Fix a wrapper check in verify_pefile() to not round up the length. - Change asymmetric_keys code to log errors to make it easier for users to work out why failures occurred. * tag 'keys-fixes-20230321' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: asymmetric_keys: log on fatal failures in PE/pkcs7 verify_pefile: relax wrapper length check keys: Do not cache key in task struct if key is requested from kernel thread	2023-03-21 11:38:58 -07:00
Sasha Neftin	65364bbe0b	igc: Remove obsolete DMA coalescing code DMA coalescing is not applicable for i225 parts. This patch comes to tidy up the driver code. Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-21 11:37:16 -07:00
Dawid Wesierski	5a9b7bfb0d	igbvf: add PCI reset handler functions There was a problem with resuming ping after conducting a PCI reset. This commit adds two functions, igbvf_io_prepare and igbvf_io_done, which, after being added to the pci_error_handlers struct, will prepare the drivers for a PCI reset and then bring the interface up and reset it after. This will prevent the driver from ending up in incorrect state. Test_and_set_bit is highly reliable in this context, so we are not including a timeout in this commit This introduces 900ms - 1100ms of overhead to this operation but it's in non-time-critical flow. And also allows the driver to continue functioning after the reset. Functionality documented in ethernet-controller-i350-datasheet 4.2.1.3 https://www.intel.com/content/www/us/en/products/details/ethernet/gigabit-controllers/i350-controllers/docs.html Signed-off-by: Dawid Wesierski <dawidx.wesierski@intel.com> Signed-off-by: Kamil Maziarz <kamil.maziarz@intel.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-21 11:37:16 -07:00
Andrii Staikov	d71980d47e	igb: refactor igb_ptp_adjfine_82580 to use diff_by_scaled_ppm Driver's .adjfine interface functions use adjust_by_scaled_ppm and diff_by_scaled_ppm introduced in commit `1060707e38` ("ptp: introduce helpers to adjust by scaled parts per million") to calculate the required adjustment in a concise manner, but not igb_ptp_adjfine_82580. Fix it by introducing IGB_82580_BASE_PERIOD and changing function logic to use diff_by_scaled_ppm. Signed-off-by: Andrii Staikov <andrii.staikov@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-21 11:37:16 -07:00
Radoslaw Tyl	c672297bbc	i40e: fix flow director packet filter programming Initialize to zero structures to build a valid Tx Packet used for the filter programming. Fixes: `a9219b332f` ("i40e: VLAN field for flow director") Signed-off-by: Radoslaw Tyl <radoslawx.tyl@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-21 11:11:08 -07:00
Stefan Assmann	4e264be98b	iavf: fix hang on reboot with ice When a system with E810 with existing VFs gets rebooted the following hang may be observed. Pid 1 is hung in iavf_remove(), part of a network driver: PID: 1 TASK: ffff965400e5a340 CPU: 24 COMMAND: "systemd-shutdow" #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb #1 [ffffaad04005fae8] schedule at ffffffff8b323e2d #2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc #3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930 #4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf] #5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513 #6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa #7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc #8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e #9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429 #10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4 #11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice] #12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice] #13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice] #14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1 #15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386 #16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870 #17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6 #18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159 #19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc #20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d #21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169 #22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b RIP: 00007f1baa5c13d7 RSP: 00007fffbcc55a98 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1baa5c13d7 RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead RBP: 00007fffbcc55ca0 R8: 0000000000000000 R9: 00007fffbcc54e90 R10: 00007fffbcc55050 R11: 0000000000000202 R12: 0000000000000005 R13: 0000000000000000 R14: 00007fffbcc55af0 R15: 0000000000000000 ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b During reboot all drivers PM shutdown callbacks are invoked. In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE. In ice_shutdown() the call chain above is executed, which at some point calls iavf_remove(). However iavf_remove() expects the VF to be in one of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If that's not the case it sleeps forever. So if iavf_shutdown() gets invoked before iavf_remove() the system will hang indefinitely because the adapter is already in state __IAVF_REMOVE. Fix this by returning from iavf_remove() if the state is __IAVF_REMOVE, as we already went through iavf_shutdown(). Fixes: `974578017f` ("iavf: Add waiting so the port is initialized in remove") Fixes: `a8417330f8` ("iavf: Fix race condition between iavf_shutdown and iavf_remove") Reported-by: Marius Cornea <mcornea@redhat.com> Signed-off-by: Stefan Assmann <sassmann@kpanic.de> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-21 11:11:08 -07:00
Michal Swiatkowski	7d46c0e670	ice: remove filters only if VSI is deleted Filters shouldn't be removed in VSI rebuild path. Removing them on PF VSI results in no rule for PF MAC after changing for example queues amount. Remove all filters only in the VSI remove flow. As unload should also cause the filter to be removed introduce, a new function ice_stop_eth(). It will unroll ice_start_eth(), so remove filters and close VSI. Fixes: `6624e780a5` ("ice: split ice_vsi_setup into smaller functions") Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-21 10:02:48 -07:00
Michal Swiatkowski	83b49e7f63	ice: check if VF exists before mode check Setting trust on VF should return EINVAL when there is no VF. Move checking for switchdev mode after checking if VF exists. Fixes: `c54d209c78` ("ice: Wait for VF to be reset/ready before configuration") Signed-off-by: Michal Swiatkowski <michal.swiatkowski@intel.com> Signed-off-by: Kalyan Kodamagula <kalyan.kodamagula@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-21 10:02:48 -07:00
Piotr Raczynski	387d42ae6d	ice: fix rx buffers handling for flow director packets Adding flow director filters stopped working correctly after commit `2fba7dc515` ("ice: Add support for XDP multi-buffer on Rx side"). As a result, only first flow director filter can be added, adding next filter leads to NULL pointer dereference attached below. Rx buffer handling and reallocation logic has been optimized, however flow director specific traffic was not accounted for. As a result driver handled those packets incorrectly since new logic was based on ice_rx_ring::first_desc which was not set in this case. Fix this by setting struct ice_rx_ring::first_desc to next_to_clean for flow director received packets. [ 438.544867] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 438.551840] #PF: supervisor read access in kernel mode [ 438.556978] #PF: error_code(0x0000) - not-present page [ 438.562115] PGD 7c953b2067 P4D 0 [ 438.565436] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 438.569794] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 6.2.0-net-bug #1 [ 438.577531] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0005.2202160810 02/16/2022 [ 438.588470] RIP: 0010:ice_clean_rx_irq+0x2b9/0xf20 [ice] [ 438.593860] Code: 45 89 f7 e9 ac 00 00 00 8b 4d 78 41 31 4e 10 41 09 d5 4d 85 f6 0f 84 82 00 00 00 49 8b 4e 08 41 8b 76 1c 65 8b 3d 47 36 4a 3f <48> 8b 11 48 c1 ea 36 39 d7 0f 85 a6 00 00 00 f6 41 08 02 0f 85 9c [ 438.612605] RSP: 0018:ff8c732640003ec8 EFLAGS: 00010082 [ 438.617831] RAX: 0000000000000800 RBX: 00000000000007ff RCX: 0000000000000000 [ 438.624957] RDX: 0000000000000800 RSI: 0000000000000000 RDI: 0000000000000000 [ 438.632089] RBP: ff4ed275a2158200 R08: 00000000ffffffff R09: 0000000000000020 [ 438.639222] R10: 0000000000000000 R11: 0000000000000020 R12: 0000000000001000 [ 438.646356] R13: 0000000000000000 R14: ff4ed275d0daffe0 R15: 0000000000000000 [ 438.653485] FS: 0000000000000000(0000) GS:ff4ed2738fa00000(0000) knlGS:0000000000000000 [ 438.661563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 438.667310] CR2: 0000000000000000 CR3: 0000007c9f0d6006 CR4: 0000000000771ef0 [ 438.674444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 438.681573] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 438.688697] PKRU: 55555554 [ 438.691404] Call Trace: [ 438.693857] <IRQ> [ 438.695877] ? profile_tick+0x17/0x80 [ 438.699542] ice_msix_clean_ctrl_vsi+0x24/0x50 [ice] [ 438.702571] ice 0000:b1:00.0: VF 1: ctrl_vsi irq timeout [ 438.704542] __handle_irq_event_percpu+0x43/0x1a0 [ 438.704549] handle_irq_event+0x34/0x70 [ 438.704554] handle_edge_irq+0x9f/0x240 [ 438.709901] iavf 0000:b1:01.1: Failed to add Flow Director filter with status: 6 [ 438.714571] __common_interrupt+0x63/0x100 [ 438.714580] common_interrupt+0xb4/0xd0 [ 438.718424] iavf 0000:b1:01.1: Rule ID: 127 dst_ip: 0.0.0.0 src_ip 0.0.0.0 UDP: dst_port 4 src_port 0 [ 438.722255] </IRQ> [ 438.722257] <TASK> [ 438.722257] asm_common_interrupt+0x22/0x40 [ 438.722262] RIP: 0010:cpuidle_enter_state+0xc8/0x430 [ 438.722267] Code: 6e e9 25 ff e8 f9 ef ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 d7 f1 24 ff 45 84 ff 0f 85 57 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d [ 438.722269] RSP: 0018:ffffffff86003e50 EFLAGS: 00000246 [ 438.784108] RAX: ff4ed2738fa00000 RBX: ffbe72a64fc01020 RCX: 0000000000000000 [ 438.791234] RDX: 0000000000000000 RSI: ffffffff858d84de RDI: ffffffff85893641 [ 438.798365] RBP: 0000000000000002 R08: 0000000000000002 R09: 000000003158af9d [ 438.805490] R10: 0000000000000008 R11: 0000000000000354 R12: ffffffff862365a0 [ 438.812622] R13: 000000661b472a87 R14: 0000000000000002 R15: 0000000000000000 [ 438.819757] cpuidle_enter+0x29/0x40 [ 438.823333] do_idle+0x1b6/0x230 [ 438.826566] cpu_startup_entry+0x19/0x20 [ 438.830492] rest_init+0xcb/0xd0 [ 438.833717] arch_call_rest_init+0xa/0x30 [ 438.837731] start_kernel+0x776/0xb70 [ 438.841396] secondary_startup_64_no_verify+0xe5/0xeb [ 438.846449] </TASK> Fixes: `2fba7dc515` ("ice: Add support for XDP multi-buffer on Rx side") Signed-off-by: Piotr Raczynski <piotr.raczynski@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-21 10:02:48 -07:00
Robbie Harwood	3584c1dbff	asymmetric_keys: log on fatal failures in PE/pkcs7 These particular errors can be encountered while trying to kexec when secureboot lockdown is in place. Without this change, even with a signed debug build, one still needs to reboot the machine to add the appropriate dyndbg parameters (since lockdown blocks debugfs). Accordingly, upgrade all pr_debug() before fatal error into pr_warn(). Signed-off-by: Robbie Harwood <rharwood@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Jarkko Sakkinen <jarkko@kernel.org> cc: Eric Biederman <ebiederm@xmission.com> cc: Herbert Xu <herbert@gondor.apana.org.au> cc: keyrings@vger.kernel.org cc: linux-crypto@vger.kernel.org cc: kexec@lists.infradead.org Link: https://lore.kernel.org/r/20230220171254.592347-3-rharwood@redhat.com/ # v2	2023-03-21 16:23:56 +00:00

... 3 4 5 6 7 ...

1170818 Commits