linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-17 09:14:19 +08:00

Author	SHA1	Message	Date
Tung Nguyen	ceb1eb2fb6	tipc: fix memory leak caused by tipc_buf_append() Commit `ed42989eab` ("tipc: fix the skb_unshare() in tipc_buf_append()") replaced skb_unshare() with skb_copy() to not reduce the data reference counter of the original skb intentionally. This is not the correct way to handle the cloned skb because it causes memory leak in 2 following cases: 1/ Sending multicast messages via broadcast link The original skb list is cloned to the local skb list for local destination. After that, the data reference counter of each skb in the original list has the value of 2. This causes each skb not to be freed after receiving ACK: tipc_link_advance_transmq() { ... /* release skb */ __skb_unlink(skb, &l->transmq); kfree_skb(skb); <-- memory exists after being freed } 2/ Sending multicast messages via replicast link Similar to the above case, each skb cannot be freed after purging the skb list: tipc_mcast_xmit() { ... __skb_queue_purge(pkts); <-- memory exists after being freed } This commit fixes this issue by using skb_unshare() instead. Besides, to avoid use-after-free error reported by KASAN, the pointer to the fragment is set to NULL before calling skb_unshare() to make sure that the original skb is not freed after freeing the fragment 2 times in case skb_unshare() returns NULL. Fixes: `ed42989eab` ("tipc: fix the skb_unshare() in tipc_buf_append()") Acked-by: Jon Maloy <jmaloy@redhat.com> Reported-by: Thang Hoang Ngo <thang.h.ngo@dektech.com.au> Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://lore.kernel.org/r/20201027032403.1823-1-tung.q.nguyen@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-29 09:51:52 -07:00
Magnus Karlsson	e5e1a4bc91	xsk: Fix possible memory leak at socket close Fix a possible memory leak at xsk socket close that is caused by the refcounting of the umem object being wrong. The reference count of the umem was decremented only after the pool had been freed. Note that if the buffer pool is destroyed, it is important that the umem is destroyed after the pool, otherwise the umem would disappear while the driver is still running. And as the buffer pool needs to be destroyed in a work queue, the umem is also (if its refcount reaches zero) destroyed after the buffer pool in that same work queue. What was missing is that the refcount also needs to be decremented when the pool is not freed and when the pool has not even been created. The first case happens when the refcount of the pool is higher than 1, i.e. it is still being used by some other socket using the same device and queue id. In this case, it is safe to decrement the refcount of the umem outside of the work queue as the umem will never be freed because the refcount of the umem is always greater than or equal to the refcount of the buffer pool. The second case is if the buffer pool has not been created yet, i.e. the socket was closed before it was bound but after the umem was created. In this case, it is safe to destroy the umem outside of the work queue, since there is no pool that can use it by definition. Fixes: `1c1efc2af1` ("xsk: Create and free buffer pool independently from umem") Reported-by: syzbot+eb71df123dc2be2c1456@syzkaller.appspotmail.com Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1603801921-2712-1-git-send-email-magnus.karlsson@gmail.com	2020-10-29 15:19:56 +01:00
Jason Gunthorpe	071ba4cc55	RDMA: Add rdma_connect_locked() There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the handler triggers a completion and another thread does rdma_connect() or the handler directly calls rdma_connect(). In all cases rdma_connect() needs to hold the handler_mutex, but when handler's are invoked this is already held by the core code. This causes ULPs using the 2nd method to deadlock. Provide a rdma_connect_locked() and have all ULPs call it from their handlers. Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com Reported-and-tested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Fixes: `2a7cec5381` ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state") Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2020-10-28 09:14:49 -03:00
Leon Romanovsky	d6535dca28	net: protect tcf_block_unbind with block lock The tcf_block_unbind() expects that the caller will take block->cb_lock before calling it, however the code took RTNL lock and dropped cb_lock instead. This causes to the following kernel panic. WARNING: CPU: 1 PID: 13524 at net/sched/cls_api.c:1488 tcf_block_unbind+0x2db/0x420 Modules linked in: mlx5_ib mlx5_core mlxfw ptp pps_core act_mirred act_tunnel_key cls_flower vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad ib_ipoib rdma_cm iw_cm ib_cm ib_uverbs ib_core overlay [last unloaded: mlxfw] CPU: 1 PID: 13524 Comm: test-ecmp-add-v Tainted: G W 5.9.0+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:tcf_block_unbind+0x2db/0x420 Code: ff 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8d bc 24 30 01 00 00 be ff ff ff ff e8 7d 7f 70 00 85 c0 0f 85 7b fd ff ff <0f> 0b e9 74 fd ff ff 48 c7 c7 dc 6a 24 84 e8 02 ec fe fe e9 55 fd RSP: 0018:ffff888117d17968 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88812f713c00 RCX: 1ffffffff0848d5b RDX: 0000000000000001 RSI: ffff88814fbc8130 RDI: ffff888107f2b878 RBP: 1ffff11022fa2f3f R08: 0000000000000000 R09: ffffffff84115a87 R10: fffffbfff0822b50 R11: ffff888107f2b898 R12: ffff88814fbc8000 R13: ffff88812f713c10 R14: ffff888117d17a38 R15: ffff88814fbc80c0 FS: 00007f6593d36740(0000) GS:ffff8882a4f00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005607a00758f8 CR3: 0000000131aea006 CR4: 0000000000170ea0 Call Trace: tc_block_indr_cleanup+0x3e0/0x5a0 ? tcf_block_unbind+0x420/0x420 ? __mutex_unlock_slowpath+0xe7/0x610 flow_indr_dev_unregister+0x5e2/0x930 ? mlx5e_restore_tunnel+0xdf0/0xdf0 [mlx5_core] ? mlx5e_restore_tunnel+0xdf0/0xdf0 [mlx5_core] ? flow_indr_block_cb_alloc+0x3c0/0x3c0 ? mlx5_db_free+0x37c/0x4b0 [mlx5_core] mlx5e_cleanup_rep_tx+0x8b/0xc0 [mlx5_core] mlx5e_detach_netdev+0xe5/0x120 [mlx5_core] mlx5e_vport_rep_unload+0x155/0x260 [mlx5_core] esw_offloads_disable+0x227/0x2b0 [mlx5_core] mlx5_eswitch_disable_locked.cold+0x38e/0x699 [mlx5_core] mlx5_eswitch_disable+0x94/0xf0 [mlx5_core] mlx5_device_disable_sriov+0x183/0x1f0 [mlx5_core] mlx5_core_sriov_configure+0xfd/0x230 [mlx5_core] sriov_numvfs_store+0x261/0x2f0 ? sriov_drivers_autoprobe_store+0x110/0x110 ? sysfs_file_ops+0x170/0x170 ? sysfs_file_ops+0x117/0x170 ? sysfs_file_ops+0x170/0x170 kernfs_fop_write+0x1ff/0x3f0 ? rcu_read_lock_any_held+0x6e/0x90 vfs_write+0x1f3/0x620 ksys_write+0xf9/0x1d0 ? __x64_sys_read+0xb0/0xb0 ? lockdep_hardirqs_on_prepare+0x273/0x3f0 ? syscall_enter_from_user_mode+0x1d/0x50 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 <...> ---[ end trace bfdd028ada702879 ]--- Fixes: `0fdcf78d59` ("net: use flow_indr_dev_setup_offload()") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20201026123327.1141066-1-leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-27 17:58:36 -07:00
Guillaume Nault	501b72ae24	net/sched: act_mpls: Add softdep on mpls_gso.ko TCA_MPLS_ACT_PUSH and TCA_MPLS_ACT_MAC_PUSH might be used on gso packets. Such packets will thus require mpls_gso.ko for segmentation. v2: Drop dependency on CONFIG_NET_MPLS_GSO in Kconfig (from Jakub and David). Fixes: `2a2ea50870` ("net: sched: add mpls manipulation actions to TC") Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/1f6cab15bbd15666795061c55563aaf6a386e90e.1603708007.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-27 17:17:06 -07:00
Dan Carpenter	0d8cb9464a	devlink: Unlock on error in dumpit() This needs to unlock before returning. Fixes: `544e7c33ec` ("net: devlink: Add support for port regions") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20201026080127.GB1628785@mwanda Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-27 17:05:57 -07:00
Dan Carpenter	6c211809c8	devlink: Fix some error codes These paths don't set the error codes. It's especially important in devlink_nl_region_notify_build() where it leads to a NULL dereference in the caller. Fixes: `544e7c33ec` ("net: devlink: Add support for port regions") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20201026080059.GA1628785@mwanda Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-27 17:05:57 -07:00
Karsten Graul	96d6fded95	net/smc: fix suppressed return code The patch that repaired the invalid return code in smcd_new_buf_create() missed to take care of errno ENOSPC which has a special meaning that no more DMBEs can be registered on the device. Fix that by keeping this errno value during the translation of the return code. Fixes: `6b1bbf94ab` ("net/smc: fix invalid return code in smcd_new_buf_create()") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-26 16:29:14 -07:00
Karsten Graul	4a9baf45fd	net/smc: fix null pointer dereference in smc_listen_decline() smc_listen_work() calls smc_listen_decline() on label out_decl, providing the ini pointer variable. But this pointer can still be null when the label out_decl is reached. Fix this by checking the ini variable in smc_listen_work() and call smc_listen_decline() with the result directly. Fixes: `a7c9c5f4af` ("net/smc: CLC accept / confirm V2") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-26 16:29:14 -07:00
Jeff Vander Stoep	af545bb5ee	vsock: use ns_capable_noaudit() on socket create During __vsock_create() CAP_NET_ADMIN is used to determine if the vsock_sock->trusted should be set to true. This value is used later for determing if a remote connection should be allowed to connect to a restricted VM. Unfortunately, if the caller doesn't have CAP_NET_ADMIN, an audit message such as an selinux denial is generated even if the caller does not want a trusted socket. Logging errors on success is confusing. To avoid this, switch the capable(CAP_NET_ADMIN) check to the noaudit version. Reported-by: Roman Kiryanov <rkir@google.com> https://android-review.googlesource.com/c/device/generic/goldfish/+/1468545/ Signed-off-by: Jeff Vander Stoep <jeffv@google.com> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/r/20201023143757.377574-1-jeffv@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-26 16:22:42 -07:00
Eric Biggers	23224e4500	mm: remove kzfree() compatibility definition Commit `453431a549` ("mm, treewide: rename kzfree() to kfree_sensitive()") renamed kzfree() to kfree_sensitive(), but it left a compatibility definition of kzfree() to avoid being too disruptive. Since then a few more instances of kzfree() have slipped in. Just get rid of them and remove the compatibility definition once and for all. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-10-25 11:39:02 -07:00
Willy Tarreau	3744741ada	random32: add noise from network and scheduling activity With the removal of the interrupt perturbations in previous random32 change (random32: make prandom_u32() output unpredictable), the PRNG has become 100% deterministic again. While SipHash is expected to be way more robust against brute force than the previous Tausworthe LFSR, there's still the risk that whoever has even one temporary access to the PRNG's internal state is able to predict all subsequent draws till the next reseed (roughly every minute). This may happen through a side channel attack or any data leak. This patch restores the spirit of commit `f227e3ec3b` ("random32: update the net random state on interrupt and activity") in that it will perturb the internal PRNG's statee using externally collected noise, except that it will not pick that noise from the random pool's bits nor upon interrupt, but will rather combine a few elements along the Tx path that are collectively hard to predict, such as dev, skb and txq pointers, packet length and jiffies values. These ones are combined using a single round of SipHash into a single long variable that is mixed with the net_rand_state upon each invocation. The operation was inlined because it produces very small and efficient code, typically 3 xor, 2 add and 2 rol. The performance was measured to be the same (even very slightly better) than before the switch to SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC (i40e), the connection rate dropped from 556k/s to 555k/s while the SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps. Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ Cc: George Spelvin <lkml@sdf.org> Cc: Amit Klein <aksecurity@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Willy Tarreau <w@1wt.eu>	2020-10-24 20:21:57 +02:00
Arjun Roy	435ccfa894	tcp: Prevent low rmem stalls with SO_RCVLOWAT. With SO_RCVLOWAT, under memory pressure, it is possible to enter a state where: 1. We have not received enough bytes to satisfy SO_RCVLOWAT. 2. We have not entered buffer pressure (see tcp_rmem_pressure()). 3. But, we do not have enough buffer space to accept more packets. In this case, we advertise 0 rwnd (due to #3) but the application does not drain the receive queue (no wakeup because of #1 and #2) so the flow stalls. Modify the heuristic for SO_RCVLOWAT so that, if we are advertising rwnd<=rcv_mss, force a wakeup to prevent a stall. Without this patch, setting tcp_rmem to 6143 and disabling TCP autotune causes a stalled flow. With this patch, no stall occurs. This is with RPC-style traffic with large messages. Fixes: `03f45c883c` ("tcp: avoid extra wakeups for SO_RCVLOWAT users") Signed-off-by: Arjun Roy <arjunroy@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-23 19:11:20 -07:00
Linus Torvalds	3cb12d27ff	Fixes for 5.10-rc1 from the networking tree: Cross-tree/merge window issues: - rtl8150: don't incorrectly assign random MAC addresses; fix late in the 5.9 cycle started depending on a return code from a function which changed with the 5.10 PR from the usb subsystem Current release - regressions: - Revert "virtio-net: ethtool configurable RXCSUM", it was causing crashes at probe when control vq was not negotiated/available Previous releases - regressions: - ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO bus, only first device would be probed correctly - nexthop: Fix performance regression in nexthop deletion by effectively switching from recently added synchronize_rcu() to synchronize_rcu_expedited() - netsec: ignore 'phy-mode' device property on ACPI systems; the property is not populated correctly by the firmware, but firmware configures the PHY so just keep boot settings Previous releases - always broken: - tcp: fix to update snd_wl1 in bulk receiver fast path, addressing bulk transfers getting "stuck" - icmp: randomize the global rate limiter to prevent attackers from getting useful signal - r8169: fix operation under forced interrupt threading, make the driver always use hard irqs, even on RT, given the handler is light and only wants to schedule napi (and do so through a _irqoff() variant, preferably) - bpf: Enforce pointer id generation for all may-be-null register type to avoid pointers erroneously getting marked as null-checked - tipc: re-configure queue limit for broadcast link - net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels - fix various issues in chelsio inline tls driver Misc: - bpf: improve just-added bpf_redirect_neigh() helper api to support supplying nexthop by the caller - in case BPF program has already done a lookup we can avoid doing another one - remove unnecessary break statements - make MCTCP not select IPV6, but rather depend on it Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+R+5UACgkQMUZtbf5S Irt9KxAAiYme2aSvMOni0NQsOgQ5mVsy7tk0/4dyRqkAx0ggrfGcFuhgZYNm8ZKY KoQsQyn30Wb/2wAp1vX2I4Fod67rFyBfQg/8iWiEAu47X7Bj1lpPPJexSPKhF9/X e0TuGxZtoaDuV9C3Su/FOjRmnShGSFQu1SCyJThshwaGsFL3YQ0Ut07VRgRF8x05 A5fy2SVVIw0JOQgV1oH0GP5oEK3c50oGnaXt8emm56PxVIfAYY0oq69hQUzrfMFP zV9R0XbnbCIibT8R3lEghjtXavtQTzK5rYDKazTeOyDU87M+yuykNYj7MhgDwl9Q UdJkH2OpMlJylEH3asUjz/+ObMhXfOuj/ZS3INtO5omBJx7x76egDZPMQe4wlpcC NT5EZMS7kBdQL8xXDob7hXsvFpuEErSUGruYTHp4H52A9ke1dRTH2kQszcKk87V3 s+aVVPtJ5bHzF3oGEvfwP0DFLTF6WvjD0Ts0LmTY2DhpE//tFWV37j60Ni5XU21X fCPooihQbLOsq9D8zc0ydEvCg2LLWMXM5ovCkqfIAJzbGVYhnxJSryZwpOlKDS0y LiUmLcTZDoNR/szx0aJhVHdUUVgXDX/GsllHoc1w7ZvDRMJn40K+xnaF3dSMwtIl imhfc5pPi6fdBgjB0cFYRPfhwiwlPMQ4YFsOq9JvynJzmt6P5FQ= =ceke -----END PGP SIGNATURE----- Merge tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Cross-tree/merge window issues: - rtl8150: don't incorrectly assign random MAC addresses; fix late in the 5.9 cycle started depending on a return code from a function which changed with the 5.10 PR from the usb subsystem Current release regressions: - Revert "virtio-net: ethtool configurable RXCSUM", it was causing crashes at probe when control vq was not negotiated/available Previous release regressions: - ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO bus, only first device would be probed correctly - nexthop: Fix performance regression in nexthop deletion by effectively switching from recently added synchronize_rcu() to synchronize_rcu_expedited() - netsec: ignore 'phy-mode' device property on ACPI systems; the property is not populated correctly by the firmware, but firmware configures the PHY so just keep boot settings Previous releases - always broken: - tcp: fix to update snd_wl1 in bulk receiver fast path, addressing bulk transfers getting "stuck" - icmp: randomize the global rate limiter to prevent attackers from getting useful signal - r8169: fix operation under forced interrupt threading, make the driver always use hard irqs, even on RT, given the handler is light and only wants to schedule napi (and do so through a _irqoff() variant, preferably) - bpf: Enforce pointer id generation for all may-be-null register type to avoid pointers erroneously getting marked as null-checked - tipc: re-configure queue limit for broadcast link - net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels - fix various issues in chelsio inline tls driver Misc: - bpf: improve just-added bpf_redirect_neigh() helper api to support supplying nexthop by the caller - in case BPF program has already done a lookup we can avoid doing another one - remove unnecessary break statements - make MCTCP not select IPV6, but rather depend on it" * tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits) tcp: fix to update snd_wl1 in bulk receiver fast path net: Properly typecast int values to set sk_max_pacing_rate netfilter: nf_fwd_netdev: clear timestamp in forwarding path ibmvnic: save changed mac address to adapter->mac_addr selftests: mptcp: depends on built-in IPv6 Revert "virtio-net: ethtool configurable RXCSUM" rtnetlink: fix data overflow in rtnl_calcit() net: ethernet: mtk-star-emac: select REGMAP_MMIO net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling ether_setup net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device bpf, libbpf: Guard bpf inline asm from bpf_tail_call_static bpf, selftests: Extend test_tc_redirect to use modified bpf_redirect_neigh() bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop mptcp: depends on IPV6 but not as a module sfc: move initialisation of efx->filter_sem to efx_init_struct() mpls: load mpls_gso after mpls_iptunnel net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action() net: dsa: bcm_sf2: make const array static, makes object smaller mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it ...	2020-10-23 12:05:49 -07:00
zhuoliang zhang	a779d91314	net: xfrm: fix a race condition during allocing spi we found that the following race condition exists in xfrm_alloc_userspi flow: user thread state_hash_work thread ---- ---- xfrm_alloc_userspi() __find_acq_core() /alloc new xfrm_state:x/ xfrm_state_alloc() /schedule state_hash_work thread/ xfrm_hash_grow_check() xfrm_hash_resize() xfrm_alloc_spi /hold lock/ x->id.spi = htonl(spi) spin_lock_bh(&net->xfrm.xfrm_state_lock) /waiting lock release/ xfrm_hash_transfer() spin_lock_bh(&net->xfrm.xfrm_state_lock) /add x into hlist:net->xfrm.state_byspi/ hlist_add_head_rcu(&x->byspi) spin_unlock_bh(&net->xfrm.xfrm_state_lock) /add x into hlist:net->xfrm.state_byspi 2 times/ hlist_add_head_rcu(&x->byspi) 1. a new state x is alloced in xfrm_state_alloc() and added into the bydst hlist in __find_acq_core() on the LHS; 2. on the RHS, state_hash_work thread travels the old bydst and tranfers every xfrm_state (include x) into the new bydst hlist and new byspi hlist; 3. user thread on the LHS gets the lock and adds x into the new byspi hlist again. So the same xfrm_state (x) is added into the same list_hash (net->xfrm.state_byspi) 2 times that makes the list_hash become an inifite loop. To fix the race, x->id.spi = htonl(spi) in the xfrm_alloc_spi() is moved to the back of spin_lock_bh, sothat state_hash_work thread no longer add x which id.spi is zero into the hash_list. Fixes: `f034b5d4ef` ("[XFRM]: Dynamic xfrm_state hash table sizing.") Signed-off-by: zhuoliang zhang <zhuoliang.zhang@mediatek.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2020-10-23 09:08:55 +02:00
Neal Cardwell	18ded910b5	tcp: fix to update snd_wl1 in bulk receiver fast path In the header prediction fast path for a bulk data receiver, if no data is newly acknowledged then we do not call tcp_ack() and do not call tcp_ack_update_window(). This means that a bulk receiver that receives large amounts of data can have the incoming sequence numbers wrap, so that the check in tcp_may_update_window fails: after(ack_seq, tp->snd_wl1) If the incoming receive windows are zero in this state, and then the connection that was a bulk data receiver later wants to send data, that connection can find itself persistently rejecting the window updates in incoming ACKs. This means the connection can persistently fail to discover that the receive window has opened, which in turn means that the connection is unable to send anything, and the connection's sending process can get permanently "stuck". The fix is to update snd_wl1 in the header prediction fast path for a bulk data receiver, so that it keeps up and does not see wrapping problems. This fix is based on a very nice and thorough analysis and diagnosis by Apollon Oikonomopoulos (see link below). This is a stable candidate but there is no Fixes tag here since the bug predates current git history. Just for fun: looks like the bug dates back to when header prediction was added in Linux v2.1.8 in Nov 1996. In that version tcp_rcv_established() was added, and the code only updates snd_wl1 in tcp_ack(), and in the new "Bulk data transfer: receiver" code path it does not call tcp_ack(). This fix seems to apply cleanly at least as far back as v3.2. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reported-by: Apollon Oikonomopoulos <apoikos@dmesg.gr> Tested-by: Apollon Oikonomopoulos <apoikos@dmesg.gr> Link: https://www.spinics.net/lists/netdev/msg692430.html Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201022143331.1887495-1-ncardwell.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-22 12:26:57 -07:00
Ke Li	700465fd33	net: Properly typecast int values to set sk_max_pacing_rate In setsockopt(SO_MAX_PACING_RATE) on 64bit systems, sk_max_pacing_rate, after extended from 'u32' to 'unsigned long', takes unintentionally hiked value whenever assigned from an 'int' value with MSB=1, due to binary sign extension in promoting s32 to u64, e.g. 0x80000000 becomes 0xFFFFFFFF80000000. Thus inflated sk_max_pacing_rate causes subsequent getsockopt to return ~0U unexpectedly. It may also result in increased pacing rate. Fix by explicitly casting the 'int' value to 'unsigned int' before assigning it to sk_max_pacing_rate, for zero extension to happen. Fixes: `76a9ebe811` ("net: extend sk_pacing_rate to unsigned long") Signed-off-by: Ji Li <jli@akamai.com> Signed-off-by: Ke Li <keli@akamai.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201022064146.79873-1-keli@akamai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-22 12:18:25 -07:00
Jakub Kicinski	594850ca43	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Update debugging in IPVS tcp protocol handler to make it easier to understand, from longguang.yue 2) Update TCP tracker to deal with keepalive packet after re-registration, from Franceso Ruggeri. 3) Missing IP6SKB_FRAGMENTED from netfilter fragment reassembly, from Georg Kohmann. 4) Fix bogus packet drop in ebtables nat extensions, from Thimothee Cocault. 5) Fix typo in flowtable documentation. 6) Reset skb timestamp in nft_fwd_netdev. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-22 12:00:37 -07:00
Jakub Kicinski	d2775984d0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2020-10-22 1) Fix enforcing NULL check in verifier for new helper return types of RET_PTR_TO_{BTF_ID,MEM_OR_BTF_ID}_OR_NULL, from Martin KaFai Lau. 2) Fix bpf_redirect_neigh() helper API before it becomes frozen by adding nexthop information as argument, from Toke Høiland-Jørgensen. 3) Guard & fix compilation of bpf_tail_call_static() when __bpf__ arch is not defined by compiler or clang too old, from Daniel Borkmann. 4) Remove misplaced break after return in attach_type_to_prog_type(), from Tom Rix. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-22 09:51:41 -07:00
Linus Torvalds	24717cfbbb	The one new feature this time, from Anna Schumaker, is READ_PLUS, which has the same arguments as READ but allows the server to return an array of data and hole extents. Otherwise it's a lot of cleanup and bugfixes. -----BEGIN PGP SIGNATURE----- iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl+Q5vsVHGJmaWVsZHNA ZmllbGRzZXMub3JnAAoJECebzXlCjuG+DUAP/RlALnXbaoWi8YCcEcc9U1LoQKbD CJpDR+FqCOyGwRuzWung/5pvkOO50fGEeAroos+2rF/NgRkQq8EFr9AuBhNOYUFE IZhWEOfu/r2ukXyBmcu21HGcWLwPnyJehvjuzTQW2wOHlBi/sdoL5Ap1sVlwVLj5 EZ5kqJLD+ioG2sufW99Spi55l1Cy+3Y0IhLSWl4ZAE6s8hmFSYAJZFsOeI0Afx57 USPTDRaeqjyEULkb+f8IhD0eRApOUo4evDn9dwQx+of7HPa1CiygctTKYwA3hnlc gXp2KpVA1REaiYVgOPwYlnqBmJ2K9X0wCRzcWy2razqEcVAX/2j7QCe9M2mn4DC8 xZ2q4SxgXu9yf0qfUSVnDxWmP6ipqq7OmsG0JXTFseGKBdpjJY1qHhyqanVAGvEg I+xHnnWfGwNCftwyA3mt3RfSFPsbLlSBIMZxvN4kn8aVlqszGITOQvTdQcLYA6kT xWllBf4XKVXMqF0PzerxPDmfzBfhx6b1VPWOIVcu7VLBg3IXoEB2G5xG8MUJiSch OUTCt41LUQkerQlnzaZYqwmFdSBfXJefmcE/x/vps4VtQ/fPHX1jQyD7iTu3HfSP bRlkKHvNVeTodlBDe/HTPiTA99MShhBJyvtV5wfzIqwjc1cNreed+ePppxn8mxJi SmQ2uZk/MpUl7/V0 =rcOj -----END PGP SIGNATURE----- Merge tag 'nfsd-5.10' of git://linux-nfs.org/~bfields/linux Pull nfsd updates from Bruce Fields: "The one new feature this time, from Anna Schumaker, is READ_PLUS, which has the same arguments as READ but allows the server to return an array of data and hole extents. Otherwise it's a lot of cleanup and bugfixes" * tag 'nfsd-5.10' of git://linux-nfs.org/~bfields/linux: (43 commits) NFSv4.2: Fix NFS4ERR_STALE error when doing inter server copy SUNRPC: fix copying of multiple pages in gss_read_proxy_verf() sunrpc: raise kernel RPC channel buffer size svcrdma: fix bounce buffers for unaligned offsets and multiple pages nfsd: remove unneeded break net/sunrpc: Fix return value for sysctl sunrpc.transports NFSD: Encode a full READ_PLUS reply NFSD: Return both a hole and a data segment NFSD: Add READ_PLUS hole segment encoding NFSD: Add READ_PLUS data support NFSD: Hoist status code encoding into XDR encoder functions NFSD: Map nfserr_wrongsec outside of nfsd_dispatch NFSD: Remove the RETURN_STATUS() macro NFSD: Call NFSv2 encoders on error returns NFSD: Fix .pc_release method for NFSv2 NFSD: Remove vestigial typedefs NFSD: Refactor nfsd_dispatch() error paths NFSD: Clean up nfsd_dispatch() variables NFSD: Clean up stale comments in nfsd_dispatch() NFSD: Clean up switch statement in nfsd_dispatch() ...	2020-10-22 09:44:27 -07:00
Linus Torvalds	334d431f65	9p pull request for inclusion in 5.10 A couple of small fixes (loff_t overflow on 32bit, syzbot uninitialized variable warning) and code cleanup (xen) -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl+Rc/EACgkQq06b7GqY 5nACqxAAnrPj6uA4IQGCQCbRKetkidZrIGre/8RyEAyfxj+8gV0POYp/1SFokAu+ RzNkeDgKXkV8L/6BEc/XTi/Iy2HGlPOQLjpBcDLwOIe+ZSlXYK544zUqs20W8/ZH NHtUY9rvgEWaTyYCi0wjS51xqBs2/Km4OIUKj7OZHXx9rWmpPX/6xwgcR4tNAWEJ 7AYkq9G0OiDbY9TaHRK0dOmVq1Xs+H3s5Ci4FMh/uFcQBI3UgfpYAegqTiMh8/AF aWgtMdidHaBIWdzb1W4UbBXVd14dFfcIqMTzstFO+mElO4VtQCAQtC16cJX9Dqgh 2YQxZgF5uVAi3DogHLDT0iT6RiqJrU8WeDLfOq8AuOy29l5jlGB6y1GppmS4z0DK sw+fy0lMH9q2ahbJs3RMk05m9vK8L9HCjqnMbTD0TohtkEEWyhHJ2RadtkDnRAL+ Vy+tlfENloVW4zJIFT8RI5pLXqLG4NGV5Tlp5D//5abV1V8uEeVFHe97JTBMsafi FAGWqghPPxRB7UhtbnNHiIGDClOctT9y+SY0mxIkq0Bw+26K5G/eXWwF+Wgelfrs n0nCrbi5rVdJJhcrZid4r/tUDpXydxgQmrXV8jsgpHH22B02WXKtAzcgWJsB9vec syicGh9Shr4cF1S0Rq/Brgl47WGwHjufsvWSwnukvZ1+X4Rs9Qw= =ONeq -----END PGP SIGNATURE----- Merge tag '9p-for-5.10-rc1' of git://github.com/martinetd/linux Pull 9p updates from Dominique Martinet: "A couple of small fixes (loff_t overflow on 32bit, syzbot uninitialized variable warning) and code cleanup (xen)" * tag '9p-for-5.10-rc1' of git://github.com/martinetd/linux: net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid 9p/xen: Fix format argument warning 9P: Cast to loff_t before multiplying	2020-10-22 09:33:20 -07:00
Pablo Neira Ayuso	c77761c8a5	netfilter: nf_fwd_netdev: clear timestamp in forwarding path Similar to `7980d2eabd` ("ipvs: clear skb->tstamp in forwarding path"). fq qdisc requires tstamp to be cleared in forwarding path. Fixes: `8203e2d844` ("net: clear skb->tstamp in forwarding paths") Fixes: `fb420d5d91` ("tcp/fq: move back to CLOCK_MONOTONIC") Fixes: `80b14dee2b` ("net: Add a new socket option for a future transmit time.") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-22 14:49:36 +02:00
Di Zhu	ebfe3c5183	rtnetlink: fix data overflow in rtnl_calcit() "ip addr show" command execute error when we have a physical network card with a large number of VFs The return value of if_nlmsg_size() in rtnl_calcit() will exceed range of u16 data type when any network cards has a larger number of VFs. rtnl_vfinfo_size() will significant increase needed dump size when the value of num_vfs is larger. Eventually we get a wrong value of min_ifinfo_dump_size because of overflow which decides the memory size needed by netlink dump and netlink_dump() will return -EMSGSIZE because of not enough memory was allocated. So fix it by promoting min_dump_alloc data type to u32 to avoid whole netlink message size overflow and it's also align with the data type of struct netlink_callback{}.min_dump_alloc which is assigned by return value of rtnl_calcit() Signed-off-by: Di Zhu <zhudi21@huawei.com> Link: https://lore.kernel.org/r/20201021020053.1401-1-zhudi21@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-21 18:24:08 -07:00
Toke Høiland-Jørgensen	ba452c9e99	bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop Based on the discussion in [0], update the bpf_redirect_neigh() helper to accept an optional parameter specifying the nexthop information. This makes it possible to combine bpf_fib_lookup() and bpf_redirect_neigh() without incurring a duplicate FIB lookup - since the FIB lookup helper will return the nexthop information even if no neighbour is present, this can simply be passed on to bpf_redirect_neigh() if bpf_fib_lookup() returns BPF_FIB_LKUP_RET_NO_NEIGH. Thus fix & extend it before helper API is frozen. [0] https://lore.kernel.org/bpf/393e17fc-d187-3a8d-2f0d-a627c7c63fca@iogearbox.net/ Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/bpf/160322915615.32199.1187570224032024535.stgit@toke.dk	2020-10-22 01:28:54 +02:00
Linus Torvalds	ed7cfefe44	We have: - a patch that removes crush_workspace_mutex (myself). CRUSH computations are no longer serialized and can run in parallel. - a couple new filesystem client metrics for "ceph fs top" command (Xiubo Li) - a fix for a very old messenger bug that affected the filesystem, marked for stable (myself) - assorted fixups and cleanups throughout the codebase from Jeff and others. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl+QN8cTHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi4iwCACLwUvO0ONTzUUb2D8ftfyXjo/jIod5 eO7RHfCHOP2A83GQpdYdktVDSVeOUIOiPhxAxo1dL4GieI2/saXrnoevam7ogZkA OmR4drdtRVUqF/aATrtjiDMg2ge0dnx5gfjMxSP/FiPPpXOdrtxng/7tv8yo+03q AlMqg/YcxO06t1M1qh9SEyfzjcHyPnJU2i0ienngxnGxQ7QiMOR6anF1LNhtN803 4fTBX2tLqDNAa+x5yF1kKSn9OmFNhc0oUsqef+Ck0Vw1LC0/SuxzE2J904iehAwy /HzJCer0zcp+eZhn3HhoHmp0UZpOC1dzMxa3CHyRJFrxCSTEhY8iqPiJ =wGr7 -----END PGP SIGNATURE----- Merge tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client Pull ceph updates from Ilya Dryomov: - a patch that removes crush_workspace_mutex (myself). CRUSH computations are no longer serialized and can run in parallel. - a couple new filesystem client metrics for "ceph fs top" command (Xiubo Li) - a fix for a very old messenger bug that affected the filesystem, marked for stable (myself) - assorted fixups and cleanups throughout the codebase from Jeff and others. * tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client: (27 commits) libceph: clear con->out_msg on Policy::stateful_server faults libceph: format ceph_entity_addr nonces as unsigned libceph: fix ENTITY_NAME format suggestion libceph: move a dout in queue_con_delay() ceph: comment cleanups and clarifications ceph: break up send_cap_msg ceph: drop separate mdsc argument from __send_cap ceph: promote to unsigned long long before shifting ceph: don't SetPageError on readpage errors ceph: mark ceph_fmt_xattr() as printf-like for better type checking ceph: fold ceph_update_writeable_page into ceph_write_begin ceph: fold ceph_sync_writepages into writepage_nounlock ceph: fold ceph_sync_readpages into ceph_readpage ceph: don't call ceph_update_writeable_page from page_mkwrite ceph: break out writeback of incompatible snap context to separate function ceph: add a note explaining session reject error string libceph: switch to the new "osd blocklist add" command libceph, rbd, ceph: "blacklist" -> "blocklist" ceph: have ceph_writepages_start call pagevec_lookup_range_tag ceph: use kill_anon_super helper ...	2020-10-21 10:34:10 -07:00
Matthieu Baerts	0ed37ac586	mptcp: depends on IPV6 but not as a module Like TCP, MPTCP cannot be compiled as a module. Obviously, MPTCP IPv6' support also depends on CONFIG_IPV6. But not all functions from IPv6 code are exported. To simplify the code and reduce modifications outside MPTCP, it was decided from the beginning to support MPTCP with IPv6 only if CONFIG_IPV6 was built inlined. That's also why CONFIG_MPTCP_IPV6 was created. More modifications are needed to support CONFIG_IPV6=m. Even if it was not explicit, until recently, we were forcing CONFIG_IPV6 to be built-in because we had "select IPV6" in Kconfig. Now that we have "depends on IPV6", we have to explicitly set "IPV6=y" to force CONFIG_IPV6 not to be built as a module. In other words, we can now only have CONFIG_MPTCP_IPV6=y if CONFIG_IPV6=y. Note that the new dependency might hide the fact IPv6 is not supported in MPTCP even if we have CONFIG_IPV6=m. But selecting IPV6 like we did before was forcing it to be built-in while it was maybe not what the user wants. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: `010b430d5d` ("mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it") Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20201021105154.628257-1-matthieu.baerts@tessares.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-21 08:05:40 -07:00
Alexander Ovechkin	b7c24497ba	mpls: load mpls_gso after mpls_iptunnel mpls_iptunnel is used only for mpls encapsuation, and if encaplusated packet is larger than MTU we need mpls_gso for segmentation. Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru> Acked-by: Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by: David Ahern <dsahern@gmail.com> Link: https://lore.kernel.org/r/20201020114333.26866-1-ovov@yandex-team.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-20 21:16:45 -07:00
Davide Caratti	a7a12b5a0f	net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels the following command # tc action add action tunnel_key \ > set src_ip 2001:db8::1 dst_ip 2001:db8::2 id 10 erspan_opts 1:6789:0:0 generates the following splat: BUG: KASAN: slab-out-of-bounds in tunnel_key_copy_opts+0xcc9/0x1010 [act_tunnel_key] Write of size 4 at addr ffff88813f5f1cc8 by task tc/873 CPU: 2 PID: 873 Comm: tc Not tainted 5.9.0+ #282 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 Call Trace: dump_stack+0x99/0xcb print_address_description.constprop.7+0x1e/0x230 kasan_report.cold.13+0x37/0x7c tunnel_key_copy_opts+0xcc9/0x1010 [act_tunnel_key] tunnel_key_init+0x160c/0x1f40 [act_tunnel_key] tcf_action_init_1+0x5b5/0x850 tcf_action_init+0x15d/0x370 tcf_action_add+0xd9/0x2f0 tc_ctl_action+0x29b/0x3a0 rtnetlink_rcv_msg+0x341/0x8d0 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x439/0x630 netlink_sendmsg+0x719/0xbf0 sock_sendmsg+0xe2/0x110 ____sys_sendmsg+0x5ba/0x890 ___sys_sendmsg+0xe9/0x160 __sys_sendmsg+0xd3/0x170 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f872a96b338 Code: 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 43 2c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 41 89 d4 55 RSP: 002b:00007ffffe367518 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 000000005f8f5aed RCX: 00007f872a96b338 RDX: 0000000000000000 RSI: 00007ffffe367580 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000000001c R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000001 R13: 0000000000686760 R14: 0000000000000601 R15: 0000000000000000 Allocated by task 873: kasan_save_stack+0x19/0x40 __kasan_kmalloc.constprop.7+0xc1/0xd0 __kmalloc+0x151/0x310 metadata_dst_alloc+0x20/0x40 tunnel_key_init+0xfff/0x1f40 [act_tunnel_key] tcf_action_init_1+0x5b5/0x850 tcf_action_init+0x15d/0x370 tcf_action_add+0xd9/0x2f0 tc_ctl_action+0x29b/0x3a0 rtnetlink_rcv_msg+0x341/0x8d0 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x439/0x630 netlink_sendmsg+0x719/0xbf0 sock_sendmsg+0xe2/0x110 ____sys_sendmsg+0x5ba/0x890 ___sys_sendmsg+0xe9/0x160 __sys_sendmsg+0xd3/0x170 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff88813f5f1c00 which belongs to the cache kmalloc-256 of size 256 The buggy address is located 200 bytes inside of 256-byte region [ffff88813f5f1c00, ffff88813f5f1d00) The buggy address belongs to the page: page:0000000011b48a19 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13f5f0 head:0000000011b48a19 order:1 compound_mapcount:0 flags: 0x17ffffc0010200(slab\|head) raw: 0017ffffc0010200 0000000000000000 0000000d00000001 ffff888107c43400 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88813f5f1b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88813f5f1c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88813f5f1c80: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc ^ ffff88813f5f1d00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88813f5f1d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc using IPv6 tunnels, act_tunnel_key allocates a fixed amount of memory for the tunnel metadata, but then it expects additional bytes to store tunnel specific metadata with tunnel_key_copy_opts(). Fix the arguments of __ipv6_tun_set_dst(), so that 'md_size' contains the size previously computed by tunnel_key_get_opts_len(), like it's done for IPv4 tunnels. Fixes: `0ed5269f9e` ("net/sched: add tunnel option support to act_tunnel_key") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://lore.kernel.org/r/36ebe969f6d13ff59912d6464a4356fe6f103766.1603231100.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-20 21:10:41 -07:00
Guillaume Nault	b130762161	net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action() We need to jump to the "err_out_locked" label when tcf_gate_get_entries() fails. Otherwise, tc_setup_flow_action() exits with ->tcfa_lock still held. Fixes: `d29bdd69ec` ("net: schedule: add action gate offloading") Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://lore.kernel.org/r/12f60e385584c52c22863701c0185e40ab08a7a7.1603207948.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-20 21:00:52 -07:00
Geert Uytterhoeven	010b430d5d	mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it MPTCP_IPV6 selects IPV6, thus enabling an optional feature the user may not want to enable. Fix this by making MPTCP_IPV6 depend on IPV6, like is done for all other IPv6 features. Fixes: `f870fa0b57` ("mptcp: Add MPTCP socket stubs") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20201020073839.29226-1-geert@linux-m68k.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-20 20:54:08 -07:00
Defang Bo	280e3ebdaf	nfc: Ensure presence of NFC_ATTR_FIRMWARE_NAME attribute in nfc_genl_fw_download() Check that the NFC_ATTR_FIRMWARE_NAME attributes are provided by the netlink client prior to accessing them.This prevents potential unhandled NULL pointer dereference exceptions which can be triggered by malicious user-mode programs, if they omit one or both of these attributes. Similar to commit `a0323b979f` ("nfc: Ensure presence of required attributes in the activate_target handler"). Fixes: `9674da8759` ("NFC: Add firmware upload netlink command") Signed-off-by: Defang Bo <bodefang@126.com> Link: https://lore.kernel.org/r/1603107538-4744-1-git-send-email-bodefang@126.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-20 17:06:22 -07:00
Geert Uytterhoeven	b142083b58	mptcp: MPTCP_KUNIT_TESTS should depend on MPTCP instead of selecting it MPTCP_KUNIT_TESTS selects MPTCP, thus enabling an optional feature the user may not want to enable. Fix this by making the test depend on MPTCP instead. Fixes: `a00a582203` ("mptcp: move crypto test to KUNIT") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20201019113240.11516-1-geert@linux-m68k.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-20 16:50:06 -07:00
Geliang Tang	65b8c8a620	mptcp: move mptcp_options_received's port initialization Move mptcp_options_received's port initialization from mptcp_parse_option to mptcp_get_options, put it together with the other fields initializations of mptcp_options_received. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-20 16:32:36 -07:00
Geliang Tang	fe2d9b1a0e	mptcp: initialize mptcp_options_received's ahmac Initialize mptcp_options_received's ahmac to zero, otherwise it will be a random number when receiving ADD_ADDR suboption with echo-flag=1. Fixes: `3df523ab58` ("mptcp: Add ADD_ADDR handling") Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-20 16:32:14 -07:00
Roi Dayan	47b5d2a107	net/sched: act_ct: Fix adding udp port mangle operation Need to use the udp header type and not tcp. Fixes: `9c26ba9b1f` ("net/sched: act_ct: Instantiate flow table entry actions") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com> Link: https://lore.kernel.org/r/20201019090244.3015186-1-roid@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-20 16:15:51 -07:00
Linus Torvalds	59f0e7eb2f	NFS Client Updates for Linux 5.10 - Stable Fixes: - Wait for stateid updates after CLOSE/OPEN_DOWNGRADE # v5.4+ - Fix nfs_path in case of a rename retry - Support EXCHID4_FLAG_SUPP_FENCE_OPS v4.2 EXCHANGE_ID flag - New features and improvements: - Replace dprintk() calls with tracepoints - Make cache consistency bitmap dynamic - Added support for the NFS v4.2 READ_PLUS operation - Improvements to net namespace uniquifier - Other bugfixes and cleanups - Remove redundant clnt pointer - Don't update timeout values on connection resets - Remove redundant tracepoints - Various cleanups to comments - Fix oops when trying to use copy_file_range with v4.0 source server - Improvements to flexfiles mirrors - Add missing "local_lock=posix" mount option -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl+PJv0ACgkQ18tUv7Cl QOtB+hAAza4BwQSAZs0QbszPcoNV0f0qND99hWcJ2ad5KCr8S5eNYaMqB8G8lbS+ Ahuv6xRy69l2vjHvINXoSaSiXClYNAlAr5myiW0DqMLDpVV8VEMAhqgFJK97VpqQ AsbsdFwC4xAcQ+6nJ+IfK9f8AOYhvARkOP3171qkq0jX3Eq4j1IiQARn+JrGFZFL waC4UtVhCnx5z5pg7jyu7Ar4N0Xky72+Fm1EvRog4gndX2JbNNSkwavD33mWZ8iN 3q6DRq/AtEk9F4ttPwVeALvpNCBqjoGcNw5FCBil7x4V+xIbDI2FBhusKas3GzXm W2tyUuJMTGpA6ZjkyYDcg0jC8vKUiUpMWgT3oZoQ/6g7P4vPzB/XB/RQcAInMRjS /MhWZ3YNQmApnVCIZSwOa9+oNUc69dM0+vqmesLvvb/aNjzRnGwiJNiuWGyrwoGX 3SzpPUcJixa3+eKLi9uEHojKB+SuPIrB/HW4UYh/ZpOMyilVqUnX8RYbaK3qQ/gK Un++LjTmao9cXyjIDCiHJB630W01YSWXq1nwuSYkLuXCuALmmtgA4z8pm+4K+w+G SctETKVcD0qi3Yx5mFdU5A1dHXMB7nzYNaiRWYLvLO5qAR33nUPI4lV8rog2TUPe 2iW4n0j2YxnTlwR3QMPYJRndP7KeR5XvOoNuh3XpKRVbmbc7/A4= =6RM5 -----END PGP SIGNATURE----- Merge tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client updates from Anna Schumaker: "Stable Fixes: - Wait for stateid updates after CLOSE/OPEN_DOWNGRADE # v5.4+ - Fix nfs_path in case of a rename retry - Support EXCHID4_FLAG_SUPP_FENCE_OPS v4.2 EXCHANGE_ID flag New features and improvements: - Replace dprintk() calls with tracepoints - Make cache consistency bitmap dynamic - Added support for the NFS v4.2 READ_PLUS operation - Improvements to net namespace uniquifier Other bugfixes and cleanups: - Remove redundant clnt pointer - Don't update timeout values on connection resets - Remove redundant tracepoints - Various cleanups to comments - Fix oops when trying to use copy_file_range with v4.0 source server - Improvements to flexfiles mirrors - Add missing 'local_lock=posix' mount option" * tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (55 commits) NFSv4.2: support EXCHGID4_FLAG_SUPP_FENCE_OPS 4.2 EXCHANGE_ID flag NFSv4: Fix up RCU annotations for struct nfs_netns_client NFS: Only reference user namespace from nfs4idmap struct instead of cred nfs: add missing "posix" local_lock constant table definition NFSv4: Use the net namespace uniquifier if it is set NFSv4: Clean up initialisation of uniquified client id strings NFS: Decode a full READ_PLUS reply SUNRPC: Add an xdr_align_data() function NFS: Add READ_PLUS hole segment decoding SUNRPC: Add the ability to expand holes in data pages SUNRPC: Split out _shift_data_right_tail() SUNRPC: Split out xdr_realign_pages() from xdr_align_pages() NFS: Add READ_PLUS data segment support NFS: Use xdr_page_pos() in NFSv4 decode_getacl() SUNRPC: Implement a xdr_page_pos() function SUNRPC: Split out a function for setting current page NFS: fix nfs_path in case of a rename retry fs: nfs: return per memcg count for xattr shrinkers NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE nfs: remove incorrect fallthrough label ...	2020-10-20 13:26:30 -07:00
Martijn de Gouw	d48c812474	SUNRPC: fix copying of multiple pages in gss_read_proxy_verf() When the passed token is longer than 4032 bytes, the remaining part of the token must be copied from the rqstp->rq_arg.pages. But the copy must make sure it happens in a consecutive way. With the existing code, the first memcpy copies 'length' bytes from argv->iobase, but since the header is in front, this never fills the whole first page of in_token->pages. The mecpy in the loop copies the following bytes, but starts writing at the next page of in_token->pages. This leaves the last bytes of page 0 unwritten. Symptoms were that users with many groups were not able to access NFS exports, when using Active Directory as the KDC. Signed-off-by: Martijn de Gouw <martijn.de.gouw@prodrive-technologies.com> Fixes: `5866efa8cb` "SUNRPC: Fix svcauth_gss_proxy_init()" Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2020-10-20 13:21:30 -04:00
Roberto Bergantinos Corpas	27a1e8a0f7	sunrpc: raise kernel RPC channel buffer size Its possible that using AUTH_SYS and mountd manage-gids option a user may hit the 8k RPC channel buffer limit. This have been observed on field, causing unanswered RPCs on clients after mountd fails to write on channel : rpc.mountd[11231]: auth_unix_gid: error writing reply Userland nfs-utils uses a buffer size of 32k (RPC_CHAN_BUF_SIZE), so lets match those two. Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2020-10-20 13:21:30 -04:00
Saeed Mirzamohammadi	31cc578ae2	netfilter: nftables_offload: KASAN slab-out-of-bounds Read in nft_flow_rule_create This patch fixes the issue due to: BUG: KASAN: slab-out-of-bounds in nft_flow_rule_create+0x622/0x6a2 net/netfilter/nf_tables_offload.c:40 Read of size 8 at addr ffff888103910b58 by task syz-executor227/16244 The error happens when expr->ops is accessed early on before performing the boundary check and after nft_expr_next() moves the expr to go out-of-bounds. This patch checks the boundary condition before expr->ops that fixes the slab-out-of-bounds Read issue. Add nft_expr_more() and use it to fix this problem. Signed-off-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-20 13:54:54 +02:00
Timothée COCAULT	63137bc588	netfilter: ebtables: Fixes dropping of small packets in bridge nat Fixes an error causing small packets to get dropped. skb_ensure_writable expects the second parameter to be a length in the ethernet payload.=20 If we want to write the ethernet header (src, dst), we should pass 0. Otherwise, packets with small payloads (< ETH_ALEN) will get dropped. Fixes: `c1a8311679` ("netfilter: bridge: convert skb_make_writable to skb_ensure_writable") Signed-off-by: Timothée COCAULT <timothee.cocault@orange.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-20 13:54:53 +02:00
Georg Kohmann	68f9f9c2c3	netfilter: Drop fragmented ndisc packets assembled in netfilter Fragmented ndisc packets assembled in netfilter not dropped as specified in RFC 6980, section 5. This behaviour breaks TAHI IPv6 Core Conformance Tests v6LC.2.1.22/23, V6LC.2.2.26/27 and V6LC.2.3.18. Setting IP6SKB_FRAGMENTED flag during reassembly. References: commit `b800c3b966` ("ipv6: drop fragmented ndisc packets by default (RFC 6980)") Signed-off-by: Georg Kohmann <geokohma@cisco.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-20 13:54:53 +02:00
Francesco Ruggeri	4f25434bcc	netfilter: conntrack: connection timeout after re-register If the first packet conntrack sees after a re-register is an outgoing keepalive packet with no data (SEG.SEQ = SND.NXT-1), td_end is set to SND.NXT-1. When the peer correctly acknowledges SND.NXT, tcp_in_window fails check III (Upper bound for valid (s)ack: sack <= receiver.td_end) and returns false, which cascades into nf_conntrack_in setting skb->_nfct = 0 and in later conntrack iptables rules not matching. In cases where iptables are dropping packets that do not match conntrack rules this can result in idle tcp connections to time out. v2: adjust td_end when getting the reply rather than when sending out the keepalive packet. Fixes: `f94e63801a` ("netfilter: conntrack: reset tcp maxwin on re-register") Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-20 13:54:53 +02:00
longguang.yue	79dce09ab0	ipvs: adjust the debug info in function set_tcp_state Outputting client,virtual,dst addresses info when tcp state changes, which makes the connection debug more clear Signed-off-by: longguang.yue <bigclouds@163.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-20 13:54:46 +02:00
Ido Schimmel	df6afe2f7c	nexthop: Fix performance regression in nexthop deletion While insertion of 16k nexthops all using the same netdev ('dummy10') takes less than a second, deletion takes about 130 seconds: # time -p ip -b nexthop.batch real 0.29 user 0.01 sys 0.15 # time -p ip link set dev dummy10 down real 131.03 user 0.06 sys 0.52 This is because of repeated calls to synchronize_rcu() whenever a nexthop is removed from a nexthop group: # /usr/share/bcc/tools/offcputime -p `pgrep -nx ip` -K ... b'finish_task_switch' b'schedule' b'schedule_timeout' b'wait_for_completion' b'__wait_rcu_gp' b'synchronize_rcu.part.0' b'synchronize_rcu' b'__remove_nexthop' b'remove_nexthop' b'nexthop_flush_dev' b'nh_netdev_event' b'raw_notifier_call_chain' b'call_netdevice_notifiers_info' b'__dev_notify_flags' b'dev_change_flags' b'do_setlink' b'__rtnl_newlink' b'rtnl_newlink' b'rtnetlink_rcv_msg' b'netlink_rcv_skb' b'rtnetlink_rcv' b'netlink_unicast' b'netlink_sendmsg' b'____sys_sendmsg' b'___sys_sendmsg' b'__sys_sendmsg' b'__x64_sys_sendmsg' b'do_syscall_64' b'entry_SYSCALL_64_after_hwframe' - ip (277) 126554955 Since nexthops are always deleted under RTNL, synchronize_net() can be used instead. It will call synchronize_rcu_expedited() which only blocks for several microseconds as opposed to multiple milliseconds like synchronize_rcu(). With this patch deletion of 16k nexthops takes less than a second: # time -p ip link set dev dummy10 down real 0.12 user 0.00 sys 0.04 Tested with fib_nexthops.sh which includes torture tests that prompted the initial change: # ./fib_nexthops.sh ... Tests passed: 134 Tests failed: 0 Fixes: `90f33bffa3` ("nexthops: don't modify published nexthop groups") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: David Ahern <dsahern@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20201016172914.643282-1-idosch@idosch.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-19 20:07:15 -07:00
Christian Eggers	bc7e343dbd	net: dsa: tag_ksz: KSZ8795 and KSZ9477 also use tail tags The Marvell 88E6060 uses tag_trailer.c and the KSZ8795, KSZ9477 and KSZ9893 switches also use tail tags. Fixes: `7a6ffe764b` ("net: dsa: point out the tail taggers") Signed-off-by: Christian Eggers <ceggers@arri.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20201016171603.10587-1-ceggers@arri.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-19 17:32:50 -07:00
Taehee Yoo	0e8b8d6a2d	net: core: use list_del_init() instead of list_del() in netdev_run_todo() dev->unlink_list is reused unless dev is deleted. So, list_del() should not be used. Due to using list_del(), dev->unlink_list can't be reused so that dev->nested_level update logic doesn't work. In order to fix this bug, list_del_init() should be used instead of list_del(). Test commands: ip link add bond0 type bond ip link add bond1 type bond ip link set bond0 master bond1 ip link set bond0 nomaster ip link set bond1 master bond0 ip link set bond1 nomaster Splat looks like: [ 255.750458][ T1030] ============================================ [ 255.751967][ T1030] WARNING: possible recursive locking detected [ 255.753435][ T1030] 5.9.0-rc8+ #772 Not tainted [ 255.754553][ T1030] -------------------------------------------- [ 255.756047][ T1030] ip/1030 is trying to acquire lock: [ 255.757304][ T1030] ffff88811782a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync_multiple+0xc2/0x150 [ 255.760056][ T1030] [ 255.760056][ T1030] but task is already holding lock: [ 255.761862][ T1030] ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding] [ 255.764581][ T1030] [ 255.764581][ T1030] other info that might help us debug this: [ 255.766645][ T1030] Possible unsafe locking scenario: [ 255.766645][ T1030] [ 255.768566][ T1030] CPU0 [ 255.769415][ T1030] ---- [ 255.770259][ T1030] lock(&dev_addr_list_lock_key/1); [ 255.771629][ T1030] lock(&dev_addr_list_lock_key/1); [ 255.772994][ T1030] [ 255.772994][ T1030] * DEADLOCK * [ 255.772994][ T1030] [ 255.775091][ T1030] May be due to missing lock nesting notation [ 255.775091][ T1030] [ 255.777182][ T1030] 2 locks held by ip/1030: [ 255.778299][ T1030] #0: ffffffffb1f63250 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2e4/0x8b0 [ 255.780600][ T1030] #1: ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding] [ 255.783411][ T1030] [ 255.783411][ T1030] stack backtrace: [ 255.784874][ T1030] CPU: 7 PID: 1030 Comm: ip Not tainted 5.9.0-rc8+ #772 [ 255.786595][ T1030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 255.789030][ T1030] Call Trace: [ 255.789850][ T1030] dump_stack+0x99/0xd0 [ 255.790882][ T1030] __lock_acquire.cold.71+0x166/0x3cc [ 255.792285][ T1030] ? register_lock_class+0x1a30/0x1a30 [ 255.793619][ T1030] ? rcu_read_lock_sched_held+0x91/0xc0 [ 255.794963][ T1030] ? rcu_read_lock_bh_held+0xa0/0xa0 [ 255.796246][ T1030] lock_acquire+0x1b8/0x850 [ 255.797332][ T1030] ? dev_mc_sync_multiple+0xc2/0x150 [ 255.798624][ T1030] ? bond_enslave+0x3d4d/0x43e0 [bonding] [ 255.800039][ T1030] ? check_flags+0x50/0x50 [ 255.801143][ T1030] ? lock_contended+0xd80/0xd80 [ 255.802341][ T1030] _raw_spin_lock_nested+0x2e/0x70 [ 255.803592][ T1030] ? dev_mc_sync_multiple+0xc2/0x150 [ 255.804897][ T1030] dev_mc_sync_multiple+0xc2/0x150 [ 255.806168][ T1030] bond_enslave+0x3d58/0x43e0 [bonding] [ 255.807542][ T1030] ? __lock_acquire+0xe53/0x51b0 [ 255.808824][ T1030] ? bond_update_slave_arr+0xdc0/0xdc0 [bonding] [ 255.810451][ T1030] ? check_chain_key+0x236/0x5e0 [ 255.811742][ T1030] ? mutex_is_locked+0x13/0x50 [ 255.812910][ T1030] ? rtnl_is_locked+0x11/0x20 [ 255.814061][ T1030] ? netdev_master_upper_dev_get+0xf/0x120 [ 255.815553][ T1030] do_setlink+0x94c/0x3040 [ ... ] Reported-by: syzbot+4a0f7bc34e3997a6c7df@syzkaller.appspotmail.com Fixes: `1fc70edb7d` ("net: core: add nested_level variable in net_device") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Link: https://lore.kernel.org/r/20201015162606.9377-1-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-18 14:50:25 -07:00
Eelco Chaudron	f981fc3d51	net: openvswitch: fix to make sure flow_lookup() is not preempted The flow_lookup() function uses per CPU variables, which must be called with BH disabled. However, this is fine in the general NAPI use case where the local BH is disabled. But, it's also called from the netlink context. The below patch makes sure that even in the netlink path, the BH is disabled. In addition, u64_stats_update_begin() requires a lock to ensure one writer which is not ensured here. Making it per-CPU and disabling NAPI (softirq) ensures that there is always only one writer. Fixes: `eac87c413b` ("net: openvswitch: reorder masks array based on usage") Reported-by: Juri Lelli <jlelli@redhat.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/160295903253.7789.826736662555102345.stgit@ebuild Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-18 12:29:36 -07:00
Eric Dumazet	b38e7819ca	icmp: randomize the global rate limiter Keyu Man reported that the ICMP rate limiter could be used by attackers to get useful signal. Details will be provided in an upcoming academic publication. Our solution is to add some noise, so that the attackers no longer can get help from the predictable token bucket limiter. Fixes: `4cdf507d54` ("icmp: add a global rate limitation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Keyu Man <kman001@ucr.edu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-16 16:47:09 -07:00
Hoang Huu Le	ec78e31852	tipc: fix incorrect setting window for bcast link In commit `16ad3f4022` ("tipc: introduce variable window congestion control"), we applied the algorithm to select window size from minimum window to the configured maximum window for unicast link, and, besides we chose to keep the window size for broadcast link unchanged and equal (i.e fix window 50) However, when setting maximum window variable via command, the window variable was re-initialized to unexpect value (i.e 32). We fix this by updating the fix window for broadcast as we stated. Fixes: `16ad3f4022` ("tipc: introduce variable window congestion control") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-16 14:09:12 -07:00
Hoang Huu Le	75cee397ae	tipc: re-configure queue limit for broadcast link The queue limit of the broadcast link is being calculated base on initial MTU. However, when MTU value changed (e.g manual changing MTU on NIC device, MTU negotiation etc.,) we do not re-calculate queue limit. This gives throughput does not reflect with the change. So fix it by calling the function to re-calculate queue limit of the broadcast link. Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-16 14:09:12 -07:00
Dan Aloni	c327a310ec	svcrdma: fix bounce buffers for unaligned offsets and multiple pages This was discovered using O_DIRECT at the client side, with small unaligned file offsets or IOs that span multiple file pages. Fixes: `e248aa7be8` ("svcrdma: Remove max_sge check at connect time") Signed-off-by: Dan Aloni <dan@kernelim.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2020-10-16 15:15:04 -04:00
Artur Molchanov	c09f56b8f6	net/sunrpc: Fix return value for sysctl sunrpc.transports Fix returning value for sysctl sunrpc.transports. Return error code from sysctl proc_handler function proc_do_xprt instead of number of the written bytes. Otherwise sysctl returns random garbage for this key. Since v1: - Handle negative returned value from memory_read_from_buffer as an error Signed-off-by: Artur Molchanov <arturmolchanov@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2020-10-16 15:15:04 -04:00
Linus Torvalds	9ff9b0d392	networking changes for the 5.10 merge window Add redirect_neigh() BPF packet redirect helper, allowing to limit stack traversal in common container configs and improving TCP back-pressure. Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain. Expand netlink policy support and improve policy export to user space. (Ge)netlink core performs request validation according to declared policies. Expand the expressiveness of those policies (min/max length and bitmasks). Allow dumping policies for particular commands. This is used for feature discovery by user space (instead of kernel version parsing or trial and error). Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge. Allow more than 255 IPv4 multicast interfaces. Add support for Type of Service (ToS) reflection in SYN/SYN-ACK packets of TCPv6. In Multi-patch TCP (MPTCP) support concurrent transmission of data on multiple subflows in a load balancing scenario. Enhance advertising addresses via the RM_ADDR/ADD_ADDR options. Support SMC-Dv2 version of SMC, which enables multi-subnet deployments. Allow more calls to same peer in RxRPC. Support two new Controller Area Network (CAN) protocols - CAN-FD and ISO 15765-2:2016. Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit kernel problem. Add TC actions for implementing MPLS L2 VPNs. Improve nexthop code - e.g. handle various corner cases when nexthop objects are removed from groups better, skip unnecessary notifications and make it easier to offload nexthops into HW by converting to a blocking notifier. Support adding and consuming TCP header options by BPF programs, opening the doors for easy experimental and deployment-specific TCP option use. Reorganize TCP congestion control (CC) initialization to simplify life of TCP CC implemented in BPF. Add support for shipping BPF programs with the kernel and loading them early on boot via the User Mode Driver mechanism, hence reusing all the user space infra we have. Support sleepable BPF programs, initially targeting LSM and tracing. Add bpf_d_path() helper for returning full path for given 'struct path'. Make bpf_tail_call compatible with bpf-to-bpf calls. Allow BPF programs to call map_update_elem on sockmaps. Add BPF Type Format (BTF) support for type and enum discovery, as well as support for using BTF within the kernel itself (current use is for pretty printing structures). Support listing and getting information about bpf_links via the bpf syscall. Enhance kernel interfaces around NIC firmware update. Allow specifying overwrite mask to control if settings etc. are reset during update; report expected max time operation may take to users; support firmware activation without machine reboot incl. limits of how much impact reset may have (e.g. dropping link or not). Extend ethtool configuration interface to report IEEE-standard counters, to limit the need for per-vendor logic in user space. Adopt or extend devlink use for debug, monitoring, fw update in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx, dpaa2-eth). In mlxsw expose critical and emergency SFP module temperature alarms. Refactor port buffer handling to make the defaults more suitable and support setting these values explicitly via the DCBNL interface. Add XDP support for Intel's igb driver. Support offloading TC flower classification and filtering rules to mscc_ocelot switches. Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as fixed interval period pulse generator and one-step timestamping in dpaa-eth. Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3) offload. Add Lynx PHY/PCS MDIO module, and convert various drivers which have this HW to use it. Convert mvpp2 to split PCS. Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as 7-port Mediatek MT7531 IP. Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver, and wcn3680 support in wcn36xx. Improve performance for packets which don't require much offloads on recent Mellanox NICs by 20% by making multiple packets share a descriptor entry. Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto subtree to drivers/net. Move MDIO drivers out of the phy directory. Clean up a lot of W=1 warnings, reportedly the actively developed subsections of networking drivers should now build W=1 warning free. Make sure drivers don't use in_interrupt() to dynamically adapt their code. Convert tasklets to use new tasklet_setup API (sadly this conversion is not yet complete). Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+ItRwACgkQMUZtbf5S IrtTMg//UxpdR/MirT1DatBU0K/UGAZY82hV7F/UC8tPgjfHZeHvWlDFxfi3YP81 PtPKbhRZ7DhwBXefUp6nY3UdvjftrJK2lJm8prJUPSsZRye8Wlcb7y65q7/P2y2U Efucyopg6RUrmrM0DUsIGYGJgylQLHnMYUl/keCsD4t5Bp4ksyi9R2t5eitGoWzh r3QGdbSa0AuWx4iu0i+tqp6Tj0ekMBMXLVb35dtU1t0joj2KTNEnSgABN3prOa8E iWYf2erOau68Ogp3yU3miCy0ZU4p/7qGHTtzbcp677692P/ekak6+zmfHLT9/Pjy 2Stq2z6GoKuVxdktr91D9pA3jxG4LxSJmr0TImcGnXbvkMP3Ez3g9RrpV5fn8j6F mZCH8TKZAoD5aJrAJAMkhZmLYE1pvDa7KolSk8WogXrbCnTEb5Nv8FHTS1Qnk3yl wSKXuvutFVNLMEHCnWQLtODbTST9DI/aOi6EctPpuOA/ZyL1v3pl+gfp37S+LUTe owMnT/7TdvKaTD0+gIyU53M6rAWTtr5YyRQorX9awIu/4Ha0F0gYD7BJZQUGtegp HzKt59NiSrFdbSH7UdyemdBF4LuCgIhS7rgfeoUXMXmuPHq7eHXyHZt5dzPPa/xP 81P0MAvdpFVwg8ij2yp2sHS7sISIRKq17fd1tIewUabxQbjXqPc= =bc1U -----END PGP SIGNATURE----- Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: - Add redirect_neigh() BPF packet redirect helper, allowing to limit stack traversal in common container configs and improving TCP back-pressure. Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain. - Expand netlink policy support and improve policy export to user space. (Ge)netlink core performs request validation according to declared policies. Expand the expressiveness of those policies (min/max length and bitmasks). Allow dumping policies for particular commands. This is used for feature discovery by user space (instead of kernel version parsing or trial and error). - Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge. - Allow more than 255 IPv4 multicast interfaces. - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK packets of TCPv6. - In Multi-patch TCP (MPTCP) support concurrent transmission of data on multiple subflows in a load balancing scenario. Enhance advertising addresses via the RM_ADDR/ADD_ADDR options. - Support SMC-Dv2 version of SMC, which enables multi-subnet deployments. - Allow more calls to same peer in RxRPC. - Support two new Controller Area Network (CAN) protocols - CAN-FD and ISO 15765-2:2016. - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit kernel problem. - Add TC actions for implementing MPLS L2 VPNs. - Improve nexthop code - e.g. handle various corner cases when nexthop objects are removed from groups better, skip unnecessary notifications and make it easier to offload nexthops into HW by converting to a blocking notifier. - Support adding and consuming TCP header options by BPF programs, opening the doors for easy experimental and deployment-specific TCP option use. - Reorganize TCP congestion control (CC) initialization to simplify life of TCP CC implemented in BPF. - Add support for shipping BPF programs with the kernel and loading them early on boot via the User Mode Driver mechanism, hence reusing all the user space infra we have. - Support sleepable BPF programs, initially targeting LSM and tracing. - Add bpf_d_path() helper for returning full path for given 'struct path'. - Make bpf_tail_call compatible with bpf-to-bpf calls. - Allow BPF programs to call map_update_elem on sockmaps. - Add BPF Type Format (BTF) support for type and enum discovery, as well as support for using BTF within the kernel itself (current use is for pretty printing structures). - Support listing and getting information about bpf_links via the bpf syscall. - Enhance kernel interfaces around NIC firmware update. Allow specifying overwrite mask to control if settings etc. are reset during update; report expected max time operation may take to users; support firmware activation without machine reboot incl. limits of how much impact reset may have (e.g. dropping link or not). - Extend ethtool configuration interface to report IEEE-standard counters, to limit the need for per-vendor logic in user space. - Adopt or extend devlink use for debug, monitoring, fw update in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx, dpaa2-eth). - In mlxsw expose critical and emergency SFP module temperature alarms. Refactor port buffer handling to make the defaults more suitable and support setting these values explicitly via the DCBNL interface. - Add XDP support for Intel's igb driver. - Support offloading TC flower classification and filtering rules to mscc_ocelot switches. - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as fixed interval period pulse generator and one-step timestamping in dpaa-eth. - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3) offload. - Add Lynx PHY/PCS MDIO module, and convert various drivers which have this HW to use it. Convert mvpp2 to split PCS. - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as 7-port Mediatek MT7531 IP. - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver, and wcn3680 support in wcn36xx. - Improve performance for packets which don't require much offloads on recent Mellanox NICs by 20% by making multiple packets share a descriptor entry. - Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto subtree to drivers/net. Move MDIO drivers out of the phy directory. - Clean up a lot of W=1 warnings, reportedly the actively developed subsections of networking drivers should now build W=1 warning free. - Make sure drivers don't use in_interrupt() to dynamically adapt their code. Convert tasklets to use new tasklet_setup API (sadly this conversion is not yet complete). * tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits) Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH" net, sockmap: Don't call bpf_prog_put() on NULL pointer bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo bpf, sockmap: Add locking annotations to iterator netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements net: fix pos incrementment in ipv6_route_seq_next net/smc: fix invalid return code in smcd_new_buf_create() net/smc: fix valid DMBE buffer sizes net/smc: fix use-after-free of delayed events bpfilter: Fix build error with CONFIG_BPFILTER_UMH cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info bpf: Fix register equivalence tracking. rxrpc: Fix loss of final ack on shutdown rxrpc: Fix bundle counting for exclusive connections netfilter: restore NF_INET_NUMHOOKS ibmveth: Identify ingress large send packets. ibmveth: Switch order of ibmveth_helper calls. cxgb4: handle 4-tuple PEDIT to NAT mode translation selftests: Add VRF route leaking tests ...	2020-10-15 18:42:13 -07:00
Jakub Kicinski	105faa8742	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2020-10-15 The main changes are: 1) Fix register equivalence tracking in verifier, from Alexei Starovoitov. 2) Fix sockmap error path to not call bpf_prog_put() with NULL, from Alex Dewar. 3) Fix sockmap to add locking annotations to iterator, from Lorenz Bauer. 4) Fix tcp_hdr_options test to use loopback address, from Martin KaFai Lau. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-15 12:45:00 -07:00
Jakub Kicinski	2295cddf99	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Minor conflicts in net/mptcp/protocol.h and tools/testing/selftests/net/Makefile. In both cases code was added on both sides in the same place so just keep both. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-15 12:43:21 -07:00
Jakub Kicinski	2ecbc1f684	Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH" This reverts commit `1d273fcc2c`. Alexei points out there's nothing implying headers will be built and therefore exist under usr/include, so this fix doesn't make much sense. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-15 12:33:24 -07:00
Alex Dewar	83c11c1755	net, sockmap: Don't call bpf_prog_put() on NULL pointer If bpf_prog_inc_not_zero() fails for skb_parser, then bpf_prog_put() is called unconditionally on skb_verdict, even though it may be NULL. Fix and tidy up error path. Fixes: `743df8b774` ("bpf, sockmap: Check skb_verdict and skb_parser programs explicitly") Addresses-Coverity-ID: 1497799: Null pointer dereferences (FORWARD_NULL) Signed-off-by: Alex Dewar <alex.dewar90@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20201012170952.60750-1-alex.dewar90@gmail.com	2020-10-15 21:05:23 +02:00
Lorenz Bauer	f58423aeab	bpf, sockmap: Add locking annotations to iterator The sparse checker currently outputs the following warnings: include/linux/rcupdate.h:632:9: sparse: sparse: context imbalance in 'sock_hash_seq_start' - wrong count at exit include/linux/rcupdate.h:632:9: sparse: sparse: context imbalance in 'sock_map_seq_start' - wrong count at exit Add the necessary __acquires and __release annotations to make the iterator locking schema palatable to sparse. Also add __must_hold for good measure. The kernel codebase uses both __acquires(rcu) and __acquires(RCU). I couldn't find any guidance which one is preferred, so I used what is easier to type out. Fixes: `0365351524` ("net: Allow iterating sockmap and sockhash") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20201012091850.67452-1-lmb@cloudflare.com	2020-10-15 20:49:56 +02:00
Davide Caratti	346e320cb2	netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements nftables payload statements are used to mangle SCTP headers, but they can only replace the Internet Checksum. As a consequence, nftables rules that mangle sport/dport/vtag in SCTP headers potentially generate packets that are discarded by the receiver, unless the CRC-32C is "offloaded" (e.g the rule mangles a skb having 'ip_summed' equal to 'CHECKSUM_PARTIAL'. Fix this extending uAPI definitions and L4 checksum update function, in a way that userspace programs (e.g. nft) can instruct the kernel to compute CRC-32C in SCTP headers. Also ensure that LIBCRC32C is built if NF_TABLES is 'y' or 'm' in the kernel build configuration. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-15 11:45:19 -07:00
Yonghong Song	6617dfd440	net: fix pos incrementment in ipv6_route_seq_next Commit `4fc427e051` ("ipv6_route_seq_next should increase position index") tried to fix the issue where seq_file pos is not increased if a NULL element is returned with seq_ops->next(). See bug https://bugzilla.kernel.org/show_bug.cgi?id=206283 The commit effectively does: - increase pos for all seq_ops->start() - increase pos for all seq_ops->next() For ipv6_route, increasing pos for all seq_ops->next() is correct. But increasing pos for seq_ops->start() is not correct since pos is used to determine how many items to skip during seq_ops->start(): iter->skip = pos; seq_ops->start() just fetches the current* pos item. The item can be skipped only after seq_ops->show() which essentially is the beginning of seq_ops->next(). For example, I have 7 ipv6 route entries, root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=4096 00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001 eth0 fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001 eth0 00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo 00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001 lo fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001 eth0 ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000004 00000000 00000001 eth0 00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo 0+1 records in 0+1 records out 1050 bytes (1.0 kB, 1.0 KiB) copied, 0.00707908 s, 148 kB/s root@arch-fb-vm1:~/net-next In the above, I specify buffer size 4096, so all records can be returned to user space with a single trip to the kernel. If I use buffer size 128, since each record size is 149, internally kernel seq_read() will read 149 into its internal buffer and return the data to user space in two read() syscalls. Then user read() syscall will trigger next seq_ops->start(). Since the current implementation increased pos even for seq_ops->start(), it will skip record #2, #4 and #6, assuming the first record is #1. root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=128 00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001 eth0 00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001 eth0 00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo 4+1 records in 4+1 records out 600 bytes copied, 0.00127758 s, 470 kB/s To fix the problem, create a fake pos pointer so seq_ops->start() won't actually increase seq_file pos. With this fix, the above `dd` command with `bs=128` will show correct result. Fixes: `4fc427e051` ("ipv6_route_seq_next should increase position index") Cc: Alexei Starovoitov <ast@kernel.org> Suggested-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-15 10:21:28 -07:00
Karsten Graul	6b1bbf94ab	net/smc: fix invalid return code in smcd_new_buf_create() smc_ism_register_dmb() returns error codes set by the ISM driver which are not guaranteed to be negative or in the errno range. Such values would not be handled by ERR_PTR() and finally the return code will be used as a memory address. Fix that by using a valid negative errno value with ERR_PTR(). Fixes: `72b7f6c487` ("net/smc: unique reason code for exceeded max dmb count") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-15 09:54:43 -07:00
Karsten Graul	ef12ad4588	net/smc: fix valid DMBE buffer sizes The SMCD_DMBE_SIZES should include all valid DMBE buffer sizes, so the correct value is 6 which means 1MB. With 7 the registration of an ISM buffer would always fail because of the invalid size requested. Fix that and set the value to 6. Fixes: `c6ba7c9ba4` ("net/smc: add base infrastructure for SMC-D and ISM") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-15 09:54:43 -07:00
Karsten Graul	d535ca1367	net/smc: fix use-after-free of delayed events When a delayed event is enqueued then the event worker will send this event the next time it is running and no other flow is currently active. The event handler is called for the delayed event, and the pointer to the event keeps set in lgr->delayed_event. This pointer is cleared later in the processing by smc_llc_flow_start(). This can lead to a use-after-free condition when the processing does not reach smc_llc_flow_start(), but frees the event because of an error situation. Then the delayed_event pointer is still set but the event is freed. Fix this by always clearing the delayed event pointer when the event is provided to the event handler for processing, and remove the code to clear it in smc_llc_flow_start(). Fixes: `555da9af82` ("net/smc: add event-based llc_flow framework") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-15 09:54:43 -07:00
YueHaibing	1d273fcc2c	bpfilter: Fix build error with CONFIG_BPFILTER_UMH IF CONFIG_BPFILTER_UMH is set, building fails: In file included from /usr/include/sys/socket.h:33:0, from net/bpfilter/main.c:6: /usr/include/bits/socket.h:390:10: fatal error: asm/socket.h: No such file or directory #include <asm/socket.h> ^~~~~~~~~~~~~~ compilation terminated. scripts/Makefile.userprogs:43: recipe for target 'net/bpfilter/main.o' failed make[2]: *** [net/bpfilter/main.o] Error 1 Add missing include path to fix this. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-15 09:36:14 -07:00
David Howells	ddc7834af8	rxrpc: Fix loss of final ack on shutdown Fix the loss of transmission of a call's final ack when a socket gets shut down. This means that the server will retransmit the last data packet or send a ping ack and then get an ICMP indicating the port got closed. The server will then view this as a failure. Fixes: `3136ef49a1` ("rxrpc: Delay terminal ACK transmission on a client call") Signed-off-by: David Howells <dhowells@redhat.com>	2020-10-15 13:28:00 +01:00
David Howells	f3af4ad1e0	rxrpc: Fix bundle counting for exclusive connections Fix rxrpc_unbundle_conn() to not drop the bundle usage count when cleaning up an exclusive connection. Based on the suggested fix from Hillf Danton. Fixes: `245500d853` ("rxrpc: Rewrite the client connection manager") Reported-by: syzbot+d57aaf84dd8a550e6d91@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> cc: Hillf Danton <hdanton@sina.com>	2020-10-15 13:28:00 +01:00
Pablo Neira Ayuso	d25e2e9388	netfilter: restore NF_INET_NUMHOOKS This definition is used by the iptables legacy UAPI, restore it. Fixes: `d3519cb89f` ("netfilter: nf_tables: add inet ingress support") Reported-by: Jason A. Donenfeld <Jason@zx2c4.com> Tested-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-14 20:28:05 -07:00
Mathieu Desnoyers	272928d1cd	ipv6/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2) As per RFC4443, the destination address field for ICMPv6 error messages is copied from the source address field of the invoking packet. In configurations with Virtual Routing and Forwarding tables, looking up which routing table to use for sending ICMPv6 error messages is currently done by using the destination net_device. If the source and destination interfaces are within separate VRFs, or one in the global routing table and the other in a VRF, looking up the source address of the invoking packet in the destination interface's routing table will fail if the destination interface's routing table contains no route to the invoking packet's source address. One observable effect of this issue is that traceroute6 does not work in the following cases: - Route leaking between global routing table and VRF - Route leaking between VRFs Use the source device routing table when sending ICMPv6 error messages. [ In the context of ipv4, it has been pointed out that a similar issue may exist with ICMP errors triggered when forwarding between network namespaces. It would be worthwhile to investigate whether ipv6 has similar issues, but is outside of the scope of this investigation. ] [ Testing shows that similar issues exist with ipv6 unreachable / fragmentation needed messages. However, investigation of this additional failure mode is beyond this investigation's scope. ] Link: https://tools.ietf.org/html/rfc4443 Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-14 17:14:23 -07:00
Mathieu Desnoyers	e1e84eb58e	ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2) As per RFC792, ICMP errors should be sent to the source host. However, in configurations with Virtual Routing and Forwarding tables, looking up which routing table to use is currently done by using the destination net_device. commit `9d1a6c4ea4` ("net: icmp_route_lookup should use rt dev to determine L3 domain") changes the interface passed to l3mdev_master_ifindex() and inet_addr_type_dev_table() from skb_in->dev to skb_dst(skb_in)->dev. This effectively uses the destination device rather than the source device for choosing which routing table should be used to lookup where to send the ICMP error. Therefore, if the source and destination interfaces are within separate VRFs, or one in the global routing table and the other in a VRF, looking up the source host in the destination interface's routing table will fail if the destination interface's routing table contains no route to the source host. One observable effect of this issue is that traceroute does not work in the following cases: - Route leaking between global routing table and VRF - Route leaking between VRFs Preferably use the source device routing table when sending ICMP error messages. If no source device is set, fall-back on the destination device routing table. Else, use the main routing table (index 0). [ It has been pointed out that a similar issue may exist with ICMP errors triggered when forwarding between network namespaces. It would be worthwhile to investigate, but is outside of the scope of this investigation. ] [ It has also been pointed out that a similar issue exists with unreachable / fragmentation needed messages, which can be triggered by changing the MTU of eth1 in r1 to 1400 and running: ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2 Some investigation points to raw_icmp_error() and raw_err() as being involved in this last scenario. The focus of this patch is TTL expired ICMP messages, which go through icmp_route_lookup. Investigation of failure modes related to raw_icmp_error() is beyond this investigation's scope. ] Fixes: `9d1a6c4ea4` ("net: icmp_route_lookup should use rt dev to determine L3 domain") Link: https://tools.ietf.org/html/rfc792 Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-14 17:14:11 -07:00
Jakub Kicinski	1e40d75ef9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Extend nf_queue selftest to cover re-queueing, non-gso mode and delayed queueing, from Florian Westphal. 2) Clear skb->tstamp in IPVS forwarding path, from Julian Anastasov. 3) Provide netlink extended error reporting for EEXIST case. 4) Missing VLAN offload tag and proto in log target. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 20:02:54 -07:00
Cong Wang	fdafed4599	ip_gre: set dev->hard_header_len and dev->needed_headroom properly GRE tunnel has its own header_ops, ipgre_header_ops, and sets it conditionally. When it is set, it assumes the outer IP header is already created before ipgre_xmit(). This is not true when we send packets through a raw packet socket, where L2 headers are supposed to be constructed by user. Packet socket calls dev_validate_header() to validate the header. But GRE tunnel does not set dev->hard_header_len, so that check can be simply bypassed, therefore uninit memory could be passed down to ipgre_xmit(). Similar for dev->needed_headroom. dev->hard_header_len is supposed to be the length of the header created by dev->header_ops->create(), so it should be used whenever header_ops is set, and dev->needed_headroom should be used when it is not set. Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com Cc: William Tu <u9012063@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Xie He <xie.he.0141@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 18:35:29 -07:00
Heiner Kallweit	5fc3594d36	xfrm: use new function dev_fetch_sw_netstats Simplify the code by using new function dev_fetch_sw_netstats(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/a6b816f4-bbf2-9db0-d59a-7e4e9cc808fe@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:33:49 -07:00
Heiner Kallweit	3569939a81	net: openvswitch: use new function dev_fetch_sw_netstats Simplify the code by using new function dev_fetch_sw_netstats(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/5e52dc91-97b1-82b0-214b-65d404e4a2ec@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:33:49 -07:00
Heiner Kallweit	6401297e76	mac80211: use new function dev_fetch_sw_netstats Simplify the code by using new function dev_fetch_sw_netstats(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/93dda477-70ae-0ccf-71b4-bfebb66c9beb@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:33:49 -07:00
Heiner Kallweit	cf89f18fa4	iptunnel: use new function dev_fetch_sw_netstats Simplify the code by using new function dev_fetch_sw_netstats(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/050f9a83-b195-a3d6-edbd-91a59040be21@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:33:49 -07:00
Heiner Kallweit	a0d2698101	net: dsa: use new function dev_fetch_sw_netstats Simplify the code by using new function dev_fetch_sw_netstats(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/b6047017-8226-6b7e-a3cd-064e69fdfa27@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:33:49 -07:00
Heiner Kallweit	f3f04f0f3a	net: bridge: use new function dev_fetch_sw_netstats Simplify the code by using new function dev_fetch_sw_netstats(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/d1c3ff29-5691-9d54-d164-16421905fa59@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:33:49 -07:00
Heiner Kallweit	44fa32f008	net: add function dev_fetch_sw_netstats for fetching pcpu_sw_netstats In several places the same code is used to populate rtnl_link_stats64 fields with data from pcpu_sw_netstats. Therefore factor out this code to a new function dev_fetch_sw_netstats(). v2: - constify argument netstats - don't ignore netstats being NULL or an ERRPTR - switch to EXPORT_SYMBOL_GPL Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/6d16a338-52f5-df69-0020-6bc771a7d498@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:33:48 -07:00
Or Cohen	c9bf52a173	net/af_unix: Remove unused old_pid variable Commit `109f6e39fa` ("af_unix: Allow SO_PEERCRED to work across namespaces.") introduced the old_pid variable in unix_listen, but it's never used. Remove the declaration and the call to put_pid. Signed-off-by: Or Cohen <orcohen@paloaltonetworks.com> Link: https://lore.kernel.org/r/20201011153527.18628-1-orcohen@paloaltonetworks.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:21:46 -07:00
Christian Eggers	4e3bbb33e6	socket: don't clear SOCK_TSTAMP_NEW when SO_TIMESTAMPNS is disabled SOCK_TSTAMP_NEW (timespec64 instead of timespec) is also used for hardware time stamps (configured via SO_TIMESTAMPING_NEW). User space (ptp4l) first configures hardware time stamping via SO_TIMESTAMPING_NEW which sets SOCK_TSTAMP_NEW. In the next step, ptp4l disables SO_TIMESTAMPNS(_NEW) (software time stamps), but this must not switch hardware time stamps back to "32 bit mode". This problem happens on 32 bit platforms were the libc has already switched to struct timespec64 (from SO_TIMExxx_OLD to SO_TIMExxx_NEW socket options). ptp4l complains with "missing timestamp on transmitted peer delay request" because the wrong format is received (and discarded). Fixes: `887feae36a` ("socket: Add SO_TIMESTAMP[NS]_NEW") Fixes: `783da70e83` ("net: add sock_enable_timestamps") Signed-off-by: Christian Eggers <ceggers@arri.de> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:18:18 -07:00
Christian Eggers	59e611a566	socket: fix option SO_TIMESTAMPING_NEW The comparison of optname with SO_TIMESTAMPING_NEW is wrong way around, so SOCK_TSTAMP_NEW will first be set and than reset again. Additionally move it out of the test for SOF_TIMESTAMPING_RX_SOFTWARE as this seems unrelated. This problem happens on 32 bit platforms were the libc has already switched to struct timespec64 (from SO_TIMExxx_OLD to SO_TIMExxx_NEW socket options). ptp4l complains with "missing timestamp on transmitted peer delay request" because the wrong format is received (and discarded). Fixes: `9718475e69` ("socket: Add SO_TIMESTAMPING_NEW") Signed-off-by: Christian Eggers <ceggers@arri.de> Reviewed-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Reviewed-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:18:18 -07:00
Julia Lawall	0403a2b53c	net/tls: use semicolons rather than commas to separate statements Replace commas with semicolons. Commas introduce unnecessary variability in the code structure and are hard to see. What is done is essentially described by the following Coccinelle semantic patch (http://coccinelle.lip6.fr/): // <smpl> @@ expression e1,e2; @@ e1 -, +; e2 ... when any // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Link: https://lore.kernel.org/r/1602412498-32025-6-git-send-email-Julia.Lawall@inria.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:11:52 -07:00
Julia Lawall	6159e9633f	net/ipv6: use semicolons rather than commas to separate statements Replace commas with semicolons. Commas introduce unnecessary variability in the code structure and are hard to see. What is done is essentially described by the following Coccinelle semantic patch (http://coccinelle.lip6.fr/): // <smpl> @@ expression e1,e2; @@ e1 -, +; e2 ... when any // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/1602412498-32025-5-git-send-email-Julia.Lawall@inria.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:11:52 -07:00
Julia Lawall	44797589c2	tcp: use semicolons rather than commas to separate statements Replace commas with semicolons. Commas introduce unnecessary variability in the code structure and are hard to see. What is done is essentially described by the following Coccinelle semantic patch (http://coccinelle.lip6.fr/): // <smpl> @@ expression e1,e2; @@ e1 -, +; e2 ... when any // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Link: https://lore.kernel.org/r/1602412498-32025-4-git-send-email-Julia.Lawall@inria.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-13 17:11:52 -07:00
Pablo Neira Ayuso	0d9826bc18	netfilter: nf_log: missing vlan offload tag and proto Dump vlan tag and proto for the usual vlan offload case if the NF_LOG_MACDECODE flag is set on. Without this information the logging is misleading as there is no reference to the VLAN header. [12716.993704] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0800 SRC=192.168.10.2 DST=172.217.168.163 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=2548 DF PROTO=TCP SPT=55848 DPT=80 WINDOW=501 RES=0x00 ACK FIN URGP=0 [12721.157643] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0806 ARP HTYPE=1 PTYPE=0x0800 OPCODE=2 MACSRC=86:6c:92:ea:d6:73 IPSRC=192.168.10.2 MACDST=0e:3b:eb:86:73:76 IPDST=192.168.10.1 Fixes: `83e96d443b` ("netfilter: log: split family specific code to nf_log_{ip,ip6,common}.c files") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-14 01:25:14 +02:00
Linus Torvalds	6ad4bf6ea1	io_uring-5.10-2020-10-12 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EXPEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiR4EAC3trm1ojXVF7y9/XRhcPpb4Pror+ZA1coO gyoy+zUuCEl9WCzzHWqXULMYMP0YzNJnJs0oLQPA1s0sx1H4uDMl/UXg0OXZisYG Y59Kca3c1DHFwj9KPQXfGmCEjc/rbDWK5TqRc2iZMz+6E5Mt71UFZHtenwgV1zD8 hTmZZkzLCu2ePfOvrPONgL5tDqPWGVyn61phoC7cSzMF66juXGYuvQGktzi/m6q+ jAxUnhKvKTlLB9wsq3s5X/20/QD56Yuba9U+YxeeNDBE8MDWQOsjz0mZCV1fn4p3 h/6762aRaWaXH7EwMtsHFUWy7arJZg/YoFYNYLv4Ksyy3y4sMABZCy3A+JyzrgQ0 hMu7vjsP+k22X1WH8nyejBfWNEmxu6dpgckKrgF0dhJcXk/acWA3XaDWZ80UwfQy isKRAP1rA0MJKHDMIwCzSQJDPvtUAkPptbNZJcUSU78o+pPoCaQ93V++LbdgGtKn iGJJqX05dVbcsDx5X7fluphjkUTC4yFr7ZgLgbOIedXajWRD8iOkO2xxCHk6SKFl iv9entvRcX9k3SHK9uffIUkRBUujMU0+HCIQFCO1qGmkCaS5nSrovZl4HoL7L/Dj +T8+v7kyJeklLXgJBaE7jk01O4HwZKjwPWMbCjvL9NKk8j7c1soYnRu5uNvi85Mu /9wn671s+w== =udgj -----END PGP SIGNATURE----- Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: - Add blkcg accounting for io-wq offload (Dennis) - A use-after-free fix for io-wq (Hillf) - Cancelation fixes and improvements - Use proper files_struct references for offload - Cleanup of io_uring_get_socket() since that can now go into our own header - SQPOLL fixes and cleanups, and support for sharing the thread - Improvement to how page accounting is done for registered buffers and huge pages, accounting the real pinned state - Series cleaning up the xarray code (Willy) - Various cleanups, refactoring, and improvements (Pavel) - Use raw spinlock for io-wq (Sebastian) - Add support for ring restrictions (Stefano) * tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (62 commits) io_uring: keep a pointer ref_node in file_data io_uring: refactor *files_register()'s error paths io_uring: clean file_data access in files_register io_uring: don't delay io_init_req() error check io_uring: clean leftovers after splitting issue io_uring: remove timeout.list after hrtimer cancel io_uring: use a separate struct for timeout_remove io_uring: improve submit_state.ios_left accounting io_uring: simplify io_file_get() io_uring: kill extra check in fixed io_file_get() io_uring: clean up ->files grabbing io_uring: don't io_prep_async_work() linked reqs io_uring: Convert advanced XArray uses to the normal API io_uring: Fix XArray usage in io_uring_add_task_file io_uring: Fix use of XArray in __io_uring_files_cancel io_uring: fix break condition for __io_uring_register() waiting io_uring: no need to call xa_destroy() on empty xarray io_uring: batch account ->req_issue and task struct references io_uring: kill callback_head argument for io_req_task_work_add() io_uring: move req preps out of io_issue_sqe() ...	2020-10-13 12:36:21 -07:00
Linus Torvalds	39a5101f98	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Allow DRBG testing through user-space af_alg - Add tcrypt speed testing support for keyed hashes - Add type-safe init/exit hooks for ahash Algorithms: - Mark arc4 as obsolete and pending for future removal - Mark anubis, khazad, sead and tea as obsolete - Improve boot-time xor benchmark - Add OSCCA SM2 asymmetric cipher algorithm and use it for integrity Drivers: - Fixes and enhancement for XTS in caam - Add support for XIP8001B hwrng in xiphera-trng - Add RNG and hash support in sun8i-ce/sun8i-ss - Allow imx-rngc to be used by kernel entropy pool - Use crypto engine in omap-sham - Add support for Ingenic X1830 with ingenic" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (205 commits) X.509: Fix modular build of public_key_sm2 crypto: xor - Remove unused variable count in do_xor_speed X.509: fix error return value on the failed path crypto: bcm - Verify GCM/CCM key length in setkey crypto: qat - drop input parameter from adf_enable_aer() crypto: qat - fix function parameters descriptions crypto: atmel-tdes - use semicolons rather than commas to separate statements crypto: drivers - use semicolons rather than commas to separate statements hwrng: mxc-rnga - use semicolons rather than commas to separate statements hwrng: iproc-rng200 - use semicolons rather than commas to separate statements hwrng: stm32 - use semicolons rather than commas to separate statements crypto: xor - use ktime for template benchmarking crypto: xor - defer load time benchmark to a later time crypto: hisilicon/zip - fix the uninitalized 'curr_qm_qp_num' crypto: hisilicon/zip - fix the return value when device is busy crypto: hisilicon/zip - fix zero length input in GZIP decompress crypto: hisilicon/zip - fix the uncleared debug registers lib/mpi: Fix unused variable warnings crypto: x86/poly1305 - Remove assignments with no effect hwrng: npcm - modify readl to readb ...	2020-10-13 08:50:16 -07:00
Linus Torvalds	85ed13e78d	Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull compat iovec cleanups from Al Viro: "Christoph's series around import_iovec() and compat variant thereof" * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: security/keys: remove compat_keyctl_instantiate_key_iov mm: remove compat_process_vm_{readv,writev} fs: remove compat_sys_vmsplice fs: remove the compat readv/writev syscalls fs: remove various compat readv/writev helpers iov_iter: transparently handle compat iovecs in import_iovec iov_iter: refactor rw_copy_check_uvector and import_iovec iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c compat.h: fix a spelling error in <linux/compat.h>	2020-10-12 16:35:51 -07:00
Linus Torvalds	c90578360c	Merge branch 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull copy_and_csum cleanups from Al Viro: "Saner calling conventions for csum_and_copy_..._user() and friends" [ Removing 800+ lines of code and cleaning stuff up is good - Linus ] * 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ppc: propagate the calling conventions change down to csum_partial_copy_generic() amd64: switch csum_partial_copy_generic() to new calling conventions sparc64: propagate the calling convention changes down to __csum_partial_copy_...() xtensa: propagate the calling conventions change down into csum_partial_copy_generic() mips: propagate the calling convention change down into __csum_partial_copy_..._user() mips: __csum_partial_copy_kernel() has no users left mips: csum_and_copy_{to,from}_user() are never called under KERNEL_DS sparc32: propagate the calling conventions change down to __csum_partial_copy_sparc_generic() i386: propagate the calling conventions change down to csum_partial_copy_generic() sh: propage the calling conventions change down to csum_partial_copy_generic() m68k: get rid of zeroing destination on error in csum_and_copy_from_user() arm: propagate the calling convention changes down to csum_partial_copy_from_user() alpha: propagate the calling convention changes down to csum_partial_copy.c helpers saner calling conventions for csum_and_copy_..._user() csum_and_copy_..._user(): pass 0xffffffff instead of 0 as initial sum csum_partial_copy_nocheck(): drop the last argument unify generic instances of csum_partial_copy_nocheck() icmp_push_reply(): reorder adding the checksum up skb_copy_and_csum_bits(): don't bother with the last argument	2020-10-12 16:24:13 -07:00
Jakub Kicinski	ccdf7fae3a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-10-12 The main changes are: 1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong. 2) libbpf relocation support for different size load/store, from Andrii. 3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel. 4) BPF support for per-cpu variables, form Hao. 5) sockmap improvements, from John. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-12 16:16:50 -07:00
Jakub Kicinski	a308283fdb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next: 1) Inspect the reply packets coming from DR/TUN and refresh connection state and timeout, from longguang yue and Julian Anastasov. 2) Series to add support for the inet ingress chain type in nf_tables. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-12 15:00:36 -07:00
Pablo Neira Ayuso	98a381a7d4	netfilter: nftables: extend error reporting for chain updates The initial support for netlink extended ACK is missing the chain update path, which results in misleading error reporting in case of EEXIST. Fixes `36dd1bcc07` ("netfilter: nf_tables: initial support for extended ACK reporting") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-12 16:54:30 +02:00
Ilya Dryomov	28e1581c3b	libceph: clear con->out_msg on Policy::stateful_server faults con->out_msg must be cleared on Policy::stateful_server (!CEPH_MSG_CONNECT_LOSSY) faults. Not doing so botches the reconnection attempt, because after writing the banner the messenger moves on to writing the data section of that message (either from where it got interrupted by the connection reset or from the beginning) instead of writing struct ceph_msg_connect. This results in a bizarre error message because the server sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct ceph_msg_connect: libceph: mds0 (1)172.21.15.45:6828 socket error on write ceph: mds0 reconnect start libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN) libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32 libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch AFAICT this bug goes back to the dawn of the kernel client. The reason it survived for so long is that only MDS sessions are stateful and only two MDS messages have a data section: CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare) and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved). The connection has to get reset precisely when such message is being sent -- in this case it was the former. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/47723 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2020-10-12 15:29:27 +02:00
Ilya Dryomov	a9dfe31e5c	libceph: format ceph_entity_addr nonces as unsigned Match the server side logs. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-10-12 15:29:27 +02:00
Ilya Dryomov	5a5036c89f	libceph: move a dout in queue_con_delay() The queued con->work can start executing (and therefore logging) before we get to this "con->work has been queued" message, making the logs confusing. Move it up, with the meaning of "con->work is about to be queued". Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-10-12 15:29:27 +02:00
Ilya Dryomov	1b05fae7f2	libceph: switch to the new "osd blocklist add" command Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-10-12 15:29:26 +02:00
Ilya Dryomov	0b98acd618	libceph, rbd, ceph: "blacklist" -> "blocklist" Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-10-12 15:29:26 +02:00
Ilya Dryomov	3986f9a42e	libceph: multiple workspaces for CRUSH computations Replace a global map->crush_workspace (protected by a global mutex) with a list of workspaces, up to the number of CPUs + 1. This is based on a patch from Robin Geuze <robing@nl.team.blue>. Robin and his team have observed a 10-20% increase in IOPS on all queue depths and lower CPU usage as well on a high-end all-NVMe 100GbE cluster. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-10-12 15:29:26 +02:00
Oliver Hartkopp	f726f3d371	can: remove obsolete version strings As pointed out by Jakub Kicinski here: http://lore.kernel.org/r/20201009175751.5c54097f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com this patch removes the obsolete version information of the different CAN protocols and the AF_CAN core module. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20201012074354.25839-2-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2020-10-12 10:06:39 +02:00
Oliver Hartkopp	ac911bfeb3	can: isotp: implement cleanups / improvements from review As pointed out by Jakub Kicinski here: http://lore.kernel.org/r/20201009175751.5c54097f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com this patch addresses the remarked issues: - remove empty line in comment - remove default=y for CAN_ISOTP in Kconfig - make use of pr_notice_once() - use GFP_ATOMIC instead of gfp_any() in soft hrtimer context The version strings in the CAN subsystem are removed by a separate patch. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20201012074354.25839-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2020-10-12 10:06:08 +02:00
Anant Thazhemadam	7ca1db21ef	net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid In p9_fd_create_unix, checking is performed to see if the addr (passed as an argument) is NULL or not. However, no check is performed to see if addr is a valid address, i.e., it doesn't entirely consist of only 0's. The initialization of sun_server.sun_path to be equal to this faulty addr value leads to an uninitialized variable, as detected by KMSAN. Checking for this (faulty addr) and returning a negative error number appropriately, resolves this issue. Link: http://lkml.kernel.org/r/20201012042404.2508-1-anant.thazhemadam@gmail.com Reported-by: syzbot+75d51fe5bf4ebe988518@syzkaller.appspotmail.com Tested-by: syzbot+75d51fe5bf4ebe988518@syzkaller.appspotmail.com Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>	2020-10-12 10:05:47 +02:00
John Fastabend	ef5659280e	bpf, sockmap: Allow skipping sk_skb parser program Currently, we often run with a nop parser namely one that just does this, 'return skb->len'. This happens when either our verdict program can handle streaming data or it is only looking at socket data such as IP addresses and other metadata associated with the flow. The second case is common for a L3/L4 proxy for instance. So lets allow loading programs without the parser then we can skip the stream parser logic and avoid having to add a BPF program that is effectively a nop. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160239297866.8495.13345662302749219672.stgit@john-Precision-5820-Tower	2020-10-11 18:09:44 -07:00
John Fastabend	743df8b774	bpf, sockmap: Check skb_verdict and skb_parser programs explicitly We are about to allow skb_verdict to run without skb_parser programs as a first step change code to check each program type specifically. This should be a mechanical change without any impact to actual result. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160239294756.8495.5796595770890272219.stgit@john-Precision-5820-Tower	2020-10-11 18:09:44 -07:00
John Fastabend	0b17ad25d8	bpf, sockmap: Add memory accounting so skbs on ingress lists are visible Move skb->sk assignment out of sk_psock_bpf_run() and into individual callers. Then we can use proper skb_set_owner_r() call to assign a sk to a skb. This improves things by also charging the truesize against the sockets sk_rmem_alloc counter. With this done we get some accounting in place to ensure the memory associated with skbs on the workqueue are still being accounted for somewhere. Finally, by using skb_set_owner_r the destructor is setup so we can just let the normal skb_kfree logic recover the memory. Combined with previous patch dropping skb_orphan() we now can recover from memory pressure and maintain accounting. Note, we will charge the skbs against their originating socket even if being redirected into another socket. Once the skb completes the redirect op the kfree_skb will give the memory back. This is important because if we charged the socket we are redirecting to (like it was done before this series) the sock_writeable() test could fail because of the skb trying to be sent is already charged against the socket. Also TLS case is special. Here we wait until we have decided not to simply PASS the packet up the stack. In the case where we PASS the packet up the stack we already have an skb which is accounted for on the TLS socket context. For the parser case we continue to just set/clear skb->sk this is because the skb being used here may be combined with other skbs or turned into multiple skbs depending on the parser logic. For example the parser could request a payload length greater than skb->len so that the strparser needs to collect multiple skbs. At any rate the final result will be handled in the strparser recv callback. Fixes: `604326b41a` ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160226867513.5692.10579573214635925960.stgit@john-Precision-5820-Tower	2020-10-11 18:00:57 -07:00
John Fastabend	10d58d0063	bpf, sockmap: Remove skb_orphan and let normal skb_kfree do cleanup Calling skb_orphan() is unnecessary in the strp rcv handler because the skb is from a skb_clone() in __strp_recv. So it never has a destructor or a sk assigned. Plus its confusing to read because it might hint to the reader that the skb could have an sk assigned which is not true. Even if we did have an sk assigned it would be cleaner to simply wait for the upcoming kfree_skb(). Additionally, move the comment about strparser clone up so its closer to the logic it is describing and add to it so that it is more complete. Fixes: `604326b41a` ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160226865548.5692.9098315689984599579.stgit@john-Precision-5820-Tower	2020-10-11 18:00:57 -07:00
John Fastabend	9047f19e7c	bpf, sockmap: Remove dropped data on errors in redirect case In the sk_skb redirect case we didn't handle the case where we overrun the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress. Because we didn't have anything implemented we simply dropped the skb. This meant data could be dropped if socket memory accounting was in place. This fixes the above dropped data case by moving the memory checks later in the code where we actually do the send or recv. This pushes those checks into the workqueue and allows us to return an EAGAIN error which in turn allows us to try again later from the workqueue. Fixes: `51199405f9` ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160226863689.5692.13861422742592309285.stgit@john-Precision-5820-Tower	2020-10-11 18:00:57 -07:00
John Fastabend	29545f4977	bpf, sockmap: Remove skb_set_owner_w wmem will be taken later from sendpage The skb_set_owner_w is unnecessary here. The sendpage call will create a fresh skb and set the owner correctly from workqueue. Its also not entirely harmless because it consumes cycles, but also impacts resource accounting by increasing sk_wmem_alloc. This is charging the socket we are going to send to for the skb, but we will put it on the workqueue for some time before this happens so we are artifically inflating sk_wmem_alloc for this period. Further, we don't know how many skbs will be used to send the packet or how it will be broken up when sent over the new socket so charging it with one big sum is also not correct when the workqueue may break it up if facing memory pressure. Seeing we don't know how/when this is going to be sent drop the early accounting. A later patch will do proper accounting charged on receive socket for the case where skbs get enqueued on the workqueue. Fixes: `604326b41a` ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160226861708.5692.17964237936462425136.stgit@john-Precision-5820-Tower	2020-10-11 18:00:57 -07:00
John Fastabend	9ecbfb06a0	bpf, sockmap: On receive programs try to fast track SK_PASS ingress When we receive an skb and the ingress skb verdict program returns SK_PASS we currently set the ingress flag and put it on the workqueue so it can be turned into a sk_msg and put on the sk_msg ingress queue. Then finally telling userspace with data_ready hook. Here we observe that if the workqueue is empty then we can try to convert into a sk_msg type and call data_ready directly without bouncing through a workqueue. Its a common pattern to have a recv verdict program for visibility that always returns SK_PASS. In this case unless there is an ENOMEM error or we overrun the socket we can avoid the workqueue completely only using it when we fall back to error cases caused by memory pressure. By doing this we eliminate another case where data may be dropped if errors occur on memory limits in workqueue. Fixes: `51199405f9` ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160226859704.5692.12929678876744977669.stgit@john-Precision-5820-Tower	2020-10-11 18:00:57 -07:00
John Fastabend	cfea28f890	bpf, sockmap: Skb verdict SK_PASS to self already checked rmem limits For sk_skb case where skb_verdict program returns SK_PASS to continue to pass packet up the stack, the memory limits were already checked before enqueuing in skb_queue_tail from TCP side. So, lets remove the extra checks here. The theory is if the TCP stack believes we have memory to receive the packet then lets trust the stack and not double check the limits. In fact the accounting here can cause a drop if sk_rmem_alloc has increased after the stack accepted this packet, but before the duplicate check here. And worse if this happens because TCP stack already believes the data has been received there is no retransmit. Fixes: `51199405f9` ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160226857664.5692.668205469388498375.stgit@john-Precision-5820-Tower	2020-10-11 18:00:57 -07:00
Julian Anastasov	7980d2eabd	ipvs: clear skb->tstamp in forwarding path fq qdisc requires tstamp to be cleared in forwarding path Reported-by: Evgeny B <abt-admin@mail.ru> Link: https://bugzilla.kernel.org/show_bug.cgi?id=209427 Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Fixes: `8203e2d844` ("net: clear skb->tstamp in forwarding paths") Fixes: `fb420d5d91` ("tcp/fq: move back to CLOCK_MONOTONIC") Fixes: `80b14dee2b` ("net: Add a new socket option for a future transmit time.") Signed-off-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-12 01:59:41 +02:00
Pablo Neira Ayuso	793d5d6124	netfilter: flowtable: reduce calls to pskb_may_pull() Make two unfront calls to pskb_may_pull() to linearize the network and transport header. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-12 01:58:10 +02:00
Pablo Neira Ayuso	d3519cb89f	netfilter: nf_tables: add inet ingress support This patch adds a new ingress hook for the inet family. The inet ingress hook emulates the IP receive path code, therefore, unclean packets are drop before walking over the ruleset in this basechain. This patch also introduces the nft_base_chain_netdev() helper function to check if this hook is bound to one or more devices (through the hook list infrastructure). This check allows to perform the same handling for the inet ingress as it would be a netdev ingress chain from the control plane. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-12 01:57:34 +02:00
Pablo Neira Ayuso	60a3815da7	netfilter: add inet ingress support This patch adds the NF_INET_INGRESS pseudohook for the NFPROTO_INET family. This is a mapping this new hook to the existing NFPROTO_NETDEV and NF_NETDEV_INGRESS hook. The hook does not guarantee that packets are inet only, users must filter out non-ip traffic explicitly. This infrastructure makes it easier to support this new hook in nf_tables. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-12 01:57:34 +02:00
Pablo Neira Ayuso	ddcfa710d4	netfilter: add nf_ingress_hook() helper function Add helper function to check if this is an ingress hook. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-12 01:57:34 +02:00
Pablo Neira Ayuso	afd9024cd1	netfilter: add nf_static_key_{inc,dec} Add helper functions increment and decrement the hook static keys. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-12 01:57:34 +02:00
longguang.yue	073b04e76b	ipvs: inspect reply packets from DR/TUN real servers Just like for MASQ, inspect the reply packets coming from DR/TUN real servers and alter the connection's state and timeout according to the protocol. It's ipvs's duty to do traffic statistic if packets get hit, no matter what mode it is. Signed-off-by: longguang.yue <bigclouds@163.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-12 01:57:34 +02:00
Toke Høiland-Jørgensen	d1c362e1dd	bpf: Always return target ifindex in bpf_fib_lookup The bpf_fib_lookup() helper performs a neighbour lookup for the destination IP and returns BPF_FIB_LKUP_NO_NEIGH if this fails, with the expectation that the BPF program will pass the packet up the stack in this case. However, with the addition of bpf_redirect_neigh() that can be used instead to perform the neighbour lookup, at the cost of a bit of duplicated work. For that we still need the target ifindex, and since bpf_fib_lookup() already has that at the time it performs the neighbour lookup, there is really no reason why it can't just return it in any case. So let's just always return the ifindex if the FIB lookup itself succeeds. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Link: https://lore.kernel.org/bpf/20201009184234.134214-1-toke@redhat.com	2020-10-11 21:59:20 +02:00
Vladimir Oltean	ea440cd2d9	net: dsa: tag_ocelot: use VLAN information from tagging header when available When the Extraction Frame Header contains a valid classified VLAN, use that instead of the VLAN header present in the packet. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-11 11:19:04 -07:00
Daniel Borkmann	4a8f87e60f	bpf: Allow for map-in-map with dynamic inner array map entries Recent work in `f4d0525921` ("bpf: Add map_meta_equal map ops") and `134fede4ee` ("bpf: Relax max_entries check for most of the inner map types") added support for dynamic inner max elements for most map-in-map types. Exceptions were maps like array or prog array where the map_gen_lookup() callback uses the maps' max_entries field as a constant when emitting instructions. We recently implemented Maglev consistent hashing into Cilium's load balancer which uses map-in-map with an outer map being hash and inner being array holding the Maglev backend table for each service. This has been designed this way in order to reduce overall memory consumption given the outer hash map allows to avoid preallocating a large, flat memory area for all services. Also, the number of service mappings is not always known a-priori. The use case for dynamic inner array map entries is to further reduce memory overhead, for example, some services might just have a small number of back ends while others could have a large number. Right now the Maglev backend table for small and large number of backends would need to have the same inner array map entries which adds a lot of unneeded overhead. Dynamic inner array map entries can be realized by avoiding the inlined code generation for their lookup. The lookup will still be efficient since it will be calling into array_map_lookup_elem() directly and thus avoiding retpoline. The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips inline code generation and relaxes array_map_meta_equal() check to ignore both maps' max_entries. This also still allows to have faster lookups for map-in-map when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed. Example code generation where inner map is dynamic sized array: # bpftool p d x i 125 int handle__sys_enter(void * ctx): ; int handle__sys_enter(void ctx) 0: (b4) w1 = 0 ; int key = 0; 1: (63) (u32 )(r10 -4) = r1 2: (bf) r2 = r10 ; 3: (07) r2 += -4 ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key); 4: (18) r1 = map[id:468] 6: (07) r1 += 272 7: (61) r0 = (u32 )(r2 +0) 8: (35) if r0 >= 0x3 goto pc+5 9: (67) r0 <<= 3 10: (0f) r0 += r1 11: (79) r0 = (u64 )(r0 +0) 12: (15) if r0 == 0x0 goto pc+1 13: (05) goto pc+1 14: (b7) r0 = 0 15: (b4) w6 = -1 ; if (!inner_map) 16: (15) if r0 == 0x0 goto pc+6 17: (bf) r2 = r10 ; 18: (07) r2 += -4 ; val = bpf_map_lookup_elem(inner_map, &key); 19: (bf) r1 = r0 \| No inlining but instead 20: (85) call array_map_lookup_elem#149280 \| call to array_map_lookup_elem() ; return val ? val : -1; \| for inner array lookup. 21: (15) if r0 == 0x0 goto pc+1 ; return val ? val : -1; 22: (61) r6 = (u32 *)(r0 +0) ; } 23: (bc) w0 = w6 24: (95) exit Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net	2020-10-11 10:21:04 -07:00
Daniel Borkmann	9aa1206e8f	bpf: Add redirect_peer helper Add an efficient ingress to ingress netns switch that can be used out of tc BPF programs in order to redirect traffic from host ns ingress into a container veth device ingress without having to go via CPU backlog queue [0]. For local containers this can also be utilized and path via CPU backlog queue only needs to be taken once, not twice. On a high level this borrows from ipvlan which does similar switch in __netif_receive_skb_core() and then iterates via another_round. This helps to reduce latency for mentioned use cases. Pod to remote pod with redirect(), TCP_RR [1]: # percpu_netperf 10.217.1.33 RT_LATENCY: 122.450 (per CPU: 122.666 122.401 122.333 122.401 ) MEAN_LATENCY: 121.210 (per CPU: 121.100 121.260 121.320 121.160 ) STDDEV_LATENCY: 120.040 (per CPU: 119.420 119.910 125.460 115.370 ) MIN_LATENCY: 46.500 (per CPU: 47.000 47.000 47.000 45.000 ) P50_LATENCY: 118.500 (per CPU: 118.000 119.000 118.000 119.000 ) P90_LATENCY: 127.500 (per CPU: 127.000 128.000 127.000 128.000 ) P99_LATENCY: 130.750 (per CPU: 131.000 131.000 129.000 132.000 ) TRANSACTION_RATE: 32666.400 (per CPU: 8152.200 8169.842 8174.439 8169.897 ) Pod to remote pod with redirect_peer(), TCP_RR: # percpu_netperf 10.217.1.33 RT_LATENCY: 44.449 (per CPU: 43.767 43.127 45.279 45.622 ) MEAN_LATENCY: 45.065 (per CPU: 44.030 45.530 45.190 45.510 ) STDDEV_LATENCY: 84.823 (per CPU: 66.770 97.290 84.380 90.850 ) MIN_LATENCY: 33.500 (per CPU: 33.000 33.000 34.000 34.000 ) P50_LATENCY: 43.250 (per CPU: 43.000 43.000 43.000 44.000 ) P90_LATENCY: 46.750 (per CPU: 46.000 47.000 47.000 47.000 ) P99_LATENCY: 52.750 (per CPU: 51.000 54.000 53.000 53.000 ) TRANSACTION_RATE: 90039.500 (per CPU: 22848.186 23187.089 22085.077 21919.130 ) [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net	2020-10-11 10:21:04 -07:00
David Ahern	874fb9e2ca	ipv4: Restore flowi4_oif update before call to xfrm_lookup_route Tobias reported regressions in IPsec tests following the patch referenced by the Fixes tag below. The root cause is dropping the reset of the flowi4_oif after the fib_lookup. Apparently it is needed for xfrm cases, so restore the oif update to ip_route_output_flow right before the call to xfrm_lookup_route. Fixes: `2fbc6e89b2` ("ipv4: Update exception handling for multipath routes via same device") Reported-by: Tobias Brunner <tobias@strongswan.org> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-10 11:38:59 -07:00
Paolo Abeni	0e4f35d788	mptcp: subflows garbage collection The msk can close MP_JOIN subflows if the initial handshake fails. Currently such subflows are kept alive in the conn_list until the msk itself is closed. Beyond the wasted memory, we could end-up sending the DATA_FIN and the DATA_FIN ack on such socket, even after a reset. Fixes: `43b54c6ee3` ("mptcp: Use full MPTCP-level disconnect state machine") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-10 11:04:53 -07:00
Paolo Abeni	d582484726	mptcp: fix fallback for MP_JOIN subflows Additional/MP_JOIN subflows that do not pass some initial handshake tests currently causes fallback to TCP. That is an RFC violation: we should instead reset the subflow and leave the the msk untouched. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/91 Fixes: `f296234c98` ("mptcp: Add handling of incoming MP_JOIN requests") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-10 11:04:53 -07:00
Pujin Shi	16cb365380	net: smc: fix missing brace warning for old compilers For older versions of gcc, the array = {0}; will cause warnings: net/smc/smc_llc.c: In function 'smc_llc_add_link_local': net/smc/smc_llc.c:1212:9: warning: missing braces around initializer [-Wmissing-braces] struct smc_llc_msg_add_link add_llc = {0}; ^ net/smc/smc_llc.c:1212:9: warning: (near initialization for 'add_llc.hd') [-Wmissing-braces] net/smc/smc_llc.c: In function 'smc_llc_srv_delete_link_local': net/smc/smc_llc.c:1245:9: warning: missing braces around initializer [-Wmissing-braces] struct smc_llc_msg_del_link del_llc = {0}; ^ net/smc/smc_llc.c:1245:9: warning: (near initialization for 'del_llc.hd') [-Wmissing-braces] 2 warnings generated Fixes: `4dadd151b2` ("net/smc: enqueue local LLC messages") Signed-off-by: Pujin Shi <shipujin.t@gmail.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-10 10:38:40 -07:00
Pujin Shi	7e94e46c16	net: smc: fix missing brace warning for old compilers For older versions of gcc, the array = {0}; will cause warnings: net/smc/smc_llc.c: In function 'smc_llc_send_link_delete_all': net/smc/smc_llc.c:1317:9: warning: missing braces around initializer [-Wmissing-braces] struct smc_llc_msg_del_link delllc = {0}; ^ net/smc/smc_llc.c:1317:9: warning: (near initialization for 'delllc.hd') [-Wmissing-braces] 1 warnings generated Fixes: `f3811fd7bc` ("net/smc: send DELETE_LINK, ALL message and wait for send to complete") Signed-off-by: Pujin Shi <shipujin.t@gmail.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-10 10:38:40 -07:00
Jakub Kicinski	b54fa649d7	linux-can-fixes-for-5.9-20201008 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl9/hIITHG1rbEBwZW5n dXRyb25peC5kZQAKCRCpyVqK+u3vqSUNB/9NiVYEPFF2mkNKzHAkhOssFF2NIzo8 0LBsV/YbOFqxSW0nYg0BB2nIdfIIdR+yqBW3ishbvizAWUOzSVvGD03oG0pZkKaZ eMhh+OQ5fBjIjkNYXHC870uUZyJ7FzcyFLpeNoJfrGVuAtTpQts2TwWwevArPhfk 4XswX7DQ5npLwTGWBIvL8a6zYAmjEtT5Fnds0oeGeLfepWePG0ih/uxLN65JTrTU uHt5gSgJLflXF+swErnLosMr1ADy2I+Zsg3XlYp89A9xLy5IEh0vyQPxEl69Wu1x gQgO7L675YRwO9EyVkIfkYUzd/ot3o+virHTN470unu5QCb6O+04nof1 =omsQ -----END PGP SIGNATURE----- Merge tag 'linux-can-fixes-for-5.9-20201008' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can ==================== linux-can-fixes-for-5.9-20201008 The first patch is by Lucas Stach and fixes m_can driver by removing an erroneous call to m_can_class_suspend() in runtime suspend. Which causes the pinctrl state to get stuck on the "sleep" state, which breaks all CAN functionality on SoCs where this state is defined. The last two patches target the j1939 protocol: Cong Wang fixes a syzbot finding of an uninitialized variable in the j1939 transport protocol. I contribute a patch, that fixes the initialization of a same uninitialized variable in a different function. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-10 10:21:04 -07:00
Jakub Kicinski	16573e7cb5	A handful of changes: * fixes for the recent S1G work * a docbook build time improvement * API to pass beacon rate to lower-level driver -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl9+7MQACgkQB8qZga/f l8SOqw/8Cr0/QTy2DMNGZwsIcNgLjYvgKckZFWFhyyJtYnFHUa+VX1fVO07xoQpG PWgHnDRR39rm6ZJt0n61h0C5wxwibggU8oGBNV3cYRDTwefR/WPWJ9UJs9NSVKA+ PJQt9iqKGneEq3KjISaKvWLFdWUHePorOVyZr8tLYlerGma17kLG6nunB7gg6CCn VSQrnVyeF7oEs7agkFtdaPpIQ2ieTxD9Xl2p2y3bMFRLVJopaBGHrWaGq1e7UKme c5W9cIwOxIvieqcn03/lkwS5KEaG5OXqL+pL1zJiN69OWpjp3X4DFMP8f7xTrmFj xK9SRTnR7CU3yvlfSQuNJgJn4+2zussn+VgzY5sDBW3FPAFy0NNvlA5vVnZgPJ61 AkfE8wNiaVAPLA8ckxN6VV9CiV1JDx5w2FrnCPhA7weX0l3PpAaojA5xG6HCDEHx LfNr9NbRLiTZdz8YkDEUU+GCCfnYBAbYXO/lMQK56L2T+HOu3yVE9nYuUQNRCCkK gDr9biOYrWpYC1r3mD+i+0guaH3ZRpO/skoD0So25YxCP+j/AkDJXSjflTkaU9P2 XmgXxjxVX8JFrg29XciiGW6a4TOWS3caMvwxb0PlzfPaNREuglH3ygFHVnWsAt2G GZIgc3OggRdUIK4zoX3jsZQAgDK4B9ym70zujSY7JdIamxaHmO4= =TrMG -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-net-next-2020-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== A handful of changes: * fixes for the recent S1G work * a docbook build time improvement * API to pass beacon rate to lower-level driver ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-10 09:12:52 -07:00
Johannes Berg	44f3625bc6	netlink: export policy in extended ACK Add a new attribute NLMSGERR_ATTR_POLICY to the extended ACK to advertise the policy, e.g. if an attribute was out of range, you'll know the range that's permissible. Add new NL_SET_ERR_MSG_ATTR_POL() and NL_SET_ERR_MSG_ATTR_POL() macros to set this, since realistically it's only useful to do this when the bad attribute (offset) is also returned. Use it in lib/nlattr.c which practically does all the policy validation. v2: - add and use netlink_policy_dump_attr_size_estimate() v3: - remove redundant break v4: - really remove redundant break ... sorry Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 20:22:32 -07:00
Johannes Berg	d2681e93b0	netlink: policy: refactor per-attr policy writing Refactor the per-attribute policy writing into a new helper function, to be used later for dumping out the policy of a rejected attribute. v2: - fix some indentation v3: - change variable order in netlink_policy_dump_write() Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 20:22:31 -07:00
Hoang Huu Le	7b50ee3dad	tipc: fix NULL pointer dereference in tipc_named_rcv In the function node_lost_contact(), we call __skb_queue_purge() without grabbing the list->lock. This can cause to a race-condition why processing the list 'namedq' in calling path tipc_named_rcv()->tipc_named_dequeue(). [] BUG: kernel NULL pointer dereference, address: 0000000000000000 [] #PF: supervisor read access in kernel mode [] #PF: error_code(0x0000) - not-present page [] PGD 7ca63067 P4D 7ca63067 PUD 6c553067 PMD 0 [] Oops: 0000 [#1] SMP NOPTI [] CPU: 1 PID: 15 Comm: ksoftirqd/1 Tainted: G O 5.9.0-rc6+ #2 [] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS [...] [] RIP: 0010:tipc_named_rcv+0x103/0x320 [tipc] [] Code: 41 89 44 24 10 49 8b 16 49 8b 46 08 49 c7 06 00 00 00 [...] [] RSP: 0018:ffffc900000a7c58 EFLAGS: 00000282 [] RAX: 00000000000012ec RBX: 0000000000000000 RCX: ffff88807bde1270 [] RDX: 0000000000002c7c RSI: 0000000000002c7c RDI: ffff88807b38f1a8 [] RBP: ffff88807b006288 R08: ffff88806a367800 R09: ffff88806a367900 [] R10: ffff88806a367a00 R11: ffff88806a367b00 R12: ffff88807b006258 [] R13: ffff88807b00628a R14: ffff888069334d00 R15: ffff88806a434600 [] FS: 0000000000000000(0000) GS:ffff888079480000(0000) knlGS:0[...] [] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [] CR2: 0000000000000000 CR3: 0000000077320000 CR4: 00000000000006e0 [] Call Trace: [] ? tipc_bcast_rcv+0x9a/0x1a0 [tipc] [] tipc_rcv+0x40d/0x670 [tipc] [] ? _raw_spin_unlock+0xa/0x20 [] tipc_l2_rcv_msg+0x55/0x80 [tipc] [] __netif_receive_skb_one_core+0x8c/0xa0 [] process_backlog+0x98/0x140 [] net_rx_action+0x13a/0x420 [] __do_softirq+0xdb/0x316 [] ? smpboot_thread_fn+0x2f/0x1e0 [] ? smpboot_thread_fn+0x74/0x1e0 [] ? smpboot_thread_fn+0x14e/0x1e0 [] run_ksoftirqd+0x1a/0x40 [] smpboot_thread_fn+0x149/0x1e0 [] ? sort_range+0x20/0x20 [] kthread+0x131/0x150 [] ? kthread_unuse_mm+0xa0/0xa0 [] ret_from_fork+0x22/0x30 [] Modules linked in: veth tipc(O) ip6_udp_tunnel udp_tunnel [...] [] CR2: 0000000000000000 [] ---[ end trace 65c276a8e2e2f310 ]--- To fix this, we need to grab the lock of the 'namedq' list on both path calling. Fixes: `cad2929dc4` ("tipc: update a binding service via broadcast") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 18:29:06 -07:00
Cong Wang	ed42989eab	tipc: fix the skb_unshare() in tipc_buf_append() skb_unshare() drops a reference count on the old skb unconditionally, so in the failure case, we end up freeing the skb twice here. And because the skb is allocated in fclone and cloned by caller tipc_msg_reassemble(), the consequence is actually freeing the original skb too, thus triggered the UAF by syzbot. Fix this by replacing this skb_unshare() with skb_cloned()+skb_copy(). Fixes: `ff48b6222e` ("tipc: use skb_unshare() instead in tipc_buf_append()") Reported-and-tested-by: syzbot+e96a7ba46281824cc46a@syzkaller.appspotmail.com Cc: Jon Maloy <jmaloy@redhat.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 18:22:56 -07:00
Karsten Graul	f29fa00399	net/smc: restore smcd_version when all ISM V2 devices failed to init Field ini->smcd_version is set to SMC_V2 before calling smc_listen_ism_init(). This clears the V1 bit that may be set. When all matching ISM V2 devices fail to initialize then the smcd_version field needs to get restored to allow any possible V1 devices to initialize. And be consistent, always go to the not_found label when no device was found. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 18:15:47 -07:00
Karsten Graul	9047a617dc	net/smc: cleanup buffer usage in smc_listen_work() coccinelle informs about net/smc/af_smc.c:1770:10-11: WARNING: opportunity for kzfree/kvfree_sensitive Its not that kzfree() would help here, the memset() is done to prepare the buffer for another socket receive. Fix that warning message by reordering the calls, while at it eliminate the unneeded variable cclc2 and use sizeof(*buf) as above in the same function. No functional changes. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 18:15:47 -07:00
Karsten Graul	c60a2cefb3	net/smc: consolidate unlocking in same function Static code checkers warn of inconsistent returns because the lgr mutex is locked in one function and unlocked in a function called by the locking function: net/smc/af_smc.c:823 smc_connect_rdma() warn: inconsistent returns 'smc_client_lgr_pending'. net/smc/af_smc.c:897 smc_connect_ism() warn: inconsistent returns 'smc_server_lgr_pending'. Make the code consistent by doing the unlock in the same function that fetches the lock. No functional changes. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 18:15:47 -07:00
Jakub Kicinski	8f5e71b9d3	linux-can-next-for-5.10-20201007 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl9+MLsTHG1rbEBwZW5n dXRyb25peC5kZQAKCRCpyVqK+u3vqScfCACBLbt8heMyVQQYd130/oIAvQ2o8Bjf DPa36CVMYOGH1KOCnEyuc+oKsXeJfy33Faxe3+s7q2aMkddH7zhHQNiohPgwWAsM AuAjRgbAVkDd9BTAbOS/cujftZBBccGRUuusbCg7lsBdwhGQbCggfYbOmGt2B8Gv Wt3s2td8i90WutFb3UN3ec5N44jTn+WQvuOjX0Dzt/qi3r5qyC5JvcdkW3LAfxEG X7bJ6cf8HRgUyPAALJGdoWdKT+ImmFbUJc8WuX9PdlYzxR+FyPKQgxD187ARTeLA OqQSMHi4jRNywei0WUg5YoR0qPAHWKOLIMBK45TWNfa0p2+JzRbxTf3o =7UL+ -----END PGP SIGNATURE----- Merge tag 'linux-can-next-for-5.10-20201007' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== linux-can-next-for-5.10-20201007 The first 3 patches are by me and fix several warnings found when compiling the kernel with W=1. Lukas Bulwahn's patch adjusts the MAINTAINERS file, to accommodate the renaming of the mcp251xfd driver. Vincent Mailhol contributes 3 patches for the CAN networking layer. First error queue support is added the the CAN RAW protocol. The second patch converts the get_can_dlc() and get_canfd_dlc() in-Kernel-only macros from using __u8 to u8. The third patch adds a helper function to calculate the length of one bit in in multiple of time quanta. Oliver Hartkopp's patch add support for the ISO 15765-2:2016 transport protocol to the CAN stack. Three patches by Lad Prabhakar add documentation for various new rcar controllers to the device tree bindings of the rcar_can and rcan_canfd driver. Michael Walle's patch adds various processors to the flexcan driver binding documentation. The next two patches are by me and target the flexcan driver aswell. The remove the ack_grp and ack_bit from the fsl,stop-mode DT property and the driver, as they are not used anymore. As these are the last two arguments this change will not break existing device trees. The last three patches are by Srinivas Neeli and target the xilinx_can driver. The first one increases the lower limit for the bit rate prescaler to 2, the other two fix sparse and coverity findings. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 17:58:46 -07:00
Rohit Maheshwari	ea1dd3e9d0	net/tls: sendfile fails with ktls offload At first when sendpage gets called, if there is more data, 'more' in tls_push_data() gets set which later sets pending_open_record_frags, but when there is no more data in file left, and last time tls_push_data() gets called, pending_open_record_frags doesn't get reset. And later when 2 bytes of encrypted alert comes as sendmsg, it first checks for pending_open_record_frags, and since this is set, it creates a record with 0 data bytes to encrypt, meaning record length is prepend_size + tag_size only, which causes problem. We should set/reset pending_open_record_frags based on more bit. Fixes: `e8f6979981` ("net/tls: Add generic NIC offload infrastructure") Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 16:42:02 -07:00
Georg Kohmann	4a65dff81a	net: ipv6: Discard next-hop MTU less than minimum link MTU When a ICMPV6_PKT_TOOBIG report a next-hop MTU that is less than the IPv6 minimum link MTU, the estimated path MTU is reduced to the minimum link MTU. This behaviour breaks TAHI IPv6 Core Conformance Test v6LC4.1.6: Packet Too Big Less than IPv6 MTU. Referring to RFC 8201 section 4: "If a node receives a Packet Too Big message reporting a next-hop MTU that is less than the IPv6 minimum link MTU, it must discard it. A node must not reduce its estimate of the Path MTU below the IPv6 minimum link MTU on receipt of a Packet Too Big message." Drop the path MTU update if reported MTU is less than the minimum link MTU. Signed-off-by: Georg Kohmann <geokohma@cisco.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 16:22:59 -07:00
Manjunath Patil	9f0bb95eee	net/rds: suppress page allocation failure error in recv buffer refill RDS/IB tries to refill the recv buffer in softirq context using GFP_NOWAIT flag. However alloc failure is handled by queueing a work to refill the recv buffer with GFP_KERNEL flag. This means failure to allocate with GFP_NOWAIT isn't fatal. Do not print the PAF warnings if softirq context fails to refill the recv buffer. We will see the PAF warnings when worker also fails to allocate. Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com> Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 12:32:03 -07:00
Moshe Shemesh	195d9dece1	devlink: Add enable_remote_dev_reset generic parameter The enable_remote_dev_reset devlink param flags that the host admin allows device resets that can be initiated by other hosts. This parameter is useful for setups where a device is shared by different hosts, such as multi-host setup. Once the user set this parameter to false, the driver should NACK any attempt to reset the device while the driver is loaded. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 12:06:53 -07:00
Moshe Shemesh	77069ba2e3	devlink: Add remote reload stats Add remote reload stats to hold the history of actions performed due devlink reload commands initiated by remote host. For example, in case firmware activation with reset finished successfully but was initiated by remote host. The function devlink_remote_reload_actions_performed() is exported to enable drivers update on remote reload actions performed as it was not initiated by their own devlink instance. Expose devlink remote reload stats to the user through devlink dev get command. Examples: $ devlink dev show pci/0000:82:00.0: stats: reload: driver_reinit 2 fw_activate 1 fw_activate_no_reset 0 remote_reload: driver_reinit 0 fw_activate 0 fw_activate_no_reset 0 pci/0000:82:00.1: stats: reload: driver_reinit 1 fw_activate 0 fw_activate_no_reset 0 remote_reload: driver_reinit 1 fw_activate 1 fw_activate_no_reset 0 $ devlink dev show -jp { "dev": { "pci/0000:82:00.0": { "stats": { "reload": { "driver_reinit": 2, "fw_activate": 1, "fw_activate_no_reset": 0 }, "remote_reload": { "driver_reinit": 0, "fw_activate": 0, "fw_activate_no_reset": 0 } } }, "pci/0000:82:00.1": { "stats": { "reload": { "driver_reinit": 1, "fw_activate": 0, "fw_activate_no_reset": 0 }, "remote_reload": { "driver_reinit": 1, "fw_activate": 1, "fw_activate_no_reset": 0 } } } } } Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 12:06:52 -07:00
Moshe Shemesh	a254c26426	devlink: Add reload stats Add reload stats to hold the history per reload action type and limit. For example, the number of times fw_activate has been performed on this device since the driver module was added or if the firmware activation was performed with or without reset. Add devlink notification on stats update. Expose devlink reload stats to the user through devlink dev get command. Examples: $ devlink dev show pci/0000:82:00.0: stats: reload: driver_reinit 2 fw_activate 1 fw_activate_no_reset 0 pci/0000:82:00.1: stats: reload: driver_reinit 1 fw_activate 0 fw_activate_no_reset 0 $ devlink dev show -jp { "dev": { "pci/0000:82:00.0": { "stats": { "reload": { "driver_reinit": 2, "fw_activate": 1, "fw_activate_no_reset": 0 } } }, "pci/0000:82:00.1": { "stats": { "reload": { "driver_reinit": 1, "fw_activate": 0, "fw_activate_no_reset": 0 } } } } } Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 12:06:52 -07:00
Moshe Shemesh	dc64cc7c63	devlink: Add devlink reload limit option Add reload limit to demand restrictions on reload actions. Reload limits supported: no_reset: No reset allowed, no down time allowed, no link flap and no configuration is lost. By default reload limit is unspecified and so no constraints on reload actions are required. Some combinations of action and limit are invalid. For example, driver can not reinitialize its entities without any downtime. The no_reset reload limit will have usecase in this patchset to implement restricted fw_activate on mlx5. Have the uapi parameter of reload limit ready for future support of multiselection. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 12:06:52 -07:00
Moshe Shemesh	ccdf07219d	devlink: Add reload action option to devlink reload command Add devlink reload action to allow the user to request a specific reload action. The action parameter is optional, if not specified then devlink driver re-init action is used (backward compatible). Note that when required to do firmware activation some drivers may need to reload the driver. On the other hand some drivers may need to reset the firmware to reinitialize the driver entities. Therefore, the devlink reload command returns the actions which were actually performed. Reload actions supported are: driver_reinit: driver entities re-initialization, applying devlink-param and devlink-resource values. fw_activate: firmware activate. command examples: $devlink dev reload pci/0000:82:00.0 action driver_reinit reload_actions_performed: driver_reinit $devlink dev reload pci/0000:82:00.0 action fw_activate reload_actions_performed: driver_reinit fw_activate Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 12:06:52 -07:00
Moshe Shemesh	69d56e0ea0	devlink: Change devlink_reload_supported() param type Change devlink_reload_supported() function to get devlink_ops pointer param instead of devlink pointer param. This change will be used in the next patch to check if devlink reload is supported before devlink instance is allocated. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 12:06:52 -07:00
Davide Caratti	37198e93ce	net: mptcp: make DACK4/DACK8 usage consistent among all subflows using packetdrill it's possible to observe the same MPTCP DSN being acked by different subflows with DACK4 and DACK8. This is in contrast with what specified in RFC8684 §3.3.2: if an MPTCP endpoint transmits a 64-bit wide DSN, it MUST be acknowledged with a 64-bit wide DACK. Fix 'use_64bit_ack' variable to make it a property of MPTCP sockets, not TCP subflows. Fixes: `a0c1d0eafd` ("mptcp: Use 32-bit DATA_ACK when possible") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 08:25:48 -07:00
Nikita V. Shirokov	eca43ee6c4	bpf: Add tcp_notsent_lowat bpf setsockopt Adding support for TCP_NOTSENT_LOWAT sockoption (https://lwn.net/Articles/560082/) in tcp bpf programs. Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201009070325.226855-1-tehnerd@tehnerd.com	2020-10-09 17:12:03 +02:00
Eric Dumazet	846e463a70	net/sched: get rid of qdisc->padded kmalloc() of sufficiently big portion of memory is cache-aligned in regular conditions. If some debugging options are used, there is no reason qdisc structures would need 64-byte alignment if most other kernel structures are not aligned. This get rid of QDISC_ALIGN and QDISC_ALIGNTO. Addition of privdata field will help implementing the reverse of qdisc_priv() and documents where the private data is. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Allen Pais <allen.lkml@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-09 08:08:08 -07:00
Magnus Karlsson	c3f01fdced	xsk: Introduce padding between ring pointers Introduce one cache line worth of padding between the producer and consumer pointers in all the lockless rings. This so that the HW adjacency prefetcher will not prefetch the consumer pointer when the producer pointer is used and vice versa. This improves throughput performance for the l2fwd sample app with 2% on my machine with HW prefetching turned on. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/1602166338-21378-1-git-send-email-magnus.karlsson@gmail.com	2020-10-09 16:35:01 +02:00
Xin Long	7fe94612dd	xfrm: interface: fix the priorities for ipip and ipv6 tunnels As Nicolas noticed in his case, when xfrm_interface module is installed the standard IP tunnels will break in receiving packets. This is caused by the IP tunnel handlers with a higher priority in xfrm interface processing incoming packets by xfrm_input(), which would drop the packets and return 0 instead when anything wrong happens. Rather than changing xfrm_input(), this patch is to adjust the priority for the IP tunnel handlers in xfrm interface, so that the packets would go to xfrmi's later than the others', as the others' would not drop the packets when the handlers couldn't process them. Note that IPCOMP also defines its own IPIP tunnel handler and it calls xfrm_input() as well, so we must make its priority lower than xfrmi's, which means having xfrmi loaded would still break IPCOMP. We may seek another way to fix it in xfrm_input() in the future. Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Fixes: `da9bbf0598` ("xfrm: interface: support IPIP and IPIP6 tunnels processing with .cb_handler") FIxes: `d7b360c286` ("xfrm: interface: support IP6IP6 and IP6IP tunnels processing with .cb_handler") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2020-10-09 12:29:48 +02:00
Ye Bin	316a1bef0d	9p/xen: Fix format argument warning Fix follow warnings: [net/9p/trans_xen.c:454]: (warning) %u in format string (no. 1) requires 'unsigned int' but the argument type is 'int'. [net/9p/trans_xen.c:460]: (warning) %u in format string (no. 1) requires 'unsigned int' but the argument type is 'int'. Link: http://lkml.kernel.org/r/20201009080552.89918-1-yebin10@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>	2020-10-09 10:23:29 +02:00
Paolo Abeni	d9fb8c507d	mptcp: fix infinite loop on recvmsg()/worker() race. If recvmsg() and the workqueue race to dequeue the data pending on some subflow, the current mapping for such subflow covers several skbs and some of them have not reached yet the received, either the worker or recvmsg() can find a subflow with the data_avail flag set - since the current mapping is valid and in sequence - but no skbs in the receive queue - since the other entity just processed them. The above will lead to an unbounded loop in __mptcp_move_skbs() and a subsequent hang of any task trying to acquiring the msk socket lock. This change addresses the issue stopping the __mptcp_move_skbs() loop as soon as we detect the above race (empty receive queue with data_avail set). Reported-and-tested-by: syzbot+fcf8ca5817d6e92c6567@syzkaller.appspotmail.com Fixes: `ab174ad8ef` ("mptcp: move ooo skbs into msk out of order queue.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-08 17:24:04 -07:00
Johannes Berg	fd15dd0586	ethtool: correct policy for ETHTOOL_MSG_CHANNELS_SET This accidentally got wired up to the get policy instead of the set policy, causing operations to be rejected. Fix it by wiring up the correct policy instead. Fixes: `5028588b62` ("ethtool: wire up set policies to ops") Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-08 16:06:01 -07:00
Johannes Berg	db972e5325	ethtool: strset: allow ETHTOOL_A_STRSET_COUNTS_ONLY attr The ETHTOOL_A_STRSET_COUNTS_ONLY flag attribute was previously not allowed to be used, but now due to the policy size reduction we would access the tb[] array out of bounds since we tried to check for the attribute despite it not being accepted. Fix both issues by adding it correctly to the appropriate policy. Fixes: `ff419afa43` ("ethtool: trim policy tables") Fixes: `71921690f9` ("ethtool: provide string sets with STRSET_GET request") Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-08 16:06:01 -07:00
Jakub Kicinski	9d49aea13f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Small conflict around locking in rxrpc_process_event() - channel_lock moved to bundle in next, while state lock needs _bh() from net. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-08 15:44:50 -07:00
Marc Kleine-Budde	13ba4c4344	net: j1939: j1939_session_fresh_new(): fix missing initialization of skbcnt This patch add the initialization of skbcnt, similar to: `e009f95b15` can: j1935: j1939_tp_tx_dat_new(): fix missing initialization of skbcnt Let's play save and initialize this skbcnt as well. Suggested-by: Jakub Kicinski <kuba@kernel.org> Fixes: `9d71dd0c70` ("can: add support of SAE J1939 protocol") Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2020-10-08 23:28:09 +02:00
Cong Wang	e009f95b15	can: j1935: j1939_tp_tx_dat_new(): fix missing initialization of skbcnt This fixes an uninit-value warning: BUG: KMSAN: uninit-value in can_receive+0x26b/0x630 net/can/af_can.c:650 Reported-and-tested-by: syzbot+3f3837e61a48d32b495f@syzkaller.appspotmail.com Fixes: `9d71dd0c70` ("can: add support of SAE J1939 protocol") Cc: Robin van der Gracht <robin@protonic.nl> Cc: Oleksij Rempel <linux@rempel-privat.de> Cc: Pengutronix Kernel Team <kernel@pengutronix.de> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://lore.kernel.org/r/20201008061821.24663-1-xiyou.wangcong@gmail.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2020-10-08 23:21:46 +02:00
Dumitru Ceara	8aa7b526dc	openvswitch: handle DNAT tuple collision With multiple DNAT rules it's possible that after destination translation the resulting tuples collide. For example, two openvswitch flows: nw_dst=10.0.0.10,tp_dst=10, actions=ct(commit,table=2,nat(dst=20.0.0.1:20)) nw_dst=10.0.0.20,tp_dst=10, actions=ct(commit,table=2,nat(dst=20.0.0.1:20)) Assuming two TCP clients initiating the following connections: 10.0.0.10:5000->10.0.0.10:10 10.0.0.10:5000->10.0.0.20:10 Both tuples would translate to 10.0.0.10:5000->20.0.0.1:20 causing nf_conntrack_confirm() to fail because of tuple collision. Netfilter handles this case by allocating a null binding for SNAT at egress by default. Perform the same operation in openvswitch for DNAT if no explicit SNAT is requested by the user and allocate a null binding for SNAT for packets in the "original" direction. Reported-at: https://bugzilla.redhat.com/1877128 Suggested-by: Florian Westphal <fw@strlen.de> Fixes: `05752523e5` ("openvswitch: Interface with NAT.") Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-08 12:20:35 -07:00
Eric Dumazet	d42ee76ecb	sctp: fix sctp_auth_init_hmacs() error path After freeing ep->auth_hmacs we have to clear the pointer or risk use-after-free as reported by syzbot: BUG: KASAN: use-after-free in sctp_auth_destroy_hmacs net/sctp/auth.c:509 [inline] BUG: KASAN: use-after-free in sctp_auth_destroy_hmacs net/sctp/auth.c:501 [inline] BUG: KASAN: use-after-free in sctp_auth_free+0x17e/0x1d0 net/sctp/auth.c:1070 Read of size 8 at addr ffff8880a8ff52c0 by task syz-executor941/6874 CPU: 0 PID: 6874 Comm: syz-executor941 Not tainted 5.9.0-rc8-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 sctp_auth_destroy_hmacs net/sctp/auth.c:509 [inline] sctp_auth_destroy_hmacs net/sctp/auth.c:501 [inline] sctp_auth_free+0x17e/0x1d0 net/sctp/auth.c:1070 sctp_endpoint_destroy+0x95/0x240 net/sctp/endpointola.c:203 sctp_endpoint_put net/sctp/endpointola.c:236 [inline] sctp_endpoint_free+0xd6/0x110 net/sctp/endpointola.c:183 sctp_destroy_sock+0x9c/0x3c0 net/sctp/socket.c:4981 sctp_v6_destroy_sock+0x11/0x20 net/sctp/socket.c:9415 sk_common_release+0x64/0x390 net/core/sock.c:3254 sctp_close+0x4ce/0x8b0 net/sctp/socket.c:1533 inet_release+0x12e/0x280 net/ipv4/af_inet.c:431 inet6_release+0x4c/0x70 net/ipv6/af_inet6.c:475 __sock_release+0xcd/0x280 net/socket.c:596 sock_close+0x18/0x20 net/socket.c:1277 __fput+0x285/0x920 fs/file_table.c:281 task_work_run+0xdd/0x190 kernel/task_work.c:141 exit_task_work include/linux/task_work.h:25 [inline] do_exit+0xb7d/0x29f0 kernel/exit.c:806 do_group_exit+0x125/0x310 kernel/exit.c:903 __do_sys_exit_group kernel/exit.c:914 [inline] __se_sys_exit_group kernel/exit.c:912 [inline] __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:912 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x43f278 Code: Bad RIP value. RSP: 002b:00007fffe0995c38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f278 RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000 RBP: 00000000004bf068 R08: 00000000000000e7 R09: ffffffffffffffd0 R10: 0000000020000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000 Allocated by task 6874: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461 kmem_cache_alloc_trace+0x174/0x300 mm/slab.c:3554 kmalloc include/linux/slab.h:554 [inline] kmalloc_array include/linux/slab.h:593 [inline] kcalloc include/linux/slab.h:605 [inline] sctp_auth_init_hmacs+0xdb/0x3b0 net/sctp/auth.c:464 sctp_auth_init+0x8a/0x4a0 net/sctp/auth.c:1049 sctp_setsockopt_auth_supported net/sctp/socket.c:4354 [inline] sctp_setsockopt+0x477e/0x97f0 net/sctp/socket.c:4631 __sys_setsockopt+0x2db/0x610 net/socket.c:2132 __do_sys_setsockopt net/socket.c:2143 [inline] __se_sys_setsockopt net/socket.c:2140 [inline] __x64_sys_setsockopt+0xba/0x150 net/socket.c:2140 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 6874: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track+0x1c/0x30 mm/kasan/common.c:56 kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422 __cache_free mm/slab.c:3422 [inline] kfree+0x10e/0x2b0 mm/slab.c:3760 sctp_auth_destroy_hmacs net/sctp/auth.c:511 [inline] sctp_auth_destroy_hmacs net/sctp/auth.c:501 [inline] sctp_auth_init_hmacs net/sctp/auth.c:496 [inline] sctp_auth_init_hmacs+0x2b7/0x3b0 net/sctp/auth.c:454 sctp_auth_init+0x8a/0x4a0 net/sctp/auth.c:1049 sctp_setsockopt_auth_supported net/sctp/socket.c:4354 [inline] sctp_setsockopt+0x477e/0x97f0 net/sctp/socket.c:4631 __sys_setsockopt+0x2db/0x610 net/socket.c:2132 __do_sys_setsockopt net/socket.c:2143 [inline] __se_sys_setsockopt net/socket.c:2140 [inline] __x64_sys_setsockopt+0xba/0x150 net/socket.c:2140 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: `1f485649f5` ("[SCTP]: Implement SCTP-AUTH internals") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-08 12:19:51 -07:00
Jakub Kicinski	a9e54cb3d5	A single fix for missing input validation in nl80211. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl9+7CIACgkQB8qZga/f l8Q63A//U8OEnil62jlD3om0eRYTyI4kIl67DLG0EMK9rlI3BRDqSSNudQ7hJtsw VhHwcXgLF2ztwp1N7dLKl0AJKOsivajZiWdoyEYosCrnyG8ZxEUX22A5AlMO7sWa vREKrtl9AtpPi05lifaEGv0kCkl8Q0gsv0ENCkq4Fs5anVTcUEdUfOiiclwEDtM4 5OPOVTKpzhU1XXBMBWkNp6pqHGRXTLk/PqhjaIsMtaB5qxkrHm3txFTnTrU3+0oA tFmedbWoqVmDdUDaeE2hIyBdIIqNnxPX+ccI5NJC2/ZPkBS3DrtMGRVsSCS2yDIk y5zQnnvkaQPJ5mLLQyyEuIf1tJEavYnT2bHpoy6B12rlBjt5FHodVs3QGvO7qVfm nBBchLmtHcZOYNZ4jRPQriZc9ZkffZbhiDNcydxo4YRQnTMGc4BkfvKUuLSo0/zP 9S0qdFgDipUqzvn6S/ICAEPPe4+JQ3h9DAO1Ky8MbuPlg/up2IK7XRVuamfZokto GmuwawDqPYDH85w+gHwfP5PVrg0ItRba8OD/FeYvwHMBF/WXQdGdLaFXmxWcehUx f7LP8WtkBIo/pzMhYV8wpmy8oEfgVgx4o67TRu5jkrbJZkbdv35bOqgrfCyTMTjb /IXSU9ERBtL2Pt9xZxNfF6hbYjD3FgPUeFpKe7kU8HhyzxRu/BI= =R/2r -----END PGP SIGNATURE----- Merge tag 'mac80211-for-net-2020-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== pull-request: mac80211 2020-10-08 A single fix for missing input validation in nl80211. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-08 12:18:34 -07:00
Jakub Kicinski	cfe90f4980	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2020-10-08 The main changes are: 1) Fix "unresolved symbol" build error under CONFIG_NET w/o CONFIG_INET due to missing tcp_timewait_sock and inet_timewait_sock BTF, from Yonghong Song. 2) Fix 32 bit sub-register bounds tracking for OR case, from Daniel Borkmann. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-08 12:05:37 -07:00
Henrik Bjoernlund	b6c02ef549	bridge: Netlink interface fix. This commit is correcting NETLINK br_fill_ifinfo() to be able to handle 'filter_mask' with multiple flags asserted. Fixes: `36a8e8e265` ("bridge: Extend br_fill_ifinfo to return MPR status") Signed-off-by: Henrik Bjoernlund <henrik.bjoernlund@microchip.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Suggested-by: Nikolay Aleksandrov <nikolay@nvidia.com> Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-08 12:05:07 -07:00
Anant Thazhemadam	3dc289f8f1	net: wireless: nl80211: fix out-of-bounds access in nl80211_del_key() In nl80211_parse_key(), key.idx is first initialized as -1. If this value of key.idx remains unmodified and gets returned, and nl80211_key_allowed() also returns 0, then rdev_del_key() gets called with key.idx = -1. This causes an out-of-bounds array access. Handle this issue by checking if the value of key.idx after nl80211_parse_key() is called and return -EINVAL if key.idx < 0. Cc: stable@vger.kernel.org Reported-by: syzbot+b1bb342d1d097516cbda@syzkaller.appspotmail.com Tested-by: syzbot+b1bb342d1d097516cbda@syzkaller.appspotmail.com Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Link: https://lore.kernel.org/r/20201007035401.9522-1-anant.thazhemadam@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-10-08 12:37:25 +02:00
Rajkumar Manoharan	ba6ff70a3b	mac80211: copy configured beacon tx rate to driver The user is allowed to change beacon tx rate (HT/VHT/HE) from hostapd. This information needs to be passed to the driver when the rate control is offloaded to the firmware. The driver capability of allowing beacon rate is already validated in cfg80211, so simply passing the rate information to the driver is enough. Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org> Link: https://lore.kernel.org/r/1601762658-15627-1-git-send-email-rmanohar@codeaurora.org [adjust commit message slightly] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-10-08 12:26:35 +02:00
Thomas Pedersen	c1cd35c606	cfg80211: only allow S1G channels on S1G band As discovered by syzbot, cfg80211 was accepting S1G channel widths on non-S1G bands. Add a check for this, and consolidate the 1MHz frequency check as it ends up being a subset of the others. Reported-by: syzbot+92715a0eccd6c881bc32@syzkaller.appspotmail.com Fixes: `11b34737b1` ("nl80211: support setting S1G channels") Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com> Link: https://lore.kernel.org/r/20201005165122.17583-1-thomas@adapt-ip.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-10-08 10:41:24 +02:00
Thomas Pedersen	12bf8fad4c	mac80211: initialize last_rate for S1G STAs last_rate is initialized to zero by sta_info_alloc(), but this indicates legacy bitrate for the last TX rate (and invalid for the last RX rate). To avoid a warning when decoding the last rate as legacy (before a data frame has been sent), initialize them as S1G MCS. Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com> Link: https://lore.kernel.org/r/20201005164522.18069-2-thomas@adapt-ip.com [rename to ieee80211_s1g_sta_rate_init(), seems more appropriate] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-10-08 10:40:57 +02:00
Thomas Pedersen	8b783d104e	mac80211: handle lack of sband->bitrates in rates Even though a driver or mac80211 shouldn't produce a legacy bitrate if sband->bitrates doesn't exist, don't crash if that is the case either. This fixes a kernel panic if station dump is run before last_rate can be updated with a data frame when sband->bitrates is missing (eg. in S1G bands). Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com> Link: https://lore.kernel.org/r/20201005164522.18069-1-thomas@adapt-ip.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-10-08 10:33:54 +02:00
Oliver Hartkopp	e057dd3fc2	can: add ISO 15765-2:2016 transport protocol CAN Transport Protocols offer support for segmented Point-to-Point communication between CAN nodes via two defined CAN Identifiers. As CAN frames can only transport a small amount of data bytes (max. 8 bytes for 'classic' CAN and max. 64 bytes for CAN FD) this segmentation is needed to transport longer PDUs as needed e.g. for vehicle diagnosis (UDS, ISO 14229) or IP-over-CAN traffic. This protocol driver implements data transfers according to ISO 15765-2:2016 for 'classic' CAN and CAN FD frame types. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20200928200404.82229-1-socketcan@hartkopp.net [mkl: Removed "WITH Linux-syscall-note" from isotp.c. Fixed indention, a checkpatch warning and typos. Replaced __u{8,32} by u{8,32}. Removed always false (optlen < 0) check in isotp_setsockopt().] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2020-10-07 23:18:33 +02:00
Anna Schumaker	e6ac0accb2	SUNRPC: Add an xdr_align_data() function For now, this function simply aligns the data at the beginning of the pages. This can eventually be expanded to shift data to the correct offsets when we're ready. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-10-07 14:28:40 -04:00
Anna Schumaker	84ce182ab8	SUNRPC: Add the ability to expand holes in data pages This patch adds the ability to "read a hole" into a set of XDR data pages by taking the following steps: 1) Shift all data after the current xdr->p to the right, possibly into the tail, 2) Zero the specified range, and 3) Update xdr->p to point beyond the hole. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-10-07 14:28:39 -04:00
Anna Schumaker	43f0f0816c	SUNRPC: Split out _shift_data_right_tail() xdr_shrink_pagelen() is very similar to what we need for hole expansion, so split out the common code into its own function that can be used by both functions. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-10-07 14:28:39 -04:00
Anna Schumaker	06216ecbd9	SUNRPC: Split out xdr_realign_pages() from xdr_align_pages() I don't need the entire align pages code for READ_PLUS, so split out the part I do need so I don't need to reimplement anything. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-10-07 14:28:39 -04:00
Anna Schumaker	cf1f08cac3	SUNRPC: Implement a xdr_page_pos() function I'll need this for READ_PLUS to help figure out the offset where page data is stored at, but it might also be useful for other things. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-10-07 14:28:39 -04:00
Anna Schumaker	f7d61ee414	SUNRPC: Split out a function for setting current page I'm going to need this bit of code in a few places for READ_PLUS decoding, so let's make it a helper function. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-10-07 14:28:39 -04:00
Vincent Mailhol	eb88531bdb	can: raw: add missing error queue support Error queue are not yet implemented in CAN-raw sockets. The problem: a userland call to recvmsg(soc, msg, MSG_ERRQUEUE) on a CAN-raw socket would unqueue messages from the normal queue without any kind of error or warning. As such, it prevented CAN drivers from using the functionalities that relies on the error queue such as skb_tx_timestamp(). SCM_CAN_RAW_ERRQUEUE is defined as the type for the CAN raw error queue. SCM stands for "Socket control messages". The name is inspired from SCM_J1939_ERRQUEUE of include/uapi/linux/can/j1939.h. Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Link: https://lore.kernel.org/r/20200926162527.270030-1-mailhol.vincent@wanadoo.fr Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2020-10-06 22:44:27 +02:00
Marc Kleine-Budde	80ede649ea	can: af_can: can_rcv_list_find(): fix kernel doc after variable renaming This patch fixes the kernel doc for can_rcv_list_find() which was broken in commit: `3ee6d2bebe` ("can: af_can: rename find_rcv_list() to can_rcv_list_find()") while renaming a variable, but forgetting to rename the kernel doc, too. Link: http://lore.kernel.org/r/20201006203748.1750156-2-mkl@pengutronix.de Fixes: `3ee6d2bebe` ("can: af_can: rename find_rcv_list() to can_rcv_list_find()") Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2020-10-06 22:42:07 +02:00
Jakub Kicinski	a0de1cd356	ethtool: specify which header flags are supported per command Perform header flags validation through the policy. Only pause command supports ETHTOOL_FLAG_STATS. Create a separate policy to be able to express that in policy dumps to user space. Note that even though the core will validate the header policy, it cannot record multiple layers of attributes and we have to re-parse header sub-attrs. When doing so we could skip attribute validation, or use most permissive policy. Opt for the former. We will no longer return the extack cookie for flags but since we only added first new flag in this release it's not expected that any user space had a chance to make use of it. v2: - remove the re-validation in ethnl_parse_header_dev_get() Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:25:55 -07:00
Jakub Kicinski	bdbb4e29df	netlink: add mask validation We don't have good validation policy for existing unsigned int attrs which serve as flags (for new ones we could use NLA_BITFIELD32). With increased use of policy dumping having the validation be expressed as part of the policy is important. Add validation policy in form of a mask of supported/valid bits. Support u64 in the uAPI to be future-proof, but really for now the embedded mask member can only hold 32 bits, so anything with bit 32+ set will always fail validation. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:25:55 -07:00
Jakub Kicinski	329d9c333e	ethtool: link up ethnl_header_policy as a nested policy To get the most out of parsing by the core, and to allow dumping full policies we need to specify which policy applies to nested attrs. For headers it's ethnl_header_policy. $ sed -i 's@$ETHTOOL_A_.HEADER\].=$ { .type = NLA_NESTED },@\1\n\t\tNLA_POLICY_NESTED(ethnl_header_policy),@' net/ethtool/* Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:25:55 -07:00
Jakub Kicinski	ff419afa43	ethtool: trim policy tables Since ethtool uses strict attribute validation there's no need to initialize all attributes in policy tables. 0 is NLA_UNSPEC which is going to be rejected. Remove the NLA_REJECTs. Similarly attributes above maxattrs are rejected, so there's no need to always size the policy tables to ETHTOOL_A_..._MAX. v2: - new patch Suggested-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:25:55 -07:00
Jakub Kicinski	5028588b62	ethtool: wire up set policies to ops Similarly to get commands wire up the policies of set commands to get parsing by the core and policy dumps. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:25:55 -07:00
Jakub Kicinski	4f30974feb	ethtool: wire up get policies to ops Wire up policies for get commands in struct nla_policy of the ethtool family. Make use of genetlink code attr validation and parsing, as well as allow dumping policies to user space. For every ETHTOOL_MSG_*_GET: - add 'ethnl_' prefix to policy name - add extern declaration in net/ethtool/netlink.h - wire up the policy & attr in ethtool_genl_ops[]. - remove .request_policy and .max_attr from ethnl_request_ops. Obviously core only records the first "layer" of parsed attrs so we still need to parse the sub-attrs of the nested header attribute. v2: - merge of patches 1 and 2 from v1 - remove stray empty lines in ops - also remove .max_attr Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:25:55 -07:00
Fabian Frederick	560b50cf6c	ipv4: use dev_sw_netstats_rx_add() use new helper for netstats settings Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:23:21 -07:00
Fabian Frederick	e40b3727f9	net: openvswitch: use dev_sw_netstats_rx_add() use new helper for netstats settings Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:23:21 -07:00
Fabian Frederick	c852162ea9	xfrm: use dev_sw_netstats_rx_add() use new helper for netstats settings Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:23:21 -07:00
Fabian Frederick	5711eb0502	ipv6: use dev_sw_netstats_rx_add() use new helper for netstats settings Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:23:21 -07:00
David S. Miller	d91dc434f2	rxrpc fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl97RWEACgkQ+7dXa6fL C2sxNBAAhr1dnVfGHAV7mUVAv8BtNwY6B+mczIo48k53oiy0+Ngh83yrcdt2EkmY s3JdbWq1rVlCps6zOOefKYfXG8FS2guFVDjKl9SaC6nYmxdEPnRmbW9mlhiFg/Na xLnYVcJnuHw2ymisaRkARQn4w6F4CfEYBI9pbRpiw2d7vfD+Rziu49JMqVbTc2mF g8tY0KPt81TouPlc//5BrY0dFat06gRbBsYcLmL/x/9aNofWg6F8dse9Evixgl3y sY+ZwQkIxipYVyfuS9Z2UVhFTcYSvbTKWgvE08f9AK7iO6Y35hI4HIkZckIepgU0 rRNZY5AAq6Qb/kbGwIN27GDD/Ef8SqrW5NFdyRQykr8h1DIxGi5BlWRpVcpH1d9x JI4fAp9dAcySOtusETrOBMvczz9wxB1HSe0tmrUP3lx0DLA484zdR8M+rQNPcEOK M/x83hmIkMnmd3dH/eVNx0OwA35KVQ/eW79QsfDhnG2JVms4jwzqe/QfGpwXl2q9 SYNrlJZe6HjypNdWwMPZLswKzKe+7v9zKxY69TvsdKmqycQf2hVwsIxRmAr1GHEc dQX3ag+LzS8elgqWRZ/NC4y8ojUgO73BhgL1DCrSgvu1UIzMC9bNSxrsdN+d3VSt ZKzaFGQ9E9GDGSvfVJt/yRAb7kjQdeXchowWSGg804fPEzlGmds= =dmWc -----END PGP SIGNATURE----- Merge tag 'rxrpc-fixes-20201005' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Miscellaneous fixes Here are some miscellaneous rxrpc fixes: (1) Fix the xdr encoding of the contents read from an rxrpc key. (2) Fix a BUG() for a unsupported encoding type. (3) Fix missing _bh lock annotations. (4) Fix acceptance handling for an incoming call where the incoming call is encrypted. (5) The server token keyring isn't network namespaced - it belongs to the server, so there's no need. Namespacing it means that request_key() fails to find it. (6) Fix a leak of the server keyring. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:18:20 -07:00
Igor Russkikh	c6db31ffe2	ethtool: allow netdev driver to define phy tunables Define get/set phy tunable callbacks in ethtool ops. This will allow MAC drivers with integrated PHY still to implement these tunables. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:16:01 -07:00
Vladimir Oltean	302af7c604	net: always dump full packets with skb_dump Currently skb_dump has a restriction to only dump full packet for the first 5 socket buffers, then only headers will be printed. Remove this arbitrary and confusing restriction, which is only documented vaguely ("up to") in the comments above the prototype. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:14:02 -07:00
Eric Dumazet	86bccd0367	tcp: fix receive window update in tcp_add_backlog() We got reports from GKE customers flows being reset by netfilter conntrack unless nf_conntrack_tcp_be_liberal is set to 1. Traces seemed to suggest ACK packet being dropped by the packet capture, or more likely that ACK were received in the wrong order. wscale=7, SYN and SYNACK not shown here. This ACK allows the sender to send 1871128 bytes from seq 51359321 : New right edge of the window -> 51359321+1871128=51598809 09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0 09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408 09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0 09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768 09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0 09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384 Receiver now allows to send 606128=77568 from seq 51521241 : New right edge of the window -> 51521241+606128=51598809 09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0 09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384 It seems the sender exceeds RWIN allowance, since 51611353 > 51598809 09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728 09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040 09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0 netfilter conntrack is not happy and sends RST 09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0 09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0 Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet. New right edge of the window -> 51521241+859*128=51631193 Normally TCP stack handles OOO packets just fine, but it turns out tcp_add_backlog() does not. It can update the window field of the aggregated packet even if the ACK sequence of the last received packet is too old. Many thanks to Alexandre Ferrieux for independently reporting the issue and suggesting a fix. Fixes: `4f693b55c3` ("tcp: implement coalescing on backlog queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:11:58 -07:00
Paolo Abeni	717f203416	mptcp: don't skip needed ack Currently we skip calling tcp_cleanup_rbuf() when packets are moved into the OoO queue or simply dropped. In both cases we still increment tp->copied_seq, and we should ask the TCP stack to check for ack. Fixes: `c76c695656` ("mptcp: call tcp_cleanup_rbuf on subflows") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:08:06 -07:00
Paolo Abeni	017512a07e	mptcp: more DATA FIN fixes Currently data fin on data packet are not handled properly: the 'rcv_data_fin_seq' field is interpreted as the last sequence number carrying a valid data, but for data fin packet with valid maps we currently store map_seq + map_len, that is, the next value. The 'write_seq' fields carries instead the value subseguent to the last valid byte, so in mptcp_write_data_fin() we never detect correctly the last DSS map. Fixes: `7279da6145` ("mptcp: Use MPTCP-level flag for sending DATA_FIN") Fixes: `1a49b2c2a5` ("mptcp: Handle incoming 32-bit DATA_FIN values") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:06:59 -07:00
Manivannan Sadhasivam	082bb94fe1	net: qrtr: ns: Fix the incorrect usage of rcu_read_lock() The rcu_read_lock() is not supposed to lock the kernel_sendmsg() API since it has the lock_sock() in qrtr_sendmsg() which will sleep. Hence, fix it by excluding the locking for kernel_sendmsg(). While at it, let's also use radix_tree_deref_retry() to confirm the validity of the pointer returned by radix_tree_deref_slot() and use radix_tree_iter_resume() to resume iterating the tree properly before releasing the lock as suggested by Doug. Fixes: `a7809ff90c` ("net: qrtr: ns: Protect radix_tree_deref_slot() using rcu read locks") Reported-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Tested-by: Douglas Anderson <dianders@chromium.org> Tested-by: Alex Elder <elder@linaro.org> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-06 06:01:35 -07:00
David S. Miller	8b0308fe31	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Rejecting non-native endian BTF overlapped with the addition of support for it. The rest were more simple overlapping changes, except the renesas ravb binding update, which had to follow a file move as well as a YAML conversion. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-05 18:40:01 -07:00
Linus Torvalds	165563c050	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from David Miller: 1) Make sure SKB control block is in the proper state during IPSEC ESP-in-TCP encapsulation. From Sabrina Dubroca. 2) Various kinds of attributes were not being cloned properly when we build new xfrm_state objects from existing ones. Fix from Antony Antony. 3) Make sure to keep BTF sections, from Tony Ambardar. 4) TX DMA channels need proper locking in lantiq driver, from Hauke Mehrtens. 5) Honour route MTU during forwarding, always. From Maciej Żenczykowski. 6) Fix races in kTLS which can result in crashes, from Rohit Maheshwari. 7) Skip TCP DSACKs with rediculous sequence ranges, from Priyaranjan Jha. 8) Use correct address family in xfrm state lookups, from Herbert Xu. 9) A bridge FDB flush should not clear out user managed fdb entries with the ext_learn flag set, from Nikolay Aleksandrov. 10) Fix nested locking of netdev address lists, from Taehee Yoo. 11) Fix handling of 32-bit DATA_FIN values in mptcp, from Mat Martineau. 12) Fix r8169 data corruptions on RTL8402 chips, from Heiner Kallweit. 13) Don't free command entries in mlx5 while comp handler could still be running, from Eran Ben Elisha. 14) Error flow of request_irq() in mlx5 is busted, due to an off by one we try to free and IRQ never allocated. From Maor Gottlieb. 15) Fix leak when dumping netlink policies, from Johannes Berg. 16) Sendpage cannot be performed when a page is a slab page, or the page count is < 1. Some subsystems such as nvme were doing so. Create a "sendpage_ok()" helper and use it as needed, from Coly Li. 17) Don't leak request socket when using syncookes with mptcp, from Paolo Abeni. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (111 commits) net/core: check length before updating Ethertype in skb_mpls_{push,pop} net: mvneta: fix double free of txq->buf net_sched: check error pointer in tcf_dump_walker() net: team: fix memory leak in __team_options_register net: typhoon: Fix a typo Typoon --> Typhoon net: hinic: fix DEVLINK build errors net: stmmac: Modify configuration method of EEE timers tcp: fix syn cookied MPTCP request socket leak libceph: use sendpage_ok() in ceph_tcp_sendpage() scsi: libiscsi: use sendpage_ok() in iscsi_tcp_segment_map() drbd: code cleanup by using sendpage_ok() to check page for kernel_sendpage() tcp: use sendpage_ok() to detect misused .sendpage nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() net: add WARN_ONCE in kernel_sendpage() for improper zero-copy send net: introduce helper sendpage_ok() in include/linux/net.h net: usb: pegasus: Proper error handing when setting pegasus' MAC address net: core: document two new elements of struct net_device netlink: fix policy dump leak net/mlx5e: Fix race condition on nhe->n pointer in neigh update net/mlx5e: Fix VLAN create flow ...	2020-10-05 11:27:14 -07:00
David Howells	38b1dc47a3	rxrpc: Fix server keyring leak If someone calls setsockopt() twice to set a server key keyring, the first keyring is leaked. Fix it to return an error instead if the server key keyring is already set. Fixes: `17926a7932` ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com>	2020-10-05 17:09:22 +01:00
David Howells	fea9911124	rxrpc: The server keyring isn't network-namespaced The keyring containing the server's tokens isn't network-namespaced, so it shouldn't be looked up with a network namespace. It is expected to be owned specifically by the server, so namespacing is unnecessary. Fixes: `a58946c158` ("keys: Pass the network namespace into request_key mechanism") Signed-off-by: David Howells <dhowells@redhat.com>	2020-10-05 16:36:06 +01:00
David Howells	2d914c1bf0	rxrpc: Fix accept on a connection that need securing When a new incoming call arrives at an userspace rxrpc socket on a new connection that has a security class set, the code currently pushes it onto the accept queue to hold a ref on it for the socket. This doesn't work, however, as recvmsg() pops it off, notices that it's in the SERVER_SECURING state and discards the ref. This means that the call runs out of refs too early and the kernel oopses. By contrast, a kernel rxrpc socket manually pre-charges the incoming call pool with calls that already have user call IDs assigned, so they are ref'd by the call tree on the socket. Change the mode of operation for userspace rxrpc server sockets to work like this too. Although this is a UAPI change, server sockets aren't currently functional. Fixes: `248f219cb8` ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com>	2020-10-05 16:35:57 +01:00
David Howells	fa1d113a0f	rxrpc: Fix some missing _bh annotations on locking conn->state_lock conn->state_lock may be taken in softirq mode, but a previous patch replaced an outer lock in the response-packet event handling code, and lost the _bh from that when doing so. Fix this by applying the _bh annotation to the state_lock locking. Fixes: `a1399f8bb0` ("rxrpc: Call channels should have separate call number spaces") Signed-off-by: David Howells <dhowells@redhat.com>	2020-10-05 16:34:32 +01:00
David Howells	9a059cd5ca	rxrpc: Downgrade the BUG() for unsupported token type in rxrpc_read() If rxrpc_read() (which allows KEYCTL_READ to read a key), sees a token of a type it doesn't recognise, it can BUG in a couple of places, which is unnecessary as it can easily get back to userspace. Fix this to print an error message instead. Fixes: `99455153d0` ("RxRPC: Parse security index 5 keys (Kerberos 5)") Signed-off-by: David Howells <dhowells@redhat.com>	2020-10-05 16:33:37 +01:00
Marc Dionne	56305118e0	rxrpc: Fix rxkad token xdr encoding The session key should be encoded with just the 8 data bytes and no length; ENCODE_DATA precedes it with a 4 byte length, which confuses some existing tools that try to parse this format. Add an ENCODE_BYTES macro that does not include a length, and use it for the key. Also adjust the expected length. Note that commit `774521f353` ("rxrpc: Fix an assertion in rxrpc_read()") had fixed a BUG by changing the length rather than fixing the encoding. The original length was correct. Fixes: `99455153d0` ("RxRPC: Parse security index 5 keys (Kerberos 5)") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>	2020-10-05 16:33:28 +01:00
Björn Töpel	b75597d894	xsk: Remove internal DMA headers Christoph Hellwig correctly pointed out [1] that the AF_XDP core was pointlessly including internal headers. Let us remove those includes. [1] https://lore.kernel.org/bpf/20201005084341.GA3224@infradead.org/ Fixes: `1c1efc2af1` ("xsk: Create and free buffer pool independently from umem") Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20201005090525.116689-1-bjorn.topel@gmail.com	2020-10-05 15:48:00 +02:00
Vladimir Oltean	2e554a7a5d	net: dsa: propagate switchdev vlan_filtering prepare phase to drivers A driver may refuse to enable VLAN filtering for any reason beyond what the DSA framework cares about, such as: - having tc-flower rules that rely on the switch being VLAN-aware - the particular switch does not support VLAN, even if the driver does (the DSA framework just checks for the presence of the .port_vlan_add and .port_vlan_del pointers) - simply not supporting this configuration to be toggled at runtime Currently, when a driver rejects a configuration it cannot support, it does this from the commit phase, which triggers various warnings in switchdev. So propagate the prepare phase to drivers, to give them the ability to refuse invalid configurations cleanly and avoid the warnings. Since we need to modify all function prototypes and check for the prepare phase from within the drivers, take that opportunity and move the existing driver restrictions within the prepare phase where that is possible and easy. Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Cc: Hauke Mehrtens <hauke@hauke-m.de> Cc: Woojung Huh <woojung.huh@microchip.com> Cc: Microchip Linux Driver Support <UNGLinuxDriver@microchip.com> Cc: Sean Wang <sean.wang@mediatek.com> Cc: Landen Chao <Landen.Chao@mediatek.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Vivien Didelot <vivien.didelot@gmail.com> Cc: Jonathan McDowell <noodles@earth.li> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-05 05:56:48 -07:00
Rikard Falkeborn	b980b313e5	net: openvswitch: Constify static struct genl_small_ops The only usage of these is to assign their address to the small_ops field in the genl_family struct, which is a const pointer, and applying ARRAY_SIZE() on them. Make them const to allow the compiler to put them in read-only memory. Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-04 21:13:36 -07:00
Rikard Falkeborn	674d3ab949	mptcp: Constify mptcp_pm_ops The only usages of mptcp_pm_ops is to assign its address to the small_ops field of the genl_family struct, which is a const pointer, and applying ARRAY_SIZE() on it. Make it const to allow the compiler to put it in read-only memory. Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-04 21:13:36 -07:00
Guillaume Nault	4296adc3e3	net/core: check length before updating Ethertype in skb_mpls_{push,pop} Openvswitch allows to drop a packet's Ethernet header, therefore skb_mpls_push() and skb_mpls_pop() might be called with ethernet=true and mac_len=0. In that case the pointer passed to skb_mod_eth_type() doesn't point to an Ethernet header and the new Ethertype is written at unexpected locations. Fix this by verifying that mac_len is big enough to contain an Ethernet header. Fixes: `fa4e0f8855` ("net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actions") Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-04 15:09:26 -07:00
Cong Wang	580e4273d7	net_sched: check error pointer in tcf_dump_walker() Although we take RTNL on dump path, it is possible to skip RTNL on insertion path. So the following race condition is possible: rtnl_lock() // no rtnl lock mutex_lock(&idrinfo->lock); // insert ERR_PTR(-EBUSY) mutex_unlock(&idrinfo->lock); tc_dump_action() rtnl_unlock() So we have to skip those temporary -EBUSY entries on dump path too. Reported-and-tested-by: syzbot+b47bc4f247856fb4d9e1@syzkaller.appspotmail.com Fixes: `0fedc63fad` ("net_sched: commit action insertions together") Cc: Vlad Buslov <vladbu@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-04 14:53:06 -07:00
Andrew Lunn	08156ba430	net: dsa: Add devlink port regions support to DSA Allow DSA drivers to make use of devlink port regions, via simple wrappers. Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-04 14:38:53 -07:00
Andrew Lunn	544e7c33ec	net: devlink: Add support for port regions Allow regions to be registered to a devlink port. The same netlink API is used, but the port index is provided to indicate when a region is a port region as opposed to a device region. Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-04 14:38:53 -07:00
Andrew Lunn	3122433eb5	net: dsa: Register devlink ports before calling DSA driver setup() DSA drivers want to create regions on devlink ports as well as the devlink device instance, in order to export registers and other tables per port. To keep all this code together in the drivers, have the devlink ports registered early, so the setup() method can setup both device and port devlink regions. v3: Remove dp->setup Move common code out of switch statement. Fix wrong goto Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-04 14:38:53 -07:00
Andrew Lunn	f15ec13a96	net: dsa: Make use of devlink port flavour unused If a port is unused, still create a devlink port for it, but set the flavour to unused. This allows us to attach devlink regions to the port, etc. Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-04 14:38:52 -07:00
Andrew Lunn	cf1166349c	net: devlink: Add unused port flavour Not all ports of a switch need to be used, particularly in embedded systems. Add a port flavour for ports which physically exist in the switch, but are not connected to the front panel etc, and so are unused. By having unused ports present in devlink, it gives a more accurate representation of the hardware. It also allows regions to be associated to such ports, so allowing, for example, to determine unused ports are correctly powered off, or to compare probable reset defaults of unused ports to used ports experiences issues. Actually registering unused ports and setting the flavour to unused is optional. The DSA core will register all such switch ports, but such ports are expected to be limited in number. Bigger ASICs may decide not to list unused ports. v2: Expand the description about why it is useful Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-04 14:38:52 -07:00
David S. Miller	321e921daa	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Rename 'searched' column to 'clashres' in conntrack /proc/ stats to amend a recent patch, from Florian Westphal. 2) Remove unused nft_data_debug(), from YueHaibing. 3) Remove unused definitions in IPVS, also from YueHaibing. 4) Fix user data memleak in tables and objects, this is also amending a recent patch, from Jose M. Guisado. 5) Use nla_memdup() to allocate user data in table and objects, also from Jose M. Guisado 6) User data support for chains, from Jose M. Guisado 7) Remove unused definition in nf_tables_offload, from YueHaibing. 8) Use kvzalloc() in ip_set_alloc(), from Vasily Averin. 9) Fix false positive reported by lockdep in nfnetlink mutexes, from Florian Westphal. 10) Extend fast variant of cmp for neq operation, from Phil Sutter. 11) Implement fast bitwise variant, also from Phil Sutter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-04 14:35:53 -07:00
Phil Sutter	10fdd6d80e	netfilter: nf_tables: Implement fast bitwise expression A typical use of bitwise expression is to mask out parts of an IP address when matching on the network part only. Optimize for this common use with a fast variant for NFT_BITWISE_BOOL-type expressions operating on 32bit-sized values. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-04 21:08:33 +02:00
Phil Sutter	5f48846daf	netfilter: nf_tables: Enable fast nft_cmp for inverted matches Add a boolean indicating NFT_CMP_NEQ. To include it into the match decision, it is sufficient to XOR it with the data comparison's result. While being at it, store the mask that is calculated during expression init and free the eval routine from having to recalculate it each time. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-04 21:08:32 +02:00
Florian Westphal	ab6c41eefd	netfilter: nfnetlink: place subsys mutexes in distinct lockdep classes From time to time there are lockdep reports similar to this one: WARNING: possible circular locking dependency detected ------------------------------------------------------ 000000004f61aa56 (&table[i].mutex){+.+.}, at: nfnl_lock [nfnetlink] but task is already holding lock: [..] (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid [nf_tables] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&net->nft.commit_mutex){+.+.}: [..] nf_tables_valid_genid+0x18/0x60 [nf_tables] nfnetlink_rcv_batch+0x24c/0x620 [nfnetlink] nfnetlink_rcv+0x110/0x140 [nfnetlink] netlink_unicast+0x12c/0x1e0 [..] sys_sendmsg+0x18/0x40 linux_sparc_syscall+0x34/0x44 -> #0 (&table[i].mutex){+.+.}: [..] nfnl_lock+0x24/0x40 [nfnetlink] ip_set_nfnl_get_byindex+0x19c/0x280 [ip_set] set_match_v1_checkentry+0x14/0xc0 [xt_set] xt_check_match+0x238/0x260 [x_tables] __nft_match_init+0x160/0x180 [nft_compat] [..] sys_sendmsg+0x18/0x40 linux_sparc_syscall+0x34/0x44 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&net->nft.commit_mutex); lock(&table[i].mutex); lock(&net->nft.commit_mutex); lock(&table[i].mutex); Lockdep considers this an ABBA deadlock because the different nfnl subsys mutexes reside in the same lockdep class, but this is a false positive. CPU1 table[i] refers to the nftables subsys mutex, whereas CPU1 locks the ipset subsys mutex. Yi Che reported a similar lockdep splat, this time between ipset and ctnetlink subsys mutexes. Time to place them in distinct classes to avoid these warnings. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-04 21:08:32 +02:00
Vasily Averin	9446ab34ac	netfilter: ipset: enable memory accounting for ipset allocations Currently netadmin inside non-trusted container can quickly allocate whole node's memory via request of huge ipset hashtable. Other ipset-related memory allocations should be restricted too. v2: fixed typo ALLOC -> ACCOUNT Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-04 21:08:25 +02:00
YueHaibing	82ec6630f9	netfilter: nf_tables_offload: Remove unused macro FLOW_SETUP_BLOCK commit `9a32669fec` ("netfilter: nf_tables_offload: support indr block call") left behind this. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-10-04 21:07:55 +02:00
Matthieu Baerts	456afe01b1	mptcp: ADD_ADDRs with echo bit are smaller The MPTCP ADD_ADDR suboption with echo-flag=1 has no HMAC, the size is smaller than the one initially sent without echo-flag=1. We then need to use the correct size everywhere when we need this echo bit. Before this patch, the wrong size was reserved but the correct amount of bytes were written (and read): the remaining bytes contained garbage. Fixes: `6a6c05a8b0` ("mptcp: send out ADD_ADDR with echo flag") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/95 Reported-and-tested-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 17:36:37 -07:00
Guillaume Nault	a45294af9e	net/sched: act_mpls: Add action to push MPLS LSE before Ethernet header Define the MAC_PUSH action which pushes an MPLS LSE before the mac header (instead of between the mac and the network headers as the plain PUSH action does). The only special case is when the skb has an offloaded VLAN. In that case, it has to be inlined before pushing the MPLS header. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 17:28:45 -07:00
Guillaume Nault	19fbcb36a3	net/sched: act_vlan: Add {POP,PUSH}_ETH actions Implement TCA_VLAN_ACT_POP_ETH and TCA_VLAN_ACT_PUSH_ETH, to respectively pop and push a base Ethernet header at the beginning of a frame. POP_ETH is just a matter of pulling ETH_HLEN bytes. VLAN tags, if any, must be stripped before calling POP_ETH. PUSH_ETH is restricted to skbs with no mac_header, and only the MAC addresses can be configured. The Ethertype is automatically set from skb->protocol. These restrictions ensure that all skb's fields remain consistent, so that this action can't confuse other part of the networking stack (like GSO). Since openvswitch already had these actions, consolidate the code in skbuff.c (like for vlan and mpls push/pop). Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 17:28:45 -07:00
Karsten Graul	fd6ebb6fb2	net/smc: use an array to check fields in system EID The check for old hardware versions that did not have SMCDv2 support was using suspicious pointer magic. Address the fields using an array. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 17:04:48 -07:00
Karsten Graul	839d696ffb	net/smc: send ISM devices with unique chid in CLC proposal When building a CLC proposal message then the list of ISM devices does not need to contain multiple devices that have the same chid value, all these devices use the same function at the end. Improve smc_find_ism_v2_device_clnt() to collect only ISM devices that have unique chid values. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 17:04:48 -07:00
Yuchung Cheng	9cd8b6c905	tcp: account total lost packets properly The retransmission refactoring patch `686989700c` ("tcp: simplify tcp_mark_skb_lost") does not properly update the total lost packet counter which may break the policer mode in BBR. This patch fixes it. Fixes: `686989700c` ("tcp: simplify tcp_mark_skb_lost") Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 16:57:18 -07:00
Julian Wiedmann	a29f245ec9	net/iucv: fix indentation in __iucv_message_receive() smatch complains about net/iucv/iucv.c:1119 __iucv_message_receive() warn: inconsistent indenting While touching this line, also make the return logic consistent and thus get rid of a goto label. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 16:51:07 -07:00
Julian Wiedmann	398999bac6	net/af_iucv: right-size the uid variable in iucv_sock_bind() smatch complains about net/iucv/af_iucv.c:624 iucv_sock_bind() error: memcpy() 'sa->siucv_user_id' too small (8 vs 9) Which is absolutely correct - the memcpy() takes 9 bytes (sizeof(uid)) from an 8-byte field (sa->siucv_user_id). Luckily the sockaddr_iucv struct contains more data after the .siucv_user_id field, and we checked the size of the passed data earlier on. So the memcpy() won't accidentally read from an invalid location. Fix the warning by reducing the size of the uid variable to what's actually needed, and thus reducing the amount of copied data. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 16:51:07 -07:00
Jakub Kicinski	e992a6eda9	genetlink: allow dumping command-specific policy Right now CTRL_CMD_GETPOLICY can only dump the family-wide policy. Support dumping policy of a specific op. v3: - rebase after per-op policy export and handle that v2: - make cmd U32, just in case. v1: - don't echo op in the output in a naive way, this should make it cleaner to extend the output format for dumping policies for all the commands at once in the future. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20201001225933.1373426-11-kuba@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 14:18:29 -07:00
Johannes Berg	50a896cf2d	genetlink: properly support per-op policy dumping Add support for per-op policy dumping. The data is pretty much as before, except that now the assumption that the policy with index 0 is "the" policy no longer holds - you now need to look at the new CTRL_ATTR_OP_POLICY attribute which is a nested attr (indexed by op) containing attributes for do and dump policies. When a single op is requested, the CTRL_ATTR_OP_POLICY will be added in the same way, since do and dump policies may differ. v2: - conditionally advertise per-command policies only if there actually is a policy being used for the do/dump and it's present at all Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 14:18:29 -07:00
Johannes Berg	aa85ee5f95	genetlink: factor skb preparation out of ctrl_dumppolicy() We'll need this later for the per-op policy index dump. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 14:18:29 -07:00
Johannes Berg	04a351a62b	netlink: rework policy dump to support multiple policies Rework the policy dump code a bit to support adding multiple policies to a single dump, in order to e.g. support per-op policies in generic netlink. v2: - move kernel-doc to implementation [Jakub] - squash the first patch to not flip-flop on the prototype [Jakub] - merge netlink_policy_dump_get_policy_idx() with the old get_policy_idx() we already had - rebase without Jakub's patch to have per-op dump Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 14:18:29 -07:00
Johannes Berg	899b07c578	netlink: compare policy more accurately The maxtype is really an integral part of the policy, and while we haven't gotten into a situation yet where this happens, it seems that some developer might eventually have two places pointing to identical policies, with different maxattr to exclude some attrs in one of the places. Even if not, it's really the right thing to compare both since the two data items fundamentally belong together. v2: - also do the proper comparison in get_policy_idx() Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-03 14:18:29 -07:00
Christoph Hellwig	89cd35c58b	iov_iter: transparently handle compat iovecs in import_iovec Use in compat_syscall to import either native or the compat iovecs, and remove the now superflous compat_import_iovec. This removes the need for special compat logic in most callers, and the remaining ones can still be simplified by using __import_iovec with a bool compat parameter. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-10-03 00:02:13 -04:00
Jakub Kicinski	a4bb4f5fc8	genetlink: switch control commands to per-op policies In preparation for adding a new attribute to CTRL_CMD_GETPOLICY split the policies for getpolicy and getfamily apart. This will cause a slight user-visible change in that dumping the policies will switch from per family to per op, but supposedly sniffer-type applications (which are the main use case for policy dumping thus far) should support both, anyway. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 19:11:12 -07:00
Jakub Kicinski	8e1ed28fd8	genetlink: use parsed attrs in dumppolicy Attributes are already parsed based on the policy specified in the family and ready-to-use in info->attrs. No need to call genlmsg_parse() again. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 19:11:12 -07:00
Jakub Kicinski	48526a0f4c	genetlink: bring back per op policy Add policy to the struct genl_ops structure, this time with maxattr, so it can be used properly. Propagate .policy and .maxattr from the family in genl_get_cmd() if needed, this way the rest of the code does not have to worry if the policy is per op or global. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 19:11:12 -07:00
Jakub Kicinski	78ade619c1	genetlink: use .start callback for dumppolicy The structure of ctrl_dumppolicy() is clearly split into init and dumping. Move the init to a .start callback for clarity, it's a more idiomatic netlink dump code structure. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 19:11:12 -07:00
Jakub Kicinski	adc848450f	genetlink: add a structure for dump state Whenever netlink dump uses more than 2 cb->args[] entries code gets hard to read. We're about to add more state to ctrl_dumppolicy() so create a structure. Since the structure is typed and clearly named we can remove the local fam_id variable and use ctx->fam_id directly. v3: - rebase onto explicit free fix v1: - s/nl_policy_dump/netlink_policy_dump_state/ - forward declare struct netlink_policy_dump_state, and move from passing unsigned long to actual pointer type - add build bug on - u16 fam_id - s/args/ctx/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 19:11:12 -07:00
Jakub Kicinski	66a9b9287d	genetlink: move to smaller ops wherever possible Bulk of the genetlink users can use smaller ops, move them. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 19:11:11 -07:00
Jakub Kicinski	0b588afdd1	genetlink: add small version of ops We want to add maxattr and policy back to genl_ops, to enable dumping per command policy to user space. This, however, would cause bloat for all the families with global policies. Introduce smaller version of ops (half the size of genl_ops). Translate these smaller ops into a full blown struct before use in the core. v1: - use struct assignment - put a full copy of the op in struct genl_dumpit_info - s/light/small/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 19:11:11 -07:00
Ioana Ciornei	c50bf2be73	devlink: add .trap_group_action_set() callback Add a new devlink callback, .trap_group_action_set(), which can be used by device drivers which do not support controlling the action (drop, trap) on each trap but rather on the entire group trap. If this new callback is populated, it will take precedence over the .trap_action_set() callback when the user requests a change of all the traps in a group. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 16:31:56 -07:00
Ioana Ciornei	10c24eb23d	devlink: add parser error drop packet traps Add parser error drop packet traps, so that capable device driver could register them with devlink. The new packet trap group holds any drops of packets which were marked by the device as erroneous during header parsing. Add documentation for every added packet trap and packet trap group. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 16:31:55 -07:00
Paolo Abeni	9d8c05ad56	tcp: fix syn cookied MPTCP request socket leak If a syn-cookies request socket don't pass MPTCP-level validation done in syn_recv_sock(), we need to release it immediately, or it will be leaked. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/89 Fixes: `9466a1cceb` ("mptcp: enable JOIN requests even if cookies are in use") Reported-and-tested-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 15:34:38 -07:00
David S. Miller	26d0a8edca	Another set of changes, this time with: * lots more S1G band support * 6 GHz scanning, finally * kernel-doc fixes * non-split wiphy dump fixes in nl80211 * various other small cleanups/features -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl92/IsACgkQB8qZga/f l8QrBxAAi8HJdFyZtCIuLrXL3KHzey5AjmrYuHdsFcdk8NJZkEco17I3l05D9ek6 76VvqjiYzDwdmgoHr3yz0K7pOAoTRpBKlaecvZLPXWf2bVhebWSU5EPcrTZHolrJ JBoBj4FU6Im/MnFbeiKxPj3M+NTQrLdekODSeaC5hFhi/oSF9lap6RMC8sz4YrVp 9yKzB8zjz+eL4wL3EsztEzpTxbvHTaVMe0XBVou7Fg2ZauJGwqMxpIukpMWUmmNr EequhVFpdlXbVMle8wP4ZR58c4+O1kbRoYL9WhAILtdDhCKfLccWXnlUjzuQlCeB RH/jzG7AlVhm972oUuqG9szAcU8hEgWdsNEML7pilXmFk/ZSNLpUZfZCAILn+Gd3 8oMQnXp2br+DLzf1SO7cxpL2KrTNjrb4gcJVBJ9eBlDjK/64N22MqZkpOKcMxq51 ocmf1MJ1TbAbZn/kY2hsoaPYt2+bm1umMa/t/Pwuds+xKZEOOuPNgZQILcAsfJZB 2OWDDT+RNLo/K4mPETtyQQZoCxAWB9n/CcnU+UTsmnUmMsEnCEbPnbYKBrc6jX1l jSP6XUD8fxhB2lfW+SPtQPnAi86+gblXVvEO8zm0+ez3juItlIsVcRY8ey/j9N1F uorpycvrfl6+Q1+mmhe6et2r+TLdGs73I0PJ44HRA9JZexKBrDg= =B9Xl -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-net-next-2020-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Another set of changes, this time with: * lots more S1G band support * 6 GHz scanning, finally * kernel-doc fixes * non-split wiphy dump fixes in nl80211 * various other small cleanups/features ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 15:33:13 -07:00
Coly Li	40efc4dc73	libceph: use sendpage_ok() in ceph_tcp_sendpage() In libceph, ceph_tcp_sendpage() does the following checks before handle the page by network layer's zero copy sendpage method, if (page_count(page) >= 1 && !PageSlab(page)) This check is exactly what sendpage_ok() does. This patch replace the open coded checks by sendpage_ok() as a code cleanup. Signed-off-by: Coly Li <colyli@suse.de> Acked-by: Jeff Layton <jlayton@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 15:27:08 -07:00
Coly Li	cf83a17ede	tcp: use sendpage_ok() to detect misused .sendpage commit `a10674bf24` ("tcp: detecting the misuse of .sendpage for Slab objects") adds the checks for Slab pages, but the pages don't have page_count are still missing from the check. Network layer's sendpage method is not designed to send page_count 0 pages neither, therefore both PageSlab() and page_count() should be both checked for the sending page. This is exactly what sendpage_ok() does. This patch uses sendpage_ok() in do_tcp_sendpages() to detect misused .sendpage, to make the code more robust. Fixes: `a10674bf24` ("tcp: detecting the misuse of .sendpage for Slab objects") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: David S. Miller <davem@davemloft.net> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 15:27:08 -07:00
Coly Li	7b62d31d3f	net: add WARN_ONCE in kernel_sendpage() for improper zero-copy send If a page sent into kernel_sendpage() is a slab page or it doesn't have ref_count, this page is improper to send by the zero copy sendpage() method. Otherwise such page might be unexpected released in network code path and causes impredictable panic due to kernel memory management data structure corruption. This path adds a WARN_ON() on the sending page before sends it into the concrete zero-copy sendpage() method, if the page is improper for the zero-copy sendpage() method, a warning message can be observed before the consequential unpredictable kernel panic. This patch does not change existing kernel_sendpage() behavior for the improper page zero-copy send, it just provides hint warning message for following potential panic due the kernel memory heap corruption. Signed-off-by: Coly Li <colyli@suse.de> Cc: Cong Wang <amwang@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David S. Miller <davem@davemloft.net> Cc: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 15:27:08 -07:00
John Fastabend	18ebe16d10	bpf, sockmap: Add skb_adjust_room to pop bytes off ingress payload This implements a new helper skb_adjust_room() so users can push/pop extra bytes from a BPF_SK_SKB_STREAM_VERDICT program. Some protocols may include headers and other information that we may not want to include when doing a redirect from a BPF_SK_SKB_STREAM_VERDICT program. One use case is to redirect TLS packets into a receive socket that doesn't expect TLS data. In TLS case the first 13B or so contain the protocol header. With KTLS the payload is decrypted so we should be able to redirect this to a receiving socket, but the receiving socket may not be expecting to receive a TLS header and discard the data. Using the above helper we can pop the header off and put an appropriate header on the payload. This allows for creating a proxy between protocols without extra hops through the stack or userspace. So in order to fix this case add skb_adjust_room() so users can strip the header. After this the user can strip the header and an unmodified receiver thread will work correctly when data is redirected into the ingress path of a sock. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160160099197.7052.8443193973242831692.stgit@john-Precision-5820-Tower	2020-10-02 15:18:39 -07:00
Florian Fainelli	3a68844dd2	net: dsa: Utilize __vlan_find_dev_deep_rcu() Now that we are guaranteed that dsa_untag_bridge_pvid() is called after eth_type_trans() we can utilize __vlan_find_dev_deep_rcu() which will take care of finding an 802.1Q upper on top of a bridge master. A common use case, prior to 12a1526d067 ("net: dsa: untag the bridge pvid from rx skbs") was to configure a bridge 802.1Q upper like this: ip link add name br0 type bridge vlan_filtering 0 ip link add link br0 name br0.1 type vlan id 1 in order to pop the default_pvid VLAN tag. With this change we restore that behavior while still allowing the DSA receive path to automatically pop the VLAN tag. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 13:36:07 -07:00
Florian Fainelli	a348292b63	net: dsa: Obtain VLAN protocol from skb->protocol Now that dsa_untag_bridge_pvid() is called after eth_type_trans() we are guaranteed that skb->protocol will be set to a correct value, thus allowing us to avoid calling vlan_eth_hdr(). Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 13:36:07 -07:00
Florian Fainelli	1c5ad5a940	net: dsa: b53: Set untag_bridge_pvid Indicate to the DSA receive path that we need to untage the bridge PVID, this allows us to remove the dsa_untag_bridge_pvid() calls from net/dsa/tag_brcm.c. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 13:36:07 -07:00
Florian Fainelli	1dc0408cdf	net: dsa: Call dsa_untag_bridge_pvid() from dsa_switch_rcv() When a DSA switch driver needs to call dsa_untag_bridge_pvid(), it can set dsa_switch::untag_brige_pvid to indicate this is necessary. This is a pre-requisite to making sure that we are always calling dsa_untag_bridge_pvid() after eth_type_trans() has been called. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-02 13:36:07 -07:00

... 3 4 5 6 7 ...

62922 Commits