linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2025-01-09 15:24:32 +08:00

Author	SHA1	Message	Date
Eric Dumazet	406715df93	fq_codel: do not include <linux/jhash.h> Since commit `342db22182` ("sched: Call skb_get_hash_perturb in sch_fq_codel") we no longer need anything from this file. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 15:31:42 -07:00
Vivien Didelot	7e99e34701	net: dsa: remove dsa_switch_alloc helper Now that ports are dynamically listed in the fabric, there is no need to provide a special helper to allocate the dsa_switch structure. This will give more flexibility to drivers to embed this structure as they wish in their private structure. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 12:37:07 -07:00
Vivien Didelot	05f294a852	net: dsa: allocate ports on touch Allocate the struct dsa_port the first time it is accessed with dsa_port_touch, and remove the static dsa_port array from the dsa_switch structure. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 12:37:07 -07:00
Vivien Didelot	da4561cda2	net: dsa: use ports list to setup default CPU port Use the new ports list instead of iterating over switches and their ports when setting up the default CPU port. Unassign it on teardown. Now that we can iterate over multiple CPU ports, remove dst->cpu_dp. At the same time, provide a better error message for CPU-less tree. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 12:37:07 -07:00
Vivien Didelot	c0b736282c	net: dsa: use ports list to find first CPU port Use the new ports list instead of iterating over switches and their ports when looking up the first CPU port in the tree. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 12:37:07 -07:00
Vivien Didelot	0cfec588ec	net: dsa: use ports list to setup multiple master devices Now that we have a potential list of CPU ports, make use of it instead of only configuring the master device of an unique CPU port. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 12:37:07 -07:00
Vivien Didelot	764b7e6242	net: dsa: use ports list to find a port by node Use the new ports list instead of iterating over switches and their ports to find a port from a given node. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 12:37:07 -07:00
Vivien Didelot	86bfb2c1f4	net: dsa: use ports list for routing table setup Use the new ports list instead of accessing the dsa_switch array of ports when iterating over DSA ports of a switch to set up the routing table. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 12:37:06 -07:00
Vivien Didelot	fb35c60cba	net: dsa: use ports list to setup switches Use the new ports list instead of iterating over switches and their ports when setting up the switches and their ports. At the same time, provide setup states and messages for ports and switches as it is done for the trees. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 12:37:06 -07:00
Vivien Didelot	7b9a2f4bac	net: dsa: use ports list to find slave Use the new ports list instead of iterating over switches and their ports when looking for a slave device from a given master interface. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 12:37:06 -07:00
Vivien Didelot	ab8ccae122	net: dsa: add ports list in the switch fabric Add a list of switch ports within the switch fabric. This will help the lookup of a port inside the whole fabric, and it is the first step towards supporting multiple CPU ports, before deprecating the usage of the unique dst->cpu_dp pointer. In preparation for a future allocation of the dsa_port structures, return -ENOMEM in case no structure is returned, even though this error cannot be reached yet. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 12:37:06 -07:00
Vivien Didelot	68bb8ea8ad	net: dsa: use dsa_to_port helper everywhere Do not let the drivers access the ds->ports static array directly while there is a dsa_to_port helper for this purpose. At the same time, un-const this helper since the SJA1105 driver assigns the priv member of the returned dsa_port structure. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 12:37:06 -07:00
Ursula Braun	81cf4f4707	net/smc: remove close abort worker With the introduction of the link group termination worker there is no longer a need to postpone smc_close_active_abort() to a worker. To protect socket destruction due to normal and abnormal socket closing, the socket refcount is increased. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 11:23:44 -07:00
Ursula Braun	f528ba24a8	net/smc: introduce link group termination worker Use a worker for link group termination to guarantee process context. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 11:23:44 -07:00
Ursula Braun	2a0674fffb	net/smc: improve abnormal termination of link groups If a link group and its connections must be terminated, * wake up socket waiters * do not enable buffer reuse A linkgroup might be terminated while normal connection closing is running. Avoid buffer reuse and its related LLC DELETE RKEY call, if linkgroup termination has started. And use the earliest indication of linkgroup termination possible, namely the removal from the linkgroup list. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 11:23:44 -07:00
Ursula Braun	8317976096	net/smc: tell peers about abnormal link group termination There are lots of link group termination scenarios. Most of them still allow to inform the peer of the terminating sockets about aborting. This patch tries to call smc_close_abort() for terminating sockets. And the internal TCP socket is reset with tcp_abort(). Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 11:23:43 -07:00
Ursula Braun	8e316b9e72	net/smc: improve link group freeing Usually link groups are freed delayed to enable quick connection creation for a follow-on SMC socket. Terminated link groups are freed faster. This patch makes sure, fast schedule of link group freeing is not rescheduled by a delayed schedule. And it makes sure link group freeing is not rescheduled, if the real freeing is already running. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 11:23:43 -07:00
Ursula Braun	69318b5215	net/smc: improve abnormal termination locking Locking hierarchy requires that the link group conns_lock can be taken if the socket lock is held, but not vice versa. Nevertheless socket termination during abnormal link group termination should be protected by the socket lock. This patch reduces the time segments the link group conns_lock is held to enable usage of lock_sock in smc_lgr_terminate(). Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 11:23:43 -07:00
Ursula Braun	8caa654451	net/smc: terminate link group without holding lgr lock When a link group is to be terminated, it is sufficient to hold the lgr lock when unlinking the link group from its list. Move the lock-protected link group unlinking into smc_lgr_terminate(). Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 11:23:43 -07:00
Ursula Braun	b290098092	net/smc: cancel send and receive for terminated socket The resources for a terminated socket are being cleaned up. This patch makes sure * no more data is received for an actively terminated socket * no more data is sent for an actively or passively terminated socket Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-10-22 11:23:43 -07:00
Davide Caratti	985fd98ab5	net/sched: act_police: re-use tcf_tm_dump() Use tcf_tm_dump(), instead of an open coded variant (no functional change in this patch). Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-21 11:22:40 -07:00
David S. Miller	2f184393e0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Several cases of overlapping changes which were for the most part trivially resolvable. Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-20 10:43:00 -07:00
Linus Torvalds	531e93d114	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from David Miller: "I was battling a cold after some recent trips, so quite a bit piled up meanwhile, sorry about that. Highlights: 1) Fix fd leak in various bpf selftests, from Brian Vazquez. 2) Fix crash in xsk when device doesn't support some methods, from Magnus Karlsson. 3) Fix various leaks and use-after-free in rxrpc, from David Howells. 4) Fix several SKB leaks due to confusion of who owns an SKB and who should release it in the llc code. From Eric Biggers. 5) Kill a bunc of KCSAN warnings in TCP, from Eric Dumazet. 6) Jumbo packets don't work after resume on r8169, as the BIOS resets the chip into non-jumbo mode during suspend. From Heiner Kallweit. 7) Corrupt L2 header during MPLS push, from Davide Caratti. 8) Prevent possible infinite loop in tc_ctl_action, from Eric Dumazet. 9) Get register bits right in bcmgenet driver, based upon chip version. From Florian Fainelli. 10) Fix mutex problems in microchip DSA driver, from Marek Vasut. 11) Cure race between route lookup and invalidation in ipv4, from Wei Wang. 12) Fix performance regression due to false sharing in 'net' structure, from Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (145 commits) net: reorder 'struct net' fields to avoid false sharing net: dsa: fix switch tree list net: ethernet: dwmac-sun8i: show message only when switching to promisc net: aquantia: add an error handling in aq_nic_set_multicast_list net: netem: correct the parent's backlog when corrupted packet was dropped net: netem: fix error path for corrupted GSO frames macb: propagate errors when getting optional clocks xen/netback: fix error path of xenvif_connect_data() net: hns3: fix mis-counting IRQ vector numbers issue net: usb: lan78xx: Connect PHY before registering MAC vsock/virtio: discard packets if credit is not respected vsock/virtio: send a credit update when buffer size is changed mlxsw: spectrum_trap: Push Ethernet header before reporting trap net: ensure correct skb->tstamp in various fragmenters net: bcmgenet: reset 40nm EPHY on energy detect net: bcmgenet: soft reset 40nm EPHYs before MAC init net: phy: bcm7xxx: define soft_reset for 40nm EPHY net: bcmgenet: don't set phydev->link from MAC net: Update address for MediaTek ethernet driver in MAINTAINERS ipv4: fix race condition between route lookup and invalidation ...	2019-10-19 17:09:11 -04:00
Vivien Didelot	50c7d2ba9d	net: dsa: fix switch tree list If there are multiple switch trees on the device, only the last one will be listed, because the arguments of list_add_tail are swapped. Fixes: `83c0afaec7` ("net: dsa: Add new binding implementation") Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-19 12:19:41 -07:00
Jakub Kicinski	e0ad032e14	net: netem: correct the parent's backlog when corrupted packet was dropped If packet corruption failed we jump to finish_segs and return NET_XMIT_SUCCESS. Seeing success will make the parent qdisc increment its backlog, that's incorrect - we need to return NET_XMIT_DROP. Fixes: `6071bd1aa1` ("netem: Segment GSO packets on enqueue") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-19 12:12:36 -07:00
Jakub Kicinski	a7fa12d158	net: netem: fix error path for corrupted GSO frames To corrupt a GSO frame we first perform segmentation. We then proceed using the first segment instead of the full GSO skb and requeue the rest of the segments as separate packets. If there are any issues with processing the first segment we still want to process the rest, therefore we jump to the finish_segs label. Commit `177b800746` ("net: netem: fix backlog accounting for corrupted GSO frames") started using the pointer to the first segment in the "rest of segments processing", but as mentioned above the first segment may had already been freed at this point. Backlog corrections for parent qdiscs have to be adjusted. Fixes: `177b800746` ("net: netem: fix backlog accounting for corrupted GSO frames") Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-19 12:12:35 -07:00
Stefano Garzarella	ae6fcfbf5f	vsock/virtio: discard packets if credit is not respected If the remote peer doesn't respect the credit information (buf_alloc, fwd_cnt), sending more data than it can send, we should drop the packets to prevent a malicious peer from using all of our memory. This is patch follows the VIRTIO spec: "VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has sufficient free buffer space for the payload" Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-18 10:19:43 -07:00
Stefano Garzarella	ec3359b685	vsock/virtio: send a credit update when buffer size is changed When the user application set a new buffer size value, we should update the remote peer about this change, since it uses this information to calculate the credit available. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-18 10:19:43 -07:00
Eric Dumazet	9669fffc14	net: ensure correct skb->tstamp in various fragmenters Thomas found that some forwarded packets would be stuck in FQ packet scheduler because their skb->tstamp contained timestamps far in the future. We thought we addressed this point in commit `8203e2d844` ("net: clear skb->tstamp in forwarding paths") but there is still an issue when/if a packet needs to be fragmented. In order to meet EDT requirements, we have to make sure all fragments get the original skb->tstamp. Note that this original skb->tstamp should be zero in forwarding path, but might have a non zero value in output path if user decided so. Fixes: `fb420d5d91` ("tcp/fq: move back to CLOCK_MONOTONIC") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Thomas Bartschies <Thomas.Bartschies@cvk.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-18 10:02:37 -07:00
Wei Wang	5018c59607	ipv4: fix race condition between route lookup and invalidation Jesse and Ido reported the following race condition: <CPU A, t0> - Received packet A is forwarded and cached dst entry is taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set() <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables from multiple ISPs"), route is added / deleted and rt_cache_flush() is called <CPU B, t2> - Received packet B tries to use the same cached dst entry from t0, but rt_cache_valid() is no longer true and it is replaced in rt_cache_route() by the newer one. This calls dst_dev_put() on the original dst entry which assigns the blackhole netdev to 'dst->dev' <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due to 'dst->dev' being the blackhole netdev There are 2 issues in the v4 routing code: 1. A per-netns counter is used to do the validation of the route. That means whenever a route is changed in the netns, users of all routes in the netns needs to redo lookup. v6 has an implementation of only updating fn_sernum for routes that are affected. 2. When rt_cache_valid() returns false, rt_cache_route() is called to throw away the current cache, and create a new one. This seems unnecessary because as long as this route does not change, the route cache does not need to be recreated. To fully solve the above 2 issues, it probably needs quite some code changes and requires careful testing, and does not suite for net branch. So this patch only tries to add the deleted cached rt into the uncached list, so user could still be able to use it to receive packets until it's done. Fixes: `95c47f9cf5` ("ipv4: call dst_dev_put() properly") Signed-off-by: Wei Wang <weiwan@google.com> Reported-by: Ido Schimmel <idosch@idosch.org> Reported-by: Jesse Hathaway <jesse@mbuki-mvuki.org> Tested-by: Jesse Hathaway <jesse@mbuki-mvuki.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-17 16:44:03 -07:00
Stefano Brivio	595e0651d0	ipv4: Return -ENETUNREACH if we can't create route but saddr is valid ...instead of -EINVAL. An issue was found with older kernel versions while unplugging a NFS client with pending RPCs, and the wrong error code here prevented it from recovering once link is back up with a configured address. Incidentally, this is not an issue anymore since commit `4f8943f808` ("SUNRPC: Replace direct task wakeups from softirq context"), included in 5.2-rc7, had the effect of decoupling the forwarding of this error by using SO_ERROR in xs_wake_error(), as pointed out by Benjamin Coddington. To the best of my knowledge, this isn't currently causing any further issue, but the error code doesn't look appropriate anyway, and we might hit this in other paths as well. In detail, as analysed by Gonzalo Siero, once the route is deleted because the interface is down, and can't be resolved and we return -EINVAL here, this ends up, courtesy of inet_sk_rebuild_header(), as the socket error seen by tcp_write_err(), called by tcp_retransmit_timer(). In turn, tcp_write_err() indirectly calls xs_error_report(), which wakes up the RPC pending tasks with a status of -EINVAL. This is then seen by call_status() in the SUN RPC implementation, which aborts the RPC call calling rpc_exit(), instead of handling this as a potentially temporary condition, i.e. as a timeout. Return -EINVAL only if the input parameters passed to ip_route_output_key_hash_rcu() are actually invalid (this is the case if the specified source address is multicast, limited broadcast or all zeroes), but return -ENETUNREACH in all cases where, at the given moment, the given source address doesn't allow resolving the route. While at it, drop the initialisation of err to -ENETUNREACH, which was added to __ip_route_output_key() back then by commit `0315e38270` ("net: Fix behaviour of unreachable, blackhole and prohibit routes"), but actually had no effect, as it was, and is, overwritten by the fib_lookup() return code assignment, and anyway ignored in all other branches, including the if (fl4->saddr) one: I find this rather confusing, as it would look like -ENETUNREACH is the "default" error, while that statement has no effect. Also note that after commit `fc75fc8339` ("ipv4: dont create routes on down devices"), we would get -ENETUNREACH if the device is down, but -EINVAL if the source address is specified and we can't resolve the route, and this appears to be rather inconsistent. Reported-by: Stefan Walter <walteste@inf.ethz.ch> Analysed-by: Benjamin Coddington <bcodding@redhat.com> Analysed-by: Gonzalo Siero <gsierohu@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-17 16:36:17 -07:00
Marc Kleine-Budde	4eab421bc3	net: sched: Avoid using yield() in a busy waiting loop With threaded interrupts enabled, the interrupt thread runs as SCHED_RR with priority 50. If a user application with a higher priority preempts the interrupt thread and tries to shutdown the network interface then it will loop forever. The kernel will spin in the loop waiting for the device to become idle and the scheduler will never consider the interrupt thread because its priority is lower. Avoid the problem by sleeping for a jiffy giving other tasks, including the interrupt thread, a chance to run and make progress. In the original thread it has been suggested to use wait_event() and properly waiting for the state to occur. DaveM explained that this would require to add expensive checks in the fast paths of packet processing. Link: https://lkml.kernel.org/r/1393976987-23555-1-git-send-email-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> [bigeasy: Rewrite commit message, add comment, use schedule_timeout_uninterruptible()] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-17 15:33:03 -04:00
YueHaibing	ce753e66dc	net/rds: Remove unnecessary null check Null check before dma_pool_destroy is redundant, so remove it. This is detected by coccinelle. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-17 15:23:03 -04:00
Yunsheng Lin	a8c41a6807	pktgen: remove unnecessary assignment in pktgen_xmit() variable ret is not used after jumping to "unlock" label, so the assignment is redundant. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-17 14:25:13 -04:00
Eric Dumazet	2ca4f6ca45	rxrpc: use rcu protection while reading sk->sk_user_data We need to extend the rcu_read_lock() section in rxrpc_error_report() and use rcu_dereference_sk_user_data() instead of plain access to sk->sk_user_data to make sure all rules are respected. The compiler wont reload sk->sk_user_data at will, and RCU rules prevent memory beeing freed too soon. Fixes: `f0308fb070` ("rxrpc: Fix possible NULL pointer access in ICMP handling") Fixes: `17926a7932` ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-16 12:20:17 -07:00
Mahesh Bandewar	bd74708cd9	Revert "blackhole_netdev: fix syzkaller reported issue" This reverts commit `b0818f80c8`. Started seeing weird behavior after this patch especially in the IPv6 code path. Haven't root caused it, but since this was applied to net branch, taking a precautionary measure to revert it and look / analyze those failures Revert this now and I'll send a better fix after analysing / fixing the weirdness observed. CC: Eric Dumazet <edumazet@google.com> CC: Wei Wang <weiwan@google.com> CC: David S. Miller <davem@davemloft.net> Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-16 13:41:26 -04:00
Xin Long	63dfb7938b	sctp: change sctp_prot .no_autobind with true syzbot reported a memory leak: BUG: memory leak, unreferenced object 0xffff888120b3d380 (size 64): backtrace: [...] slab_alloc mm/slab.c:3319 [inline] [...] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483 [...] sctp_bucket_create net/sctp/socket.c:8523 [inline] [...] sctp_get_port_local+0x189/0x5a0 net/sctp/socket.c:8270 [...] sctp_do_bind+0xcc/0x200 net/sctp/socket.c:402 [...] sctp_bindx_add+0x4b/0xd0 net/sctp/socket.c:497 [...] sctp_setsockopt_bindx+0x156/0x1b0 net/sctp/socket.c:1022 [...] sctp_setsockopt net/sctp/socket.c:4641 [inline] [...] sctp_setsockopt+0xaea/0x2dc0 net/sctp/socket.c:4611 [...] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3147 [...] __sys_setsockopt+0x10f/0x220 net/socket.c:2084 [...] __do_sys_setsockopt net/socket.c:2100 [inline] It was caused by when sending msgs without binding a port, in the path: inet_sendmsg() -> inet_send_prepare() -> inet_autobind() -> .get_port/sctp_get_port(), sp->bind_hash will be set while bp->port is not. Later when binding another port by sctp_setsockopt_bindx(), a new bucket will be created as bp->port is not set. sctp's autobind is supposed to call sctp_autobind() where it does all things including setting bp->port. Since sctp_autobind() is called in sctp_sendmsg() if the sk is not yet bound, it should have skipped the auto bind. THis patch is to avoid calling inet_autobind() in inet_send_prepare() by changing sctp_prot .no_autobind with true, also remove the unused .get_port. Reported-by: syzbot+d44f7bbebdea49dbc84a@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-15 20:37:51 -07:00
Vinicius Costa Gomes	28aa7c86c2	sched: etf: Fix ordering of packets with same txtime When a application sends many packets with the same txtime, they may be transmitted out of order (different from the order in which they were enqueued). This happens because when inserting elements into the tree, when the txtime of two packets are the same, the new packet is inserted at the left side of the tree, causing the reordering. The only effect of this change should be that packets with the same txtime will be transmitted in the order they are enqueued. The application in question (the AVTP GStreamer plugin, still in development) is sending video traffic, in which each video frame have a single presentation time, the problem is that when packetizing, multiple packets end up with the same txtime. The receiving side was rejecting packets because they were being received out of order. Fixes: `25db26a913` ("net/sched: Introduce the ETF Qdisc") Reported-by: Ederson de Souza <ederson.desouza@intel.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-15 20:32:04 -07:00
Eric Dumazet	39f13ea2f6	net: avoid potential infinite loop in tc_ctl_action() tc_ctl_action() has the ability to loop forever if tcf_action_add() returns -EAGAIN. This special case has been done in case a module needed to be loaded, but it turns out that tcf_add_notify() could also return -EAGAIN if the socket sk_rcvbuf limit is hit. We need to separate the two cases, and only loop for the module loading case. While we are at it, add a limit of 10 attempts since unbounded loops are always scary. syzbot repro was something like : socket(PF_NETLINK, SOCK_RAW\|SOCK_NONBLOCK, NETLINK_ROUTE) = 3 write(3, ..., 38) = 38 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0 sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{..., 388}], msg_controllen=0, msg_flags=0x10}, ...) NMI backtrace for cpu 0 CPU: 0 PID: 1054 Comm: khungtaskd Not tainted 5.4.0-rc1+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 nmi_cpu_backtrace.cold+0x70/0xb2 lib/nmi_backtrace.c:101 nmi_trigger_cpumask_backtrace+0x23b/0x28b lib/nmi_backtrace.c:62 arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38 trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline] check_hung_uninterruptible_tasks kernel/hung_task.c:205 [inline] watchdog+0x9d0/0xef0 kernel/hung_task.c:289 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Sending NMI from CPU 0 to CPUs 1: NMI backtrace for cpu 1 CPU: 1 PID: 8859 Comm: syz-executor910 Not tainted 5.4.0-rc1+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:arch_local_save_flags arch/x86/include/asm/paravirt.h:751 [inline] RIP: 0010:lockdep_hardirqs_off+0x1df/0x2e0 kernel/locking/lockdep.c:3453 Code: 5c 08 00 00 5b 41 5c 41 5d 5d c3 48 c7 c0 58 1d f3 88 48 ba 00 00 00 00 00 fc ff df 48 c1 e8 03 80 3c 10 00 0f 85 d3 00 00 00 <48> 83 3d 21 9e 99 07 00 0f 84 b9 00 00 00 9c 58 0f 1f 44 00 00 f6 RSP: 0018:ffff8880a6f3f1b8 EFLAGS: 00000046 RAX: 1ffffffff11e63ab RBX: ffff88808c9c6080 RCX: 0000000000000000 RDX: dffffc0000000000 RSI: 0000000000000000 RDI: ffff88808c9c6914 RBP: ffff8880a6f3f1d0 R08: ffff88808c9c6080 R09: fffffbfff16be5d1 R10: fffffbfff16be5d0 R11: 0000000000000003 R12: ffffffff8746591f R13: ffff88808c9c6080 R14: ffffffff8746591f R15: 0000000000000003 FS: 00000000011e4880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffff600400 CR3: 00000000a8920000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: trace_hardirqs_off+0x62/0x240 kernel/trace/trace_preemptirq.c:45 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline] _raw_spin_lock_irqsave+0x6f/0xcd kernel/locking/spinlock.c:159 __wake_up_common_lock+0xc8/0x150 kernel/sched/wait.c:122 __wake_up+0xe/0x10 kernel/sched/wait.c:142 netlink_unlock_table net/netlink/af_netlink.c:466 [inline] netlink_unlock_table net/netlink/af_netlink.c:463 [inline] netlink_broadcast_filtered+0x705/0xb80 net/netlink/af_netlink.c:1514 netlink_broadcast+0x3a/0x50 net/netlink/af_netlink.c:1534 rtnetlink_send+0xdd/0x110 net/core/rtnetlink.c:714 tcf_add_notify net/sched/act_api.c:1343 [inline] tcf_action_add+0x243/0x370 net/sched/act_api.c:1362 tc_ctl_action+0x3b5/0x4bc net/sched/act_api.c:1410 rtnetlink_rcv_msg+0x463/0xb00 net/core/rtnetlink.c:5386 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5404 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328 netlink_sendmsg+0x8a5/0xd60 net/netlink/af_netlink.c:1917 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:657 ___sys_sendmsg+0x803/0x920 net/socket.c:2311 __sys_sendmsg+0x105/0x1d0 net/socket.c:2356 __do_sys_sendmsg net/socket.c:2365 [inline] __se_sys_sendmsg net/socket.c:2363 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2363 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x440939 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+cf0adbb9c28c8866c788@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-15 20:20:22 -07:00
Eric Dumazet	e9c43add67	net_sched: sch_fq: remove one obsolete check in fq_dequeue() After commit `eeb84aa0d0` ("net_sched: sch_fq: do not assume EDT packets are ordered"), all skbs get a non zero time_to_send in flow_queue_add() This means @time_next_packet variable in fq_dequeue() can no longer be zero. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-15 20:19:20 -07:00
Eric Dumazet	cab209e571	tcp: fix a possible lockdep splat in tcp_done() syzbot found that if __inet_inherit_port() returns an error, we call tcp_done() after inet_csk_prepare_forced_close(), meaning the socket lock is no longer held. We might fix this in a different way in net-next, but for 5.4 it seems safer to relax the lockdep check. Fixes: `d983ea6f16` ("tcp: add rcu protection around tp->fastopen_rsk") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-15 20:13:33 -07:00
Alexander Lobakin	6570bc79c0	net: core: use listified Rx for GRO_NORMAL in napi_gro_receive() Commit `323ebb61e3` ("net: use listified RX for handling GRO_NORMAL skbs") made use of listified skb processing for the users of napi_gro_frags(). The same technique can be used in a way more common napi_gro_receive() to speed up non-merged (GRO_NORMAL) skbs for a wide range of drivers including gro_cells and mac80211 users. This slightly changes the return value in cases where skb is being dropped by the core stack, but it seems to have no impact on related drivers' functionality. gro_normal_batch is left untouched as it's very individual for every single system configuration and might be tuned in manual order to achieve an optimal performance. Signed-off-by: Alexander Lobakin <alobakin@dlink.ru> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-15 18:16:30 -07:00
Himadri Pandya	77ffe33363	hv_sock: use HV_HYP_PAGE_SIZE for Hyper-V communication Current code assumes PAGE_SIZE (the guest page size) is equal to the page size used to communicate with Hyper-V (which is always 4K). While this assumption is true on x86, it may not be true for Hyper-V on other architectures. For example, Linux on ARM64 may have PAGE_SIZE of 16K or 64K. A new symbol, HV_HYP_PAGE_SIZE, has been previously introduced to use when the Hyper-V page size is intended instead of the guest page size. Make this code work on non-x86 architectures by using the new HV_HYP_PAGE_SIZE symbol instead of PAGE_SIZE, where appropriate. Also replace the now redundant PAGE_SIZE_4K with HV_HYP_PAGE_SIZE. The change has no effect on x86, but lays the groundwork to run on ARM64 and others. Signed-off-by: Himadri Pandya <himadrispandya@gmail.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-15 17:25:17 -07:00
Davide Caratti	fa4e0f8855	net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actions the following script: # tc qdisc add dev eth0 clsact # tc filter add dev eth0 egress protocol ip matchall \ > action mpls push protocol mpls_uc label 0x355aa bos 1 causes corruption of all IP packets transmitted by eth0. On TC egress, we can't rely on the value of skb->mac_len, because it's 0 and a MPLS 'push' operation will result in an overwrite of the first 4 octets in the packet L2 header (e.g. the Destination Address if eth0 is an Ethernet); the same error pattern is present also in the MPLS 'pop' operation. Fix this error in act_mpls data plane, computing 'mac_len' as the difference between the network header and the mac header (when not at TC ingress), and use it in MPLS 'push'/'pop' core functions. v2: unbreak 'make htmldocs' because of missing documentation of 'mac_len' in skb_mpls_pop(), reported by kbuild test robot CC: Lorenzo Bianconi <lorenzo@kernel.org> Fixes: `2a2ea50870` ("net: sched: add mpls manipulation actions to TC") Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: John Hurley <john.hurley@netronome.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-15 17:14:48 -07:00
Davide Caratti	dedc5a08da	net: avoid errors when trying to pop MLPS header on non-MPLS packets the following script: # tc qdisc add dev eth0 clsact # tc filter add dev eth0 egress matchall action mpls pop implicitly makes the kernel drop all packets transmitted by eth0, if they don't have a MPLS header. This behavior is uncommon: other encapsulations (like VLAN) just let the packet pass unmodified. Since the result of MPLS 'pop' operation would be the same regardless of the presence / absence of MPLS header(s) in the original packet, we can let skb_mpls_pop() return 0 when dealing with non-MPLS packets. For the OVS use-case, this is acceptable because __ovs_nla_copy_actions() already ensures that MPLS 'pop' operation only occurs with packets having an MPLS Ethernet type (and there are no other callers in current code, so the semantic change should be ok). v2: better documentation of use-cases for skb_mpls_pop(), thanks to Simon Horman Fixes: `2a2ea50870` ("net: sched: add mpls manipulation actions to TC") Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: John Hurley <john.hurley@netronome.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-15 17:14:48 -07:00
Mahesh Bandewar	b0818f80c8	blackhole_netdev: fix syzkaller reported issue While invalidating the dst, we assign backhole_netdev instead of loopback device. However, this device does not have idev pointer and hence no ip6_ptr even if IPv6 is enabled. Possibly this has triggered the syzbot reported crash. The syzbot report does not have reproducer, however, this is the only device that doesn't have matching idev created. Crash instruction is : static inline bool ip6_ignore_linkdown(const struct net_device dev) { const struct inet6_dev idev = __in6_dev_get(dev); return !!idev->cnf.ignore_routes_with_linkdown; <= crash } Also ipv6 always assumes presence of idev and never checks for it being NULL (as does the above referenced code). So adding a idev for the blackhole_netdev to avoid this class of crashes in the future. Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-15 10:35:23 -07:00
David S. Miller	a98d62c3ee	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-10-14 The following pull-request contains BPF updates for your net-next tree. 12 days of development and 85 files changed, 1889 insertions(+), 1020 deletions(-) The main changes are: 1) auto-generation of bpf_helper_defs.h, from Andrii. 2) split of bpf_helpers.h into bpf_{helpers, helper_defs, endian, tracing}.h and move into libbpf, from Andrii. 3) Track contents of read-only maps as scalars in the verifier, from Andrii. 4) small x86 JIT optimization, from Daniel. 5) cross compilation support, from Ivan. 6) bpf flow_dissector enhancements, from Jakub and Stanislav. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-14 12:17:21 -07:00
David S. Miller	7e0d15ee0d	A few more small things, nothing really stands out: * minstrel improvements from Felix * a TX aggregation simplification * some additional capabilities for hwsim * minor cleanups & docs updates -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl2gQp8ACgkQB8qZga/f l8T5kA//Yo07t93OespsNwJZXWx7l/WWBtIydnTAk9hNXDV4kd6oLgN0oadqpx7g 5bryCRqmS4vx2IjiEQiSK08JqmhpruQSXYe8oixVVCOknw2pfkK6lr+OHCqJO11A iWu5Nz7bTl4pfSO0XIfWk1STUjNuXWCTUgbLSwU4oaoqk8oib2VeV1QdXX0hvgXF gSlToWQqliI/c6HS69iUJGRqXZCMO7GPWE9Sqj8cvmeAFXWQz9zcan6Fcd2XJyLq qJxNbxGD0JQ6vdbg2bFnio8PlwYMJ7ohrRDds8euYzViVtyTVZ6WtD9/gKB6UGVe RS5NEsmZLISCrQbV8nK/q0G/mBdNNegj4ezUkWxMvuYDEvl83Xniyz5CoAC++9mp 0M0//+NgwoVqDvaoV0s+TZBYv5arJyeUCY9kkmPCFFVV6cvmXfRFpn9yU95he2Eb duY5P+uKNlFU+sYVh1d6QC26mEAIa0y4qZszp3HurVWXe/aG/fLumW2USAOdqDOw 9HF9vOqGc3FRZTX1l15F+5nPn9gMyMJJGqOeT4oS1mQJT/KdzQCGLmhQ+IR+00Un zF6QsfCCtbuO5xLErqoARa7qKzddDxgkEBbdmQmjUwdyzAxSxZxGDBLLcpZ0OQwo Kxx7ELz97f55unLbByDrFMoZvEXaCeGcbZeTJWGvDRElw/BhRJU= =IJ7D -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-net-next-2019-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== A few more small things, nothing really stands out: * minstrel improvements from Felix * a TX aggregation simplification * some additional capabilities for hwsim * minor cleanups & docs updates ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-13 11:29:07 -07:00
Michal Kubecek	cb0ce18aaf	genetlink: do not parse attributes for families with zero maxattr Commit `c10e6cf85e` ("net: genetlink: push attrbuf allocation and parsing to a separate function") moved attribute buffer allocation and attribute parsing from genl_family_rcv_msg_doit() into a separate function genl_family_rcv_msg_attrs_parse() which, unlike the previous code, calls __nlmsg_parse() even if family->maxattr is 0 (i.e. the family does its own parsing). The parser error is ignored and does not propagate out of genl_family_rcv_msg_attrs_parse() but an error message ("Unknown attribute type") is set in extack and if further processing generates no error or warning, it stays there and is interpreted as a warning by userspace. Dumpit requests are not affected as genl_family_rcv_msg_dumpit() bypasses the call of genl_family_rcv_msg_attrs_parse() if family->maxattr is zero. Move this logic inside genl_family_rcv_msg_attrs_parse() so that we don't have to handle it in each caller. v3: put the check inside genl_family_rcv_msg_attrs_parse() v2: adjust also argument of genl_family_rcv_msg_attrs_free() Fixes: `c10e6cf85e` ("net: genetlink: push attrbuf allocation and parsing to a separate function") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-13 11:20:03 -07:00
Soheil Hassas Yeganeh	c208bdb937	tcp: improve recv_skip_hint for tcp_zerocopy_receive tcp_zerocopy_receive() rounds down the zc->length a multiple of PAGE_SIZE. This results in two issues: - tcp_zerocopy_receive sets recv_skip_hint to the length of the receive queue if the zc->length input is smaller than the PAGE_SIZE, even though the data in receive queue could be zerocopied. - tcp_zerocopy_receive would set recv_skip_hint of 0, in cases where we have a little bit of data after the perfectly-sized packets. To fix these issues, do not store the rounded down value in zc->length. Round down the length passed to zap_page_range(), and return min(inq, zc->length) when the zap_range is 0. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-13 11:16:25 -07:00

1 2 3 4 5 ...

57966 Commits