The hsr_get_node_list() is to send node addresses to the userspace.
If there are so many nodes, it could fail because of buffer size.
In order to avoid this failure, the restart routine is added.
Fixes: f421436a59 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hsr_get_node_{list/status}() are not under rtnl_lock() because
they are callback functions of generic netlink.
But they use __dev_get_by_index() without rtnl_lock().
So, it would use unsafe data.
In order to fix it, rcu_read_lock() and dev_get_by_index_rcu()
are used instead of __dev_get_by_index().
Fixes: f421436a59 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PACKET_RX_RING can cause multiple writers to access the same slot if a
fast writer wraps the ring while a slow writer is still copying. This
is particularly likely with few, large, slots (e.g., GSO packets).
Synchronize kernel thread ownership of rx ring slots with a bitmap.
Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
while holding the sk receive queue lock. They release this lock before
copying and set tp_status to TP_STATUS_USER to release to userspace
when done. During copying, another writer may take the lock, also see
TP_STATUS_KERNEL, and start writing to the same slot.
Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
slot, test and set with the lock held. To release race-free, update
tp_status and owner bit as a transaction, so take the lock again.
This is the one of a variety of discussed options (see Link below):
* instead of a shadow ring, embed the data in the slot itself, such as
in tp_padding. But any test for this field may match a value left by
userspace, causing deadlock.
* avoid the lock on release. This leaves a small race if releasing the
shadow slot before setting TP_STATUS_USER. The below reproducer showed
that this race is not academic. If releasing the slot after tp_status,
the race is more subtle. See the first link for details.
* add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
of two fields. But, legacy applications may interpret all non-zero
tp_status as owned by the user. As libpcap does. So this is possible
only opt-in by newer processes. It can be added as an optional mode.
* embed the struct at the tail of pg_vec to avoid extra allocation.
The implementation proved no less complex than a separate field.
The additional locking cost on release adds contention, no different
than scaling on multicore or multiqueue h/w. In practice, below
reproducer nor small packet tcpdump showed a noticeable change in
perf report in cycles spent in spinlock. Where contention is
problematic, packet sockets support mitigation through PACKET_FANOUT.
And we can consider adding opt-in state TP_KERNEL_OWNED.
Easy to reproduce by running multiple netperf or similar TCP_STREAM
flows concurrently with `tcpdump -B 129 -n greater 60000`.
Based on an earlier patchset by Jon Rosen. See links below.
I believe this issue goes back to the introduction of tpacket_rcv,
which predates git history.
Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
Suggested-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ERSPAN shares most of the code path with GRE and gretap code. While that
helps keep the code compact, it is also error prone. Currently a broken
userspace can turn a gretap tunnel into a de facto ERSPAN one by passing
IFLA_GRE_ERSPAN_VER. There has been a similar issue in ip6gretap in the
past.
To prevent these problems in future, split the newlink and changelink code
paths. Split the ERSPAN code out of ipgre_netlink_parms() into a new
function erspan_netlink_parms(). Extract a piece of common logic from
ipgre_newlink() and ipgre_changelink() into ipgre_newlink_encap_setup().
Add erspan_newlink() and erspan_changelink().
Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bpfilter UMH code was recently changed to log its informative messages to
/dev/kmsg, however this interface doesn't support SEEK_CUR yet, used by
dprintf(). As result dprintf() returns -EINVAL and doesn't log anything.
However there already had some discussions about supporting SEEK_CUR into
/dev/kmsg interface in the past it wasn't concluded. Since the only user of
that from userspace perspective inside the kernel is the bpfilter UMH
(userspace) module it's better to correct it here instead waiting a conclusion
on the interface.
Fixes: 36c4357c63 ("net: bpfilter: print umh messages to /dev/kmsg")
Signed-off-by: Bruno Meneguele <bmeneg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 599be01ee5 ("net_sched: fix an OOB access in cls_tcindex")
I moved cp->hash calculation before the first
tcindex_alloc_perfect_hash(), but cp->alloc_hash is left untouched.
This difference could lead to another out of bound access.
cp->alloc_hash should always be the size allocated, we should
update it after this tcindex_alloc_perfect_hash().
Reported-and-tested-by: syzbot+dcc34d54d68ef7d2d53d@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+c72da7b9ed57cde6fca2@syzkaller.appspotmail.com
Fixes: 599be01ee5 ("net_sched: fix an OOB access in cls_tcindex")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported a use-after-free in tcindex_dump(). This is due to
the lack of RTNL in the deferred rcu work. We queue this work with
RTNL in tcindex_change(), later, tcindex_dump() is called:
fh = tp->ops->get(tp, t->tcm_handle);
...
err = tp->ops->change(..., &fh, ...);
tfilter_notify(..., fh, ...);
but there is nothing to serialize the pending
tcindex_partial_destroy_work() with tcindex_dump().
Fix this by simply holding RTNL in tcindex_partial_destroy_work(),
so that it won't be called until RTNL is released after
tc_new_tfilter() is completed.
Reported-and-tested-by: syzbot+653090db2562495901dc@syzkaller.appspotmail.com
Fixes: 3d210534cc ("net_sched: fix a race condition in tcindex_destroy()")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf 2020-03-12
The following pull-request contains BPF updates for your *net* tree.
We've added 12 non-merge commits during the last 8 day(s) which contain
a total of 12 files changed, 161 insertions(+), 15 deletions(-).
The main changes are:
1) Andrii fixed two bugs in cgroup-bpf.
2) John fixed sockmap.
3) Luke fixed x32 jit.
4) Martin fixed two issues in struct_ops.
5) Yonghong fixed bpf_send_signal.
6) Yoshiki fixed BTF enum.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There was a bug that was causing packets to be sent to the driver
without first calling dequeue() on the "child" qdisc. And the KASAN
report below shows that sending a packet without calling dequeue()
leads to bad results.
The problem is that when checking the last qdisc "child" we do not set
the returned skb to NULL, which can cause it to be sent to the driver,
and so after the skb is sent, it may be freed, and in some situations a
reference to it may still be in the child qdisc, because it was never
dequeued.
The crash log looks like this:
[ 19.937538] ==================================================================
[ 19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780
[ 19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0
[ 19.939612]
[ 19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97
[ 19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4
[ 19.941523] Call Trace:
[ 19.941774] <IRQ>
[ 19.941985] dump_stack+0x97/0xe0
[ 19.942323] print_address_description.constprop.0+0x3b/0x60
[ 19.942884] ? taprio_dequeue_soft+0x620/0x780
[ 19.943325] ? taprio_dequeue_soft+0x620/0x780
[ 19.943767] __kasan_report.cold+0x1a/0x32
[ 19.944173] ? taprio_dequeue_soft+0x620/0x780
[ 19.944612] kasan_report+0xe/0x20
[ 19.944954] taprio_dequeue_soft+0x620/0x780
[ 19.945380] __qdisc_run+0x164/0x18d0
[ 19.945749] net_tx_action+0x2c4/0x730
[ 19.946124] __do_softirq+0x268/0x7bc
[ 19.946491] irq_exit+0x17d/0x1b0
[ 19.946824] smp_apic_timer_interrupt+0xeb/0x380
[ 19.947280] apic_timer_interrupt+0xf/0x20
[ 19.947687] </IRQ>
[ 19.947912] RIP: 0010:default_idle+0x2d/0x2d0
[ 19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3
[ 19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
[ 19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e
[ 19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f
[ 19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1
[ 19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001
[ 19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 19.954408] ? default_idle_call+0x2e/0x70
[ 19.954816] ? default_idle+0x1f/0x2d0
[ 19.955192] default_idle_call+0x5e/0x70
[ 19.955584] do_idle+0x3d4/0x500
[ 19.955909] ? arch_cpu_idle_exit+0x40/0x40
[ 19.956325] ? _raw_spin_unlock_irqrestore+0x23/0x30
[ 19.956829] ? trace_hardirqs_on+0x30/0x160
[ 19.957242] cpu_startup_entry+0x19/0x20
[ 19.957633] start_secondary+0x2a6/0x380
[ 19.958026] ? set_cpu_sibling_map+0x18b0/0x18b0
[ 19.958486] secondary_startup_64+0xa4/0xb0
[ 19.958921]
[ 19.959078] Allocated by task 33:
[ 19.959412] save_stack+0x1b/0x80
[ 19.959747] __kasan_kmalloc.constprop.0+0xc2/0xd0
[ 19.960222] kmem_cache_alloc+0xe4/0x230
[ 19.960617] __alloc_skb+0x91/0x510
[ 19.960967] ndisc_alloc_skb+0x133/0x330
[ 19.961358] ndisc_send_ns+0x134/0x810
[ 19.961735] addrconf_dad_work+0xad5/0xf80
[ 19.962144] process_one_work+0x78e/0x13a0
[ 19.962551] worker_thread+0x8f/0xfa0
[ 19.962919] kthread+0x2ba/0x3b0
[ 19.963242] ret_from_fork+0x3a/0x50
[ 19.963596]
[ 19.963753] Freed by task 33:
[ 19.964055] save_stack+0x1b/0x80
[ 19.964386] __kasan_slab_free+0x12f/0x180
[ 19.964830] kmem_cache_free+0x80/0x290
[ 19.965231] ip6_mc_input+0x38a/0x4d0
[ 19.965617] ipv6_rcv+0x1a4/0x1d0
[ 19.965948] __netif_receive_skb_one_core+0xf2/0x180
[ 19.966437] netif_receive_skb+0x8c/0x3c0
[ 19.966846] br_handle_frame_finish+0x779/0x1310
[ 19.967302] br_handle_frame+0x42a/0x830
[ 19.967694] __netif_receive_skb_core+0xf0e/0x2a90
[ 19.968167] __netif_receive_skb_one_core+0x96/0x180
[ 19.968658] process_backlog+0x198/0x650
[ 19.969047] net_rx_action+0x2fa/0xaa0
[ 19.969420] __do_softirq+0x268/0x7bc
[ 19.969785]
[ 19.969940] The buggy address belongs to the object at ffff888112862840
[ 19.969940] which belongs to the cache skbuff_head_cache of size 224
[ 19.971202] The buggy address is located 140 bytes inside of
[ 19.971202] 224-byte region [ffff888112862840, ffff888112862920)
[ 19.972344] The buggy address belongs to the page:
[ 19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0
[ 19.973930] flags: 0x8000000000010200(slab|head)
[ 19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0
[ 19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000
[ 19.975915] page dumped because: kasan: bad access detected
[ 19.976461] page_owner tracks the page as allocated
[ 19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO)
[ 19.978332] prep_new_page+0x24b/0x330
[ 19.978707] get_page_from_freelist+0x2057/0x2c90
[ 19.979170] __alloc_pages_nodemask+0x218/0x590
[ 19.979619] new_slab+0x9d/0x300
[ 19.979948] ___slab_alloc.constprop.0+0x2f9/0x6f0
[ 19.980421] __slab_alloc.constprop.0+0x30/0x60
[ 19.980870] kmem_cache_alloc+0x201/0x230
[ 19.981269] __alloc_skb+0x91/0x510
[ 19.981620] alloc_skb_with_frags+0x78/0x4a0
[ 19.982043] sock_alloc_send_pskb+0x5eb/0x750
[ 19.982476] unix_stream_sendmsg+0x399/0x7f0
[ 19.982904] sock_sendmsg+0xe2/0x110
[ 19.983262] ____sys_sendmsg+0x4de/0x6d0
[ 19.983660] ___sys_sendmsg+0xe4/0x160
[ 19.984032] __sys_sendmsg+0xab/0x130
[ 19.984396] do_syscall_64+0xe7/0xae0
[ 19.984761] page last free stack trace:
[ 19.985142] __free_pages_ok+0x432/0xbc0
[ 19.985533] qlist_free_all+0x56/0xc0
[ 19.985907] quarantine_reduce+0x149/0x170
[ 19.986315] __kasan_kmalloc.constprop.0+0x9e/0xd0
[ 19.986791] kmem_cache_alloc+0xe4/0x230
[ 19.987182] prepare_creds+0x24/0x440
[ 19.987548] do_faccessat+0x80/0x590
[ 19.987906] do_syscall_64+0xe7/0xae0
[ 19.988276] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 19.988775]
[ 19.988930] Memory state around the buggy address:
[ 19.989402] ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.990111] ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[ 19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 19.991529] ^
[ 19.992081] ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.992796] ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Reported-by: Michael Schmidt <michael.schmidt@eti.uni-siegen.de>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Internet Assigned Numbers Authority (IANA) has recently assigned
a protocol number value of 143 for Ethernet [1].
Before this assignment, encapsulation mechanisms such as Segment Routing
used the IPv6-NoNxt protocol number (59) to indicate that the encapsulated
payload is an Ethernet frame.
In this patch, we add the definition of the Ethernet protocol number to the
kernel headers and update the SRv6 L2 tunnels to use it.
[1] https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml
Signed-off-by: Paolo Lungaroni <paolo.lungaroni@cnit.it>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Acked-by: Ahmed Abdelsalam <ahmed.abdelsalam@gssi.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default, DSA drivers should configure CPU and DSA ports to their
maximum speed. In many configurations this is sufficient to make the
link work.
In some cases it is necessary to configure the link to run slower,
e.g. because of limitations of the SoC it is connected to. Or back to
back PHYs are used and the PHY needs to be driven in order to
establish link. In this case, phylink is used.
Only instantiate phylink if it is required. If there is no PHY, or no
fixed link properties, phylink can upset a link which works in the
default configuration.
Fixes: 0e27921816 ("net: dsa: Use PHYLINK for the CPU/DSA ports")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
In one error case, tpacket_rcv drops packets after incrementing the
ring producer index.
If this happens, it does not update tp_status to TP_STATUS_USER and
thus the reader is stalled for an iteration of the ring, causing out
of order arrival.
The only such error path is when virtio_net_hdr_from_skb fails due
to encountering an unknown GSO type.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
caifdevs->list is traversed using list_for_each_entry_rcu()
outside an RCU read-side critical section but under the
protection of rtnl_mutex. Hence, add the corresponding lockdep
expression to silence the following false-positive warning:
[ 10.868467] =============================
[ 10.869082] WARNING: suspicious RCU usage
[ 10.869817] 5.6.0-rc1-00177-g06ec0a154aae4 #1 Not tainted
[ 10.870804] -----------------------------
[ 10.871557] net/caif/caif_dev.c:115 RCU-list traversed in non-reader section!!
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bucket->lock is not needed in the sock_hash_free and sock_map_free
calls, in fact it is causing a splat due to being inside rcu block.
| BUG: sleeping function called from invalid context at net/core/sock.c:2935
| in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 62, name: kworker/0:1
| 3 locks held by kworker/0:1/62:
| #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
| #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
| #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
| CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04008-g7b083332376e #454
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
| dump_stack+0x71/0xa0
| ___might_sleep.cold+0xa6/0xb6
| lock_sock_nested+0x28/0x90
| sock_map_free+0x5f/0x180
| bpf_map_free_deferred+0x58/0x80
| process_one_work+0x260/0x5e0
| worker_thread+0x4d/0x3e0
| kthread+0x108/0x140
| ? process_one_work+0x5e0/0x5e0
| ? kthread_park+0x90/0x90
| ret_from_fork+0x3a/0x50
The reason we have stab->lock and bucket->locks in sockmap code is to
handle checking EEXIST in update/delete cases. We need to be careful during
an update operation that we check for EEXIST and we need to ensure that the
psock object is not in some partial state of removal/insertion while we do
this. So both map_update_common and sock_map_delete need to guard from being
run together potentially deleting an entry we are checking, etc. But by the
time we get to the tear-down code in sock_{ma[|hash}_free we have already
disconnected the map and we just did synchronize_rcu() in the line above so
no updates/deletes should be in flight. Because of this we can drop the
bucket locks from the map free'ing code, noting no update/deletes can be
in-flight.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/158385850787.30597.8346421465837046618.stgit@john-Precision-5820-Tower
When trying to transmit to an unknown destination, the mesh code would
unconditionally transmit a HWMP PREQ even if HWMP is not the current
path selection algorithm.
Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
Link: https://lore.kernel.org/r/20200305140409.12204-1-cavallar@lri.fr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add missing attribute validation for beacon report scanning
to the netlink policy.
Fixes: 1d76250bd3 ("nl80211: support beacon report scanning")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20200303051058.4089398-3-kuba@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
During IB device removal, cancel the event worker before the device
structure is freed.
Fixes: a4cf0443c4 ("smc: introduce SMC as an IB-client")
Reported-by: syzbot+b297c6825752e7a07272@syzkaller.appspotmail.com
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rafał found an issue that for non-Ethernet interface, if we down and up
frequently, the memory will be consumed slowly.
The reason is we add allnodes/allrouters addressed in multicast list in
ipv6_add_dev(). When link down, we call ipv6_mc_down(), store all multicast
addresses via mld_add_delrec(). But when link up, we don't call ipv6_mc_up()
for non-Ethernet interface to remove the addresses. This makes idev->mc_tomb
getting bigger and bigger. The call stack looks like:
addrconf_notify(NETDEV_REGISTER)
ipv6_add_dev
ipv6_dev_mc_inc(ff01::1)
ipv6_dev_mc_inc(ff02::1)
ipv6_dev_mc_inc(ff02::2)
addrconf_notify(NETDEV_UP)
addrconf_dev_config
/* Alas, we support only Ethernet autoconfiguration. */
return;
addrconf_notify(NETDEV_DOWN)
addrconf_ifdown
ipv6_mc_down
igmp6_group_dropped(ff02::2)
mld_add_delrec(ff02::2)
igmp6_group_dropped(ff02::1)
igmp6_group_dropped(ff01::1)
After investigating, I can't found a rule to disable multicast on
non-Ethernet interface. In RFC2460, the link could be Ethernet, PPP, ATM,
tunnels, etc. In IPv4, it doesn't check the dev type when calls ip_mc_up()
in inetdev_event(). Even for IPv6, we don't check the dev type and call
ipv6_add_dev(), ipv6_dev_mc_inc() after register device.
So I think it's OK to fix this memory consumer by calling ipv6_mc_up() for
non-Ethernet interface.
v2: Also check IFF_MULTICAST flag to make sure the interface supports
multicast
Reported-by: Rafał Miłecki <zajec5@gmail.com>
Tested-by: Rafał Miłecki <zajec5@gmail.com>
Fixes: 74235a25c6 ("[IPV6] addrconf: Fix IPv6 on tuntap tunnels")
Fixes: 1666d49e1d ("mld: do not remove mld souce list info when set link down")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a TCP socket is allocated in IRQ context or cloned from unassociated
(i.e. not associated to a memcg) in IRQ context then it will remain
unassociated for its whole life. Almost half of the TCPs created on the
system are created in IRQ context, so, memory used by such sockets will
not be accounted by the memcg.
This issue is more widespread in cgroup v1 where network memory
accounting is opt-in but it can happen in cgroup v2 if the source socket
for the cloning was created in root memcg.
To fix the issue, just do the association of the sockets at the accept()
time in the process context and then force charge the memory buffer
already used and reserved by the socket.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Don't schedule OGM for disabled interface, by Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl5iPBEWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoX+wEAC3nG5q5Yu6G81vSdQiu8ey4xjR
CIGl+kSxX/ryK7GaEhvSZ+/rdFbVsJN276CjvLTbnwIHf0QJV6a6A68+Xo0qG6Nc
7Bpo9VP43GQJfmzFdqMQz5W/oVd88XKCfF7V9Cf+FIMkBTmuuxGfmNKcjpJVWZD4
imQihGpknW/QUjlKuIVHD79ZEBn75iCR0HYK8Xuc88XIDUUDA/fROdXJS8jge4BM
4UBFn46xQndBAVOJkYwd2FKhLwLT0YVJqMLBoQBf3Lq5Td6R2x6yrzXfyzNlKXEp
oaZsHSB0Zo9v7ICxqY+CA0Yk8SxRrbGV2+cxjOHgudDcDnWF0MELcah0KOmcsPcx
+G+dacTECrI0lxDu+LAWL//weKjHVh7WOeJk8wqRDgSV7AmCrSZS83v1cRXPQF41
0UTKbWPnCe6TANmgllt+H7ere2O9vGbrLPD9jLH8kE1eDgfG1lLC8PjIxA5L/6T5
sEvDg2/6JdQtAjqrAc+MP4hm4H2yEzK+n6ozA1Fn9gXn1RiKFr/zOB33PyaHUkUL
K2MXVZANW0RLKRIb/zXQ8bJcMEAU7aeHZytm7n/JmRpSkRV1uvJSgeOF/NvZFg9P
uBE0TO2RzXmJyUBzdSdZfrCSPobFbR1Gs9Uks+1U5FXjxw6j9adx+/5WuRbhJii1
qdv3ueo7awMeX3RrJg==
=JJc9
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20200306' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- Don't schedule OGM for disabled interface, by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In our production environment we have faced with problem that updating
classid in cgroup with heavy tasks cause long freeze of the file tables
in this tasks. By heavy tasks we understand tasks with many threads and
opened sockets (e.g. balancers). This freeze leads to an increase number
of client timeouts.
This patch implements following logic to fix this issue:
аfter iterating 1000 file descriptors file table lock will be released
thus providing a time gap for socket creation/deletion.
Now update is non atomic and socket may be skipped using calls:
dup2(oldfd, newfd);
close(oldfd);
But this case is not typical. Moreover before this patch skip is possible
too by hiding socket fd in unix socket buffer.
New sockets will be allocated with updated classid because cgroup state
is updated before start of the file descriptors iteration.
So in common cases this patch has no side effects.
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 1ec17dbd90 ("inet_diag: fix reporting cgroup classid and
fallback to priority") croup classid reporting was fixed. But this works
only for TCP sockets because for other socket types icsk parameter can
be NULL and classid code path is skipped. This change moves classid
handling to inet_diag_msg_attrs_fill() function.
Also inet_diag_msg_attrs_size() helper was added and addends in
nlmsg_new() were reordered to save order from inet_sk_diag_fill().
Fixes: 1ec17dbd90 ("inet_diag: fix reporting cgroup classid and fallback to priority")
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Patches to bump position index from sysctl seq_next,
from Vasilin Averin.
2) Release flowtable hook from error path, from Florian Westphal.
3) Patches to add missing netlink attribute validation,
from Jakub Kicinski.
4) Missing NFTA_CHAIN_FLAGS in nf_tables_fill_chain_info().
5) Infinite loop in module autoload if extension is not available,
from Florian Westphal.
6) Missing module ownership in inet/nat chain type definition.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently passive MPTCP socket can skip including the DACK
option - if the peer sends data before accept() completes.
The above happens because the msk 'can_ack' flag is set
only after the accept() call.
Such missing DACK option may cause - as per RFC spec -
unwanted fallback to TCP.
This change addresses the issue using the key material
available in the current subflow, if any, to create a suitable
dack option when msk ack seq is not yet available.
v1 -> v2:
- adavance the generated ack after the initial MPC packet
Fixes: d22f4988ff ("mptcp: process MP_CAPABLE data option")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is similar to commit 674d9de02a ("NFC: Fix possible memory
corruption when handling SHDLC I-Frame commands") and commit d7ee81ad09
("NFC: nci: Add some bounds checking in nci_hci_cmd_received()") which
added range checks on "pipe".
The "pipe" variable comes skb->data[0] in nfc_hci_msg_rx_work().
It's in the 0-255 range. We're using it as the array index into the
hdev->pipes[] array which has NFC_HCI_MAX_PIPES (128) members.
Fixes: 118278f20a ("NFC: hci: Add pipes table to reference them with a tuple {gate, host}")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nft will loop forever if the kernel doesn't support an expression:
1. nft_expr_type_get() appends the family specific name to the module list.
2. -EAGAIN is returned to nfnetlink, nfnetlink calls abort path.
3. abort path sets ->done to true and calls request_module for the
expression.
4. nfnetlink replays the batch, we end up in nft_expr_type_get() again.
5. nft_expr_type_get attempts to append family-specific name. This
one already exists on the list, so we continue
6. nft_expr_type_get adds the generic expression name to the module
list. -EAGAIN is returned, nfnetlink calls abort path.
7. abort path encounters the family-specific expression which
has 'done' set, so it gets removed.
8. abort path requests the generic expression name, sets done to true.
9. batch is replayed.
If the expression could not be loaded, then we will end up back at 1),
because the family-specific name got removed and the cycle starts again.
Note that userspace can SIGKILL the nft process to stop the cycle, but
the desired behaviour is to return an error after the generic expr name
fails to load the expression.
Fixes: eb014de4fd ("netfilter: nf_tables: autoload modules from the abort path")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add missing attribute validation for tunnel source and
destination ports to the netlink policy.
Fixes: af308b94a2 ("netfilter: nf_tables: add tunnel support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add missing attribute validation for NFTA_PAYLOAD_CSUM_FLAGS
to the netlink policy.
Fixes: 1814096980 ("netfilter: nft_payload: layer 4 checksum adjustment for pseudoheader fields")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If hook registration fails, the hooks allocated via nft_netdev_hook_alloc
need to be freed.
We can't change the goto label to 'goto 5' -- while it does fix the memleak
it does cause a warning splat from the netfilter core (the hooks were not
registered).
Fixes: 3f0465a9ef ("netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables")
Reported-by: syzbot+a2ff6fa45162a5ed4dd3@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If .next function does not change position index,
following .show function will repeat output related
to current position index.
Without patch:
# dd if=/proc/net/ip_tables_matches # original file output
conntrack
conntrack
conntrack
recent
recent
icmp
udplite
udp
tcp
0+1 records in
0+1 records out
65 bytes copied, 5.4074e-05 s, 1.2 MB/s
# dd if=/proc/net/ip_tables_matches bs=62 skip=1
dd: /proc/net/ip_tables_matches: cannot skip to specified offset
cp <<< end of last line
tcp <<< and then unexpected whole last line once again
0+1 records in
0+1 records out
7 bytes copied, 0.000102447 s, 68.3 kB/s
Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If .next function does not change position index,
following .show function will repeat output related
to current position index.
Without the patch:
# dd if=/proc/net/xt_recent/SSH # original file outpt
src=127.0.0.4 ttl: 0 last_seen: 6275444819 oldest_pkt: 1 6275444819
src=127.0.0.2 ttl: 0 last_seen: 6275438906 oldest_pkt: 1 6275438906
src=127.0.0.3 ttl: 0 last_seen: 6275441953 oldest_pkt: 1 6275441953
0+1 records in
0+1 records out
204 bytes copied, 6.1332e-05 s, 3.3 MB/s
Read after lseek into middle of last line (offset 140 in example below)
generates expected end of last line and then unexpected whole last line
once again
# dd if=/proc/net/xt_recent/SSH bs=140 skip=1
dd: /proc/net/xt_recent/SSH: cannot skip to specified offset
127.0.0.3 ttl: 0 last_seen: 6275441953 oldest_pkt: 1 6275441953
src=127.0.0.3 ttl: 0 last_seen: 6275441953 oldest_pkt: 1 6275441953
0+1 records in
0+1 records out
132 bytes copied, 6.2487e-05 s, 2.1 MB/s
Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If .next function does not change position index,
following .show function will repeat output related
to current position index.
Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If .next function does not change position index,
following .show function will repeat output related
to current position index.
Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Place phylink_start()/phylink_stop() inside dsa_port_enable() and
dsa_port_disable(), which ensures that we call phylink_stop() before
tearing down phylink - which is a documented requirement. Failure
to do so can cause use-after-free bugs.
Fixes: 0e27921816 ("net: dsa: Use PHYLINK for the CPU/DSA ports")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we modify the peer route and changed it to a new one, we should
remove the old route first. Before the fix:
+ ip addr add dev dummy1 2001:db8::1 peer 2001:db8::2
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 256 pref medium
2001:db8::2 proto kernel metric 256 pref medium
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::3
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 256 pref medium
2001:db8::2 proto kernel metric 256 pref medium
After the fix:
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::3
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 256 pref medium
2001:db8::3 proto kernel metric 256 pref medium
This patch depend on the previous patch "net/ipv6: need update peer route
when modify metric" to update new peer route after delete old one.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we modify the route metric, the peer address's route need also
be updated. Before the fix:
+ ip addr add dev dummy1 2001:db8::1 peer 2001:db8::2 metric 60
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 60 pref medium
2001:db8::2 proto kernel metric 60 pref medium
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::2 metric 61
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 61 pref medium
2001:db8::2 proto kernel metric 60 pref medium
After the fix:
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::2 metric 61
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 61 pref medium
2001:db8::2 proto kernel metric 61 pref medium
Fixes: 8308f3ff17 ("net/ipv6: Add support for specifying metric of connected routes")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing attribute validation for vendor subcommand attributes
to the netlink policy.
Fixes: 9e58095f96 ("NFC: netlink: Implement vendor command support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing attribute validation for NFC_ATTR_TARGET_INDEX
to the netlink policy.
Fixes: 4d63adfe12 ("NFC: Add NFC_CMD_DEACTIVATE_TARGET support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing attribute validation for NFC_ATTR_SE_INDEX
to the netlink policy.
Fixes: 5ce3f32b52 ("NFC: netlink: SE API implementation")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing attribute validation for TIPC_NLA_PROP_MTU
to the netlink policy.
Fixes: 901271e040 ("tipc: implement configuration of UDP media MTU")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing attribute validation for TCA_TAPRIO_ATTR_TXTIME_DELAY
to the netlink policy.
Fixes: 4cfd5779bd ("taprio: Add support for txtime-assist mode")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing attribute validation for TCA_FQ_ORPHAN_MASK
to the netlink policy.
Fixes: 06eb395fa9 ("pkt_sched: fq: better control of DDOS traffic")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing attribute validation for OVS_PACKET_ATTR_HASH
to the netlink policy.
Fixes: bd1903b7c4 ("net: openvswitch: add hash info to upcall")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>