Introduce bpf_mem_alloc_check_size() to check whether the allocation
size exceeds the limitation for the kmalloc-equivalent allocator. The
upper limit for percpu allocation is LLIST_NODE_SZ bytes larger than
non-percpu allocation, so a percpu argument is added to the helper.
The helper will be used in the following patch to check whether the size
parameter passed to bpf_mem_alloc() is too big.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241030100516.3633640-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_iter_bits_destroy() uses "kit->nr_bits <= 64" to check whether the
bits are dynamically allocated. However, the check is incorrect and may
cause a kmemleak as shown below:
unreferenced object 0xffff88812628c8c0 (size 32):
comm "swapper/0", pid 1, jiffies 4294727320
hex dump (first 32 bytes):
b0 c1 55 f5 81 88 ff ff f0 f0 f0 f0 f0 f0 f0 f0 ..U...........
f0 f0 f0 f0 f0 f0 f0 f0 00 00 00 00 00 00 00 00 ..............
backtrace (crc 781e32cc):
[<00000000c452b4ab>] kmemleak_alloc+0x4b/0x80
[<0000000004e09f80>] __kmalloc_node_noprof+0x480/0x5c0
[<00000000597124d6>] __alloc.isra.0+0x89/0xb0
[<000000004ebfffcd>] alloc_bulk+0x2af/0x720
[<00000000d9c10145>] prefill_mem_cache+0x7f/0xb0
[<00000000ff9738ff>] bpf_mem_alloc_init+0x3e2/0x610
[<000000008b616eac>] bpf_global_ma_init+0x19/0x30
[<00000000fc473efc>] do_one_initcall+0xd3/0x3c0
[<00000000ec81498c>] kernel_init_freeable+0x66a/0x940
[<00000000b119f72f>] kernel_init+0x20/0x160
[<00000000f11ac9a7>] ret_from_fork+0x3c/0x70
[<0000000004671da4>] ret_from_fork_asm+0x1a/0x30
That is because nr_bits will be set as zero in bpf_iter_bits_next()
after all bits have been iterated.
Fix the issue by setting kit->bit to kit->nr_bits instead of setting
kit->nr_bits to zero when the iteration completes in
bpf_iter_bits_next(). In addition, use "!nr_bits || bits >= nr_bits" to
check whether the iteration is complete and still use "nr_bits > 64" to
indicate whether bits are dynamically allocated. The "!nr_bits" check is
necessary because bpf_iter_bits_new() may fail before setting
kit->nr_bits, and this condition will stop the iteration early instead
of accessing the zeroed or freed kit->bits.
Considering the initial value of kit->bits is -1 and the type of
kit->nr_bits is unsigned int, change the type of kit->nr_bits to int.
The potential overflow problem will be handled in the following patch.
Fixes: 4665415975 ("bpf: Add bits iterator")
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241030100516.3633640-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Two small fixes, both in drivers (ufs and scsi_debug).
Signed-off-by: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZyH+cSYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishVdMAQDdOiaS
9DO+ly/Il64wXZqb9WKcVYRIjmz7m7g5xdMgrgEA1yfD6G7GgQ3zvbVPNC7Y9ecr
4O2iR5EGAVb1Y7UaEQU=
=551G
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two small fixes, both in drivers (ufs and scsi_debug)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Fix another deadlock during RTC update
scsi: scsi_debug: Fix do_device_access() handling of unexpected SG copy length
The error flow in nfsd4_copy() calls cleanup_async_copy(), which
already decrements nn->pending_async_copies.
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Fixes: aadc3bbea1 ("NFSD: Limit the number of concurrent async COPY operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If we don't do that, the group is considered usable by userspace, but
all further GROUP_SUBMIT will fail with -EINVAL.
Changes in v3:
- Add R-bs
Changes in v2:
- New patch
Fixes: de85488138 ("drm/panthor: Add the scheduler logical block")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241029152912.270346-3-boris.brezillon@collabora.com
Userspace can use GROUP_SUBMIT errors as a trigger to check the group
state and recreate the group if it became unusable. Make sure we
report an error when the group became unusable.
Changes in v3:
- None
Changes in v2:
- Add R-bs
Fixes: de85488138 ("drm/panthor: Add the scheduler logical block")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241029152912.270346-2-boris.brezillon@collabora.com
The system and GPU MMU page size might differ, which becomes a
problem for FW sections that need to be mapped at explicit addresses
since our PAGE_SIZE alignment might cover a VA range that's
expected to be used for another section.
Make sure we never map more than we need.
Changes in v3:
- Add R-bs
Changes in v2:
- Plan for per-VM page sizes so the MCU VM and user VM can
have different pages sizes
Fixes: 2718d91816 ("drm/panthor: Add the FW logical block")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241030150231.768949-1-boris.brezillon@collabora.com
This was previously fixed with commit 1147dd0503
("nvme: fix error-handling for io_uring nvme-passthrough"), but the
change was mistakenly undone in a later commit.
Fixes: d6aacee925 ("nvme: use bio_integrity_map_user")
Cc: stable@vger.kernel.org
Reported-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
ctrl->dh_key might be used across multiple calls to nvmet_setup_dhgroup()
for the same controller. So it's better to nullify it after release on
error path in order to avoid double free later in nvmet_destroy_auth().
Found by Linux Verification Center (linuxtesting.org) with Svace.
Fixes: 7a277c37d3 ("nvmet-auth: Diffie-Hellman key exchange support")
Cc: stable@vger.kernel.org
Signed-off-by: Vitaliy Shevtsov <v.shevtsov@maxima.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
A recent commit enables integrity checks for formats the previous kernel
versions registered with the "nop" integrity profile. This means
namespaces using that format become unreadable when upgrading the kernel
past that commit.
Introduce a module parameter to restore the "nop" integrity profile so
that storage can be readable once again. This could be a boot device, so
the setting needs to happen at module load time.
Fixes: 921e81db52 ("nvme: allow integrity when PI is not in first bytes")
Reported-by: David Wei <dw@davidwei.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The WD19 family of docks has the same audio chipset as the WD15. This
change enables jack detection on the WD19.
We don't need the dell_dock_mixer_init quirk for the WD19. It is only
needed because of the dell_alc4020_map quirk for the WD15 in
mixer_maps.c, which disables the volume controls. Even for the WD15,
this quirk was apparently only needed when the dock firmware was not
updated.
Signed-off-by: Jan Schär <jan@jschaer.ch>
Cc: <stable@vger.kernel.org>
Link: https://patch.msgid.link/20241029221249.15661-1-jan@jschaer.ch
Signed-off-by: Takashi Iwai <tiwai@suse.de>
The biggest set of changes here is Hans' fixes and quirks for various
Baytrail based platforms with RT5640 CODECs, and there's one core fix
for a missed length assignment for __counted_by() checking. Otherwise
it's small device specific fixes, several of them in the DT bindings.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEreZoqmdXGLWf4p/qJNaLcl1Uh9AFAmciNo8ACgkQJNaLcl1U
h9CNfAf/bw39l2d16DVOdzq1gv4I0BUX7M/wTsLRujnDCv8F7qZn3BhhPEeEiLDP
3wa8MzwcnXGI7rM5kzPKUERI352N7FzWUpSz6r7QtszPpttzx8HSxcHuuU68msSo
oqrUmEqA+1sFuDwsMMm85uUpeHHFQtgEhtMxMafz9VxWhTqSQCfIoM62pAns2Xdq
X3mSaovGIofWXszMjzf7tWrWfAAnzgvYjmOYNd7QwIpi/HZL9iAxw/orbLW6AZCm
ZnrhiGxf5ZZeeaiZhEzdH7iktM1+WpvLihl1PUD8JcgCHW3CJud/OqtpXV+AeyHm
p7dI0bb71g0RJ08pt+i64wxS6HA+Qg==
=I/0i
-----END PGP SIGNATURE-----
Merge tag 'asoc-fix-v6.12-rc5' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.12
The biggest set of changes here is Hans' fixes and quirks for various
Baytrail based platforms with RT5640 CODECs, and there's one core fix
for a missed length assignment for __counted_by() checking. Otherwise
it's small device specific fixes, several of them in the DT bindings.
ip6table_nat module unload has refcnt warning for UAF. call trace is:
WARNING: CPU: 1 PID: 379 at kernel/module/main.c:853 module_put+0x6f/0x80
Modules linked in: ip6table_nat(-)
CPU: 1 UID: 0 PID: 379 Comm: ip6tables Not tainted 6.12.0-rc4-00047-gc2ee9f594da8-dirty #205
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:module_put+0x6f/0x80
Call Trace:
<TASK>
get_info+0x128/0x180
do_ip6t_get_ctl+0x6a/0x430
nf_getsockopt+0x46/0x80
ipv6_getsockopt+0xb9/0x100
rawv6_getsockopt+0x42/0x190
do_sock_getsockopt+0xaa/0x180
__sys_getsockopt+0x70/0xc0
__x64_sys_getsockopt+0x20/0x30
do_syscall_64+0xa2/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Concurrent execution of module unload and get_info() trigered the warning.
The root cause is as follows:
cpu0 cpu1
module_exit
//mod->state = MODULE_STATE_GOING
ip6table_nat_exit
xt_unregister_template
kfree(t)
//removed from templ_list
getinfo()
t = xt_find_table_lock
list_for_each_entry(tmpl, &xt_templates[af]...)
if (strcmp(tmpl->name, name))
continue; //table not found
try_module_get
list_for_each_entry(t, &xt_net->tables[af]...)
return t; //not get refcnt
module_put(t->me) //uaf
unregister_pernet_subsys
//remove table from xt_net list
While xt_table module was going away and has been removed from
xt_templates list, we couldnt get refcnt of xt_table->me. Check
module in xt_net->tables list re-traversal to fix it.
Fixes: fdacd57c79 ("netfilter: x_tables: never register tables by default")
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Directly return the error from xfs_bmap_longest_free_extent instead
of breaking from the loop and handling it there, and use a done
label to directly jump to the exist when we found a suitable perag
structure to reduce the indentation level and pag/max_pag check
complexity in the tail of the function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the main loop in xfs_filestream_pick_ag fails to find a suitable
AG it tries to just pick the online AG. But the loop for that uses
args->pag as loop iterator while the later code expects pag to be
set. Fix this by reusing the max_pag case for this last resort, and
also add a check for impossible case of no AG just to make sure that
the uninitialized pag doesn't even escape in theory.
Reported-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com
Fixes: f8f1ed1ab3 ("xfs: return a referenced perag from filestreams allocator")
Cc: <stable@vger.kernel.org> # v6.3
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Recently, we found that the CPU spent a lot of time in
xfs_alloc_ag_vextent_size when the filesystem has millions of fragmented
spaces.
The reason is that we conducted much extra searching for extents that
could not yield a better result, and these searches would cost a lot of
time when there were millions of extents to search through. Even if we
get the same result length, we don't switch our choice to the new one,
so we can definitely terminate the search early.
Since the result length cannot exceed the found length, when the found
length equals the best result length we already have, we can conclude
the search.
We did a test in that filesystem:
[root@localhost ~]# xfs_db -c freesp /dev/vdb
from to extents blocks pct
1 1 215 215 0.01
2 3 994476 1988952 99.99
Before this patch:
0) | xfs_alloc_ag_vextent_size [xfs]() {
0) * 15597.94 us | }
After this patch:
0) | xfs_alloc_ag_vextent_size [xfs]() {
0) 19.176 us | }
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Extsize should only be allowed to be set on files with no data in it.
For this, we check if the files have extents but miss to check if
delayed extents are present. This patch adds that check.
While we are at it, also refactor this check into a helper since
it's used in some other places as well like xfs_inactive() or
xfs_ioctl_setattr_xflags()
**Without the patch (SUCCEEDS)**
$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'
wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)
**With the patch (FAILS as expected)**
$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'
wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)
xfs_io: FS_IOC_FSSETXATTR testfile: Invalid argument
Fixes: e94af02a9c ("[XFS] fix old xfs_setattr mis-merge from irix; mostly harmless esp if not using xfs rt")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The NOC firewall interrupt means that the HW prevented
unauthorized access to a protected resource, so there
is no need to trigger device reset in such case.
To facilitate security testing add firewall_irq_counter
debugfs file that tracks firewall interrupts.
Fixes: 8a27ad81f7 ("accel/ivpu: Split IP and buttress code")
Cc: stable@vger.kernel.org # v6.11+
Signed-off-by: Andrzej Kacprowski <Andrzej.Kacprowski@intel.com>
Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241017144958.79327-1-jacek.lawrynowicz@linux.intel.com
- cgroup_bpf_release_fn() could saturate system_wq with
cgrp->bpf.release_work which can then form a circular dependency leading
to deadlocks. Fix by using a dedicated workqueue. The system_wq's max
concurrency limit is being increased separately.
- Fix theoretical off-by-one bug when enforcing max cgroup hierarchy depth.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZyGCPA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGS2MAQDmtRNBlDYl36fiLAsylU4Coz5P0Y4ISmtSWT+c
zrEUZAD/WKSlCfy4RFngmnfkYbrJ+tWOVTMtsDqby8IzYLDGBw8=
=glRQ
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- cgroup_bpf_release_fn() could saturate system_wq with
cgrp->bpf.release_work which can then form a circular dependency
leading to deadlocks. Fix by using a dedicated workqueue. The
system_wq's max concurrency limit is being increased separately.
- Fix theoretical off-by-one bug when enforcing max cgroup hierarchy
depth
* tag 'cgroup-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Fix potential overflow issue when checking max_depth
cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
- Instances of scx_ops_bypass() could race each other leading to
misbehavior. Fix by protecting the operation with a spinlock.
- selftest and userspace header fixes.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZyF/5Q4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGRi+AP4+jGUz+O1LS0bCNj44Xlr0v6kci5dfJR7TlBv5
hwROcgEA84i7nRq6oJ1IkK7ItLbZYwgZyxqdn0Pgsq+oMWhgAwE=
=R766
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Instances of scx_ops_bypass() could race each other leading to
misbehavior. Fix by protecting the operation with a spinlock.
- selftest and userspace header fixes
* tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix enq_last_no_enq_fails selftest
sched_ext: Make cast_mask() inline
scx: Fix raciness in scx_ops_bypass()
scx: Fix exit selftest to use custom DSQ
sched_ext: Fix function pointer type mismatches in BPF selftests
selftests/sched_ext: add order-only dependency of runner.o on BPFOBJ
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmcgrxcACgkQu+CwddJF
iJrq9ggAiZ/2c7p23s52LdVhT9GTyV5omVOh2kDztVx4w6RM3RbkhkLWdqt0XUag
uf1TJe6kOvnCeHEFEEo3sqPj820XebxKDf0GGCdI6a9f4n30ipKH+vWSQ0iutKO/
dOBdArxr0FGOV5VZR9i3xQ6sUqZXXUbJdte0c0ovp6Q6HDHTeQeKNhOQ2fv33TG/
7jBh5HVyhI6JE/+TOxrMaklH0IqYBb6z49wdbaN7XBvXVXlb5MtOZy109gfUHDwe
tfktifyE45VtmF0WdHfxDbCnqyDSG1Jm3wsLDbMq+voJ1BQlUvIZ5Dv4kucYqffm
VN5HkH6uQ09aoounBoU4g50UYeNpiQ==
=xAw8
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
- Fix for a slub_kunit test warning with MEM_ALLOC_PROFILING_DEBUG (Pei
Xiao)
- Fix for a MTE-based KASAN BUG in krealloc() (Qun-Wei Lin)
* tag 'slab-for-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm: krealloc: Fix MTE false alarm in __do_krealloc
slub/kunit: fix a WARNING due to unwrapped __kmalloc_cache_noprof
No particular theme here - mainly singletons, a couple of doubletons.
Please see the changelogs.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZyBpwAAKCRDdBJ7gKXxA
jt9XAPsEfjtMc6wtcII5zXLXbLbznnCenaX0bSOmAHMQsQS63QEAp/JTyjN1rBjm
DExd7kbYx9ya61fnBLZ2WfEMm0Sbigc=
=PIza
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-10-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"21 hotfixes. 13 are cc:stable. 13 are MM and 8 are non-MM.
No particular theme here - mainly singletons, a couple of doubletons.
Please see the changelogs"
* tag 'mm-hotfixes-stable-2024-10-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
mm: avoid unconditional one-tick sleep when swapcache_prepare fails
mseal: update mseal.rst
mm: split critical region in remap_file_pages() and invoke LSMs in between
selftests/mm: fix deadlock for fork after pthread_create with atomic_bool
Revert "selftests/mm: replace atomic_bool with pthread_barrier_t"
Revert "selftests/mm: fix deadlock for fork after pthread_create on ARM"
tools: testing: add expand-only mode VMA test
mm/vma: add expand-only VMA merge mode and optimise do_brk_flags()
resource,kexec: walk_system_ram_res_rev must retain resource flags
nilfs2: fix kernel bug due to missing clearing of checked flag
mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node id
ocfs2: pass u64 to ocfs2_truncate_inline maybe overflow
mm: shmem: fix data-race in shmem_getattr()
mm: mark mas allocation in vms_abort_munmap_vmas as __GFP_NOFAIL
x86/traps: move kmsan check after instrumentation_begin
resource: remove dependency on SPARSEMEM from GET_FREE_REGION
mm/mmap: fix race in mmap_region() with ftruncate()
mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reserves
fork: only invoke khugepaged, ksm hooks if no error
fork: do not invoke uffd on fork if error occurs
...
Another set of fixes, mostly iwlwifi:
* fix infinite loop in 6 GHz scan if more than
255 colocated APs were reported
* revert removal of retry loops for now to work
around issues with firmware initialization on
some devices/platforms
* fix SAR table issues with some BIOSes
* fix race in suspend/debug collection
* fix memory leak in fw recovery
* fix link ID leak in AP mode for older devices
* fix sending TX power constraints
* fix link handling in FW restart
And also the stack:
* fix setting TX power from userspace with the new
chanctx emulation code for old-style drivers
* fix a memory corruption bug due to structure
embedding
* fix CQM configuration double-free when moving
between net namespaces
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmcgnyMACgkQ10qiO8sP
aAApKhAAgnQgnDXmtFYyP6qKr5tkHnY+4VcLOOUI13lGAuhIJwfMqct4204TUS+4
/3qBajSsr3n/pKjI/RQ9LtRdUHoZS+O6zVmSjAEFq8JU+Z8cZKo4teY8SfXY9pXm
I/6X93o9hMCeL1+ebr4qgZVJFW9jdQnM9xJgR9lwNDViJUQT0Bn6Xfy2Crvkl2hn
cnr1wU3lH261vf22X8ON+giWToyCvsQmKExGQzenSLJ1UK8bwcu9akWA7GZsjFYK
Skp9q95EEFJ9V2qGAHWCivD3//rJlsXZPfwgT2QhHvvMawrb/l9pva9fk543y8Xk
9IpLa8WEiLivYk7lW4qzykrfWhNM+ubdOJVUIl23qbpI8aP/ZU3QHzIn9ZOGrBVJ
yvu647OC0P/xNMHJ7jhOCAxmduqwsxKsRcnES1ZwWdQWAmHO97bSTh95+YfLsAwD
6l/PQVzeUcypeYG9rXnBkA/k8obFh8dJnIU4d96rxJdrtF7ixRiedfvx9V9tGEx4
dWu+4KVkh2W7iSXAWI/cammpU06ptzFNcjEYIGs/7xttu/riXH4fNcgm2wl+oPe7
SMaBwbGf2nTXtrqLPaVSkyM16MtSVHh9iqzYH//N357PCd1Kr9L+5xUKx1hfJ7X+
7Z9+h9azBCQXeXz2AuWQWWynZPTSrdVLYuMNsdLJ17HAerDZJc8=
=4M1T
-----END PGP SIGNATURE-----
Merge tag 'wireless-2024-10-29' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
wireless fixes for v6.12-rc6
Another set of fixes, mostly iwlwifi:
* fix infinite loop in 6 GHz scan if more than
255 colocated APs were reported
* revert removal of retry loops for now to work
around issues with firmware initialization on
some devices/platforms
* fix SAR table issues with some BIOSes
* fix race in suspend/debug collection
* fix memory leak in fw recovery
* fix link ID leak in AP mode for older devices
* fix sending TX power constraints
* fix link handling in FW restart
And also the stack:
* fix setting TX power from userspace with the new
chanctx emulation code for old-style drivers
* fix a memory corruption bug due to structure
embedding
* fix CQM configuration double-free when moving
between net namespaces
* tag 'wireless-2024-10-29' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: ieee80211_i: Fix memory corruption bug in struct ieee80211_chanctx
wifi: iwlwifi: mvm: fix 6 GHz scan construction
wifi: cfg80211: clear wdev->cqm_config pointer on free
mac80211: fix user-power when emulating chanctx
Revert "wifi: iwlwifi: remove retry loops in start"
wifi: iwlwifi: mvm: don't add default link in fw restart flow
wifi: iwlwifi: mvm: Fix response handling in iwl_mvm_send_recovery_cmd()
wifi: iwlwifi: mvm: SAR table alignment
wifi: iwlwifi: mvm: Use the sync timepoint API in suspend
wifi: iwlwifi: mvm: really send iwl_txpower_constraints_cmd
wifi: iwlwifi: mvm: don't leak a link on AP removal
====================
Link: https://patch.msgid.link/20241029093926.13750-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Config a small gso_max_size/gso_ipv4_max_size will lead to an underflow
in sk_dst_gso_max_size(), which may trigger a BUG_ON crash,
because sk->sk_gso_max_size would be much bigger than device limits.
Call Trace:
tcp_write_xmit
tso_segs = tcp_init_tso_segs(skb, mss_now);
tcp_set_skb_tso_segs
tcp_skb_pcount_set
// skb->len = 524288, mss_now = 8
// u16 tso_segs = 524288/8 = 65535 -> 0
tso_segs = DIV_ROUND_UP(skb->len, mss_now)
BUG_ON(!tso_segs)
Add check for the minimum value of gso_max_size and gso_ipv4_max_size.
Fixes: 46e6b992c2 ("rtnetlink: allow GSO maximums to be set on device creation")
Fixes: 9eefedd58a ("net: add gso_ipv4_max_size and gro_ipv4_max_size per device")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241023035213.517386-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mounting btrfs from two images (which have the same one fsid and two
different dev_uuids) in certain executing order may trigger an UAF for
variable 'device->bdev_file' in __btrfs_free_extra_devids(). And
following are the details:
1. Attach image_1 to loop0, attach image_2 to loop1, and scan btrfs
devices by ioctl(BTRFS_IOC_SCAN_DEV):
/ btrfs_device_1 → loop0
fs_device
\ btrfs_device_2 → loop1
2. mount /dev/loop0 /mnt
btrfs_open_devices
btrfs_device_1->bdev_file = btrfs_get_bdev_and_sb(loop0)
btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1)
btrfs_fill_super
open_ctree
fail: btrfs_close_devices // -ENOMEM
btrfs_close_bdev(btrfs_device_1)
fput(btrfs_device_1->bdev_file)
// btrfs_device_1->bdev_file is freed
btrfs_close_bdev(btrfs_device_2)
fput(btrfs_device_2->bdev_file)
3. mount /dev/loop1 /mnt
btrfs_open_devices
btrfs_get_bdev_and_sb(&bdev_file)
// EIO, btrfs_device_1->bdev_file is not assigned,
// which points to a freed memory area
btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1)
btrfs_fill_super
open_ctree
btrfs_free_extra_devids
if (btrfs_device_1->bdev_file)
fput(btrfs_device_1->bdev_file) // UAF !
Fix it by setting 'device->bdev_file' as 'NULL' after closing the
btrfs_device in btrfs_close_one_device().
Fixes: 1423881941 ("btrfs: do not background blkdev_put()")
CC: stable@vger.kernel.org # 4.19+
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219408
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a test for out-of-bounds write in trie_get_next_key() when a full
path from root to leaf exists and bpf_map_get_next_key() is called
with the leaf node. It may crashes the kernel on failure, so please
run in a VM.
Signed-off-by: Byeonguk Jeong <jungbu2855@gmail.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/Zxx4ep78tsbeWPVM@localhost.localdomain
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
trie_get_next_key() allocates a node stack with size trie->max_prefixlen,
while it writes (trie->max_prefixlen + 1) nodes to the stack when it has
full paths from the root to leaves. For example, consider a trie with
max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ...
0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with
.prefixlen = 8 make 9 nodes be written on the node stack with size 8.
Fixes: b471f2f1de ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Signed-off-by: Byeonguk Jeong <jungbu2855@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>
Tested-by: Hou Tao <houtao1@huawei.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/Zxx384ZfdlFYnz6J@localhost.localdomain
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ensure the refcount and async_copies fields are initialized early.
cleanup_async_copy() will reference these fields if an error occurs
in nfsd4_copy(). If they are not correctly initialized, at the very
least, a refcount underflow occurs.
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Fixes: aadc3bbea1 ("NFSD: Limit the number of concurrent async COPY operations")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Merge series from Alexey Klimov <alexey.klimov@linaro.org>:
This sent as RFC because of the following:
- regarding the LO switch patch. I've got info about that from two persons
independently hence not sure what tags to put there and who should be
the author. Please let me know if that needs to be corrected.
- the wcd937x pdm watchdog is a problem for audio playback and needs to be
fixed. The minimal fix would be to at least increase timeout value but
it will still trigger in case of plenty of dbg messages or other
delay-generating things. Unfortunately, I can't test HPHL/R outputs hence
the patch is only for AUX. The other options would be introducing
module parameter for debugging and using HOLD_OFF bit for that or
adding Kconfig option.
Alexey Klimov (2):
ASoC: codecs: wcd937x: add missing LO Switch control
ASoC: codecs: wcd937x: relax the AUX PDM watchdog
sound/soc/codecs/wcd937x.c | 12 ++++++++++--
sound/soc/codecs/wcd937x.h | 4 ++++
2 files changed, 14 insertions(+), 2 deletions(-)
--
2.45.2
This command:
$ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact
Error: block dev insert failed: -EBUSY.
fails because user space requests the same block index to be set for
both ingress and egress.
[ side note, I don't think it even failed prior to commit 913b47d342
("net/sched: Introduce tc block netdev tracking infra"), because this
is a command from an old set of notes of mine which used to work, but
alas, I did not scientifically bisect this ]
The problem is not that it fails, but rather, that the second time
around, it fails differently (and irrecoverably):
$ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact
Error: dsa_core: Flow block cb is busy.
[ another note: the extack is added by me for illustration purposes.
the context of the problem is that clsact_init() obtains the same
&q->ingress_block pointer as &q->egress_block, and since we call
tcf_block_get_ext() on both of them, "dev" will be added to the
block->ports xarray twice, thus failing the operation: once through
the ingress block pointer, and once again through the egress block
pointer. the problem itself is that when xa_insert() fails, we have
emitted a FLOW_BLOCK_BIND command through ndo_setup_tc(), but the
offload never sees a corresponding FLOW_BLOCK_UNBIND. ]
Even correcting the bad user input, we still cannot recover:
$ tc qdisc replace dev swp3 ingress_block 1 egress_block 2 clsact
Error: dsa_core: Flow block cb is busy.
Basically the only way to recover is to reboot the system, or unbind and
rebind the net device driver.
To fix the bug, we need to fill the correct error teardown path which
was missed during code movement, and call tcf_block_offload_unbind()
when xa_insert() fails.
[ last note, fundamentally I blame the label naming convention in
tcf_block_get_ext() for the bug. The labels should be named after what
they do, not after the error path that jumps to them. This way, it is
obviously wrong that two labels pointing to the same code mean
something is wrong, and checking the code correctness at the goto site
is also easier ]
Fixes: 94e2557d08 ("net: sched: move block device tracking into tcf_block_get/put_ext()")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20241023100541.974362-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This was found by a static analyzer.
We should not forget the trailing zero after copy_from_user()
if we will further do some string operations, sscanf() in this
case. Adding a trailing zero will ensure that the function
performs properly.
Fixes: c6385c0b67 ("netdevsim: Allow reporting activity on nexthop buckets")
Signed-off-by: Zichen Xie <zichenxie0106@gmail.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20241022171907.8606-1-zichenxie0106@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The test added is a simplified reproducer from syzbot report [1].
If verifier does not insert checkpoint somewhere inside the loop,
verification of the program would take a very long time.
This would happen because mark_chain_precision() for register r7 would
constantly trace jump history of the loop back, processing many
iterations for each mark_chain_precision() call.
[1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20241029172641.1042523-2-eddyz87@gmail.com
A specifically crafted program might trick verifier into growing very
long jump history within a single bpf_verifier_state instance.
Very long jump history makes mark_chain_precision() unreasonably slow,
especially in case if verifier processes a loop.
Mitigate this by forcing new state in is_state_visited() in case if
current state's jump history is too long.
Use same constant as in `skip_inf_loop_check`, but multiply it by
arbitrarily chosen value 2 to account for jump history containing not
only information about jumps, but also information about stack access.
For an example of problematic program consider the code below,
w/o this patch the example is processed by verifier for ~15 minutes,
before failing to allocate big-enough chunk for jmp_history.
0: r7 = *(u16 *)(r1 +0);"
1: r7 += 0x1ab064b9;"
2: if r7 & 0x702000 goto 1b;
3: r7 &= 0x1ee60e;"
4: r7 += r1;"
5: if r7 s> 0x37d2 goto +0;"
6: r0 = 0;"
7: exit;"
Perf profiling shows that most of the time is spent in
mark_chain_precision() ~95%.
The easiest way to explain why this program causes problems is to
apply the following patch:
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0c216e71cec7..4b4823961abe 100644
\--- a/include/linux/bpf.h
\+++ b/include/linux/bpf.h
\@@ -1926,7 +1926,7 @@ struct bpf_array {
};
};
-#define BPF_COMPLEXITY_LIMIT_INSNS 1000000 /* yes. 1M insns */
+#define BPF_COMPLEXITY_LIMIT_INSNS 256 /* yes. 1M insns */
#define MAX_TAIL_CALL_CNT 33
/* Maximum number of loops for bpf_loop and bpf_iter_num.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f514247ba8ba..75e88be3bb3e 100644
\--- a/kernel/bpf/verifier.c
\+++ b/kernel/bpf/verifier.c
\@@ -18024,8 +18024,13 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
skip_inf_loop_check:
if (!force_new_state &&
env->jmps_processed - env->prev_jmps_processed < 20 &&
- env->insn_processed - env->prev_insn_processed < 100)
+ env->insn_processed - env->prev_insn_processed < 100) {
+ verbose(env, "is_state_visited: suppressing checkpoint at %d, %d jmps processed, cur->jmp_history_cnt is %d\n",
+ env->insn_idx,
+ env->jmps_processed - env->prev_jmps_processed,
+ cur->jmp_history_cnt);
add_new_state = false;
+ }
goto miss;
}
/* If sl->state is a part of a loop and this loop's entry is a part of
\@@ -18142,6 +18147,9 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
if (!add_new_state)
return 0;
+ verbose(env, "is_state_visited: new checkpoint at %d, resetting env->jmps_processed\n",
+ env->insn_idx);
+
/* There were no equivalent states, remember the current one.
* Technically the current state is not proven to be safe yet,
* but it will either reach outer most bpf_exit (which means it's safe)
And observe verification log:
...
is_state_visited: new checkpoint at 5, resetting env->jmps_processed
5: R1=ctx() R7=ctx(...)
5: (65) if r7 s> 0x37d2 goto pc+0 ; R7=ctx(...)
6: (b7) r0 = 0 ; R0_w=0
7: (95) exit
from 5 to 6: R1=ctx() R7=ctx(...) R10=fp0
6: R1=ctx() R7=ctx(...) R10=fp0
6: (b7) r0 = 0 ; R0_w=0
7: (95) exit
is_state_visited: suppressing checkpoint at 1, 3 jmps processed, cur->jmp_history_cnt is 74
from 2 to 1: R1=ctx() R7_w=scalar(...) R10=fp0
1: R1=ctx() R7_w=scalar(...) R10=fp0
1: (07) r7 += 447767737
is_state_visited: suppressing checkpoint at 2, 3 jmps processed, cur->jmp_history_cnt is 75
2: R7_w=scalar(...)
2: (45) if r7 & 0x702000 goto pc-2
... mark_precise 152 steps for r7 ...
2: R7_w=scalar(...)
is_state_visited: suppressing checkpoint at 1, 4 jmps processed, cur->jmp_history_cnt is 75
1: (07) r7 += 447767737
is_state_visited: suppressing checkpoint at 2, 4 jmps processed, cur->jmp_history_cnt is 76
2: R7_w=scalar(...)
2: (45) if r7 & 0x702000 goto pc-2
...
BPF program is too large. Processed 257 insn
The log output shows that checkpoint at label (1) is never created,
because it is suppressed by `skip_inf_loop_check` logic:
a. When 'if' at (2) is processed it pushes a state with insn_idx (1)
onto stack and proceeds to (3);
b. At (5) checkpoint is created, and this resets
env->{jmps,insns}_processed.
c. Verification proceeds and reaches `exit`;
d. State saved at step (a) is popped from stack and is_state_visited()
considers if checkpoint needs to be added, but because
env->{jmps,insns}_processed had been just reset at step (b)
the `skip_inf_loop_check` logic forces `add_new_state` to false.
e. Verifier proceeds with current state, which slowly accumulates
more and more entries in the jump history.
The accumulation of entries in the jump history is a problem because
of two factors:
- it eventually exhausts memory available for kmalloc() allocation;
- mark_chain_precision() traverses the jump history of a state,
meaning that if `r7` is marked precise, verifier would iterate
ever growing jump history until parent state boundary is reached.
(note: the log also shows a REG INVARIANTS VIOLATION warning
upon jset processing, but that's another bug to fix).
With this patch applied, the example above is rejected by verifier
under 1s of time, reaching 1M instructions limit.
The program is a simplified reproducer from syzbot report.
Previous discussion could be found at [1].
The patch does not cause any changes in verification performance,
when tested on selftests from veristat.cfg and cilium programs taken
from [2].
[1] https://lore.kernel.org/bpf/20241009021254.2805446-1-eddyz87@gmail.com/
[2] https://github.com/anakryiko/cilium
Changelog:
- v1 -> v2:
- moved patch to bpf tree;
- moved force_new_state variable initialization after declaration and
shortened the comment.
v1: https://lore.kernel.org/bpf/20241018020307.1766906-1-eddyz87@gmail.com/
Fixes: 2589726d12 ("bpf: introduce bounded loops")
Reported-by: syzbot+7e46cdef14bf496a3ab4@syzkaller.appspotmail.com
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20241029172641.1042523-1-eddyz87@gmail.com
Closes: https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/
In qdisc_tree_reduce_backlog, Qdiscs with major handle ffff: are assumed
to be either root or ingress. This assumption is bogus since it's valid
to create egress qdiscs with major handle ffff:
Budimir Markovic found that for qdiscs like DRR that maintain an active
class list, it will cause a UAF with a dangling class pointer.
In 066a3b5b23, the concern was to avoid iterating over the ingress
qdisc since its parent is itself. The proper fix is to stop when parent
TC_H_ROOT is reached because the only way to retrieve ingress is when a
hierarchy which does not contain a ffff: major handle call into
qdisc_lookup with TC_H_MAJ(TC_H_ROOT).
In the scenario where major ffff: is an egress qdisc in any of the tree
levels, the updates will also propagate to TC_H_ROOT, which then the
iteration must stop.
Fixes: 066a3b5b23 ("[NET_SCHED] sch_api: fix qdisc_tree_decrease_qlen() loop")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
net/sched/sch_api.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241024165547.418570-1-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The CI occasionaly encounters a failing test run. Example:
# PASS: ipsec tunnel mode for ns1/ns2
# re-run with random mtus: -o 10966 -l 19499 -r 31322
# PASS: flow offloaded for ns1/ns2
[..]
# FAIL: ipsec tunnel ... counter 1157059 exceeds expected value 878489
This script will re-exec itself, on the second run, random MTUs are
chosen for the involved links. This is done so we can cover different
combinations (large mtu on client, small on server, link has lowest
mtu, etc).
Furthermore, file size is random, even for the first run.
Rework this script and always use the same file size on initial run so
that at least the first round can be expected to have reproducible
behavior.
Second round will use random mtu/filesize.
Raise the failure limit to that of the file size, this should avoid all
errneous test errors. Currently, first fin will remove the offload, so if
one peer is already closing remaining data is handled by classic path,
which result in larger-than-expected counter and a test failure.
Given packet path also counts tcp/ip headers, in case offload is
completely broken this test will still fail (as expected).
The test counter limit could be made more strict again in the future
once flowtable can keep a connection in offloaded state until FINs
in both directions were seen.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241022152324.13554-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Existing user space applications maintained by the Osmocom project are
breaking since a recent fix that addresses incorrect error checking.
Restore operation for user space programs that specify -1 as file
descriptor to skip GTPv0 or GTPv1 only sockets.
Fixes: defd8b3c37 ("gtp: fix a potential NULL pointer dereference")
Reported-by: Pau Espin Pedrol <pespin@sysmocom.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Oliver Smith <osmith@sysmocom.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241022144825.66740-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
daddr can be NULL if there is no neighbour table entry present,
in that case the tx packet should be dropped.
saddr will usually be set by MCTP core, but check for NULL in case a
packet is transmitted by a different protocol.
Fixes: f5b8abf9fc ("mctp i2c: MCTP I2C binding driver")
Cc: stable@vger.kernel.org
Reported-by: Dung Cao <dung@os.amperecomputing.com>
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241022-mctp-i2c-null-dest-v3-1-e929709956c5@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The per-netns IP tunnel hash table is protected by the RTNL mutex and
ip_tunnel_find() is only called from the control path where the mutex is
taken.
Add a lockdep expression to hlist_for_each_entry_rcu() in
ip_tunnel_find() in order to validate that the mutex is held and to
silence the suspicious RCU usage warning [1].
[1]
WARNING: suspicious RCU usage
6.12.0-rc3-custom-gd95d9a31aceb #139 Not tainted
-----------------------------
net/ipv4/ip_tunnel.c:221 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by ip/362:
#0: ffffffff86fc7cb0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x377/0xf60
stack backtrace:
CPU: 12 UID: 0 PID: 362 Comm: ip Not tainted 6.12.0-rc3-custom-gd95d9a31aceb #139
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl+0xba/0x110
lockdep_rcu_suspicious.cold+0x4f/0xd6
ip_tunnel_find+0x435/0x4d0
ip_tunnel_newlink+0x517/0x7a0
ipgre_newlink+0x14c/0x170
__rtnl_newlink+0x1173/0x19c0
rtnl_newlink+0x6c/0xa0
rtnetlink_rcv_msg+0x3cc/0xf60
netlink_rcv_skb+0x171/0x450
netlink_unicast+0x539/0x7f0
netlink_sendmsg+0x8c1/0xd80
____sys_sendmsg+0x8f9/0xc20
___sys_sendmsg+0x197/0x1e0
__sys_sendmsg+0x122/0x1f0
do_syscall_64+0xbb/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241023123009.749764-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are code paths from which the function is called without holding
the RCU read lock, resulting in a suspicious RCU usage warning [1].
Fix by using l3mdev_master_upper_ifindex_by_index() which will acquire
the RCU read lock before calling
l3mdev_master_upper_ifindex_by_index_rcu().
[1]
WARNING: suspicious RCU usage
6.12.0-rc3-custom-gac8f72681cf2 #141 Not tainted
-----------------------------
net/core/dev.c:876 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by ip/361:
#0: ffffffff86fc7cb0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x377/0xf60
stack backtrace:
CPU: 3 UID: 0 PID: 361 Comm: ip Not tainted 6.12.0-rc3-custom-gac8f72681cf2 #141
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl+0xba/0x110
lockdep_rcu_suspicious.cold+0x4f/0xd6
dev_get_by_index_rcu+0x1d3/0x210
l3mdev_master_upper_ifindex_by_index_rcu+0x2b/0xf0
ip_tunnel_bind_dev+0x72f/0xa00
ip_tunnel_newlink+0x368/0x7a0
ipgre_newlink+0x14c/0x170
__rtnl_newlink+0x1173/0x19c0
rtnl_newlink+0x6c/0xa0
rtnetlink_rcv_msg+0x3cc/0xf60
netlink_rcv_skb+0x171/0x450
netlink_unicast+0x539/0x7f0
netlink_sendmsg+0x8c1/0xd80
____sys_sendmsg+0x8f9/0xc20
___sys_sendmsg+0x197/0x1e0
__sys_sendmsg+0x122/0x1f0
do_syscall_64+0xbb/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: db53cd3d88 ("net: Handle l3mdev in ip_tunnel_init_flow")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241022063822.462057-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reset POR_EL0 to "allow all" before writing the signal frame, preventing
spurious uaccess failures.
When POE is supported, the POR_EL0 register constrains memory
accesses based on the target page's POIndex (pkey). This raises the
question: what constraints should apply to a signal handler? The
current answer is that POR_EL0 is reset to POR_EL0_INIT when
invoking the handler, giving it full access to POIndex 0. This is in
line with x86's MPK support and remains unchanged.
This is only part of the story, though. POR_EL0 constrains all
unprivileged memory accesses, meaning that uaccess routines such as
put_user() are also impacted. As a result POR_EL0 may prevent the
signal frame from being written to the signal stack (ultimately
causing a SIGSEGV). This is especially concerning when an alternate
signal stack is used, because userspace may want to prevent access
to it outside of signal handlers. There is currently no provision
for that: POR_EL0 is reset after writing to the stack, and
POR_EL0_INIT only enables access to POIndex 0.
This patch ensures that POR_EL0 is reset to its most permissive
state before the signal stack is accessed. Once the signal frame has
been fully written, POR_EL0 is still set to POR_EL0_INIT - it is up
to the signal handler to enable access to additional pkeys if
needed. As to sigreturn(), it expects having access to the stack
like any other syscall; we only need to ensure that POR_EL0 is
restored from the signal frame after all uaccess calls. This
approach is in line with the recent x86/pkeys series [1].
Resetting POR_EL0 early introduces some complications, in that we
can no longer read the register directly in preserve_poe_context().
This is addressed by introducing a struct (user_access_state)
and helpers to manage any such register impacting user accesses
(uaccess and accesses in userspace). Things look like this on signal
delivery:
1. Save original POR_EL0 into struct [save_reset_user_access_state()]
2. Set POR_EL0 to "allow all" [save_reset_user_access_state()]
3. Create signal frame
4. Write saved POR_EL0 value to the signal frame [preserve_poe_context()]
5. Finalise signal frame
6. If all operations succeeded:
a. Set POR_EL0 to POR_EL0_INIT [set_handler_user_access_state()]
b. Else reset POR_EL0 to its original value [restore_user_access_state()]
If any step fails when setting up the signal frame, the process will
be sent a SIGSEGV, which it may be able to handle. Step 6.b ensures
that the original POR_EL0 is saved in the signal frame when
delivering that SIGSEGV (so that the original value is restored by
sigreturn).
The return path (sys_rt_sigreturn) doesn't strictly require any change
since restore_poe_context() is already called last. However, to
avoid uaccess calls being accidentally added after that point, we
use the same approach as in the delivery path, i.e. separating
uaccess from writing to the register:
1. Read saved POR_EL0 value from the signal frame [restore_poe_context()]
2. Set POR_EL0 to the saved value [restore_user_access_state()]
[1] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@oracle.com/
Fixes: 9160f7e909 ("arm64: add POE signal support")
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Link: https://lore.kernel.org/r/20241029144539.111155-2-kevin.brodsky@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
The tcp_bpf_recvmsg_parser() function, running in user context,
retrieves seq_copied from tcp_sk without holding the socket lock, and
stores it in a local variable seq. However, the softirq context can
modify tcp_sk->seq_copied concurrently, for example, n tcp_read_sock().
As a result, the seq value is stale when it is assigned back to
tcp_sk->copied_seq at the end of tcp_bpf_recvmsg_parser(), leading to
incorrect behavior.
Due to concurrency, the copied_seq field in tcp_bpf_recvmsg_parser()
might be set to an incorrect value (less than the actual copied_seq) at
the end of function: 'WRITE_ONCE(tcp->copied_seq, seq)'. This causes the
'offset' to be negative in tcp_read_sock()->tcp_recv_skb() when
processing new incoming packets (sk->copied_seq - skb->seq becomes less
than 0), and all subsequent packets will be dropped.
Signed-off-by: Jiayuan Chen <mrpre@163.com>
Link: https://lore.kernel.org/r/20241028065226.35568-1-mrpre@163.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
node_to_amd_nb() is defined to NULL in non-AMD configs:
drivers/platform/x86/amd/hsmp/plat.c: In function 'init_platform_device':
drivers/platform/x86/amd/hsmp/plat.c:165:68: error: dereferencing 'void *' pointer [-Werror]
165 | sock->root = node_to_amd_nb(i)->root;
| ^~
drivers/platform/x86/amd/hsmp/plat.c:165:68: error: request for member 'root' in something not a structure or union
Users of the interface who also allow COMPILE_TEST will cause the above build
error so provide an inline stub to fix that.
[ bp: Massage commit message. ]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/r/20241029092329.3857004-1-arnd@kernel.org