Commit Graph

1137335 Commits

Author SHA1 Message Date
Alex Hung
d39e2ad63d mailmap: update Alex Hung's email address
I am no longer at Canonical and add entry of my personal email address.

Link: https://lkml.kernel.org/r/20221114001302.671897-1-alex.hung@amd.com
Signed-off-by: Alex Hung <alexhung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:43 -08:00
Ian Cowan
4a42344081 mm: mmap: fix documentation for vma_mas_szero
When the struct_mm input, mm, was changed to a struct ma_state, mas, the
documentation for the function was never updated.  This updates that
documentation reference.

Link: https://lkml.kernel.org/r/20221114003349.41235-1-ian@linux.cowan.aero
Signed-off-by: Ian Cowan <ian@linux.cowan.aero>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:42 -08:00
SeongJae Park
8468b48661 mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed
A DAMON sysfs interface user can start DAMON with a scheme, remove the
sysfs directory for the scheme, and then ask update of the scheme's stats.
Because the schemes stats update logic isn't aware of the situation, it
results in an invalid memory access.  Fix the bug by checking if the
scheme sysfs directory exists.

Link: https://lkml.kernel.org/r/20221114175552.1951-1-sj@kernel.org
Fixes: 0ac32b8aff ("mm/damon/sysfs: support DAMOS stats")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>	[v5.18]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:42 -08:00
Alistair Popple
4a955bed88 mm/memory: return vm_fault_t result from migrate_to_ram() callback
The migrate_to_ram() callback should always succeed, but in rare cases can
fail usually returning VM_FAULT_SIGBUS.  Commit 16ce101db8
("mm/memory.c: fix race when faulting a device private page") incorrectly
stopped passing the return code up the stack.  Fix this by setting the ret
variable, restoring the previous behaviour on migrate_to_ram() failure.

Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com
Fixes: 16ce101db8 ("mm/memory.c: fix race when faulting a device private page")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:42 -08:00
Li Liguang
cd08d80ecd mm: correctly charge compressed memory to its memcg
Kswapd will reclaim memory when memory pressure is high, the annonymous
memory will be compressed and stored in the zpool if zswap is enabled. 
The memcg_kmem_bypass() in get_obj_cgroup_from_page() will bypass the
kernel thread and cause the compressed memory not be charged to its memory
cgroup.

Remove the memcg_kmem_bypass() call and properly charge compressed memory
to its corresponding memory cgroup.

Link: https://lore.kernel.org/linux-mm/CALvZod4nnn8BHYqAM4xtcR0Ddo2-Wr8uKm9h_CHWUaXw7g_DCg@mail.gmail.com/
Link: https://lkml.kernel.org/r/20221114194828.100822-1-hannes@cmpxchg.org
Fixes: f4840ccfca ("zswap: memcg accounting")
Signed-off-by: Li Liguang <liliguang@baidu.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: <stable@vger.kernel.org>	[5.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:42 -08:00
Mike Kravetz
b6305049f3 ipc/shm: call underlying open/close vm_ops
Shared memory segments can be created that are backed by hugetlb pages. 
When this happens, the vmas associated with any mappings (shmat) are
marked VM_HUGETLB, yet the vm_ops for such mappings are provided by
ipc/shm (shm_vm_ops).  There is a mechanism to call the underlying hugetlb
vm_ops, and this is done for most operations.  However, it is not done for
open and close.

This was not an issue until the introduction of the hugetlb vma_lock. 
This lock structure is pointed to by vm_private_data and the open/close
vm_ops help maintain this structure.  The special hugetlb routine called
at fork took care of structure updates at fork time.  However,
vma_splitting is not properly handled for ipc shared memory mappings
backed by hugetlb pages.  This can result in a "kernel NULL pointer
dereference" BUG or use after free as two vmas point to the same lock
structure.

Update the shm open and close routines to always call the underlying open
and close routines.

Link: https://lkml.kernel.org/r/20221114210018.49346-1-mike.kravetz@oracle.com
Fixes: 8d9bfb2608 ("hugetlb: add vma based lock for pmd sharing")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Doug Nelson <doug.nelson@intel.com>
Reported-by: <syzbot+83b4134621b7c326d950@syzkaller.appspotmail.com>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:42 -08:00
Mukesh Ojha
a6f810efab gcov: clang: fix the buffer overflow issue
Currently, in clang version of gcov code when module is getting removed
gcov_info_add() incorrectly adds the sfn_ptr->counter to all the
dst->functions and it result in the kernel panic in below crash report. 
Fix this by properly handling it.

[    8.899094][  T599] Unable to handle kernel write to read-only memory at virtual address ffffff80461cc000
[    8.899100][  T599] Mem abort info:
[    8.899102][  T599]   ESR = 0x9600004f
[    8.899103][  T599]   EC = 0x25: DABT (current EL), IL = 32 bits
[    8.899105][  T599]   SET = 0, FnV = 0
[    8.899107][  T599]   EA = 0, S1PTW = 0
[    8.899108][  T599]   FSC = 0x0f: level 3 permission fault
[    8.899110][  T599] Data abort info:
[    8.899111][  T599]   ISV = 0, ISS = 0x0000004f
[    8.899113][  T599]   CM = 0, WnR = 1
[    8.899114][  T599] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000ab8de000
[    8.899116][  T599] [ffffff80461cc000] pgd=18000009ffcde003, p4d=18000009ffcde003, pud=18000009ffcde003, pmd=18000009ffcad003, pte=00600000c61cc787
[    8.899124][  T599] Internal error: Oops: 9600004f [#1] PREEMPT SMP
[    8.899265][  T599] Skip md ftrace buffer dump for: 0x1609e0
....
..,
[    8.899544][  T599] CPU: 7 PID: 599 Comm: modprobe Tainted: G S         OE     5.15.41-android13-8-g38e9b1af6bce #1
[    8.899547][  T599] Hardware name: XXX (DT)
[    8.899549][  T599] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
[    8.899551][  T599] pc : gcov_info_add+0x9c/0xb8
[    8.899557][  T599] lr : gcov_event+0x28c/0x6b8
[    8.899559][  T599] sp : ffffffc00e733b00
[    8.899560][  T599] x29: ffffffc00e733b00 x28: ffffffc00e733d30 x27: ffffffe8dc297470
[    8.899563][  T599] x26: ffffffe8dc297000 x25: ffffffe8dc297000 x24: ffffffe8dc297000
[    8.899566][  T599] x23: ffffffe8dc0a6200 x22: ffffff880f68bf20 x21: 0000000000000000
[    8.899569][  T599] x20: ffffff880f68bf00 x19: ffffff8801babc00 x18: ffffffc00d7f9058
[    8.899572][  T599] x17: 0000000000088793 x16: ffffff80461cbe00 x15: 9100052952800785
[    8.899575][  T599] x14: 0000000000000200 x13: 0000000000000041 x12: 9100052952800785
[    8.899577][  T599] x11: ffffffe8dc297000 x10: ffffffe8dc297000 x9 : ffffff80461cbc80
[    8.899580][  T599] x8 : ffffff8801babe80 x7 : ffffffe8dc2ec000 x6 : ffffffe8dc2ed000
[    8.899583][  T599] x5 : 000000008020001f x4 : fffffffe2006eae0 x3 : 000000008020001f
[    8.899586][  T599] x2 : ffffff8027c49200 x1 : ffffff8801babc20 x0 : ffffff80461cb3a0
[    8.899589][  T599] Call trace:
[    8.899590][  T599]  gcov_info_add+0x9c/0xb8
[    8.899592][  T599]  gcov_module_notifier+0xbc/0x120
[    8.899595][  T599]  blocking_notifier_call_chain+0xa0/0x11c
[    8.899598][  T599]  do_init_module+0x2a8/0x33c
[    8.899600][  T599]  load_module+0x23cc/0x261c
[    8.899602][  T599]  __arm64_sys_finit_module+0x158/0x194
[    8.899604][  T599]  invoke_syscall+0x94/0x2bc
[    8.899607][  T599]  el0_svc_common+0x1d8/0x34c
[    8.899609][  T599]  do_el0_svc+0x40/0x54
[    8.899611][  T599]  el0_svc+0x94/0x2f0
[    8.899613][  T599]  el0t_64_sync_handler+0x88/0xec
[    8.899615][  T599]  el0t_64_sync+0x1b4/0x1b8
[    8.899618][  T599] Code: f905f56c f86e69ec f86e6a0f 8b0c01ec (f82e6a0c)
[    8.899620][  T599] ---[ end trace ed5218e9e5b6e2e6 ]---

Link: https://lkml.kernel.org/r/1668020497-13142-1-git-send-email-quic_mojha@quicinc.com
Fixes: e178a5beb3 ("gcov: clang support")
Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Tom Rix <trix@redhat.com>
Cc: <stable@vger.kernel.org>	[5.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:41 -08:00
Gautam Menghani
045634ff1e mm/khugepaged: refactor mm_khugepaged_scan_file tracepoint to remove filename from function call
Refactor the mm_khugepaged_scan_file tracepoint to move filename
dereference to the tracepoint definition, to maintain consistency with
other tracepoints[1].

[1]:lore.kernel.org/lkml/20221024111621.3ba17e2c@gandalf.local.home/

Link: https://lkml.kernel.org/r/20221026044524.54793-1-gautammenghani201@gmail.com
Fixes: d41fd2016e ("mm/khugepaged: add tracepoint to hpage_collapse_scan_file()")
Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:41 -08:00
Charan Teja Kalla
ed86b74874 mm/page_exit: fix kernel doc warning in page_ext_put()
Fix the below compiler warnings reported with 'make W=1 mm/'. 
mm/page_ext.c:178: warning: Function parameter or member 'page_ext' not
described in 'page_ext_put'.

[quic_pkondeti@quicinc.com: better patch title]
Link: https://lkml.kernel.org/r/1667884582-2465-1-git-send-email-quic_charante@quicinc.com
Fixes: b1d5488a25 ("mm: fix use-after free of page_ext after race with memory-offline")
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:41 -08:00
Yang Shi
e031ff96b3 mm: khugepaged: allow page allocation fallback to eligible nodes
Syzbot reported the below splat:

WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Modules linked in:
CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
 hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
 madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
 madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
 madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
 do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
 do_madvise mm/madvise.c:1432 [inline]
 __do_sys_madvise mm/madvise.c:1432 [inline]
 __se_sys_madvise mm/madvise.c:1430 [inline]
 __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6b48a4eef9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
 </TASK>

The khugepaged code would pick up the node with the most hit as the preferred
node, and also tries to do some balance if several nodes have the same
hit record.  Basically it does conceptually:
    * If the target_node <= last_target_node, then iterate from
last_target_node + 1 to MAX_NUMNODES (1024 on default config)
    * If the max_value == node_load[nid], then target_node = nid

But there is a corner case, paritucularly for MADV_COLLAPSE, that the
non-existing node may be returned as preferred node.

Assuming the system has 2 nodes, the target_node is 0 and the
last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
be 0, then it may return 2 for target_node, but it is actually not
existing (offline), so the warn is triggered.

The node balance was introduced by commit 9f1b868a13 ("mm: thp:
khugepaged: add policy for finding target node") to satisfy
"numactl --interleave=all".  But interleaving is a mere hint rather than
something that has hard requirements.

So use nodemask to record the nodes which have the same hit record, the
hugepage allocation could fallback to those nodes.  And remove
__GFP_THISNODE since it does disallow fallback.  And if the nodemask
just has one node set, it means there is one single node has the most
hit record, the nodemask approach actually behaves like __GFP_THISNODE.

Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com
Fixes: 7d8faaf155 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:41 -08:00
Johannes Weiner
f53af4285d mm: vmscan: fix extreme overreclaim and swap floods
During proactive reclaim, we sometimes observe severe overreclaim, with
several thousand times more pages reclaimed than requested.

This trace was obtained from shrink_lruvec() during such an instance:

    prio:0 anon_cost:1141521 file_cost:7767
    nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
    nr=[7161123 345 578 1111]

While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
by swapping.  These requests take over a minute, during which the write()
to memory.reclaim is unkillably stuck inside the kernel.

Digging into the source, this is caused by the proportional reclaim
bailout logic.  This code tries to resolve a fundamental conflict: to
reclaim roughly what was requested, while also aging all LRUs fairly and
in accordance to their size, swappiness, refault rates etc.  The way it
attempts fairness is that once the reclaim goal has been reached, it stops
scanning the LRUs with the smaller remaining scan targets, and adjusts the
remainder of the bigger LRUs according to how much of the smaller LRUs was
scanned.  It then finishes scanning that remainder regardless of the
reclaim goal.

This works fine if priority levels are low and the LRU lists are
comparable in size.  However, in this instance, the cgroup that is
targeted by proactive reclaim has almost no files left - they've already
been squeezed out by proactive reclaim earlier - and the remaining anon
pages are hot.  Anon rotations cause the priority level to drop to 0,
which results in reclaim targeting all of anon (a lot) and all of file
(almost nothing).  By the time reclaim decides to bail, it has scanned
most or all of the file target, and therefor must also scan most or all of
the enormous anon target.  This target is thousands of times larger than
the reclaim goal, thus causing the overreclaim.

The bailout code hasn't changed in years, why is this failing now?  The
most likely explanations are two other recent changes in anon reclaim:

1. Before the series starting with commit 5df741963d ("mm: fix LRU
   balancing effect of new transparent huge pages"), the VM was
   overall relatively reluctant to swap at all, even if swap was
   configured. This means the LRU balancing code didn't come into play
   as often as it does now, and mostly in high pressure situations
   where pronounced swap activity wouldn't be as surprising.

2. For historic reasons, shrink_lruvec() loops on the scan targets of
   all LRU lists except the active anon one, meaning it would bail if
   the only remaining pages to scan were active anon - even if there
   were a lot of them.

   Before the series starting with commit ccc5dc6734 ("mm/vmscan:
   make active/inactive ratio as 1:1 for anon lru"), most anon pages
   would live on the active LRU; the inactive one would contain only a
   handful of preselected reclaim candidates. After the series, anon
   gets aged similarly to file, and the inactive list is the default
   for new anon pages as well, making it often the much bigger list.

   As a result, the VM is now more likely to actually finish large
   anon targets than before.

Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
larger LRU lists is made before bailing out on a met reclaim goal.

This fixes the extreme overreclaim problem.

Fairness is more subtle and harder to evaluate.  No obvious misbehavior
was observed on the test workload, in any case.  Conceptually, fairness
should primarily be a cumulative effect from regular, lower priority
scans.  Once the VM is in trouble and needs to escalate scan targets to
make forward progress, fairness needs to take a backseat.  This is also
acknowledged by the myriad exceptions in get_scan_count().  This patch
makes fairness decrease gradually, as it keeps fairness work static over
increasing priority levels with growing scan targets.  This should make
more sense - although we may have to re-visit the exact values.

Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:41 -08:00
Alexander Potapenko
436fa4a699 docs: kmsan: fix formatting of "Example report"
Add a blank line to make the sentence before the list render as a separate
paragraph, not a definition.

Link: https://lkml.kernel.org/r/20221107142255.4038811-1-glider@google.com
Fixes: 93858ae70c ("kmsan: add ReST documentation")
Signed-off-by: Alexander Potapenko <glider@google.com>
Suggested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:25 -08:00
SeongJae Park
1de09a7281 mm/damon/dbgfs: check if rm_contexts input is for a real context
A user could write a name of a file under 'damon/' debugfs directory,
which is not a user-created context, to 'rm_contexts' file.  In the case,
'dbgfs_rm_context()' just assumes it's the valid DAMON context directory
only if a file of the name exist.  As a result, invalid memory access
could happen as below.  Fix the bug by checking if the given input is for
a directory.  This check can filter out non-context inputs because
directories under 'damon/' debugfs directory can be created via only
'mk_contexts' file.

This bug has found by syzbot[1].

[1] https://lore.kernel.org/damon/000000000000ede3ac05ec4abf8e@google.com/

Link: https://lkml.kernel.org/r/20221107165001.5717-2-sj@kernel.org
Fixes: 75c1c2b53c ("mm/damon/dbgfs: support multiple contexts")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: syzbot+6087eafb76a94c4ac9eb@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>	[5.15.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:25 -08:00
Liam Howlett
7dc5ba6254 maple_tree: don't set a new maximum on the node when not reusing nodes
In RCU mode, the node limits were being updated to the last pivot which
may not be correct and would cause the metadata to be set when it
shouldn't.  Fix this by not setting a new limit in this case.

Link: https://lkml.kernel.org/r/20221107163857.867377-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:25 -08:00
Liam Howlett
9bbba56334 maple_tree: fix depth tracking in maple_state
It is possible to confuse the depth tracking in the maple state by
searching the same node for values.  Fix the depth tracking by moving
where the depth is incremented closer to where the node changes level. 
Also change the initial depth setting when using the root node.

Link: https://lkml.kernel.org/r/20221107163814.866612-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:25 -08:00
Naoya Horiguchi
1fdbed657a arch/x86/mm/hugetlbpage.c: pud_huge() returns 0 when using 2-level paging
The following bug is reported to be triggered when starting X on x86-32
system with i915:

  [  225.777375] kernel BUG at mm/memory.c:2664!
  [  225.777391] invalid opcode: 0000 [#1] PREEMPT SMP
  [  225.777405] CPU: 0 PID: 2402 Comm: Xorg Not tainted 6.1.0-rc3-bdg+ #86
  [  225.777415] Hardware name:  /8I865G775-G, BIOS F1 08/29/2006
  [  225.777421] EIP: __apply_to_page_range+0x24d/0x31c
  [  225.777437] Code: ff ff 8b 55 e8 8b 45 cc e8 0a 11 ec ff 89 d8 83 c4 28 5b 5e 5f 5d c3 81 7d e0 a0 ef 96 c1 74 ad 8b 45 d0 e8 2d 83 49 00 eb a3 <0f> 0b 25 00 f0 ff ff 81 eb 00 00 00 40 01 c3 8b 45 ec 8b 00 e8 76
  [  225.777446] EAX: 00000001 EBX: c53a3b58 ECX: b5c00000 EDX: c258aa00
  [  225.777454] ESI: b5c00000 EDI: b5900000 EBP: c4b0fdb4 ESP: c4b0fd80
  [  225.777462] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010202
  [  225.777470] CR0: 80050033 CR2: b5900000 CR3: 053a3000 CR4: 000006d0
  [  225.777479] Call Trace:
  [  225.777486]  ? i915_memcpy_init_early+0x63/0x63 [i915]
  [  225.777684]  apply_to_page_range+0x21/0x27
  [  225.777694]  ? i915_memcpy_init_early+0x63/0x63 [i915]
  [  225.777870]  remap_io_mapping+0x49/0x75 [i915]
  [  225.778046]  ? i915_memcpy_init_early+0x63/0x63 [i915]
  [  225.778220]  ? mutex_unlock+0xb/0xd
  [  225.778231]  ? i915_vma_pin_fence+0x6d/0xf7 [i915]
  [  225.778420]  vm_fault_gtt+0x2a9/0x8f1 [i915]
  [  225.778644]  ? lock_is_held_type+0x56/0xe7
  [  225.778655]  ? lock_is_held_type+0x7a/0xe7
  [  225.778663]  ? 0xc1000000
  [  225.778670]  __do_fault+0x21/0x6a
  [  225.778679]  handle_mm_fault+0x708/0xb21
  [  225.778686]  ? mt_find+0x21e/0x5ae
  [  225.778696]  exc_page_fault+0x185/0x705
  [  225.778704]  ? doublefault_shim+0x127/0x127
  [  225.778715]  handle_exception+0x130/0x130
  [  225.778723] EIP: 0xb700468a

Recently pud_huge() got aware of non-present entry by commit 3a194f3f8a
("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present
pud entry") to handle some special states of gigantic page.  However, it's
overlooked that pud_none() always returns false when running with 2-level
paging, and as a result pud_huge() can return true pointlessly.

Introduce "#if CONFIG_PGTABLE_LEVELS > 2" to pud_huge() to deal with this.

Link: https://lkml.kernel.org/r/20221107021010.2449306-1-naoya.horiguchi@linux.dev
Fixes: 3a194f3f8a ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:25 -08:00
Johannes Weiner
82e60d00b7 fs: fix leaked psi pressure state
When psi annotations were added to to btrfs compression reads, the psi
state tracking over add_ra_bio_pages and btrfs_submit_compressed_read was
faulty.  A pressure state, once entered, is never left.  This results in
incorrectly elevated pressure, which triggers OOM kills.

pflags record the *previous* memstall state when we enter a new one.  The
code tried to initialize pflags to 1, and then optimize the leave call
when we either didn't enter a memstall, or were already inside a nested
stall.  However, there can be multiple PageWorkingset pages in the bio, at
which point it's that path itself that enters repeatedly and overwrites
pflags.  This causes us to miss the exit.

Enter the stall only once if needed, then unwind correctly.

erofs has the same problem, fix that up too.  And move the memstall exit
past submit_bio() to restore submit accounting originally added by
b8e24a9300 ("block: annotate refault stalls from IO submission").

Link: https://lkml.kernel.org/r/Y2UHRqthNUwuIQGS@cmpxchg.org
Fixes: 4088a47e78 ("btrfs: add manual PSI accounting for compressed reads")
Fixes: 99486c511f ("erofs: add manual PSI accounting for the compressed address space")
Fixes: 118f3663fb ("block: remove PSI accounting from the bio layer")
Link: https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:25 -08:00
Ryusuke Konishi
8cccf05fe8 nilfs2: fix use-after-free bug of ns_writer on remount
If a nilfs2 filesystem is downgraded to read-only due to metadata
corruption on disk and is remounted read/write, or if emergency read-only
remount is performed, detaching a log writer and synchronizing the
filesystem can be done at the same time.

In these cases, use-after-free of the log writer (hereinafter
nilfs->ns_writer) can happen as shown in the scenario below:

 Task1                               Task2
 --------------------------------    ------------------------------
 nilfs_construct_segment
   nilfs_segctor_sync
     init_wait
     init_waitqueue_entry
     add_wait_queue
     schedule
                                     nilfs_remount (R/W remount case)
				       nilfs_attach_log_writer
                                         nilfs_detach_log_writer
                                           nilfs_segctor_destroy
                                             kfree
     finish_wait
       _raw_spin_lock_irqsave
         __raw_spin_lock_irqsave
           do_raw_spin_lock
             debug_spin_lock_before  <-- use-after-free

While Task1 is sleeping, nilfs->ns_writer is freed by Task2.  After Task1
waked up, Task1 accesses nilfs->ns_writer which is already freed.  This
scenario diagram is based on the Shigeru Yoshida's post [1].

This patch fixes the issue by not detaching nilfs->ns_writer on remount so
that this UAF race doesn't happen.  Along with this change, this patch
also inserts a few necessary read-only checks with superblock instance
where only the ns_writer pointer was used to check if the filesystem is
read-only.

Link: https://syzkaller.appspot.com/bug?id=79a4c002e960419ca173d55e863bd09e8112df8b
Link: https://lkml.kernel.org/r/20221103141759.1836312-1-syoshida@redhat.com [1]
Link: https://lkml.kernel.org/r/20221104142959.28296-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+f816fa82f8783f7a02bb@syzkaller.appspotmail.com
Reported-by: Shigeru Yoshida <syoshida@redhat.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:24 -08:00
Alexander Potapenko
ba54d194f8 x86/traps: avoid KMSAN bugs originating from handle_bug()
There is a case in exc_invalid_op handler that is executed outside the
irqentry_enter()/irqentry_exit() region when an UD2 instruction is used to
encode a call to __warn().

In that case the `struct pt_regs` passed to the interrupt handler is never
unpoisoned by KMSAN (this is normally done in irqentry_enter()), which
leads to false positives inside handle_bug().

Use kmsan_unpoison_entry_regs() to explicitly unpoison those registers
before using them.

Link: https://lkml.kernel.org/r/20221102110611.1085175-5-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:24 -08:00
Alexander Potapenko
83d0edfa04 kmsan: make sure PREEMPT_RT is off
As pointed out by Peter Zijlstra, __msan_poison_alloca() does not play
well with IRQ code when PREEMPT_RT is on, because in that mode even
GFP_ATOMIC allocations cannot be performed.

Fixing this would require making stackdepot completely lockless, which is
quite challenging and may be excessive for the time being.

Instead, make sure KMSAN is incompatible with PREEMPT_RT, like other debug
configs are.

Link: https://lkml.kernel.org/r/20221102110611.1085175-4-glider@google.com
Link: https://lore.kernel.org/lkml/20221025221755.3810809-1-glider@google.com/
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:24 -08:00
Alexander Potapenko
ac66998df3 Kconfig.debug: ensure early check for KMSAN in CONFIG_KMSAN_WARN
As pointed out by Masahiro Yamada, Kconfig picks up the first default
entry which has true 'if' condition.  Hence, the previously added check
for KMSAN was never used, because it followed the checks for 64BIT and
!64BIT.

Put KMSAN check before others to ensure it is always applied.

Link: https://lkml.kernel.org/r/20221102110611.1085175-3-glider@google.com
Link: https://github.com/google/kmsan/issues/89
Link: https://lore.kernel.org/linux-mm/20221024212144.2852069-3-glider@google.com/
Fixes: 921757bc9b ("Kconfig.debug: disable CONFIG_FRAME_WARN for KMSAN by default")
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:24 -08:00
Alexander Potapenko
11385b2612 x86/uaccess: instrument copy_from_user_nmi()
Make sure usercopy hooks from linux/instrumented.h are invoked for
copy_from_user_nmi().  This fixes KMSAN false positives reported when
dumping opcodes for a stack trace.

Link: https://lkml.kernel.org/r/20221102110611.1085175-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:24 -08:00
Alexander Potapenko
cbadaf71f7 kmsan: core: kmsan_in_runtime() should return true in NMI context
Without that, every call to __msan_poison_alloca() in NMI may end up
allocating memory, which is NMI-unsafe.

Link: https://lkml.kernel.org/r/20221102110611.1085175-1-glider@google.com
Link: https://lore.kernel.org/lkml/20221025221755.3810809-1-glider@google.com/
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:24 -08:00
Vasily Gorbik
db5e8d8431 mm: hugetlb_vmemmap: include missing linux/moduleparam.h
The kernel test robot reported build failures with a 'randconfig' on s390:
>> mm/hugetlb_vmemmap.c:421:11: error: a function declaration without a
prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
   core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);
             ^

Link: https://lore.kernel.org/linux-mm/202210300751.rG3UDsuc-lkp@intel.com/
Link: https://lkml.kernel.org/r/patch.git-296b83ca939b.your-ad-here.call-01667411912-ext-5073@work.hours
Fixes: 30152245c6 ("mm: hugetlb_vmemmap: replace early_param() with core_param()")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:23 -08:00
Peter Xu
93b0d91787 mm/shmem: use page_mapping() to detect page cache for uffd continue
mfill_atomic_install_pte() checks page->mapping to detect whether one page
is used in the page cache.  However as pointed out by Matthew, the page
can logically be a tail page rather than always the head in the case of
uffd minor mode with UFFDIO_CONTINUE.  It means we could wrongly install
one pte with shmem thp tail page assuming it's an anonymous page.

It's not that clear even for anonymous page, since normally anonymous
pages also have page->mapping being setup with the anon vma.  It's safe
here only because the only such caller to mfill_atomic_install_pte() is
always passing in a newly allocated page (mcopy_atomic_pte()), whose
page->mapping is not yet setup.  However that's not extremely obvious
either.

For either of above, use page_mapping() instead.

Link: https://lkml.kernel.org/r/Y2K+y7wnhC4vbnP2@x1n
Fixes: 153132571f ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Matthew Wilcox <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:23 -08:00
Pankaj Gupta
867400af90 mm/memremap.c: map FS_DAX device memory as decrypted
virtio_pmem use devm_memremap_pages() to map the device memory.  By
default this memory is mapped as encrypted with SEV.  Guest reboot changes
the current encryption key and guest no longer properly decrypts the FSDAX
device meta data.

Mark the corresponding device memory region for FSDAX devices (mapped with
memremap_pages) as decrypted to retain the persistent memory property.

Link: https://lkml.kernel.org/r/20221102160728.3184016-1-pankaj.gupta@amd.com
Fixes: b7b3c01b19 ("mm/memremap_pages: support multiple ranges per invocation")
Signed-off-by: Pankaj Gupta <pankaj.gupta@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:23 -08:00
Peter Xu
624a2c94f5 Partly revert "mm/thp: carry over dirty bit when thp splits on pmd"
Anatoly Pugachev reported sparc64 breakage on the patch:

https://lore.kernel.org/r/20221021160603.GA23307@u164.east.ru

The sparc64 impl of pte_mkdirty() is definitely slightly special in that
it leverages a code patching mechanism for sun4u/sun4v on relevant pgtable
entry operations.

Before having a clue of why the sparc64 is special and caused the patch to
SIGSEGV the processes, revert the patch for now.  The swap path of dirty
bit inheritage is kept because that's using the swap shared code so we
assume it'll not be affected.

Link: https://lkml.kernel.org/r/Y1Wbi4yyVvDtg4zN@x1n
Fixes: 0ccf7f168e ("mm/thp: carry over dirty bit when thp splits on pmd")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Anatoly Pugachev <matorola@gmail.com> 
Tested-by: Anatoly Pugachev <matorola@gmail.com> 
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:23 -08:00
Ryusuke Konishi
8ac932a492 nilfs2: fix deadlock in nilfs_count_free_blocks()
A semaphore deadlock can occur if nilfs_get_block() detects metadata
corruption while locating data blocks and a superblock writeback occurs at
the same time:

task 1                               task 2
------                               ------
* A file operation *
nilfs_truncate()
  nilfs_get_block()
    down_read(rwsem A) <--
    nilfs_bmap_lookup_contig()
      ...                            generic_shutdown_super()
                                       nilfs_put_super()
                                         * Prepare to write superblock *
                                         down_write(rwsem B) <--
                                         nilfs_cleanup_super()
      * Detect b-tree corruption *         nilfs_set_log_cursor()
      nilfs_bmap_convert_error()             nilfs_count_free_blocks()
        __nilfs_error()                        down_read(rwsem A) <--
          nilfs_set_error()
            down_write(rwsem B) <--

                           *** DEADLOCK ***

Here, nilfs_get_block() readlocks rwsem A (= NILFS_MDT(dat_inode)->mi_sem)
and then calls nilfs_bmap_lookup_contig(), but if it fails due to metadata
corruption, __nilfs_error() is called from nilfs_bmap_convert_error()
inside the lock section.

Since __nilfs_error() calls nilfs_set_error() unless the filesystem is
read-only and nilfs_set_error() attempts to writelock rwsem B (=
nilfs->ns_sem) to write back superblock exclusively, hierarchical lock
acquisition occurs in the order rwsem A -> rwsem B.

Now, if another task starts updating the superblock, it may writelock
rwsem B during the lock sequence above, and can deadlock trying to
readlock rwsem A in nilfs_count_free_blocks().

However, there is actually no need to take rwsem A in
nilfs_count_free_blocks() because it, within the lock section, only reads
a single integer data on a shared struct with
nilfs_sufile_get_ncleansegs().  This has been the case after commit
aa474a2201 ("nilfs2: add local variable to cache the number of clean
segments"), that is, even before this bug was introduced.

So, this resolves the deadlock problem by just not taking the semaphore in
nilfs_count_free_blocks().

Link: https://lkml.kernel.org/r/20221029044912.9139-1-konishi.ryusuke@gmail.com
Fixes: e828949e5b ("nilfs2: call nilfs_error inside bmap routines")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+45d6ce7b7ad7ef455d03@syzkaller.appspotmail.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>	[2.6.38+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:23 -08:00
Li Zetao
cc674ab3c0 mm/mmap: fix memory leak in mmap_region()
There is a memory leak reported by kmemleak:

  unreferenced object 0xffff88817231ce40 (size 224):
    comm "mount.cifs", pid 19308, jiffies 4295917571 (age 405.880s)
    hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      60 c0 b2 00 81 88 ff ff 98 83 01 42 81 88 ff ff  `..........B....
    backtrace:
      [<ffffffff81936171>] __alloc_file+0x21/0x250
      [<ffffffff81937051>] alloc_empty_file+0x41/0xf0
      [<ffffffff81937159>] alloc_file+0x59/0x710
      [<ffffffff81937964>] alloc_file_pseudo+0x154/0x210
      [<ffffffff81741dbf>] __shmem_file_setup+0xff/0x2a0
      [<ffffffff817502cd>] shmem_zero_setup+0x8d/0x160
      [<ffffffff817cc1d5>] mmap_region+0x1075/0x19d0
      [<ffffffff817cd257>] do_mmap+0x727/0x1110
      [<ffffffff817518b2>] vm_mmap_pgoff+0x112/0x1e0
      [<ffffffff83adf955>] do_syscall_64+0x35/0x80
      [<ffffffff83c0006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

The root cause was traced to an error handing path in mmap_region() when
arch_validate_flags() or mas_preallocate() fails.  In the shared anonymous
mapping sence, vma will be setuped and mapped with a new shared anonymous
file via shmem_zero_setup().  So in this case, the file resource needs to
be released.

Fix it by calling fput(vma->vm_file) and unmap_region() when
arch_validate_flags() or mas_preallocate() returns an error in the shared
anonymous mapping sence.

Link: https://lkml.kernel.org/r/20221028073717.1179380-1-lizetao1@huawei.com
Fixes: d4af56c5c7 ("mm: start tracking VMAs with maple tree")
Fixes: c462ac288f ("mm: Introduce arch_validate_flags()")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:23 -08:00
James Houghton
8625147caf hugetlbfs: don't delete error page from pagecache
This change is very similar to the change that was made for shmem [1], and
it solves the same problem but for HugeTLBFS instead.

Currently, when poison is found in a HugeTLB page, the page is removed
from the page cache.  That means that attempting to map or read that
hugepage in the future will result in a new hugepage being allocated
instead of notifying the user that the page was poisoned.  As [1] states,
this is effectively memory corruption.

The fix is to leave the page in the page cache.  If the user attempts to
use a poisoned HugeTLB page with a syscall, the syscall will fail with
EIO, the same error code that shmem uses.  For attempts to map the page,
the thread will get a BUS_MCEERR_AR SIGBUS.

[1]: commit a760542666 ("mm: shmem: don't truncate page if memory failure happens")

Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:22 -08:00
Liam Howlett
120b116208 maple_tree: reorganize testing to restore module testing
Along the development cycle, the testing code support for module/in-kernel
compiles was removed.  Restore this functionality by moving any internal
API tests to the userspace side, as well as threading tests.  Fix the
lockdep issues and add a way to reduce memory usage so the tests can
complete with KASAN + memleak detection.  Make the tests work on 32 bit
hosts where possible and detect 32 bit hosts in the radix test suite.

[akpm@linux-foundation.org: fix module export]
[akpm@linux-foundation.org: fix it some more]
[liam.howlett@oracle.com: fix compile warnings on 32bit build in check_find()]
  Link: https://lkml.kernel.org/r/20221107203816.1260327-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20221028180415.3074673-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:22 -08:00
Liam Howlett
9a887877ef maple_tree: mas_anode_descend() clang-analyzer cleanup
clang-analyzer reported some Dead Stores in mas_anode_descend().  Upon
inspection, there were a few clean ups that would make the code cleaner:

The count variable was set from the mt_slots array and then updated but
never used again.  Just use the array reference directly.

Also stop updating the type since it isn't used after the update.

Stop setting the gaps pointer to NULL at the start since it is always
set before the loop begins.

Link: https://lkml.kernel.org/r/20221026151413.4032730-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:22 -08:00
Liam Howlett
c61b3a2b2d maple_tree: remove pointer to pointer use in mas_alloc_nodes()
There is a more direct and cleaner way of implementing the same functional
code.  Remove the confusing and unnecessary use of pointers here.

Link: https://lkml.kernel.org/r/20221026151241.4031117-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:22 -08:00
Linus Torvalds
f0c4d9fc9c Linux 6.1-rc4 2022-11-06 15:07:11 -08:00
Linus Torvalds
16c7a368c8 cxl fixes for 6.1-rc4
- Fix region creation crash with pass-through decoders
 
 - Fix region creation crash when no decoder allocation fails
 
 - Fix region creation crash when scanning regions to enforce the
   increasing physical address order constraint that CXL mandates
 
 - Fix a memory leak for cxl_pmem_region objects, track 1:N instead of
   1:1 memory-device-to-region associations.
 
 - Fix a memory leak for cxl_region objects when regions with active
   targets are deleted
 
 - Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window)
   emulated proximity domains.
 
 - Fix region creation failure for switch attached devices downstream of
   a single-port host-bridge
 
 - Fix false positive memory leak of cxl_region objects by recycling
   recently used region ids rather than freeing them
 
 - Add regression test infrastructure for a pass-through decoder
   configuration
 
 - Fix some mailbox payload handling corner cases
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCY2f0dwAKCRDfioYZHlFs
 Z93zAQCHzy4qbEdw95SPQ/BpUJ2rxcWzruFZkaUTU1RHM5lApwEApP9Fjvdkgo9I
 dlQTRON1nSqqoEXqSxbt8RU0I9Z11ws=
 =pBN4
 -----END PGP SIGNATURE-----

Merge tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull cxl fixes from Dan Williams:
 "Several fixes for CXL region creation crashes, leaks and failures.

  This is mainly fallout from the original implementation of dynamic CXL
  region creation (instantiate new physical memory pools) that arrived
  in v6.0-rc1.

  Given the theme of "failures in the presence of pass-through decoders"
  this also includes new regression test infrastructure for that case.

  Summary:

   - Fix region creation crash with pass-through decoders

   - Fix region creation crash when no decoder allocation fails

   - Fix region creation crash when scanning regions to enforce the
     increasing physical address order constraint that CXL mandates

   - Fix a memory leak for cxl_pmem_region objects, track 1:N instead of
     1:1 memory-device-to-region associations.

   - Fix a memory leak for cxl_region objects when regions with active
     targets are deleted

   - Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window)
     emulated proximity domains.

   - Fix region creation failure for switch attached devices downstream
     of a single-port host-bridge

   - Fix false positive memory leak of cxl_region objects by recycling
     recently used region ids rather than freeing them

   - Add regression test infrastructure for a pass-through decoder
     configuration

   - Fix some mailbox payload handling corner cases"

* tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/region: Recycle region ids
  cxl/region: Fix 'distance' calculation with passthrough ports
  tools/testing/cxl: Add a single-port host-bridge regression config
  tools/testing/cxl: Fix some error exits
  cxl/pmem: Fix cxl_pmem_region and cxl_memdev leak
  cxl/region: Fix cxl_region leak, cleanup targets at region delete
  cxl/region: Fix region HPA ordering validation
  cxl/pmem: Use size_add() against integer overflow
  cxl/region: Fix decoder allocation crash
  ACPI: NUMA: Add CXL CFMWS 'nodes' to the possible nodes set
  cxl/pmem: Fix failure to account for 8 byte header for writes to the device LSA.
  cxl/region: Fix null pointer dereference due to pass through decoder commit
  cxl/mbox: Add a check on input payload size
2022-11-06 13:09:52 -08:00
Linus Torvalds
aa52994915 hwmon fixes for v6.1-rc4
Fix two regressions:
 
 - Commit 54cc3dbfc1 ("hwmon: (pmbus) Add regulator supply into macro")
   resulted in regulator undercount when disabling regulators. Revert it.
 
 - The thermal subsystem rework caused the scmi driver to no longer register
   with the thermal subsystem because index values no longer match.
   To fix the problem, the scmi driver now directly registers with the
   thermal subsystem, no longer through the hwmon core.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEiHPvMQj9QTOCiqgVyx8mb86fmYEFAmNnv0YACgkQyx8mb86f
 mYEAPQ/7BfM6TjP3L85oGHPUxhP+AEkEoCfD1gplgRD4JPxYZZlwv8F+YGTV/PPQ
 mp12Bh8AH+NSryu6PhuhHKZxIvKcgxg+FEedfnT4gPYcMKtevPF84TBIbbFmLpuo
 culgJm4w3w2qwqP480kF0UTow1pRXqwjZYd3/GILzcaCIjWS2JS55w//YMQrE5J+
 +OkZsJSGWJITAJPYhIb0ETFcqjp8GROWfMZos2IdpErzmcuS31tcrD0igSwLXFwP
 bylDDXb9O9AiLT/qzkc0kjZvaohC2zoP5we+7JyXj0sUSODf+wBCEClvM1DI00JN
 CcS0zmgrmscktkTkonZDIz95vqO2Il5UmKXlK6dkpdDSS78JAX8g78LU1J7aFJ/9
 3iZnmYg4J8NXsO01plr4P3cGdPRqDoZMEYojSWq4WQs2VaKNIxsk5Zzcl6XN6OT1
 MT4w55d8QSQEghDu8vUvnmMg1X5Pkzc50RFYcWyeyMDrtue3rP4uD8m+wl9Sjgvs
 tWdayF9JmTQBaVvh3dUO3OR+z+66T8Qk1YBE1xqb+ocptYPNA1+wGum/O5wT+3ew
 lE8Jfn/8q7DAJ7lVNXKMyXMvPhZJZ5afhs+gpIJgPB2+m8p3plKgpGdW6Ew36AKe
 PKwmGx8kgfzQ6RYqXmCMG6wCr+xiLrYDPNOVz5ewC5LvYVtibs4=
 =u6Ip
 -----END PGP SIGNATURE-----

Merge tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:
 "Fix two regressions:

   - Commit 54cc3dbfc1 ("hwmon: (pmbus) Add regulator supply into
     macro") resulted in regulator undercount when disabling regulators.
     Revert it.

   - The thermal subsystem rework caused the scmi driver to no longer
     register with the thermal subsystem because index values no longer
     match. To fix the problem, the scmi driver now directly registers
     with the thermal subsystem, no longer through the hwmon core"

* tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  Revert "hwmon: (pmbus) Add regulator supply into macro"
  hwmon: (scmi) Register explicitly with Thermal Framework
2022-11-06 12:59:12 -08:00
Linus Torvalds
727ea09e99 - Add Cooper Lake's stepping to the PEBS guest/host events isolation
fixed microcode revisions checking quirk
 
 - Update Icelake and Sapphire Rapids events constraints
 
 - Use the standard energy unit for Sapphire Rapids in RAPL
 
 - Fix the hw_breakpoint test to fail more graciously on !SMP configs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmNnr/4ACgkQEsHwGGHe
 VUrFRg//dyB0lnQcdvIaPd7DWn3WGop+MeZv0NZI7uYk+SqjtJ3yJ/c4ktcaIgJV
 MhTk8Q/gxHvuT+MZarC/f1QYtTqzRQ//rKD2aO/l9Gr813Hu4R0z2AEwrNKDmzyd
 BYy3O5GXGeBAiLxtmKZ2bDlS5z8a9L3dlbLCWqjq6iGIVncljWmEDmNQmA3YPury
 v8f+V8EqfSE4iWcpnNsZOdrmkMkXEzA8X5vRswQ9l2y6qMmnEeUk9Hn9mFlG+QK4
 VDyxkQEB+vZVfWL2UjD3dpEaH5LVyfCQBwOaVdFfHhMmLhoTO2VmRMLza3Qd9ejZ
 RIE1hlRibqGMqyHDTjZvnkPgnz4QQqayDf8UIIwVdaMVdIaZmxcIQwfsbQS12E5b
 9EBzbaD6TJx42E56WuQHM+ZYt6nz0ktPz0IeBFJIwbU30gqJwdi0uIz2kXNpkthC
 eX4Bq/iM9C41A58mj9+uerF9jshi/DJU74KcMGUZiJ7IeGDJgL9CfViOTueMOjr2
 OI8nvLOtwBpj8X3AO1nEVkevSt4KPoTD+NVCNpXmjVm9DNFvMRo2EUsRHHrCkLJN
 EO7iF14rTlSI7IAE+qxNgRsmXPCyuVBhB3S3/3YmCqsH1kQXqlgxT/2eOJN6kCGz
 tlaWnD3TEaifH/DQQVGmv9nNFjS0C49MSxrZ7Oe7phnmSn3vaGY=
 =midC
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Add Cooper Lake's stepping to the PEBS guest/host events isolation
   fixed microcode revisions checking quirk

 - Update Icelake and Sapphire Rapids events constraints

 - Use the standard energy unit for Sapphire Rapids in RAPL

 - Fix the hw_breakpoint test to fail more graciously on !SMP configs

* tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[]
  perf/x86/intel: Fix pebs event constraints for SPR
  perf/x86/intel: Fix pebs event constraints for ICL
  perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain
  perf/hw_breakpoint: test: Skip the test if dependencies unmet
2022-11-06 12:41:32 -08:00
Linus Torvalds
f6f5204727 - Add new Intel CPU models
- Enforce that TDX guests are successfully loaded only on TDX hardware
 where virtualization exception (#VE) delivery on kernel memory is
 disabled because handling those in all possible cases is "essentially
 impossible"
 
 - Add the proper include to the syscall wrappers so that BTF can see the
 real pt_regs definition and not only the forward declaration
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmNnrUgACgkQEsHwGGHe
 VUoC2w//T6+5SlusY9uYIUpL/cGYj+888b/ysO0H0S37IVATUiI5m0eFAA+pcWON
 pzn81oBqk1Lstm7x/jT2mzxsZ2fIFbe6EA8hnLAexA4KY70oGhall9Q6O363CmFa
 DtUjd0LKjH6GkNH1RUcb5icJGVY3vZPCfSuxlYJUD66NBUx2pEF8l5hzZ0W20Yhq
 cHVY0i1HoCNNDRBOODrH7MEY/kWMSvhFybCYOfRMhoVd3aJhsLlq+7/7Ic5wabyy
 2mE8b0GU8or9mluU51OiCDjp+qnpB+BTFjV+88ji5jNEKLIarAXkoHDDD06xLhOK
 a2L44zZ55RAFxxCBm9L10OE0ta3kUqpq+YKQkh0gGGdDdAylUp8IF0zXRl/6jRDC
 T76jM1QOvC791HWD6kDf5XizY+PeaVD9LzAREezG6778mZbNNQwOtkECHZF0U3UP
 n/NIabDlZIncuQQbT0sSshrIyfwtkH5E+epcyLuuchYUYnDGkvNkVU31ndiwFhUG
 fW8I53XBnIlk5PunJ0jhaq4+Tugr7APipUs75y8IpFEINj6gxuoSdXyezlQVpmQ+
 tL1UXqxSlQaCoW295Fr19p3ZBBfqRKXSCS/toCluB/ekhP3ISzIZV7/cB1smmsIR
 JpgXQtcAMtXjIv9A1ZexQVlp2srk7Y6WrFocMNc47lKxmHZ78KY=
 =nqZp
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Add new Intel CPU models

 - Enforce that TDX guests are successfully loaded only on TDX hardware
   where virtualization exception (#VE) delivery on kernel memory is
   disabled because handling those in all possible cases is "essentially
   impossible"

 - Add the proper include to the syscall wrappers so that BTF can see
   the real pt_regs definition and not only the forward declaration

* tag 'x86_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Add several Intel server CPU model numbers
  x86/tdx: Panic on bad configs that #VE on "private" memory access
  x86/tdx: Prepare for using "INFO" call for a second purpose
  x86/syscall: Include asm/ptrace.h in syscall_wrapper header
2022-11-06 12:36:47 -08:00
Linus Torvalds
35697d81a7 Kbuild fixes for v6.1 (2nd)
- Use POSIX-compatible grep option.
 
  - Document git-related tips for reproducible builds.
 
  - Fix a typo in the modpost rule.
 
  - Suppress SIGPIPE error message from gcc-ar and llvm-ar.
 
  - Fix segmentation fault in the menuconfig search.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmNmPH8VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGZ3AQALBPaJ5OBpz8PzAUVdWJkVMAJYeu
 e0oPrRJPmxlvYZ4U4acAxxH9QGdAFopa+EBRWiCwb+L5lDQagvtb/boN5fyVHKWc
 aQKoNanmzzxNoO9w3bH6ApTeDxZ9O54V3G5I6xiM/cVy+HfFQePvfAuF1tnxGpYi
 RAftq2PhBo94ltpzhky00wnijYF8kU37RmTiZ/wUdSccOQ3cH/nhOduhnjXFpc+K
 JbwocFT9PtvqSy1gSMzZbBikQL4jktK2CIslhJEsG3Pn5zi0eL6UQcY9Drc3oIF5
 qOmtswtVJ6AiwJkdXb3/Vx5bS92wzIph3VOPpY2Vq8WkOA0t4gtByj13lzH2yJ0Q
 05OsqXu1v5nilQOjHSWoyFaw6x3Exh/qa1hLOcPrfTAC7vP8LHO7L0ujySqtlbxe
 pdmba/58YMIKdDPfZ3uFoMk4s3XuqDhBLkQl2ctoIfvX3KFWwcNE7oiyCkJXkE6v
 asyH0gWYz2hyM29ulm15yA/eDt+OKweldz17e/GIOlA5hr8kt/96E/lEHW9r/tSK
 Bw0u4HiWf92vlZWWKjDWkWD4T4FkM2n4Jn9zOU5fauS21BQG217LIHIh62bs1Luw
 5Rb1UF7cAPEQxJZsTMdkdmWudZsabjpPFV68p8IucmKSQeHpgH1naJxXWCtln6V6
 ZHnLnUNELUoMEg7F
 =1MFY
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Use POSIX-compatible grep options

 - Document git-related tips for reproducible builds

 - Fix a typo in the modpost rule

 - Suppress SIGPIPE error message from gcc-ar and llvm-ar

 - Fix segmentation fault in the menuconfig search

* tag 'kbuild-fixes-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: fix segmentation fault in menuconfig search
  kbuild: fix SIGPIPE error message for AR=gcc-ar and AR=llvm-ar
  kbuild: fix typo in modpost
  Documentation: kbuild: Add description of git for reproducible builds
  kbuild: use POSIX-compatible grep option
2022-11-06 12:23:10 -08:00
Linus Torvalds
089d1c3122 ARM:
* Fix the pKVM stage-1 walker erronously using the stage-2 accessor
 
 * Correctly convert vcpu->kvm to a hyp pointer when generating
   an exception in a nVHE+MTE configuration
 
 * Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them
 
 * Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE
 
 * Document the boot requirements for FGT when entering the kernel
   at EL1
 
 x86:
 
 * Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()
 
 * Make argument order consistent for kvcalloc()
 
 * Userspace API fixes for DEBUGCTL and LBRs
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmNncNEUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOKJQf9HhmONhrKaLQ1Ycp5R5qbwbj4zKZR
 3f78NxGaauG9MUHP96tSPWRSgLNQi36yUKI9FOFwfw/qsp79B+9KWkuqzWkYgXqj
 CagwjTtCbQsLzQvDrvBt8Zrw7IQPtGFBFQjwQfyxRipEQBHndJpip0oYr8hoze5O
 xICLmFsjMDtiHOjLwUhHJhaAh/qAg4xaoC6LsV855vkkqxd9Bhrj4z8QkcdUnjlt
 mrP2u/4iAQGubH+3YnAqdWFQUMYxmd0WsIUw3RTzdZJWei6mLjDaA+B3jAIUiXnv
 6UKrwlL56yQzUQxOt/v+d6J76FTDvjiqmUhgy7pINasJBoB5+xG4sJhOIA==
 =Gqfw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"ARM:

   - Fix the pKVM stage-1 walker erronously using the stage-2 accessor

   - Correctly convert vcpu->kvm to a hyp pointer when generating an
     exception in a nVHE+MTE configuration

   - Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them

   - Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE

   - Document the boot requirements for FGT when entering the kernel at
     EL1

  x86:

   - Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()

   - Make argument order consistent for kvcalloc()

   - Userspace API fixes for DEBUGCTL and LBRs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: Fix a typo about the usage of kvcalloc()
  KVM: x86: Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()
  KVM: VMX: Ignore guest CPUID for host userspace writes to DEBUGCTL
  KVM: VMX: Fold vmx_supported_debugctl() into vcpu_supported_debugctl()
  KVM: VMX: Advertise PMU LBRs if and only if perf supports LBRs
  arm64: booting: Document our requirements for fine grained traps with SME
  KVM: arm64: Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE
  KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling them
  KVM: arm64: Fix bad dereference on MTE-enabled systems
  KVM: arm64: Use correct accessor to parse stage-1 PTEs
2022-11-06 10:46:59 -08:00
Linus Torvalds
6e8c78d32b xen: branch for v6.1-rc4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCY2dMgQAKCRCAXGG7T9hj
 vtsjAQCajqsnrz+uzySSDRNJDUNPkh9x2vgVQFBwaQMJWSJBXgD+LbwYlCNPTg1R
 E5IzcY5bxMK/bFEkTOpJQ3wacVA0wA4=
 =64Hm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-6.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:
 "One fix for silencing a smatch warning, and a small cleanup patch"

* tag 'for-linus-6.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/xen: simplify sysenter and syscall setup
  x86/xen: silence smatch warning in pmu_msr_chk_emulated()
2022-11-06 10:42:29 -08:00
Linus Torvalds
9761070d14 Fix a number of bug fixes, including some regressions, the most
serious of which was one which would cause online resizes to fail with
 file systems with metadata checksums enabled.  Also fix a warning
 caused by the newly added fortify string checker, plus some bugs that
 were found using fuzzed file systems.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmNnSCYACgkQ8vlZVpUN
 gaNbBgf/QsOe7KCrr/X7mK7SFgbNY+jsmvagPV0SvAg9Uc0P3EkmXE0NcNcZOAUx
 mgNBYNNS+QGKtdqHBy8p1kNgcbFAR/OJZ7rFD3XUnB/N+XKZSgimhNUx+IaEX7Dx
 XidK5cPcKEZlbfuqxwkIfvaqC9v3XcpFpHicA/uDTPe4kZ8VhJQk294M5EuMA8lQ
 wumDFsf/1sN4osJH7eHMZk/e3iFN8fwrpCgvwJ56zzW7UWSl8jJrq9kxHo43iijY
 82DbRCdsVrdTPaD5gJSvcggLgMpUu+yoA1UbwiUlR1AtmaFfDg+rfIZs1ooyCdHl
 QLQ3RlXdkfHTwAYBFFApzR55MhPakQ==
 =zw2b
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a number of bugs, including some regressions, the most serious of
  which was one which would cause online resizes to fail with file
  systems with metadata checksums enabled.

  Also fix a warning caused by the newly added fortify string checker,
  plus some bugs that were found using fuzzed file systems"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix fortify warning in fs/ext4/fast_commit.c:1551
  ext4: fix wrong return err in ext4_load_and_init_journal()
  ext4: fix warning in 'ext4_da_release_space'
  ext4: fix BUG_ON() when directory entry has invalid rec_len
  ext4: update the backup superblock's at the end of the online resize
2022-11-06 10:30:29 -08:00
Linus Torvalds
90153f928b 3 cifs/smb3 fixes, one for symlink handling and two fix multichannel issues with iterating channels, including for oplock breaks when leases are disabled
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmNnPqoACgkQiiy9cAdy
 T1FXpgv/fgkH48LGjRH+Pd86oZchCagEMF8Zfy59SKiNWwcuS43B+sjwrotbwT4r
 Nmq1LYHmGgy0b63L9L+DSwO6mxOEvm3ryJ2vxInsG1Rsebw0oSBxolPOHjYLWHKH
 +BsMGxLEVHWMHzFzrNJC1Fp9oGgmD6tNmdDBxBw471UK1AURfc7tg/70MmDm7lDx
 cTLE40Fu+ni3OZ22YL0jYIgHWkk0S1r+/lFNYLvxrZF+D7zRhnVCALbY60L/a4/T
 /nLViWlHAKp9UUlCJTJOXyfVV2PVkF2JEUCIPcfTvYNvDFMGLH/mLLNd6iSq4EgX
 HE811XfZ8HrfL+T2oTHcgNo6CkCZBtdw2zV/RivRDojxHYy/soYv0p0B54c37V3i
 x89/tc1KYxHNR27W+0dxT8D66cRkez0Sb4f/BKdhHfl0WaAmVpfl41XGQ3pihm/C
 Nb8nj/b16R9lqm27Zgu8Vy3p1LSF0d3tn/UDxIP3unoyQWEHHY7oiBlMJ6uc6qls
 faQSx8tv
 =ESTZ
 -----END PGP SIGNATURE-----

Merge tag '6.1-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "One symlink handling fix and two fixes foir multichannel issues with
  iterating channels, including for oplock breaks when leases are
  disabled"

* tag '6.1-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix use-after-free on the link name
  cifs: avoid unnecessary iteration of tcp sessions
  cifs: always iterate smb sessions using primary channel
2022-11-06 10:19:39 -08:00
Linus Torvalds
8391aa4b4c Tracing fixes for 6.1-rc3:
- Fixed NULL pointer dereference in the ring buffer wait-waiters code for
   machines that have less CPUs than what nr_cpu_ids returns. The buffer
   array is of size nr_cpu_ids, but only the online CPUs get initialized.
 
 - Fixed use after free call in ftrace_shutdown.
 
 - Fix accounting of if a kprobe is enabled
 
 - Fix NULL pointer dereference on error path of fprobe rethook_alloc().
 
 - Fix unregistering of fprobe_kprobe_handler
 
 - Fix memory leak in kprobe test module
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY2bPChQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrOzAP95LEYzhi0pbxtuDHBv+HOTALi8Lttk
 4FOcdrSj7tXn5wD/ZtNbOhq3OxTonPrIkZTBqpOohElIoXRSlt+Og68QCQE=
 =4DN2
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull `lTracing fixes for 6.1-rc3:

 - Fixed NULL pointer dereference in the ring buffer wait-waiters code
   for machines that have less CPUs than what nr_cpu_ids returns.

   The buffer array is of size nr_cpu_ids, but only the online CPUs get
   initialized.

 - Fixed use after free call in ftrace_shutdown.

 - Fix accounting of if a kprobe is enabled

 - Fix NULL pointer dereference on error path of fprobe rethook_alloc().

 - Fix unregistering of fprobe_kprobe_handler

 - Fix memory leak in kprobe test module

* tag 'trace-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: kprobe: Fix memory leak in test_gen_kprobe/kretprobe_cmd()
  tracing/fprobe: Fix to check whether fprobe is registered correctly
  fprobe: Check rethook_alloc() return in rethook initialization
  kprobe: reverse kp->flags when arm_kprobe failed
  ftrace: Fix use-after-free for dynamic ftrace_ops
  ring-buffer: Check for NULL cpu_buffer in ring_buffer_wake_waiters()
2022-11-06 09:57:38 -08:00
Paolo Bonzini
f4298cac2b Merge tag 'kvmarm-fixes-6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
* Fix the pKVM stage-1 walker erronously using the stage-2 accessor

* Correctly convert vcpu->kvm to a hyp pointer when generating
  an exception in a nVHE+MTE configuration

* Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them

* Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE

* Document the boot requirements for FGT when entering the kernel
  at EL1
2022-11-06 03:30:49 -05:00
Paolo Bonzini
1462014966 Merge branch 'kvm-master' into HEAD
x86:
* Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()

* Make argument order consistent for kvcalloc()

* Userspace API fixes for DEBUGCTL and LBRs
2022-11-06 03:30:38 -05:00
Theodore Ts'o
0d043351e5 ext4: fix fortify warning in fs/ext4/fast_commit.c:1551
With the new fortify string system, rework the memcpy to avoid this
warning:

memcpy: detected field-spanning write (size 60) of single field "&raw_inode->i_generation" at fs/ext4/fast_commit.c:1551 (size 4)

Cc: stable@kernel.org
Fixes: 54d9469bc5 ("fortify: Add run-time WARN for cross-field memcpy()")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06 01:07:59 -04:00
Jason Yan
9f2a1d9fb3 ext4: fix wrong return err in ext4_load_and_init_journal()
The return value is wrong in ext4_load_and_init_journal(). The local
variable 'err' need to be initialized before goto out. The original code
in __ext4_fill_super() is fine because it has two return values 'ret'
and 'err' and 'ret' is initialized as -EINVAL. After we factor out
ext4_load_and_init_journal(), this code is broken. So fix it by directly
returning -EINVAL in the error handler path.

Cc: stable@kernel.org
Fixes: 9c1dd22d74 ("ext4: factor out ext4_load_and_init_journal()")
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221025040206.3134773-1-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06 01:07:59 -04:00
Ye Bin
1b8f787ef5 ext4: fix warning in 'ext4_da_release_space'
Syzkaller report issue as follows:
EXT4-fs (loop0): Free/Dirty block details
EXT4-fs (loop0): free_blocks=0
EXT4-fs (loop0): dirty_blocks=0
EXT4-fs (loop0): Block reservation details
EXT4-fs (loop0): i_reserved_data_blocks=0
EXT4-fs warning (device loop0): ext4_da_release_space:1527: ext4_da_release_space: ino 18, to_free 1 with only 0 reserved data blocks
------------[ cut here ]------------
WARNING: CPU: 0 PID: 92 at fs/ext4/inode.c:1528 ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1524
Modules linked in:
CPU: 0 PID: 92 Comm: kworker/u4:4 Not tainted 6.0.0-syzkaller-09423-g493ffd6605b2 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
Workqueue: writeback wb_workfn (flush-7:0)
RIP: 0010:ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1528
RSP: 0018:ffffc900015f6c90 EFLAGS: 00010296
RAX: 42215896cd52ea00 RBX: 0000000000000000 RCX: 42215896cd52ea00
RDX: 0000000000000000 RSI: 0000000080000001 RDI: 0000000000000000
RBP: 1ffff1100e907d96 R08: ffffffff816aa79d R09: fffff520002bece5
R10: fffff520002bece5 R11: 1ffff920002bece4 R12: ffff888021fd2000
R13: ffff88807483ecb0 R14: 0000000000000001 R15: ffff88807483e740
FS:  0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005555569ba628 CR3: 000000000c88e000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ext4_es_remove_extent+0x1ab/0x260 fs/ext4/extents_status.c:1461
 mpage_release_unused_pages+0x24d/0xef0 fs/ext4/inode.c:1589
 ext4_writepages+0x12eb/0x3be0 fs/ext4/inode.c:2852
 do_writepages+0x3c3/0x680 mm/page-writeback.c:2469
 __writeback_single_inode+0xd1/0x670 fs/fs-writeback.c:1587
 writeback_sb_inodes+0xb3b/0x18f0 fs/fs-writeback.c:1870
 wb_writeback+0x41f/0x7b0 fs/fs-writeback.c:2044
 wb_do_writeback fs/fs-writeback.c:2187 [inline]
 wb_workfn+0x3cb/0xef0 fs/fs-writeback.c:2227
 process_one_work+0x877/0xdb0 kernel/workqueue.c:2289
 worker_thread+0xb14/0x1330 kernel/workqueue.c:2436
 kthread+0x266/0x300 kernel/kthread.c:376
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
 </TASK>

Above issue may happens as follows:
ext4_da_write_begin
  ext4_create_inline_data
    ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS);
    ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA);
__ext4_ioctl
  ext4_ext_migrate -> will lead to eh->eh_entries not zero, and set extent flag
ext4_da_write_begin
  ext4_da_convert_inline_data_to_extent
    ext4_da_write_inline_data_begin
      ext4_da_map_blocks
        ext4_insert_delayed_block
	  if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk))
	    if (!ext4_es_scan_clu(inode, &ext4_es_is_mapped, lblk))
	      ext4_clu_mapped(inode, EXT4_B2C(sbi, lblk)); -> will return 1
	       allocated = true;
          ext4_es_insert_delayed_block(inode, lblk, allocated);
ext4_writepages
  mpage_map_and_submit_extent(handle, &mpd, &give_up_on_write); -> return -ENOSPC
  mpage_release_unused_pages(&mpd, give_up_on_write); -> give_up_on_write == 1
    ext4_es_remove_extent
      ext4_da_release_space(inode, reserved);
        if (unlikely(to_free > ei->i_reserved_data_blocks))
	  -> to_free == 1  but ei->i_reserved_data_blocks == 0
	  -> then trigger warning as above

To solve above issue, forbid inode do migrate which has inline data.

Cc: stable@kernel.org
Reported-by: syzbot+c740bb18df70ad00952e@syzkaller.appspotmail.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221018022701.683489-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06 01:07:59 -04:00
Luís Henriques
17a0bc9bd6 ext4: fix BUG_ON() when directory entry has invalid rec_len
The rec_len field in the directory entry has to be a multiple of 4.  A
corrupted filesystem image can be used to hit a BUG() in
ext4_rec_len_to_disk(), called from make_indexed_dir().

 ------------[ cut here ]------------
 kernel BUG at fs/ext4/ext4.h:2413!
 ...
 RIP: 0010:make_indexed_dir+0x53f/0x5f0
 ...
 Call Trace:
  <TASK>
  ? add_dirent_to_buf+0x1b2/0x200
  ext4_add_entry+0x36e/0x480
  ext4_add_nondir+0x2b/0xc0
  ext4_create+0x163/0x200
  path_openat+0x635/0xe90
  do_filp_open+0xb4/0x160
  ? __create_object.isra.0+0x1de/0x3b0
  ? _raw_spin_unlock+0x12/0x30
  do_sys_openat2+0x91/0x150
  __x64_sys_open+0x6c/0xa0
  do_syscall_64+0x3c/0x80
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

The fix simply adds a call to ext4_check_dir_entry() to validate the
directory entry, returning -EFSCORRUPTED if the entry is invalid.

CC: stable@kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216540
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Link: https://lore.kernel.org/r/20221012131330.32456-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06 01:07:49 -04:00