mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-12-12 21:44:06 +08:00
03c5f33123
1154268 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Matthew Wilcox (Oracle)
|
03c5f33123 |
buffer: use b_folio in touch_buffer()
Removes a call to compound_head() in this path. Link: https://lkml.kernel.org/r/20221215214402.3522366-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
abc8a8a2c7 |
buffer: replace obvious uses of b_page with b_folio
These cases just check if it's NULL, or use b_page to get to the page's address space. They are assumptions that b_page never points to a tail page. Link: https://lkml.kernel.org/r/20221215214402.3522366-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
d685c668b0 |
buffer: add b_folio as an alias of b_page
Patch series "Start converting buffer_heads to use folios". I was hoping that filesystems would convert from buffer_heads to iomap, but that's not happening particularly quickly. So the buffer_head infrastructure needs to be converted from being page-based to being folio-based. This patch (of 12): Buffer heads point to the allocation (ie the folio), not the page. This is currently the same thing for all filesystems that use buffer heads, so this is a safe transitional step. Link: https://lkml.kernel.org/r/20221215214402.3522366-1-willy@infradead.org Link: https://lkml.kernel.org/r/20221215214402.3522366-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
9c67a20704 |
mm/hugetlb: introduce hugetlb_walk()
huge_pte_offset() is the main walker function for hugetlb pgtables. The name is not really representing what it does, though. Instead of renaming it, introduce a wrapper function called hugetlb_walk() which will use huge_pte_offset() inside. Assert on the locks when walking the pgtable. Note, the vma lock assertion will be a no-op for private mappings. Document the last special case in the page_vma_mapped_walk() path where we don't need any more lock to call hugetlb_walk(). Taking vma lock there is not needed because either: (1) potential callers of hugetlb pvmw holds i_mmap_rwsem already (from one rmap_walk()), or (2) the caller will not walk a hugetlb vma at all so the hugetlb code path not reachable (e.g. in ksm or uprobe paths). It's slightly implicit for future page_vma_mapped_walk() callers on that lock requirement. But anyway, when one day this rule breaks, one will get a straightforward warning in hugetlb_walk() with lockdep, then there'll be a way out. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20221216155229.2043750-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
dd361e5033 |
mm/hugetlb: make walk_hugetlb_range() safe to pmd unshare
Since walk_hugetlb_range() walks the pgtable, it needs the vma lock to make sure the pgtable page will not be freed concurrently. Link: https://lkml.kernel.org/r/20221216155226.2043738-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
eefc7fa536 |
mm/hugetlb: make follow_hugetlb_page() safe to pmd unshare
Since follow_hugetlb_page() walks the pgtable, it needs the vma lock to make sure the pgtable page will not be freed concurrently. Link: https://lkml.kernel.org/r/20221216155223.2043727-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
7d049f3a03 |
mm/hugetlb: make hugetlb_follow_page_mask() safe to pmd unshare
Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock to make sure the pgtable page will not be freed concurrently. Link: https://lkml.kernel.org/r/20221216155219.2043714-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
b8da2e4660 |
mm/hugetlb: make userfaultfd_huge_must_wait() safe to pmd unshare
We can take the hugetlb walker lock, here taking vma lock directly. Link: https://lkml.kernel.org/r/20221216155217.2043700-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
fcd48540d1 |
mm/hugetlb: move swap entry handling into vma lock when faulted
In hugetlb_fault(), there used to have a special path to handle swap entry at the entrance using huge_pte_offset(). That's unsafe because huge_pte_offset() for a pmd sharable range can access freed pgtables if without any lock to protect the pgtable from being freed after pmd unshare. Here the simplest solution to make it safe is to move the swap handling to be after the vma lock being held. We may need to take the fault mutex on either migration or hwpoison entries now (also the vma lock, but that's really needed), however neither of them is hot path. Note that the vma lock cannot be released in hugetlb_fault() when the migration entry is detected, because in migration_entry_wait_huge() the pgtable page will be used again (by taking the pgtable lock), so that also need to be protected by the vma lock. Modify migration_entry_wait_huge() so that it must be called with vma read lock held, and properly release the lock in __migration_entry_wait_huge(). Link: https://lkml.kernel.org/r/20221216155100.2043537-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
fe7d4c6d5a |
mm/hugetlb: document huge_pte_offset usage
huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a hugetlb address. Normally, it's always safe to walk a generic pgtable as long as we're with the mmap lock held for either read or write, because that guarantees the pgtable pages will always be valid during the process. But it's not true for hugetlbfs, especially shared: hugetlbfs can have its pgtable freed by pmd unsharing, it means that even with mmap lock held for current mm, the PMD pgtable page can still go away from under us if pmd unsharing is possible during the walk. So we have two ways to make it safe even for a shared mapping: (1) If we're with the hugetlb vma lock held for either read/write, it's okay because pmd unshare cannot happen at all. (2) If we're with the i_mmap_rwsem lock held for either read/write, it's okay because even if pmd unshare can happen, the pgtable page cannot be freed from under us. Document it. Link: https://lkml.kernel.org/r/20221216155100.2043537-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
bb373dce2c |
mm/hugetlb: don't wait for migration entry during follow page
That's what the code does with !hugetlb pages, so we should logically do the same for hugetlb, so migration entry will also be treated as no page. This is probably also the last piece in follow_page code that may sleep, the last one should be removed in cf994dd8af27 ("mm/gup: remove FOLL_MIGRATION", 2022-11-16). Link: https://lkml.kernel.org/r/20221216155100.2043537-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
243b1f2d3b |
mm/hugetlb: let vma_offset_start() to return start
Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare", v4. Problem ======= huge_pte_offset() is a major helper used by hugetlb code paths to walk a hugetlb pgtable. It's used mostly everywhere since that's needed even before taking the pgtable lock. huge_pte_offset() is always called with mmap lock held with either read or write. It was assumed to be safe but it's actually not. One race condition can easily trigger by: (1) firstly trigger pmd share on a memory range, (2) do huge_pte_offset() on the range, then at the meantime, (3) another thread unshare the pmd range, and the pgtable page is prone to lost if the other shared process wants to free it completely (by either munmap or exit mm). The recent work from Mike on vma lock can resolve most of this already. It's achieved by forbidden pmd unsharing during the lock being taken, so no further risk of the pgtable page being freed. It means if we can take the vma lock around all huge_pte_offset() callers it'll be safe. There're already a bunch of them that we did as per the latest mm-unstable, but also quite a few others that we didn't for various reasons especially on huge_pte_offset() usage. One more thing to mention is that besides the vma lock, i_mmap_rwsem can also be used to protect the pgtable page (along with its pgtable lock) from being freed from under us. IOW, huge_pte_offset() callers need to either hold the vma lock or i_mmap_rwsem to safely walk the pgtables. A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is very hard to trigger, one needs to apply another kernel delay patch too, see below): ======8<======= #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <errno.h> #include <unistd.h> #include <sys/mman.h> #include <fcntl.h> #include <linux/memfd.h> #include <assert.h> #include <pthread.h> #define MSIZE (1UL << 30) /* 1GB */ #define PSIZE (2UL << 20) /* 2MB */ #define HOLD_SEC (1) int pipefd[2]; void *buf; void *do_map(int fd) { unsigned char *tmpbuf, *p; int ret; ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE); if (ret) { perror("posix_memalign() failed"); return NULL; } tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 0); if (tmpbuf == MAP_FAILED) { perror("mmap() failed"); return NULL; } printf("mmap() -> %p\n", tmpbuf); for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) { *p = 1; } return tmpbuf; } void do_unmap(void *buf) { munmap(buf, MSIZE); } void proc2(int fd) { unsigned char c; buf = do_map(fd); if (!buf) return; read(pipefd[0], &c, 1); /* * This frees the shared pgtable page, causing use-after-free in * proc1_thread1 when soft walking hugetlb pgtable. */ do_unmap(buf); printf("Proc2 quitting\n"); } void *proc1_thread1(void *data) { /* * Trigger follow-page on 1st 2m page. Kernel hack patch needed to * withhold this procedure for easier reproduce. */ madvise(buf, PSIZE, MADV_POPULATE_WRITE); printf("Proc1-thread1 quitting\n"); return NULL; } void *proc1_thread2(void *data) { unsigned char c; /* Wait a while until proc1_thread1() start to wait */ sleep(0.5); /* Trigger pmd unshare */ madvise(buf, PSIZE, MADV_DONTNEED); /* Kick off proc2 to release the pgtable */ write(pipefd[1], &c, 1); printf("Proc1-thread2 quitting\n"); return NULL; } void proc1(int fd) { pthread_t tid1, tid2; int ret; buf = do_map(fd); if (!buf) return; ret = pthread_create(&tid1, NULL, proc1_thread1, NULL); assert(ret == 0); ret = pthread_create(&tid2, NULL, proc1_thread2, NULL); assert(ret == 0); /* Kick the child to share the PUD entry */ pthread_join(tid1, NULL); pthread_join(tid2, NULL); do_unmap(buf); } int main(void) { int fd, ret; fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB); if (fd < 0) { perror("open failed"); return -1; } ret = ftruncate(fd, MSIZE); if (ret) { perror("ftruncate() failed"); return -1; } ret = pipe(pipefd); if (ret) { perror("pipe() failed"); return -1; } if (fork()) { proc1(fd); } else { proc2(fd); } close(pipefd[0]); close(pipefd[1]); close(fd); return 0; } ======8<======= The kernel patch needed to present such a race so it'll trigger 100%: ======8<======= : diff --git a/mm/hugetlb.c b/mm/hugetlb.c : index 9d97c9a2a15d..f8d99dad5004 100644 : --- a/mm/hugetlb.c : +++ b/mm/hugetlb.c : @@ -38,6 +38,7 @@ : #include <asm/page.h> : #include <asm/pgalloc.h> : #include <asm/tlb.h> : +#include <asm/delay.h> : : #include <linux/io.h> : #include <linux/hugetlb.h> : @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, : bool unshare = false; : int absent; : struct page *page; : + unsigned long c = 0; : : /* : * If we have a pending SIGKILL, don't keep faulting pages and : @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, : */ : pte = huge_pte_offset(mm, vaddr & huge_page_mask(h), : huge_page_size(h)); : + : + pr_info("%s: withhold 1 sec...\n", __func__); : + for (c = 0; c < 100; c++) { : + udelay(10000); : + } : + pr_info("%s: withhold 1 sec...done\n", __func__); : + : if (pte) : ptl = huge_pte_lock(h, mm, pte); : absent = !pte || huge_pte_none(huge_ptep_get(pte)); : ======8<======= It'll trigger use-after-free of the pgtable spinlock: ======8<======= [ 16.959907] follow_hugetlb_page: withhold 1 sec... [ 17.960315] follow_hugetlb_page: withhold 1 sec...done [ 17.960550] ------------[ cut here ]------------ [ 17.960742] DEBUG_LOCKS_WARN_ON(1) [ 17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0 [ 17.961264] Modules linked in: [ 17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46 [ 17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0 [ 17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9 [ 17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096 [ 17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000 [ 17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff [ 17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58 [ 17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000 [ 17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001 [ 17.965123] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000 [ 17.965443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0 [ 17.965956] PKRU: 55555554 [ 17.966068] Call Trace: [ 17.966172] <TASK> [ 17.966268] ? tick_nohz_tick_stopped+0x12/0x30 [ 17.966455] lock_acquire+0xbf/0x2b0 [ 17.966603] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.966799] ? _printk+0x48/0x4e [ 17.966934] _raw_spin_lock+0x2f/0x40 [ 17.967087] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.967285] follow_hugetlb_page.cold+0x75/0x5c4 [ 17.967473] __get_user_pages+0xbb/0x620 [ 17.967635] faultin_vma_page_range+0x9a/0x100 [ 17.967817] madvise_vma_behavior+0x3c0/0xbd0 [ 17.967998] ? mas_prev+0x11/0x290 [ 17.968141] ? find_vma_prev+0x5e/0xa0 [ 17.968304] ? madvise_vma_anon_name+0x70/0x70 [ 17.968486] madvise_walk_vmas+0xa9/0x120 [ 17.968650] do_madvise.part.0+0xfa/0x270 [ 17.968813] __x64_sys_madvise+0x5a/0x70 [ 17.968974] do_syscall_64+0x37/0x90 [ 17.969123] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 17.969329] RIP: 0033:0x7f1840f0efdb [ 17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68 [ 17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c [ 17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb [ 17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000 [ 17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f [ 17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80 [ 17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000 [ 17.972083] </TASK> [ 17.972199] irq event stamp: 2353 [ 17.972372] hardirqs last enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70 [ 17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70 [ 17.973365] softirqs last enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160 [ 17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160 [ 17.974341] ---[ end trace 0000000000000000 ]--- [ 17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8 [ 17.975012] #PF: supervisor read access in kernel mode [ 17.975314] #PF: error_code(0x0000) - not-present page [ 17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0 [ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G W 6.1.0-rc4-peterx+ #46 [ 17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0 [ 17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f [ 17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046 [ 17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000 [ 17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff [ 17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58 [ 17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000 [ 17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001 [ 17.984815] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000 [ 17.985924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0 [ 17.986674] PKRU: 55555554 [ 17.986832] Call Trace: [ 17.987012] <TASK> [ 17.987266] ? tick_nohz_tick_stopped+0x12/0x30 [ 17.987770] lock_acquire+0xbf/0x2b0 [ 17.988118] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.988575] ? _printk+0x48/0x4e [ 17.988889] _raw_spin_lock+0x2f/0x40 [ 17.989243] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.989687] follow_hugetlb_page.cold+0x75/0x5c4 [ 17.990119] __get_user_pages+0xbb/0x620 [ 17.990500] faultin_vma_page_range+0x9a/0x100 [ 17.990928] madvise_vma_behavior+0x3c0/0xbd0 [ 17.991354] ? mas_prev+0x11/0x290 [ 17.991678] ? find_vma_prev+0x5e/0xa0 [ 17.992024] ? madvise_vma_anon_name+0x70/0x70 [ 17.992421] madvise_walk_vmas+0xa9/0x120 [ 17.992793] do_madvise.part.0+0xfa/0x270 [ 17.993166] __x64_sys_madvise+0x5a/0x70 [ 17.993539] do_syscall_64+0x37/0x90 [ 17.993879] entry_SYSCALL_64_after_hwframe+0x63/0xcd ======8<======= Resolution ========== This patchset protects all the huge_pte_offset() callers to also take the vma lock properly. Patch Layout ============ Patch 1-2: cleanup, or dependency of the follow up patches Patch 3: before fixing, document huge_pte_offset() on lock required Patch 4-8: each patch resolves one possible race condition Patch 9: introduce hugetlb_walk() to replace huge_pte_offset() Tests ===== The series is verified with the above reproducer so the race cannot trigger anymore. It also passes all hugetlb kselftests. This patch (of 9): Even though vma_offset_start() is named like that, it's not returning "the start address of the range" but rather the offset we should use to offset the vma->vm_start address. Make it return the real value of the start vaddr, and it also helps for all the callers because whenever the retval is used, it'll be ultimately added into the vma->vm_start anyway, so it's better. Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
379c2e60e8 |
hugetlb: update vma flag check for hugetlb vma lock
The check for whether a hugetlb vma lock exists partially depends on the vma's flags. Currently, it checks for either VM_MAYSHARE or VM_SHARED. The reason both flags are used is because VM_MAYSHARE was previously cleared in hugetlb vmas as they are tore down. This is no longer the case, and only the VM_MAYSHARE check is required. Link: https://lkml.kernel.org/r/20221212235042.178355-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jeff Xu
|
11f75a0144 |
selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC
Tests to verify MFD_NOEXEC, MFD_EXEC and vm.memfd_noexec sysctl. Link: https://lkml.kernel.org/r/20221215001205.51969-6-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@google.com> Co-developed-by: Daniel Verkamp <dverkamp@chromium.org> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: kernel test robot <lkp@intel.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jeff Xu
|
c4f75bc8bd |
mm/memfd: add write seals when apply SEAL_EXEC to executable memfd
In order to avoid WX mappings, add F_SEAL_WRITE when apply F_SEAL_EXEC to an executable memfd, so W^X from start. This implys application need to fill the content of the memfd first, after F_SEAL_EXEC is applied, application can no longer modify the content of the memfd. Typically, application seals the memfd right after writing to it. For example: 1. memfd_create(MFD_EXEC). 2. write() code to the memfd. 3. fcntl(F_ADD_SEALS, F_SEAL_EXEC) to convert the memfd to W^X. 4. call exec() on the memfd. Link: https://lkml.kernel.org/r/20221215001205.51969-5-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Daniel Verkamp <dverkamp@chromium.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: kernel test robot <lkp@intel.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jeff Xu
|
105ff5339f |
mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set executable bit at creation time (memfd_create). When MFD_NOEXEC_SEAL is set, memfd is created without executable bit (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be executable (mode: 0777) after creation. when MFD_EXEC flag is set, memfd is created with executable bit (mode:0777), this is the same as the old behavior of memfd_create. The new pid namespaced sysctl vm.memfd_noexec has 3 values: 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_EXEC was set. 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_NOEXEC_SEAL was set. 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected. The sysctl allows finer control of memfd_create for old-software that doesn't set the executable bit, for example, a container with vm.memfd_noexec=1 means the old-software will create non-executable memfd by default. Also, the value of memfd_noexec is passed to child namespace at creation time. For example, if the init namespace has vm.memfd_noexec=2, all its children namespaces will be created with 2. [akpm@linux-foundation.org: add stub functions to fix build] [akpm@linux-foundation.org: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff] [akpm@linux-foundation.org: s/pr_warn_ratelimited/pr_warn_once/, per review] [akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warning] Link: https://lkml.kernel.org/r/20221215001205.51969-4-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@google.com> Co-developed-by: Daniel Verkamp <dverkamp@chromium.org> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Daniel Verkamp
|
32d118ad50 |
selftests/memfd: add tests for F_SEAL_EXEC
Basic tests to ensure that user/group/other execute bits cannot be changed after applying F_SEAL_EXEC to a memfd. Link: https://lkml.kernel.org/r/20221215001205.51969-3-jeffxu@google.com Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Co-developed-by: Jeff Xu <jeffxu@google.com> Signed-off-by: Jeff Xu <jeffxu@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: kernel test robot <lkp@intel.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Daniel Verkamp
|
6fd7353829 |
mm/memfd: add F_SEAL_EXEC
Patch series "mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC", v8. Since Linux introduced the memfd feature, memfd have always had their execute bit set, and the memfd_create() syscall doesn't allow setting it differently. However, in a secure by default system, such as ChromeOS, (where all executables should come from the rootfs, which is protected by Verified boot), this executable nature of memfd opens a door for NoExec bypass and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm process created a memfd to share the content with an external process, however the memfd is overwritten and used for executing arbitrary code and root escalation. [2] lists more VRP in this kind. On the other hand, executable memfd has its legit use, runc uses memfd’s seal and executable feature to copy the contents of the binary then execute them, for such system, we need a solution to differentiate runc's use of executable memfds and an attacker's [3]. To address those above, this set of patches add following: 1> Let memfd_create() set X bit at creation time. 2> Let memfd to be sealed for modifying X bit. 3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of X bit.For example, if a container has vm.memfd_noexec=2, then memfd_create() without MFD_NOEXEC_SEAL will be rejected. 4> A new security hook in memfd_create(). This make it possible to a new LSM, which rejects or allows executable memfd based on its security policy. This patch (of 5): The new F_SEAL_EXEC flag will prevent modification of the exec bits: written as traditional octal mask, 0111, or as named flags, S_IXUSR | S_IXGRP | S_IXOTH. Any chmod(2) or similar call that attempts to modify any of these bits after the seal is applied will fail with errno EPERM. This will preserve the execute bits as they are at the time of sealing, so the memfd will become either permanently executable or permanently un-executable. Link: https://lkml.kernel.org/r/20221215001205.51969-1-jeffxu@google.com Link: https://lkml.kernel.org/r/20221215001205.51969-2-jeffxu@google.com Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Co-developed-by: Jeff Xu <jeffxu@google.com> Signed-off-by: Jeff Xu <jeffxu@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
f1eb1bacfb |
mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()
This patch is a cleanup to always wr-protect pte/pmd in mkuffd_wp paths. The reasons I still think this patch is worthwhile, are: (1) It is a cleanup already; diffstat tells. (2) It just feels natural after I thought about this, if the pte is uffd protected, let's remove the write bit no matter what it was. (2) Since x86 is the only arch that supports uffd-wp, it also redefines pte|pmd_mkuffd_wp() in that it should always contain removals of write bits. It means any future arch that want to implement uffd-wp should naturally follow this rule too. It's good to make it a default, even if with vm_page_prot changes on VM_UFFD_WP. (3) It covers more than vm_page_prot. So no chance of any potential future "accident" (like pte_mkdirty() sparc64 or loongarch, even though it just got its pte_mkdirty fixed <1 month ago). It'll be fairly clear when reading the code too that we don't worry anything before a pte_mkuffd_wp() on uncertainty of the write bit. We may call pte_wrprotect() one more time in some paths (e.g. thp split), but that should be fully local bitop instruction so the overhead should be negligible. Although this patch should logically also fix all the known issues on uffd-wp too recently on page migration (not for numa hint recovery - that may need another explcit pte_wrprotect), but this is not the plan for that fix. So no fixes, and stable doesn't need this. Link: https://lkml.kernel.org/r/20221214201533.1774616-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ives van Hoorne <ives@codesandbox.io> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Sidhartha Kumar
|
04a42e72d7 |
mm: move folio_set_compound_order() to mm/internal.h
folio_set_compound_order() is moved to an mm-internal location so external
folio users cannot misuse this function. Change the name of the function
to folio_set_order() and use WARN_ON_ONCE() rather than BUG_ON. Also,
handle the case if a non-large folio is passed and add clarifying comments
to the function.
Link: https://lore.kernel.org/lkml/20221207223731.32784-1-sidhartha.kumar@oracle.com/T/
Link: https://lkml.kernel.org/r/20221215061757.223440-1-sidhartha.kumar@oracle.com
Fixes:
|
||
Andrew Morton
|
1301f93134 |
Pull mm-hotfixes-stable dependencies into mm-stable.
Merge branch 'mm-hotfixes-stable' into mm-stable |
||
Peter Xu
|
7e3ce3f8d2 |
mm: fix a few rare cases of using swapin error pte marker
This patch should harden commit |
||
Peter Xu
|
49d6d7fb63 |
mm/uffd: fix pte marker when fork() without fork event
Patch series "mm: Fixes on pte markers".
Patch 1 resolves the syzkiller report from Pengfei.
Patch 2 further harden pte markers when used with the recent swapin error
markers. The major case is we should persist a swapin error marker after
fork(), so child shouldn't read a corrupted page.
This patch (of 2):
When fork(), dst_vma is not guaranteed to have VM_UFFD_WP even if src may
have it and has pte marker installed. The warning is improper along with
the comment. The right thing is to inherit the pte marker when needed, or
keep the dst pte empty.
A vague guess is this happened by an accident when there's the prior patch
to introduce src/dst vma into this helper during the uffd-wp feature got
developed and I probably messed up in the rebase, since if we replace
dst_vma with src_vma the warning & comment it all makes sense too.
Hugetlb did exactly the right here (copy_hugetlb_page_range()). Fix the
general path.
Reproducer:
https://github.com/xupengfe/syzkaller_logs/blob/main/221208_115556_copy_page_range/repro.c
Bugzilla report: https://bugzilla.kernel.org/show_bug.cgi?id=216808
Link: https://lkml.kernel.org/r/20221214200453.1772655-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221214200453.1772655-2-peterx@redhat.com
Fixes:
|
||
Andrew Morton
|
bd86d2ea36 |
Sync with v6.2-rc4
Merge branch 'master' into mm-hotfixes-stable |
||
Andrew Morton
|
0e18a6b49b |
Sync with v6.2-rc4
Merge branch 'master' into mm-stable |
||
Linus Torvalds
|
5dc4c995db | Linux 6.2-rc4 | ||
Linus Torvalds
|
f0f70ddb8f |
- Make sure the poking PGD is pinned for Xen PV as it requires it this way
- Fixes for two resctrl races when moving a task or creating a new monitoring group - Fix SEV-SNP guests running under HyperV where MTRRs are disabled to not return a UC- type mapping type on memremap() and thus cause a serious slowdown - Fix insn mnemonics in bioscall.S now that binutils is starting to fix confusing insn suffixes -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPD5xsACgkQEsHwGGHe VUr65g/8CkfKQKIQ/kPn1B+M/PI4S8DBmz7CdufQTbB66GSfDwRpGnxIKJKZM1UG pyOmP1kHVXGGCsvFQimalxnBtFx6t3wFS+R+c/l5pRtgU63bwQpNjJ+vBcBpb4xs J7f06VgF6jB8qk0NqXBKNJt4kauZA50wiErYm8PU/yf0tmHratjrvDfKQmus9pI4 AMcQEudRhDDo8xqn8fvIC/S7xm9/TgNzKxYP+HciPa74HZm5vrjS0hIyZkIsh5d7 Q4MsP4VaBH12W6MbpF9TBw5VTDomwY8oS4xTNfkhyOTcHi8uLrMjA66VmaJwWXpe EZjOk4+KhtxYigaI5oQO9M5e9IINK6ZfbzU65P74wTvP3orszsBPG7QvodIzLc76 YI1GEea/bXgxYgPuMR6spZoGMS58K7BXgViVMVRwZ9bZk17+5pfbLkzKqZsfw/s/ nj3n8z4ayoc+ffDPllh0471Dn16ugKMf+EvW2Su1Q1QoaA7icNNkZrKaM6eSuoFr bClsrHeglQFadwy4kmY0fi7BUfiKTp+c5Ur7lM7VojrjH0XcoZreThVg0tNrk3ZR IAyBjtCdtg1rzrRfb7qBf6B+WNGdxYWGuDgOPiH1Xh+/plvyCqQ2/3AGWQAbu/qP to+IYmZg09mRpVBDYpCY4z6IzRbMJ9rRho4rNXCQUeAzSVMCrZE= =W2zR -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Make sure the poking PGD is pinned for Xen PV as it requires it this way - Fixes for two resctrl races when moving a task or creating a new monitoring group - Fix SEV-SNP guests running under HyperV where MTRRs are disabled to not return a UC- type mapping type on memremap() and thus cause a serious slowdown - Fix insn mnemonics in bioscall.S now that binutils is starting to fix confusing insn suffixes * tag 'x86_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: fix poking_init() for Xen PV guests x86/resctrl: Fix event counts regression in reused RMIDs x86/resctrl: Fix task CLOSID/RMID update race x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case x86/boot: Avoid using Intel mnemonics in AT&T syntax asm |
||
Linus Torvalds
|
8aa9761223 |
- Fix the EDAC device's confusion in the polling setting units
- Fix a memory leak in highbank's probing function -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPD41kACgkQEsHwGGHe VUqYOQ//bsKMgAbPDbvgSm8OkFvtLd+r38RE+dcg84QrBaeQ7z2x269i1bUhqkQV baatecdz+VoNQK3zmrIRshwgCGuwDEpRWAuPKJxoCfSU1eNzEipxHCoctNcp2q+Y 7nP65JtkQ6y1nEnBYh5h1snwbM3yuLIvmmuhYhPpd204k00L16Tzsu1uGiz7nDsZ 9ohBtS6laHztJnVd9gKIHOcgBCylPjNYpRecsgZk2COK/O9uTo8drQ46gKxwHQ4l P4RsvDligoaFO8rRLAzg2sfsOt9O2runkdWhYnVKSLdXOOA+wQ+eacOuzWs5dNpQ BQfWZB3GwxQ3+G1c2WOVrq/15YNrxpqMA9/060vPzGqW9VDCnOnoS3w5nHCDw5Ep YB+3OVn31Z4Hw+nP/WCR/0QYKDSBmqvGd4wKmHcUnL22rRZSl6UZTSqnBw3yrhV9 YzmZ+r3B9b7AtvDs3qWMjQwFYnxmeU7DSXsxjbUpDcMDuy/Y8/P9CSzE2ruC26L1 TdK/Vy6igoDCThzv0RMFEUfmpojIYoTtGNuyMrA90ZNKwXKVOcf9mM3TQm4vLH7m P3Q9UvBlApoaKbXSZaU0cV+klnba2prAnu4LTVeBWoyjPlmFZkv9ZxDHId8n2rie dTe/KZJRlScfwVtp83xh428ixdYWNs03R6P5tQAMKoZNLnixqXQ= =TiMH -----END PGP SIGNATURE----- Merge tag 'edac_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC fixes from Borislav Petkov: - Fix the EDAC device's confusion in the polling setting units - Fix a memory leak in highbank's probing function * tag 'edac_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/highbank: Fix memory leak in highbank_mc_probe() EDAC/device: Fix period calculation in edac_device_reset_delay_period() |
||
Linus Torvalds
|
b1d63f0c77 |
powerpc fixes for 6.2 #3
- Fix a build failure with some versions of ld that have an odd version string. - Fix incorrect use of mutex in the IMC PMU driver. Thanks to: Kajol Jain, Michael Petlan, Ojaswin Mujoo, Peter Zijlstra, Yang Yingliang. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmPD2ZoTHG1wZUBlbGxl cm1hbi5pZC5hdQAKCRBR6+o8yOGlgIJTD/9HeqFTpviUPd0QCW2HWVv2jwLA97Qy BGsMVz6eSk+f3YBFDFJ0n5KqYR4Bry5mgwPm1nQe86cKr+ZCdtnF+10vPL0TEdK5 M4a+u9PA+xqq9ukIN96ZP7QGi80YY0RRIrfkOwK6iQJVLSxT+TUcl1ko/iPalbI7 /5c2BCnkIoDbwU1ux3/6+uJCDFE3oLy5v1nt/mbzuekFTRvPGRCt+DWBywrYKYi8 XfBBVTz6F+PoBZ7vZa6Kv5IQANiuU4COTjpm9AjVvjp0oKfYskBmtZvHQhr3v6v4 HZsms49w0r3D7sZgEB2hmKFM2/QijSDeyBxnmt/hHeNYMMiFg+0lhoxoNoykzcN9 UfFU7NID3uqruYMkhAxiCIvyul9Vzcr+pe3GgooY+AtuokhuMUEUXJjEDIyXtoci 2VnEsdbl0/gihdHedfhLRXlEn8xz6fQvxDcpYZClSIDeS/nL7cuPd/9+JHn1hilq aJ0MX6VUKnwkSBb9Gkd1bt09jS6lqDUQS5+88IMvoJo53xHVlHF+5kqXAJRkpt1w XsDMBuKqT/aEC+rI5GyHXNglGuBYqMvEmbdEGtIVFCbUkcI35Xe8RXKmqDbO2U7N qqfYj53dgaptJF/sllcGuCTj6JrdJvKEVnuzsIwoE+XyXzr1e9UBmyZ+I//VjTjL Mhy9rXp7tJYnkw== =BaFo -----END PGP SIGNATURE----- Merge tag 'powerpc-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix a build failure with some versions of ld that have an odd version string - Fix incorrect use of mutex in the IMC PMU driver Thanks to Kajol Jain, Michael Petlan, Ojaswin Mujoo, Peter Zijlstra, and Yang Yingliang. * tag 'powerpc-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s/hash: Make stress_hpt_timer_fn() static powerpc/imc-pmu: Fix use of mutex in IRQs disabled section powerpc/boot: Fix incorrect version calculation issue in ld_version |
||
Linus Torvalds
|
7c69844052 |
IOMMU Fixes for Linux v6.2-rc3
Including: - Core: Fix an iommu-group refcount leak - Fix overflow issue in IOVA alloc path - ARM-SMMU fixes from Will: - Fix VFIO regression on NXP SoCs by reporting IOMMU_CAP_CACHE_COHERENCY - Fix SMMU shutdown paths to avoid device unregistration race - Error handling fix for Mediatek IOMMU driver -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmPC1nsACgkQK/BELZcB GuN+aw//c4rOO4buNaG/T00HfSdSGq1VwG1aIicslC82IDnh28R4A0iGoGtlmXJo +2qC2VPQaH7SpU7WEjhwIjuXUyuYQF5gvrZFnrumHGRSYI7IYze793TAbsGA9bLV Wn20rygyLlptu+wnGYHIG9PkB041ysjqJtQpRvT5AvUYW3Z9BoNDWs5YwJ9Qfm+W pm781ctgURPSmNK+wKKkRh5CCteWRxhKh8FKMvQ9o6lAoJNB/dcPpyE2oJ+lMojm kKhONbvQe3DdRm/zNY3gV1chTDPNeyIHhDGc6/NA1oAjuETlhzOG3JIrroijzsnA dZOJSJ6/jzqA6ZBh5hhuyUSbB0rRAN2URnrO2eFfJaVw7GJH60pdA7asxu37gNuF umbtsdzBZW0xba3qL7tvASZnKZCVeEsR4D6Apb36eaR7h6U7X1kKXOAK1PqHVS7+ LjT7RCMBx+UbKSpvT2ETMlLHpSDNA81X9yzssA4H7Cyk17NguB/L9Hd/I6uzbb26 ZHI/mZRJ0d4DvXCzmQK4760A2TSfAPA9UiseevDNQHgb8ZYw0hWqJbTaKsF5UfUe MEi0Jd0djQek++oqX7lQar04hbWC6BJ/aY12eATNVVVcuPqJPluysczcQdat342s fZzNx5ghRsSLxCVyuu1pigmoikSCI/SdoPnO69Dw/86bWvS+Djs= =THhI -----END PGP SIGNATURE----- Merge tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: - Core: Fix an iommu-group refcount leak - Fix overflow issue in IOVA alloc path - ARM-SMMU fixes from Will: - Fix VFIO regression on NXP SoCs by reporting IOMMU_CAP_CACHE_COHERENCY - Fix SMMU shutdown paths to avoid device unregistration race - Error handling fix for Mediatek IOMMU driver * tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/mediatek-v1: Fix an error handling path in mtk_iommu_v1_probe() iommu/iova: Fix alloc iova overflows issue iommu: Fix refcount leak in iommu_device_claim_dma_owner iommu/arm-smmu-v3: Don't unregister on shutdown iommu/arm-smmu: Don't unregister on shutdown iommu/arm-smmu: Report IOMMU_CAP_CACHE_COHERENCY even betterer |
||
Linus Torvalds
|
4f43ade45d |
memblock: always release pages to the buddy allocator in memblock_free_late()
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages() only releases pages to the buddy allocator if they are not in the deferred range. This is correct for free pages (as defined by for_each_free_mem_pfn_range_in_zone()) because free pages in the deferred range will be initialized and released as part of the deferred init process. memblock_free_pages() is called by memblock_free_late(), which is used to free reserved ranges after memblock_free_all() has run. All pages in reserved ranges have been initialized at that point, and accordingly, those pages are not touched by the deferred init process. This means that currently, if the pages that memblock_free_late() intends to release are in the deferred range, they will never be released to the buddy allocator. They will forever be reserved. In addition, memblock_free_pages() calls kmsan_memblock_free_pages(), which is also correct for free pages but is not correct for reserved pages. KMSAN metadata for reserved pages is initialized by kmsan_init_shadow(), which runs shortly before memblock_free_all(). For both of these reasons, memblock_free_pages() should only be called for free pages, and memblock_free_late() should call __free_pages_core() directly instead. One case where this issue can occur in the wild is EFI boot on x86_64. The x86 EFI code reserves all EFI boot services memory ranges via memblock_reserve() and frees them later via memblock_free_late() (efi_reserve_boot_services() and efi_free_boot_services(), respectively). If any of those ranges happens to fall within the deferred init range, the pages will not be released and that memory will be unavailable. For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI: v6.2-rc2: Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 178867 v6.2-rc2 + patch: Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 222816 # +43,949 pages -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmPCrI8QHHJwcHRAa2Vy bmVsLm9yZwAKCRA5A4Ymyw79kT1lB/wPbLpePLzZfDGyV/NR9gi4FuJiaRfhlklV rbxnJce050GERbSQoF/r4zrxn2pzvIWGMh1xWZBGi/q8mT2rOIYtVqUahY9YuL/Z 7+xqdCOALIxEj+cXqYocqp8/NFgUWLGuMoomc9lWvEkUs+zOvkD8Z/bRecfPYvOa BftPALmtXgx46Ecce0gZvvh4YULpVLNdDPPiwZTabV+47Cl8+cJ0Y+iEHsUfOesU hQG0unWJH77O3IU4QxiirLekLP/6a5O5f0W7u3PZmNNv7N+UdwE+De+QF0aamfgA LZDO1qOakflegFZvK0JchCzS4hc6dtRKqIvNM3cCBMXLvV4REHKP =geNh -----END PGP SIGNATURE----- Merge tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fix from Mike Rapoport: "memblock: always release pages to the buddy allocator in memblock_free_late() If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages() only releases pages to the buddy allocator if they are not in the deferred range. This is correct for free pages (as defined by for_each_free_mem_pfn_range_in_zone()) because free pages in the deferred range will be initialized and released as part of the deferred init process. memblock_free_pages() is called by memblock_free_late(), which is used to free reserved ranges after memblock_free_all() has run. All pages in reserved ranges have been initialized at that point, and accordingly, those pages are not touched by the deferred init process. This means that currently, if the pages that memblock_free_late() intends to release are in the deferred range, they will never be released to the buddy allocator. They will forever be reserved. In addition, memblock_free_pages() calls kmsan_memblock_free_pages(), which is also correct for free pages but is not correct for reserved pages. KMSAN metadata for reserved pages is initialized by kmsan_init_shadow(), which runs shortly before memblock_free_all(). For both of these reasons, memblock_free_pages() should only be called for free pages, and memblock_free_late() should call __free_pages_core() directly instead. One case where this issue can occur in the wild is EFI boot on x86_64. The x86 EFI code reserves all EFI boot services memory ranges via memblock_reserve() and frees them later via memblock_free_late() (efi_reserve_boot_services() and efi_free_boot_services(), respectively). If any of those ranges happens to fall within the deferred init range, the pages will not be released and that memory will be unavailable. For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI: v6.2-rc2: Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 178867 v6.2-rc2 + patch: Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 222816 # +43,949 pages" * tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: mm: Always release pages to the buddy allocator in memblock_free_late(). |
||
Linus Torvalds
|
880ca43e5c |
kernel hardening fixes for v6.2-rc4
- Fix CFI hash randomization with KASAN (Sami Tolvanen) - Check size of coreboot table entry and use flex-array -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmPB6IwWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJgqhEACZc2ehC6DNc3JSQEbZ9H47FL6Z pnyNvi+ZkC+vxENjH6WMMjtwehWHiQJVcHNaK9eF9/7A3pq58axw3RyeeVbPmC3B E0bDjJqaknAKa9FtFdyCTXD0V1TmY/s+oHTZHUXohq9ctI+hJT3reTJ55Uo5jlyV 8aB2lvbg8Bch4BAmg7z8gd3208VL30Q3Go0mspmovYUXVCvnwe08SyROIoJZnE9+ m5IIRfVCNFrAda1DPfiNeqQcE2EnKhTT0ESwtZbQ0HS5z1zJRYjs8gaeY63iQTNn tR1mpP97RngzQ1jCfZP3dZIuYA1TLgz/px0WraYflrpnYpzJOl0XLiigXefU5lyL 7YtGb9xuu8TXMI2D+n52DlYXGRjc9I7zUMPg03y7sC4BnKX5eA6Qda4plP5kvxxp K9PSO91RkS+01nwvXCNs7ISkQ1YpayDyNxsiDIqmHx3po9QB5QniceAa5mIYR/ld v9QKzRhLELiq8cYdu+fgfSOEaY8q9+/k+kEHakfsrXoLaiK2RVw4Y++S6Fh1QIy4 R8DHdhd8j33Yws96xRhI2P+g5mVzDpdEN1TtskdO5WjefCT83R84qqJsEaklVTrI AQDSweQfF+hc+B1PkDRbCgiLeSUnPfxzwdSoy35fc9/qg/JnoQMHFrkJB2Xn2+hv KaFfgM93f1CbCW/KDg== =KwzM -----END PGP SIGNATURE----- Merge tag 'hardening-v6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening fixes from Kees Cook: - Fix CFI hash randomization with KASAN (Sami Tolvanen) - Check size of coreboot table entry and use flex-array * tag 'hardening-v6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kbuild: Fix CFI hash randomization with KASAN firmware: coreboot: Check size of table entry and use flex-array |
||
Linus Torvalds
|
8b7be52f3f |
modules-6.2-rc4
Just one fix for modules by Nick. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmPB5S4SHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoin0OUQALx+uch7cZvl3aznH4CAkF1U2mKuESLT unQsnSlxIF0BETTAdPbhcCVEtKi102Xq4R6ffCSilB33b0IGM6qY6ZUQb/GMGhg5 qIur7EDiFksjCBmBuE1iNaAknkiKeirkJTEC4lEOl8OLXprmnuMeeR+E0BDi5sH3 YIxexLuhyKeF9Ke4bjAJ7A4WKvh2R7yamISCUcP6GL3w7U9urSjo9FHKcUmFH9nr qCX/1zE0fz7iJzb9YtBNVhdgKNOYNOxa7TDNXQCVuLcZQfmQqDMBVgKdOkj6TAWX 6L2CHT4N9IjPt9+DrlEUf6bSSwP4N4aFdyMAo6UlVXcgvEbTS/kdoMSqFOAQSAdb G/lzsvS3fD76VcCdZkwAFYnEhUTJ4xWTS0oaI++tu0EFX5lvRuQv2DRt0WULlD/u L0paUwmjtVajcSIATxRZkjoMiVD4btDRz30kaIUU/xoc1Gg/EADrSLHESaZ9eZVL EJ40aqLLIRBXGZrVEzvf97HIzuQiKfaPzywNvbMpxG3m0tV2pn3Z4ts/A8aO7c+O mBDnTURiZN6pT+xsnJBvqWrlXwPRUGwI+NjRcdPZhUyfgj5MHpEI8PAcUWy6TTUn H2P6x2iC3/nypqhnwjoixSptjaUWcak3R6UgwVS2YqfjePCqaq0wg9DAFDQH0Yx3 awAOoum0Ubin =aa3U -----END PGP SIGNATURE----- Merge tag 'modules-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull module fix from Luis Chamberlain: "Just one fix for modules by Nick" * tag 'modules-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: kallsyms: Fix scheduling with interrupts disabled in self-test |
||
Linus Torvalds
|
b35ad63eec |
7 smb3 client fixes
-----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmPB/t0ACgkQiiy9cAdy T1Gvbwv+LIXF5dHNGHDuezecbD8T9sRF2v15Mh7i6SSg7BWeXXebYY4dyrSQ5SHu KqsN2Y3P3A0ZQ3ApzFryImM6BwSpOeyCsVRhl7VCWnMgXcroqc/O6F6/YRFVkUAi iWWZLXM7WFBQGXUbPXaiPc+wCRARrnul9p+48Teyy0CJWiWormQmkznVxeihErDX /pWdQdvJeFcUrIj1H3e4cyJF2hVzRiUGI/eZmBGlDyaK192vYgGYO2AhHnTfd7fU dUJ+/trVw0koyC5/86veHRqCcXzFD44ORkAB46NaCic1K8t+RPhmxgtriiNLcCrQ kEmeub6ayPkuniV88NBEPXaDy0S/cEYHr7GEuZTn4sq+hw//y5KbKN3sa6aLHsMH 46BIHcyTXU59eNJ4lWOjoqD2NiqP2GYFY4PZftB85H1EW8Fchwcsw4WzW0ENpzmi qcWslXDKYjJIZ6NHnBiR/FYI1VdsmoGbDzMhrHreCrWCuIuVaqfrFPMnvE8f5dqY fkfnkqT4 =SdZA -----END PGP SIGNATURE----- Merge tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: - memory leak and double free fix - two symlink fixes - minor cleanup fix - two smb1 fixes * tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix uninitialized memory read for smb311 posix symlink create cifs: fix potential memory leaks in session setup cifs: do not query ifaces on smb1 mounts cifs: fix double free on failed kerberos auth cifs: remove redundant assignment to the variable match cifs: fix file info setting in cifs_open_file() cifs: fix file info setting in cifs_query_path_info() |
||
Linus Torvalds
|
8e76813085 |
SCSI fixes on 20230114
Two minor fixes in the hisi_sas driver which only impact enterprise style multi-expander and shared disk situations and no core changes. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCY8KxpSYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishRAlAP9cQmiq M8WGimymqulRFEpkWpDM1R06eqHDI2K0h5/rfAD+KlgS0cOVvx0nenWyhmprYvDX 2Z239J+WXOjSJq8UJL4= =gMQX -----END PGP SIGNATURE----- Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Two minor fixes in the hisi_sas driver which only impact enterprise style multi-expander and shared disk situations and no core changes" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: hisi_sas: Set a port invalid only if there are no devices attached when refreshing port id scsi: hisi_sas: Use abort task set to reset SAS disks when discovered |
||
Linus Torvalds
|
34cbf89afc |
ATA fixes for 6.2-rc4
A single fix for rc4 to prevent building the pata_cs5535 driver with user mode linux as it uses msr operations that are not defined with UML. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCY8HnowAKCRDdoc3SxdoY dmDWAPwKMUiDzFSkeD7zfKGwCd72HWUa1298yL+XnD8Y7vLBtgEAqhYi9fAnVnN0 dUHm12rEwFOay+lVwxWuQowFaVmzGAs= =RXSo -----END PGP SIGNATURE----- Merge tag 'ata-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata Pull ATA fix from Damien Le Moal: "A single fix to prevent building the pata_cs5535 driver with user mode linux as it uses msr operations that are not defined with UML" * tag 'ata-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: ata: pata_cs5535: Don't build on UML |
||
Linus Torvalds
|
97ec4d559d |
block-6.2-2023-01-13
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPBsFAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqgnEAC0OqxnMsOPNbkLO7k6FsSrG7ZoENkOIMCt Grk3D1cPkM13I0xc+WiaOBezMriPzfdXvt5AGDn9fd53Ih47qpSY4eU6pCqoCk5y HWdn8KXZvhJGZsSy0Nz+cfPDW/8diJON8YBpJwWM/DfDdP8XibtjlIMTVTtJab6h aGWjmy3leNfghOJ0cZ1wjL6maWFoowQASs52PZfajSc0mQ5X0i8BgQb1WOHNu89C vEir9PYlTmdMnYlAKLsyEL3KoGUPm++zSLtJeyWYavlCMGK5WTyNkzmeXqsQhAGf b1LjovQASe//1t2wvCzQviRf4cae0pE9JhiaYt2oxoDdHrfQj/WPndVS4yE9c+0O BnLVTCFHNv86TRXNCbEUzI+Ftj6m9qt4MrHz8YpstX7FxGxYC+T5RqTwYClWZQ0j llBuJUHj+kkAv6kBMJCHTyat6pxIDgcb52QMJr5mFWuEaTloraBIJC70hMtxBQV/ j5mrBYqCngCHVs+hAl9UQ4zqQVSvkeT11QFvwFolxIfs7qtfLqeGzYxvaeomqO3V sA+H5NY50OEuPfFFmCpcNUJXeUKg7wP39iNHdz6P5cCDBCfUwbNbgKKKNmBovaC+ KhPd8Xo1MmzDuF+cylvTcjOBDte4425GN7PBj4vP1xbuHYcjg6AEFLawgqE9Y4XX xyNlgJXPOg== =ujiw -----END PGP SIGNATURE----- Merge tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Nothing major in here, just a collection of NVMe fixes and dropping a wrong might_sleep() that static checkers tripped over but which isn't valid" * tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux: MAINTAINERS: stop nvme matching for nvmem files nvme: don't allow unprivileged passthrough on partitions nvme: replace the "bool vec" arguments with flags in the ioctl path nvme: remove __nvme_ioctl nvme-pci: fix error handling in nvme_pci_enable() nvme-pci: add NVME_QUIRK_IDENTIFY_CNS quirk to Apple T2 controllers nvme-apple: add NVME_QUIRK_IDENTIFY_CNS quirk to fix regression block: Drop spurious might_sleep() from blk_put_queue() |
||
Linus Torvalds
|
2ce7592df9 |
io_uring-6.2-2023-01-13
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPBsL8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplLXD/9kmbwSseR8sm8s6rB3mZcfgN4vvkRDa5Kg r5bj32sOl2o37szObGI53MneFcCpdOszA2R3AHqUoj7ZQsyPlt59OIpKtdBlp6xv ACX1weBnxMTtdMZ90Mp/GxGLZuvSifbfL4z5YgbRRKnGorz3prmfRkKuO51DcJes mhoPGTDUsAmxoU261LzuHI7DtD69We8yBzj58p21NF8DvBI3NtuWWhOI+EGs2CDr aqilG5LxOYBKuEY660KhWKMAXwVbVkyLak4eIh4iax0R0Um/SKXnRMIRI3L1QZoP eqGtF4zGJGM+47RUnl6GF59/IsyRZR04GJlZ5ma8aqqZNh4oj8E6A+hpgczJ0025 QM4hG+NwqXeNpMjN0PwDypo3WoYjyIhaMoEyCT7dmVzj62y3pm3DDaZDnzUlQF5w 8YvoSPwyCzWHWFDGEZp6WRYn2YfoiE+L48euknADmTL5FwUOs1J5OUK0v8BcAYLO tqJ8Q8emx15ZTzHeI+Z9lDYIuYKBU9XuO8ugfbss/Xx2tQMDeHe6BIaujhwam6c4 BWyAIViXgWKpIIDB3Emsu3lStc3PJ1WLbBdw4ja0nwhCRB7IeclzZf1IZHTT4xsQ eg5EK2QORLrlY9keCoFhqfT4guSYQptBlwPuxhHM2gcrxXegR6294hBoPVSd1be5 g5NK0rrSWA== =9C0J -----END PGP SIGNATURE----- Merge tag 'io_uring-6.2-2023-01-13' of git://git.kernel.dk/linux Pull io_uring fixes from Jens Axboe: "A fix for a regression that happened last week, rest is fixes that will be headed to stable as well. In detail: - Fix for a regression added with the leak fix from last week (me) - In writing a test case for that leak, inadvertently discovered a case where we a poll request can race. So fix that up and mark it for stable, and also ensure that fdinfo covers both the poll tables that we have. The latter was an oversight when the split poll table were added (me) - Fix for a lockdep reported issue with IOPOLL (Pavel)" * tag 'io_uring-6.2-2023-01-13' of git://git.kernel.dk/linux: io_uring: lock overflowing for IOPOLL io_uring/poll: attempt request issue after racy poll wakeup io_uring/fdinfo: include locked hash table in fdinfo output io_uring/poll: add hash if ready poll request can't complete inline io_uring/io-wq: only free worker if it was allocated for creation |
||
Linus Torvalds
|
9e058c2952 |
pci-v6.2-fixes-1
-----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmPBniAUHGJoZWxnYWFz QGdvb2dsZS5jb20ACgkQWYigwDrT+vwOjRAAhjyRAgyiZV2rWS4pyvpQpqcpZWD9 796ZSqnzLJjVYCymGvUTX23FEA48n59+bCM/WpfEGUPrBf8LZQxC9YOCm6ltuM8+ FoSBykW/tHPq5IWaLzgrWpHeDOgEnZu/WFGGvrV3tl1mLpM1SJT8bGDsjHXlo+FM qkTEiA3nUEKQs5x9r2TTLCeUWGPNTIHNd2VfuxOqM3qC/nVCOfTTxU8nm6Lk7Eix nboAugAIADJIjs/+ZGekLBuzZYPkLYuDTyMYJ5hdo1p7wWCLc9gArEqvXKwVgmD3 ptenZeOlQi9Ay45HmkfIgfgKeeQ7REJj3dx04vf67neAianyUrB0EZDqDjR7LmgM ozlNt0XjyoeEhu6AQS0s1LZtbDiED1R/00P6Gb+YEjUCVipW2lEYYwP0v9dsnNoh 6wblgnkQoxLFM+5CAXRmCmpaoQn0Uam7okfVeohtsz8/kNQF2St0hjzr4Dmws+O3 k9PUqnnUl4ByElzpEDesVGZMJ3pxFVH15ufu8VnRqN60pLTvNrsPyU4cVnG176Rc 3RSDN3zMtPxnHJVy4r3bTNEZsX/7RUrOb4xScOXMmRDBMUc8QdscF8Oj1ucKlj5j mp7vB/7+VjU96uRarRyqUxGeQc77DCTcvOa1IGh/cuYom8ZJ6vpSCpKy6f6SFGuf i8iTTUcQKCdqVW4= =Fv2v -----END PGP SIGNATURE----- Merge tag 'pci-v6.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull pci fixes from Bjorn Helgaas: - Work around apparent firmware issue that made Linux reject MMCONFIG space, which broke PCI extended config space (Bjorn Helgaas) - Fix CONFIG_PCIE_BT1 dependency due to mid-air collision between a PCI_MSI_IRQ_DOMAIN -> PCI_MSI change and addition of PCIE_BT1 (Lukas Bulwahn) * tag 'pci-v6.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space x86/pci: Simplify is_mmconf_reserved() messages PCI: dwc: Adjust to recent removal of PCI_MSI_IRQ_DOMAIN |
||
Sami Tolvanen
|
42633ed852 |
kbuild: Fix CFI hash randomization with KASAN
Clang emits a asan.module_ctor constructor to each object file when KASAN is enabled, and these functions are indirectly called in do_ctors. With CONFIG_CFI_CLANG, the compiler also emits a CFI type hash before each address-taken global function so they can pass indirect call checks. However, in commit |
||
Kees Cook
|
3b293487b8 |
firmware: coreboot: Check size of table entry and use flex-array
The memcpy() of the data following a coreboot_table_entry couldn't be evaluated by the compiler under CONFIG_FORTIFY_SOURCE. To make it easier to reason about, add an explicit flexible array member to struct coreboot_device so the entire entry can be copied at once. Additionally, validate the sizes before copying. Avoids this run-time false positive warning: memcpy: detected field-spanning write (size 168) of single field "&device->entry" at drivers/firmware/google/coreboot_table.c:103 (size 8) Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/all/03ae2704-8c30-f9f0-215b-7cdf4ad35a9a@molgen.mpg.de/ Cc: Jack Rosenthal <jrosenth@chromium.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Julius Werner <jwerner@chromium.org> Cc: Brian Norris <briannorris@chromium.org> Cc: Stephen Boyd <swboyd@chromium.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Julius Werner <jwerner@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Link: https://lore.kernel.org/r/20230107031406.gonna.761-kees@kernel.org Reviewed-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Jack Rosenthal <jrosenth@chromium.org> Link: https://lore.kernel.org/r/20230112230312.give.446-kees@kernel.org |
||
Nicholas Piggin
|
da35048f26 |
kallsyms: Fix scheduling with interrupts disabled in self-test
kallsyms_on_each* may schedule so must not be called with interrupts
disabled. The iteration function could disable interrupts, but this
also changes lookup_symbol() to match the change to the other timing
code.
Reported-by: Erhard F. <erhard_f@mailbox.org>
Link: https://lore.kernel.org/all/bug-216902-206035@https.bugzilla.kernel.org%2F/
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202212251728.8d0872ff-oliver.sang@intel.com
Fixes:
|
||
Peter Foley
|
22eebaa631 |
ata: pata_cs5535: Don't build on UML
This driver uses MSR functions that aren't implemented under UML. Avoid building it to prevent tripping up allyesconfig. e.g. /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x3a3): undefined reference to `__tracepoint_read_msr' /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x3d2): undefined reference to `__tracepoint_write_msr' /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x457): undefined reference to `__tracepoint_write_msr' /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x481): undefined reference to `do_trace_write_msr' /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x4d5): undefined reference to `do_trace_write_msr' /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x4f5): undefined reference to `do_trace_read_msr' /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x51c): undefined reference to `do_trace_write_msr' Signed-off-by: Peter Foley <pefoley2@pefoley.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> |
||
Linus Torvalds
|
92783a90bc |
ARM:
* Fix the PMCR_EL0 reset value after the PMU rework * Correctly handle S2 fault triggered by a S1 page table walk by not always classifying it as a write, as this breaks on R/O memslots * Document why we cannot exit with KVM_EXIT_MMIO when taking a write fault from a S1 PTW on a R/O memslot * Put the Apple M2 on the naughty list for not being able to correctly implement the vgic SEIS feature, just like the M1 before it * Reviewer updates: Alex is stepping down, replaced by Zenghui x86: * Fix various rare locking issues in Xen emulation and teach lockdep to detect them * Documentation improvements * Do not return host topology information from KVM_GET_SUPPORTED_CPUID -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmPAT3EUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPmDAf+ICCVMwgm+PjAc6NuXzaUk6BFGWKF 1lzMvnKb6ARnhMKwyjl/Sf5EgnTuucnSTBHuE1kjaLkPUDNJvi4oRXVdDwKjtXnZ Zxk4dpsNLWVfALHTk1KweIkR5KNif0kugUh9RNp6zOBnoTVRh8XdCHpeDv73tJaG R1gCAreVTDbp+wNrVpiImUfYAZ4GrGpwwWRH/xLAGDWoTL9Z9J5tQygf+0C429n/ eJoTrToLjESbYadDgCNDD+TUkHbeDVg8aeio2JZga9SvH3RBhwriLqz26v9yvikL UoY96AySMaiox4pgCUYUl8nng8MR8AG4C4vpNnLalj7tfHxRfhtAwD0EYw== =gDOV -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "ARM: - Fix the PMCR_EL0 reset value after the PMU rework - Correctly handle S2 fault triggered by a S1 page table walk by not always classifying it as a write, as this breaks on R/O memslots - Document why we cannot exit with KVM_EXIT_MMIO when taking a write fault from a S1 PTW on a R/O memslot - Put the Apple M2 on the naughty list for not being able to correctly implement the vgic SEIS feature, just like the M1 before it - Reviewer updates: Alex is stepping down, replaced by Zenghui x86: - Fix various rare locking issues in Xen emulation and teach lockdep to detect them - Documentation improvements - Do not return host topology information from KVM_GET_SUPPORTED_CPUID" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86/xen: Avoid deadlock by adding kvm->arch.xen.xen_lock leaf node lock KVM: Ensure lockdep knows about kvm->lock vs. vcpu->mutex ordering rule KVM: x86/xen: Fix potential deadlock in kvm_xen_update_runstate_guest() KVM: x86/xen: Fix lockdep warning on "recursive" gpc locking Documentation: kvm: fix SRCU locking order docs KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID KVM: nSVM: clarify recalc_intercepts() wrt CR8 MAINTAINERS: Remove myself as a KVM/arm64 reviewer MAINTAINERS: Add Zenghui Yu as a KVM/arm64 reviewer KVM: arm64: vgic: Add Apple M2 cpus to the list of broken SEIS implementations KVM: arm64: Convert FSC_* over to ESR_ELx_FSC_* KVM: arm64: Document the behaviour of S1PTW faults on RO memslots KVM: arm64: Fix S1PTW handling on RO memslots KVM: arm64: PMU: Fix PMCR_EL0 reset value |
||
Mateusz Guzik
|
f5fe24ef17 |
lockref: stop doing cpu_relax in the cmpxchg loop
On the x86-64 architecture even a failing cmpxchg grants exclusive access to the cacheline, making it preferable to retry the failed op immediately instead of stalling with the pause instruction. To illustrate the impact, below are benchmark results obtained by running various will-it-scale tests on top of the 6.2-rc3 kernel and Cascade Lake (2 sockets * 24 cores * 2 threads) CPU. All results in ops/s. Note there is some variance in re-runs, but the code is consistently faster when contention is present. open3 ("Same file open/close"): proc stock no-pause 1 805603 814942 (+%1) 2 1054980 1054781 (-0%) 8 1544802 1822858 (+18%) 24 1191064 2199665 (+84%) 48 851582 1469860 (+72%) 96 609481 1427170 (+134%) fstat2 ("Same file fstat"): proc stock no-pause 1 3013872 3047636 (+1%) 2 4284687 4400421 (+2%) 8 3257721 5530156 (+69%) 24 2239819 5466127 (+144%) 48 1701072 5256609 (+209%) 96 |
||
Bjorn Helgaas
|
fd3a8cff4d |
x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space
Normally we reject ECAM space unless it is reported as reserved in the E820 table or via a PNP0C02 _CRS method (PCI Firmware, r3.3, sec 4.1.2). |
||
Linus Torvalds
|
0bf913e07b |
First batch of EFI fixes for v6.2:
- avoid a potential crash on the efi_subsys_init() error path - use more appropriate error code for runtime services calls issued after a crash in the firmware occurred - avoid READ_ONCE() for accessing firmware tables that may appear misaligned in memory -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmPBg68ACgkQw08iOZLZ jyQs5Qv+PVg06BhEqN+vwNQy6vd4ezTxmDAy7yx751mo3HIw0qT0ohsCIpRydq0c +qlCXa+Uu/yr/IQplfDT9vY+MEwD9iuwJha8ltGRWM3++yEF4uQXowHDoEKsO84l 5PaC37EfOvHmV6UdFdIF0OYDOcRvX2FsIbmUKRyvIav1e+QRLvUWWKKEmAh04c7G yNc0837kmoOpjKrYPc8j2n3dVUbhrFUW5eLIFmd8yrR+GRu6Ae5RH3J7iF7Nqtrq oReYYq3XpmYg8c00WV0NKVuB0DK7fhGY7jcbDfLmTrPwqVzLjxQGecxsQPYnqrJd mZywkm2fM8KIJy2LQDJOVOZaDAzaC2SkrpELHX/MnPK1UrP561AIv/sXK+3+UBEm b6m5dHbJgaifKP3kkbc9Cy4f9avLJOdjdXH5f5zPe7it54yHLsacEvjT6M2oiunx zIvTd/MXi24J+tzgxr08KM5wHXgLGh+fUM7BfZTvEVQmUjY8TnIPjsaAJhTS3jzV TN3/XAWi =4LbF -----END PGP SIGNATURE----- Merge tag 'efi-fixes-for-v6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: - avoid a potential crash on the efi_subsys_init() error path - use more appropriate error code for runtime services calls issued after a crash in the firmware occurred - avoid READ_ONCE() for accessing firmware tables that may appear misaligned in memory * tag 'efi-fixes-for-v6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi: tpm: Avoid READ_ONCE() for accessing the event log efi: rt-wrapper: Add missing include efi: fix userspace infinite retry read efivars after EFI runtime services page fault efi: fix NULL-deref in init error path |
||
Linus Torvalds
|
40d92fc4fa |
Three documentation fixes (or rather two and one warning):
- Sphinx 6.0 broke our configuration mechanism, so fix it. - I broke our configuration for non-Alabaster themes; Akira fixed it. - Deprecate Sphinx < 2.4 with an eye toward future removal -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmPBhWwPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5Y+i8IAJCd8qgopxIcmzif8ncsrZFIdk3FBd4INCU3 gEr5IBQN10Fm3es8FcWQPhX8nqFzlyG9GyjNSEfZpKYF9y3zWx9l5xOD0f6Ki6F4 HEaBcP11zQCSbrdZMR2in7fW+SNqIjJ+srDLrLkG2d4il6IbbSwx121pjPxJgHkK Y4Sj4Aa3fm5m5JzqArc8/IQRl6ewrfCuGXGh2RdunMzdf22Q2vMdIzEfhyilV1Cg FSXttLDTq5huRSBv8PYaMJnpx2mMn+si8c5mFcNV6oDP+VG4m2rBw4kYQk6q0rU2 xJFnbh7oThKyQ955k+sxJYoSxq9Fd5lXX/3d+HtqSvvC/WAP8gY= =ttzZ -----END PGP SIGNATURE----- Merge tag 'docs-6.2-fixes' of git://git.lwn.net/linux Pull documentation fixes from Jonathan Corbet: "Three documentation fixes (or rather two and one warning): - Sphinx 6.0 broke our configuration mechanism, so fix it - I broke our configuration for non-Alabaster themes; Akira fixed it - Deprecate Sphinx < 2.4 with an eye toward future removal" * tag 'docs-6.2-fixes' of git://git.lwn.net/linux: docs/conf.py: Use about.html only in sidebar of alabaster theme docs: Deprecate use of Sphinx < 2.4.x docs: Fix the docs build with Sphinx 6.0 |
||
Ard Biesheuvel
|
d3f450533b |
efi: tpm: Avoid READ_ONCE() for accessing the event log
Nathan reports that recent kernels built with LTO will crash when doing EFI boot using Fedora's GRUB and SHIM. The culprit turns out to be a misaligned load from the TPM event log, which is annotated with READ_ONCE(), and under LTO, this gets translated into a LDAR instruction which does not tolerate misaligned accesses. Interestingly, this does not happen when booting the same kernel straight from the UEFI shell, and so the fact that the event log may appear misaligned in memory may be caused by a bug in GRUB or SHIM. However, using READ_ONCE() to access firmware tables is slightly unusual in any case, and here, we only need to ensure that 'event' is not dereferenced again after it gets unmapped, but this is already taken care of by the implicit barrier() semantics of the early_memunmap() call. Cc: <stable@vger.kernel.org> Cc: Peter Jones <pjones@redhat.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Reported-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1782 Signed-off-by: Ard Biesheuvel <ardb@kernel.org> |
||
Pavel Begunkov
|
544d163d65 |
io_uring: lock overflowing for IOPOLL
syzbot reports an issue with overflow filling for IOPOLL: WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734 CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0 Workqueue: events_unbound io_ring_exit_work Call trace: io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734 io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773 io_fill_cqe_req io_uring/io_uring.h:168 [inline] io_do_iopoll+0x474/0x62c io_uring/rw.c:1065 io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513 io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056 io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289 worker_thread+0x340/0x610 kernel/workqueue.c:2436 kthread+0x12c/0x158 kernel/kthread.c:376 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863 There is no real problem for normal IOPOLL as flush is also called with uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL, for which __io_cqring_overflow_flush() happens from the CQ waiting path. Reported-and-tested-by: syzbot+6805087452d72929404e@syzkaller.appspotmail.com Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |