linux/arch/x86/mm
Linus Torvalds 02b670c1f8 x86/mm: Remove broken vsyscall emulation code from the page fault code
The syzbot-reported stack trace from hell in this discussion thread
actually has three nested page faults:

  https://lore.kernel.org/r/000000000000d5f4fc0616e816d4@google.com

... and I think that's actually the important thing here:

 - the first page fault is from user space, and triggers the vsyscall
   emulation.

 - the second page fault is from __do_sys_gettimeofday(), and that should
   just have caused the exception that then sets the return value to
   -EFAULT

 - the third nested page fault is due to _raw_spin_unlock_irqrestore() ->
   preempt_schedule() -> trace_sched_switch(), which then causes a BPF
   trace program to run, which does that bpf_probe_read_compat(), which
   causes that page fault under pagefault_disable().

It's quite the nasty backtrace, and there's a lot going on.

The problem is literally the vsyscall emulation, which sets

        current->thread.sig_on_uaccess_err = 1;

and that causes the fixup_exception() code to send the signal *despite* the
exception being caught.

And I think that is in fact completely bogus.  It's completely bogus
exactly because it sends that signal even when it *shouldn't* be sent -
like for the BPF user mode trace gathering.

In other words, I think the whole "sig_on_uaccess_err" thing is entirely
broken, because it makes any nested page-faults do all the wrong things.

Now, arguably, I don't think anybody should enable vsyscall emulation any
more, but this test case clearly does.

I think we should just make the "send SIGSEGV" be something that the
vsyscall emulation does on its own, not this broken per-thread state for
something that isn't actually per thread.

The x86 page fault code actually tried to deal with the "incorrect nesting"
by having that:

                if (in_interrupt())
                        return;

which ignores the sig_on_uaccess_err case when it happens in interrupts,
but as shown by this example, these nested page faults do not need to be
about interrupts at all.

IOW, I think the only right thing is to remove that horrendously broken
code.

The attached patch looks like the ObviouslyCorrect(tm) thing to do.

NOTE! This broken code goes back to this commit in 2011:

  4fc3490114 ("x86-64: Set siginfo and context on vsyscall emulation faults")

... and back then the reason was to get all the siginfo details right.
Honestly, I do not for a moment believe that it's worth getting the siginfo
details right here, but part of the commit says:

    This fixes issues with UML when vsyscall=emulate.

... and so my patch to remove this garbage will probably break UML in this
situation.

I do not believe that anybody should be running with vsyscall=emulate in
2024 in the first place, much less if you are doing things like UML. But
let's see if somebody screams.

Reported-and-tested-by: syzbot+83e7f982ca045ab4405c@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/CAHk-=wh9D6f7HUkDgZHKmDCHUQmp+Co89GP+b8+z+G56BKeyNg@mail.gmail.com
2024-05-01 09:41:43 +02:00
..
pat x86/mm/pat: fix VM_PAT handling in COW mappings 2024-04-05 11:21:31 -07:00
amdtopology.c x86/mm/numa: Move early mptable evaluation into common code 2024-02-15 22:07:41 +01:00
cpu_entry_area.c x86/mm: Do not shuffle CPU entry areas without KASLR 2023-03-22 10:42:47 -07:00
debug_pagetables.c x86/bugs: Rename CONFIG_PAGE_TABLE_ISOLATION => CONFIG_MITIGATION_PAGE_TABLE_ISOLATION 2024-01-10 10:52:28 +01:00
dump_pagetables.c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
extable.c x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user 2024-01-31 22:03:04 +01:00
fault.c x86/mm: Remove broken vsyscall emulation code from the page fault code 2024-05-01 09:41:43 +02:00
highmem_32.c x86/mm: Include asm/numa.h for set_highmem_pages_init() 2023-05-18 11:56:18 -07:00
hugetlbpage.c arch/x86/mm/hugetlbpage.c: pud_huge() returns 0 when using 2-level paging 2022-11-08 15:57:25 -08:00
ident_map.c Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped." 2024-03-25 11:54:35 +01:00
init_32.c mm/treewide: replace pmd_large() with pmd_leaf() 2024-03-06 13:04:19 -08:00
init_64.c mm/treewide: replace pud_large() with pud_leaf() 2024-03-06 13:04:19 -08:00
init.c - Remove unnecessary "INVPCID single" feature tracking 2023-08-30 09:54:00 -07:00
iomap_32.c io-mapping: Cleanup atomic iomap 2020-11-06 23:14:58 +01:00
ioremap.c x86/ioremap: Add hypervisor callback for private MMIO mapping in coco VM 2023-03-26 23:42:40 +02:00
kasan_init_64.c mm/treewide: replace pud_large() with pud_leaf() 2024-03-06 13:04:19 -08:00
kaslr.c x86/mm: Avoid using set_pgd() outside of real PGD pages 2023-06-16 11:46:42 -07:00
kmmio.c x86/mm/kmmio: Remove redundant preempt_disable() 2022-12-12 10:54:48 -05:00
kmsan_shadow.c x86: kmsan: handle CPU entry area 2022-10-03 14:03:26 -07:00
maccess.c x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() 2024-02-15 19:21:39 -08:00
Makefile Core x86 changes for v6.9: 2024-03-11 19:53:15 -07:00
mem_encrypt_amd.c x86/sev: Skip ROM range scans and validation for SEV-SNP guests 2024-03-26 15:22:35 +01:00
mem_encrypt_boot.S x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros 2022-12-15 10:37:27 -08:00
mem_encrypt_identity.c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
mem_encrypt.c x86/sev: Add callback to apply RMP table fixups for kexec 2024-04-29 11:21:09 +02:00
mm_internal.h x86/mm: thread pgprot_t through init_memory_mapping() 2020-04-10 15:36:21 -07:00
mmap.c x86/mm/mmap: Fix -Wmissing-prototypes warnings 2020-04-22 20:19:48 +02:00
mmio-mod.c x86: Replace cpumask_weight() with cpumask_empty() where appropriate 2022-04-10 22:35:38 +02:00
numa_32.c x86/numa/32: Include missing <asm/pgtable_areas.h> 2024-04-04 09:39:38 +02:00
numa_64.c
numa_emulation.c x86/mm: Replace nodes_weight() with nodes_empty() where appropriate 2022-04-10 22:35:38 +02:00
numa_internal.h
numa.c x86/numa: Fix the sort compare func used in numa_fill_memblks() 2024-02-16 23:20:34 -08:00
pf_in.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156 2019-05-30 11:26:35 -07:00
pf_in.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156 2019-05-30 11:26:35 -07:00
pgprot.c x86/mm: move protection_map[] inside the platform 2022-07-17 17:14:38 -07:00
pgtable_32.c mm: remove unneeded includes of <asm/pgalloc.h> 2020-08-07 11:33:26 -07:00
pgtable.c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
physaddr.c mm, x86/mm: Untangle address space layout definitions from basic pgtable type definitions 2019-12-10 10:12:55 +01:00
physaddr.h
pkeys.c x86/pkeys: Clarify PKRU_AD_KEY macro 2022-06-07 16:06:33 -07:00
pti.c mm/treewide: replace pud_large() with pud_leaf() 2024-03-06 13:04:19 -08:00
srat.c x86/apic: Wrap APIC ID validation into an inline 2023-08-09 11:58:30 -07:00
testmmiotrace.c remove ioremap_nocache and devm_ioremap_nocache 2020-01-06 09:45:59 +01:00
tlb.c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00