2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-26 23:55:40 +08:00
linux-next/arch/s390
Linus Torvalds 5df397dec7 mm: delay page_remove_rmap() until after the TLB has been flushed
When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.

However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with the
page table lock because the page had already been made invisible to rmap.

And that is a problem, because while the TLB entry exists, we could end up
with the following situation:

 (a) one CPU could come in and clean it, never seeing our mapping of the
     page

 (b) another CPU could continue to use the stale and dirty TLB entry and
     continue to write to said page

resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.

End result: possibly lost dirty data.

This extends our current TLB gather infrastructure to optionally track a
"should I do a delayed page_remove_rmap() for this page after flushing the
TLB".  It uses the newly introduced 'encoded page pointer' to do that
without having to keep separate data around.

Note, this is complicated by a couple of issues:

 - we want to delay the rmap removal, but not past the page table lock,
   because that simplifies the memcg accounting

 - only SMP configurations want to delay TLB flushing, since on UP
   there are obviously no remote TLBs to worry about, and the page
   table lock means there are no preemption issues either

 - s390 has its own mmu_gather model that doesn't delay TLB flushing,
   and as a result also does not want the delayed rmap. As such, we can
   treat S390 like the UP case and use a common fallback for the "no
   delays" case.

 - we can track an enormous number of pages in our mmu_gather structure,
   with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
   all set up to be approximately 10k pending pages.

   We do not want to have a huge number of batched pages that we then
   need to check for delayed rmap handling inside the page table lock.

Particularly that last point results in a noteworthy detail, where the
normal page batch gathering is limited once we have delayed rmaps pending,
in such a way that only the last batch (the so-called "active batch") in
the mmu_gather structure can have any delayed entries.

NOTE!  While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.

So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly even
when user space goes off the rails.

[akpm@linux-foundation.org: fix UP build, per Linus]
Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
Link: https://lkml.kernel.org/r/20221109203051.1835763-4-torvalds@linux-foundation.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 15:58:50 -08:00
..
appldata s390/appldata: use struct_size() helper 2020-06-29 16:32:34 +02:00
boot s390/boot: add secure boot trailer 2022-10-26 14:47:31 +02:00
configs s390: update defconfigs 2022-08-30 21:57:07 +02:00
crypto crypto: Kconfig - simplify cipher entries 2022-08-26 18:50:43 +08:00
hypfs s390/hypfs: avoid error message under KVM 2022-08-15 17:19:51 +02:00
include mm: delay page_remove_rmap() until after the TLB has been flushed 2022-11-30 15:58:50 -08:00
kernel s390/pai: fix raw data collection for PMU pai_ext 2022-10-26 14:47:31 +02:00
kvm The first batch of KVM patches, mostly covering x86, which I 2022-10-09 09:39:55 -07:00
lib s390/uaccess: add missing EX_TABLE entries to __clear_user() 2022-10-26 14:47:30 +02:00
mm treewide: use get_random_u32() when possible 2022-10-11 17:42:58 -06:00
net s390/bpf: Fix typo in comment 2022-05-23 11:25:53 -07:00
pci s390/pci: add missing EX_TABLE entries to __pcistg_mio_inuser()/__pcilg_mio_inuser() 2022-10-26 14:47:31 +02:00
purgatory s390/purgatory: remove duplicated build rule of kexec-purgatory.o 2022-06-30 14:18:16 +02:00
tools KVM: s390: Add facility 197 to the allow list 2022-07-13 15:25:25 +02:00
Kbuild kbuild: use more subdir- for visiting subdirectories while cleaning 2021-10-24 13:49:46 +09:00
Kconfig Random number generator updates for Linux 6.0-rc1. 2022-08-02 17:31:35 -07:00
Kconfig.debug s390/Kconfig.debug: fix indentation 2022-06-01 12:03:15 +02:00
Makefile kbuild: remove head-y syntax 2022-10-02 18:06:03 +09:00