From 2f406263e3e954aa24c1248edcfa9be0c1bb30fa Mon Sep 17 00:00:00 2001 From: Yin Fengwei Date: Tue, 8 Aug 2023 10:09:15 +0800 Subject: [PATCH 1/7] madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check Patch series "don't use mapcount() to check large folio sharing", v2. In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio. Use folio_estimated_sharers() here as the estimated number is enough. This patchset will fix the cases: User space application call madvise() with MADV_FREE, MADV_COLD and MADV_PAGEOUT for specific address range. There are THP mapped to the range. Without the patchset, the THP is skipped. With the patch, the THP will be split and handled accordingly. David reported the cow self test skip some cases because of MADV_PAGEOUT skip THP: https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel.com/T/#mbf0f2ec7fbe45da47526de1d7036183981691e81 and I confirmed this patchset make it work again. This patch (of 3): Commit 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping. It's not correct for large folio. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared. Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here. User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then. NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready. Link: https://lkml.kernel.org/r/20230808020917.2230692-1-fengwei.yin@intel.com Link: https://lkml.kernel.org/r/20230808020917.2230692-2-fengwei.yin@intel.com Fixes: 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios") Signed-off-by: Yin Fengwei Reviewed-by: Yu Zhao Reviewed-by: Ryan Roberts Cc: David Hildenbrand Cc: Kefeng Wang Cc: Matthew Wilcox Cc: Minchan Kim Cc: Vishal Moola (Oracle) Cc: Yang Shi Cc: Signed-off-by: Andrew Morton --- mm/madvise.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index bfe0e06427bd..46802b4cf65a 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -384,7 +384,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, folio = pfn_folio(pmd_pfn(orig_pmd)); /* Do not interfere with other mappings of this folio */ - if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) goto huge_unlock; if (pageout_anon_only_filter && !folio_test_anon(folio)) @@ -458,7 +458,7 @@ regular_folio: if (folio_test_large(folio)) { int err; - if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) break; if (pageout_anon_only_filter && !folio_test_anon(folio)) break; From 20b18aada1856b2ce0512b087a8681342af73e60 Mon Sep 17 00:00:00 2001 From: Yin Fengwei Date: Tue, 8 Aug 2023 10:09:16 +0800 Subject: [PATCH 2/7] madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check Commit fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping. It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared. Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here. User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then. NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready. Link: https://lkml.kernel.org/r/20230808020917.2230692-3-fengwei.yin@intel.com Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio") Signed-off-by: Yin Fengwei Reviewed-by: Yu Zhao Reviewed-by: Ryan Roberts Cc: David Hildenbrand Cc: Kefeng Wang Cc: Matthew Wilcox Cc: Minchan Kim Cc: Vishal Moola (Oracle) Cc: Yang Shi Signed-off-by: Andrew Morton --- mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f15d557e5708..164d22365bde 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1612,7 +1612,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, * If other processes are mapping this folio, we couldn't discard * the folio unless they all do MADV_FREE so let's skip the folio. */ - if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) goto out; if (!folio_trylock(folio)) From 0e0e9bd5f7b9d40fd03b70092367247d52da1db0 Mon Sep 17 00:00:00 2001 From: Yin Fengwei Date: Tue, 8 Aug 2023 10:09:17 +0800 Subject: [PATCH 3/7] madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping. It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared. Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here. User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then. NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready. Link: https://lkml.kernel.org/r/20230808020917.2230692-4-fengwei.yin@intel.com Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Signed-off-by: Yin Fengwei Reviewed-by: Yu Zhao Reviewed-by: Ryan Roberts Cc: David Hildenbrand Cc: Kefeng Wang Cc: Matthew Wilcox Cc: Minchan Kim Cc: Vishal Moola (Oracle) Cc: Yang Shi Cc: Signed-off-by: Andrew Morton --- mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/madvise.c b/mm/madvise.c index 46802b4cf65a..ec30f48f8f2e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -680,7 +680,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (folio_test_large(folio)) { int err; - if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) break; if (!folio_trylock(folio)) break; From cfeb6ae8bcb96ccf674724f223661bbcef7b0d0b Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 18 Aug 2023 20:43:55 -0400 Subject: [PATCH 4/7] maple_tree: disable mas_wr_append() when other readers are possible The current implementation of append may cause duplicate data and/or incorrect ranges to be returned to a reader during an update. Although this has not been reported or seen, disable the append write operation while the tree is in rcu mode out of an abundance of caution. During the analysis of the mas_next_slot() the following was artificially created by separating the writer and reader code: Writer: reader: mas_wr_append set end pivot updates end metata Detects write to last slot last slot write is to start of slot store current contents in slot overwrite old end pivot mas_next_slot(): read end metadata read old end pivot return with incorrect range store new value Alternatively: Writer: reader: mas_wr_append set end pivot updates end metata Detects write to last slot last lost write to end of slot store value mas_next_slot(): read end metadata read old end pivot read new end pivot return with incorrect range set old end pivot There may be other accesses that are not safe since we are now updating both metadata and pointers, so disabling append if there could be rcu readers is the safest action. Link: https://lkml.kernel.org/r/20230819004356.1454718-2-Liam.Howlett@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett Cc: Signed-off-by: Andrew Morton --- lib/maple_tree.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 4dd73cf936a6..f723024e1426 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4265,6 +4265,10 @@ static inline unsigned char mas_wr_new_end(struct ma_wr_state *wr_mas) * mas_wr_append: Attempt to append * @wr_mas: the maple write state * + * This is currently unsafe in rcu mode since the end of the node may be cached + * by readers while the node contents may be updated which could result in + * inaccurate information. + * * Return: True if appended, false otherwise */ static inline bool mas_wr_append(struct ma_wr_state *wr_mas) @@ -4274,6 +4278,9 @@ static inline bool mas_wr_append(struct ma_wr_state *wr_mas) struct ma_state *mas = wr_mas->mas; unsigned char node_pivots = mt_pivots[wr_mas->type]; + if (mt_in_rcu(mas->tree)) + return false; + if (mas->offset != wr_mas->node_end) return false; From 5e56982dd0759d8b345fe1468297dd4f630db5d7 Mon Sep 17 00:00:00 2001 From: Andre Przywara Date: Mon, 21 Aug 2023 17:05:33 +0100 Subject: [PATCH 5/7] selftests: cachestat: test for cachestat availability Patch series "selftests: cachestat: fix run on older kernels", v2. I ran all kernel selftests on some test machine, and stumbled upon cachestat failing (among others). These patches fix the run on older kernels and when the current directory is on a tmpfs instance. This patch (of 2): As cachestat is a new syscall, it won't be available on older kernels, for instance those running on a development machine. At the moment the test reports all tests as "not ok" in this case. Test for the cachestat syscall availability first, before doing further tests, and bail out early with a TAP SKIP comment. This also uses the opportunity to add the proper TAP headers, and add one check for proper error handling (illegal file descriptor). Link: https://lkml.kernel.org/r/20230821160534.3414911-1-andre.przywara@arm.com Link: https://lkml.kernel.org/r/20230821160534.3414911-2-andre.przywara@arm.com Signed-off-by: Andre Przywara Acked-by: Nhat Pham Cc: Johannes Weiner Cc: Shuah Khan Signed-off-by: Andrew Morton --- .../selftests/cachestat/test_cachestat.c | 20 ++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/cachestat/test_cachestat.c b/tools/testing/selftests/cachestat/test_cachestat.c index 54d09b820ed4..b0c06393bcad 100644 --- a/tools/testing/selftests/cachestat/test_cachestat.c +++ b/tools/testing/selftests/cachestat/test_cachestat.c @@ -15,6 +15,8 @@ #include "../kselftest.h" +#define NR_TESTS 8 + static const char * const dev_files[] = { "/dev/zero", "/dev/null", "/dev/urandom", "/proc/version", "/proc" @@ -236,7 +238,23 @@ out: int main(void) { - int ret = 0; + int ret; + + ksft_print_header(); + + ret = syscall(__NR_cachestat, -1, NULL, NULL, 0); + if (ret == -1 && errno == ENOSYS) + ksft_exit_skip("cachestat syscall not available\n"); + + ksft_set_plan(NR_TESTS); + + if (ret == -1 && errno == EBADF) { + ksft_test_result_pass("bad file descriptor recognized\n"); + ret = 0; + } else { + ksft_test_result_fail("bad file descriptor ignored\n"); + ret = 1; + } for (int i = 0; i < 5; i++) { const char *dev_filename = dev_files[i]; From f84f62e69963d7742acec4340ec1c4c7ef22b887 Mon Sep 17 00:00:00 2001 From: Andre Przywara Date: Mon, 21 Aug 2023 17:05:34 +0100 Subject: [PATCH 6/7] selftests: cachestat: catch failing fsync test on tmpfs The cachestat kselftest runs a test on a normal file, which is created temporarily in the current directory. Among the tests it runs there is a call to fsync(), which is expected to clean all dirty pages used by the file. However the tmpfs filesystem implements fsync() as noop_fsync(), so the call will not even attempt to clean anything when this test file happens to live on a tmpfs instance. This happens in an initramfs, or when the current directory is in /dev/shm or sometimes /tmp. To avoid this test failing wrongly, use statfs() to check which filesystem the test file lives on. If that is "tmpfs", we skip the fsync() test. Since the fsync test is only one part of the "normal file" test, we now execute this twice, skipping the fsync part on the first call. This way only the second test, including the fsync part, would be skipped. Link: https://lkml.kernel.org/r/20230821160534.3414911-3-andre.przywara@arm.com Signed-off-by: Andre Przywara Cc: Johannes Weiner Cc: Nhat Pham Cc: Shuah Khan Signed-off-by: Andrew Morton --- .../selftests/cachestat/test_cachestat.c | 62 ++++++++++++++----- 1 file changed, 47 insertions(+), 15 deletions(-) diff --git a/tools/testing/selftests/cachestat/test_cachestat.c b/tools/testing/selftests/cachestat/test_cachestat.c index b0c06393bcad..c037553cfd92 100644 --- a/tools/testing/selftests/cachestat/test_cachestat.c +++ b/tools/testing/selftests/cachestat/test_cachestat.c @@ -4,10 +4,12 @@ #include #include #include +#include #include #include #include #include +#include #include #include #include @@ -15,7 +17,7 @@ #include "../kselftest.h" -#define NR_TESTS 8 +#define NR_TESTS 9 static const char * const dev_files[] = { "/dev/zero", "/dev/null", "/dev/urandom", @@ -92,6 +94,20 @@ out: return ret; } +/* + * fsync() is implemented via noop_fsync() on tmpfs. This makes the fsync() + * test fail below, so we need to check for test file living on a tmpfs. + */ +static bool is_on_tmpfs(int fd) +{ + struct statfs statfs_buf; + + if (fstatfs(fd, &statfs_buf)) + return false; + + return statfs_buf.f_type == TMPFS_MAGIC; +} + /* * Open/create the file at filename, (optionally) write random data to it * (exactly num_pages), then test the cachestat syscall on this file. @@ -99,13 +115,13 @@ out: * If test_fsync == true, fsync the file, then check the number of dirty * pages. */ -bool test_cachestat(const char *filename, bool write_random, bool create, - bool test_fsync, unsigned long num_pages, int open_flags, - mode_t open_mode) +static int test_cachestat(const char *filename, bool write_random, bool create, + bool test_fsync, unsigned long num_pages, + int open_flags, mode_t open_mode) { size_t PS = sysconf(_SC_PAGESIZE); int filesize = num_pages * PS; - bool ret = true; + int ret = KSFT_PASS; long syscall_ret; struct cachestat cs; struct cachestat_range cs_range = { 0, filesize }; @@ -114,7 +130,7 @@ bool test_cachestat(const char *filename, bool write_random, bool create, if (fd == -1) { ksft_print_msg("Unable to create/open file.\n"); - ret = false; + ret = KSFT_FAIL; goto out; } else { ksft_print_msg("Create/open %s\n", filename); @@ -123,7 +139,7 @@ bool test_cachestat(const char *filename, bool write_random, bool create, if (write_random) { if (!write_exactly(fd, filesize)) { ksft_print_msg("Unable to access urandom.\n"); - ret = false; + ret = KSFT_FAIL; goto out1; } } @@ -134,7 +150,7 @@ bool test_cachestat(const char *filename, bool write_random, bool create, if (syscall_ret) { ksft_print_msg("Cachestat returned non-zero.\n"); - ret = false; + ret = KSFT_FAIL; goto out1; } else { @@ -144,15 +160,17 @@ bool test_cachestat(const char *filename, bool write_random, bool create, if (cs.nr_cache + cs.nr_evicted != num_pages) { ksft_print_msg( "Total number of cached and evicted pages is off.\n"); - ret = false; + ret = KSFT_FAIL; } } } if (test_fsync) { - if (fsync(fd)) { + if (is_on_tmpfs(fd)) { + ret = KSFT_SKIP; + } else if (fsync(fd)) { ksft_print_msg("fsync fails.\n"); - ret = false; + ret = KSFT_FAIL; } else { syscall_ret = syscall(cachestat_nr, fd, &cs_range, &cs, 0); @@ -163,13 +181,13 @@ bool test_cachestat(const char *filename, bool write_random, bool create, print_cachestat(&cs); if (cs.nr_dirty) { - ret = false; + ret = KSFT_FAIL; ksft_print_msg( "Number of dirty should be zero after fsync.\n"); } } else { ksft_print_msg("Cachestat (after fsync) returned non-zero.\n"); - ret = false; + ret = KSFT_FAIL; goto out1; } } @@ -260,7 +278,7 @@ int main(void) const char *dev_filename = dev_files[i]; if (test_cachestat(dev_filename, false, false, false, - 4, O_RDONLY, 0400)) + 4, O_RDONLY, 0400) == KSFT_PASS) ksft_test_result_pass("cachestat works with %s\n", dev_filename); else { ksft_test_result_fail("cachestat fails with %s\n", dev_filename); @@ -269,13 +287,27 @@ int main(void) } if (test_cachestat("tmpfilecachestat", true, true, - true, 4, O_CREAT | O_RDWR, 0400 | 0600)) + false, 4, O_CREAT | O_RDWR, 0600) == KSFT_PASS) ksft_test_result_pass("cachestat works with a normal file\n"); else { ksft_test_result_fail("cachestat fails with normal file\n"); ret = 1; } + switch (test_cachestat("tmpfilecachestat", true, true, + true, 4, O_CREAT | O_RDWR, 0600)) { + case KSFT_FAIL: + ksft_test_result_fail("cachestat fsync fails with normal file\n"); + ret = KSFT_FAIL; + break; + case KSFT_PASS: + ksft_test_result_pass("cachestat fsync works with a normal file\n"); + break; + case KSFT_SKIP: + ksft_test_result_skip("tmpfilecachestat is on tmpfs\n"); + break; + } + if (test_cachestat_shmem()) ksft_test_result_pass("cachestat works with a shmem file\n"); else { From e5548f85b4527c4c803b7eae7887c10bf8f90c97 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 22 Aug 2023 22:14:47 -0700 Subject: [PATCH 7/7] shmem: fix smaps BUG sleeping while atomic smaps_pte_hole_lookup() is calling shmem_partial_swap_usage() with page table lock held: but shmem_partial_swap_usage() does cond_resched_rcu() if need_resched(): "BUG: sleeping function called from invalid context". Since shmem_partial_swap_usage() is designed to count across a range, but smaps_pte_hole_lookup() only calls it for a single page slot, just break out of the loop on the last or only page, before checking need_resched(). Link: https://lkml.kernel.org/r/6fe3b3ec-abdf-332f-5c23-6a3b3a3b11a9@google.com Fixes: 230100321518 ("mm/smaps: simplify shmem handling of pte holes") Signed-off-by: Hugh Dickins Acked-by: Peter Xu Cc: [5.16+] Signed-off-by: Andrew Morton --- mm/shmem.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index f5af4b943e42..d963c747dabc 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -806,14 +806,16 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping, XA_STATE(xas, &mapping->i_pages, start); struct page *page; unsigned long swapped = 0; + unsigned long max = end - 1; rcu_read_lock(); - xas_for_each(&xas, page, end - 1) { + xas_for_each(&xas, page, max) { if (xas_retry(&xas, page)) continue; if (xa_is_value(page)) swapped++; - + if (xas.xa_index == max) + break; if (need_resched()) { xas_pause(&xas); cond_resched_rcu();