Let's move show_mem.c from lib to mm, as it belongs memory subsystem, also
split some memory statistic related functions from page_alloc.c to
show_mem.c, and we cleanup some unneeded include.
There is no functional change.
Link: https://lkml.kernel.org/r/20230516063821.121844-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
set_zone_contiguous() is only used in mm init/hotplug, and
clear_zone_contiguous() only used in hotplug, move them from page_alloc.c
to the more appropriate file.
Link: https://lkml.kernel.org/r/20230516063821.121844-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit f2fc4b44ec ("mm: move init_mem_debugging_and_hardening() to
mm/mm_init.c"), the init_on_alloc() and init_on_free() define is better to
move there too.
Link: https://lkml.kernel.org/r/20230516063821.121844-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: page_alloc: misc cleanup and refactor", v2.
This aims to reduce more space in page_alloc.c, also do some cleanup, no
functional changes intended.
This patch (of 13):
Since commit 9420f89db2 ("mm: move most of core MM initialization to
mm/mm_init.c"), mirrored_kernelcore should be moved into mm_init.c, as
most related codes are already there.
Link: https://lkml.kernel.org/r/20230516063821.121844-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20230516063821.121844-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Following the discussion about direct map fragmentaion at LSF/MM [1], it
appears that direct map fragmentation has a negligible effect on kernel
data accesses. Since the only reason that warranted secretmem to be
disabled by default was concern about performance regression caused by the
direct map fragmentation, it makes perfect sense to lift this restriction
and make secretmem enabled.
secretmem obeys RLIMIT_MEMBLOCK and as such it is not expected to cause
large fragmentation of the direct map or meaningfull increase in page
tables allocated during split of the large mappings in the direct map.
The secretmem.enable parameter is retained to allow system administrators
to disable secretmem at boot.
Switch the default setting of secretmem.enable parameter to 1.
Link: https://lwn.net/Articles/931406/ [1]
Link: https://lkml.kernel.org/r/20230515083400.3563974-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit 95e7a450b8 ("Revert "mm/compaction: fix set skip in
fast_find_migrateblock"").
Commit 7efc3b7261 ("mm/compaction: fix set skip in
fast_find_migrateblock") was reverted due to bug reports about khugepaged
consuming large amounts of CPU without making progress. The underlying
bug was partially fixed by commit cfccd2e63e ("mm, compaction: finish
pageblocks on complete migration failure") but it only mitigated the
problem and Vlastimil Babka pointing out the same issue could
theoretically happen to kcompactd.
As pageblocks containing pages that fail to migrate should now be forcibly
rescanned to set the skip hint if skip hints are used,
fast_find_migrateblock() should no longer loop on a small subset of
pageblocks for prolonged periods of time. Revert the revert so
fast_find_migrateblock() is effective again.
Using the mmtests config workload-usemem-stress-numa-compact, the number
of unique ranges scanned was analysed for both kcompactd and !kcompactd
activity.
6.4.0-rc1-vanilla
kcompactd
7 range=(0x10d600~0x10d800)
7 range=(0x110c00~0x110e00)
7 range=(0x110e00~0x111000)
7 range=(0x111800~0x111a00)
7 range=(0x111a00~0x111c00)
!kcompactd
1 range=(0x113e00~0x114000)
1 range=(0x114000~0x114020)
1 range=(0x114400~0x114489)
1 range=(0x114489~0x1144aa)
1 range=(0x1144aa~0x114600)
6.4.0-rc1-mm-revertfastmigrate
kcompactd
17 range=(0x104200~0x104400)
17 range=(0x104400~0x104600)
17 range=(0x104600~0x104800)
17 range=(0x104800~0x104a00)
17 range=(0x104a00~0x104c00)
!kcompactd
1793 range=(0x15c200~0x15c400)
5436 range=(0x105800~0x105a00)
19826 range=(0x150a00~0x150c00)
19833 range=(0x150800~0x150a00)
19834 range=(0x11ce00~0x11d000)
6.4.0-rc1-mm-follupfastfind
kcompactd
22 range=(0x107200~0x107400)
23 range=(0x107400~0x107600)
23 range=(0x107600~0x107800)
23 range=(0x107c00~0x107e00)
23 range=(0x107e00~0x108000)
!kcompactd
3 range=(0x890240~0x890400)
5 range=(0x886e00~0x887000)
5 range=(0x88a400~0x88a600)
6 range=(0x88f800~0x88fa00)
9 range=(0x88a400~0x88a420)
Note that the vanilla kernel and the full series had some duplication of
ranges scanned but it was not severe and would be in line with compaction
resets when the skip hints are cleared. Just a revert of commit
7efc3b7261 ("mm/compaction: fix set skip in fast_find_migrateblock")
showed excessive rescans of the same ranges so the series should not
reintroduce bug 1206848.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1206848
Link: https://lkml.kernel.org/r/20230515113344.6869-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Chuyi Zhou <zhouchuyi@bytedance.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
isolate_migratepages_block should mark a pageblock as skip if scanning
started on an aligned pageblock boundary but it only updates the skip flag
if the first migration candidate is also aligned. Tracing during a
compaction stress load (mmtests: workload-usemem-stress-numa-compact) that
many pageblocks are not marked skip causing excessive scanning of blocks
that had been recently checked. Update pageblock skip based on
"valid_page" which is set if scanning started on a pageblock boundary.
[mgorman@techsingularity.net: fix handling of skip bit]
Link: https://lkml.kernel.org/r/20230602111622.swtxhn6lu2qwgrwq@techsingularity.net
Link: https://lkml.kernel.org/r/20230515113344.6869-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Chuyi Zhou <zhouchuyi@bytedance.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
fast_find_migrateblock relies on skip hints to avoid rescanning a recently
selected pageblock but compact_zone() only forces the pageblock scan
completion to set the skip hint if in direct compaction. While this
prevents direct compaction repeatedly scanning a subset of blocks due to
fast_find_migrateblock(), it does not prevent proactive compaction, node
compaction and kcompactd encountering the same problem described in commit
cfccd2e63e ("mm, compaction: finish pageblocks on complete migration
failure").
Force the scan completion of a pageblock to set the skip hint if skip
hints are obeyed to prevent fast_find_migrateblock() repeatedly selecting
a subset of pageblocks.
Link: https://lkml.kernel.org/r/20230515113344.6869-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Chuyi Zhou <zhouchuyi@bytedance.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Follow-up "Fix excessive CPU usage during compaction"".
The series "Fix excessive CPU usage during compaction" [1] attempted to
fix a bug [2] but Vlastimil noted that the fix was incomplete [3]. While
the series was merged, fast_find_migrateblock was still disabled. This
series should fix the corner cases and allow 95e7a450b8 ("Revert
"mm/compaction: fix set skip in fast_find_migrateblock"") to be safely
reverted. Details on how many pageblocks are rescanned are in the
changelog of the last patch.
"Raghavendra K T" tested this and reported "decent improvement from perf
perspective as well as compaction related data [4]
[1] https://lore.kernel.org/r/20230125134434.18017-1-mgorman@techsingularity.net
[2] https://bugzilla.suse.com/show_bug.cgi?id=1206848
[3] https://lore.kernel.org/r/a55cf026-a2f9-ef01-9a4c-398693e048ea@suse.cz
[4] https://lkml.kernel.org/r/6d62686f-964d-342c-e085-0eae2555cc54@amd.com
This patch (of 4):
compact_zone() intends to rescan pageblocks if there is a failure to
migrate "within the current order-aligned block". However, the pageblock
scan may already be complete and moved to the next block causing the next
pageblock to be "rescanned". Ensure only the most recent pageblock is
rescanned.
Link: https://lkml.kernel.org/r/20230515113344.6869-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20230515113344.6869-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Chuyi Zhou <zhouchuyi@bytedance.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit 60e2793d44 ("mm, oom: do not trigger out_of_memory from the
#PF"), no user sets gfp_mask to 0. Remove the 0 mask check and update the
comments.
Link: https://lkml.kernel.org/r/20230508073538.1168-1-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is already a memory_failure_init(), don't add a new initcall, move
register_sysctl_init() into it to cleanup a bit.
Link: https://lkml.kernel.org/r/20230508114128.37081-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
HugeTLB pages have a struct page optimizations where struct pages for tail
pages are freed. However, when HugeTLB pages are destroyed, the memory
for struct pages (vmemmap) need to be allocated again.
Currently, __GFP_NORETRY flag is used to allocate the memory for vmemmap,
but given that this flag makes very little effort to actually reclaim
memory the returning of huge pages back to the system can be problem.
Lets use __GFP_RETRY_MAYFAIL instead. This flag is also performs graceful
reclaim without causing ooms, but at least it may perform a few retries,
and will fail only when there is genuinely little amount of unused memory
in the system.
Freeing a 1G page requires 16M of free memory. A machine might need to be
reconfigured from one task to another, and release a large number of 1G
pages back to the system if allocating 16M fails, the release won't work.
Link: https://lkml.kernel.org/r/20230508234059.2529638-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Suggested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gcc-13 warns about function definitions for builtin interfaces that have a
different prototype, e.g.:
In file included from kasan_test.c:31:
kasan.h:574:6: error: conflicting types for built-in function '__asan_register_globals'; expected 'void(void *, long int)' [-Werror=builtin-declaration-mismatch]
574 | void __asan_register_globals(struct kasan_global *globals, size_t size);
kasan.h:577:6: error: conflicting types for built-in function '__asan_alloca_poison'; expected 'void(void *, long int)' [-Werror=builtin-declaration-mismatch]
577 | void __asan_alloca_poison(unsigned long addr, size_t size);
kasan.h:580:6: error: conflicting types for built-in function '__asan_load1'; expected 'void(void *)' [-Werror=builtin-declaration-mismatch]
580 | void __asan_load1(unsigned long addr);
kasan.h:581:6: error: conflicting types for built-in function '__asan_store1'; expected 'void(void *)' [-Werror=builtin-declaration-mismatch]
581 | void __asan_store1(unsigned long addr);
kasan.h:643:6: error: conflicting types for built-in function '__hwasan_tag_memory'; expected 'void(void *, unsigned char, long int)' [-Werror=builtin-declaration-mismatch]
643 | void __hwasan_tag_memory(unsigned long addr, u8 tag, unsigned long size);
The two problems are:
- Addresses are passes as 'unsigned long' in the kernel, but gcc-13
expects a 'void *'.
- sizes meant to use a signed ssize_t rather than size_t.
Change all the prototypes to match these. Using 'void *' consistently for
addresses gets rid of a couple of type casts, so push that down to the
leaf functions where possible.
This now passes all randconfig builds on arm, arm64 and x86, but I have
not tested it on the other architectures that support kasan, since they
tend to fail randconfig builds in other ways. This might fail if any of
the 32-bit architectures expect a 'long' instead of 'int' for the size
argument.
The __asan_allocas_unpoison() function prototype is somewhat weird, since
it uses a pointer for 'stack_top' and an size_t for 'stack_bottom'. This
looks like it is meant to be 'addr' and 'size' like the others, but the
implementation clearly treats them as 'top' and 'bottom'.
Link: https://lkml.kernel.org/r/20230509145735.9263-2-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kasan sw-tags implementation contains one function that is only called
from assembler and has no prototype in a header. This causes a W=1
warning:
mm/kasan/sw_tags.c:171:6: warning: no previous prototype for 'kasan_tag_mismatch' [-Wmissing-prototypes]
171 | void kasan_tag_mismatch(unsigned long addr, unsigned long access_info,
Add a prototype in the local header to get a clean build.
Link: https://lkml.kernel.org/r/20230509145735.9263-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After recent changes to the retrying and failure counting in
migrate_pages_batch(), it was found that it's unnecessary to count
retrying and failure for normal, large, and THP folios separately.
Because we don't use retrying and failure number of large folios directly.
So, in this patch, we simplified retrying and failure counting of large
folios via counting retrying and failure of normal and large folios
together. This results in the reduced line number.
Previously, in migrate_pages_batch we need to track whether the source
folio is large/THP before splitting. So is_large is used to cache
folio_test_large() result. Now, we don't need that variable any more
because we don't count retrying and failure of large folios (only counting
that of THP folios). So, in this patch, is_large is removed to simplify
the code.
This is just code cleanup, no functionality changes are expected.
Link: https://lkml.kernel.org/r/20230510031829.11513-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The format string in __add_pages and __remove_pages has a typo and prints
e.g., "Misaligned __add_pages start: 0xfc605 end: #fc609" instead of
"Misaligned __add_pages start: 0xfc605 end: 0xfc609" Fix "#%lx" => "%#lx"
Link: https://lkml.kernel.org/r/20230510090758.3537242-1-rick.wertenbroek@gmail.com
Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
page_endio() is not used anymore. Remove it.
Link: https://lkml.kernel.org/r/20230510124716.73655-1-p.raghav@samsung.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
All other instances of gup_huge_pXd() perform the unshare check, so update
the PGD-specific function to do so as well.
While checking pgd_write() might seem unusual, this function already
performs such a check via pgd_access_permitted() so this is in line with
the existing implementation.
David said:
: This change makes unshare handling across all GUP-fast variants
: consistent, which is desirable as GUP-fast is complicated enough
: already even when consistent.
:
: This function was the only one I seemed to have missed (or left out and
: forgot why -- maybe because it's really dead code for now). The COW
: selftest would identify the problem, so far there was no report.
: Either the selftest wasn't run on corresponding architectures with that
: hugetlb size, or that code is still dead code and unused by
: architectures.
:
: the original commit(s) that added unsharing explain why we care about
: these checks:
:
: a7f2266041 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page")
: 84209e87c6 ("mm/gup: reliable R/O long-term pinning in COW mappings")
Link: https://lkml.kernel.org/r/cb971ac8dd315df97058ea69442ecc007b9a364a.1683381545.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Set the 'empty' bool directly from the result of the function that
determines its value instead of adding additional logic.
Link: https://lkml.kernel.org/r/20230126215125.4069751-13-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.htmlhttps://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Before commit 29ef680ae7 ("memcg, oom: move out_of_memory back to the
charge path"), all memcg oom killers were delayed to page fault path. And
the explicit wakeup is used in this case:
thread A:
...
if (locked) { // complete oom-kill, hold the lock
mem_cgroup_oom_unlock(memcg);
...
}
...
thread B:
...
if (locked && !memcg->oom_kill_disable) {
...
} else {
schedule(); // can't acquire the lock
...
}
...
The reason is that thread A kicks off the OOM-killer, which leads to
wakeups from the uncharges of the exiting task. But thread B is not
guaranteed to see them if it enters the OOM path after the OOM kills but
before thread A releases the lock.
Now only oom_kill_disable case is handled from the #PF path. In that case
it is userspace to trigger the wake up not the #PF path itself. All
potential paths to free some charges are responsible to call
memcg_oom_recover() , so the explicit wakeup is not needed in the
mem_cgroup_oom_synchronize() path which doesn't release any memory itself.
Link: https://lkml.kernel.org/r/20230419030739.115845-2-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mem_cgroup_oom_synchronize() is only used when the memcg oom handling is
handed over to the edge of the #PF path. Since commit 29ef680ae7
("memcg, oom: move out_of_memory back to the charge path") this is the
case only when the kernel memcg oom killer is disabled
(current->memcg_in_oom is only set if memcg->oom_kill_disable). Therefore
a check for oom_kill_disable in mem_cgroup_oom_synchronize() is not
required.
Link: https://lkml.kernel.org/r/20230419030739.115845-1-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Previous patches removed all callers of mem_cgroup_flush_stats_atomic().
Remove the function and simplify the code.
Link: https://lkml.kernel.org/r/20230421174020.2994750-5-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, we approximate the root usage by adding the memcg stats for
anon, file, and conditionally swap (for memsw). To read the memcg stats
we need to invoke an rstat flush. rstat flushes can be expensive, they
scale with the number of cpus and cgroups on the system.
mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold()
with irqs disabled, so such an expensive operation with irqs disabled can
cause problems.
Instead, approximate the root usage from global state. This is not 100%
accurate, but the root usage has always been ill-defined anyway.
Link: https://lkml.kernel.org/r/20230421174020.2994750-4-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The previous patch moved the wb_over_bg_thresh()->mem_cgroup_wb_stats()
code path in wb_writeback() outside the lock section. We no longer need
to flush the stats atomically. Flush the stats non-atomically.
Link: https://lkml.kernel.org/r/20230421174020.2994750-3-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__pageblock_pfn_to_page() currently performs both pfn_valid check and
pfn_to_online_page(). The former one is redundant because the latter is a
stronger check. Drop pfn_valid().
Link: https://lkml.kernel.org/r/c3868b58c6714c09a43440d7d02c7b4eed6e03f6.1682342634.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For the /proc/sys/vm/compact_memory file, the admin-guide states: When 1
is written to the file, all zones are compacted such that free memory is
available in contiguous blocks where possible. This can be important for
example in the allocation of huge pages although processes will also
directly compact memory as required
But it was not strictly followed, writing any value would cause all zones
to be compacted.
It has been slightly optimized to comply with the admin-guide. Enforce
the 1 on the unlikely chance that the sysctl handler is ever extended to
do something different.
Commit ef49843841 ("mm/compaction: remove unused variable
sysctl_compact_memory") has also been optimized a bit here, as the
declaration in the external header file has been eliminated, and
sysctl_compact_memory also needs to be verified.
[akpm@linux-foundation.org: add __read_mostly, per Mel]
Link: https://lkml.kernel.org/r/tencent_DFF54DB2A60F3333F97D3F6B5441519B050A@qq.com
Signed-off-by: Wen Yang <wenyang.linux@foxmail.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: William Lam <william.lam@bytedance.com>
Cc: Pintu Kumar <pintu@codeaurora.org>
Cc: Fu Wei <wefu@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "memcg: OOM log improvements", v2.
This short patch series brings back some cgroup v1 stats in OOM logs
that were unnecessarily changed before. It also makes memcg OOM logs
less reliant on printk() internals.
This patch (of 2):
Commit c8713d0b23 ("mm: memcontrol: dump memory.stat during cgroup OOM")
made sure we dump all the stats in memory.stat during a cgroup OOM, but it
also introduced a slight behavioral change. The code used to print the
non-hierarchical v1 cgroup stats for the entire cgroup subtree, now it
only prints the v2 cgroup stats for the cgroup under OOM.
For cgroup v1 users, this introduces a few problems:
(a) The non-hierarchical stats of the memcg under OOM are no longer
shown.
(b) A couple of v1-only stats (e.g. pgpgin, pgpgout) are no longer
shown.
(c) We show the list of cgroup v2 stats, even in cgroup v1. This list
of stats is not tracked with v1 in mind. While most of the stats seem
to be working on v1, there may be some stats that are not fully or
correctly tracked.
Although OOM log is not set in stone, we should not change it for no
reason. When upgrading the kernel version to a version including commit
c8713d0b23 ("mm: memcontrol: dump memory.stat during cgroup OOM"), these
behavioral changes are noticed in cgroup v1.
The fix is simple. Commit c8713d0b23 ("mm: memcontrol: dump memory.stat
during cgroup OOM") separated stats formatting from stats display for v2,
to reuse the stats formatting in the OOM logs. Do the same for v1.
Move the v2 specific formatting from memory_stat_format() to
memcg_stat_format(), add memcg1_stat_format() for v1, and make
memory_stat_format() select between them based on cgroup version. Since
memory_stat_show() now works for both v1 & v2, drop memcg_stat_show().
Link: https://lkml.kernel.org/r/20230428132406.2540811-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230428132406.2540811-3-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, we format all the memcg stats into a buffer in
mem_cgroup_print_oom_meminfo() and use pr_info() to dump it to the logs.
However, this buffer is large in size. Although it is currently working
as intended, ther is a dependency between the memcg stats buffer and the
printk record size limit.
If we add more stats in the future and the buffer becomes larger than the
printk record size limit, or if the prink record size limit is reduced,
the logs may be truncated.
It is safer to use seq_buf_do_printk(), which will automatically break up
the buffer at line breaks and issue small printk() calls.
Refactor the code to move the seq_buf from memory_stat_format() to its
callers, and use seq_buf_do_printk() to print the seq_buf in
mem_cgroup_print_oom_meminfo().
Link: https://lkml.kernel.org/r/20230428132406.2540811-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The MIGRATE_SYNC_LIGHT mode is intended to block for things that will
finish quickly but not for things that will take a long time. Exactly how
long is too long is not well defined, but waits of tens of milliseconds is
likely non-ideal.
When putting a Chromebook under memory pressure (opening over 90 tabs on a
4GB machine) it was fairly easy to see delays waiting for some locks in
the kcompactd code path of > 100 ms. While the laptop wasn't amazingly
usable in this state, it was still limping along and this state isn't
something artificial. Sometimes we simply end up with a lot of memory
pressure.
Putting the same Chromebook under memory pressure while it was running
Android apps (though not stressing them) showed a much worse result (NOTE:
this was on a older kernel but the codepaths here are similar). Android
apps on ChromeOS currently run from a 128K-block, zlib-compressed,
loopback-mounted squashfs disk. If we get a page fault from something
backed by the squashfs filesystem we could end up holding a folio lock
while reading enough from disk to decompress 128K (and then decompressing
it using the somewhat slow zlib algorithms). That reading goes through
the ext4 subsystem (because it's a loopback mount) before eventually
ending up in the block subsystem. This extra jaunt adds extra overhead.
Without much work I could see cases where we ended up blocked on a folio
lock for over a second. With more extreme memory pressure I could see up
to 25 seconds.
We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the
two locks that were seen to be slow [1] and that generated much
discussion. After discussion, it was decided that we should avoid waiting
for the two locks during MIGRATE_SYNC_LIGHT if they were being held for
IO. We'll continue with the unbounded wait for the more full SYNC modes.
With this change, I couldn't see any slow waits on these locks with my
previous testcases.
NOTE: The reason I stated digging into this originally isn't because some
benchmark had gone awry, but because we've received in-the-field crash
reports where we have a hung task waiting on the page lock (which is the
equivalent code path on old kernels). While the root cause of those
crashes is likely unrelated and won't be fixed by this patch, analyzing
those crash reports did point out these very long waits seemed like
something good to fix. With this patch we should no longer hang waiting
on these locks, but presumably the system will still be in a bad shape and
hang somewhere else.
[1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid
Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A memcg pointer in the percpu stock can be accessed by drain_all_stock()
from another cpu in a lockless way. In theory it might lead to an issue,
similar to the one which has been discovered with stock->cached_objcg,
where the pointer was zeroed between the check for being NULL and
dereferencing. In this case the issue is unlikely a real problem, but to
make it bulletproof and similar to stock->cached_objcg, let's annotate all
accesses to stock->cached with READ_ONCE()/WTRITE_ONCE().
Link: https://lkml.kernel.org/r/20230502160839.361544-2-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The zswap writeback mechanism can cause a race condition resulting in
memory corruption, where a swapped out page gets swapped in with data that
was written to a different page.
The race unfolds like this:
1. a page with data A and swap offset X is stored in zswap
2. page A is removed off the LRU by zpool driver for writeback in
zswap-shrink work, data for A is mapped by zpool driver
3. user space program faults and invalidates page entry A, offset X is
considered free
4. kswapd stores page B at offset X in zswap (zswap could also be
full, if so, page B would then be IOed to X, then skip step 5.)
5. entry A is replaced by B in tree->rbroot, this doesn't affect the
local reference held by zswap-shrink work
6. zswap-shrink work writes back A at X, and frees zswap entry A
7. swapin of slot X brings A in memory instead of B
The fix:
Once the swap page cache has been allocated (case ZSWAP_SWAPCACHE_NEW),
zswap-shrink work just checks that the local zswap_entry reference is
still the same as the one in the tree. If it's not the same it means that
it's either been invalidated or replaced, in both cases the writeback is
aborted because the local entry contains stale data.
Reproducer:
I originally found this by running `stress` overnight to validate my work
on the zswap writeback mechanism, it manifested after hours on my test
machine. The key to make it happen is having zswap writebacks, so
whatever setup pumps /sys/kernel/debug/zswap/written_back_pages should do
the trick.
In order to reproduce this faster on a vm, I setup a system with ~100M of
available memory and a 500M swap file, then running `stress --vm 1
--vm-bytes 300000000 --vm-stride 4000` makes it happen in matter of tens
of minutes. One can speed things up even more by swinging
/sys/module/zswap/parameters/max_pool_percent up and down between, say, 20
and 1; this makes it reproduce in tens of seconds. It's crucial to set
`--vm-stride` to something other than 4096 otherwise `stress` won't
realize that memory has been corrupted because all pages would have the
same data.
Link: https://lkml.kernel.org/r/20230503151200.19707-1-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chris Li (Google) <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit 1ba3cbf3ec ("mm: kfence: improve the performance of
__kfence_alloc() and __kfence_free()"), kfence reports failures in random
places at boot on big endian machines.
The problem is that the new KFENCE_CANARY_PATTERN_U64 encodes the address
of each byte in its value, so it needs to be byte swapped on big endian
machines.
The compiler is smart enough to do the le64_to_cpu() at compile time, so
there is no runtime overhead.
Link: https://lkml.kernel.org/r/20230505035127.195387-1-mpe@ellerman.id.au
Fixes: 1ba3cbf3ec ("mm: kfence: improve the performance of __kfence_alloc() and __kfence_free()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Under memory pressure, we sometimes observe the following crash:
[ 5694.832838] ------------[ cut here ]------------
[ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100)
[ 5694.858677] WARNING: CPU: 33 PID: 418824 at lib/list_debug.c:47 __list_del_entry_valid+0x42/0x80
[ 5694.961820] CPU: 33 PID: 418824 Comm: fuse_counters.s Kdump: loaded Tainted: G S 5.19.0-0_fbk3_rc3_hoangnhatpzsdynshrv41_10870_g85a9558a25de #1
[ 5694.990194] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021
[ 5695.007072] RIP: 0010:__list_del_entry_valid+0x42/0x80
[ 5695.017351] Code: 08 48 83 c2 22 48 39 d0 74 24 48 8b 10 48 39 f2 75 2c 48 8b 51 08 b0 01 48 39 f2 75 34 c3 48 c7 c7 55 d7 78 82 e8 4e 45 3b 00 <0f> 0b eb 31 48 c7 c7 27 a8 70 82 e8 3e 45 3b 00 0f 0b eb 21 48 c7
[ 5695.054919] RSP: 0018:ffffc90027aef4f0 EFLAGS: 00010246
[ 5695.065366] RAX: 41fe484987275300 RBX: ffff888008988180 RCX: 0000000000000000
[ 5695.079636] RDX: ffff88886006c280 RSI: ffff888860060480 RDI: ffff888860060480
[ 5695.093904] RBP: 0000000000000002 R08: 0000000000000000 R09: ffffc90027aef370
[ 5695.108175] R10: 0000000000000000 R11: ffffffff82fdf1c0 R12: 0000000010000002
[ 5695.122447] R13: ffff888014b6a448 R14: ffff888014b6a420 R15: 00000000138dc240
[ 5695.136717] FS: 00007f23a7d3f740(0000) GS:ffff888860040000(0000) knlGS:0000000000000000
[ 5695.152899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5695.164388] CR2: 0000560ceaab6ac0 CR3: 000000001c06c001 CR4: 00000000007706e0
[ 5695.178659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5695.192927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5695.207197] PKRU: 55555554
[ 5695.212602] Call Trace:
[ 5695.217486] <TASK>
[ 5695.221674] zs_map_object+0x91/0x270
[ 5695.229000] zswap_frontswap_store+0x33d/0x870
[ 5695.237885] ? do_raw_spin_lock+0x5d/0xa0
[ 5695.245899] __frontswap_store+0x51/0xb0
[ 5695.253742] swap_writepage+0x3c/0x60
[ 5695.261063] shrink_page_list+0x738/0x1230
[ 5695.269255] shrink_lruvec+0x5ec/0xcd0
[ 5695.276749] ? shrink_slab+0x187/0x5f0
[ 5695.284240] ? mem_cgroup_iter+0x6e/0x120
[ 5695.292255] shrink_node+0x293/0x7b0
[ 5695.299402] do_try_to_free_pages+0xea/0x550
[ 5695.307940] try_to_free_pages+0x19a/0x490
[ 5695.316126] __folio_alloc+0x19ff/0x3e40
[ 5695.323971] ? __filemap_get_folio+0x8a/0x4e0
[ 5695.332681] ? walk_component+0x2a8/0xb50
[ 5695.340697] ? generic_permission+0xda/0x2a0
[ 5695.349231] ? __filemap_get_folio+0x8a/0x4e0
[ 5695.357940] ? walk_component+0x2a8/0xb50
[ 5695.365955] vma_alloc_folio+0x10e/0x570
[ 5695.373796] ? walk_component+0x52/0xb50
[ 5695.381634] wp_page_copy+0x38c/0xc10
[ 5695.388953] ? filename_lookup+0x378/0xbc0
[ 5695.397140] handle_mm_fault+0x87f/0x1800
[ 5695.405157] do_user_addr_fault+0x1bd/0x570
[ 5695.413520] exc_page_fault+0x5d/0x110
[ 5695.421017] asm_exc_page_fault+0x22/0x30
After some investigation, I have found the following issue: unlike other
zswap backends, zsmalloc performs the LRU list update at the object
mapping time, rather than when the slot for the object is allocated.
This deviation was discussed and agreed upon during the review process
of the zsmalloc writeback patch series:
https://lore.kernel.org/lkml/Y3flcAXNxxrvy3ZH@cmpxchg.org/
Unfortunately, this introduces a subtle bug that occurs when there is a
concurrent store and reclaim, which interleave as follows:
zswap_frontswap_store() shrink_worker()
zs_malloc() zs_zpool_shrink()
spin_lock(&pool->lock) zs_reclaim_page()
zspage = find_get_zspage()
spin_unlock(&pool->lock)
spin_lock(&pool->lock)
zspage = list_first_entry(&pool->lru)
list_del(&zspage->lru)
zspage->lru.next = LIST_POISON1
zspage->lru.prev = LIST_POISON2
spin_unlock(&pool->lock)
zs_map_object()
spin_lock(&pool->lock)
if (!list_empty(&zspage->lru))
list_del(&zspage->lru)
CHECK_DATA_CORRUPTION(next == LIST_POISON1) /* BOOM */
With the current upstream code, this issue rarely happens. zswap only
triggers writeback when the pool is already full, at which point all
further store attempts are short-circuited. This creates an implicit
pseudo-serialization between reclaim and store. I am working on a new
zswap shrinking mechanism, which makes interleaving reclaim and store
more likely, exposing this bug.
zbud and z3fold do not have this problem, because they perform the LRU
list update in the alloc function, while still holding the pool's lock.
This patch fixes the aforementioned bug by moving the LRU update back to
zs_malloc(), analogous to zbud and z3fold.
Link: https://lkml.kernel.org/r/20230505185054.2417128-1-nphamcs@gmail.com
Fixes: 64f768c6b3 ("zsmalloc: add a LRU to zs_pool to keep track of zspages in LRU order")
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When something registers and unregisters many shrinkers, such as:
for x in $(seq 10000); do unshare -Ui true; done
Sometimes the following error is printed to the kernel log:
debugfs: Directory '...' with parent 'shrinker' already present!
This occurs since commit badc28d492 ("mm: shrinkers: fix deadlock in
shrinker debugfs") / v6.2: Since the call to `debugfs_remove_recursive`
was moved outside the `shrinker_rwsem`/`shrinker_mutex` lock, but the call
to `ida_free` stayed inside, a newly registered shrinker can be
re-assigned that ID and attempt to create the debugfs directory before the
directory from the previous shrinker has been removed.
The locking changes in commit f95bdb700b ("mm: vmscan: make global slab
shrink lockless") made the race condition more likely, though it existed
before then.
Commit badc28d492 ("mm: shrinkers: fix deadlock in shrinker debugfs")
could be reverted since the issue is addressed should no longer occur
since the count and scan operations are lockless since commit 20cd1892fc
("mm: shrinkers: make count and scan in shrinker debugfs lockless").
However, since this is a contended lock, prefer instead moving `ida_free`
outside the lock to avoid the race.
Link: https://lkml.kernel.org/r/20230503013232.299211-1-joanbrugueram@gmail.com
Fixes: badc28d492 ("mm: shrinkers: fix deadlock in shrinker debugfs")
Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2d55c16c0c ("dmapool: create/destroy cleanup").
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFaTPAAKCRDdBJ7gKXxA
ji1zAP4s/mWPQOJkoTR8cn9wRdGUB4FKgDiJwwYeOdcHRuylvwEAlgOD8qlizXYu
z4tGx6J8p3X+IdEApqH6Up56wjW+Wgk=
=X19M
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-05-06-10-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull dmapool updates - again - from Andrew Morton:
"Reinstate the dmapool changes which were accidentally removed by a
mishap on the last commit in the previous attempt at the series"
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup").
[ The whole old series: def8574308ed..2d55c16c0c54 results in an empty
diff because that last commit ended up being just a revert of all that
came everything before it. - Linus ]
* tag 'mm-stable-2023-05-06-10-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
dmapool: link blocks across pages
dmapool: don't memset on free twice
dmapool: simplify freeing
dmapool: consolidate page initialization
dmapool: rearrange page alloc failure handling
dmapool: move debug code to own functions
dmapool: speedup DMAPOOL_DEBUG with init_on_alloc
dmapool: cleanup integer types
dmapool: use sysfs_emit() instead of scnprintf()
dmapool: remove checks for dev == NULL
changes.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFaSnAAKCRDdBJ7gKXxA
jskTAPwIkL6xVMIEdtWfQNIy751o0onzfOQAIWgM3dSL83cMJQEA5OXxXt4LjrJw
4q0zf4GLYXjMjhnfYr8zXF39y05iyQA=
=nIhV
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"Five hotfixes.
Three are cc:stable, two pertain to merge window changes"
* tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
afs: fix the afs_dir_get_folio return value
nilfs2: do not write dirty data after degenerating to read-only
mm: do not reclaim private data from pinned page
nilfs2: fix infinite loop in nilfs_mdt_get_block()
mm/mmap/vma_merge: always check invariants
The allocated dmapool pages are never freed for the lifetime of the pool.
There is no need for the two level list+stack lookup for finding a free
block since nothing is ever removed from the list. Just use a simple
stack, reducing time complexity to constant.
The implementation inserts the stack linking elements and the dma handle
of the block within itself when freed. This means the smallest possible
dmapool block is increased to at most 16 bytes to accommodate these
fields, but there are no exisiting users requesting a dma pool smaller
than that anyway.
Removing the list has a significant change in performance. Using the
kernel's micro-benchmarking self test:
Before:
# modprobe dmapool_test
dmapool test: size:16 blocks:8192 time:57282
dmapool test: size:64 blocks:8192 time:172562
dmapool test: size:256 blocks:8192 time:789247
dmapool test: size:1024 blocks:2048 time:371823
dmapool test: size:4096 blocks:1024 time:362237
After:
# modprobe dmapool_test
dmapool test: size:16 blocks:8192 time:24997
dmapool test: size:64 blocks:8192 time:26584
dmapool test: size:256 blocks:8192 time:33542
dmapool test: size:1024 blocks:2048 time:9022
dmapool test: size:4096 blocks:1024 time:6045
The module test allocates quite a few blocks that may not accurately
represent how these pools are used in real life. For a more marco level
benchmark, running fio high-depth + high-batched on nvme, this patch shows
submission and completion latency reduced by ~100usec each, 1% IOPs
improvement, and perf record's time spent in dma_pool_alloc/free were
reduced by half.
[kbusch@kernel.org: push new blocks in ascending order]
Link: https://lkml.kernel.org/r/20230221165400.1595247-1-kbusch@meta.com
Link: https://lkml.kernel.org/r/20230126215125.4069751-12-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Bryan O'Donoghue <bryan.odonoghue@linaro.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If debug is enabled, dmapool will poison the range, so no need to clear it
to 0 immediately before writing over it.
Link: https://lkml.kernel.org/r/20230126215125.4069751-11-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The actions for busy and not busy are mostly the same, so combine these
and remove the unnecessary function. Also, the pool is about to be freed
so there's no need to poison the page data since we only check for poison
on alloc, which can't be done on a freed pool.
Link: https://lkml.kernel.org/r/20230126215125.4069751-10-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Various fields of the dma pool are set in different places. Move it all
to one function.
Link: https://lkml.kernel.org/r/20230126215125.4069751-9-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Handle the error in a condition so the good path can be in the normal
flow.
Link: https://lkml.kernel.org/r/20230126215125.4069751-8-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Clean up the normal path by moving the debug code outside it.
Link: https://lkml.kernel.org/r/20230126215125.4069751-7-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Avoid double-memset of the same allocated memory in dma_pool_alloc() when
both DMAPOOL_DEBUG is enabled and init_on_alloc=1.
Link: https://lkml.kernel.org/r/20230126215125.4069751-6-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
To represent the size of a single allocation, dmapool currently uses
'unsigned int' in some places and 'size_t' in other places. Standardize
on 'unsigned int' to reduce overhead, but use 'size_t' when counting all
the blocks in the entire pool.
Link: https://lkml.kernel.org/r/20230126215125.4069751-5-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use sysfs_emit instead of scnprintf, snprintf or sprintf.
Link: https://lkml.kernel.org/r/20230126215125.4069751-4-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device. But nobody
ended up using dma pools without a device, and trying to do so will result
in an oops. So remove the checks for pool->dev == NULL since they are
unneeded bloat.
[kbusch@kernel.org: add check for null dev on create]
Link: https://lkml.kernel.org/r/20230126215125.4069751-3-kbusch@meta.com
Fixes: 2d55c16c0c ("dmapool: create/destroy cleanup")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>