Commit Graph

1266201 Commits

Author SHA1 Message Date
SeongJae Park
5965d5615b selftests/damon: classify tests for functionalities and regressions
DAMON selftests can be classified into two categories: functionalities and
regressions.  Functionality tests are for checking if the function is
working as specified, while the regression tests are basically reproducers
of previously reported and fixed bugs.  The tests of the categories are
mixed in the selftests Makefile.  Separate those for easier understanding
of the types of tests.

Link: https://lkml.kernel.org/r/20240503180318.72798-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11 15:41:34 -07:00
SeongJae Park
06cf8ce12c selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
_damon_sysfs.py is using '==' or '!=' for 'None'.  Since 'None' is a
singleton, using 'is' or 'is not' is more efficient.  Use the more
efficient one.

Link: https://lkml.kernel.org/r/20240503180318.72798-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11 15:41:34 -07:00
SeongJae Park
e799fda692 selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
_damon_sysfs.py assumes sysfs is mounted at /sys.  In some systems, that
might not be true.  Find the mount point from /proc/mounts file content.

Link: https://lkml.kernel.org/r/20240503180318.72798-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11 15:41:33 -07:00
SeongJae Park
732b8815c0 selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
DAMON context staging method in _damon_sysfs.py is not checking the
returned error from nr_schemes file read.  Check it.

Link: https://lkml.kernel.org/r/20240503180318.72798-3-sj@kernel.org
Fixes: f5f0e5a2be ("selftests/damon/_damon_sysfs: implement kdamonds start function")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11 15:41:33 -07:00
SeongJae Park
b96a303b68 mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
Patch series "mm/damon: misc fixes and improvements".

Add miscelleneous and non-urgent fixes and improvements for DAMON code,
selftests, and documents.


This patch (of 10):

damos_quota_init_priv() function should initialize all private fields of
struct damos_quota.  However, it is not initializing ->esz_bp field.  This
could result in use of uninitialized variable from
damon_feed_loop_next_input() function.  There is no such issue at the
moment because every caller of the function is passing damos_quota object
that already having the field zero value.  But we cannot guarantee the
future, and the function is not doing what it is promising.  A bug is a
bug.  This fix is for preventing possible future issues.

Link: https://lkml.kernel.org/r/20240503180318.72798-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240503180318.72798-2-sj@kernel.org
Fixes: 9294a037c0 ("mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11 15:41:33 -07:00
SeongJae Park
f1c07c0a16 selftests/damon: add a test for DAMOS quota goal
Add a selftest for DAMOS quota goal.  It tests the feature by setting a
user_input metric based goal, change the current feedback, and check if
the effective quota size is increased and decreased as expected.

Link: https://lkml.kernel.org/r/20240502172718.74166-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11 15:41:33 -07:00
SeongJae Park
d14d6b0e7d selftests/damon/_damon_sysfs: support quota goals
Patch series "selftests/damon: add DAMOS quota goal test".

Extend DAMON selftest-purpose sysfs wrapper to support DAMOS quota goal,
and implement a simple selftest for the feature using it.


This patch (of 2):

The DAMON sysfs test purpose wrapper, _damon_sysfs.py, is not supporting
quota goals.  Implement the support for testing the feature.  The test
will be implemented and added by the following commit.

Link: https://lkml.kernel.org/r/20240502172718.74166-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240502172718.74166-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11 15:41:32 -07:00
Matthew Wilcox (Oracle)
447bac3d29 thp: remove HPAGE_PMD_ORDER minimum assertion
We now handle order-1 folios correctly, so we don't need this assertion
any more.

Link: https://lkml.kernel.org/r/20240429190114.3126789-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:02 -07:00
SeongJae Park
c961bddb7d mm/vmscan: remove ignore_references argument of reclaim_folio_list()
All reclaim_folio_list() callers are passing 'true' for
'ignore_references' parameter.  In other words, the parameter is not
really being used.  Simplify the code by removing the parameter.

Link: https://lkml.kernel.org/r/20240429224451.67081-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:02 -07:00
SeongJae Park
14f5be2a2d mm/vmscan: remove ignore_references argument of reclaim_pages()
All reclaim_pages() callers are setting 'ignore_references' parameter
'true'.  In other words, the parameter is not really being used.  Remove
the argument to make it simple.

Link: https://lkml.kernel.org/r/20240429224451.67081-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:02 -07:00
SeongJae Park
ebd3f70c63 mm/damon/paddr: do page level access check for pageout DAMOS action on its own
'pageout' DAMOS action implementation of 'paddr' DAMON operations set asks
reclaim_pages() to do page level access check if the user is not asking
DAMOS to do that on its own.  Simplify the logic by making the check
always be done by 'paddr'.

Link: https://lkml.kernel.org/r/20240429224451.67081-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:01 -07:00
SeongJae Park
69a5f99917 mm/damon/paddr: avoid unnecessary page level access check for pageout DAMOS action
Patch series "mm/damon/paddr: simplify page level access re-check for
pageout.

The 'pageout' DAMOS action implementation of 'paddr' asks reclaim_pages()
to do page level access check again.  But the user can ask 'paddr' to do
the page level access check on its own, using DAMOS filter of 'young page'
type.  Meanwhile, 'paddr' is the only user of reclaim_pages() that asks
the page level access check.

Make 'paddr' does the page level access check on its own always, and
simplify reclaim_pages() by removing the page level access check request
handling logic.  As a result of the change for reclaim_pages(),
reclaim_folio_list(), which is called by reclaim_pages(), also no more
need to do the page level access check.  Simplify the function, too.


This patch (of 4):

'pageout' DAMOS action implementation of 'paddr' asks reclaim_pages() to
do the page level access check.  User could ask DAMOS to do the page level
access check on its own using 'young page' type DAMOS filter.  In the
case, pageout DAMOS action unnecessarily asks reclaim_pages() to do the
check again.  Ask the page level access check only if the scheme is not
having the filter.

Link: https://lkml.kernel.org/r/20240429224451.67081-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240429224451.67081-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:01 -07:00
Peter Xu
01d89b93e1 mm/gup: fix hugepd handling in hugetlb rework
Commit a12083d721 added hugepd handling for gup-slow, reusing gup-fast
functions.  follow_hugepd() correctly took the vma pointer in, however
didn't pass it over into the lower functions, which was overlooked.

The issue is gup_fast_hugepte() uses the vma pointer to make the correct
decision on whether an unshare is needed for a FOLL_PIN|FOLL_LONGTERM. 
Now without vma ponter it will constantly return "true" (needs an unshare)
for a page cache, even though in the SHARED case it will be wrong to
unshare.

The other problem is, even if an unshare is needed, it now returns 0
rather than -EMLINK, which will not trigger a follow up FAULT_FLAG_UNSHARE
fault.  That will need to be fixed too when the unshare is wanted.

gup_longterm test didn't expose this issue in the past because it didn't
yet test R/O unshare in this case, another separate patch will enable that
in future tests.

Fix it by passing vma correctly to the bottom, rename gup_fast_hugepte()
back to gup_hugepte() as it is shared between the fast/slow paths, and
also allow -EMLINK to be returned properly by gup_hugepte() even though
gup-fast will take it the same as zero.

Link: https://lkml.kernel.org/r/20240430131303.264331-1-peterx@redhat.com
Fixes: a12083d721 ("mm/gup: handle hugepd for follow_page()")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:01 -07:00
David Hildenbrand
67f4c91a44 selftests: mm: gup_longterm: test unsharing logic when R/O pinning
In our FOLL_LONGTERM tests, we prefault the page tables for the GUP-fast
test cases to be able to find a PTE and exercise the "longterm pinning
allowed" logic on the GUP-fast path where possible.

For now, we always prefault the page tables writable, resulting in PTEs
that are writable.

Let's cover more cases to also test if our unsharing logic works as
expected (and is able to make progress when there is nothing to unshare)
by mprotect'ing the range R/O when R/O-pinning, so we don't get PTEs that
are writable.

This change would have found an issue introduced by commit a12083d721
("mm/gup: handle hugepd for follow_page()"), whereby R/O pinning was not
able to make progress in all cases, because unsharing logic was not
provided with the VMA to decide at some point that long-term R/O pinning a
!anon page is fine.

Link: https://lkml.kernel.org/r/20240430131508.86924-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:01 -07:00
Frank van der Linden
cc48be374b mm/hugetlb: align cma on allocation order, not demotion order
Align the CMA area for hugetlb gigantic pages to their size, not the size
that they can be demoted to.  Otherwise there might be misaligned sections
at the start and end of the CMA area that will never be used for hugetlb
page allocations.

Link: https://lkml.kernel.org/r/20240430161437.2100295-1-fvdl@google.com
Fixes: a01f43901c ("hugetlb: be sure to free demoted CMA pages to CMA")
Signed-off-by: Frank van der Linden <fvdl@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:01 -07:00
Vishal Verma
2acf04532d dax/bus.c: use the right locking mode (read vs write) in size_show
In size_show(), the dax_dev_rwsem only needs a read lock, but was
acquiring a write lock.  Change it to down_read_interruptible() so it
doesn't unnecessarily hold a write lock.

Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-4-e3dcd755774c@intel.com
Fixes: c05ae9d85b ("dax/bus.c: replace driver-core lock usage by a local rwsem")
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:01 -07:00
Vishal Verma
e39dbcfba7 dax/bus.c: don't use down_write_killable for non-user processes
Change an instance of down_write_killable() to a simple down_write() where
there is no user process that might want to interrupt the operation.

Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-3-e3dcd755774c@intel.com
Fixes: c05ae9d85b ("dax/bus.c: replace driver-core lock usage by a local rwsem")
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:01 -07:00
Vishal Verma
6f6544f27e dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths
Commit c05ae9d85b ("dax/bus.c: replace driver-core lock usage by a local
rwsem") aimed to undo device_lock() abuses for protecting changes to
dax-driver internal data-structures like the dax_region resource tree to
device-dax-instance range structures.  However, the device_lock() was
legitimately enforcing that devices to be deleted were not current
actively attached to any driver nor assigned any capacity from the region.

As a result of the device_lock restoration in delete_store(), the
conditional locking in unregister_dev_dax() and unregister_dax_mapping()
can be removed.

Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-2-e3dcd755774c@intel.com
Fixes: c05ae9d85b ("dax/bus.c: replace driver-core lock usage by a local rwsem")
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:00 -07:00
Vishal Verma
c14c647bbe dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts
Patch series "dax/bus.c: Fixups for dax-bus locking", v3.

Commit Fixes: c05ae9d85b ("dax/bus.c: replace driver-core lock usage by
a local rwsem") introduced a few problems that this series aims to fix. 
Add back device_lock() where it was correctly used (during device
manipulation operations), remove conditional locking in
unregister_dax_dev() and unregister_dax_mapping(), use non-interruptible
versions of rwsem locks when not called from a user process, and fix up a
write vs.  read usage of an rwsem.


This patch (of 4):

In [1], Dan points out that all of the WARN_ON_ONCE() usage in the
referenced patch should be replaced with lockdep_assert_held, or
lockdep_held_assert_write().  Replace these as appropriate.

Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-0-e3dcd755774c@intel.com
Link: https://lore.kernel.org/r/65f0b5ef41817_aa222941a@dwillia2-mobl3.amr.corp.intel.com.notmuch [1]
Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-1-e3dcd755774c@intel.com
Fixes: c05ae9d85b ("dax/bus.c: replace driver-core lock usage by a local rwsem")
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:00 -07:00
Breno Leitao
1872b3bcd5 mm: memcg: use READ_ONCE()/WRITE_ONCE() to access stock->nr_pages
A memcg pointer in the per-cpu stock can be accessed by
drain_all_stock() and consume_stock() in parallel, causing a potential
race, which is believed to e harmless.

KCSAN shows this data-race clearly in the splat below:

	BUG: KCSAN: data-race in drain_all_stock.part.0 / try_charge_memcg

	write to 0xffff88903f8b0788 of 4 bytes by task 35901 on cpu 2:
	try_charge_memcg (mm/memcontrol.c:2323 mm/memcontrol.c:2746)
	__mem_cgroup_charge (mm/memcontrol.c:7287 mm/memcontrol.c:7301)
	do_anonymous_page (mm/memory.c:1054 mm/memory.c:4375 mm/memory.c:4433)
	__handle_mm_fault (mm/memory.c:3878 mm/memory.c:5300 mm/memory.c:5441)
	handle_mm_fault (mm/memory.c:5606)
	do_user_addr_fault (arch/x86/mm/fault.c:1363)
	exc_page_fault (./arch/x86/include/asm/irqflags.h:37
		        ./arch/x86/include/asm/irqflags.h:72
			arch/x86/mm/fault.c:1513
			arch/x86/mm/fault.c:1563)
	asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)

	read to 0xffff88903f8b0788 of 4 bytes by task 287 on cpu 27:
	drain_all_stock.part.0 (mm/memcontrol.c:2433)
	mem_cgroup_css_offline (mm/memcontrol.c:5398 mm/memcontrol.c:5687)
	css_killed_work_fn (kernel/cgroup/cgroup.c:5521 kernel/cgroup/cgroup.c:5794)
	process_one_work (kernel/workqueue.c:3254)
	worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
	kthread (kernel/kthread.c:388)
	ret_from_fork (arch/x86/kernel/process.c:147)
	ret_from_fork_asm (arch/x86/entry/entry_64.S:257)

	value changed: 0x00000014 -> 0x00000013

This happens because drain_all_stock() is reading stock->nr_pages, while
consume_stock() might be updating the same address, causing a potential
data-race.

Make the shared addresses bulletproof regarding to reads and writes,
similarly to what stock->cached_objcg and stock->cached.
Annotate all accesses to stock->nr_pages with READ_ONCE()/WRITE_ONCE().

Link: https://lkml.kernel.org/r/20240501095420.679208-1-leitao@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:00 -07:00
Ryan Roberts
3a5a8d343e mm: fix race between __split_huge_pmd_locked() and GUP-fast
__split_huge_pmd_locked() can be called for a present THP, devmap or
(non-present) migration entry.  It calls pmdp_invalidate() unconditionally
on the pmdp and only determines if it is present or not based on the
returned old pmd.  This is a problem for the migration entry case because
pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a
present pmd.

On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future
call to pmd_present() will return true.  And therefore any lockless
pgtable walker could see the migration entry pmd in this state and start
interpretting the fields as if it were present, leading to BadThings (TM).
GUP-fast appears to be one such lockless pgtable walker.

x86 does not suffer the above problem, but instead pmd_mkinvalid() will
corrupt the offset field of the swap entry within the swap pte.  See link
below for discussion of that problem.

Fix all of this by only calling pmdp_invalidate() for a present pmd.  And
for good measure let's add a warning to all implementations of
pmdp_invalidate[_ad]().  I've manually reviewed all other
pmdp_invalidate[_ad]() call sites and believe all others to be conformant.

This is a theoretical bug found during code review.  I don't have any test
case to trigger it in practice.

Link: https://lkml.kernel.org/r/20240501143310.1381675-1-ryan.roberts@arm.com
Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/
Fixes: 84c3fc4e9c ("mm: thp: check pmd migration entry in common path")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:00 -07:00
Ryan Roberts
b0d7e15a9f mm/debug_vm_pgtable: test pmd_leaf() behavior with pmd_mkinvalid()
An invalidated pmd should still cause pmd_leaf() to return true.  Let's
test for that to ensure all arches remain consistent.

Link: https://lkml.kernel.org/r/20240501144439.1389048-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:00 -07:00
Shakeel Butt
a94032b35e memcg: use proper type for mod_memcg_state
The memcg stats update functions can take arbitrary integer but the only
input which make sense is enum memcg_stat_item and we don't want these
functions to be called with arbitrary integer, so replace the parameter
type with enum memcg_stat_item and compiler will be able to warn if memcg
stat update functions are called with incorrect index value.

Link: https://lkml.kernel.org/r/20240501172617.678560-9-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:00 -07:00
Shakeel Butt
acb5fe2f1a memcg: warn for unexpected events and stats
To reduce memory usage by the memcg events and stats, the kernel uses
indirection table and only allocate stats and events which are being used
by the memcg code.  To make this more robust, let's add warnings where
unexpected stats and events indexes are used.

Link: https://lkml.kernel.org/r/20240501172617.678560-8-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:37:00 -07:00
Shakeel Butt
4715c6a753 mm: cleanup WORKINGSET_NODES in workingset
WORKINGSET_NODES is not exposed in the memcg stats and thus there is no
need to use the memcg specific stat update functions for it.  In future if
we decide to expose WORKINGSET_NODES in the memcg stats, we can revert
this patch.

Link: https://lkml.kernel.org/r/20240501172617.678560-7-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:36:59 -07:00
Shakeel Butt
0667c7870a memcg: cleanup __mod_memcg_lruvec_state
There are no memcg specific stats for NR_SHMEM_PMDMAPPED and
NR_FILE_PMDMAPPED.  Let's remove them.

Link: https://lkml.kernel.org/r/20240501172617.678560-6-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:36:59 -07:00
Shakeel Butt
ff48c71c26 memcg: reduce memory for the lruvec and memcg stats
At the moment, the amount of memory allocated for stats related structs in
the mem_cgroup corresponds to the size of enum node_stat_item.  However
not all fields in enum node_stat_item have corresponding memcg stats.  So,
let's use indirection mechanism similar to the one used for memcg vmstats
management.

For a given x86_64 config, the size of stats with and without patch is:

structs size in bytes         w/o     with

struct lruvec_stats           1128     648
struct lruvec_stats_percpu     752     432
struct memcg_vmstats          1832    1352
struct memcg_vmstats_percpu   1280     960

The memory savings are further compounded by the fact that these structs
are allocated for each cpu and for each node.  To be precise, for each
memcg the memory saved would be:

Memory saved = ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODES * NR_CPUS) +
	       (21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long)

Where 21 is the number of fields eliminated.

Link: https://lkml.kernel.org/r/20240501172617.678560-5-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:36:59 -07:00
Roman Gushchin
aab6103b97 mm: memcg: account memory used for memcg vmstats and lruvec stats
The percpu memory used by memcg's memory statistics is already accounted. 
For consistency, let's enable accounting for vmstats and lruvec stats as
well.

Link: https://lkml.kernel.org/r/20240501172617.678560-4-shakeel.butt@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:36:59 -07:00
Shakeel Butt
70a64b7919 memcg: dynamically allocate lruvec_stats
To decouple the dependency of lruvec_stats on NR_VM_NODE_STAT_ITEMS, we
need to dynamically allocate lruvec_stats in the mem_cgroup_per_node
structure.  Also move the definition of lruvec_stats_percpu and
lruvec_stats and related functions to the memcontrol.c to facilitate later
patches.  No functional changes in the patch.

Link: https://lkml.kernel.org/r/20240501172617.678560-3-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:36:59 -07:00
Shakeel Butt
59142d87ab memcg: reduce memory size of mem_cgroup_events_index
Patch series "memcg: reduce memory consumption by memcg stats", v4.

Most of the memory overhead of a memcg object is due to memcg stats
maintained by the kernel.  Since stats updates happen in performance
critical codepaths, the stats are maintained per-cpu and numa specific
stats are maintained per-node * per-cpu.  This drastically increase the
overhead on large machines i.e.  large of CPUs and multiple numa nodes. 
This patch series tries to reduce the overhead by at least not allocating
the memory for stats which are not memcg specific.  


This patch (of 8):

mem_cgroup_events_index is a translation table to get the right index of
the memcg relevant entry for the general vm_event_item.  At the moment, it
is defined as integer array.  However on a typical system the max entry of
vm_event_item (NR_VM_EVENT_ITEMS) is 113, so we don't need to use int as
storage type of the array.  For now just use int8_t as type and add a
BUILD_BUG_ON().

Another benefit of this change is that the translation table fits in 2
cachelines while previously it would require 8 cachelines (assuming 64
bytes cacheline).

Link: https://lkml.kernel.org/r/20240501172617.678560-1-shakeel.butt@linux.dev
Link: https://lkml.kernel.org/r/20240501172617.678560-2-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:36:59 -07:00
Saurav Shah
a4c43b8a09 selftests/memfd: fix spelling mistakes
Fix spelling mistakes in the comments.

Link: https://lkml.kernel.org/r/20240501231317.24648-1-sauravshah.31@gmail.com
Signed-off-by: Saurav Shah <sauravshah.31@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:36:59 -07:00
David Hildenbrand
b8a2528835 mm/hugetlb: document why hugetlb uses folio_mapcount() for COW reuse decisions
Let's document why hugetlb still uses folio_mapcount() and is prone to
leaking memory between processes, for example using vmsplice() that still
uses FOLL_GET.

More details can be found in [1], especially around how hugetlb pages
cannot really be overcommitted, and why we don't particularly care about
these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.

[1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/

Link: https://lkml.kernel.org/r/20240502085259.103784-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:36:58 -07:00
David Hildenbrand
4bf6a4ebc5 selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL
Patch series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL".

The failing hugetlb vmsplice() COW tests keep confusing people, and having
tests that have been failing for years and likely will keep failing for
years to come because nobody cares enough is rather suboptimal.  Let's
mark them as XFAIL and document why fixing them is not that easy as it
would appear at first sight.

More details can be found in [1], especially around how hugetlb pages
cannot really be overcommitted, and why we don't particularly care about
these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.

[1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/


This patch (of 2):

The vmsplice() hugetlb tests have been failing right from the start, and
we documented that in the introducing commit 7dad331be7 ("selftests/vm:
anon_cow: hugetlb tests"):

	Note that some tests cases still fail. This will, for example, be
	fixed once vmsplice properly uses FOLL_PIN instead of FOLL_GET for
	pinning. With 2 MiB and 1 GiB hugetlb on x86_64, the expected
	failures are:

Until vmsplice() is changed, these tests will likely keep failing: hugetlb
COW reuse logic is harder to change, because using the same COW reuse
logic as we use for !hugetlb could harm other (sane) users when running
out of free hugetlb pages.

More details can be found in [1], especially around how hugetlb pages
cannot really be overcommitted, and why we don't particularly care about
these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.

These (expected) failures keep confusing people, so flag them accordingly.

Before:
	$ ./cow
	[...]
	Bail out! 8 out of 778 tests failed
	# Totals: pass:769 fail:8 xfail:0 xpass:0 skip:1 error:0
	$ echo $?
	1

After:
	$ ./cow
	[...]
	# Totals: pass:769 fail:0 xfail:8 xpass:0 skip:1 error:0
	$ echo $?
	0

[1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/

Link: https://lkml.kernel.org/r/20240502085259.103784-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240502085259.103784-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07 10:36:58 -07:00
linke li
5ee9562c58 mm/swapfile: mark racy access on si->highest_bit
In scan_swap_map_slots(), si->highest_bit can by changed by
swap_range_alloc() concurrently.  All reads on si->highest_bit except one
is either protected by lock or read using READ_ONCE.  So mark the one racy
read on si->highest_bit as benign using READ_ONCE.

This patch is aimed at reducing the number of benign races reported by
KCSAN in order to focus future debugging effort on harmful races.

Link: https://lkml.kernel.org/r/tencent_912BC3E8B0291DA4A0028AB424076375DA07@qq.com
Signed-off-by: linke li <lilinke99@qq.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:57 -07:00
Hao Ge
637a900b08 mm/rmap: change the type of we_locked from int to bool
Change the type of we_locked from int to bool because folio_trylock return
bool

Link: https://lkml.kernel.org/r/20240428012049.8182-1-gehao@kylinos.cn
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:56 -07:00
Hao Ge
620875560b mm/pagemap: make trylock_page return bool
Make trylock_page return bool to align the return values of folio_trylock
function and it also corresponds to its comment.

Link: https://lkml.kernel.org/r/20240428014711.11169-1-gehao@kylinos.cn
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:56 -07:00
Zi Yan
7491f3f348 mm/rmap: do not add fully unmapped large folio to deferred split list
In __folio_remove_rmap(), a large folio is added to deferred split list if
any page in a folio loses its final mapping.  But it is possible that the
folio is fully unmapped and adding it to deferred split list is
unnecessary.

For PMD-mapped THPs, that was not really an issue, because removing the
last PMD mapping in the absence of PTE mappings would not have added the
folio to the deferred split queue.

However, for PTE-mapped THPs, which are now more prominent due to mTHP,
they are always added to the deferred split queue.  One side effect is
that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be
unintentionally increased, making it look like there are many partially
mapped folios -- although the whole folio is fully unmapped stepwise.

Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs
where possible starting from commit b06dc281aa ("mm/rmap: introduce
folio_remove_rmap_[pte|ptes|pmd]()").  When it happens, a whole PTE-mapped
folio is unmapped in one go and can avoid being added to deferred split
list, reducing the THP_DEFERRED_SPLIT_PAGE noise.  But there will still be
noise when we cannot batch-unmap a complete PTE-mapped folio in one go --
or where this type of batching is not implemented yet, e.g., migration.

To avoid the unnecessary addition, folio->_nr_pages_mapped is checked to
tell if the whole folio is unmapped.  If the folio is already on deferred
split list, it will be skipped, too.

Note: commit 98046944a159 ("mm: huge_memory: add the missing
folio_test_pmd_mappable() for THP split statistics") tried to exclude mTHP
deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not fix the
above issue.  A fully unmapped PTE-mapped order-9 THP was still added to
deferred split list and counted as THP_DEFERRED_SPLIT_PAGE, since nr is
512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio()
the order-9 folio is folio_test_pmd_mappable().

Link: https://lkml.kernel.org/r/20240502132852.862138-1-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:56 -07:00
SeongJae Park
eedbd23dca Docs/ABI/damon: update for 'youg page' type DAMOS filter
Update DAMON ABI document for the newly added DAMO filter type, 'young
page'.

Link: https://lkml.kernel.org/r/20240426195247.100306-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Honggyu Kim <honggyu.kim@sk.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:56 -07:00
SeongJae Park
ed13c93b93 Docs/admin-guide/mm/damon/usage: update for young page type DAMOS filter
Update DAMON usage document for the newly added DAMOS filter type, 'young
page'.

Link: https://lkml.kernel.org/r/20240426195247.100306-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Honggyu Kim <honggyu.kim@sk.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:55 -07:00
SeongJae Park
26dd7cc7bb Docs/mm/damon/design: document 'young page' type DAMOS filter
Update DAMON design document for the newly added DAMOS filter type, 'young
page'.

Link: https://lkml.kernel.org/r/20240426195247.100306-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Honggyu Kim <honggyu.kim@sk.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:55 -07:00
SeongJae Park
ade414bdf6 mm/damon/paddr: implement DAMOS filter type YOUNG
DAMOS filter of type YOUNG is defined, but not yet implemented by any
DAMON operations set.  Add the implementation on 'paddr', the DAMON
operations set for the physical address space.

Link: https://lkml.kernel.org/r/20240426195247.100306-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Tested-by: Honggyu Kim <honggyu.kim@sk.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:55 -07:00
SeongJae Park
2d8b24654f mm/damon: add DAMOS filter type YOUNG
Define yet another DAMOS filter type, YOUNG.  Like anon and memcg, the
type of filter will be applied to each page in the memory region, and see
if the page is accessed since the last check.  Based on the 'matching'
parameter, the page is filtered out or in.

Note that this commit is adding only the type definition.  The
implementation should be made by DAMON operations sets.  A commit for the
implementation on 'paddr' DAMON operations set will follow.

Link: https://lkml.kernel.org/r/20240426195247.100306-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Tested-by: Honggyu Kim <honggyu.kim@sk.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:55 -07:00
SeongJae Park
6daea38215 mm/damon/paddr: implement damon_folio_mkold()
damon_pa_mkold() receives physical address, get the folio covering the
address, and makes the folio as old.  A following commit will reuse the
internal logic for marking a given folio as old.  To avoid duplication of
the code, split the internal logic.  Also, change the rmap walker
function's name from __damon_pa_mkold() to damon_folio_mkold_one(),
following the change of the caller's name and the naming rule that more
commonly used by other rmap walkers.

Link: https://lkml.kernel.org/r/20240426195247.100306-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Tested-by: Honggyu Kim <honggyu.kim@sk.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:54 -07:00
SeongJae Park
180d928e55 mm/damon/paddr: implement damon_folio_young()
Patch series "mm/damon: add a DAMOS filter type for page granularity
access recheck".

DAMON provides its best-effort accuracy-overhead tradeoff under the
user-defined ranges of acceptable level of the monitoring accuracy and
overhead.  A recent discussion for tiered memory management support from
DAMON[1] concluded that finding memory regions of specific access pattern
with low overhead despite of low accuracy via DAMON first, and then double
checking the access of the region again in a finer (e.g., page)
granularity could be a useful strategy for some DAMOS schemes.

Add a new type of DAMOS filter, namely 'young' for such a case.  It checks
each page of DAMOS target region is accessed since the last check, and
filters it out or in if 'matching' parameter is 'true' or 'false',
respectively.

Because this is a filter type that applied in page granularity, the
support depends on DAMON operations set, similar to 'anon' and 'memcg'
DAMOS filter types.  Implement the support on the DAMON operations set for
the physical address space, 'paddr', since one of the expected usages[1]
is based on the physical address space.

[1] https://lore.kernel.org/r/20240227235121.153277-1-sj@kernel.org


This patch (of 7):

damon_pa_young() receives physical address, get the folio covering the
address, and show if the folio is accessed since the last check.  A
following commit will reuse the internal logic for checking access to a
given folio.  To avoid duplication of the code, split the internal logic. 
Also, change the rmap walker function's name from __damon_pa_young() to
damon_folio_young_one(), following the change of the caller's name and the
naming rule that more commonly used by other rmap walkers.

Link: https://lkml.kernel.org/r/20240426195247.100306-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240426195247.100306-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Tested-by: Honggyu Kim <honggyu.kim@sk.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:54 -07:00
Matthew Wilcox (Oracle)
737019cf6a mm: optimise vmf_anon_prepare() for VMAs without an anon_vma
If the mmap_lock can be taken for read, we can call __anon_vma_prepare()
while holding it, saving ourselves a trip back through the fault handler.

Link: https://lkml.kernel.org/r/20240426144506.1290619-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:54 -07:00
Matthew Wilcox (Oracle)
73b4a0cd82 mm: fix some minor per-VMA lock issues in userfaultfd
Rename lock_vma() to uffd_lock_vma() because it really is uffd specific. 
Remove comment referencing unlock_vma() which doesn't exist.  Fix the
comment about lock_vma_under_rcu() which I just made incorrect.

Link: https://lkml.kernel.org/r/20240426144506.1290619-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:54 -07:00
Matthew Wilcox (Oracle)
a373baed5a mm: delay the check for a NULL anon_vma
Instead of checking the anon_vma early in the fault path where all page
faults pay the cost, delay it until we know we're going to need the
anon_vma to be filled in.  This will have a slight negative effect on the
first fault in an anonymous VMA, but it shortens every other page fault. 
It also makes the code slightly cleaner as the anon and file backed fault
handling look more similar.

The Intel kernel test bot reports a 3x improvement in vm-scalability
throughput with the small-allocs-mt test.  This is clearly an extreme
situation that won't be replicated in any real-world workload, but it's a
nice win.

https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/

Link: https://lkml.kernel.org/r/20240426144506.1290619-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:53 -07:00
Matthew Wilcox (Oracle)
3be5106059 mm: assert the mmap_lock is held in __anon_vma_prepare()
Patch series "Improve anon_vma scalability for anon VMAs".

We have a 3x throughput improvement reported by Intel's kernel test robot:
https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/

This is from delaying taking the mmap_lock for page faults until we
actually need the mmap_lock in order to assign an anon_vma to the vma.  It
cleans up the page fault path a little by making the anon fault handler
more similar to the file fault handler.


This patch (of 4):

Convert the comment into an assertion.

Link: https://lkml.kernel.org/r/20240426144506.1290619-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20240426144506.1290619-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:53 -07:00
Matthew Wilcox
e0ffb29bc5 mm: simplify thp_vma_allowable_order
Combine the three boolean arguments into one flags argument for
readability.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:53 -07:00
Kemeng Shi
dc6e0ae5b1 mm: remove stale comment __folio_mark_dirty
The __folio_mark_dirty will not mark inode dirty any longer.  Remove the
stale comment of it.

Link: https://lkml.kernel.org/r/20240425131724.36778-5-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Howard Cochran <hcochran@kernelspring.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:53 -07:00