linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-11 12:28:41 +08:00

Author	SHA1	Message	Date
Dan Williams	14b80582c4	resource: Introduce alloc_free_mem_region() The core of devm_request_free_mem_region() is a helper that searches for free space in iomem_resource and performs __request_region_locked() on the result of that search. The policy choices of the implementation conform to what CONFIG_DEVICE_PRIVATE users want which is memory that is immediately marked busy, and a preference to search for the first-fit free range in descending order from the top of the physical address space. CXL has a need for a similar allocator, but with the following tweaks: 1/ Search for free space in ascending order 2/ Search for free space relative to a given CXL window 3/ 'insert' rather than 'request' the new resource given downstream drivers from the CXL Region driver (like the pmem or dax drivers) are responsible for request_mem_region() when they activate the memory range. Rework __request_free_mem_region() into get_free_mem_region() which takes a set of GFR_* (Get Free Region) flags to control the allocation policy (ascending vs descending), and "busy" policy (insert_resource() vs request_region()). As part of the consolidation of the legacy GFR_REQUEST_REGION case with the new default of just inserting a new resource into the free space some minor cleanups like not checking for NULL before calling devres_free() (which does its own check) is included. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/linux-cxl/20220420143406.GY2120790@nvidia.com/ Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/165784333333.1758207.13703329337805274043.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2022-07-21 17:19:25 -07:00
Hyeonggon Yoo	3041808b52	mm/slab_common: move generic bulk alloc/free functions to SLOB Now that only SLOB use __kmem_cache_{alloc,free}_bulk(), move them to SLOB. No functional change intended. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2022-07-20 13:30:12 +02:00
Hyeonggon Yoo	2055e67bb6	mm/sl[au]b: use own bulk free function when bulk alloc failed There is no benefit to call generic bulk free function when kmem_cache_alloc_bulk() failed. Use own kmem_cache_free_bulk() instead of generic function. Note that if kmem_cache_alloc_bulk() fails to allocate first object in SLUB, size is zero. So allow passing size == 0 to kmem_cache_free_bulk() like SLAB's. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2022-07-20 13:30:11 +02:00
Barry Song	d0637c505f	arm64: enable THP_SWAP for arm64 THP_SWAP has been proven to improve the swap throughput significantly on x86_64 according to commit `bd4c82c22c` ("mm, THP, swap: delay splitting THP after swapped out"). As long as arm64 uses 4K page size, it is quite similar with x86_64 by having 2MB PMD THP. THP_SWAP is architecture-independent, thus, enabling it on arm64 will benefit arm64 as well. A corner case is that MTE has an assumption that only base pages can be swapped. We won't enable THP_SWAP for ARM64 hardware with MTE support until MTE is reworked to coexist with THP_SWAP. A micro-benchmark is written to measure thp swapout throughput as below, unsigned long long tv_to_ms(struct timeval tv) { return tv.tv_sec * 1000 + tv.tv_usec / 1000; } main() { struct timeval tv_b, tv_e;; #define SIZE 40010241024 volatile void p = mmap(NULL, SIZE, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS, -1, 0); if (!p) { perror("fail to get memory"); exit(-1); } madvise(p, SIZE, MADV_HUGEPAGE); memset(p, 0x11, SIZE); / write to get mem */ gettimeofday(&tv_b, NULL); madvise(p, SIZE, MADV_PAGEOUT); gettimeofday(&tv_e, NULL); printf("swp out bandwidth: %ld bytes/ms\n", SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b))); } Testing is done on rk3568 64bit Quad Core Cortex-A55 platform - ROCK 3A. thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests) thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests) Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Steven Price <steven.price@arm.com> Cc: Yang Shi <shy828301@gmail.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Link: https://lore.kernel.org/r/20220720093737.133375-1-21cnbao@gmail.com Signed-off-by: Will Deacon <will@kernel.org>	2022-07-20 10:52:40 +01:00
Miaohe Lin	da9a298f5f	hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte When alloc_huge_page fails, pagep is set to NULL without put_page first. So the hugepage indicated by pagep is leaked. Link: https://lkml.kernel.org/r/20220709092629.54291-1-linmiaohe@huawei.com Fixes: `8cc5fcbb5b` ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-18 15:07:52 -07:00
Mike Rapoport	84ac013046	secretmem: fix unhandled fault in truncate syzkaller reports the following issue: BUG: unable to handle page fault for address: ffff888021f7e005 PGD `11401067` P4D `11401067` PUD 11402067 PMD 21f7d063 PTE 800fffffde081060 Oops: 0002 [#1] PREEMPT SMP KASAN CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64 Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01 RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005 RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005 R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005 R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb FS: 00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> zero_user_segments include/linux/highmem.h:272 [inline] folio_zero_range include/linux/highmem.h:428 [inline] truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237 truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381 truncate_inode_pages mm/truncate.c:452 [inline] truncate_pagecache+0x63/0x90 mm/truncate.c:753 simple_setattr+0xed/0x110 fs/libfs.c:535 secretmem_setattr+0xae/0xf0 mm/secretmem.c:170 notify_change+0xb8c/0x12b0 fs/attr.c:424 do_truncate+0x13c/0x200 fs/open.c:65 do_sys_ftruncate+0x536/0x730 fs/open.c:193 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fb29d900899 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899 RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003 RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000 </TASK> Modules linked in: CR2: ffff888021f7e005 ---[ end trace 0000000000000000 ]--- Eric Biggers suggested that this happens when secretmem_setattr()->simple_setattr() races with secretmem_fault() so that a page that is faulted in by secretmem_fault() (and thus removed from the direct map) is zeroed by inode truncation right afterwards. Use mapping->invalidate_lock to make secretmem_fault() and secretmem_setattr() mutually exclusive. [rppt@linux.ibm.com: v3] Link: https://lkml.kernel.org/r/20220714091337.412297-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20220707165650.248088-1-rppt@kernel.org Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-18 15:07:51 -07:00
Naoya Horiguchi	c2cb0dcce9	mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() Originally copy_hugetlb_page_range() handles migration entries and hwpoisoned entries in similar manner. But recently the related code path has more code for migration entries, and when is_writable_migration_entry() was converted to !is_readable_migration_entry(), hwpoison entries on source processes got to be unexpectedly updated (which is legitimate for migration entries, but not for hwpoison entries). This results in unexpected serious issues like kernel panic when forking processes with hwpoison entries in pmd. Separate the if branch into one for hwpoison entries and one for migration entries. Link: https://lkml.kernel.org/r/20220704013312.2415700-3-naoya.horiguchi@linux.dev Fixes: `6c287605fd` ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> [5.18] Cc: David Hildenbrand <david@redhat.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-18 15:07:51 -07:00
Muchun Song	f4f451a16d	mm: fix missing wake-up event for FSDAX pages FSDAX page refcounts are 1-based, rather than 0-based: if refcount is 1, then the page is freed. The FSDAX pages can be pinned through GUP, then they will be unpinned via unpin_user_page() using a folio variant to put the page, however, folio variants did not consider this special case, the result will be to miss a wakeup event (like the user of __fuse_dax_break_layouts()). This results in a task being permanently stuck in TASK_INTERRUPTIBLE state. Since FSDAX pages are only possibly obtained by GUP users, so fix GUP instead of folio_put() to lower overhead. Link: https://lkml.kernel.org/r/20220705123532.283-1-songmuchun@bytedance.com Fixes: `d8ddc099c6` ("mm/gup: Add gup_put_folio()") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-18 15:07:51 -07:00
Josef Bacik	3fe2895cfe	mm: fix page leak with multiple threads mapping the same page We have an application with a lot of threads that use a shared mmap backed by tmpfs mounted with -o huge=within_size. This application started leaking loads of huge pages when we upgraded to a recent kernel. Using the page ref tracepoints and a BPF program written by Tejun Heo we were able to determine that these pages would have multiple refcounts from the page fault path, but when it came to unmap time we wouldn't drop the number of refs we had added from the faults. I wrote a reproducer that mmap'ed a file backed by tmpfs with -o huge=always, and then spawned 20 threads all looping faulting random offsets in this map, while using madvise(MADV_DONTNEED) randomly for huge page aligned ranges. This very quickly reproduced the problem. The problem here is that we check for the case that we have multiple threads faulting in a range that was previously unmapped. One thread maps the PMD, the other thread loses the race and then returns 0. However at this point we already have the page, and we are no longer putting this page into the processes address space, and so we leak the page. We actually did the correct thing prior to `f9ce0be71d`, however it looks like Kirill copied what we do in the anonymous page case. In the anonymous page case we don't yet have a page, so we don't have to drop a reference on anything. Previously we did the correct thing for file based faults by returning VM_FAULT_NOPAGE so we correctly drop the reference on the page we faulted in. Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable() case, this makes us drop the ref on the page properly, and now my reproducer no longer leaks the huge pages. [josef@toxicpanda.com: v2] Link: https://lkml.kernel.org/r/e90c8f0dbae836632b669c2afc434006a00d4a67.1657721478.git.josef@toxicpanda.com Link: https://lkml.kernel.org/r/2b798acfd95c9ab9395fe85e8d5a835e2e10a920.1657051137.git.josef@toxicpanda.com Fixes: `f9ce0be71d` ("mm: Cleanup faultaround and finish_fault() codepaths") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-18 15:07:51 -07:00
ZhaoLong Wang	0c98c8e1e1	tmpfs: fix the issue that the mount and remount results are inconsistent. An undefined-behavior issue has not been completely fixed since commit `d14f5efadd` ("tmpfs: fix undefined-behaviour in shmem_reconfigure()"). In the commit, check in the shmem_reconfigure() is added in remount process to avoid the Ubsan problem. However, the check is not added to the mount process. It causes inconsistent results between mount and remount. The operations to reproduce the problem in user mode as follows: If nr_blocks is set to 0x8000000000000000, the mounting is successful. # mount tmpfs /dev/shm/ -t tmpfs -o nr_blocks=0x8000000000000000 However, when -o remount is used, the mount fails because of the check in the shmem_reconfigure() # mount tmpfs /dev/shm/ -t tmpfs -o remount,nr_blocks=0x8000000000000000 mount: /dev/shm: mount point not mounted or bad option. Therefore, add checks in the shmem_parse_one() function and remove the check in shmem_reconfigure() to avoid this problem. Link: https://lkml.kernel.org/r/20220629124324.1640807-1-wangzhaolong1@huawei.com Signed-off-by: ZhaoLong Wang <wangzhaolong1@huawei.com> Cc: Luo Meng <luomeng12@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Zhihao Cheng <chengzhihao1@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-18 15:07:51 -07:00
Yee Lee	07313a2b29	mm: kfence: apply kmemleak_ignore_phys on early allocated pool This patch solves two issues. (1) The pool allocated by memblock needs to unregister from kmemleak scanning. Apply kmemleak_ignore_phys to replace the original kmemleak_free as its address now is stored in the phys tree. (2) The pool late allocated by page-alloc doesn't need to unregister. Move out the freeing operation from its call path. Link: https://lkml.kernel.org/r/20220628113714.7792-2-yee.lee@mediatek.com Fixes: `0c24e06119` ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA") Signed-off-by: Yee Lee <yee.lee@mediatek.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-18 15:07:51 -07:00
Miaohe Lin	cdb5c9e53f	mm/mmap: fix obsolete comment of find_extend_vma mmget_still_valid() has already been removed via commit `4d45e75a99` ("mm: remove the now-unnecessary mmget_still_valid() hack"). Update the corresponding comment. Link: https://lkml.kernel.org/r/20220709092527.47778-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:48 -07:00
Miaohe Lin	8f0b747d7d	mm/page_vma_mapped.c: use helper function huge_pte_lock Use helper function huge_pte_lock() to lock the huge pte to simplify the code a bit. No functional change intended. Link: https://lkml.kernel.org/r/20220709092440.43018-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:47 -07:00
Uros Bizjak	04ec006171	mm/page_alloc: use try_cmpxchg in set_pfnblock_flags_mask Use try_cmpxchg instead of cmpxchg in set_pfnblock_flags_mask. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). The main loop improves from: 1c5d: 48 89 c2 mov %rax,%rdx 1c60: 48 89 c1 mov %rax,%rcx 1c63: 48 21 fa and %rdi,%rdx 1c66: 4c 09 c2 or %r8,%rdx 1c69: f0 48 0f b1 16 lock cmpxchg %rdx,(%rsi) 1c6e: 48 39 c1 cmp %rax,%rcx 1c71: 75 ea jne 1c5d <...> to: 1c60: 48 89 ca mov %rcx,%rdx 1c63: 48 21 c2 and %rax,%rdx 1c66: 4c 09 c2 or %r8,%rdx 1c69: f0 48 0f b1 16 lock cmpxchg %rdx,(%rsi) 1c6e: 75 f0 jne 1c60 <...> Link: https://lkml.kernel.org/r/20220708140736.8737-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:47 -07:00
Gang Li	dcadcf1c30	mm, hugetlb: skip irrelevant nodes in show_free_areas() show_free_areas() allows to filter out node specific data which is irrelevant to the allocation request. But hugetlb_show_meminfo() still shows hugetlb on all nodes, which is redundant and unnecessary. Use show_mem_node_skip() to skip irrelevant nodes. And replace hugetlb_show_meminfo() with hugetlb_show_meminfo_node(nid). before-and-after sample output of OOM: before: ``` [ 214.362453] Node 1 active_anon:148kB inactive_anon:4050920kB active_file:112kB inactive_file:100kB [ 214.375429] Node 1 Normal free:45100kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig [ 214.388334] lowmem_reserve[]: 0 0 0 0 0 [ 214.390251] Node 1 Normal: 4234kB (UE) 3208kB (UME) 18716kB (UE) 11732kB (UE) 5764kB (UME) 20 [ 214.397626] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 214.401518] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB ``` after: ``` [ 145.069705] Node 1 active_anon:128kB inactive_anon:4049412kB active_file:56kB inactive_file:84kB u [ 145.110319] Node 1 Normal free:45424kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig [ 145.152315] lowmem_reserve[]: 0 0 0 0 0 [ 145.155244] Node 1 Normal: 4704kB (UME) 3738kB (UME) 24716kB (UME) 16832kB (UE) 8664kB (UME) [ 145.164119] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB ``` Link: https://lkml.kernel.org/r/20220706034655.1834-1-ligang.bdlg@bytedance.com Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:47 -07:00
Patrick Wang	a317ebccaa	mm: percpu: use kmemleak_ignore_phys() instead of kmemleak_free() Kmemleak recently added a rbtree to store the objects allocted with physical address. Those objects can't be freed with kmemleak_free(). According to the comments, percpu allocations are tracked by kmemleak separately. Kmemleak_free() was used to avoid the unnecessary tracking. If kmemleak_free() fails, those objects would be scanned by kmemleak, which is unnecessary but shouldn't lead to other effects. Use kmemleak_ignore_phys() instead of kmemleak_free() for those objects. Link: https://lkml.kernel.org/r/20220705113158.127600-1-patrick.wang.shcn@gmail.com Fixes: `0c24e06119` ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA") Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:47 -07:00
Xiu Jianfeng	48725bbc0c	mm/mprotect: remove the redundant initialization for error The variable error will be assigned correctly before it is used, the initialization is redundant, so remove it. Link: https://lkml.kernel.org/r/20220704114112.163112-1-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:47 -07:00
Miaohe Lin	e75858b904	mm/huge_memory: use helper macro IS_ERR_OR_NULL in split_huge_pages_pid Use helper macro IS_ERR_OR_NULL to check the validity of page to simplify the code. Minor readability improvement. Link: https://lkml.kernel.org/r/20220704132201.14611-17-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:47 -07:00
Miaohe Lin	cea3332808	mm/huge_memory: comment the subtly logic in __split_huge_pmd It's dangerous and wrong to call page_folio(pmd_page(*pmd)) when pmd isn't present. But the caller guarantees pmd is present when folio is set. So we should be safe here. Add comment to make it clear. Link: https://lkml.kernel.org/r/20220704132201.14611-16-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:46 -07:00
Miaohe Lin	d764afedfb	mm/huge_memory: correct comment of prep_transhuge_page We use page->mapping and page->index, instead of page->indexlru in second tail page as list_head. Correct it. Link: https://lkml.kernel.org/r/20220704132201.14611-15-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:46 -07:00
Miaohe Lin	a17206dac7	mm/huge_memory: minor cleanup for split_huge_pages_all There is nothing to do if a zone doesn't have any pages managed by the buddy allocator. So we should check managed_zone instead. Also if a thp is found, there's no need to traverse the subpages again. Link: https://lkml.kernel.org/r/20220704132201.14611-13-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:46 -07:00
Miaohe Lin	0b175468a0	mm/huge_memory: try to free subpage in swapcache when possible Subpages in swapcache won't be freed even if it is the last user of the page until next time reclaim. It shouldn't hurt indeed, but we could try to free these pages to save more memory for system. Link: https://lkml.kernel.org/r/20220704132201.14611-12-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:46 -07:00
Miaohe Lin	749290799e	mm/huge_memory: fix comment in zap_huge_pud The comment about deposited pgtable is borrowed from zap_huge_pmd but there's no deposited pgtable stuff for huge pud in zap_huge_pud. Remove it to avoid confusion. Link: https://lkml.kernel.org/r/20220704132201.14611-10-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:45 -07:00
Miaohe Lin	37139bb02c	mm/huge_memory: use helper macro __ATTR_RW Use helper macro __ATTR_RW to define use_zero_page_attr, defrag_attr and enabled_attr to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/20220704132201.14611-9-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:45 -07:00
Miaohe Lin	74ba2b38ba	mm/huge_memory: use helper function vma_lookup in split_huge_pages_pid Use helper function vma_lookup to lookup the needed vma to simplify the code. Minor readability improvement. Link: https://lkml.kernel.org/r/20220704132201.14611-8-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:45 -07:00
Miaohe Lin	4fba8f2a30	mm/huge_memory: rename mmun_start to haddr in remove_migration_pmd mmun_start indicates mmu_notifier start address but there's no mmu_notifier stuff in remove_migration_pmd. This will make it hard to get the meaning of mmun_start. Rename it to haddr to avoid confusing readers and also imporve readability. Link: https://lkml.kernel.org/r/20220704132201.14611-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:45 -07:00
Miaohe Lin	a69e4717c6	mm/huge_memory: use helper touch_pmd in huge_pmd_set_accessed Use helper touch_pmd to set pmd accessed to simplify the code and improve the readability. No functional change intended. Link: https://lkml.kernel.org/r/20220704132201.14611-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:45 -07:00
Miaohe Lin	5fe653e900	mm/huge_memory: use helper touch_pud in huge_pud_set_accessed Use helper touch_pud to set pud accessed to simplify the code and improve the readability. No functional change intended. Link: https://lkml.kernel.org/r/20220704132201.14611-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:44 -07:00
Miaohe Lin	d965e39075	mm/huge_memory: fix comment of __pud_trans_huge_lock __pud_trans_huge_lock returns page table lock pointer if a given pud maps a thp instead of 'true' since introduced. Fix corresponding comments. Link: https://lkml.kernel.org/r/20220704132201.14611-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:44 -07:00
Miaohe Lin	4286f14748	mm/huge_memory: access vm_page_prot with READ_ONCE in remove_migration_pmd vma->vm_page_prot is read lockless from the rmap_walk, it may be updated concurrently. Using READ_ONCE to prevent the risk of reading intermediate values. Link: https://lkml.kernel.org/r/20220704132201.14611-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:44 -07:00
Miaohe Lin	7c38f1812d	mm/huge_memory: use flush_pmd_tlb_range in move_huge_pmd Patch series "A few cleanup patches for huge_memory", v3. This series contains a few cleaup patches to remove duplicated codes, add/use helper functions, fix some obsolete comments and so on. More details can be found in the respective changelogs. This patch (of 16): Arches with special requirements for evicting THP backing TLB entries can implement flush_pmd_tlb_range. Otherwise also, it can help optimize TLB flush in THP regime. Using flush_pmd_tlb_range to take advantage of this in move_huge_pmd. Link: https://lkml.kernel.org/r/20220704132201.14611-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220704132201.14611-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:44 -07:00
Anshuman Khandual	3d923c5f1e	mm/mmap: drop ARCH_HAS_VM_GET_PAGE_PROT Now all the platforms enable ARCH_HAS_GET_PAGE_PROT. They define and export own vm_get_page_prot() whether custom or standard DECLARE_VM_GET_PAGE_PROT. Hence there is no need for default generic fallback for vm_get_page_prot(). Just drop this fallback and also ARCH_HAS_GET_PAGE_PROT mechanism. Link: https://lkml.kernel.org/r/20220711070600.2378316-27-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:41 -07:00
Anshuman Khandual	09095f7413	mm/mmap: build protect protection_map[] with ARCH_HAS_VM_GET_PAGE_PROT Now that protection_map[] has been moved inside those platforms that enable ARCH_HAS_VM_GET_PAGE_PROT. Hence generic protection_map[] array now can be protected with CONFIG_ARCH_HAS_VM_GET_PAGE_PROT intead of __P000. Link: https://lkml.kernel.org/r/20220711070600.2378316-8-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:38 -07:00
Anshuman Khandual	43957b5d11	mm/mmap: define DECLARE_VM_GET_PAGE_PROT This just converts the generic vm_get_page_prot() implementation into a new macro i.e DECLARE_VM_GET_PAGE_PROT which later can be used across platforms when enabling them with ARCH_HAS_VM_GET_PAGE_PROT. This does not create any functional change. Link: https://lkml.kernel.org/r/20220711070600.2378316-3-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Suggested-by: Christoph Hellwig <hch@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:37 -07:00
Anshuman Khandual	840532711d	mm/mmap: build protect protection_map[] with __P000 Patch series "mm/mmap: Drop __SXXX/__PXXX macros from across platforms", v7. __SXXX/__PXXX macros are unnecessary abstraction layer in creating the generic protection_map[] array which is used for vm_get_page_prot(). This abstraction layer can be avoided, if the platforms just define the array protection_map[] for all possible vm_flags access permission combinations and also export vm_get_page_prot() implementation. This series drops __SXXX/__PXXX macros from across platforms in the tree. First it build protects generic protection_map[] array with '#ifdef __P000' and moves it inside platforms which enable ARCH_HAS_VM_GET_PAGE_PROT. Later this build protects same array with '#ifdef ARCH_HAS_VM_GET_PAGE_PROT' and moves inside remaining platforms while enabling ARCH_HAS_VM_GET_PAGE_PROT. This adds a new macro DECLARE_VM_GET_PAGE_PROT defining the current generic vm_get_page_prot(), in order for it to be reused on platforms that do not require custom implementation. Finally, ARCH_HAS_VM_GET_PAGE_PROT can just be dropped, as all platforms now define and export vm_get_page_prot(), via looking up a private and static protection_map[] array. protection_map[] data type has been changed as 'static const' on all platforms that do not change it during boot. This patch (of 26): Build protect generic protection_map[] array with __P000, so that it can be moved inside all the platforms one after the other. Otherwise there will be build failures during this process. CONFIG_ARCH_HAS_VM_GET_PAGE_PROT cannot be used for this purpose as only certain platforms enable this config now. Link: https://lkml.kernel.org/r/20220711070600.2378316-1-anshuman.khandual@arm.com Link: https://lkml.kernel.org/r/20220711070600.2378316-2-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:37 -07:00
Linus Walleij	9330723c26	mm: nommu: pass a pointer to virt_to_page() Functions that work on a pointer to virtual memory such as virt_to_pfn() and users of that function such as virt_to_page() are supposed to pass a pointer to virtual memory, ideally a (void ) or other pointer. However since many architectures implement virt_to_pfn() as a macro, this function becomes polymorphic and accepts both a (unsigned long) and a (void ). If we instead implement a proper virt_to_pfn(void addr) function the following happens (occurred on arch/arm): mm/nommu.c: In function 'free_page_series': mm/nommu.c:501:50: warning: passing argument 1 of 'virt_to_pfn' makes pointer from integer without a cast [-Wint-conversion] struct page page = virt_to_page(from); Fix this with an explicit cast. Link: https://lkml.kernel.org/r/20220630084124.691207-6-linus.walleij@linaro.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:37 -07:00
Linus Walleij	396a400bc1	mm: gup: pass a pointer to virt_to_page() Functions that work on a pointer to virtual memory such as virt_to_pfn() and users of that function such as virt_to_page() are supposed to pass a pointer to virtual memory, ideally a (void ) or other pointer. However since many architectures implement virt_to_pfn() as a macro, this function becomes polymorphic and accepts both a (unsigned long) and a (void ). If we instead implement a proper virt_to_pfn(void *addr) function the following happens (occurred on arch/arm): mm/gup.c: In function '__get_user_pages_locked': mm/gup.c:1599:49: warning: passing argument 1 of 'virt_to_pfn' makes pointer from integer without a cast [-Wint-conversion] pages[i] = virt_to_page(start); Fix this with an explicit cast. Link: https://lkml.kernel.org/r/20220630084124.691207-5-linus.walleij@linaro.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:36 -07:00
Linus Walleij	9e7ee421ac	mm: kfence: pass a pointer to virt_to_page() Functions that work on a pointer to virtual memory such as virt_to_pfn() and users of that function such as virt_to_page() are supposed to pass a pointer to virtual memory, ideally a (void ) or other pointer. However since many architectures implement virt_to_pfn() as a macro, this function becomes polymorphic and accepts both a (unsigned long) and a (void ). If we instead implement a proper virt_to_pfn(void addr) function the following happens (occurred on arch/arm): mm/kfence/core.c:558:30: warning: passing argument 1 of 'virt_to_pfn' makes pointer from integer without a cast [-Wint-conversion] In one case we can refer to __kfence_pool directly (and that is a proper (char ) pointer) and in the other call site we use an explicit cast. Link: https://lkml.kernel.org/r/20220630084124.691207-4-linus.walleij@linaro.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:36 -07:00
Linus Walleij	259ecb34e2	mm/highmem: pass a pointer to virt_to_page() Functions that work on a pointer to virtual memory such as virt_to_pfn() and users of that function such as virt_to_page() are supposed to pass a pointer to virtual memory, ideally a (void ) or other pointer. However since many architectures implement virt_to_pfn() as a macro, this function becomes polymorphic and accepts both a (unsigned long) and a (void ). If we instead implement a proper virt_to_pfn(void addr) function the following happens (occurred on arch/arm): mm/highmem.c:153:29: warning: passing argument 1 of 'virt_to_pfn' makes pointer from integer without a cast [-Wint-conversion] We already have a proper void pointer in the scope of this function named "vaddr" so pass that instead. Link: https://lkml.kernel.org/r/20220630084124.691207-3-linus.walleij@linaro.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:36 -07:00
Xiang Yang	9c94bef9c9	mm/memcontrol.c: replace cgroup_memory_nokmem with mem_cgroup_kmem_disabled() mem_cgroup_kmem_disabled() checks whether the kmem accounting is off. Therefore, replace cgroup_memory_nokmem with mem_cgroup_kmem_disabled(), which is the same work in percpu.c and slab_common.c. Link: https://lkml.kernel.org/r/20220625061844.226764-1-xiangyang3@huawei.com Signed-off-by: Xiang Yang <xiangyang3@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Souptick Joarder (HPE) <jrdr.linux@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:36 -07:00
Mel Gorman	01b44456a7	mm/page_alloc: replace local_lock with normal spinlock struct per_cpu_pages is no longer strictly local as PCP lists can be drained remotely using a lock for protection. While the use of local_lock works, it goes against the intent of local_lock which is for "pure CPU local concurrency control mechanisms and not suited for inter-CPU concurrency control" (Documentation/locking/locktypes.rst) local_lock protects against migration between when the percpu pointer is accessed and the pcp->lock acquired. The lock acquisition is a preemption point so in the worst case, a task could migrate to another NUMA node and accidentally allocate remote memory. The main requirement is to pin the task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT. Replace local_lock with helpers that pin a task to a CPU, lookup the per-cpu structure and acquire the embedded lock. It's similar to local_lock without breaking the intent behind the API. It is not a complete API as only the parts needed for PCP-alloc are implemented but in theory, the generic helpers could be promoted to a general API if there was demand for an embedded lock within a per-cpu struct with a guarantee that the per-cpu structure locked matches the running CPU and cannot use get_cpu_var due to RT concerns. PCP requires these semantics to avoid accidentally allocating remote memory. [mgorman@techsingularity.net: use pcp_spin_trylock_irqsave instead of pcpu_spin_trylock_irqsave] Link: https://lkml.kernel.org/r/20220627084645.GA27531@techsingularity.net Link: https://lkml.kernel.org/r/20220624125423.6126-8-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:35 -07:00
Nicolas Saenz Julienne	443c2accd1	mm/page_alloc: remotely drain per-cpu lists Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu drain work queued by __drain_all_pages(). So introduce a new mechanism to remotely drain the per-cpu lists. It is made possible by remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new scheme is that drain operations are now migration safe. There was no observed performance degradation vs. the previous scheme. Both netperf and hackbench were run in parallel to triggering the __drain_all_pages(NULL, true) code path around ~100 times per second. The new scheme performs a bit better (~5%), although the important point here is there are no performance regressions vs. the previous mechanism. Per-cpu lists draining happens only in slow paths. Minchan Kim tested an earlier version and reported; My workload is not NOHZ CPUs but run apps under heavy memory pressure so they goes to direct reclaim and be stuck on drain_all_pages until work on workqueue run. unit: nanosecond max(dur) avg(dur) count(dur) 166713013 487511.77786438033 1283 From traces, system encountered the drain_all_pages 1283 times and worst case was 166ms and avg was 487us. The other problem was alloc_contig_range in CMA. The PCP draining takes several hundred millisecond sometimes though there is no memory pressure or a few of pages to be migrated out but CPU were fully booked. Your patch perfectly removed those wasted time. Link: https://lkml.kernel.org/r/20220624125423.6126-7-mgorman@techsingularity.net Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Yu Zhao <yuzhao@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:35 -07:00
Mel Gorman	4b23a68f95	mm/page_alloc: protect PCP lists with a spinlock Currently the PCP lists are protected by using local_lock_irqsave to prevent migration and IRQ reentrancy but this is inconvenient. Remote draining of the lists is impossible and a workqueue is required and every task allocation/free must disable then enable interrupts which is expensive. As preparation for dealing with both of those problems, protect the lists with a spinlock. The IRQ-unsafe version of the lock is used because IRQs are already disabled by local_lock_irqsave. spin_trylock is used in combination with local_lock_irqsave() but later will be replaced with a spin_trylock_irqsave when the local_lock is removed. The per_cpu_pages still fits within the same number of cache lines after this patch relative to before the series. struct per_cpu_pages { spinlock_t lock; /* 0 4 / int count; / 4 4 / int high; / 8 4 / int batch; / 12 4 / short int free_factor; / 16 2 / short int expire; / 18 2 / / XXX 4 bytes hole, try to pack / struct list_head lists[13]; / 24 208 / / size: 256, cachelines: 4, members: 7 / / sum members: 228, holes: 1, sum holes: 4 / / padding: 24 / } __attribute__((__aligned__(64))); There is overhead in the fast path due to acquiring the spinlock even though the spinlock is per-cpu and uncontended in the common case. Page Fault Test (PFT) running on a 1-socket reported the following results on a 1 socket machine. 5.19.0-rc3 5.19.0-rc3 vanilla mm-pcpspinirq-v5r16 Hmean faults/sec-1 869275.7381 ( 0.00%) 874597.5167 0.61%* Hmean faults/sec-3 2370266.6681 ( 0.00%) 2379802.0362 * 0.40%* Hmean faults/sec-5 2701099.7019 ( 0.00%) 2664889.7003 * -1.34%* Hmean faults/sec-7 3517170.9157 ( 0.00%) 3491122.8242 * -0.74%* Hmean faults/sec-8 3965729.6187 ( 0.00%) 3939727.0243 * -0.66%* There is a small hit in the number of faults per second but given that the results are more stable, it's borderline noise. [akpm@linux-foundation.org: add missing local_unlock_irqrestore() on contention path] Link: https://lkml.kernel.org/r/20220624125423.6126-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:35 -07:00
Mel Gorman	e2a66c21b7	mm/page_alloc: remove mistaken page == NULL check in rmqueue If a page allocation fails, the ZONE_BOOSTER_WATERMARK should be tested, cleared and kswapd woken whether the allocation attempt was via the PCP or directly via the buddy list. Remove the page == NULL so the ZONE_BOOSTED_WATERMARK bit is checked unconditionally. As it is unlikely that ZONE_BOOSTED_WATERMARK is set, mark the branch accordingly. Link: https://lkml.kernel.org/r/20220624125423.6126-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:35 -07:00
Mel Gorman	589d9973c1	mm/page_alloc: split out buddy removal code from rmqueue into separate helper This is a preparation page to allow the buddy removal code to be reused in a later patch. No functional change. Link: https://lkml.kernel.org/r/20220624125423.6126-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:35 -07:00
Mel Gorman	5d0a661d80	mm/page_alloc: use only one PCP list for THP-sized allocations The per_cpu_pages is cache-aligned on a standard x86-64 distribution configuration but a later patch will add a new field which would push the structure into the next cache line. Use only one list to store THP-sized pages on the per-cpu list. This assumes that the vast majority of THP-sized allocations are GFP_MOVABLE but even if it was another type, it would not contribute to serious fragmentation that potentially causes a later THP allocation failure. Align per_cpu_pages on the cacheline boundary to ensure there is no false cache sharing. After this patch, the structure sizing is; struct per_cpu_pages { int count; /* 0 4 / int high; / 4 4 / int batch; / 8 4 / short int free_factor; / 12 2 / short int expire; / 14 2 / struct list_head lists[13]; / 16 208 / / size: 256, cachelines: 4, members: 6 / / padding: 32 */ } __attribute__((__aligned__(64))); Link: https://lkml.kernel.org/r/20220624125423.6126-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:35 -07:00
Mel Gorman	bf75f20056	mm/page_alloc: add page->buddy_list and page->pcp_list Patch series "Drain remote per-cpu directly", v5. Some setups, notably NOHZ_FULL CPUs, may be running realtime or latency-sensitive applications that cannot tolerate interference due to per-cpu drain work queued by __drain_all_pages(). Introduce a new mechanism to remotely drain the per-cpu lists. It is made possible by remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. This has two advantages, the time to drain is more predictable and other unrelated tasks are not interrupted. This series has the same intent as Nicolas' series "mm/page_alloc: Remote per-cpu lists drain support" -- avoid interference of a high priority task due to a workqueue item draining per-cpu page lists. While many workloads can tolerate a brief interruption, it may cause a real-time task running on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is non-deterministic. Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. The local_lock on its own prevents migration and the IRQ disabling protects from corruption due to an interrupt arriving while a page allocation is in progress. This series adjusts the locking. A spinlock is added to struct per_cpu_pages to protect the list contents while local_lock_irq is ultimately replaced by just the spinlock in the final patch. This allows a remote CPU to safely. Follow-on work should allow the spin_lock_irqsave to be converted to spin_lock to avoid IRQs being disabled/enabled in most cases. The follow-on patch will be one kernel release later as it is relatively high risk and it'll make bisections more clear if there are any problems. Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages and when it is storing per-cpu pages. Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking this is not necessary but it avoids per_cpu_pages consuming another cache line. Patch 3 is a preparation patch to avoid code duplication. Patch 4 is a minor correction. Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still relying on local_lock to prevent migration, stabilise the pcp lookup and prevent IRQ reentrancy. Patch 6 remote drains per-cpu pages directly instead of using a workqueue. Patch 7 uses a normal spinlock instead of local_lock for remote draining This patch (of 7): The page allocator uses page->lru for storing pages on either buddy or PCP lists. Create page->buddy_list and page->pcp_list as a union with page->lru. This is simply to clarify what type of list a page is on in the page allocator. No functional change intended. [minchan@kernel.org: fix page lru fields in macros] Link: https://lkml.kernel.org/r/20220624125423.6126-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:34 -07:00
Mike Kravetz	bcd51a3c67	hugetlb: lazy page table copies in fork() Lazy page table copying at fork time was introduced with commit `d992895ba2` ("[PATCH] Lazy page table copies in fork()"). At the time, hugetlb was very new and did not support page faulting. As a result, it was excluded. When full page fault support was added for hugetlb, the exclusion was not removed. Simply remove the check that prevents lazy copying of hugetlb page tables at fork. Of course, like other mappings this only applies to shared mappings. Lazy page table copying at fork will be less advantageous for hugetlb mappings because: - There are fewer page table entries with hugetlb - hugetlb pmds can be shared instead of copied In any case, completely eliminating the copy at fork time should speed things up. Link: https://lkml.kernel.org/r/20220621235620.291305-5-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: James Houghton <jthoughton@google.com> Cc: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:34 -07:00
Mike Kravetz	4ddb4d91b8	hugetlb: do not update address in huge_pmd_unshare As an optimization for loops sequentially processing hugetlb address ranges, huge_pmd_unshare would update a passed address if it unshared a pmd. Updating a loop control variable outside the loop like this is generally a bad idea. These loops are now using hugetlb_mask_last_page to optimize scanning when non-present ptes are discovered. The same can be done when huge_pmd_unshare returns 1 indicating a pmd was unshared. Remove address update from huge_pmd_unshare. Change the passed argument type and update all callers. In loops sequentially processing addresses use hugetlb_mask_last_page to update address if pmd is unshared. [sfr@canb.auug.org.au: fix an unused variable warning/error] Link: https://lkml.kernel.org/r/20220622171117.70850960@canb.auug.org.au Link: https://lkml.kernel.org/r/20220621235620.291305-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Will Deacon <will@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:34 -07:00
Mike Kravetz	e95a985178	hugetlb: skip to end of PT page mapping when pte not present Patch series "hugetlb: speed up linear address scanning", v2. At unmap, fork and remap time hugetlb address ranges are linearly scanned. We can optimize these scans if the ranges are sparsely populated. Also, enable page table "Lazy copy" for hugetlb at fork. NOTE: Architectures not defining CONFIG_ARCH_WANT_GENERAL_HUGETLB need to add an arch specific version hugetlb_mask_last_page() to take advantage of sparse address scanning improvements. Baolin Wang added the routine for arm64. Other architectures which could be optimized are: ia64, mips, parisc, powerpc, s390, sh and sparc. This patch (of 4): HugeTLB address ranges are linearly scanned during fork, unmap and remap operations. If a non-present entry is encountered, the code currently continues to the next huge page aligned address. However, a non-present entry implies that the page table page for that entry is not present. Therefore, the linear scan can skip to the end of range mapped by the page table page. This can speed operations on large sparsely populated hugetlb mappings. Create a new routine hugetlb_mask_last_page() that will return an address mask. When the mask is ORed with an address, the result will be the address of the last huge page mapped by the associated page table page. Use this mask to update addresses in routines which linearly scan hugetlb address ranges when a non-present pte is encountered. hugetlb_mask_last_page is related to the implementation of huge_pte_offset as hugetlb_mask_last_page is called when huge_pte_offset returns NULL. This patch only provides a complete hugetlb_mask_last_page implementation when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined. Architectures which provide their own versions of huge_pte_offset can also provide their own version of hugetlb_mask_last_page. Link: https://lkml.kernel.org/r/20220621235620.291305-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20220621235620.291305-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: James Houghton <jthoughton@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:34 -07:00
Kuan-Ying Lee	3de0de7580	kasan: separate double free case from invalid free Currently, KASAN describes all invalid-free/double-free bugs as "double-free or invalid-free". This is ambiguous. KASAN should report "double-free" when a double-free is a more likely cause (the address points to the start of an object) and report "invalid-free" otherwise [1]. [1] https://bugzilla.kernel.org/show_bug.cgi?id=212193 Link: https://lkml.kernel.org/r/20220615062219.22618-1-Kuan-Ying.Lee@mediatek.com Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Yee Lee <yee.lee@mediatek.com> Cc: Andrew Yang <andrew.yang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:33 -07:00
Yang Shi	1064026bab	mm: khugepaged: reorg some khugepaged helpers The khugepaged_{enabled\|always\|req_madv} are not khugepaged only anymore, move them to huge_mm.h and rename to hugepage_flags_xxx, and remove khugepaged_req_madv due to no users. Also move khugepaged_defrag to khugepaged.c since its only caller is in that file, it doesn't have to be in a header file. Link: https://lkml.kernel.org/r/20220616174840.1202070-7-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:33 -07:00
Yang Shi	7da4e2cb8b	mm: thp: kill __transhuge_page_enabled() The page fault path checks THP eligibility with __transhuge_page_enabled() which does the similar thing as hugepage_vma_check(), so use hugepage_vma_check() instead. However page fault allows DAX and !anon_vma cases, so added a new flag, in_pf, to hugepage_vma_check() to make page fault work correctly. The in_pf flag is also used to skip shmem and file THP for page fault since shmem handles THP in its own shmem_fault() and file THP allocation on fault is not supported yet. Also remove hugepage_vma_enabled() since hugepage_vma_check() is the only caller now, it is not necessary to have a helper function. Link: https://lkml.kernel.org/r/20220616174840.1202070-6-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:33 -07:00
Yang Shi	9fec51689f	mm: thp: kill transparent_hugepage_active() The transparent_hugepage_active() was introduced to show THP eligibility bit in smaps in proc, smaps is the only user. But it actually does the similar check as hugepage_vma_check() which is used by khugepaged. We definitely don't have to maintain two similar checks, so kill transparent_hugepage_active(). This patch also fixed the wrong behavior for VM_NO_KHUGEPAGED vmas. Also move hugepage_vma_check() to huge_memory.c and huge_mm.h since it is not only for khugepaged anymore. [akpm@linux-foundation.org: check vma->vm_mm, per Zach] [akpm@linux-foundation.org: add comment to vdso check] Link: https://lkml.kernel.org/r/20220616174840.1202070-5-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:33 -07:00
Yang Shi	f707fa4937	mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate The hugepage_vma_revalidate() needs to check if the vma is still anonymous vma or not since the address may be unmapped then remapped to file before khugepaged reaquired the mmap_lock. The old comment is not quite helpful, elaborate this with better comment. Link: https://lkml.kernel.org/r/20220616174840.1202070-4-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:32 -07:00
Yang Shi	4fa6893fae	mm: thp: consolidate vma size check to transhuge_vma_suitable There are couple of places that check whether the vma size is ok for THP or whether address fits, they are open coded and duplicate, use transhuge_vma_suitable() to do the job by passing in (vma->end - HPAGE_PMD_SIZE). Move vma size check into hugepage_vma_check(). This will make khugepaged_enter() is as same as khugepaged_enter_vma(). There is just one caller for khugepaged_enter(), replace it to khugepaged_enter_vma() and remove khugepaged_enter(). Link: https://lkml.kernel.org/r/20220616174840.1202070-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:32 -07:00
Yang Shi	66137fb34a	mm: khugepaged: check THP flag in hugepage_vma_check() Patch series "Cleanup transhuge_xxx helpers", v5. This series is the follow-up of the discussion about cleaning up transhuge_xxx helpers at https://lore.kernel.org/linux-mm/627a71f8-e879-69a5-ceb3-fc8d29d2f7f1@suse.cz/. THP has a bunch of helpers that do VMA sanity check for different paths, they do the similar checks for the most callsites and have a lot duplicate codes. And it is confusing what helpers should be used at what conditions. This series reorganized and cleaned up the code so that we could consolidate all the checks into hugepage_vma_check(). The transhuge_vma_enabled(), transparent_hugepage_active() and __transparent_hugepage_enabled() are killed by this series. This patch (of 7): Currently the THP flag check in hugepage_vma_check() will fallthrough if the flag is NEVER and VM_HUGEPAGE is set. This is not a problem for now since all the callers have the flag checked before or can't be invoked if the flag is NEVER. However, the following patch will call hugepage_vma_check() in more places, for example, page fault, so this flag must be checked in hugepge_vma_check(). Link: https://lkml.kernel.org/r/20220616174840.1202070-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20220616174840.1202070-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:32 -07:00
Shiyang Ruan	c36e202495	mm: introduce mf_dax_kill_procs() for fsdax case This new function is a variant of mf_generic_kill_procs that accepts a file, offset pair instead of a struct to support multiple files sharing a DAX mapping. It is intended to be called by the file systems as part of the memory_failure handler after the file system performed a reverse mapping from the storage address to the file and file offset. Link: https://lkml.kernel.org/r/20220603053738.1218681-6-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:30 -07:00
Shiyang Ruan	33a8f7f2b3	pagemap,pmem: introduce ->memory_failure() When memory-failure occurs, we call this function which is implemented by each kind of devices. For the fsdax case, pmem device driver implements it. Pmem device driver will find out the filesystem in which the corrupted page located in. With dax_holder notify support, we are able to notify the memory failure from pmem driver to upper layers. If there is something not support in the notify routine, memory_failure will fall back to the generic hanlder. Link: https://lkml.kernel.org/r/20220603053738.1218681-4-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:30 -07:00
Shiyang Ruan	00cc790e00	mm: factor helpers for memory_failure_dev_pagemap memory_failure_dev_pagemap code is a bit complex before introduce RMAP feature for fsdax. So it is needed to factor some helper functions to simplify these code. [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n build] [zhengbin13@huawei.com: fix redefinition of mf_generic_kill_procs] Link: https://lkml.kernel.org/r/20220628112143.1170473-1-zhengbin13@huawei.com Link: https://lkml.kernel.org/r/20220603053738.1218681-3-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:30 -07:00
Alistair Popple	b05a79d437	mm/gup: migrate device coherent pages when pinning instead of failing Currently any attempts to pin a device coherent page will fail. This is because device coherent pages need to be managed by a device driver, and pinning them would prevent a driver from migrating them off the device. However this is no reason to fail pinning of these pages. These are coherent and accessible from the CPU so can be migrated just like pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin them first try migrating them out of ZONE_DEVICE. [hch@lst.de: rebased to the split device memory checks, moved migrate_device_page to migrate_device.c] Link: https://lkml.kernel.org/r/20220715150521.18165-7-alex.sierra@amd.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:28 -07:00
Alex Sierra	dd19e6d8ff	mm: add device coherent vma selection for memory migration This case is used to migrate pages from device memory, back to system memory. Device coherent type memory is cache coherent from device and CPU point of view. Link: https://lkml.kernel.org/r/20220715150521.18165-6-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alistair Poppple <apopple@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:28 -07:00
Alex Sierra	3218f8712d	mm: handling Non-LRU pages returned by vm_normal_pages With DEVICE_COHERENT, we'll soon have vm_normal_pages() return device-managed anonymous pages that are not LRU pages. Although they behave like normal pages for purposes of mapping in CPU page, and for COW. They do not support LRU lists, NUMA migration or THP. Callers to follow_page() currently don't expect ZONE_DEVICE pages, however, with DEVICE_COHERENT we might now return ZONE_DEVICE. Check for ZONE_DEVICE pages in applicable users of follow_page() as well. Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> [v2] Reviewed-by: Alistair Popple <apopple@nvidia.com> [v6] Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:28 -07:00
Alex Sierra	f25cbb7a95	mm: add zone device coherent type memory support Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. [hch@lst.de: rebased ontop of the refcount changes, remove is_dev_private_or_coherent_page] Link: https://lkml.kernel.org/r/20220715150521.18165-4-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:27 -07:00
Alex Sierra	6077c943be	mm: rename is_pinnable_page() to is_longterm_pinnable_page() Patch series "Add MEMORY_DEVICE_COHERENT for coherent device memory mapping", v9. This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory owned by a device that can be mapped into CPU page tables like MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE. This patch series is mostly self-contained except for a few places where it needs to update other subsystems to handle the new memory type. System stability and performance are not affected according to our ongoing testing, including xfstests. How it works: The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver registers the memory with devmap as MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for this hardware page migration capability is the Frontier supercomputer project. This functionality is not AMD-specific. We expect other GPU vendors to find this functionality useful, and possibly other hardware types in the future. Our test nodes in the lab are similar to the Frontier configuration, with .5 TB of system memory plus 256 GB of device memory split across 4 GPUs, all in a single coherent address space. Page migration is expected to improve application efficiency significantly. We will report empirical results as they become available. Coherent device type pages at gup are now migrated back to system memory if they are being pinned long-term (FOLL_LONGTERM). The reason is, that long-term pinning would interfere with the device memory manager owning the device-coherent pages (e.g. evictions in TTM). These series incorporate Alistair Popple patches to do this migration from pin_user_pages() calls. hmm_gup_test has been added to hmm-test to test different get user pages calls. This series includes handling of device-managed anonymous pages returned by vm_normal_pages. Although they behave like normal pages for purposes of mapping in CPU page tables and for COW, they do not support LRU lists, NUMA migration or THP. We also introduced a FOLL_LRU flag that adds the same behaviour to follow_page and related APIs, to allow callers to specify that they expect to put pages on an LRU list. This patch (of 14): is_pinnable_page() and folio_is_pinnable() are renamed to is_longterm_pinnable_page() and folio_is_longterm_pinnable() respectively. These functions are used in the FOLL_LONGTERM flag context. Link: https://lkml.kernel.org/r/20220715150521.18165-1-alex.sierra@amd.com Link: https://lkml.kernel.org/r/20220715150521.18165-2-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:27 -07:00
SeongJae Park	ec1658f0f9	mm/damon/lru_sort: fix potential memory leak in damon_lru_sort_init() damon_lru_sort_init() returns an error when damon_select_ops() fails without freeing 'ctx' which allocated before. This commit fixes the potential memory leak by freeing 'ctx' under the situation. Link: https://lkml.kernel.org/r/20220714170458.49727-1-sj@kernel.org Fixes: `40e983cca9` ("mm/damon: introduce DAMON-based LRU-lists Sorting") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:27 -07:00
Patricia Alfonso	5b301409e8	UML: add support for KASAN under x86_64 Make KASAN run on User Mode Linux on x86_64. The UML-specific KASAN initializer uses mmap to map the ~16TB of shadow memory to the location defined by KASAN_SHADOW_OFFSET. kasan_init() utilizes constructors to initialize KASAN before main(). The location of the KASAN shadow memory, starting at KASAN_SHADOW_OFFSET, can be configured using the KASAN_SHADOW_OFFSET option. The default location of this offset is 0x100000000000, which keeps it out-of-the-way even on UML setups with more "physical" memory. For low-memory setups, 0x7fff8000 can be used instead, which fits in an immediate and is therefore faster, as suggested by Dmitry Vyukov. There is usually enough free space at this location; however, it is a config option so that it can be easily changed if needed. Note that, unlike KASAN on other architectures, vmalloc allocations still use the shadow memory allocated upfront, rather than allocating and free-ing it per-vmalloc allocation. If another architecture chooses to go down the same path, we should replace the checks for CONFIG_UML with something more generic, such as: - A CONFIG_KASAN_NO_SHADOW_ALLOC option, which architectures could set - or, a way of having architecture-specific versions of these vmalloc and module shadow memory allocation options. Also note that, while UML supports both KASAN in inline mode (CONFIG_KASAN_INLINE) and static linking (CONFIG_STATIC_LINK), it does not support both at the same time. Signed-off-by: Patricia Alfonso <trishalfonso@google.com> Co-developed-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: David Gow <davidgow@google.com> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2022-07-17 23:35:22 +02:00
Catalin Marinas	6d05141a39	mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON Currently post_alloc_hook() skips the kasan unpoisoning if the tags will be zeroed (__GFP_ZEROTAGS) or __GFP_SKIP_KASAN_UNPOISON is passed. Since __GFP_ZEROTAGS is now accompanied by __GFP_SKIP_KASAN_UNPOISON, remove the extra check. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20220610152141.2148929-4-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org>	2022-07-07 10:48:37 +01:00
Catalin Marinas	70c248aca9	mm: kasan: Skip unpoisoning of user pages Commit `c275c5c6d5` ("kasan: disable freed user page poisoning with HW tags") added __GFP_SKIP_KASAN_POISON to GFP_HIGHUSER_MOVABLE. A similar argument can be made about unpoisoning, so also add __GFP_SKIP_KASAN_UNPOISON to user pages. To ensure the user page is still accessible via page_address() without a kasan fault, reset the page->flags tag. With the above changes, there is no need for the arm64 tag_clear_highpage() to reset the page->flags tag. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20220610152141.2148929-3-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org>	2022-07-07 10:48:37 +01:00
Catalin Marinas	ed0a6d1d97	mm: kasan: Ensure the tags are visible before the tag in page->flags __kasan_unpoison_pages() colours the memory with a random tag and stores it in page->flags in order to re-create the tagged pointer via page_to_virt() later. When the tag from the page->flags is read, ensure that the in-memory tags are already visible by re-ordering the page_kasan_tag_set() after kasan_unpoison(). The former already has barriers in place through try_cmpxchg(). On the reader side, the order is ensured by the address dependency between page->flags and the memory access. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://lore.kernel.org/r/20220610152141.2148929-2-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org>	2022-07-07 10:48:37 +01:00
Muchun Song	b77d5b1b83	mm: slab: optimize memcg_slab_free_hook() Most callers of memcg_slab_free_hook() already know the slab, which could be passed to memcg_slab_free_hook() directly to reduce the overhead of an another call of virt_to_slab(). For bulk freeing of objects, the call of slab_objcgs() in the loop in memcg_slab_free_hook() is redundant as well. Rework memcg_slab_free_hook() and build_detached_freelist() to reduce those unnecessary overhead and make memcg_slab_free_hook() can handle bulk freeing in slab_free(). Move the calling site of memcg_slab_free_hook() from do_slab_free() to slab_free() for slub to make the code clearer since the logic is weird (e.g. the caller need to judge whether it needs to call memcg_slab_free_hook()). It is easy to make mistakes like missing calling of memcg_slab_free_hook() like fixes of: commit `d1b2cf6cb8` ("mm: memcg/slab: uncharge during kmem_cache_free_bulk()") commit `ae085d7f93` ("mm: kfence: fix missing objcg housekeeping for SLAB") This optimization is mainly for bulk objects freeing. The following numbers is shown for 16-object freeing. before after kmem_cache_free_bulk: ~430 ns ~400 ns The overhead is reduced by about 7% for 16-object freeing. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Link: https://lore.kernel.org/r/20220429123044.37885-1-songmuchun@bytedance.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2022-07-04 17:13:05 +02:00
Vasily Averin	b347aa7b57	mm/tracing: add 'accounted' entry into output of allocation tracepoints Slab caches marked with SLAB_ACCOUNT force accounting for every allocation from this cache even if __GFP_ACCOUNT flag is not passed. Unfortunately, at the moment this flag is not visible in ftrace output, and this makes it difficult to analyze the accounted allocations. This patch adds boolean "accounted" entry into trace output, and set it to 'true' for calls used __GFP_ACCOUNT flag and for allocations from caches marked with SLAB_ACCOUNT. Set it to 'false' if accounting is disabled in configs. Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Link: https://lore.kernel.org/r/c418ed25-65fe-f623-fbf8-1676528859ed@openvz.org Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2022-07-04 17:11:27 +02:00
Xiongwei Song	efb9352700	mm/slub: Simplify __kmem_cache_alias() There is no need to do anything if sysfs_slab_alias() return nonzero value after getting a mergeable cache. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Link: https://lore.kernel.org/all/e5ebc952-af17-321f-5343-bc914d47c931@suse.cz/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2022-07-04 17:08:47 +02:00
Jiapeng Chong	d1ca263d0d	mm, slab: fix bad alignments As reported by coccicheck: ./mm/slab.c:3253:2-59: code aligned with following code on line 3255. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2022-07-04 17:04:37 +02:00
Miaohe Lin	1baec203b7	mm/khugepaged: try to free transhuge swapcache when possible Transhuge swapcaches won't be freed in __collapse_huge_page_copy(). It's because release_pte_page() is not called for these pages and thus free_page_and_swap_cache can't grab the page lock. These pages won't be freed from swap cache even if we are the only user until next time reclaim. It shouldn't hurt indeed, but we could try to free these pages to save more memory for system. Link: https://lkml.kernel.org/r/20220625092816.4856-8-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:52 -07:00
Miaohe Lin	081c32564b	mm/khugepaged: remove unneeded return value of khugepaged_add_pte_mapped_thp() The return value of khugepaged_add_pte_mapped_thp() is always 0 and also ignored. Remove it to clean up the code. Link: https://lkml.kernel.org/r/20220625092816.4856-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:51 -07:00
Miaohe Lin	6dcdc94db1	mm/khugepaged: use helper macro __ATTR_RW Use helper macro __ATTR_RW to define the khugepaged attributes. Minor readability improvement. Link: https://lkml.kernel.org/r/20220625092816.4856-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:51 -07:00
Miaohe Lin	2f55f070e5	mm/khugepaged: minor cleanup for collapse_file nr_none is always 0 for non-shmem case because the page can be read from the backend store. So when nr_none ! = 0, it must be in is_shmem case. Also only adjust the nrpages and uncharge shmem when nr_none != 0 to save cpu cycles. Link: https://lkml.kernel.org/r/20220625092816.4856-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:51 -07:00
Miaohe Lin	36ee2c784a	mm/khugepaged: trivial typo and codestyle cleanup Fix some typos and tweak the code to meet codestyle. No functional change intended. Link: https://lkml.kernel.org/r/20220625092816.4856-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:51 -07:00
Miaohe Lin	4d928e20fd	mm/khugepaged: stop swapping in page when VM_FAULT_RETRY occurs When do_swap_page returns VM_FAULT_RETRY, we do not retry here and thus swap entry will remain in pagetable. This will result in later failure. So stop swapping in pages in this case to save cpu cycles. As A further optimization, mmap_lock is released when __collapse_huge_page_swapin() fails to avoid relocking mmap_lock. And "swapped_in++" is moved after error handling to make it more accurate. Link: https://lkml.kernel.org/r/20220625092816.4856-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:51 -07:00
Miaohe Lin	dd5ff79d4a	mm/khugepaged: remove unneeded shmem_huge_enabled() check Patch series "A few cleanup patches for khugepaged", v2. This series contains a few cleaup patches to remove unneeded return value, use helper macro, fix typos and so on. More details can be found in the respective changelogs. This patch (of 7): If we reach here, khugepaged_scan_mm_slot() has already made sure that hugepage is enabled for shmem, via its call to hugepage_vma_check(). Remove this duplicated check. Link: https://lkml.kernel.org/r/20220625092816.4856-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220625092816.4856-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Howells <dhowells@redhat.com> Cc: NeilBrown <neilb@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:51 -07:00
XueBing Chen	f673bd7c26	mm: sparsemem: drop unexpected word 'a' in comments there is an unexpected word 'a' in the comments that need to be dropped Link: https://lkml.kernel.org/r/24fbdae3.c86.1819a0f31b9.Coremail.chenxuebing@jari.cn Signed-off-by: XueBing Chen <chenxuebing@jari.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:50 -07:00
Qi Zheng	18f3962953	mm: hugetlb: kill set_huge_swap_pte_at() Commit `e5251fd430` ("mm/hugetlb: introduce set_huge_swap_pte_at() helper") add set_huge_swap_pte_at() to handle swap entries on architectures that support hugepages consisting of contiguous ptes. And currently the set_huge_swap_pte_at() is only overridden by arm64. set_huge_swap_pte_at() provide a sz parameter to help determine the number of entries to be updated. But in fact, all hugetlb swap entries contain pfn information, so we can find the corresponding folio through the pfn recorded in the swap entry, then the folio_size() is the number of entries that need to be updated. And considering that users will easily cause bugs by ignoring the difference between set_huge_swap_pte_at() and set_huge_pte_at(). Let's handle swap entries in set_huge_pte_at() and remove the set_huge_swap_pte_at(), then we can call set_huge_pte_at() anywhere, which simplifies our coding. Link: https://lkml.kernel.org/r/20220626145717.53572-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:50 -07:00
Yang Yang	ade63b419c	mm/page_alloc: make the annotations of available memory more accurate Not all systems use swap, so estimating available memory would help to prevent swapping or OOM of system that not use swap. And we need to reserve some page cache to prevent swapping or thrashing. If somebody is accessing the pages in pagecache, and if too much would be freed, most accesses might mean reading data from disk, i.e. thrashing. Link: https://lkml.kernel.org/r/20220623020833.972979-1-yang.yang29@zte.com.cn Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: CGEL ZTE <cgel.zte@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:50 -07:00
Yun-Ze Li	e8da368a1e	mm, docs: fix comments that mention mem_hotplug_end() Comments that mention mem_hotplug_end() are confusing as there is no function called mem_hotplug_end(). Fix them by replacing all the occurences of mem_hotplug_end() in the comments with mem_hotplug_done(). [akpm@linux-foundation.org: grammatical fixes] Link: https://lkml.kernel.org/r/20220620071516.1286101-1-p76091292@gs.ncku.edu.tw Signed-off-by: Yun-Ze Li <p76091292@gs.ncku.edu.tw> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:50 -07:00
Baolin Wang	0506c31d0a	mm: rmap: simplify the hugetlb handling when unmapping or migration According to previous discussion [1], there are so many levels of indenting to handle the hugetlb case when unmapping or migration. We can combine folio_test_anon() and huge_pmd_unshare() to save one level of indenting, by adding a local variable and moving the VM_BUG_ON() a little forward. No intended functional changes in this patch. [1] https://lore.kernel.org/all/0b986dc4-5843-3e2d-c2df-5a2e9f13e6ab@oracle.com/ Link: https://lkml.kernel.org/r/28414b1b96f095e838c1e548074f8e0fc70d78cf.1655724713.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:49 -07:00
Miaohe Lin	f7cc67ae7f	mm/madvise: minor cleanup for swapin_walk_pmd_entry() Passing index to pte_offset_map_lock() directly so the below calculation can be avoided. Rename orig_pte to ptep as it's not changed. Also use helper is_swap_pte() to improve the readability. No functional change intended. [akpm@linux-foundation.org: reduce scope of `ptep'] Link: https://lkml.kernel.org/r/20220618090527.37843-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:49 -07:00
Muchun Song	dc2628f395	mm: hugetlb: remove minimum_order variable commit `641844f561` ("mm/hugetlb: introduce minimum hugepage order") fixed a static checker warning and introduced a global variable minimum_order to fix the warning. However, the local variable in dissolve_free_huge_pages() can be initialized to huge_page_order(&default_hstate) to fix the warning. So remove minimum_order to simplify the code. Link: https://lkml.kernel.org/r/20220616033846.96937-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:49 -07:00
Muchun Song	6636109512	mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory For now, the feature of hugetlb_free_vmemmap is not compatible with the feature of memory_hotplug.memmap_on_memory, and hugetlb_free_vmemmap takes precedence over memory_hotplug.memmap_on_memory. However, someone wants to make memory_hotplug.memmap_on_memory takes precedence over hugetlb_free_vmemmap since memmap_on_memory makes it more likely to succeed memory hotplug in close-to-OOM situations. So the decision of making hugetlb_free_vmemmap take precedence is not wise and elegant. The proper approach is to have hugetlb_vmemmap.c do the check whether the section which the HugeTLB pages belong to can be optimized. If the section's vmemmap pages are allocated from the added memory block itself, hugetlb_free_vmemmap should refuse to optimize the vmemmap, otherwise, do the optimization. Then both kernel parameters are compatible. So this patch introduces VmemmapSelfHosted to mask any non-optimizable vmemmap pages. The hugetlb_vmemmap can use this flag to detect if a vmemmap page can be optimized. [songmuchun@bytedance.com: walk vmemmap page tables to avoid false-positive] Link: https://lkml.kernel.org/r/20220620110616.12056-3-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220617135650.74901-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Co-developed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:49 -07:00
Muchun Song	ed7802dd48	mm: memory_hotplug: enumerate all supported section flags Patch series "make hugetlb_optimize_vmemmap compatible with memmap_on_memory", v3. This series makes hugetlb_optimize_vmemmap compatible with memmap_on_memory. This patch (of 2): We are almost running out of section flags, only one bit is available in the worst case (powerpc with 256k pages). However, there are still some free bits (in ->section_mem_map) on other architectures (e.g. x86_64 has 10 bits available, arm64 has 8 bits available with worst case of 64K pages). We have hard coded those numbers in code, it is inconvenient to use those bits on other architectures except powerpc. So transfer those section flags to enumeration to make it easy to add new section flags in the future. Also, move SECTION_TAINT_ZONE_DEVICE into the scope of CONFIG_ZONE_DEVICE to save a bit on non-zone-device case. [songmuchun@bytedance.com: replace enum with defines per David] Link: https://lkml.kernel.org/r/20220620110616.12056-2-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220617135650.74901-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220617135650.74901-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:49 -07:00
Matthew Wilcox (Oracle)	ceff9d3354	mm/swap: convert __delete_from_swap_cache() to a folio All callers now have a folio, so convert the entire function to operate on folios. Link: https://lkml.kernel.org/r/20220617175020.717127-23-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:48 -07:00
Matthew Wilcox (Oracle)	75fa68a5d8	mm/swap: convert delete_from_swap_cache() to take a folio All but one caller already has a folio, so convert it to use a folio. Link: https://lkml.kernel.org/r/20220617175020.717127-22-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:48 -07:00
Matthew Wilcox (Oracle)	b98c359f1d	mm: convert page_swap_flags to folio_swap_flags The only caller already has a folio, so push the folio->page conversion down a level. Link: https://lkml.kernel.org/r/20220617175020.717127-21-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:48 -07:00
Matthew Wilcox (Oracle)	5375336c8c	mm: convert destroy_compound_page() to destroy_large_folio() All callers now have a folio, so push the folio->page conversion down to this function. [akpm@linux-foundation.org: uninline destroy_large_folio() to fix build issue] Link: https://lkml.kernel.org/r/20220617175020.717127-20-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:48 -07:00
Matthew Wilcox (Oracle)	188e8caee9	mm/swap: convert __page_cache_release() to use a folio All the callers now have a folio. Saves several calls to compound_head, totalling 502 bytes of text. Link: https://lkml.kernel.org/r/20220617175020.717127-19-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:48 -07:00
Matthew Wilcox (Oracle)	5ef82fe7f6	mm/swap: convert __put_compound_page() to __folio_put_large() All the callers now have a folio, so pass it in. This doesn't save any text, but it does save a call to compound_head() as folio_test_hugetlb() does not contain a call like PageHuge() does. Link: https://lkml.kernel.org/r/20220617175020.717127-18-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:47 -07:00
Matthew Wilcox (Oracle)	83d9965995	mm/swap: convert __put_single_page() to __folio_put_small() Saves 56 bytes of text by removing a call to compound_head(). Link: https://lkml.kernel.org/r/20220617175020.717127-17-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:47 -07:00
Matthew Wilcox (Oracle)	8d29c7036f	mm/swap: convert __put_page() to __folio_put() Saves 11 bytes of text by removing a check of PageTail. Link: https://lkml.kernel.org/r/20220617175020.717127-16-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:47 -07:00
Matthew Wilcox (Oracle)	2f58e5de66	mm/swap: convert put_pages_list to use folios Pages linked through the LRU list cannot be tail pages as ->compound_head is in a union with one of the words of the list_head, and they cannot be ZONE_DEVICE pages as ->pgmap is in a union with the same word. Saves 60 bytes of text by removing a call to page_is_fake_head(). Link: https://lkml.kernel.org/r/20220617175020.717127-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:47 -07:00
Matthew Wilcox (Oracle)	ab5e653ee8	mm/swap: convert release_pages to use a folio internally This function was already calling compound_head(), but now it can cache the result of calling compound_head() and avoid calling it again. Saves 299 bytes of text by avoiding various calls to compound_page() and avoiding checks of PageTail. Link: https://lkml.kernel.org/r/20220617175020.717127-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:47 -07:00
Matthew Wilcox (Oracle)	2397f780e1	mm/swap: convert try_to_free_swap to use a folio Save a few calls to compound_head by converting the passed page to a folio. Reduces kernel text size by 74 bytes. Link: https://lkml.kernel.org/r/20220617175020.717127-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:47 -07:00
Matthew Wilcox (Oracle)	a2d33b5dd6	mm/swap: optimise lru_add_drain_cpu() Do the per-cpu dereferencing of the fbatches once which saves 14 bytes of text and several percpu relocations. Link: https://lkml.kernel.org/r/20220617175020.717127-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:46 -07:00
Matthew Wilcox (Oracle)	4864545a46	mm/swap: pull the CPU conditional out of __lru_add_drain_all() The function is too long, so pull this complicated conditional out into cpu_needs_drain(). This ends up shrinking the text by 14 bytes, by allowing GCC to cache the result of calling per_cpu() instead of relocating each lookup individually. Link: https://lkml.kernel.org/r/20220617175020.717127-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:46 -07:00
Matthew Wilcox (Oracle)	82ac64d86f	mm/swap: rename lru_pvecs to cpu_fbatches No change to generated code, but this struct no longer contains any pagevecs, and not all the folio batches it contains are lru. Link: https://lkml.kernel.org/r/20220617175020.717127-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:46 -07:00
Matthew Wilcox (Oracle)	3a44610b12	mm/swap: convert activate_page to a folio_batch Rename it to just 'activate', saving 696 bytes of text from removals of compound_page() and the pagevec_lru_move_fn() infrastructure. Inline need_activate_page_drain() into its only caller. Link: https://lkml.kernel.org/r/20220617175020.717127-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:46 -07:00
Matthew Wilcox (Oracle)	cec394bafa	mm/swap: convert lru_lazyfree to a folio_batch Using folios instead of pages removes several calls to compound_head(), shrinking the kernel by 1089 bytes of text. Link: https://lkml.kernel.org/r/20220617175020.717127-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:46 -07:00
Matthew Wilcox (Oracle)	85cd7791a8	mm/swap: convert lru_deactivate to a folio_batch Using folios instead of pages shrinks deactivate_page() and lru_deactivate_fn() by 778 bytes between them. Link: https://lkml.kernel.org/r/20220617175020.717127-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:46 -07:00
Matthew Wilcox (Oracle)	7a3dbfe8a5	mm/swap: convert lru_deactivate_file to a folio_batch Use a folio throughout lru_deactivate_file_fn(), removing many hidden calls to compound_head(). Shrinks the kernel by 864 bytes of text. Link: https://lkml.kernel.org/r/20220617175020.717127-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:45 -07:00
Matthew Wilcox (Oracle)	70dea5346e	mm/swap: convert lru_add to a folio_batch When adding folios to the LRU for the first time, the LRU flag will already be clear, so skip the test-and-clear part of moving from one LRU to another. Removes 285 bytes from kernel text, mostly due to removing __pagevec_lru_add(). Link: https://lkml.kernel.org/r/20220617175020.717127-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:45 -07:00
Matthew Wilcox (Oracle)	7d80dd096f	mm/swap: make __pagevec_lru_add static __pagevec_lru_add has no callers outside swap.c, so make it static, and move it to a more logical position in the file. Link: https://lkml.kernel.org/r/20220617175020.717127-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:45 -07:00
Matthew Wilcox (Oracle)	c2bc16817a	mm/swap: add folio_batch_move_lru() Start converting the LRU from pagevecs to folio_batches. Combine the functionality of pagevec_add_and_need_flush() with pagevec_lru_move_fn() in the new folio_batch_add_and_move(). Convert the lru_rotate pagevec to a folio_batch. Adds 223 bytes total to kernel text, because we're duplicating infrastructure. This will be more than made up for in future patches. Link: https://lkml.kernel.org/r/20220617175020.717127-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:45 -07:00
Matthew Wilcox (Oracle)	a83f0551f4	mm/vmscan: convert reclaim_pages() to use a folio Remove a few hidden calls to compound_head, saving 76 bytes of text. Link: https://lkml.kernel.org/r/20220617154248.700416-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:44 -07:00
Matthew Wilcox (Oracle)	07f67a8ded	mm/vmscan: convert shrink_active_list() to use a folio Remove a few hidden calls to compound_head, saving 411 bytes of text. Link: https://lkml.kernel.org/r/20220617154248.700416-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:44 -07:00
Matthew Wilcox (Oracle)	ff00a170d9	mm/vmscan: convert move_pages_to_lru() to use a folio Remove a few hidden calls to compound_head, saving 387 bytes of text on my test configuration. Link: https://lkml.kernel.org/r/20220617154248.700416-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:44 -07:00
Matthew Wilcox (Oracle)	166e3d3227	mm/vmscan: convert isolate_lru_pages() to use a folio Remove a few hidden calls to compound_head, saving 279 bytes of text. Link: https://lkml.kernel.org/r/20220617154248.700416-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:44 -07:00
Matthew Wilcox (Oracle)	b8cecb9376	mm/vmscan: convert reclaim_clean_pages_from_list() to folios Patch series "nvert much of vmscan to folios" vmscan always operates on folios since it puts the pages on the LRU list. Switching all of these functions from pages to folios saves 1483 bytes of text from removing all the baggage around calling compound_page() and similar functions. This patch (of 5): This is a straightforward conversion which removes several hidden calls to compound_head, saving 330 bytes of kernel text. Link: https://lkml.kernel.org/r/20220617154248.700416-1-willy@infradead.org Link: https://lkml.kernel.org/r/20220617154248.700416-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:44 -07:00
David Hildenbrand	64fe24a3e0	mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we can try mapping anonymous pages in a private writable mapping writable if they are exclusive, the PTE is already dirty, and no special handling applies. Mapping the anonymous page writable is essentially the same thing the write fault handler would do in this case. Special handling is required for uffd-wp and softdirty tracking, so take care of that properly. Also, leave PROT_NONE handling alone for now; in the future, we could similarly extend the logic in do_numa_page() or use pte_mk_savedwrite() here. While this improves mprotect(PROT_READ)+mprotect(PROT_READ\|PROT_WRITE) performance, it should also be a valuable optimization for uffd-wp, when un-protecting. This has been previously suggested by Peter Collingbourne in [1], relevant in the context of the Scudo memory allocator, before we had PageAnonExclusive. This commit doesn't add the same handling for PMDs (i.e., anonymous THP, anonymous hugetlb); benchmark results from Andrea indicate that there are minor performance gains, so it's might still be valuable to streamline that logic for all anonymous pages in the future. As we now also set MM_CP_DIRTY_ACCT for private mappings, let's rename it to MM_CP_TRY_CHANGE_WRITABLE, to make it clearer what's actually happening. Micro-benchmark courtesy of Andrea: === #define _GNU_SOURCE #include <sys/mman.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <unistd.h> #define SIZE (102410241024) int main(int argc, char argv[]) { char p; if (posix_memalign((void *)&p, sysconf(_SC_PAGESIZE)512, SIZE)) perror("posix_memalign"), exit(1); if (madvise(p, SIZE, argc > 1 ? MADV_HUGEPAGE : MADV_NOHUGEPAGE)) perror("madvise"); explicit_bzero(p, SIZE); for (int loops = 0; loops < 40; loops++) { if (mprotect(p, SIZE, PROT_READ)) perror("mprotect"), exit(1); if (mprotect(p, SIZE, PROT_READ\|PROT_WRITE)) perror("mprotect"), exit(1); explicit_bzero(p, SIZE); } } === Results on my Ryzen 9 3900X: Stock 10 runs (lower is better): AVG 6.398s, STDEV 0.043 Patched 10 runs (lower is better): AVG 3.780s, STDEV 0.026 === [1] https://lkml.kernel.org/r/20210429214801.2583336-1-pcc@google.com Link: https://lkml.kernel.org/r/20220614093629.76309-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Suggested-by: Peter Collingbourne <pcc@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:44 -07:00
SeongJae Park	40e983cca9	mm/damon: introduce DAMON-based LRU-lists Sorting Users can do data access-aware LRU-lists sorting using 'LRU_PRIO' and 'LRU_DEPRIO' DAMOS actions. However, finding best parameters including the hotness/coldness thresholds, CPU quota, and watermarks could be challenging for some users. To make the scheme easy to be used without complex tuning for common situations, this commit implements a static kernel module called 'DAMON_LRU_SORT' using the 'LRU_PRIO' and 'LRU_DEPRIO' DAMOS actions. It proactively sorts LRU-lists using DAMON with conservatively chosen default values of the parameters. That is, the module under its default parameters will make no harm for common situations but provide some level of efficiency improvements for systems having clear hot/cold access pattern under a level of memory pressure while consuming only a limited small portion of CPU time. Link: https://lkml.kernel.org/r/20220613192301.8817-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:43 -07:00
SeongJae Park	99cdc2cd18	mm/damon/schemes: add 'LRU_DEPRIO' action This commit adds a new DAMON-based operation scheme action called 'LRU_DEPRIO' for physical address space. The action deprioritizes pages in the memory area of the target access pattern on their LRU lists. This is hence supposed to be used for rarely accessed (cold) memory regions so that cold pages could be more likely reclaimed first under memory pressure. Internally, it simply calls 'lru_deactivate()'. Using this with 'LRU_PRIO' action for hot pages, users can proactively sort LRU lists based on the access pattern. That is, it can make the LRU lists somewhat more trustworthy source of access temperature. As a result, efficiency of LRU-lists based mechanisms including the reclamation target selection could be improved. Link: https://lkml.kernel.org/r/20220613192301.8817-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:43 -07:00
SeongJae Park	8cdcc53226	mm/damon/schemes: add 'LRU_PRIO' DAMOS action This commit adds a new DAMOS action called 'LRU_PRIO' for the physical address space. The action prioritizes pages in the memory regions of the user-specified target access pattern on their LRU lists. This is hence supposed to be used for frequently accessed (hot) memory regions so that hot pages could be more likely protected under memory pressure. Internally, it simply calls 'mark_page_accessed()'. Link: https://lkml.kernel.org/r/20220613192301.8817-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:42 -07:00
SeongJae Park	0e93e8bfd0	mm/damon/paddr: use a separate function for 'DAMOS_PAGEOUT' handling This commit moves code for 'DAMOS_PAGEOUT' handling of the physical address space monitoring operations set to a separate function so that its caller, 'damon_pa_apply_scheme()', can be more easily extended for additional DAMOS actions later. Link: https://lkml.kernel.org/r/20220613192301.8817-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:42 -07:00
SeongJae Park	c364f9af29	mm/damon/dbgfs: add and use mappings between 'schemes' action inputs and 'damos_action' values Patch series "Extend DAMOS for Proactive LRU-lists Sorting". Introduction ============ In short, this patchset 1) extends DAMON-based Operation Schemes (DAMOS) for low overhead data access pattern based LRU-lists sorting, and 2) implements a static kernel module for easy use of conservatively-tuned version of that using the extended DAMOS capability. Background ---------- As page-granularity access checking overhead could be significant on huge systems, LRU lists are normally not proactively sorted but partially and reactively sorted for special events including specific user requests, system calls and memory pressure. As a result, LRU lists are sometimes not so perfectly prepared to be used as a trustworthy access pattern source for some situations including reclamation target pages selection under sudden memory pressure. DAMON-based Proactive LRU-lists Sorting --------------------------------------- Because DAMON can identify access patterns of best-effort accuracy while inducing only user-specified range of overhead, using DAMON for Proactive LRU-lists Sorting (PLRUS) could be helpful for this situation. The idea is quite simple. Find hot pages and cold pages using DAMON, and prioritize hot pages while deprioritizing cold pages on their LRU-lists. This patchset extends DAMON to support such schemes by introducing a couple of new DAMOS actions for prioritizing and deprioritizing memory regions of specific access patterns on their LRU-lists. In detail, this patchset simply uses 'mark_page_accessed()' and 'deactivate_page()' functions for prioritization and deprioritization of pages on their LRU lists, respectively. To make the scheme easy to use without complex tuning for common situations, this patchset further implements a static kernel module called 'DAMON_LRU_SORT' using the extended DAMOS functionality. It proactively sorts LRU-lists using DAMON with conservatively chosen default hotness/coldness thresholds and small CPU usage quota limit. That is, the module under its default parameters will make no harm for common situation but provide some level of benefit for systems having clear hot/cold access pattern under only memory pressure while consuming only limited small portion of CPU time. Related Works ------------- Proactive reclamation is well known to be helpful for reducing non-optimal reclamation target selection caused performance drops. However, proactive reclamation is not a best option for some cases, because it could incur additional I/O. For an example, it could be prohitive for systems using storage devices that total number of writes is limited, or cloud block storages that charges every I/O. Some proactive reclamation approaches[1,2] induce a level of memory pressure using memcg files or swappiness while monitoring PSI. As reclamation target selection is still relying on the original LRU-lists mechanism, using DAMON-based proactive reclamation before inducing the proactive reclamation could allow more memory saving with same level of performance overhead, or less performance overhead with same level of memory saving. [1] https://blogs.oracle.com/linux/post/anticipating-your-memory-needs [2] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf Evaluation ========== In short, PLRUS achieves 10% memory PSI (some) reduction, 14% major page faults reduction, and 3.74% speedup under memory pressure. Setup ----- To show the effect of PLRUS, I run PARSEC3 and SPLASH-2X benchmarks under below variant systems and measure a few metrics including the runtime of each workload, number of system-wide major page faults, and system-wide memory PSI (some). - orig: v5.18-rc4 based mm-unstable kernel + this patchset, but no DAMON scheme applied. - mprs: Same to 'orig' but artificial memory pressure is induced. - plrus: Same to 'mprs' but a radically tuned PLRUS scheme is applied to the entire physical address space of the system. For the artificial memory pressure, I set 'memory.limit_in_bytes' to 75% of the running workload's peak RSS, wait 1 second, remove the pressure by setting it to 200% of the peak RSS, wait 10 seconds, and repeat the procedure until the workload finishes[1]. I use zram based swap device. The tests are automated[2]. [1] https://github.com/awslabs/damon-tests/blob/next/perf/runners/back/0009_memcg_pressure.sh [2] https://github.com/awslabs/damon-tests/blob/next/perf/full_once_config.sh Radically Tuned PLRUS --------------------- To show effect of PLRUS on the PARSEC3/SPLASH-2X workloads which runs for no long time, we use radically tuned version of PLRUS. The version asks DAMON to do the proactive LRU-lists sorting as below. 1. Find any memory regions shown some accesses (approximately >=20 accesses per 100 sampling) and prioritize pages of the regions on their LRU lists using up to 2% CPU time. Under the CPU time limit, prioritize regions having higher access frequency and kept the access frequency longer first. 2. Find any memory regions shown no access for at least >=5 seconds and deprioritize pages of the rgions on their LRU lists using up to 2% CPU time. Under the CPU time limit, deprioritize regions that not accessed for longer time first. Results ------- I repeat the tests 25 times and calculate average of the measured numbers. The results are as below: metric orig mprs plrus plrus/mprs runtime_seconds 190.06 292.83 281.87 0.96 pgmajfaults 852.55 8769420.00 7525040.00 0.86 memory_psi_some_us 106911.00 6943420.00 6220920.00 0.90 The first row is for legend. The first cell shows the metric that the following cells of the row shows. Second, third, and fourth cells show the metrics under the configs shown at the first row of the cell, and the fifth cell shows the metric under 'plrus' divided by the metric under 'mprs'. Second row shows the averaged runtime of the workloads in seconds. Third row shows the number of system-wide major page faults while the test was ongoing. Fourth row shows the system-wide memory pressure stall for some processes in microseconds while the test was ongoing. In short, PLRUS achieves 10% memory PSI (some) reduction, 14% major page faults reduction, and 3.74% speedup under memory pressure. We also confirmed the CPU usage of kdamond was 2.61% of single CPU, which is below 4% as expected. Sequence of Patches =================== The first and second patch cleans up DAMON debugfs interface and DAMOS_PAGEOUT handling code of physical address space monitoring operations implementation for easier extension of the code. The thrid and fourth patches implement a new DAMOS action called 'lru_prio', which prioritizes pages under memory regions which have a user-specified access pattern, and document it, respectively. The fifth and sixth patches implement yet another new DAMOS action called 'lru_deprio', which deprioritizes pages under memory regions which have a user-specified access pattern, and document it, respectively. The seventh patch implements a static kernel module called 'damon_lru_sort', which utilizes the DAMON-based proactive LRU-lists sorting under conservatively chosen default parameter. Finally, the eighth patch documents 'damon_lru_sort'. This patch (of 8): DAMON debugfs interface assumes users will write 'damos_action' value directly to the 'schemes' file. This makes adding new 'damos_action' in the middle of its definition breaks the backward compatibility of DAMON debugfs interface, as values of some 'damos_action' could be changed. To mitigate the situation, this commit adds mappings between the user inputs and 'damos_action' value and makes DAMON debugfs code uses those. Link: https://lkml.kernel.org/r/20220613192301.8817-1-sj@kernel.org Link: https://lkml.kernel.org/r/20220613192301.8817-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:42 -07:00
Miaohe Lin	442701e705	mm/swap: remove swap_cache_info statistics swap_cache_info are not statistics that could be easily used to tune system performance because they are not easily accessile. Also they can't provide really useful info when OOM occurs. Remove these statistics can also help mitigate unneeded global swap_cache_info cacheline contention. Link: https://lkml.kernel.org/r/20220608144031.829-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:42 -07:00
Miaohe Lin	c894530697	mm/swapfile: fix possible data races of inuse_pages si->inuse_pages could still be accessed concurrently now. The plain reads outside si->lock critical section, i.e. swap_show and si_swapinfo, which results in data races. READ_ONCE and WRITE_ONCE is used to fix such data races. Note these data races should be ok because they're just used for showing swap info. [linmiaohe@huawei.com: use WRITE_ONCE to pair with READ_ONCE] Link: https://lkml.kernel.org/r/20220625093346.48894-2-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220608144031.829-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:42 -07:00
Uladzislau Rezki (Sony)	899c6efe58	mm/vmalloc: extend __find_vmap_area() with one more argument __find_vmap_area() finds a "vmap_area" based on passed address. It scan the specific "vmap_area_root" rb-tree. Extend the function with one extra argument, so any tree can be specified where the search has to be done. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-5-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:41 -07:00
Uladzislau Rezki (Sony)	5d7a7c54d3	mm/vmalloc: initialize VA's list node after unlink A vmap_area can travel between different places. For example attached/detached to/from different rb-trees. In order to prevent fancy bugs, initialize a VA's list node after it is removed from the list, so it pairs with VA's rb_node which is also initialized. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:41 -07:00
Uladzislau Rezki (Sony)	f9863be493	mm/vmalloc: extend __alloc_vmap_area() with extra arguments It implies that __alloc_vmap_area() allocates only from the global vmap space, therefore a list-head and rb-tree, which represent a free vmap space, are not passed as parameters to this function and are accessed directly from this function. Extend the __alloc_vmap_area() and other dependent functions to have a possibility to allocate from different trees making an interface common and not specific. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:41 -07:00
Uladzislau Rezki (Sony)	8eb510db21	mm/vmalloc: make link_va()/unlink_va() common to different rb_root Patch series "Reduce a vmalloc internal lock contention preparation work". This small serias is preparation work to implement per-cpu vmalloc allocation in order to reduce a high internal lock contention. This series does not introduce any functional changes, it is only about preparation. This patch (of 5): Currently link_va() and unlik_va(), in order to figure out a tree type, compares a passed root value with a global free_vmap_area_root variable to distinguish the augmented rb-tree from a regular one. It is hard coded since such functions can manipulate only with specific "free_vmap_area_root" tree that represents a global free vmap space. Make it common by introducing "_augment" versions of both internal functions, so it is possible to deal with different trees. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20220607093449.3100-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:41 -07:00
Roman Gushchin	bbf535fd6f	mm: shrinkers: add scan interface for shrinker debugfs Add a scan interface which allows to trigger scanning of a particular shrinker and specify memcg and numa node. It's useful for testing, debugging and profiling of a specific scan_objects() callback. Unlike alternatives (creating a real memory pressure and dropping caches via /proc/sys/vm/drop_caches) this interface allows to interact with only one shrinker at once. Also, if a shrinker is misreporting the number of objects (as some do), it doesn't affect scanning. [roman.gushchin@linux.dev: improve typing, fix arg count checking] Link: https://lkml.kernel.org/r/YpgKttTowT22mKPQ@carbon [akpm@linux-foundation.org: fix arg count checking] Link: https://lkml.kernel.org/r/20220601032227.4076670-7-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:41 -07:00
Roman Gushchin	e33c267ab7	mm: shrinkers: provide shrinkers with names Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:40 -07:00
Roman Gushchin	5035ebc644	mm: shrinkers: introduce debugfs interface for memory shrinkers This commit introduces the /sys/kernel/debug/shrinker debugfs interface which provides an ability to observe the state of individual kernel memory shrinkers. Because the feature adds some memory overhead (which shouldn't be large unless there is a huge amount of registered shrinkers), it's guarded by a config option (enabled by default). This commit introduces the "count" interface for each shrinker registered in the system. The output is in the following format: <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1>... <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1>... ... To reduce the size of output on machines with many thousands cgroups, if the total number of objects on all nodes is 0, the line is omitted. If the shrinker is not memcg-aware or CONFIG_MEMCG is off, 0 is printed as cgroup inode id. If the shrinker is not numa-aware, 0's are printed for all nodes except the first one. This commit gives debugfs entries simple numeric names, which are not very convenient. The following commit in the series will provide shrinkers with more meaningful names. [akpm@linux-foundation.org: remove WARN_ON_ONCE(), per Roman] Reported-by: syzbot+300d27c79fe6d4cbcc39@syzkaller.appspotmail.com Link: https://lkml.kernel.org/r/20220601032227.4076670-3-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:40 -07:00
Roman Gushchin	c15187a4a2	mm: memcontrol: introduce mem_cgroup_ino() and mem_cgroup_get_from_ino() Patch series "mm: introduce shrinker debugfs interface", v5. The only existing debugging mechanism is a couple of tracepoints in do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't covering everything though: shrinkers which report 0 objects will never show up, there is no support for memcg-aware shrinkers. Shrinkers are identified by their scan function, which is not always enough (e.g. hard to guess which super block's shrinker it is having only "super_cache_scan"). To provide a better visibility and debug options for memory shrinkers this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent similar to /sys/kernel/slab. For each shrinker registered in the system a directory is created. As now, the directory will contain only a "scan" file, which allows to get the number of managed objects for each memory cgroup (for memcg-aware shrinkers) and each numa node (for numa-aware shrinkers on a numa machine). Other interfaces might be added in the future. To make debugging more pleasant, the patchset also names all shrinkers, so that debugfs entries can have meaningful names. This patch (of 5): Shrinker debugfs requires a way to represent memory cgroups without using full paths, both for displaying information and getting input from a user. Cgroup inode number is a perfect way, already used by bpf. This commit adds a couple of helper functions which will be used to handle memcg-aware shrinkers. Link: https://lkml.kernel.org/r/20220601032227.4076670-1-roman.gushchin@linux.dev Link: https://lkml.kernel.org/r/20220601032227.4076670-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:40 -07:00
Tianyu Li	000eca5d04	mm/mempolicy: fix get_nodes out of bound access When user specified more nodes than supported, get_nodes will access nmask array out of bounds. Link: https://lkml.kernel.org/r/20220601093211.2970565-1-tianyu.li@arm.com Fixes: `e130242dc3` ("mm: simplify compat numa syscalls") Signed-off-by: Tianyu Li <tianyu.li@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:39 -07:00
Baolin Wang	8edaec0756	mm/hugetlb: remove unnecessary huge_ptep_set_access_flags() in hugetlb_mcopy_atomic_pte() There is no need to update the hugetlb access flags after just setting the hugetlb page table entry by set_huge_pte_at(), since the page table entry value has no changes. Thus remove the unnecessary huge_ptep_set_access_flags() in hugetlb_mcopy_atomic_pte(). Link: https://lkml.kernel.org/r/f3e28b897b53a69967a8b98a6fdcda3be80c9229.1653616175.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:39 -07:00
Andrey Konovalov	6c2f761dad	kasan: fix zeroing vmalloc memory with HW_TAGS HW_TAGS KASAN skips zeroing page_alloc allocations backing vmalloc mappings via __GFP_SKIP_ZERO. Instead, these pages are zeroed via kasan_unpoison_vmalloc() by passing the KASAN_VMALLOC_INIT flag. The problem is that __kasan_unpoison_vmalloc() does not zero pages when either kasan_vmalloc_enabled() or is_vmalloc_or_module_addr() fail. Thus: 1. Change __vmalloc_node_range() to only set KASAN_VMALLOC_INIT when __GFP_SKIP_ZERO is set. 2. Change __kasan_unpoison_vmalloc() to always zero pages when the KASAN_VMALLOC_INIT flag is set. 3. Add WARN_ON() asserts to check that KASAN_VMALLOC_INIT cannot be set in other early return paths of __kasan_unpoison_vmalloc(). Also clean up the comment in __kasan_unpoison_vmalloc. Link: https://lkml.kernel.org/r/4bc503537efdc539ffc3f461c1b70162eea31cf6.1654798516.git.andreyknvl@google.com Fixes: `23689e91fb` ("kasan, vmalloc: add vmalloc tagging for HW_TAGS") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:39 -07:00
Andrey Konovalov	d9da8f6cf5	mm: introduce clear_highpage_kasan_tagged Add a clear_highpage_kasan_tagged() helper that does clear_highpage() on a page potentially tagged by KASAN. This helper is used by the following patch. Link: https://lkml.kernel.org/r/4471979b46b2c487787ddcd08b9dc5fedd1b6ffd.1654798516.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:39 -07:00
Andrey Konovalov	aeaec8e27e	mm: rename kernel_init_free_pages to kernel_init_pages Rename kernel_init_free_pages() to kernel_init_pages(). This function is not only used for free pages but also for pages that were just allocated. Link: https://lkml.kernel.org/r/1ecaffc0a9c1404d4d7cf52efe0b2dc8a0c681d8.1654798516.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:39 -07:00
SeongJae Park	d79905c77f	mm/damon/reclaim: add 'damon_reclaim_' prefix to 'enabled_store()' This commit adds 'damon_reclaim_' prefix to 'enabled_store()', so that we can distinguish it easily from the stack trace using 'faddr2line.sh' like tools. Link: https://lkml.kernel.org/r/20220606182310.48781-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:38 -07:00
SeongJae Park	f943e7e3a4	mm/damon/reclaim: make 'enabled' checking timer simpler DAMON_RECLAIM's 'enabled' parameter store callback ('enabled_store()') schedules the parameter check timer ('damon_reclaim_timer') if the parameter is set as 'Y'. Then, the timer schedules itself to check if user has set the parameter as 'N'. It's unnecessarily complex. This commit makes it simpler by making the parameter store callback to schedule the timer regardless of the parameter value and disabling the timer's self scheduling. Link: https://lkml.kernel.org/r/20220606182310.48781-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:38 -07:00
SeongJae Park	a79b68ee3e	mm/damon/sysfs: deduplicate inputs applying DAMON sysfs interface's DAMON context building and its online parameter update have duplicated code. This commit removes the duplicate. Link: https://lkml.kernel.org/r/20220606182310.48781-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:38 -07:00
SeongJae Park	f25ab3bdfb	mm/damon/reclaim: deduplicate 'commit_inputs' handling DAMON_RECLAIM's handling of 'commit_inputs' parameter is duplicated in 'after_aggregation()' and 'after_wmarks_check()' callbacks. This commit deduplicates the code for better maintenance. Link: https://lkml.kernel.org/r/20220606182310.48781-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:38 -07:00
SeongJae Park	c9e124e038	mm/damon/{dbgfs,sysfs}: move target_has_pid() from dbgfs to damon.h The function for knowing if given monitoring context's targets will have pid or not is defined and used in dbgfs only. However, the logic is also needed for sysfs. This commit moves the code to damon.h and makes both dbgfs and sysfs to use it. Link: https://lkml.kernel.org/r/20220606182310.48781-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:38 -07:00
Miaohe Lin	ad1ac596e8	mm/migration: fix potential pte_unmap on an not mapped pte __migration_entry_wait and migration_entry_wait_on_locked assume pte is always mapped from caller. But this is not the case when it's called from migration_entry_wait_huge and follow_huge_pmd. Add a hugetlbfs variant that calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue. Link: https://lkml.kernel.org/r/20220530113016.16663-5-linmiaohe@huawei.com Fixes: `30dad30922` ("mm: migration: add migrate_entry_wait_huge()") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Howells <dhowells@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:37 -07:00
Miaohe Lin	7ce82f4c3f	mm/migration: return errno when isolate_huge_page failed We might fail to isolate huge page due to e.g. the page is under migration which cleared HPageMigratable. We should return errno in this case rather than always return 1 which could confuse the user, i.e. the caller might think all of the memory is migrated while the hugetlb page is left behind. We make the prototype of isolate_huge_page consistent with isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page to isolate_hugetlb as suggested by Muchun to improve the readability. Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com Fixes: `e8db67eb0d` ("mm: migrate: move_pages() supports thp migration") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Huang Ying <ying.huang@intel.com> Reported-by: kernel test robot <lkp@intel.com> (build error) Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:37 -07:00
Miaohe Lin	160088b3b6	mm/migration: remove unneeded lock page and PageMovable check When non-lru movable page was freed from under us, __ClearPageMovable must have been done. So we can remove unneeded lock page and PageMovable check here. Also free_pages_prepare() will clear PG_isolated for us, so we can further remove ClearPageIsolated as suggested by David. Link: https://lkml.kernel.org/r/20220530113016.16663-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Howells <dhowells@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:37 -07:00
Yang Shi	c453d8c7d1	mm/page_vma_mapped.c: check possible huge PMD map with transhuge_vma_suitable() IIUC page_vma_mapped_walk() checks if the vma is possibly huge PMD mapped with transparent_hugepage_active() and "pvmw->nr_pages >= HPAGE_PMD_NR". Actually pvmw->nr_pages is returned by compound_nr() or folio_nr_pages(), so the page should be THP as long as "pvmw->nr_pages >= HPAGE_PMD_NR". And it is guaranteed THP is allocated for valid VMA in the first place. But it may be not PMD mapped if the VMA is file VMA and it is not properly aligned. The transhuge_vma_suitable() is used to do such check, so replace transparent_hugepage_active() to it, which is too heavy and overkilling. Link: https://lkml.kernel.org/r/20220513191705.457775-1-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:37 -07:00
Gowans, James	14c99d6594	mm: split huge PUD on wp_huge_pud fallback Currently the implementation will split the PUD when a fallback is taken inside the create_huge_pud function. This isn't where it should be done: the splitting should be done in wp_huge_pud, just like it's done for PMDs. Reason being that if a callback is taken during create, there is no PUD yet so nothing to split, whereas if a fallback is taken when encountering a write protection fault there is something to split. It looks like this was the original intention with the commit where the splitting was introduced, but somehow it got moved to the wrong place between v1 and v2 of the patch series. Rebase mistake perhaps. Link: https://lkml.kernel.org/r/6f48d622eb8bce1ae5dd75327b0b73894a2ec407.camel@amazon.com Fixes: `327e9fd489` ("mm: Split huge pages on write-notify or COW") Signed-off-by: James Gowans <jgowans@amazon.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 15:42:33 -07:00
David Hildenbrand	1118234e4b	mm/rmap: fix dereferencing invalid subpage pointer in try_to_migrate_one() The subpage we calculate is an invalid pointer for device private pages, because device private pages are mapped via non-present device private entries, not ordinary present PTEs. Let's just not compute broken pointers and fixup later. Move the proper assignment of the correct subpage to the beginning of the function and assert that we really only have a single page in our folio. This currently results in a BUG when tying to compute anon_exclusive, because: [ 528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0 [ 528.739585] #PF: supervisor read access in kernel mode [ 528.745324] #PF: error_code(0x0000) - not-present page [ 528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0 [ 528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257 [ 528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021 [ 528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000 [ 528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29 c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74 0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02 [ 528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202 [ 528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0 [ 528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8 [ 528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000 [ 528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540 [ 528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff [ 528.850865] FS: 00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000 [ 528.859891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0 [ 528.874275] PKRU: 55555554 [ 528.877286] Call Trace: [ 528.880016] <TASK> [ 528.882356] ? lock_is_held_type+0xdf/0x130 [ 528.887033] rmap_walk_anon+0x167/0x410 [ 528.891316] try_to_migrate+0x90/0xd0 [ 528.895405] ? try_to_unmap_one+0xe10/0xe10 [ 528.900074] ? anon_vma_ctor+0x50/0x50 [ 528.904260] ? put_anon_vma+0x10/0x10 [ 528.908347] ? invalid_mkclean_vma+0x20/0x20 [ 528.913114] migrate_vma_setup+0x5f4/0x750 [ 528.917691] dmirror_devmem_fault+0x8c/0x250 [test_hmm] [ 528.923532] do_swap_page+0xac0/0xe50 [ 528.927623] ? __lock_acquire+0x4b2/0x1ac0 [ 528.932199] __handle_mm_fault+0x949/0x1440 [ 528.936876] handle_mm_fault+0x13f/0x3e0 [ 528.941256] do_user_addr_fault+0x215/0x740 [ 528.945928] exc_page_fault+0x75/0x280 [ 528.950115] asm_exc_page_fault+0x27/0x30 [ 528.954593] RIP: 0033:0x40366b ... Link: https://lkml.kernel.org/r/20220623205332.319257-1-david@redhat.com Fixes: `6c287605fd` ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 15:42:33 -07:00
Muchun Song	39d35edee4	mm: sparsemem: fix missing higher order allocation splitting Higher order allocations for vmemmap pages from buddy allocator must be able to be treated as indepdenent small pages as they can be freed individually by the caller. There is no problem for higher order vmemmap pages allocated at boot time since each individual small page will be initialized at boot time. However, it will be an issue for memory hotplug case since those higher order vmemmap pages are allocated from buddy allocator without initializing each individual small page's refcount. The system will panic in put_page_testzero() when CONFIG_DEBUG_VM is enabled if the vmemmap page is freed. Link: https://lkml.kernel.org/r/20220620023019.94257-1-songmuchun@bytedance.com Fixes: `d8d55f5616` ("mm: sparsemem: use page table lock to protect kernel pmd operations") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 15:42:32 -07:00
Baolin Wang	ed1523a895	mm/damon: use set_huge_pte_at() to make huge pte old The huge_ptep_set_access_flags() can not make the huge pte old according to the discussion [1], that means we will always mornitor the young state of the hugetlb though we stopped accessing the hugetlb, as a result DAMON will get inaccurate accessing statistics. So changing to use set_huge_pte_at() to make the huge pte old to fix this issue. [1] https://lore.kernel.org/all/Yqy97gXI4Nqb7dYo@arm.com/ Link: https://lkml.kernel.org/r/1655692482-28797-1-git-send-email-baolin.wang@linux.alibaba.com Fixes: `49f4203aae` ("mm/damon: add access checking for hugetlb pages") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 15:42:32 -07:00

1 2 3 4 5 ...

18722 Commits