mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-11-28 22:54:05 +08:00
b8eddff888
15564 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Johannes Weiner
|
b8eddff888 |
mm: memcontrol: add file_thp, shmem_thp to memory.stat
As huge page usage in the page cache and for shmem files proliferates in our production environment, the performance monitoring team has asked for per-cgroup stats on those pages. We already track and export anon_thp per cgroup. We already track file THP and shmem THP per node, so making them per-cgroup is only a matter of switching from node to lruvec counters. All callsites are in places where the pages are charged and locked, so page->memcg is stable. [hannes@cmpxchg.org: add documentation] Link: https://lkml.kernel.org/r/20201026174029.GC548555@cmpxchg.org Link: https://lkml.kernel.org/r/20201022151844.489337-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Hui Su
|
30e6a51dbb |
mm/shmem.c: make shmem_mapping() inline
shmem_mapping() isn't worth an out-of-line call from any callsite. So make it inline by - make shmem_aops global - export shmem_aops - inline the shmem_mapping() and replace the direct call 'shmem_aops' with shmem_mapping() in shmem.c. Link: https://lkml.kernel.org/r/20201115165207.GA265355@rlk Signed-off-by: Hui Su <sh_def@163.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Jeff Layton
|
462680946b |
mm: remove pagevec_lookup_range_nr_tag()
With the merge of commit
|
||
Miaohe Lin
|
661c756643 |
mm/swapfile.c: use memset to fill the swap_map with SWAP_HAS_CACHE
We could use helper memset to fill the swap_map with SWAP_HAS_CACHE instead of a direct loop here to simplify the code. Also we can remove the local variable i and map this way. Link: https://lkml.kernel.org/r/20200921122224.7139-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Miaohe Lin
|
9d9a033403 |
mm/swapfile.c: remove unnecessary out label in __swap_duplicate()
When the code went to the out label, it must have p == NULL. So what out label really does is redundant if check and return err. We should Remove this unnecessary out label because it does not handle resource free and so on. Link: https://lkml.kernel.org/r/20201009130337.29698-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Miaohe Lin
|
e97af69950 |
mm/swap_state: skip meaningless swap cache readahead when ra_info.win == 0
swap_ra_info() may leave ra_info untouched in non_swap_entry() case as page table lock is not held. In this case, we have ra_info.nr_pte == 0 and it is meaningless to continue with swap cache readahead. Skip such ops by init ra_info.win = 1. [akpm@linux-foundation.org: clean up struct init] Link: https://lkml.kernel.org/r/20201009133059.58407-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Miaohe Lin
|
d8aa24e04f |
mm/swapfile.c: use helper function swap_count() in add_swap_count_continuation()
Commit |
||
Ralph Campbell
|
43fbdeb349 |
mm: handle zone device pages in release_pages()
release_pages() is an optimized, inlined version of __put_pages() except that zone device struct pages that are not page_is_devmap_managed() (i.e., memory_type MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA), fall through to the code that could return the zone device page to the page allocator instead of adjusting the pgmap reference count. Clearly these type of pages are not having the reference count decremented to zero via release_pages() or page allocation problems would be seen. Just to be safe, handle the 1 to zero case in release_pages() like __put_page() does. Link: https://lkml.kernel.org/r/20201021194733.11530-1-rcampbell@nvidia.com Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Jason Gunthorpe
|
4509b42c38 |
mm/gup: combine put_compound_head() and unpin_user_page()
These functions accomplish the same thing but have different
implementations.
unpin_user_page() has a bug where it calls mod_node_page_state() after
calling put_page() which creates a risk that the page could have been
hot-uplugged from the system.
Fix this by using put_compound_head() as the only implementation.
__unpin_devmap_managed_user_page() and related can be deleted as well in
favour of the simpler, but slower, version in put_compound_head() that has
an extra atomic page_ref_sub, but always calls put_page() which internally
contains the special devmap code.
Move put_compound_head() to be directly after try_grab_compound_head() so
people can find it in future.
Link: https://lkml.kernel.org/r/0-v1-6730d4ee0d32+40e6-gup_combine_put_jgg@nvidia.com
Fixes:
|
||
Jason Gunthorpe
|
52650c8b46 |
mm/gup: remove the vma allocation from gup_longterm_locked()
Long ago there wasn't a FOLL_LONGTERM flag so this DAX check was done by post-processing the VMA list. These days it is trivial to just check each VMA to see if it is DAX before processing it inside __get_user_pages() and return failure if a DAX VMA is encountered with FOLL_LONGTERM. Removing the allocation of the VMA list is a significant speed up for many call sites. Add an IS_ENABLED to vma_is_fsdax so that code generation is unchanged when DAX is compiled out. Remove the dummy version of __gup_longterm_locked() as !CONFIG_CMA already makes memalloc_nocma_save(), check_and_migrate_cma_pages(), and memalloc_nocma_restore() into a NOP. Link: https://lkml.kernel.org/r/0-v1-5551df3ed12e+b8-gup_dax_speedup_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Jason Gunthorpe
|
57efa1fe59 |
mm/gup: prevent gup_fast from racing with COW during fork
Since commit |
||
Jason Gunthorpe
|
c28b1fc703 |
mm/gup: reorganize internal_get_user_pages_fast()
Patch series "Add a seqcount between gup_fast and copy_page_range()", v4. As discussed and suggested by Linus use a seqcount to close the small race between gup_fast and copy_page_range(). Ahmed confirms that raw_write_seqcount_begin() is the correct API to use in this case and it doesn't trigger any lockdeps. I was able to test it using two threads, one forking and the other using ibv_reg_mr() to trigger GUP fast. Modifying copy_page_range() to sleep made the window large enough to reliably hit to test the logic. This patch (of 2): The next patch in this series makes the lockless flow a little more complex, so move the entire block into a new function and remove a level of indention. Tidy a bit of cruft: - addr is always the same as start, so use start - Use the modern check_add_overflow() for computing end = start + len - nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to avoid shift overflow, make the variables unsigned long to avoid coding casts in both places. nr_pinned was missing its cast - The handling of ret and nr_pinned can be streamlined a bit No functional change. Link: https://lkml.kernel.org/r/0-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com Link: https://lkml.kernel.org/r/1-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Barry Song
|
d0de824118 |
mm/gup_test: GUP_TEST depends on DEBUG_FS
Without DEBUG_FS, all the code in gup_benchmark becomes meaningless. For sure kernel provides debugfs stub while DEBUG_FS is disabled, but the point here is that GUP_TEST can do nothing without DEBUG_FS. [song.bao.hua@hisilicon.com: add comment as a prompt to users as commented by John and Randy] Link: https://lkml.kernel.org/r/20201108083732.15336-1-song.bao.hua@hisilicon.com Link: https://lkml.kernel.org/r/20201104100552.20156-1-song.bao.hua@hisilicon.com Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Suggested-by: John Garry <john.garry@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Barry Song
|
afaa78886f |
mm/gup_test.c: mark gup_test_init as __init function
gup_test_init() is only called during initialization, mark it as __init to save some memory. Link: https://lkml.kernel.org/r/20201103081016.16532-1-song.bao.hua@hisilicon.com Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
John Hubbard
|
f4f9bda418 |
selftests/vm: gup_test: introduce the dump_pages() sub-test
For quite a while, I was doing a quick hack to gup_test.c (previously, gup_benchmark.c) whenever I wanted to try out my changes to dump_page(). This makes that hack unnecessary, and instead allows anyone to easily get the same coverage from a user space program. That saves a lot of time because you don't have to change the kernel, in order to test different pages and options. The new sub-test takes advantage of the existing gup_test infrastructure, which already provides a simple user space program, some allocated user space pages, an ioctl call, pinning of those pages (via either get_user_pages or pin_user_pages) and a corresponding kernel-side test invocation. There's not much more required, mainly just a couple of inputs from the user. In fact, the new test re-uses the existing command line options in order to get various helpful combinations (THP or normal, _fast or slow gup, gup vs. pup, and more). New command line options are: which pages to dump, and what type of "get/pin" to use. In order to figure out which pages to dump, the logic is: * If the user doesn't specify anything, the page 0 (the first page in the address range that the program sets up for testing) is dumped. * Or, the user can type up to 8 page indices anywhere on the command line. If you type more than 8, then it uses the first 8 and ignores the remaining items. For example: ./gup_test -ct -F 1 0 19 0x1000 Meaning: -c: dump pages sub-test -t: use THP pages -F 1: use pin_user_pages() instead of get_user_pages() 0 19 0x1000: dump pages 0, 19, and 4096 Link: https://lkml.kernel.org/r/20201026064021.3545418-7-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
John Hubbard
|
a9bed1e1c2 |
selftests/vm: only some gup_test items are really benchmarks
Therefore, some minor cleanup and improvements are in order: 1. Rename the other items appropriately. 2. Stop reporting timing information on the non-benchmark items. It's still being recorded and is available, but there's no point in cluttering up the report with data that no one reasonably needs to check. 3. Don't do iterations, for non-benchmark items. 4. Print out a shorter, more appropriate report for the non-benchmark tests. 5. Add the command that was run, to the report. This really helps, as there are quite a lot of options now. 6. Use a larger integer type for cmd, now that it's being compared Otherwise it doesn't work, because in this case cmd is about 3 billion, which is the perfect size for problems with signed vs unsigned int. Link: https://lkml.kernel.org/r/20201026064021.3545418-6-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
John Hubbard
|
b9dcfdff8b |
selftests/vm: use a common gup_test.h
Avoid the need to copy-paste the gup_test ioctl commands and the struct gup_test definition, between the kernel and the user space application, by providing a new header file for these. This allows easier and safer adding of new ioctl calls, as well as reducing the overall line count. Details: The header file has to be able to compile independently, because of the arguably unfortunate way that the Makefile is written: the Makefile tries to build all of its prerequisites, when really it should be only building the .c files, and leaving the other prerequisites (LOCAL_HDRS) as pure dependencies. That Makefile limitation is probably not worth fixing, but it explains why one of the includes had to be moved into the new header file. Also: simplify the ioctl struct (struct gup_test), by deleting the unused __expansion[10] field. This sort of thing is what you might see in a stable ABI, but this low-level, kernel-developer-oriented selftests/vm system is very much not subject to ABI stability. So "expansion" and "reserved" fields are unnecessary here. Link: https://lkml.kernel.org/r/20201026064021.3545418-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
John Hubbard
|
9c84f22926 |
mm/gup_benchmark: rename to mm/gup_test
Patch series "selftests/vm: gup_test, hmm-tests, assorted improvements", v3. Summary: This series provides two main things, and a number of smaller supporting goodies. The two main points are: 1) Add a new sub-test to gup_test, which in turn is a renamed version of gup_benchmark. This sub-test allows nicer testing of dump_pages(), at least on user-space pages. For quite a while, I was doing a quick hack to gup_test.c whenever I wanted to try out changes to dump_page(). Then Matthew Wilcox asked me what I meant when I said "I used my dump_page() unit test", and I realized that it might be nice to check in a polished up version of that. Details about how it works and how to use it are in the commit description for patch #6 ("selftests/vm: gup_test: introduce the dump_pages() sub-test"). 2) Fixes a limitation of hmm-tests: these tests are incredibly useful, but only if people actually build and run them. And it turns out that libhugetlbfs is a little too effective at throwing a wrench in the works, there. So I've added a little configuration check that removes just two of the 21 hmm-tests, if libhugetlbfs is not available. Further details in the commit description of patch #8 ("selftests/vm: hmm-tests: remove the libhugetlbfs dependency"). Other smaller things that this series does: a) Remove code duplication by creating gup_test.h. b) Clear up the sub-test organization, and their invocation within run_vmtests.sh. c) Other minor assorted improvements. [1] v2 is here: https://lore.kernel.org/linux-doc/20200929212747.251804-1-jhubbard@nvidia.com/ [2] https://lore.kernel.org/r/CAHk-=wgh-TMPHLY3jueHX7Y2fWh3D+nMBqVS__AZm6-oorquWA@mail.gmail.com This patch (of 9): Rename nearly every "gup_benchmark" reference and file name to "gup_test". The one exception is for the actual gup benchmark test itself. The current code already does a *little* bit more than benchmarking, and definitely covers more than get_user_pages_fast(). More importantly, however, subsequent patches are about to add some functionality that is non-benchmark related. Closely related changes: * Kconfig: in addition to renaming the options from GUP_BENCHMARK to GUP_TEST, update the help text to reflect that it's no longer a benchmark-only test. Link: https://lkml.kernel.org/r/20201026064021.3545418-1-jhubbard@nvidia.com Link: https://lkml.kernel.org/r/20201026064021.3545418-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Hailong Liu
|
800bca7c56 |
mm/filemap.c: remove else after a return
The `else' is not useful after a `return' in __lock_page_or_retry(). [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20201202154720.115162-1-carver4lio@163.com Signed-off-by: Hailong Liu<liu.hailong6@zte.com.cn> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Alex Shi
|
649c6dfed0 |
mm/truncate: add parameter explanation for invalidate_mapping_pagevec
To fix a kernel-doc markups issue: mm/truncate.c:646: warning: Function parameter or member 'mapping' not described in 'invalidate_mapping_pagevec' mm/truncate.c:646: warning: Function parameter or member 'start' not described in 'invalidate_mapping_pagevec' mm/truncate.c:646: warning: Function parameter or member 'end' not described in 'invalidate_mapping_pagevec' mm/truncate.c:646: warning: Function parameter or member 'nr_pagevec' not described in 'invalidate_mapping_pagevec' Link: https://lkml.kernel.org/r/1605605088-30668-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Kent Overstreet
|
06c0444290 |
mm/filemap.c: generic_file_buffered_read() now uses find_get_pages_contig
Convert generic_file_buffered_read() to get pages to read from in batches, and then copy data to userspace from many pages at once - in particular, we now don't touch any cachelines that might be contended while we're in the loop to copy data to userspace. This is is a performance improvement on workloads that do buffered reads with large blocksizes, and a very large performance improvement if that file is also being accessed concurrently by different threads. On smaller reads (512 bytes), there's a very small performance improvement (1%, within the margin of error). akpm: kernel test robot found a 32% speedup on one test: https://lkml.kernel.org/r/20201030081456.GY31092@shao2-debian Link: https://lkml.kernel.org/r/20201025212949.602194-3-kent.overstreet@gmail.com Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Kent Overstreet
|
723ef24b9b |
mm/filemap/c: break generic_file_buffered_read up into multiple functions
Patch series "generic_file_buffered_read() improvements", v2. generic_file_buffered_read() has turned into a real monstrosity to work with. And it's a major performance improvement, for both small random and large sequential reads. On my test box, 4k buffered random reads go from ~150k to ~250k iops, and the improvements to big sequential reads are even bigger. This incorporates the fix for IOCB_WAITQ handling that Jens just posted as well, also factors out lock_page_for_iocb() to improve handling of the various iocb flags. This patch (of 2): This is prep work for changing generic_file_buffered_read() to use find_get_pages_contig() to batch up all the pagecache lookups. This patch should be functionally identical to the existing code and changes as little as of the flow control as possible. More refactoring could be done, this patch is intended to be relatively minimal. Link: https://lkml.kernel.org/r/20201025212949.602194-1-kent.overstreet@gmail.com Link: https://lkml.kernel.org/r/20201025212949.602194-2-kent.overstreet@gmail.com Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Liam Mark
|
9cc7e96aa8 |
mm/page_owner: record timestamp and pid
Collect the time for each allocation recorded in page owner so that allocation "surges" can be measured. Record the pid for each allocation recorded in page owner so that the source of allocation "surges" can be better identified. The above is very useful when doing memory analysis. On a crash for example, we can get this information from kdump (or ramdump) and parse it to figure out memory allocation problems. Please note that on x86_64 this increases the size of struct page_owner from 16 bytes to 32. Vlastimil: it's not a functionality intended for production, so unless somebody says they need to enable page_owner for debugging and this increase prevents them from fitting into available memory, let's not complicate things with making this optional. [lmark@codeaurora.org: v3] Link: https://lkml.kernel.org/r/20201210160357.27779-1-georgi.djakov@linaro.org Link: https://lkml.kernel.org/r/20201209125153.10533-1-georgi.djakov@linaro.org Signed-off-by: Liam Mark <lmark@codeaurora.org> Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Zhenhua Huang
|
7fb7ab6d61 |
mm: fix page_owner initializing issue for arm32
Page owner of pages used by page owner itself used is missing on arm32 targets. The reason is dummy_handle and failure_handle is not initialized correctly. Buddy allocator is used to initialize these two handles. However, buddy allocator is not ready when page owner calls it. This change fixed that by initializing page owner after buddy initialization. The working flow before and after this change are: original logic: 1. allocated memory for page_ext(using memblock). 2. invoke the init callback of page_ext_ops like page_owner(using buddy allocator). 3. initialize buddy. after this change: 1. allocated memory for page_ext(using memblock). 2. initialize buddy. 3. invoke the init callback of page_ext_ops like page_owner(using buddy allocator). with the change, failure/dummy_handle can get its correct value and page owner output for example has the one for page owner itself: Page allocated via order 2, mask 0x6202c0(GFP_USER|__GFP_NOWARN), pid 1006, ts 67278156558 ns PFN 543776 type Unmovable Block 531 type Unmovable Flags 0x0() init_page_owner+0x28/0x2f8 invoke_init_callbacks_flatmem+0x24/0x34 start_kernel+0x33c/0x5d8 Link: https://lkml.kernel.org/r/1603104925-5888-1-git-send-email-zhenhuah@codeaurora.org Signed-off-by: Zhenhua Huang <zhenhuah@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Bharata B Rao
|
045ab8c948 |
mm/slub: let number of online CPUs determine the slub page order
The page order of the slab that gets chosen for a given slab cache depends
on the number of objects that can be fit in the slab while meeting other
requirements. We start with a value of minimum objects based on
nr_cpu_ids that is driven by possible number of CPUs and hence could be
higher than the actual number of CPUs present in the system. This leads
to calculate_order() chosing a page order that is on the higher side
leading to increased slab memory consumption on systems that have bigger
page sizes.
Hence rely on the number of online CPUs when determining the mininum
objects, thereby increasing the chances of chosing a lower conservative
page order for the slab.
Vlastimil said:
"Ideally, we would react to hotplug events and update existing caches
accordingly. But for that, recalculation of order for existing caches
would have to be made safe, while not affecting hot paths. We have
removed the sysfs interface with
|
||
Vlastimil Babka
|
965c484815 |
mm, slub: use kmem_cache_debug_flags() in deactivate_slab()
Commit
|
||
Alexander Popov
|
a32d654db5 |
mm/slab: rerform init_on_free earlier
Currently in CONFIG_SLAB init_on_free happens too late, and heap objects go to the heap quarantine not being erased. Lets move init_on_free clearing before calling kasan_slab_free(). In that case heap quarantine will store erased objects, similarly to CONFIG_SLUB=y behavior. Link: https://lkml.kernel.org/r/20201210183729.1261524-1-alex.popov@linux.com Signed-off-by: Alexander Popov <alex.popov@linux.com> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Vlastimil Babka
|
0c06dd7551 |
mm, slab, slub: clear the slab_cache field when freeing page
The page allocator expects that page->mapping is NULL for a page being freed. SLAB and SLUB use the slab_cache field which is in union with mapping, but before freeing the page, the field is referenced with the "mapping" name when set to NULL. It's IMHO more correct (albeit functionally the same) to use the slab_cache name as that's the field we use in SL*B, and document why we clear it in a comment (we don't clear fields such as s_mem or freelist, as page allocator doesn't care about those). While using the 'mapping' name would automagically keep the code correct if the unions in struct page changed, such changes should be done consciously and needed changes evaluated - the comment should help with that. Link: https://lkml.kernel.org/r/20201210160020.21562-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Bartosz Golaszewski
|
15d5de496b |
mm: slab: clarify krealloc()'s behavior with __GFP_ZERO
Patch series "slab: provide and use krealloc_array()", v3. Andy brought to my attention the fact that users allocating an array of equally sized elements should check if the size multiplication doesn't overflow. This is why we have helpers like kmalloc_array(). However we don't have krealloc_array() equivalent and there are many users who do their own multiplication when calling krealloc() for arrays. This series provides krealloc_array() and uses it in a couple places. A separate series will follow adding devm_krealloc_array() which is needed in the xilinx adc driver. This patch (of 9): __GFP_ZERO is ignored by krealloc() (unless we fall-back to kmalloc() path, in which case it's honored). Point that out in the kerneldoc. Link: https://lkml.kernel.org/r/20201109110654.12547-1-brgl@bgdev.pl Link: https://lkml.kernel.org/r/20201109110654.12547-2-brgl@bgdev.pl Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Gustavo Padovan <gustavo@padovan.org> Cc: Christian Knig <christian.koenig@amd.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: James Morse <james.morse@arm.com> Cc: Robert Richter <rric@kernel.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: "Michael S . Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Cc: Borislav Petkov <bp@suse.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Takashi Iwai <tiwai@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Hui Su
|
7714304f3b |
mm/slab_common.c: use list_for_each_entry in dump_unreclaimable_slab()
dump_unreclaimable_slab() acquires the slab_mutex first, and it won't remove any slab_caches list entry when itering the slab_caches lists. Thus we do not need list_for_each_entry_safe here, which is against removal of list entry. Link: https://lkml.kernel.org/r/20200926043440.GA180545@rlk Signed-off-by: Hui Su <sh_def@163.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Gerald Schaefer
|
ba9c1201be |
mm/hugetlb: clear compound_nr before freeing gigantic pages
Commit |
||
Kuan-Ying Lee
|
6c82d45c7f |
kasan: fix object remaining in offline per-cpu quarantine
We hit this issue in our internal test. When enabling generic kasan, a kfree()'d object is put into per-cpu quarantine first. If the cpu goes offline, object still remains in the per-cpu quarantine. If we call kmem_cache_destroy() now, slub will report "Objects remaining" error. ============================================================================= BUG test_module_slab (Not tainted): Objects remaining in test_module_slab on __kmem_cache_shutdown() ----------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: Slab 0x(____ptrval____) objects=34 used=1 fp=0x(____ptrval____) flags=0x2ffff00000010200 CPU: 3 PID: 176 Comm: cat Tainted: G B 5.10.0-rc1-00007-g4525c8781ec0-dirty #10 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2b0 show_stack+0x18/0x68 dump_stack+0xfc/0x168 slab_err+0xac/0xd4 __kmem_cache_shutdown+0x1e4/0x3c8 kmem_cache_destroy+0x68/0x130 test_version_show+0x84/0xf0 module_attr_show+0x40/0x60 sysfs_kf_seq_show+0x128/0x1c0 kernfs_seq_show+0xa0/0xb8 seq_read+0x1f0/0x7e8 kernfs_fop_read+0x70/0x338 vfs_read+0xe4/0x250 ksys_read+0xc8/0x180 __arm64_sys_read+0x44/0x58 el0_svc_common.constprop.0+0xac/0x228 do_el0_svc+0x38/0xa0 el0_sync_handler+0x170/0x178 el0_sync+0x174/0x180 INFO: Object 0x(____ptrval____) @offset=15848 INFO: Allocated in test_version_show+0x98/0xf0 age=8188 cpu=6 pid=172 stack_trace_save+0x9c/0xd0 set_track+0x64/0xf0 alloc_debug_processing+0x104/0x1a0 ___slab_alloc+0x628/0x648 __slab_alloc.isra.0+0x2c/0x58 kmem_cache_alloc+0x560/0x588 test_version_show+0x98/0xf0 module_attr_show+0x40/0x60 sysfs_kf_seq_show+0x128/0x1c0 kernfs_seq_show+0xa0/0xb8 seq_read+0x1f0/0x7e8 kernfs_fop_read+0x70/0x338 vfs_read+0xe4/0x250 ksys_read+0xc8/0x180 __arm64_sys_read+0x44/0x58 el0_svc_common.constprop.0+0xac/0x228 kmem_cache_destroy test_module_slab: Slab cache still has objects Register a cpu hotplug function to remove all objects in the offline per-cpu quarantine when cpu is going offline. Set a per-cpu variable to indicate this cpu is offline. [qiang.zhang@windriver.com: fix slab double free when cpu-hotplug] Link: https://lkml.kernel.org/r/20201204102206.20237-1-qiang.zhang@windriver.com Link: https://lkml.kernel.org/r/1606895585-17382-2-git-send-email-Kuan-Ying.Lee@mediatek.com Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Signed-off-by: Zqiang <qiang.zhang@windriver.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Guangye Yang <guangye.yang@mediatek.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Qian Cai <qcai@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Andrew Morton
|
16c0cc0ce3 |
revert "mm/filemap: add static for function __add_to_page_cache_locked"
Revert commit
|
||
Minchan Kim
|
a68a0262ab |
mm/madvise: remove racy mm ownership check
Jann spotted the security hole due to race of mm ownership check.
If the task is sharing the mm_struct but goes through execve() before
mm_access(), it could skip process_madvise_behavior_valid check. That
makes *any advice hint* to reach into the remote process.
This patch removes the mm ownership check. With it, it will lose the
ability that local process could give *any* advice hint with vector
interface for some reason (e.g., performance). Since there is no
concrete example in upstream yet, it would be better to remove the
abiliity at this moment and need to review when such new advice comes
up.
Fixes:
|
||
Liu Zixian
|
309d08d9b3 |
mm/mmap.c: fix mmap return value when vma is merged after call_mmap()
On success, mmap should return the begin address of newly mapped area,
but patch "mm: mmap: merge vma after call_mmap() if possible" set
vm_start of newly merged vma to return value addr. Users of mmap will
get wrong address if vma is merged after call_mmap(). We fix this by
moving the assignment to addr before merging vma.
We have a driver which changes vm_flags, and this bug is found by our
testcases.
Fixes:
|
||
Mike Kravetz
|
7a5bde3798 |
hugetlb_cgroup: fix offline of hugetlb cgroup with reservations
Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
using hugetlbfs. In this environment the issue is reproduced by:
- Start a simple pod that uses the recently added HugePages medium
feature (pod yaml attached)
- Start a DPDK app. It doesn't need to run successfully (as in transfer
packets) nor interact with real hardware. It seems just initializing
the EAL layer (which handles hugepage reservation and locking) is
enough to trigger the issue
- Delete the Pod (or let it "Complete").
This would result in a kworker thread going into a tight loop (top output):
1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy
'perf top -g' reports:
- 63.28% 0.01% [kernel] [k] worker_thread
- 49.97% worker_thread
- 52.64% process_one_work
- 62.08% css_killed_work_fn
- hugetlb_cgroup_css_offline
41.52% _raw_spin_lock
- 2.82% _cond_resched
rcu_all_qs
2.66% PageHuge
- 0.57% schedule
- 0.57% __schedule
We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
infinitely spinning. Little else can be done on the system as the
cgroup_mutex can not be acquired.
Do note that the issue can be reproduced by simply offlining a hugetlb
cgroup containing pages with reservation counts.
The loop in hugetlb_cgroup_css_offline is moving page counts from the
cgroup being offlined to the parent cgroup. This is done for each
hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
The routine moving counts (hugetlb_cgroup_move_parent) is only moving
'usage' counts. The routine hugetlb_cgroup_have_usage is checking for
both 'usage' and 'reservation' counts. Discussion about what to do with
reservation counts when reparenting was discussed here:
https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/
The decision was made to leave a zombie cgroup for with reservation
counts. Unfortunately, the code checking reservation counts was
incorrectly added to hugetlb_cgroup_have_usage.
To fix the issue, simply remove the check for reservation counts. While
fixing this issue, a related bug in hugetlb_cgroup_css_offline was
noticed. The hstate index is not reinitialized each time through the
do-while loop. Fix this as well.
Fixes:
|
||
Alex Shi
|
3351b16af4 |
mm/filemap: add static for function __add_to_page_cache_locked
mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes] Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Qian Cai
|
b11a76b37a |
mm/swapfile: do not sleep with a spin lock held
We can't call kvfree() with a spin lock held, so defer it. Fixes a
might_sleep() runtime warning.
Fixes:
|
||
Minchan Kim
|
e91d8d7823 |
mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING
While I was doing zram testing, I found sometimes decompression failed
since the compression buffer was corrupted. With investigation, I found
below commit calls cond_resched unconditionally so it could make a
problem in atomic context if the task is reschedule.
BUG: sleeping function called from invalid context at mm/vmalloc.c:108
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
3 locks held by memhog/946:
#0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
#1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
#2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
Call Trace:
unmap_kernel_range_noflush+0x2eb/0x350
unmap_kernel_range+0x14/0x30
zs_unmap_object+0xd5/0xe0
zram_bvec_rw.isra.0+0x38c/0x8e0
zram_rw_page+0x90/0x101
bdev_write_page+0x92/0xe0
__swap_writepage+0x94/0x4a0
pageout+0xe3/0x3a0
shrink_page_list+0xb94/0xd60
shrink_inactive_list+0x158/0x460
We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
contains the offending calling code) from zsmalloc.
Even though this option showed some amount improvement(e.g., 30%) in
some arm32 platforms, it has been headache to maintain since it have
abused APIs[1](e.g., unmap_kernel_range in atomic context).
Since we are approaching to deprecate 32bit machines and already made
the config option available for only builtin build since v5.8, lastly it
has been not default option in zsmalloc, it's time to drop the option
for better maintenance.
[1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org
Fixes:
|
||
Yang Shi
|
8199be001a |
mm: list_lru: set shrinker map bit when child nr_items is not zero
When investigating a slab cache bloat problem, significant amount of negative dentry cache was seen, but confusingly they neither got shrunk by reclaimer (the host has very tight memory) nor be shrunk by dropping cache. The vmcore shows there are over 14M negative dentry objects on lru, but tracing result shows they were even not scanned at all. Further investigation shows the memcg's vfs shrinker_map bit is not set. So the reclaimer or dropping cache just skip calling vfs shrinker. So we have to reboot the hosts to get the memory back. I didn't manage to come up with a reproducer in test environment, and the problem can't be reproduced after rebooting. But it seems there is race between shrinker map bit clear and reparenting by code inspection. The hypothesis is elaborated as below. The memcg hierarchy on our production environment looks like: root / \ system user The main workloads are running under user slice's children, and it creates and removes memcg frequently. So reparenting happens very often under user slice, but no task is under user slice directly. So with the frequent reparenting and tight memory pressure, the below hypothetical race condition may happen: CPU A CPU B reparent dst->nr_items == 0 shrinker: total_objects == 0 add src->nr_items to dst set_bit return SHRINK_EMPTY clear_bit child memcg offline replace child's kmemcg_id with parent's (in memcg_offline_kmem()) list_lru_del() between shrinker runs see parent's kmemcg_id dec dst->nr_items reparent again dst->nr_items may go negative due to concurrent list_lru_del() The second run of shrinker: read nr_items without any synchronization, so it may see intermediate negative nr_items then total_objects may return 0 coincidently keep the bit cleared dst->nr_items != 0 skip set_bit add scr->nr_item to dst After this point dst->nr_item may never go zero, so reparenting will not set shrinker_map bit anymore. And since there is no task under user slice directly, so no new object will be added to its lru to set the shrinker map bit either. That bit is kept cleared forever. How does list_lru_del() race with reparenting? It is because reparenting replaces children's kmemcg_id to parent's without protecting from nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually deleting items from child's lru, but dec'ing parent's nr_items, so the parent's nr_items may go negative as commit |
||
Roman Gushchin
|
becaba65f6 |
mm: memcg/slab: fix obj_cgroup_charge() return value handling
Commit |
||
Hugh Dickins
|
073861ed77 |
mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)
Twice now, when exercising ext4 looped on shmem huge pages, I have crashed on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling end_page_writeback() calling wake_up_page() on tail of a shmem huge page, no longer an ext4 page at all. The problem is that PageWriteback is not accompanied by a page reference (as the NOTE at the end of test_clear_page_writeback() acknowledges): as soon as TestClearPageWriteback has been done, that page could be removed from page cache, freed, and reused for something else by the time that wake_up_page() is reached. https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/ Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail check; but I'm paranoid about even looking at an unreferenced struct page, lest its memory might itself have already been reused or hotremoved (and wake_up_page_bit() may modify that memory with its ClearPageWaiters()). Then on crashing a second time, realized there's a stronger reason against that approach. If my testing just occasionally crashes on that check, when the page is reused for part of a compound page, wouldn't it be much more common for the page to get reused as an order-0 page before reaching wake_up_page()? And on rare occasions, might that reused page already be marked PageWriteback by its new user, and already be waited upon? What would that look like? It would look like BUG_ON(PageWriteback) after wait_on_page_writeback() in write_cache_pages() (though I have never seen that crash myself). Matthew Wilcox explaining this to himself: "page is allocated, added to page cache, dirtied, writeback starts, --- thread A --- filesystem calls end_page_writeback() test_clear_page_writeback() --- context switch to thread B --- truncate_inode_pages_range() finds the page, it doesn't have writeback set, we delete it from the page cache. Page gets reallocated, dirtied, writeback starts again. Then we call write_cache_pages(), see PageWriteback() set, call wait_on_page_writeback() --- context switch back to thread A --- wake_up_page(page, PG_writeback); ... thread B is woken, but because the wakeup was for the old use of the page, PageWriteback is still set. Devious" And prior to |
||
Matthew Wilcox (Oracle)
|
66383800df |
mm: fix madvise WILLNEED performance problem
The calculation of the end page index was incorrect, leading to a
regression of 70% when running stress-ng.
With this fix, we instead see a performance improvement of 3%.
Fixes:
|
||
Gerald Schaefer
|
bfe8cc1db0 |
mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault()
Alexander reported a syzkaller / KASAN finding on s390, see below for complete output. In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be freed in some cases. In the case of userfaultfd_missing(), this will happen after calling handle_userfault(), which might have released the mmap_lock. Therefore, the following pte_free(vma->vm_mm, pgtable) will access an unstable vma->vm_mm, which could have been freed or re-used already. For all architectures other than s390 this will go w/o any negative impact, because pte_free() simply frees the page and ignores the passed-in mm. The implementation for SPARC32 would also access mm->page_table_lock for pte_free(), but there is no THP support in SPARC32, so the buggy code path will not be used there. For s390, the mm->context.pgtable_list is being used to maintain the 2K pagetable fragments, and operating on an already freed or even re-used mm could result in various more or less subtle bugs due to list / pagetable corruption. Fix this by calling pte_free() before handle_userfault(), similar to how it is already done in __do_huge_pmd_anonymous_page() for the WRITE / non-huge_zero_page case. Commit |
||
Muchun Song
|
8faeb1ffd7 |
mm: memcg/slab: fix root memcg vmstats
If we reparent the slab objects to the root memcg, when we free the slab
object, we need to update the per-memcg vmstats to keep it correct for
the root memcg. Now this at least affects the vmstat of
NR_KERNEL_STACK_KB for !CONFIG_VMAP_STACK when the thread stack size is
smaller than the PAGE_SIZE.
David said:
"I assume that without this fix that the root memcg's vmstat would
always be inflated if we reparented"
Fixes:
|
||
Dan Williams
|
a927bd6ba9 |
mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports
The core-mm has a default __weak implementation of phys_to_target_node()
to mirror the weak definition of memory_add_physaddr_to_nid(). That
symbol is exported for modules. However, while the export in
mm/memory_hotplug.c exported the symbol in the configuration cases of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=y
...and:
CONFIG_NUMA_KEEP_MEMINFO=n
CONFIG_MEMORY_HOTPLUG=y
...it failed to export the symbol in the case of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=n
Not only is that broken, but Christoph points out that the kernel should
not be exporting any __weak symbol, which means that
memory_add_physaddr_to_nid() example that phys_to_target_node() copied
is broken too.
Rework the definition of phys_to_target_node() and
memory_add_physaddr_to_nid() to not require weak symbols. Move to the
common arch override design-pattern of an asm header defining a symbol
to replace the default implementation.
The only common header that all memory_add_physaddr_to_nid() producing
architectures implement is asm/sparsemem.h. In fact, powerpc already
defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
Double-down on that observation and define phys_to_target_node() where
necessary in asm/sparsemem.h. An alternate consideration that was
discarded was to put this override in asm/numa.h, but that entangles
with the definition of MAX_NUMNODES relative to the inclusion of
linux/nodemask.h, and requires powerpc to grow a new header.
The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported / stubbed in all combinations
of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.
[dan.j.williams@intel.com: v4]
Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
[dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes:
|
||
Eric Dumazet
|
450677dcb0 |
mm/madvise: fix memory leak from process_madvise
The early return in process_madvise() will produce a memory leak.
Fix it.
Fixes:
|
||
Linus Torvalds
|
fa5fca78bb |
io_uring-5.10-2020-11-20
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+4DAwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphdOD/9xOEnYPuekvVH9G9nyNd//Q9fPArG2+j6V /MCnze07GNtDt7z15oR+T07hKXmf+Ejh4nu3JJ6MUNfe/47hhJqHSxRHU6+PJCjk hPrsaTsDedxxLEDiLmvhXnUPzfVzJtefxVAAaKikWOb3SBqLdh7xTFSlor1HbRBl Zk4d343cjBDYfvSSt/zMWDzwwvramdz7rJnnPMKXITu64ITL5314vuK2YVZmBOet YujSah7J8FL1jKhiG1Iw5rayd2Q3smnHWIEQ+lvW6WiTvMJMLOxif2xNF4/VEZs1 CBGJUQt42LI6QGEzRBHohcefZFuPGoxnduSzHCOIhh7d6+k+y9mZfsPGohr3g9Ov NotXpVonnA7GbRqzo1+IfBRve7iRONdZ3/LBwyRmqav4I4jX68wXBNH5IDpVR0Sn c31avxa/ZL7iLIBx32enp0/r3mqNTQotEleSLUdyJQXAZTyG2INRhjLLXTqSQ5BX oVp0fZzKCwsr6HCPZpXZ/f2G7dhzuF0ghoceC02GsOVooni22gdVnQj+AWNus398 e+wcimT4MX6AHNFxO2aUtJow0KWWZRzC1p5Mxu/9W3YiMtJiC0YOGePfSqiTqX0g Uk0H5dOAgBUQrAsusf7bKr0K6W25yEk/JipxhWqi0rC71x42mLTsCT1wxSCvLwqs WxhdtVKroQ== =7PAe -----END PGP SIGNATURE----- Merge tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Mostly regression or stable fodder: - Disallow async path resolution of /proc/self - Tighten constraints for segmented async buffered reads - Fix double completion for a retry error case - Fix for fixed file life times (Pavel)" * tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block: io_uring: order refnode recycling io_uring: get an active ref_node from files_data io_uring: don't double complete failed reissue request mm: never attempt async page lock if we've transferred data already io_uring: handle -EOPNOTSUPP on path resolution proc: don't allow async path resolution of /proc/self components |
||
Linus Torvalds
|
4d02da974e |
Networking fixes for 5.10-rc5, including fixes from the WiFi (mac80211),
can and bpf (including the strncpy_from_user fix). Current release - regressions: - mac80211: fix memory leak of filtered powersave frames - mac80211: free sta in sta_info_insert_finish() on errors to avoid sleeping in atomic context - netlabel: fix an uninitialized variable warning added in -rc4 Previous release - regressions: - vsock: forward all packets to the host when no H2G is registered, un-breaking AWS Nitro Enclaves - net: Exempt multicast addresses from five-second neighbor lifetime requirement, decreasing the chances neighbor tables fill up - net/tls: fix corrupted data in recvmsg - qed: fix ILT configuration of SRC block - can: m_can: process interrupt only when not runtime suspended Previous release - always broken: - page_frag: Recover from memory pressure by not recycling pages allocating from the reserves - strncpy_from_user: Mask out bytes after NUL terminator - ip_tunnels: Set tunnel option flag only when tunnel metadata is present, always setting it confuses Open vSwitch - bpf, sockmap: - Fix partial copy_page_to_iter so progress can still be made - Fix socket memory accounting and obeying SO_RCVBUF - net: Have netpoll bring-up DSA management interface - net: bridge: add missing counters to ndo_get_stats64 callback - tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt - enetc: Workaround MDIO register access HW bug - net/ncsi: move netlink family registration to a subsystem init, instead of tying it to driver probe - net: ftgmac100: unregister NC-SI when removing driver to avoid crash - lan743x: prevent interrupt storm on open - lan743x: fix freeing skbs in the wrong context - net/mlx5e: Fix socket refcount leak on kTLS RX resync - net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097 - fix 21 unset return codes and other mistakes on error paths, mostly detected by the Hulk Robot Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+226AACgkQMUZtbf5S IruE1w/+JX3CqJwGIqyzyhwVshNaKxmX9gAOMJzkckjEohn8932zPaNq7kbmNYqt 5QsJoou3cXjFeoIEAkQA5fqR4stTZpZMnLO+7JnxxQ0vb2YBN+tIGQRNCnmd1Q0h u9gb5+5AdORdlmk3E7oC8v50dzQRfboJXLEEZTo2uGJwUgLlEAiqTSV2w4YDHMhL JtgtWA/fraL0CUc2WMoxuimg9NirbRuMijsU6+d/yExzznDpdoho/qsxL+Odu1NF hSdaKirA8B8ml0pOd/b4mj+fm4+lKyXZBfSyLx4Ki1TqluEMLzDp7gQPRnU6yyJm AOu4zsKxx6qitOX2qLQCNlEpkQp6LA2N2Zb1orliUV3Bsq2DJRhU35FgLcghtdRP GTRSdKHr2BvMScOZ7dQo8l4TqVc3e/khSZDRGdvpsM275Dt0JyS/l7yAWxunPqMb +/483/s75OuBRO57ULLJ/hR02TG37g/Jv5sI0sG/7oDpGfnulinQX+fxy9izyTEM KYl0mAPSqhb6RcjE0YXWG0rhJN6FSvc/lwPQHjq8wPSkwEdD/FTb6/eYqbXDi1ld UTYhFpkh1PQrwct14eSScMeJqTsNKvG0VV39/uZLZCzcqa3yOY5+oTzwaCFlMsy3 a5yGGxqoh7/FTM8t1ml21is9uZ31LAQEnNTMPv69pZPwAv5G5yE= =SRwI -----END PGP SIGNATURE----- Merge tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.10-rc5, including fixes from the WiFi (mac80211), can and bpf (including the strncpy_from_user fix). Current release - regressions: - mac80211: fix memory leak of filtered powersave frames - mac80211: free sta in sta_info_insert_finish() on errors to avoid sleeping in atomic context - netlabel: fix an uninitialized variable warning added in -rc4 Previous release - regressions: - vsock: forward all packets to the host when no H2G is registered, un-breaking AWS Nitro Enclaves - net: Exempt multicast addresses from five-second neighbor lifetime requirement, decreasing the chances neighbor tables fill up - net/tls: fix corrupted data in recvmsg - qed: fix ILT configuration of SRC block - can: m_can: process interrupt only when not runtime suspended Previous release - always broken: - page_frag: Recover from memory pressure by not recycling pages allocating from the reserves - strncpy_from_user: Mask out bytes after NUL terminator - ip_tunnels: Set tunnel option flag only when tunnel metadata is present, always setting it confuses Open vSwitch - bpf, sockmap: - Fix partial copy_page_to_iter so progress can still be made - Fix socket memory accounting and obeying SO_RCVBUF - net: Have netpoll bring-up DSA management interface - net: bridge: add missing counters to ndo_get_stats64 callback - tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt - enetc: Workaround MDIO register access HW bug - net/ncsi: move netlink family registration to a subsystem init, instead of tying it to driver probe - net: ftgmac100: unregister NC-SI when removing driver to avoid crash - lan743x: - prevent interrupt storm on open - fix freeing skbs in the wrong context - net/mlx5e: Fix socket refcount leak on kTLS RX resync - net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097 - fix 21 unset return codes and other mistakes on error paths, mostly detected by the Hulk Robot" * tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (115 commits) fail_function: Remove a redundant mutex unlock selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL lib/strncpy_from_user.c: Mask out bytes after NUL terminator. net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid() net/smc: fix matching of existing link groups ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module libbpf: Fix VERSIONED_SYM_COUNT number parsing net/mlx4_core: Fix init_hca fields offset atm: nicstar: Unmap DMA on send error page_frag: Recover from memory pressure net: dsa: mv88e6xxx: Wait for EEPROM done after HW reset mlxsw: core: Use variable timeout for EMAD retries mlxsw: Fix firmware flashing net: Have netpoll bring-up DSA management interface atl1e: fix error return code in atl1e_probe() atl1c: fix error return code in atl1c_probe() ah6: fix error return code in ah6_input() net: usb: qmi_wwan: Set DTR quirk for MR400 can: m_can: process interrupt only when not runtime suspended can: flexcan: flexcan_chip_start(): fix erroneous flexcan_transceiver_enable() during bus-off recovery ... |
||
Dongli Zhang
|
d8c19014bb |
page_frag: Recover from memory pressure
The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb().
This ends up to page_frag_alloc() to allocate skb->data from
page_frag_cache->va.
During the memory pressure, page_frag_cache->va may be allocated as
pfmemalloc page. As a result, the skb->pfmemalloc is always true as
skb->data is from page_frag_cache->va. The skb will be dropped if the
sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour
under memory pressure.
However, once kernel is not under memory pressure any longer (suppose large
amount of memory pages are just reclaimed), the page_frag_alloc() may still
re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a
result, the skb->pfmemalloc is always true unless page_frag_cache->va is
re-allocated, even if the kernel is not under memory pressure any longer.
Here is how kernel runs into issue.
1. The kernel is under memory pressure and allocation of
PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,
the pfmemalloc page is allocated for page_frag_cache->va.
2: All skb->data from page_frag_cache->va (pfmemalloc) will have
skb->pfmemalloc=true. The skb will always be dropped by sock without
SOCK_MEMALLOC. This is an expected behaviour.
3. Suppose a large amount of pages are reclaimed and kernel is not under
memory pressure any longer. We expect skb->pfmemalloc drop will not happen.
4. Unfortunately, page_frag_alloc() does not proactively re-allocate
page_frag_alloc->va and will always re-use the prior pfmemalloc page. The
skb->pfmemalloc is always true even kernel is not under memory pressure any
longer.
Fix this by freeing and re-allocating the page instead of recycling it.
References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/
References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Bert Barbe <bert.barbe@oracle.com>
Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Cc: Manjunath Patil <manjunath.b.patil@oracle.com>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: SRINIVAS <srinivas.eeda@oracle.com>
Fixes:
|