diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index cef6e1d20b18..f1b90cf1249b 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -155,6 +155,12 @@ Contact: SeongJae Park Description: Writing to and reading from this file sets and gets the action of the scheme. +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//target_nid +Date: Jun 2024 +Contact: SeongJae Park +Description: Action's target NUMA node id. Supported by only relevant + actions. + What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//apply_interval_us Date: Sep 2023 Contact: SeongJae Park diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 05862f06ed26..86311c2907cd 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1306,17 +1306,10 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. - Example:: echo "1G" > memory.reclaim - The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). - Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1328,6 +1321,17 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. +The following nested keys are defined. + + ========== ================================ + swappiness Swappiness value to reclaim with + ========== ================================ + + Specifying a swappiness value instructs the kernel to perform + the reclaim with that swappiness value. Note that this has the + same semantics as vm.swappiness applied to memcg reclaim with + all the existing limitations and potential future extensions. + memory.peak A read-only single value file which exists on non-root cgroups. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 530ebe9b11cd..c1134ad5f06d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -7239,9 +7239,12 @@ vmalloc=nn[KMG] [KNL,BOOT,EARLY] Forces the vmalloc area to have an exact size of . This can be used to increase - the minimum size (128MB on x86). It can also be - used to decrease the size and leave more room - for directly mapped kernel RAM. + the minimum size (128MB on x86, arm32 platforms). + It can also be used to decrease the size and leave more room + for directly mapped kernel RAM. Note that this parameter does + not exist on many other platforms (including arm64, alpha, + loongarch, arc, csky, hexagon, microblaze, mips, nios2, openrisc, + parisc, m64k, powerpc, riscv, sh, um, xtensa, s390, sparc). vmcp_cma=nn[MG] [KNL,S390,EARLY] Sets the memory size reserved for contiguous memory diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 7aa0071ff1c3..054010a7f3d8 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -34,18 +34,56 @@ detail) of DAMON, you should ensure :doc:`sysfs ` is mounted. +Snapshot Data Access Patterns +============================= + +The commands below show the memory access pattern of a program at the moment of +the execution. :: + + $ git clone https://github.com/sjp38/masim; cd masim; make + $ sudo damo start "./masim ./configs/stairs.cfg --quiet" + $ sudo ./damo show + 0 addr [85.541 TiB , 85.541 TiB ) (57.707 MiB ) access 0 % age 10.400 s + 1 addr [85.541 TiB , 85.542 TiB ) (413.285 MiB) access 0 % age 11.400 s + 2 addr [127.649 TiB , 127.649 TiB) (57.500 MiB ) access 0 % age 1.600 s + 3 addr [127.649 TiB , 127.649 TiB) (32.500 MiB ) access 0 % age 500 ms + 4 addr [127.649 TiB , 127.649 TiB) (9.535 MiB ) access 100 % age 300 ms + 5 addr [127.649 TiB , 127.649 TiB) (8.000 KiB ) access 60 % age 0 ns + 6 addr [127.649 TiB , 127.649 TiB) (6.926 MiB ) access 0 % age 1 s + 7 addr [127.998 TiB , 127.998 TiB) (120.000 KiB) access 0 % age 11.100 s + 8 addr [127.998 TiB , 127.998 TiB) (8.000 KiB ) access 40 % age 100 ms + 9 addr [127.998 TiB , 127.998 TiB) (4.000 KiB ) access 0 % age 11 s + total size: 577.590 MiB + $ sudo ./damo stop + +The first command of the above example downloads and builds an artificial +memory access generator program called ``masim``. The second command asks DAMO +to execute the artificial generator process start via the given command and +make DAMON monitors the generator process. The third command retrieves the +current snapshot of the monitored access pattern of the process from DAMON and +shows the pattern in a human readable format. + +Each line of the output shows which virtual address range (``addr [XX, XX)``) +of the process is how frequently (``access XX %``) accessed for how long time +(``age XX``). For example, the fifth region of ~9 MiB size is being most +frequently accessed for last 300 milliseconds. Finally, the fourth command +stops DAMON. + +Note that DAMON can monitor not only virtual address spaces but multiple types +of address spaces including the physical address space. + + Recording Data Access Patterns ============================== The commands below record the memory access patterns of a program and save the monitoring results to a file. :: - $ git clone https://github.com/sjp38/masim - $ cd masim; make; ./masim ./configs/zigzag.cfg & + $ ./masim ./configs/zigzag.cfg & $ sudo damo record -o damon.data $(pidof masim) -The first two lines of the commands download an artificial memory access -generator program and run it in the background. The generator will repeatedly +The line of the commands run the artificial memory access +generator program again. The generator will repeatedly access two 100 MiB sized memory regions one by one. You can substitute this with your real workload. The last line asks ``damo`` to record the access pattern in the ``damon.data`` file. diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index e58ceb89ea2a..26df6cfa4441 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -78,7 +78,7 @@ comma (","). │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... │ │ │ │ │ :ref:`schemes `/nr_schemes - │ │ │ │ │ │ :ref:`0 `/action,apply_interval_us + │ │ │ │ │ │ :ref:`0 `/action,target_nid,apply_interval_us │ │ │ │ │ │ │ :ref:`access_pattern `/ │ │ │ │ │ │ │ │ sz/min,max │ │ │ │ │ │ │ │ nr_accesses/min,max @@ -289,14 +289,18 @@ schemes// ------------ In each scheme directory, five directories (``access_pattern``, ``quotas``, -``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files -(``action`` and ``apply_interval``) exist. +``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files +(``action``, ``target_nid`` and ``apply_interval``) exist. The ``action`` file is for setting and getting the scheme's :ref:`action `. The keywords that can be written to and read from the file and their meaning are same to those of the list on :ref:`design doc `. +The ``target_nid`` file is for setting the migration target node, which is +only meaningful when the ``action`` is either ``migrate_hot`` or +``migrate_cold``. + The ``apply_interval_us`` file is for setting and getting the scheme's :ref:`apply_interval ` in microseconds. diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index f5f065c67615..caba0f52dd36 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -118,7 +118,7 @@ Short descriptions to the page flags 21 - KSM Identical memory pages dynamically shared between one or more processes. 22 - THP - Contiguous pages which construct transparent hugepages. + Contiguous pages which construct THP of any size and mapped by any granularity. 23 - OFFLINE The page is logically offline. 24 - ZERO_PAGE @@ -173,27 +173,6 @@ LRU related page flags The page-types tool in the tools/mm directory can be used to query the above flags. -Using pagemap to do something useful -==================================== - -The general procedure for using pagemap to find out about a process' memory -usage goes like this: - - 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are - mapped to what. - 2. Select the maps you are interested in -- all of them, or a particular - library, or the stack or the heap, etc. - 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. - 4. Read a u64 for each page from pagemap. - 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you - just read, seek to that entry in the file, and read the data you want. - -For example, to find the "unique set size" (USS), which is the amount of -memory that a process is using that is not shared with any other process, -you can go through every map in the process, find the PFNs, look those up -in kpagecount, and tally up the number of pages that are only referenced -once. - Exceptions for Shared Memory ============================ @@ -252,7 +231,7 @@ Following flags about pages are currently supported: - ``PAGE_IS_PRESENT`` - Page is present in the memory - ``PAGE_IS_SWAPPED`` - Page is in swapped - ``PAGE_IS_PFNZERO`` - Page has zero PFN -- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed +- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index d414d3f5592a..058485daf186 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -202,12 +202,11 @@ PMD-mappable transparent hugepage:: cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size -khugepaged will be automatically started when one or more hugepage -sizes are enabled (either by directly setting "always" or "madvise", -or by setting "inherit" while the top-level enabled is set to "always" -or "madvise"), and it'll be automatically shutdown when the last -hugepage size is disabled (either by directly setting "never", or by -setting "inherit" while the top-level enabled is set to "never"). +khugepaged will be automatically started when PMD-sized THP is enabled +(either of the per-size anon control or the top-level control are set +to "always" or "madvise"), and it'll be automatically shutdown when +PMD-sized THP is disabled (when both the per-size anon control and the +top-level control are "never") Khugepaged controls ------------------- @@ -332,6 +331,31 @@ deny force Force the huge option on for all - very useful for testing; +Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to +control mTHP allocation: +'/sys/kernel/mm/transparent_hugepage/hugepages-kB/shmem_enabled', +and its value for each mTHP is essentially consistent with the global +setting. An 'inherit' option is added to ensure compatibility with these +global settings. Conversely, the options 'force' and 'deny' are dropped, +which are rather testing artifacts from the old ages. + +always + Attempt to allocate huge pages every time we need a new page; + +inherit + Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages + have enabled="inherit" and all other hugepage sizes have enabled="never"; + +never + Do not allocate huge pages; + +within_size + Only allocate huge page if it will be fully within i_size. + Also respect fadvise()/madvise() hints; + +advise + Only allocate huge pages if requested with fadvise()/madvise(); + Need of application restart =========================== @@ -344,10 +368,6 @@ also applies to the regions registered in khugepaged. Monitoring usage ================ -.. note:: - Currently the below counters only record events relating to - PMD-sized THP. Events relating to other THP sizes are not included. - The number of PMD-sized anonymous transparent huge pages currently used by the system is available by reading the AnonHugePages field in ``/proc/meminfo``. To identify what applications are using PMD-sized anonymous transparent huge @@ -392,20 +412,23 @@ thp_collapse_alloc_failed the allocation. thp_file_alloc - is incremented every time a file huge page is successfully - allocated. + is incremented every time a shmem huge page is successfully + allocated (Note that despite being named after "file", the counter + measures only shmem). thp_file_fallback - is incremented if a file huge page is attempted to be allocated - but fails and instead falls back to using small pages. + is incremented if a shmem huge page is attempted to be allocated + but fails and instead falls back to using small pages. (Note that + despite being named after "file", the counter measures only shmem). thp_file_fallback_charge - is incremented if a file huge page cannot be charged and instead + is incremented if a shmem huge page cannot be charged and instead falls back to using small pages even though the allocation was - successful. + successful. (Note that despite being named after "file", the + counter measures only shmem). thp_file_mapped - is incremented every time a file huge page is mapped into + is incremented every time a file or shmem huge page is mapped into user address space. thp_split_page @@ -476,6 +499,34 @@ swpout_fallback Usually because failed to allocate some continuous swap space for the huge page. +shmem_alloc + is incremented every time a shmem huge page is successfully + allocated. + +shmem_fallback + is incremented if a shmem huge page is attempted to be allocated + but fails and instead falls back to using small pages. + +shmem_fallback_charge + is incremented if a shmem huge page cannot be charged and instead + falls back to using small pages even though the allocation was + successful. + +split + is incremented every time a huge page is successfully split into + smaller orders. This can happen for a variety of reasons but a + common reason is that a huge page is old and is being reclaimed. + +split_failed + is incremented if kernel fails to split huge + page. This can happen if the page was pinned by somebody. + +split_deferred + is incremented when a huge page is put onto split queue. + This happens when a huge page is partially unmapped and splitting + it would free up some memory. Pages on split queue are going to + be split under memory pressure, if splitting is possible. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index e86c968a7a0e..f48eaa98d22d 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -36,6 +36,7 @@ Currently, these files are in /proc/sys/vm: - dirtytime_expire_seconds - dirty_writeback_centisecs - drop_caches +- enable_soft_offline - extfrag_threshold - highmem_is_dirtyable - hugetlb_shm_group @@ -267,6 +268,43 @@ used:: These are informational only. They do not mean that anything is wrong with your system. To disable them, echo 4 (bit 2) into drop_caches. +enable_soft_offline +=================== +Correctable memory errors are very common on servers. Soft-offline is kernel's +solution for memory pages having (excessive) corrected memory errors. + +For different types of page, soft-offline has different behaviors / costs. + +- For a raw error page, soft-offline migrates the in-use page's content to + a new raw page. + +- For a page that is part of a transparent hugepage, soft-offline splits the + transparent hugepage into raw pages, then migrates only the raw error page. + As a result, user is transparently backed by 1 less hugepage, impacting + memory access performance. + +- For a page that is part of a HugeTLB hugepage, soft-offline first migrates + the entire HugeTLB hugepage, during which a free hugepage will be consumed + as migration target. Then the original hugepage is dissolved into raw + pages without compensation, reducing the capacity of the HugeTLB pool by 1. + +It is user's call to choose between reliability (staying away from fragile +physical memory) vs performance / capacity implications in transparent and +HugeTLB cases. + +For all architectures, enable_soft_offline controls whether to soft offline +memory pages. When set to 1, kernel attempts to soft offline the pages +whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to +the request to soft offline the pages. Its default value is 1. + +It is worth mentioning that after setting enable_soft_offline to 0, the +following requests to soft offline pages will not be performed: + +- Request to soft offline pages from RAS Correctable Errors Collector. + +- On ARM, the request to soft offline pages from GHES driver. + +- On PARISC, the request to soft offline pages from Page Deallocation Table. extfrag_threshold ================= diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 6b5f7e6e7155..c16ca163b55e 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -132,7 +132,7 @@ CASE 1: Direct IO (DIO) ----------------------- There are GUP references to pages that are serving as DIO buffers. These buffers are needed for a relatively short time (so they -are not "long term"). No special synchronization with page_mkclean() or +are not "long term"). No special synchronization with folio_mkclean() or munmap() is provided. Therefore, flags to set at the call site are: :: FOLL_PIN @@ -144,7 +144,7 @@ CASE 2: RDMA ------------ There are GUP references to pages that are serving as DMA buffers. These buffers are needed for a long time ("long term"). No special -synchronization with page_mkclean() or munmap() is provided. Therefore, flags +synchronization with folio_mkclean() or munmap() is provided. Therefore, flags to set at the call site are: :: FOLL_PIN | FOLL_LONGTERM @@ -170,7 +170,7 @@ callback, simply remove the range from the device's page tables. Either way, as long as the driver unpins the pages upon mmu notifier callback, then there is proper synchronization with both filesystem and mm -(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. +(folio_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. CASE 4: Pinning for struct page manipulation only ------------------------------------------------- @@ -196,20 +196,20 @@ INCORRECT (uses FOLL_GET calls): write to the data within the pages put_page() -page_maybe_dma_pinned(): the whole point of pinning -=================================================== +folio_maybe_dma_pinned(): the whole point of pinning +==================================================== -The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able -to query, "is this page DMA-pinned?" That allows code such as page_mkclean() +The whole point of marking folios as "DMA-pinned" or "gup-pinned" is to be able +to query, "is this folio DMA-pinned?" That allows code such as folio_mkclean() (and file system writeback code in general) to make informed decisions about -what to do when a page cannot be unmapped due to such pins. +what to do when a folio cannot be unmapped due to such pins. What to do in those cases is the subject of a years-long series of discussions and debates (see the References at the end of this document). It's a TODO item here: fill in the details once that's worked out. Meanwhile, it's safe to say that having this available: :: - static inline bool page_maybe_dma_pinned(struct page *page) + static inline bool folio_maybe_dma_pinned(struct folio *folio) ...is a prerequisite to solving the long-running gup+DMA problem. diff --git a/Documentation/dev-tools/kmsan.rst b/Documentation/dev-tools/kmsan.rst index 323eedad53cd..6a48d96c5c85 100644 --- a/Documentation/dev-tools/kmsan.rst +++ b/Documentation/dev-tools/kmsan.rst @@ -110,6 +110,13 @@ in the Makefile. Think of this as applying ``__no_sanitize_memory`` to every function in the file or directory. Most users won't need KMSAN_SANITIZE, unless their code gets broken by KMSAN (e.g. runs at early boot time). +KMSAN checks can also be temporarily disabled for the current task using +``kmsan_disable_current()`` and ``kmsan_enable_current()`` calls. Each +``kmsan_enable_current()`` call must be preceded by a +``kmsan_disable_current()`` call; these call pairs may be nested. One needs to +be careful with these calls, keeping the regions short and preferring other +ways to disable instrumentation, where possible. + Support ======= @@ -338,11 +345,11 @@ Per-task KMSAN state ~~~~~~~~~~~~~~~~~~~~ Every task_struct has an associated KMSAN task state that holds the KMSAN -context (see above) and a per-task flag disallowing KMSAN reports:: +context (see above) and a per-task counter disallowing KMSAN reports:: struct kmsan_context { ... - bool allow_reporting; + unsigned int depth; struct kmsan_context_state cstate; ... } diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 82d142de3461..e834779d9611 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -443,6 +443,15 @@ is not associated with a file: or if empty, the mapping is anonymous. +Starting with 6.11 kernel, /proc/PID/maps provides an alternative +ioctl()-based API that gives ability to flexibly and efficiently query and +filter individual VMAs. This interface is binary and is meant for more +efficient and easy programmatic use. `struct procmap_query`, defined in +linux/fs.h UAPI header, serves as an input/output argument to the +`PROCMAP_QUERY` ioctl() command. See comments in linus/fs.h UAPI header for +details on query semantics, supported flags, data returned, and general API +usage information. + The /proc/PID/smaps is an extension based on maps, showing the memory consumption for each of the process's mappings. For each mapping (aka Virtual Memory Area, or VMA) there is a series of lines such as the following:: diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index ad50ca6f495e..af245161d8e7 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -90,8 +90,6 @@ PMD Page Table Helpers +---------------------------+--------------------------------------------------+ | pmd_leaf | Tests a leaf mapped PMD | +---------------------------+--------------------------------------------------+ -| pmd_huge | Tests a HugeTLB mapped PMD | -+---------------------------+--------------------------------------------------+ | pmd_trans_huge | Tests a Transparent Huge Page (THP) at PMD | +---------------------------+--------------------------------------------------+ | pmd_present | Tests whether pmd_page() points to valid memory | @@ -169,8 +167,6 @@ PUD Page Table Helpers +---------------------------+--------------------------------------------------+ | pud_leaf | Tests a leaf mapped PUD | +---------------------------+--------------------------------------------------+ -| pud_huge | Tests a HugeTLB mapped PUD | -+---------------------------+--------------------------------------------------+ | pud_trans_huge | Tests a Transparent Huge Page (THP) at PUD | +---------------------------+--------------------------------------------------+ | pud_present | Tests a valid mapped PUD | diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 3df387249937..8730c246ceaa 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -16,53 +16,24 @@ called DAMON ``context``. DAMON executes each context with a kernel thread called ``kdamond``. Multiple kdamonds could run in parallel, for different types of monitoring. +To know how user-space can do the configurations and start/stop DAMON, refer to +:ref:`DAMON sysfs interface ` documentation. + Overall Architecture ==================== DAMON subsystem is configured with three layers including -- Operations Set: Implements fundamental operations for DAMON that depends on - the given monitoring target address-space and available set of - software/hardware primitives, -- Core: Implements core logics including monitoring overhead/accurach control - and access-aware system operations on top of the operations set layer, and -- Modules: Implements kernel modules for various purposes that provides - interfaces for the user space, on top of the core layer. - - -.. _damon_design_configurable_operations_set: - -Configurable Operations Set ---------------------------- - -For data access monitoring and additional low level work, DAMON needs a set of -implementations for specific operations that are dependent on and optimized for -the given target address space. On the other hand, the accuracy and overhead -tradeoff mechanism, which is the core logic of DAMON, is in the pure logic -space. DAMON separates the two parts in different layers, namely DAMON -Operations Set and DAMON Core Logics Layers, respectively. It further defines -the interface between the layers to allow various operations sets to be -configured with the core logic. - -Due to this design, users can extend DAMON for any address space by configuring -the core logic to use the appropriate operations set. If any appropriate set -is unavailable, users can implement one on their own. - -For example, physical memory, virtual memory, swap space, those for specific -processes, NUMA nodes, files, and backing memory devices would be supportable. -Also, if some architectures or devices supporting special optimized access -check primitives, those will be easily configurable. - - -Programmable Modules --------------------- - -Core layer of DAMON is implemented as a framework, and exposes its application -programming interface to all kernel space components such as subsystems and -modules. For common use cases of DAMON, DAMON subsystem provides kernel -modules that built on top of the core layer using the API, which can be easily -used by the user space end users. +- :ref:`Operations Set `: Implements fundamental + operations for DAMON that depends on the given monitoring target + address-space and available set of software/hardware primitives, +- :ref:`Core `: Implements core logics including monitoring + overhead/accuracy control and access-aware system operations on top of the + operations set layer, and +- :ref:`Modules `: Implements kernel modules for various + purposes that provides interfaces for the user space, on top of the core + layer. .. _damon_operations_set: @@ -70,11 +41,32 @@ used by the user space end users. Operations Set Layer ==================== -The monitoring operations are defined in two parts: +.. _damon_design_configurable_operations_set: + +For data access monitoring and additional low level work, DAMON needs a set of +implementations for specific operations that are dependent on and optimized for +the given target address space. For example, below two operations for access +monitoring are address-space dependent. 1. Identification of the monitoring target address range for the address space. 2. Access check of specific address range in the target space. +DAMON consolidates these implementations in a layer called DAMON Operations +Set, and defines the interface between it and the upper layer. The upper layer +is dedicated for DAMON's core logics including the mechanism for control of the +monitoring accruracy and the overhead. + +Hence, DAMON can easily be extended for any address space and/or available +hardware features by configuring the core logic to use the appropriate +operations set. If there is no available operations set for a given purpose, a +new operations set can be implemented following the interface between the +layers. + +For example, physical memory, virtual memory, swap space, those for specific +processes, NUMA nodes, files, and backing memory devices would be supportable. +Also, if some architectures or devices support special optimized access check +features, those will be easily configurable. + DAMON currently provides below three operation sets. Below two subsections describe how those work. @@ -82,6 +74,10 @@ describe how those work. - fvaddr: Monitor fixed virtual address ranges - paddr: Monitor the physical address space of the system +To know how user-space can do the configuration via :ref:`DAMON sysfs interface +`, refer to :ref:`operations ` file part of the +documentation. + .. _damon_design_vaddr_target_regions_construction: @@ -140,9 +136,12 @@ conflict with the reclaim logic using ``PG_idle`` and ``PG_young`` page flags, as Idle page tracking does. +.. _damon_core_logic: + Core Logics =========== +.. _damon_design_monitoring: Monitoring ---------- @@ -152,6 +151,10 @@ monitoring attributes, ``sampling interval``, ``aggregation interval``, ``update interval``, ``minimum number of regions``, and ``maximum number of regions``. +To know how user-space can set the attributes via :ref:`DAMON sysfs interface +`, refer to :ref:`monitoring_attrs ` +part of the documentation. + Access Frequency Monitoring ~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -192,7 +195,7 @@ one page in the region is required to be checked. Thus, for each ``sampling interval``, DAMON randomly picks one page in each region, waits for one ``sampling interval``, checks whether the page is accessed meanwhile, and increases the access frequency counter of the region if so. The counter is -called ``nr_regions`` of the region. Therefore, the monitoring overhead is +called ``nr_accesses`` of the region. Therefore, the monitoring overhead is controllable by setting the number of regions. DAMON allows users to set the minimum and the maximum number of regions for the trade-off. @@ -209,11 +212,18 @@ the data access pattern can be dynamically changed. This will result in low monitoring quality. To keep the assumption as much as possible, DAMON adaptively merges and splits each region based on their access frequency. -For each ``aggregation interval``, it compares the access frequencies of -adjacent regions and merges those if the frequency difference is small. Then, -after it reports and clears the aggregated access frequency of each region, it -splits each region into two or three regions if the total number of regions -will not exceed the user-specified maximum number of regions after the split. +For each ``aggregation interval``, it compares the access frequencies +(``nr_accesses``) of adjacent regions. If the difference is small, and if the +sum of the two regions' sizes is smaller than the size of total regions divided +by the ``minimum number of regions``, DAMON merges the two regions. If the +resulting number of total regions is still higher than ``maximum number of +regions``, it repeats the merging with increasing access frequenceis difference +threshold until the upper-limit of the number of regions is met, or the +threshold becomes higher than possible maximum value (``aggregation interval`` +divided by ``sampling interval``). Then, after it reports and clears the +aggregated access frequency of each region, it splits each region into two or +three regions if the total number of regions will not exceed the user-specified +maximum number of regions after the split. In this way, DAMON provides its best-effort quality and minimal overhead while keeping the bounds users set for their trade-off. @@ -248,6 +258,11 @@ and applies it to monitoring operations-related data structures such as the abstracted monitoring target memory area only for each of a user-specified time interval (``update interval``). +User-space can get the monitoring results via DAMON sysfs interface and/or +tracepoints. For more details, please refer to the documentations for +:ref:`DAMOS tried regions ` and :ref:`tracepoint`, +respectively. + .. _damon_design_damos: @@ -288,6 +303,10 @@ the access pattern of interest, and applies the user-desired operation actions to the regions, for every user-specified time interval called ``apply_interval``. +To know how user-space can set ``apply_interval`` via :ref:`DAMON sysfs +interface `, refer to :ref:`apply_interval_us ` +part of the documentation. + .. _damon_design_damos_action: @@ -325,6 +344,10 @@ that supports each action are as below. Supported by ``paddr`` operations set. - ``lru_deprio``: Deprioritize the region on its LRU lists. Supported by ``paddr`` operations set. + - ``migrate_hot``: Migrate the regions prioritizing warmer regions. + Supported by ``paddr`` operations set. + - ``migrate_cold``: Migrate the regions prioritizing colder regions. + Supported by ``paddr`` operations set. - ``stat``: Do nothing but count the statistics. Supported by all operations sets. @@ -332,6 +355,10 @@ Applying the actions except ``stat`` to a region is considered as changing the region's characteristics. Hence, DAMOS resets the age of regions when any such actions are applied to those. +To know how user-space can set the action via :ref:`DAMON sysfs interface +`, refer to :ref:`action ` part of the +documentation. + .. _damon_design_damos_access_pattern: @@ -345,6 +372,10 @@ interest by setting minimum and maximum values of the three properties. If a region's three properties are in the ranges, DAMOS classifies it as one of the regions that the scheme is having an interest in. +To know how user-space can set the access pattern via :ref:`DAMON sysfs +interface `, refer to :ref:`access_pattern +` part of the documentation. + .. _damon_design_damos_quotas: @@ -364,6 +395,10 @@ feature called quotas. It lets users specify an upper limit of time that DAMOS can use for applying the action, and/or a maximum bytes of memory regions that the action can be applied within a user-specified time duration. +To know how user-space can set the basic quotas via :ref:`DAMON sysfs interface +`, refer to :ref:`quotas ` part of the +documentation. + .. _damon_design_damos_quotas_prioritization: @@ -391,6 +426,10 @@ information to the underlying mechanism. Nevertheless, how and even whether the weight will be respected are up to the underlying prioritization mechanism implementation. +To know how user-space can set the prioritization weights via :ref:`DAMON sysfs +interface `, refer to :ref:`weights ` part of +the documentation. + .. _damon_design_damos_quotas_auto_tuning: @@ -420,6 +459,10 @@ Currently, two ``target_metric`` are provided. DAMOS does the measurement on its own, so only ``target_value`` need to be set by users at the initial time. In other words, DAMOS does self-feedback. +To know how user-space can set the tuning goal metric, the target value, and/or +the current value via :ref:`DAMON sysfs interface `, refer to +:ref:`quota goals ` part of the documentation. + .. _damon_design_damos_watermarks: @@ -442,6 +485,10 @@ is activated. If all schemes are deactivated by the watermarks, the monitoring is also deactivated. In this case, the DAMON worker thread only periodically checks the watermarks and therefore incurs nearly zero overhead. +To know how user-space can set the watermarks via :ref:`DAMON sysfs interface +`, refer to :ref:`watermarks ` part of the +documentation. + .. _damon_design_damos_filters: @@ -488,6 +535,10 @@ Below types of filters are currently supported. - Applied to pages that belonging to a given DAMON monitoring target. - Handled by the core logic. +To know how user-space can set the watermarks via :ref:`DAMON sysfs interface +`, refer to :ref:`filters ` part of the +documentation. + Application Programming Interface --------------------------------- @@ -501,6 +552,8 @@ interface, namely ``include/linux/damon.h``. Please refer to the API :doc:`document ` for details of the interface. +.. _damon_modules: + Modules ======= diff --git a/Documentation/mm/damon/index.rst b/Documentation/mm/damon/index.rst index 5e0a50583500..dafd6d028924 100644 --- a/Documentation/mm/damon/index.rst +++ b/Documentation/mm/damon/index.rst @@ -6,7 +6,7 @@ DAMON: Data Access MONitor DAMON is a Linux kernel subsystem that provides a framework for data access monitoring and the monitoring results based system operations. The core -monitoring mechanisms of DAMON (refer to :doc:`design` for the detail) make it +monitoring :ref:`mechanisms ` of DAMON make it - *accurate* (the monitoring output is useful enough for DRAM level memory management; It might not appropriate for CPU Cache levels, though), @@ -16,15 +16,16 @@ monitoring mechanisms of DAMON (refer to :doc:`design` for the detail) make it of the size of target workloads). Using this framework, therefore, the kernel can operate system in an -access-aware fashion. Because the features are also exposed to the user space, -users who have special information about their workloads can write personalized -applications for better understanding and optimizations of their workloads and -systems. +access-aware fashion. Because the features are also exposed to the :doc:`user +space `, users who have special information about +their workloads can write personalized applications for better understanding +and optimizations of their workloads and systems. -For easier development of such systems, DAMON provides a feature called DAMOS -(DAMon-based Operation Schemes) in addition to the monitoring. Using the -feature, DAMON users in both kernel and user spaces can do access-aware system -operations with no code but simple configurations. +For easier development of such systems, DAMON provides a feature called +:ref:`DAMOS ` (DAMon-based Operation Schemes) in addition +to the monitoring. Using the feature, DAMON users in both kernel and :doc:`user +spaces ` can do access-aware system operations +with no code but simple configurations. .. toctree:: :maxdepth: 2 @@ -33,3 +34,6 @@ operations with no code but simple configurations. design api maintainer-profile + +To utilize and control DAMON from the user-space, please refer to the +administration :doc:`guide `. diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst index 8213cf61d38a..feccf6a0f6c3 100644 --- a/Documentation/mm/damon/maintainer-profile.rst +++ b/Documentation/mm/damon/maintainer-profile.rst @@ -53,6 +53,40 @@ Mon-Fri) in PT (Pacific Time). The response to patches will occasionally be slow. Do not hesitate to send a ping if you have not heard back within a week of sending a patch. +Mailing tool +------------ + +Like many other Linux kernel subsystems, DAMON uses the mailing lists +(damon@lists.linux.dev and linux-mm@kvack.org) as the major communication +channel. There is a simple tool called HacKerMaiL (``hkml``) [8]_ , which is +for people who are not very familiar with the mailing lists based +communication. The tool could be particularly helpful for DAMON community +members since it is developed and maintained by DAMON maintainer. The tool is +also officially announced to support DAMON and general Linux kernel development +workflow. + +In other words, ``hkml`` [8]_ is a mailing tool for DAMON community, which +DAMON maintainer is committed to support. Please feel free to try and report +issues or feature requests for the tool to the maintainer. + +Community meetup +---------------- + +DAMON community is maintaining two bi-weekly meetup series for community +members who prefer synchronous conversations over mails. + +The first one is for any discussion between every community member. No +reservation is needed. + +The seconds one is for discussions on specific topics between restricted +members including the maintainer. The maintainer shares the available time +slots, and attendees should reserve one of those at least 24 hours before the +time slot, by reaching out to the maintainer. + +Schedules and available reservation time slots are available at the Google doc +[9]_ . DAMON maintainer will also provide periodic reminder to the mailing +list (damon@lists.linux.dev). + .. [1] https://git.kernel.org/akpm/mm/h/mm-unstable .. [2] https://git.kernel.org/sj/h/damon/next @@ -61,3 +95,5 @@ of sending a patch. .. [5] https://github.com/awslabs/damon-tests/blob/master/corr/tests/kunit.sh .. [6] https://github.com/awslabs/damon-tests/tree/master/corr .. [7] https://github.com/awslabs/damon-tests/tree/master/perf +.. [8] https://github.com/damonitor/hackermail +.. [9] https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index b6a07a26b10d..2feb2ed51ae2 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -191,13 +191,13 @@ have become evictable again (via munlock() for example) and have been "rescued" from the unevictable list. However, there may be situations where we decide, for the sake of expediency, to leave an unevictable folio on one of the regular active/inactive LRU lists for vmscan to deal with. vmscan checks for such -folios in all of the shrink_{active|inactive|page}_list() functions and will +folios in all of the shrink_{active|inactive|folio}_list() functions and will "cull" such folios that it encounters: that is, it diverts those folios to the unevictable list for the memory cgroup and node being scanned. There may be situations where a folio is mapped into a VM_LOCKED VMA, but the folio does not have the mlocked flag set. Such folios will make -it all the way to shrink_active_list() or shrink_page_list() where they +it all the way to shrink_active_list() or shrink_folio_list() where they will be detected when vmscan walks the reverse map in folio_referenced() or try_to_unmap(). The folio is culled to the unevictable list when it is released by the shrinker. @@ -269,7 +269,7 @@ the LRU. Such pages can be "noticed" by memory management in several places: (4) in the fault path and when a VM_LOCKED stack segment is expanded; or - (5) as mentioned above, in vmscan:shrink_page_list() when attempting to + (5) as mentioned above, in vmscan:shrink_folio_list() when attempting to reclaim a page in a VM_LOCKED VMA by folio_referenced() or try_to_unmap(). mlocked pages become unlocked and rescued from the unevictable list when: @@ -548,12 +548,12 @@ Some examples of these unevictable pages on the LRU lists are: (3) pages still mapped into VM_LOCKED VMAs, which should be marked mlocked, but events left mlock_count too low, so they were munlocked too early. -vmscan's shrink_inactive_list() and shrink_page_list() also divert obviously +vmscan's shrink_inactive_list() and shrink_folio_list() also divert obviously unevictable pages found on the inactive lists to the appropriate memory cgroup and node unevictable list. rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or -shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(), +shrink_folio_list(), and rmap's try_to_unmap_one() called via shrink_folio_list(), check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_folio() to correct them. Such pages are culled to the unevictable list when released by the shrinker. diff --git a/MAINTAINERS b/MAINTAINERS index 2ce06ad40768..222c9739eb83 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5701,6 +5701,8 @@ L: linux-mm@kvack.org S: Maintained F: include/linux/memcontrol.h F: mm/memcontrol.c +F: mm/memcontrol-v1.c +F: mm/memcontrol-v1.h F: mm/swap_cgroup.c F: samples/cgroup/* F: tools/testing/selftests/cgroup/memcg_protection.m diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h index 1075534b0a2e..8ed8b9a24efe 100644 --- a/arch/arm/include/asm/cacheflush.h +++ b/arch/arm/include/asm/cacheflush.h @@ -283,7 +283,7 @@ void flush_cache_pages(struct vm_area_struct *vma, unsigned long user_addr, * flush_dcache_page is used when the kernel has written to the page * cache page at virtual address page->virtual. * - * If this page isn't mapped (ie, page_mapping == NULL), or it might + * If this page isn't mapped (ie, folio_mapping == NULL), or it might * have userspace mappings, then we _must_ always clean + invalidate * the dcache entries associated with the kernel mapping. * diff --git a/arch/arm/include/asm/hugetlb-3level.h b/arch/arm/include/asm/hugetlb-3level.h index a30be5505793..87d48e2d90ad 100644 --- a/arch/arm/include/asm/hugetlb-3level.h +++ b/arch/arm/include/asm/hugetlb-3level.h @@ -13,12 +13,12 @@ /* * If our huge pte is non-zero then mark the valid bit. - * This allows pte_present(huge_ptep_get(ptep)) to return true for non-zero + * This allows pte_present(huge_ptep_get(mm,addr,ptep)) to return true for non-zero * ptes. * (The valid bit is automatically cleared by set_pte_at for PROT_NONE ptes). */ #define __HAVE_ARCH_HUGE_PTEP_GET -static inline pte_t huge_ptep_get(pte_t *ptep) +static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { pte_t retval = *ptep; if (pte_val(retval)) diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h index fefac75fa009..28ab96e808ef 100644 --- a/arch/arm64/include/asm/cacheflush.h +++ b/arch/arm64/include/asm/cacheflush.h @@ -117,7 +117,7 @@ extern void copy_to_user_page(struct vm_area_struct *, struct page *, * flush_dcache_folio is used when the kernel has written to the page * cache page at virtual address page->virtual. * - * If this page isn't mapped (ie, page_mapping == NULL), or it might + * If this page isn't mapped (ie, folio_mapping == NULL), or it might * have userspace mappings, then we _must_ always clean + invalidate * the dcache entries associated with the kernel mapping. * diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h index 3954cbd2ff56..293f880865e8 100644 --- a/arch/arm64/include/asm/hugetlb.h +++ b/arch/arm64/include/asm/hugetlb.h @@ -46,7 +46,7 @@ extern pte_t huge_ptep_clear_flush(struct vm_area_struct *vma, extern void huge_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep, unsigned long sz); #define __HAVE_ARCH_HUGE_PTEP_GET -extern pte_t huge_ptep_get(pte_t *ptep); +extern pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep); void __init arm64_hugetlb_cma_reserve(void); diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 3f09ac73cce3..5f1e2103888b 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -127,7 +127,7 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize) return contig_ptes; } -pte_t huge_ptep_get(pte_t *ptep) +pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { int ncontig, i; size_t pgsize; diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h index af3acdf3481a..161dd6e10479 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -467,8 +467,8 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, #define update_mmu_cache(vma, addr, ptep) \ update_mmu_cache_range(NULL, vma, addr, ptep, 1) -#define __HAVE_ARCH_UPDATE_MMU_TLB -#define update_mmu_tlb update_mmu_cache +#define update_mmu_tlb_range(vma, addr, ptep, nr) \ + update_mmu_cache_range(NULL, vma, addr, ptep, nr) static inline void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h index e27a4c83c548..c29a551eb0ca 100644 --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -594,8 +594,8 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, #define update_mmu_cache(vma, address, ptep) \ update_mmu_cache_range(NULL, vma, address, ptep, 1) -#define __HAVE_ARCH_UPDATE_MMU_TLB -#define update_mmu_tlb update_mmu_cache +#define update_mmu_tlb_range(vma, address, ptep, nr) \ + update_mmu_cache_range(NULL, vma, address, ptep, nr) static inline void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c index df1ced4fc3b5..bf9a37c60e9f 100644 --- a/arch/mips/mm/cache.c +++ b/arch/mips/mm/cache.c @@ -112,7 +112,7 @@ void __flush_dcache_pages(struct page *page, unsigned int nr) } /* - * We could delay the flush for the !page_mapping case too. But that + * We could delay the flush for the !folio_mapping case too. But that * case is for exec env/arg pages and those are %99 certainly going to * get faulted into the tlb (and thus flushed) anyways. */ diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index f8891fbe7c16..bc5a1612be72 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -135,7 +135,6 @@ config PPC select ARCH_HAS_DMA_MAP_DIRECT if PPC_PSERIES select ARCH_HAS_FORTIFY_SOURCE select ARCH_HAS_GCOV_PROFILE_ALL - select ARCH_HAS_HUGEPD if HUGETLB_PAGE select ARCH_HAS_KCOV select ARCH_HAS_KERNEL_FPU_SUPPORT if PPC64 && PPC_FPU select ARCH_HAS_MEMBARRIER_CALLBACKS diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h index dc5c039eb28e..dd4eb3063175 100644 --- a/arch/powerpc/include/asm/book3s/32/pgalloc.h +++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h @@ -47,8 +47,6 @@ static inline void pgtable_free(void *table, unsigned index_size) } } -#define get_hugepd_cache_index(x) (x) - static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift) { diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h index 6472b08fa1b0..c654c376ef8b 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h @@ -74,21 +74,6 @@ #define remap_4k_pfn(vma, addr, pfn, prot) \ remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot)) -#ifdef CONFIG_HUGETLB_PAGE -static inline int hash__hugepd_ok(hugepd_t hpd) -{ - unsigned long hpdval = hpd_val(hpd); - /* - * if it is not a pte and have hugepd shift mask - * set, then it is a hugepd directory pointer - */ - if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) && - ((hpdval & HUGEPD_SHIFT_MASK) != 0)) - return true; - return false; -} -#endif - /* * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just * a matter of returning the PTE bits that need to be modified. On 64K PTE, diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h index faf3e3b4e4b2..0755f2567021 100644 --- a/arch/powerpc/include/asm/book3s/64/hash.h +++ b/arch/powerpc/include/asm/book3s/64/hash.h @@ -4,6 +4,7 @@ #ifdef __KERNEL__ #include +#include /* * Common bits between 4K and 64K pages in a linux-style PTE. @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr, pte_t *ptep, unsigned long pte, int huge); unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags); /* Atomic PTE updates */ -static inline unsigned long hash__pte_update(struct mm_struct *mm, - unsigned long addr, - pte_t *ptep, unsigned long clr, - unsigned long set, - int huge) +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr, + unsigned long set) { __be64 old_be, tmp_be; - unsigned long old; __asm__ __volatile__( "1: ldarx %0,0,%3 # pte_update\n\ @@ -182,11 +179,40 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm, : "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep), "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set)) : "cc" ); + + return be64_to_cpu(old_be); +} + +static inline unsigned long hash__pte_update(struct mm_struct *mm, + unsigned long addr, + pte_t *ptep, unsigned long clr, + unsigned long set, + int huge) +{ + unsigned long old; + + old = hash__pte_update_one(ptep, clr, set); + + if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && huge) { + unsigned int psize = get_slice_psize(mm, addr); + int nb, i; + + if (psize == MMU_PAGE_16M) + nb = SZ_16M / PMD_SIZE; + else if (psize == MMU_PAGE_16G) + nb = SZ_16G / PUD_SIZE; + else + nb = 1; + + WARN_ON_ONCE(nb == 1); /* Should never happen */ + + for (i = 1; i < nb; i++) + hash__pte_update_one(ptep + i, clr, set); + } /* huge pages use the old page table lock */ if (!huge) assert_pte_locked(mm, addr); - old = be64_to_cpu(old_be); if (old & H_PAGE_HASHPTE) hpte_need_flush(mm, addr, ptep, old, huge); diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h b/arch/powerpc/include/asm/book3s/64/hugetlb.h index aa1c67c8bfc8..f0bba9c5f9c3 100644 --- a/arch/powerpc/include/asm/book3s/64/hugetlb.h +++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h @@ -49,9 +49,6 @@ static inline bool gigantic_page_runtime_supported(void) return true; } -/* hugepd entry valid bit */ -#define HUGEPD_VAL_BITS (0x8000000000000000UL) - #define huge_ptep_modify_prot_start huge_ptep_modify_prot_start extern pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep); @@ -60,29 +57,7 @@ extern pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, extern void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t new_pte); -/* - * This should work for other subarchs too. But right now we use the - * new format only for 64bit book3s - */ -static inline pte_t *hugepd_page(hugepd_t hpd) -{ - BUG_ON(!hugepd_ok(hpd)); - /* - * We have only four bits to encode, MMU page size - */ - BUILD_BUG_ON((MMU_PAGE_COUNT - 1) > 0xf); - return __va(hpd_val(hpd) & HUGEPD_ADDR_MASK); -} -static inline unsigned int hugepd_mmu_psize(hugepd_t hpd) -{ - return (hpd_val(hpd) & HUGEPD_SHIFT_MASK) >> 2; -} - -static inline unsigned int hugepd_shift(hugepd_t hpd) -{ - return mmu_psize_to_shift(hugepd_mmu_psize(hpd)); -} static inline void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr) { @@ -90,19 +65,6 @@ static inline void flush_hugetlb_page(struct vm_area_struct *vma, return radix__flush_hugetlb_page(vma, vmaddr); } -static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr, - unsigned int pdshift) -{ - unsigned long idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(hpd); - - return hugepd_page(hpd) + idx; -} - -static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift) -{ - *hpdp = __hugepd(__pa(new) | HUGEPD_VAL_BITS | (shift_to_mmu_psize(pshift) << 2)); -} - void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr); static inline int check_and_get_huge_psize(int shift) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h deleted file mode 100644 index baf934578c3a..000000000000 --- a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h +++ /dev/null @@ -1,47 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _ASM_POWERPC_BOOK3S_64_PGTABLE_4K_H -#define _ASM_POWERPC_BOOK3S_64_PGTABLE_4K_H -/* - * hash 4k can't share hugetlb and also doesn't support THP - */ -#ifndef __ASSEMBLY__ -#ifdef CONFIG_HUGETLB_PAGE -/* - * With radix , we have hugepage ptes in the pud and pmd entries. We don't - * need to setup hugepage directory for them. Our pte and page directory format - * enable us to have this enabled. - */ -static inline int hugepd_ok(hugepd_t hpd) -{ - if (radix_enabled()) - return 0; - return hash__hugepd_ok(hpd); -} -#define is_hugepd(hpd) (hugepd_ok(hpd)) - -/* - * 16M and 16G huge page directory tables are allocated from slab cache - * - */ -#define H_16M_CACHE_INDEX (PAGE_SHIFT + H_PTE_INDEX_SIZE + H_PMD_INDEX_SIZE - 24) -#define H_16G_CACHE_INDEX \ - (PAGE_SHIFT + H_PTE_INDEX_SIZE + H_PMD_INDEX_SIZE + H_PUD_INDEX_SIZE - 34) - -static inline int get_hugepd_cache_index(int index) -{ - switch (index) { - case H_16M_CACHE_INDEX: - return HTLB_16M_INDEX; - case H_16G_CACHE_INDEX: - return HTLB_16G_INDEX; - default: - BUG(); - } - /* should not reach */ -} - -#endif /* CONFIG_HUGETLB_PAGE */ - -#endif /* __ASSEMBLY__ */ - -#endif /*_ASM_POWERPC_BOOK3S_64_PGTABLE_4K_H */ diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h index 6ac73da7b80e..4d8d7b4ea16b 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h @@ -5,26 +5,6 @@ #ifndef __ASSEMBLY__ #ifdef CONFIG_HUGETLB_PAGE -/* - * With 64k page size, we have hugepage ptes in the pgd and pmd entries. We don't - * need to setup hugepage directory for them. Our pte and page directory format - * enable us to have this enabled. - */ -static inline int hugepd_ok(hugepd_t hpd) -{ - return 0; -} - -#define is_hugepd(pdep) 0 - -/* - * This should never get called - */ -static __always_inline int get_hugepd_cache_index(int index) -{ - BUILD_BUG(); -} - #endif /* CONFIG_HUGETLB_PAGE */ static inline int remap_4k_pfn(struct vm_area_struct *vma, unsigned long addr, diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 8f9432e3855a..519b1743a0f4 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -274,6 +274,24 @@ static inline bool pud_leaf(pud_t pud) { return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE)); } + +#define pmd_leaf_size pmd_leaf_size +static inline unsigned long pmd_leaf_size(pmd_t pmd) +{ + if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !radix_enabled()) + return SZ_16M; + else + return PMD_SIZE; +} + +#define pud_leaf_size pud_leaf_size +static inline unsigned long pud_leaf_size(pud_t pud) +{ + if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !radix_enabled()) + return SZ_16G; + else + return PUD_SIZE; +} #endif /* __ASSEMBLY__ */ #include @@ -285,11 +303,9 @@ static inline bool pud_leaf(pud_t pud) #define MAX_PHYSMEM_BITS R_MAX_PHYSMEM_BITS #endif - +/* hash 4k can't share hugetlb and also doesn't support THP */ #ifdef CONFIG_PPC_64K_PAGES #include -#else -#include #endif #include diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h index ea71f7245a63..18a3028ac3b6 100644 --- a/arch/powerpc/include/asm/hugetlb.h +++ b/arch/powerpc/include/asm/hugetlb.h @@ -30,10 +30,9 @@ static inline int is_hugepage_only_range(struct mm_struct *mm, } #define is_hugepage_only_range is_hugepage_only_range -#define __HAVE_ARCH_HUGETLB_FREE_PGD_RANGE -void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr, - unsigned long end, unsigned long floor, - unsigned long ceiling); +#define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT +void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + pte_t pte, unsigned long sz); #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm, @@ -67,14 +66,6 @@ static inline void flush_hugetlb_page(struct vm_area_struct *vma, { } -#define hugepd_shift(x) 0 -static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr, - unsigned pdshift) -{ - return NULL; -} - - static inline void __init gigantic_hugetlb_cma_reserve(void) { } diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h index 92df40c6cc6b..014799557f60 100644 --- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h @@ -4,42 +4,12 @@ #define PAGE_SHIFT_8M 23 -static inline pte_t *hugepd_page(hugepd_t hpd) -{ - BUG_ON(!hugepd_ok(hpd)); - - return (pte_t *)__va(hpd_val(hpd) & ~HUGEPD_SHIFT_MASK); -} - -static inline unsigned int hugepd_shift(hugepd_t hpd) -{ - return PAGE_SHIFT_8M; -} - -static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr, - unsigned int pdshift) -{ - unsigned long idx = (addr & (SZ_4M - 1)) >> PAGE_SHIFT; - - return hugepd_page(hpd) + idx; -} - static inline void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr) { flush_tlb_page(vma, vmaddr); } -static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift) -{ - *hpdp = __hugepd(__pa(new) | _PMD_USER | _PMD_PRESENT | _PMD_PAGE_8M); -} - -static inline void hugepd_populate_kernel(hugepd_t *hpdp, pte_t *new, unsigned int pshift) -{ - *hpdp = __hugepd(__pa(new) | _PMD_PRESENT | _PMD_PAGE_8M); -} - static inline int check_and_get_huge_psize(int shift) { return shift_to_mmu_psize(shift); @@ -49,6 +19,14 @@ static inline int check_and_get_huge_psize(int shift) void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned long sz); +#define __HAVE_ARCH_HUGE_PTEP_GET +static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep) +{ + if (ptep_is_8m_pmdp(mm, addr, ptep)) + ptep = pte_offset_kernel((pmd_t *)ptep, ALIGN_DOWN(addr, SZ_8M)); + return ptep_get(ptep); +} + #define __HAVE_ARCH_HUGE_PTE_CLEAR static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep, unsigned long sz) diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h index 141d82e249a8..a756a1e59c54 100644 --- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h @@ -189,19 +189,14 @@ typedef struct { #define PHYS_IMMR_BASE (mfspr(SPRN_IMMR) & 0xfff80000) -/* Page size definitions, common between 32 and 64-bit +/* + * Page size definitions for 8xx * * shift : is the "PAGE_SHIFT" value for that page size - * penc : is the pte encoding mask * */ struct mmu_psize_def { unsigned int shift; /* number of bits */ - unsigned int enc; /* PTE encoding */ - unsigned int ind; /* Corresponding indirect page size shift */ - unsigned int flags; -#define MMU_PAGE_SIZE_DIRECT 0x1 /* Supported as a direct size */ -#define MMU_PAGE_SIZE_INDIRECT 0x2 /* Supported as an indirect size */ }; extern struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT]; diff --git a/arch/powerpc/include/asm/nohash/32/pte-44x.h b/arch/powerpc/include/asm/nohash/32/pte-44x.h index 851813725237..da0469928273 100644 --- a/arch/powerpc/include/asm/nohash/32/pte-44x.h +++ b/arch/powerpc/include/asm/nohash/32/pte-44x.h @@ -75,9 +75,6 @@ #define _PAGE_NO_CACHE 0x00000400 /* H: I bit */ #define _PAGE_WRITETHRU 0x00000800 /* H: W bit */ -/* No page size encoding in the linux PTE */ -#define _PAGE_PSIZE 0 - /* TODO: Add large page lowmem mapping support */ #define _PMD_PRESENT 0 #define _PMD_PRESENT_MASK (PAGE_MASK) diff --git a/arch/powerpc/include/asm/nohash/32/pte-85xx.h b/arch/powerpc/include/asm/nohash/32/pte-85xx.h index 653a342d3b25..14d64b4f3f14 100644 --- a/arch/powerpc/include/asm/nohash/32/pte-85xx.h +++ b/arch/powerpc/include/asm/nohash/32/pte-85xx.h @@ -31,9 +31,6 @@ #define _PAGE_WRITETHRU 0x00400 /* H: W bit */ #define _PAGE_SPECIAL 0x00800 /* S: Special page */ -/* No page size encoding in the linux PTE */ -#define _PAGE_PSIZE 0 - #define _PMD_PRESENT 0 #define _PMD_PRESENT_MASK (PAGE_MASK) #define _PMD_BAD (~PAGE_MASK) diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h index 137dc3c84e45..54ebb91dbdcf 100644 --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h @@ -74,12 +74,11 @@ #define _PTE_NONE_MASK 0 #ifdef CONFIG_PPC_16K_PAGES -#define _PAGE_PSIZE _PAGE_SPS +#define _PAGE_BASE_NC (_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_SPS) #else -#define _PAGE_PSIZE 0 +#define _PAGE_BASE_NC (_PAGE_PRESENT | _PAGE_ACCESSED) #endif -#define _PAGE_BASE_NC (_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_PSIZE) #define _PAGE_BASE (_PAGE_BASE_NC) #include @@ -120,7 +119,7 @@ static inline pte_t pte_mkhuge(pte_t pte) #define pte_mkhuge pte_mkhuge -static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p, +static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *ptep, unsigned long clr, unsigned long set, int huge); static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep) @@ -142,19 +141,12 @@ static inline void __ptep_set_access_flags(struct vm_area_struct *vma, pte_t *pt } #define __ptep_set_access_flags __ptep_set_access_flags -static inline unsigned long pgd_leaf_size(pgd_t pgd) -{ - if (pgd_val(pgd) & _PMD_PAGE_8M) - return SZ_8M; - return SZ_4M; -} - -#define pgd_leaf_size pgd_leaf_size - -static inline unsigned long pte_leaf_size(pte_t pte) +static inline unsigned long __pte_leaf_size(pmd_t pmd, pte_t pte) { pte_basic_t val = pte_val(pte); + if (pmd_val(pmd) & _PMD_PAGE_8M) + return SZ_8M; if (val & _PAGE_HUGE) return SZ_512K; if (val & _PAGE_SPS) @@ -162,31 +154,38 @@ static inline unsigned long pte_leaf_size(pte_t pte) return SZ_4K; } -#define pte_leaf_size pte_leaf_size +#define __pte_leaf_size __pte_leaf_size /* * On the 8xx, the page tables are a bit special. For 16k pages, we have * 4 identical entries. For 512k pages, we have 128 entries as if it was * 4k pages, but they are flagged as 512k pages for the hardware. - * For other page sizes, we have a single entry in the table. + * For 8M pages, we have 1024 entries as if it was 4M pages (PMD_SIZE) + * but they are flagged as 8M pages for the hardware. + * For 4k pages, we have a single entry in the table. */ static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr); -static int hugepd_ok(hugepd_t hpd); +static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address); + +static inline bool ptep_is_8m_pmdp(struct mm_struct *mm, unsigned long addr, pte_t *ptep) +{ + return (pmd_t *)ptep == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M)); +} static inline int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t val, int huge) { if (!huge) return PAGE_SIZE / SZ_4K; - else if (hugepd_ok(*((hugepd_t *)pmd))) - return 1; + else if ((pmd_val(*pmd) & _PMD_PAGE_MASK) == _PMD_PAGE_8M) + return SZ_4M / SZ_4K; else if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !(val & _PAGE_HUGE)) return SZ_16K / SZ_4K; else return SZ_512K / SZ_4K; } -static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p, - unsigned long clr, unsigned long set, int huge) +static inline pte_basic_t __pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p, + unsigned long clr, unsigned long set, int huge) { pte_basic_t *entry = (pte_basic_t *)p; pte_basic_t old = pte_val(*p); @@ -198,7 +197,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p for (i = 0; i < num; i += PAGE_SIZE / SZ_4K, new += PAGE_SIZE) { *entry++ = new; - if (IS_ENABLED(CONFIG_PPC_16K_PAGES) && num != 1) { + if (IS_ENABLED(CONFIG_PPC_16K_PAGES)) { *entry++ = new; *entry++ = new; *entry++ = new; @@ -208,6 +207,21 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p return old; } +static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + unsigned long clr, unsigned long set, int huge) +{ + pte_basic_t old; + + if (huge && ptep_is_8m_pmdp(mm, addr, ptep)) { + pmd_t *pmdp = (pmd_t *)ptep; + + old = __pte_update(mm, addr, pte_offset_kernel(pmdp, 0), clr, set, huge); + __pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), clr, set, huge); + } else { + old = __pte_update(mm, addr, ptep, clr, set, huge); + } + return old; +} #define pte_update pte_update #ifdef CONFIG_PPC_16K_PAGES diff --git a/arch/powerpc/include/asm/nohash/hugetlb-e500.h b/arch/powerpc/include/asm/nohash/hugetlb-e500.h index 8f04ad20e040..cab0e1f1eea0 100644 --- a/arch/powerpc/include/asm/nohash/hugetlb-e500.h +++ b/arch/powerpc/include/asm/nohash/hugetlb-e500.h @@ -2,38 +2,8 @@ #ifndef _ASM_POWERPC_NOHASH_HUGETLB_E500_H #define _ASM_POWERPC_NOHASH_HUGETLB_E500_H -static inline pte_t *hugepd_page(hugepd_t hpd) -{ - if (WARN_ON(!hugepd_ok(hpd))) - return NULL; - - return (pte_t *)((hpd_val(hpd) & ~HUGEPD_SHIFT_MASK) | PD_HUGE); -} - -static inline unsigned int hugepd_shift(hugepd_t hpd) -{ - return hpd_val(hpd) & HUGEPD_SHIFT_MASK; -} - -static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr, - unsigned int pdshift) -{ - /* - * On FSL BookE, we have multiple higher-level table entries that - * point to the same hugepte. Just use the first one since they're all - * identical. So for that case, idx=0. - */ - return hugepd_page(hpd); -} - void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr); -static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift) -{ - /* We use the old format for PPC_E500 */ - *hpdp = __hugepd(((unsigned long)new & ~PD_HUGE) | pshift); -} - static inline int check_and_get_huge_psize(int shift) { if (shift & 1) /* Not a power of 4 */ @@ -42,4 +12,13 @@ static inline int check_and_get_huge_psize(int shift) return shift_to_mmu_psize(shift); } +static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags) +{ + unsigned int tsize = shift - _PAGE_PSIZE_SHIFT_OFFSET; + pte_basic_t val = (tsize << _PAGE_PSIZE_SHIFT) & _PAGE_PSIZE_MSK; + + return __pte((pte_val(entry) & ~(pte_basic_t)_PAGE_PSIZE_MSK) | val); +} +#define arch_make_huge_pte arch_make_huge_pte + #endif /* _ASM_POWERPC_NOHASH_HUGETLB_E500_H */ diff --git a/arch/powerpc/include/asm/nohash/mmu-e500.h b/arch/powerpc/include/asm/nohash/mmu-e500.h index 6ddced0415cb..b281d9eeaf1e 100644 --- a/arch/powerpc/include/asm/nohash/mmu-e500.h +++ b/arch/powerpc/include/asm/nohash/mmu-e500.h @@ -244,14 +244,11 @@ typedef struct { /* Page size definitions, common between 32 and 64-bit * * shift : is the "PAGE_SHIFT" value for that page size - * penc : is the pte encoding mask * */ struct mmu_psize_def { unsigned int shift; /* number of bits */ - unsigned int enc; /* PTE encoding */ - unsigned int ind; /* Corresponding indirect page size shift */ unsigned int flags; #define MMU_PAGE_SIZE_DIRECT 0x1 /* Supported as a direct size */ #define MMU_PAGE_SIZE_INDIRECT 0x2 /* Supported as an indirect size */ @@ -303,8 +300,7 @@ extern unsigned long linear_map_top; extern int book3e_htw_mode; #define PPC_HTW_NONE 0 -#define PPC_HTW_IBM 1 -#define PPC_HTW_E6500 2 +#define PPC_HTW_E6500 1 /* * 64-bit booke platforms don't load the tlb in the tlb miss handler code. diff --git a/arch/powerpc/include/asm/nohash/pgalloc.h b/arch/powerpc/include/asm/nohash/pgalloc.h index 4b62376318e1..d06efac6d7aa 100644 --- a/arch/powerpc/include/asm/nohash/pgalloc.h +++ b/arch/powerpc/include/asm/nohash/pgalloc.h @@ -44,8 +44,6 @@ static inline void pgtable_free(void *table, int shift) } } -#define get_hugepd_cache_index(x) (x) - static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift) { unsigned long pgf = (unsigned long)table; diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h index f5f39d4f03c8..8d1f0b7062eb 100644 --- a/arch/powerpc/include/asm/nohash/pgtable.h +++ b/arch/powerpc/include/asm/nohash/pgtable.h @@ -31,6 +31,13 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p extern int icache_44x_need_flush; +#ifndef pte_huge_size +static inline unsigned long pte_huge_size(pte_t pte) +{ + return PAGE_SIZE; +} +#endif + /* * PTE updates. This function is called whenever an existing * valid PTE is updated. This does -not- include set_pte_at() @@ -52,11 +59,34 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p { pte_basic_t old = pte_val(*p); pte_basic_t new = (old & ~(pte_basic_t)clr) | set; + unsigned long sz; + unsigned long pdsize; + int i; if (new == old) return old; - *p = __pte(new); + if (huge) + sz = pte_huge_size(__pte(old)); + else + sz = PAGE_SIZE; + + if (sz < PMD_SIZE) + pdsize = PAGE_SIZE; + else if (sz < PUD_SIZE) + pdsize = PMD_SIZE; + else if (sz < P4D_SIZE) + pdsize = PUD_SIZE; + else if (sz < PGDIR_SIZE) + pdsize = P4D_SIZE; + else + pdsize = PGDIR_SIZE; + + for (i = 0; i < sz / pdsize; i++, p++) { + *p = __pte(new); + if (new) + new += (unsigned long long)(pdsize / PAGE_SIZE) << PTE_RPN_SHIFT; + } if (IS_ENABLED(CONFIG_44x) && !is_kernel_addr(addr) && (old & _PAGE_EXEC)) icache_44x_need_flush = 1; @@ -340,20 +370,6 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, #define pgprot_writecombine pgprot_noncached_wc -#ifdef CONFIG_HUGETLB_PAGE -static inline int hugepd_ok(hugepd_t hpd) -{ -#ifdef CONFIG_PPC_8xx - return ((hpd_val(hpd) & _PMD_PAGE_MASK) == _PMD_PAGE_8M); -#else - /* We clear the top bit to indicate hugepd */ - return (hpd_val(hpd) && (hpd_val(hpd) & PD_HUGE) == 0); -#endif -} - -#define is_hugepd(hpd) (hugepd_ok(hpd)) -#endif - int map_kernel_page(unsigned long va, phys_addr_t pa, pgprot_t prot); void unmap_kernel_page(unsigned long va); diff --git a/arch/powerpc/include/asm/nohash/pte-e500.h b/arch/powerpc/include/asm/nohash/pte-e500.h index f516f0b5b7a8..cb78392494da 100644 --- a/arch/powerpc/include/asm/nohash/pte-e500.h +++ b/arch/powerpc/include/asm/nohash/pte-e500.h @@ -19,20 +19,7 @@ #define _PAGE_BAP_SX 0x000040 #define _PAGE_BAP_UX 0x000080 #define _PAGE_PSIZE_MSK 0x000f00 -#define _PAGE_PSIZE_4K 0x000200 -#define _PAGE_PSIZE_8K 0x000300 -#define _PAGE_PSIZE_16K 0x000400 -#define _PAGE_PSIZE_32K 0x000500 -#define _PAGE_PSIZE_64K 0x000600 -#define _PAGE_PSIZE_128K 0x000700 -#define _PAGE_PSIZE_256K 0x000800 -#define _PAGE_PSIZE_512K 0x000900 -#define _PAGE_PSIZE_1M 0x000a00 -#define _PAGE_PSIZE_2M 0x000b00 -#define _PAGE_PSIZE_4M 0x000c00 -#define _PAGE_PSIZE_8M 0x000d00 -#define _PAGE_PSIZE_16M 0x000e00 -#define _PAGE_PSIZE_32M 0x000f00 +#define _PAGE_TSIZE_4K 0x000100 #define _PAGE_DIRTY 0x001000 /* C: page changed */ #define _PAGE_SW0 0x002000 #define _PAGE_U3 0x004000 @@ -46,6 +33,9 @@ #define _PAGE_NO_CACHE 0x400000 /* I: cache inhibit */ #define _PAGE_WRITETHRU 0x800000 /* W: cache write-through */ +#define _PAGE_PSIZE_SHIFT 7 +#define _PAGE_PSIZE_SHIFT_OFFSET 10 + /* "Higher level" linux bit combinations */ #define _PAGE_EXEC (_PAGE_BAP_SX | _PAGE_BAP_UX) /* .. and was cache cleaned */ #define _PAGE_READ (_PAGE_BAP_SR | _PAGE_BAP_UR) /* User read permission */ @@ -65,8 +55,6 @@ #define _PAGE_SPECIAL _PAGE_SW0 -/* Base page size */ -#define _PAGE_PSIZE _PAGE_PSIZE_4K #define PTE_RPN_SHIFT (24) #define PTE_WIMGE_SHIFT (19) @@ -89,7 +77,7 @@ * pages. We always set _PAGE_COHERENT when SMP is enabled or * the processor might need it for DMA coherency. */ -#define _PAGE_BASE_NC (_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_PSIZE) +#define _PAGE_BASE_NC (_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_TSIZE_4K) #if defined(CONFIG_SMP) #define _PAGE_BASE (_PAGE_BASE_NC | _PAGE_COHERENT) #else @@ -105,6 +93,47 @@ static inline pte_t pte_mkexec(pte_t pte) } #define pte_mkexec pte_mkexec +static inline unsigned long pte_huge_size(pte_t pte) +{ + pte_basic_t val = pte_val(pte); + + return 1UL << (((val & _PAGE_PSIZE_MSK) >> _PAGE_PSIZE_SHIFT) + _PAGE_PSIZE_SHIFT_OFFSET); +} +#define pte_huge_size pte_huge_size + +static inline int pmd_leaf(pmd_t pmd) +{ + if (IS_ENABLED(CONFIG_PPC64)) + return (long)pmd_val(pmd) > 0; + else + return pmd_val(pmd) & _PAGE_PSIZE_MSK; +} +#define pmd_leaf pmd_leaf + +static inline unsigned long pmd_leaf_size(pmd_t pmd) +{ + return pte_huge_size(__pte(pmd_val(pmd))); +} +#define pmd_leaf_size pmd_leaf_size + +#ifdef CONFIG_PPC64 +static inline int pud_leaf(pud_t pud) +{ + if (IS_ENABLED(CONFIG_PPC64)) + return (long)pud_val(pud) > 0; + else + return pud_val(pud) & _PAGE_PSIZE_MSK; +} +#define pud_leaf pud_leaf + +static inline unsigned long pud_leaf_size(pud_t pud) +{ + return pte_huge_size(__pte(pud_val(pud))); +} +#define pud_leaf_size pud_leaf_size + +#endif + #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h index e411e5a70ea3..83d0a4fc5f75 100644 --- a/arch/powerpc/include/asm/page.h +++ b/arch/powerpc/include/asm/page.h @@ -269,38 +269,6 @@ static inline const void *pfn_to_kaddr(unsigned long pfn) #define is_kernel_addr(x) ((x) >= TASK_SIZE) #endif -#ifndef CONFIG_PPC_BOOK3S_64 -/* - * Use the top bit of the higher-level page table entries to indicate whether - * the entries we point to contain hugepages. This works because we know that - * the page tables live in kernel space. If we ever decide to support having - * page tables at arbitrary addresses, this breaks and will have to change. - */ -#ifdef CONFIG_PPC64 -#define PD_HUGE 0x8000000000000000UL -#else -#define PD_HUGE 0x80000000 -#endif - -#else /* CONFIG_PPC_BOOK3S_64 */ -/* - * Book3S 64 stores real addresses in the hugepd entries to - * avoid overlaps with _PAGE_PRESENT and _PAGE_PTE. - */ -#define HUGEPD_ADDR_MASK (0x0ffffffffffffffful & ~HUGEPD_SHIFT_MASK) -#endif /* CONFIG_PPC_BOOK3S_64 */ - -/* - * Some number of bits at the level of the page table that points to - * a hugepte are used to encode the size. This masks those bits. - * On 8xx, HW assistance requires 4k alignment for the hugepte. - */ -#ifdef CONFIG_PPC_8xx -#define HUGEPD_SHIFT_MASK 0xfff -#else -#define HUGEPD_SHIFT_MASK 0x3f -#endif - #ifndef __ASSEMBLY__ #ifdef CONFIG_PPC_BOOK3S_64 diff --git a/arch/powerpc/include/asm/pgtable-be-types.h b/arch/powerpc/include/asm/pgtable-be-types.h index 82633200b500..6bd8f89b25dc 100644 --- a/arch/powerpc/include/asm/pgtable-be-types.h +++ b/arch/powerpc/include/asm/pgtable-be-types.h @@ -101,14 +101,4 @@ static inline bool pmd_xchg(pmd_t *pmdp, pmd_t old, pmd_t new) return pmd_raw(old) == prev; } -#ifdef CONFIG_ARCH_HAS_HUGEPD -typedef struct { __be64 pdbe; } hugepd_t; -#define __hugepd(x) ((hugepd_t) { cpu_to_be64(x) }) - -static inline unsigned long hpd_val(hugepd_t x) -{ - return be64_to_cpu(x.pdbe); -} -#endif - #endif /* _ASM_POWERPC_PGTABLE_BE_TYPES_H */ diff --git a/arch/powerpc/include/asm/pgtable-types.h b/arch/powerpc/include/asm/pgtable-types.h index 082c85cc09b1..7b3d4c592a10 100644 --- a/arch/powerpc/include/asm/pgtable-types.h +++ b/arch/powerpc/include/asm/pgtable-types.h @@ -49,7 +49,11 @@ static inline unsigned long pud_val(pud_t x) #endif /* CONFIG_PPC64 */ /* PGD level */ +#if defined(CONFIG_PPC_E500) && defined(CONFIG_PTE_64BIT) +typedef struct { unsigned long long pgd; } pgd_t; +#else typedef struct { unsigned long pgd; } pgd_t; +#endif #define __pgd(x) ((pgd_t) { (x) }) static inline unsigned long pgd_val(pgd_t x) { @@ -83,13 +87,4 @@ static inline bool pte_xchg(pte_t *ptep, pte_t old, pte_t new) } #endif -#ifdef CONFIG_ARCH_HAS_HUGEPD -typedef struct { unsigned long pd; } hugepd_t; -#define __hugepd(x) ((hugepd_t) { (x) }) -static inline unsigned long hpd_val(hugepd_t x) -{ - return x.pd; -} -#endif - #endif /* _ASM_POWERPC_PGTABLE_TYPES_H */ diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 239709a2f68e..264a6c09517a 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -106,6 +106,9 @@ unsigned long vmalloc_to_phys(void *vmalloc_addr); void pgtable_cache_add(unsigned int shift); +#ifdef CONFIG_PPC32 +void __init *early_alloc_pgtable(unsigned long size); +#endif pte_t *early_pte_alloc_kernel(pmd_t *pmdp, unsigned long va); #if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_PPC32) diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S index dcf0591ad3c2..63f6b9f513a4 100644 --- a/arch/powerpc/kernel/exceptions-64e.S +++ b/arch/powerpc/kernel/exceptions-64e.S @@ -485,8 +485,8 @@ interrupt_base_book3e: /* fake trap */ EXCEPTION_STUB(0x160, decrementer) /* 0x0900 */ EXCEPTION_STUB(0x180, fixed_interval) /* 0x0980 */ EXCEPTION_STUB(0x1a0, watchdog) /* 0x09f0 */ - EXCEPTION_STUB(0x1c0, data_tlb_miss) - EXCEPTION_STUB(0x1e0, instruction_tlb_miss) + EXCEPTION_STUB(0x1c0, data_tlb_miss_bolted) + EXCEPTION_STUB(0x1e0, instruction_tlb_miss_bolted) EXCEPTION_STUB(0x200, altivec_unavailable) EXCEPTION_STUB(0x220, altivec_assist) EXCEPTION_STUB(0x260, perfmon) diff --git a/arch/powerpc/kernel/head_85xx.S b/arch/powerpc/kernel/head_85xx.S index 39724ff5ae1f..f9a73fae6464 100644 --- a/arch/powerpc/kernel/head_85xx.S +++ b/arch/powerpc/kernel/head_85xx.S @@ -294,9 +294,10 @@ set_ivor: /* Macros to hide the PTE size differences * * FIND_PTE -- walks the page tables given EA & pgdir pointer - * r10 -- EA of fault + * r10 -- free * r11 -- PGDIR pointer * r12 -- free + * r13 -- EA of fault * label 2: is the bailout case * * if we find the pte (fall through): @@ -307,34 +308,34 @@ set_ivor: #ifdef CONFIG_PTE_64BIT #ifdef CONFIG_HUGETLB_PAGE #define FIND_PTE \ - rlwinm r12, r10, 13, 19, 29; /* Compute pgdir/pmd offset */ \ - lwzx r11, r12, r11; /* Get pgd/pmd entry */ \ + rlwinm r12, r13, 14, 18, 28; /* Compute pgdir/pmd offset */ \ + add r12, r11, r12; \ + lwz r11, 4(r12); /* Get pgd/pmd entry */ \ + rlwinm. r10, r11, 32 - _PAGE_PSIZE_SHIFT, 0x1e; /* get tsize*/ \ + bne 1000f; /* Huge page (leaf entry) */ \ rlwinm. r12, r11, 0, 0, 20; /* Extract pt base address */ \ - blt 1000f; /* Normal non-huge page */ \ beq 2f; /* Bail if no table */ \ - oris r11, r11, PD_HUGE@h; /* Put back address bit */ \ - andi. r10, r11, HUGEPD_SHIFT_MASK@l; /* extract size field */ \ - xor r12, r10, r11; /* drop size bits from pointer */ \ - b 1001f; \ -1000: rlwimi r12, r10, 23, 20, 28; /* Compute pte address */ \ + rlwimi r12, r13, 23, 20, 28; /* Compute pte address */ \ li r10, 0; /* clear r10 */ \ -1001: lwz r11, 4(r12); /* Get pte entry */ + lwz r11, 4(r12); /* Get pte entry */ \ +1000: #else #define FIND_PTE \ - rlwinm r12, r10, 13, 19, 29; /* Compute pgdir/pmd offset */ \ - lwzx r11, r12, r11; /* Get pgd/pmd entry */ \ + rlwinm r12, r13, 14, 18, 28; /* Compute pgdir/pmd offset */ \ + add r12, r11, r12; \ + lwz r11, 4(r12); /* Get pgd/pmd entry */ \ rlwinm. r12, r11, 0, 0, 20; /* Extract pt base address */ \ beq 2f; /* Bail if no table */ \ - rlwimi r12, r10, 23, 20, 28; /* Compute pte address */ \ + rlwimi r12, r13, 23, 20, 28; /* Compute pte address */ \ lwz r11, 4(r12); /* Get pte entry */ #endif /* HUGEPAGE */ #else /* !PTE_64BIT */ #define FIND_PTE \ - rlwimi r11, r10, 12, 20, 29; /* Create L1 (pgdir/pmd) address */ \ + rlwimi r11, r13, 12, 20, 29; /* Create L1 (pgdir/pmd) address */ \ lwz r11, 0(r11); /* Get L1 entry */ \ rlwinm. r12, r11, 0, 0, 19; /* Extract L2 (pte) base address */ \ beq 2f; /* Bail if no table */ \ - rlwimi r12, r10, 22, 20, 29; /* Compute PTE address */ \ + rlwimi r12, r13, 22, 20, 29; /* Compute PTE address */ \ lwz r11, 0(r12); /* Get Linux PTE */ #endif @@ -441,13 +442,13 @@ START_BTB_FLUSH_SECTION BTB_FLUSH(r10) 1: END_BTB_FLUSH_SECTION - mfspr r10, SPRN_DEAR /* Get faulting address */ + mfspr r13, SPRN_DEAR /* Get faulting address */ /* If we are faulting a kernel address, we have to use the * kernel page tables. */ lis r11, PAGE_OFFSET@h - cmplw 5, r10, r11 + cmplw 5, r13, r11 blt 5, 3f lis r11, swapper_pg_dir@h ori r11, r11, swapper_pg_dir@l @@ -470,29 +471,14 @@ END_BTB_FLUSH_SECTION #endif 4: - /* Mask of required permission bits. Note that while we - * do copy ESR:ST to _PAGE_WRITE position as trying to write - * to an RO page is pretty common, we don't do it with - * _PAGE_DIRTY. We could do it, but it's a fairly rare - * event so I'd rather take the overhead when it happens - * rather than adding an instruction here. We should measure - * whether the whole thing is worth it in the first place - * as we could avoid loading SPRN_ESR completely in the first - * place... - * - * TODO: Is it worth doing that mfspr & rlwimi in the first - * place or can we save a couple of instructions here ? - */ - mfspr r12,SPRN_ESR + FIND_PTE + #ifdef CONFIG_PTE_64BIT li r13,_PAGE_PRESENT|_PAGE_BAP_SR oris r13,r13,_PAGE_ACCESSED@h #else li r13,_PAGE_PRESENT|_PAGE_READ|_PAGE_ACCESSED #endif - rlwimi r13,r12,11,29,29 - - FIND_PTE andc. r13,r13,r11 /* Check permission */ #ifdef CONFIG_PTE_64BIT @@ -549,13 +535,13 @@ START_BTB_FLUSH_SECTION 1: END_BTB_FLUSH_SECTION - mfspr r10, SPRN_SRR0 /* Get faulting address */ + mfspr r13, SPRN_SRR0 /* Get faulting address */ /* If we are faulting a kernel address, we have to use the * kernel page tables. */ lis r11, PAGE_OFFSET@h - cmplw 5, r10, r11 + cmplw 5, r13, r11 blt 5, 3f lis r11, swapper_pg_dir@h ori r11, r11, swapper_pg_dir@l @@ -564,6 +550,7 @@ END_BTB_FLUSH_SECTION rlwinm r12,r12,0,16,1 mtspr SPRN_MAS1,r12 + FIND_PTE /* Make up the required permissions for kernel code */ #ifdef CONFIG_PTE_64BIT li r13,_PAGE_PRESENT | _PAGE_BAP_SX @@ -584,6 +571,7 @@ END_BTB_FLUSH_SECTION beq 2f /* KUAP fault */ #endif + FIND_PTE /* Make up the required permissions for user code */ #ifdef CONFIG_PTE_64BIT li r13,_PAGE_PRESENT | _PAGE_BAP_UX @@ -593,7 +581,6 @@ END_BTB_FLUSH_SECTION #endif 4: - FIND_PTE andc. r13,r13,r11 /* Check permission */ #ifdef CONFIG_PTE_64BIT @@ -746,17 +733,12 @@ finish_tlb_load: lwz r15, 0(r14) 100: stw r15, 0(r17) - /* - * Calc MAS1_TSIZE from r10 (which has pshift encoded) - * tlb_enc = (pshift - 10). - */ - subi r15, r10, 10 mfspr r16, SPRN_MAS1 - rlwimi r16, r15, 7, 20, 24 + rlwimi r16, r10, MAS1_TSIZE_SHIFT, MAS1_TSIZE_MASK mtspr SPRN_MAS1, r16 /* copy the pshift for use later */ - mr r14, r10 + addi r14, r10, _PAGE_PSIZE_SHIFT_OFFSET /* fall through */ diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index edc479a7c2bc..ac74321b1192 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -415,14 +415,13 @@ FixupDAR:/* Entry point for dcbx workaround. */ oris r11, r11, (swapper_pg_dir - PAGE_OFFSET)@ha 3: lwz r11, (swapper_pg_dir-PAGE_OFFSET)@l(r11) /* Get the level 1 entry */ + rlwinm r11, r11, 0, ~_PMD_PAGE_8M mtspr SPRN_MD_TWC, r11 - mtcrf 0x01, r11 mfspr r11, SPRN_MD_TWC lwz r11, 0(r11) /* Get the pte */ - bt 28,200f /* bit 28 = Large page (8M) */ /* concat physical page address(r11) and page offset(r10) */ rlwimi r11, r10, 0, 32 - PAGE_SHIFT, 31 -201: lwz r11,0(r11) + lwz r11,0(r11) /* Check if it really is a dcbx instruction. */ /* dcbt and dcbtst does not generate DTLB Misses/Errors, * no need to include them here */ @@ -441,11 +440,6 @@ FixupDAR:/* Entry point for dcbx workaround. */ 141: mfspr r10,SPRN_M_TW b DARFixed /* Nope, go back to normal TLB processing */ -200: - /* concat physical page address(r11) and page offset(r10) */ - rlwimi r11, r10, 0, 32 - PAGE_SHIFT_8M, 31 - b 201b - 144: mfspr r10, SPRN_DSISR rlwinm r10, r10,0,7,5 /* Clear store bit for buggy dcbst insn */ mtspr SPRN_DSISR, r10 diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index ae36a129789f..22f83fbbc762 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -696,11 +696,7 @@ __init u64 ppc64_bolted_size(void) { #ifdef CONFIG_PPC_BOOK3E_64 /* Freescale BookE bolts the entire linear mapping */ - /* XXX: BookE ppc64_rma_limit setup seems to disagree? */ - if (early_mmu_has_feature(MMU_FTR_TYPE_FSL_E)) - return linear_map_top; - /* Other BookE, we assume the first GB is bolted */ - return 1ul << 30; + return linear_map_top; #else /* BookS radix, does not take faults on linear mapping */ if (early_radix_enabled()) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index 01c3b4b65241..6727a15ab94f 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -1233,10 +1233,6 @@ void __init hash__early_init_mmu(void) __pmd_table_size = H_PMD_TABLE_SIZE; __pud_table_size = H_PUD_TABLE_SIZE; __pgd_table_size = H_PGD_TABLE_SIZE; - /* - * 4k use hugepd format, so for hash set then to - * zero - */ __pmd_val_bits = HASH_PMD_VAL_BITS; __pud_val_bits = HASH_PUD_VAL_BITS; __pgd_val_bits = HASH_PGD_VAL_BITS; @@ -1546,6 +1542,13 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea, goto bail; } + if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !radix_enabled()) { + if (hugeshift == PMD_SHIFT && psize == MMU_PAGE_16M) + hugeshift = mmu_psize_defs[MMU_PAGE_16M].shift; + if (hugeshift == PUD_SHIFT && psize == MMU_PAGE_16G) + hugeshift = mmu_psize_defs[MMU_PAGE_16G].shift; + } + /* * Add _PAGE_PRESENT to the required access perm. If there are parallel * updates to the pte that can possibly clear _PAGE_PTE, catch that too. diff --git a/arch/powerpc/mm/book3s64/hugetlbpage.c b/arch/powerpc/mm/book3s64/hugetlbpage.c index 5a2e512e96db..83c3361b358b 100644 --- a/arch/powerpc/mm/book3s64/hugetlbpage.c +++ b/arch/powerpc/mm/book3s64/hugetlbpage.c @@ -53,6 +53,16 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid, /* If PTE permissions don't match, take page fault */ if (unlikely(!check_pte_access(access, old_pte))) return 1; + /* + * If hash-4k, hugepages use seeral contiguous PxD entries + * so bail out and let mm make the page young or dirty + */ + if (IS_ENABLED(CONFIG_PPC_4K_PAGES)) { + if (!(old_pte & _PAGE_ACCESSED)) + return 1; + if ((access & _PAGE_WRITE) && !(old_pte & _PAGE_DIRTY)) + return 1; + } /* * Try to lock the PTE, add ACCESSED and DIRTY if it was diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 2975ea0841ba..f4d8d3c40e5c 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -461,18 +461,6 @@ static inline void pgtable_free(void *table, int index) case PUD_INDEX: __pud_free(table); break; -#if defined(CONFIG_PPC_4K_PAGES) && defined(CONFIG_HUGETLB_PAGE) - /* 16M hugepd directory at pud level */ - case HTLB_16M_INDEX: - BUILD_BUG_ON(H_16M_CACHE_INDEX <= 0); - kmem_cache_free(PGT_CACHE(H_16M_CACHE_INDEX), table); - break; - /* 16G hugepd directory at the pgd level */ - case HTLB_16G_INDEX: - BUILD_BUG_ON(H_16G_CACHE_INDEX <= 0); - kmem_cache_free(PGT_CACHE(H_16G_CACHE_INDEX), table); - break; -#endif /* We don't free pgd table via RCU callback */ default: BUG(); diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 594a4b7b2ca2..6b043180220a 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -28,8 +28,6 @@ bool hugetlb_disabled = false; -#define hugepd_none(hpd) (hpd_val(hpd) == 0) - #define PTE_T_ORDER (__builtin_ffs(sizeof(pte_basic_t)) - \ __builtin_ffs(sizeof(void *))) @@ -42,156 +40,43 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long s return __find_linux_pte(mm->pgd, addr, NULL, NULL); } -static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp, - unsigned long address, unsigned int pdshift, - unsigned int pshift, spinlock_t *ptl) -{ - struct kmem_cache *cachep; - pte_t *new; - int i; - int num_hugepd; - - if (pshift >= pdshift) { - cachep = PGT_CACHE(PTE_T_ORDER); - num_hugepd = 1 << (pshift - pdshift); - } else { - cachep = PGT_CACHE(pdshift - pshift); - num_hugepd = 1; - } - - if (!cachep) { - WARN_ONCE(1, "No page table cache created for hugetlb tables"); - return -ENOMEM; - } - - new = kmem_cache_alloc(cachep, pgtable_gfp_flags(mm, GFP_KERNEL)); - - BUG_ON(pshift > HUGEPD_SHIFT_MASK); - BUG_ON((unsigned long)new & HUGEPD_SHIFT_MASK); - - if (!new) - return -ENOMEM; - - /* - * Make sure other cpus find the hugepd set only after a - * properly initialized page table is visible to them. - * For more details look for comment in __pte_alloc(). - */ - smp_wmb(); - - spin_lock(ptl); - /* - * We have multiple higher-level entries that point to the same - * actual pte location. Fill in each as we go and backtrack on error. - * We need all of these so the DTLB pgtable walk code can find the - * right higher-level entry without knowing if it's a hugepage or not. - */ - for (i = 0; i < num_hugepd; i++, hpdp++) { - if (unlikely(!hugepd_none(*hpdp))) - break; - hugepd_populate(hpdp, new, pshift); - } - /* If we bailed from the for loop early, an error occurred, clean up */ - if (i < num_hugepd) { - for (i = i - 1 ; i >= 0; i--, hpdp--) - *hpdp = __hugepd(0); - kmem_cache_free(cachep, new); - } else { - kmemleak_ignore(new); - } - spin_unlock(ptl); - return 0; -} - -/* - * At this point we do the placement change only for BOOK3S 64. This would - * possibly work on other subarchs. - */ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long sz) { - pgd_t *pg; - p4d_t *p4; - pud_t *pu; - pmd_t *pm; - hugepd_t *hpdp = NULL; - unsigned pshift = __ffs(sz); - unsigned pdshift = PGDIR_SHIFT; - spinlock_t *ptl; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; - addr &= ~(sz-1); - pg = pgd_offset(mm, addr); - p4 = p4d_offset(pg, addr); + addr &= ~(sz - 1); -#ifdef CONFIG_PPC_BOOK3S_64 - if (pshift == PGDIR_SHIFT) - /* 16GB huge page */ - return (pte_t *) p4; - else if (pshift > PUD_SHIFT) { - /* - * We need to use hugepd table - */ - ptl = &mm->page_table_lock; - hpdp = (hugepd_t *)p4; - } else { - pdshift = PUD_SHIFT; - pu = pud_alloc(mm, p4, addr); - if (!pu) - return NULL; - if (pshift == PUD_SHIFT) - return (pte_t *)pu; - else if (pshift > PMD_SHIFT) { - ptl = pud_lockptr(mm, pu); - hpdp = (hugepd_t *)pu; - } else { - pdshift = PMD_SHIFT; - pm = pmd_alloc(mm, pu, addr); - if (!pm) - return NULL; - if (pshift == PMD_SHIFT) - /* 16MB hugepage */ - return (pte_t *)pm; - else { - ptl = pmd_lockptr(mm, pm); - hpdp = (hugepd_t *)pm; + p4d = p4d_offset(pgd_offset(mm, addr), addr); + if (!mm_pud_folded(mm) && sz >= P4D_SIZE) + return (pte_t *)p4d; + + pud = pud_alloc(mm, p4d, addr); + if (!pud) + return NULL; + if (!mm_pmd_folded(mm) && sz >= PUD_SIZE) + return (pte_t *)pud; + + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return NULL; + + if (sz >= PMD_SIZE) { + /* On 8xx, all hugepages are handled as contiguous PTEs */ + if (IS_ENABLED(CONFIG_PPC_8xx)) { + int i; + + for (i = 0; i < sz / PMD_SIZE; i++) { + if (!pte_alloc_huge(mm, pmd + i, addr)) + return NULL; } } + return (pte_t *)pmd; } -#else - if (pshift >= PGDIR_SHIFT) { - ptl = &mm->page_table_lock; - hpdp = (hugepd_t *)p4; - } else { - pdshift = PUD_SHIFT; - pu = pud_alloc(mm, p4, addr); - if (!pu) - return NULL; - if (pshift >= PUD_SHIFT) { - ptl = pud_lockptr(mm, pu); - hpdp = (hugepd_t *)pu; - } else { - pdshift = PMD_SHIFT; - pm = pmd_alloc(mm, pu, addr); - if (!pm) - return NULL; - ptl = pmd_lockptr(mm, pm); - hpdp = (hugepd_t *)pm; - } - } -#endif - if (!hpdp) - return NULL; - if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT) - return pte_alloc_huge(mm, (pmd_t *)hpdp, addr); - - BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp)); - - if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, - pdshift, pshift, ptl)) - return NULL; - - return hugepte_offset(*hpdp, addr, pdshift); + return pte_alloc_huge(mm, pmd, addr); } #ifdef CONFIG_PPC_BOOK3S_64 @@ -248,264 +133,6 @@ int __init alloc_bootmem_huge_page(struct hstate *h, int nid) return __alloc_bootmem_huge_page(h, nid); } -#ifndef CONFIG_PPC_BOOK3S_64 -#define HUGEPD_FREELIST_SIZE \ - ((PAGE_SIZE - sizeof(struct hugepd_freelist)) / sizeof(pte_t)) - -struct hugepd_freelist { - struct rcu_head rcu; - unsigned int index; - void *ptes[]; -}; - -static DEFINE_PER_CPU(struct hugepd_freelist *, hugepd_freelist_cur); - -static void hugepd_free_rcu_callback(struct rcu_head *head) -{ - struct hugepd_freelist *batch = - container_of(head, struct hugepd_freelist, rcu); - unsigned int i; - - for (i = 0; i < batch->index; i++) - kmem_cache_free(PGT_CACHE(PTE_T_ORDER), batch->ptes[i]); - - free_page((unsigned long)batch); -} - -static void hugepd_free(struct mmu_gather *tlb, void *hugepte) -{ - struct hugepd_freelist **batchp; - - batchp = &get_cpu_var(hugepd_freelist_cur); - - if (atomic_read(&tlb->mm->mm_users) < 2 || - mm_is_thread_local(tlb->mm)) { - kmem_cache_free(PGT_CACHE(PTE_T_ORDER), hugepte); - put_cpu_var(hugepd_freelist_cur); - return; - } - - if (*batchp == NULL) { - *batchp = (struct hugepd_freelist *)__get_free_page(GFP_ATOMIC); - (*batchp)->index = 0; - } - - (*batchp)->ptes[(*batchp)->index++] = hugepte; - if ((*batchp)->index == HUGEPD_FREELIST_SIZE) { - call_rcu(&(*batchp)->rcu, hugepd_free_rcu_callback); - *batchp = NULL; - } - put_cpu_var(hugepd_freelist_cur); -} -#else -static inline void hugepd_free(struct mmu_gather *tlb, void *hugepte) {} -#endif - -/* Return true when the entry to be freed maps more than the area being freed */ -static bool range_is_outside_limits(unsigned long start, unsigned long end, - unsigned long floor, unsigned long ceiling, - unsigned long mask) -{ - if ((start & mask) < floor) - return true; - if (ceiling) { - ceiling &= mask; - if (!ceiling) - return true; - } - return end - 1 > ceiling - 1; -} - -static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshift, - unsigned long start, unsigned long end, - unsigned long floor, unsigned long ceiling) -{ - pte_t *hugepte = hugepd_page(*hpdp); - int i; - - unsigned long pdmask = ~((1UL << pdshift) - 1); - unsigned int num_hugepd = 1; - unsigned int shift = hugepd_shift(*hpdp); - - /* Note: On fsl the hpdp may be the first of several */ - if (shift > pdshift) - num_hugepd = 1 << (shift - pdshift); - - if (range_is_outside_limits(start, end, floor, ceiling, pdmask)) - return; - - for (i = 0; i < num_hugepd; i++, hpdp++) - *hpdp = __hugepd(0); - - if (shift >= pdshift) - hugepd_free(tlb, hugepte); - else - pgtable_free_tlb(tlb, hugepte, - get_hugepd_cache_index(pdshift - shift)); -} - -static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, - unsigned long addr, unsigned long end, - unsigned long floor, unsigned long ceiling) -{ - pgtable_t token = pmd_pgtable(*pmd); - - if (range_is_outside_limits(addr, end, floor, ceiling, PMD_MASK)) - return; - - pmd_clear(pmd); - pte_free_tlb(tlb, token, addr); - mm_dec_nr_ptes(tlb->mm); -} - -static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud, - unsigned long addr, unsigned long end, - unsigned long floor, unsigned long ceiling) -{ - pmd_t *pmd; - unsigned long next; - unsigned long start; - - start = addr; - do { - unsigned long more; - - pmd = pmd_offset(pud, addr); - next = pmd_addr_end(addr, end); - if (!is_hugepd(__hugepd(pmd_val(*pmd)))) { - if (pmd_none_or_clear_bad(pmd)) - continue; - - /* - * if it is not hugepd pointer, we should already find - * it cleared. - */ - WARN_ON(!IS_ENABLED(CONFIG_PPC_8xx)); - - hugetlb_free_pte_range(tlb, pmd, addr, end, floor, ceiling); - - continue; - } - /* - * Increment next by the size of the huge mapping since - * there may be more than one entry at this level for a - * single hugepage, but all of them point to - * the same kmem cache that holds the hugepte. - */ - more = addr + (1UL << hugepd_shift(*(hugepd_t *)pmd)); - if (more > next) - next = more; - - free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT, - addr, next, floor, ceiling); - } while (addr = next, addr != end); - - if (range_is_outside_limits(start, end, floor, ceiling, PUD_MASK)) - return; - - pmd = pmd_offset(pud, start & PUD_MASK); - pud_clear(pud); - pmd_free_tlb(tlb, pmd, start & PUD_MASK); - mm_dec_nr_pmds(tlb->mm); -} - -static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, - unsigned long addr, unsigned long end, - unsigned long floor, unsigned long ceiling) -{ - pud_t *pud; - unsigned long next; - unsigned long start; - - start = addr; - do { - pud = pud_offset(p4d, addr); - next = pud_addr_end(addr, end); - if (!is_hugepd(__hugepd(pud_val(*pud)))) { - if (pud_none_or_clear_bad(pud)) - continue; - hugetlb_free_pmd_range(tlb, pud, addr, next, floor, - ceiling); - } else { - unsigned long more; - /* - * Increment next by the size of the huge mapping since - * there may be more than one entry at this level for a - * single hugepage, but all of them point to - * the same kmem cache that holds the hugepte. - */ - more = addr + (1UL << hugepd_shift(*(hugepd_t *)pud)); - if (more > next) - next = more; - - free_hugepd_range(tlb, (hugepd_t *)pud, PUD_SHIFT, - addr, next, floor, ceiling); - } - } while (addr = next, addr != end); - - if (range_is_outside_limits(start, end, floor, ceiling, PGDIR_MASK)) - return; - - pud = pud_offset(p4d, start & PGDIR_MASK); - p4d_clear(p4d); - pud_free_tlb(tlb, pud, start & PGDIR_MASK); - mm_dec_nr_puds(tlb->mm); -} - -/* - * This function frees user-level page tables of a process. - */ -void hugetlb_free_pgd_range(struct mmu_gather *tlb, - unsigned long addr, unsigned long end, - unsigned long floor, unsigned long ceiling) -{ - pgd_t *pgd; - p4d_t *p4d; - unsigned long next; - - /* - * Because there are a number of different possible pagetable - * layouts for hugepage ranges, we limit knowledge of how - * things should be laid out to the allocation path - * (huge_pte_alloc(), above). Everything else works out the - * structure as it goes from information in the hugepd - * pointers. That means that we can't here use the - * optimization used in the normal page free_pgd_range(), of - * checking whether we're actually covering a large enough - * range to have to do anything at the top level of the walk - * instead of at the bottom. - * - * To make sense of this, you should probably go read the big - * block comment at the top of the normal free_pgd_range(), - * too. - */ - - do { - next = pgd_addr_end(addr, end); - pgd = pgd_offset(tlb->mm, addr); - p4d = p4d_offset(pgd, addr); - if (!is_hugepd(__hugepd(pgd_val(*pgd)))) { - if (p4d_none_or_clear_bad(p4d)) - continue; - hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling); - } else { - unsigned long more; - /* - * Increment next by the size of the huge mapping since - * there may be more than one entry at the pgd level - * for a single hugepage, but all of them point to the - * same kmem cache that holds the hugepte. - */ - more = addr + (1UL << hugepd_shift(*(hugepd_t *)pgd)); - if (more > next) - next = more; - - free_hugepd_range(tlb, (hugepd_t *)p4d, PGDIR_SHIFT, - addr, next, floor, ceiling); - } - } while (addr = next, addr != end); -} - bool __init arch_hugetlb_valid_size(unsigned long size) { int shift = __ffs(size); @@ -552,44 +179,14 @@ static int __init hugetlbpage_init(void) for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) { unsigned shift; - unsigned pdshift; if (!mmu_psize_defs[psize].shift) continue; shift = mmu_psize_to_shift(psize); -#ifdef CONFIG_PPC_BOOK3S_64 - if (shift > PGDIR_SHIFT) - continue; - else if (shift > PUD_SHIFT) - pdshift = PGDIR_SHIFT; - else if (shift > PMD_SHIFT) - pdshift = PUD_SHIFT; - else - pdshift = PMD_SHIFT; -#else - if (shift < PUD_SHIFT) - pdshift = PMD_SHIFT; - else if (shift < PGDIR_SHIFT) - pdshift = PUD_SHIFT; - else - pdshift = PGDIR_SHIFT; -#endif - if (add_huge_page_size(1ULL << shift) < 0) continue; - /* - * if we have pdshift and shift value same, we don't - * use pgt cache for hugepd. - */ - if (pdshift > shift) { - if (!IS_ENABLED(CONFIG_PPC_8xx)) - pgtable_cache_add(pdshift - shift); - } else if (IS_ENABLED(CONFIG_PPC_E500) || - IS_ENABLED(CONFIG_PPC_8xx)) { - pgtable_cache_add(PTE_T_ORDER); - } configured = true; } diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c index 21131b96d209..9b4a675eb8f8 100644 --- a/arch/powerpc/mm/init-common.c +++ b/arch/powerpc/mm/init-common.c @@ -123,12 +123,8 @@ void pgtable_cache_add(unsigned int shift) /* When batching pgtable pointers for RCU freeing, we store * the index size in the low bits. Table alignment must be * big enough to fit it. - * - * Likewise, hugeapge pagetable pointers contain a (different) - * shift value in the low bits. All tables must be aligned so - * as to leave enough 0 bits in the address to contain it. */ - unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1, - HUGEPD_SHIFT_MASK + 1); + */ + unsigned long minalign = MAX_PGTABLE_INDEX_SIZE + 1; struct kmem_cache *new = NULL; /* It would be nice if this was a BUILD_BUG_ON(), but at the diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c index 2784224054f8..989d6cdf4141 100644 --- a/arch/powerpc/mm/kasan/8xx.c +++ b/arch/powerpc/mm/kasan/8xx.c @@ -6,28 +6,33 @@ #include #include +#include + static int __init kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void *block) { pmd_t *pmd = pmd_off_k(k_start); unsigned long k_cur, k_next; - for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) { - pte_basic_t *new; + for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++, block += SZ_4M) { + pte_t *ptep; + int i; k_next = pgd_addr_end(k_cur, k_end); - k_next = pgd_addr_end(k_next, k_end); if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte) continue; - new = memblock_alloc(sizeof(pte_basic_t), SZ_4K); - if (!new) + ptep = memblock_alloc(PTE_FRAG_SIZE, PTE_FRAG_SIZE); + if (!ptep) return -ENOMEM; - *new = pte_val(pte_mkhuge(pfn_pte(PHYS_PFN(__pa(block)), PAGE_KERNEL))); + for (i = 0; i < PTRS_PER_PTE; i++) { + pte_t pte = pte_mkhuge(pfn_pte(PHYS_PFN(__pa(block + i * PAGE_SIZE)), PAGE_KERNEL)); - hugepd_populate_kernel((hugepd_t *)pmd, (pte_t *)new, PAGE_SHIFT_8M); - hugepd_populate_kernel((hugepd_t *)pmd + 1, (pte_t *)new, PAGE_SHIFT_8M); + __set_pte_at(&init_mm, k_cur, ptep + i, pte, 1); + } + pmd_populate_kernel(&init_mm, pmd, ptep); + *pmd = __pmd(pmd_val(*pmd) | _PMD_PAGE_8M); } return 0; } diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c index 43d4842bb1c7..388bba0ab3e7 100644 --- a/arch/powerpc/mm/nohash/8xx.c +++ b/arch/powerpc/mm/nohash/8xx.c @@ -11,6 +11,7 @@ #include #include +#include #include @@ -48,20 +49,6 @@ unsigned long p_block_mapped(phys_addr_t pa) return 0; } -static pte_t __init *early_hugepd_alloc_kernel(hugepd_t *pmdp, unsigned long va) -{ - if (hpd_val(*pmdp) == 0) { - pte_t *ptep = memblock_alloc(sizeof(pte_basic_t), SZ_4K); - - if (!ptep) - return NULL; - - hugepd_populate_kernel((hugepd_t *)pmdp, ptep, PAGE_SHIFT_8M); - hugepd_populate_kernel((hugepd_t *)pmdp + 1, ptep, PAGE_SHIFT_8M); - } - return hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT); -} - static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa, pgprot_t prot, int psize, bool new) { @@ -75,26 +62,36 @@ static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa, if (WARN_ON(slab_is_available())) return -EINVAL; - if (psize == MMU_PAGE_512K) + if (psize == MMU_PAGE_512K) { ptep = early_pte_alloc_kernel(pmdp, va); - else - ptep = early_hugepd_alloc_kernel((hugepd_t *)pmdp, va); + /* The PTE should never be already present */ + if (WARN_ON(pte_present(*ptep) && pgprot_val(prot))) + return -EINVAL; + } else { + if (WARN_ON(!pmd_none(*pmdp) || !pmd_none(*(pmdp + 1)))) + return -EINVAL; + + ptep = early_alloc_pgtable(PTE_FRAG_SIZE); + pmd_populate_kernel(&init_mm, pmdp, ptep); + + ptep = early_alloc_pgtable(PTE_FRAG_SIZE); + pmd_populate_kernel(&init_mm, pmdp + 1, ptep); + + ptep = (pte_t *)pmdp; + } } else { if (psize == MMU_PAGE_512K) ptep = pte_offset_kernel(pmdp, va); else - ptep = hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT); + ptep = (pte_t *)pmdp; } if (WARN_ON(!ptep)) return -ENOMEM; - /* The PTE should never be already present */ - if (new && WARN_ON(pte_present(*ptep) && pgprot_val(prot))) - return -EINVAL; - set_huge_pte_at(&init_mm, va, ptep, - pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), psize); + pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), + 1UL << mmu_psize_to_shift(psize)); return 0; } diff --git a/arch/powerpc/mm/nohash/Makefile b/arch/powerpc/mm/nohash/Makefile index 86d0fe434824..cf60c776c883 100644 --- a/arch/powerpc/mm/nohash/Makefile +++ b/arch/powerpc/mm/nohash/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 obj-y += mmu_context.o tlb.o tlb_low.o kup.o -obj-$(CONFIG_PPC_BOOK3E_64) += tlb_low_64e.o book3e_pgtable.o +obj-$(CONFIG_PPC_BOOK3E_64) += tlb_64e.o tlb_low_64e.o book3e_pgtable.o obj-$(CONFIG_44x) += 44x.o obj-$(CONFIG_PPC_8xx) += 8xx.o obj-$(CONFIG_PPC_E500) += e500.o diff --git a/arch/powerpc/mm/nohash/book3e_pgtable.c b/arch/powerpc/mm/nohash/book3e_pgtable.c index 1c5e4ecbebeb..ad2a7c26f2a0 100644 --- a/arch/powerpc/mm/nohash/book3e_pgtable.c +++ b/arch/powerpc/mm/nohash/book3e_pgtable.c @@ -29,10 +29,10 @@ int __meminit vmemmap_create_mapping(unsigned long start, _PAGE_KERNEL_RW; /* PTEs only contain page size encodings up to 32M */ - BUG_ON(mmu_psize_defs[mmu_vmemmap_psize].enc > 0xf); + BUG_ON(mmu_psize_defs[mmu_vmemmap_psize].shift - 10 > 0xf); /* Encode the size in the PTE */ - flags |= mmu_psize_defs[mmu_vmemmap_psize].enc << 8; + flags |= (mmu_psize_defs[mmu_vmemmap_psize].shift - 10) << 8; /* For each PTE for that area, map things. Note that we don't * increment phys because all PTEs are of the large size and diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c index 5ffa0af4328a..b653a7be4cb1 100644 --- a/arch/powerpc/mm/nohash/tlb.c +++ b/arch/powerpc/mm/nohash/tlb.c @@ -53,37 +53,30 @@ struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = { [MMU_PAGE_4K] = { .shift = 12, - .enc = BOOK3E_PAGESZ_4K, }, [MMU_PAGE_2M] = { .shift = 21, - .enc = BOOK3E_PAGESZ_2M, }, [MMU_PAGE_4M] = { .shift = 22, - .enc = BOOK3E_PAGESZ_4M, }, [MMU_PAGE_16M] = { .shift = 24, - .enc = BOOK3E_PAGESZ_16M, }, [MMU_PAGE_64M] = { .shift = 26, - .enc = BOOK3E_PAGESZ_64M, }, [MMU_PAGE_256M] = { .shift = 28, - .enc = BOOK3E_PAGESZ_256M, }, [MMU_PAGE_1G] = { .shift = 30, - .enc = BOOK3E_PAGESZ_1GB, }, }; static inline int mmu_get_tsize(int psize) { - return mmu_psize_defs[psize].enc; + return mmu_psize_defs[psize].shift - 10; } #else static inline int mmu_get_tsize(int psize) @@ -110,28 +103,6 @@ struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = { }; #endif -/* The variables below are currently only used on 64-bit Book3E - * though this will probably be made common with other nohash - * implementations at some point - */ -#ifdef CONFIG_PPC64 - -int mmu_pte_psize; /* Page size used for PTE pages */ -int mmu_vmemmap_psize; /* Page size used for the virtual mem map */ -int book3e_htw_mode; /* HW tablewalk? Value is PPC_HTW_* */ -unsigned long linear_map_top; /* Top of linear mapping */ - - -/* - * Number of bytes to add to SPRN_SPRG_TLB_EXFRAME on crit/mcheck/debug - * exceptions. This is used for bolted and e6500 TLB miss handlers which - * do not modify this SPRG in the TLB miss code; for other TLB miss handlers, - * this is set to zero. - */ -int extlb_level_exc; - -#endif /* CONFIG_PPC64 */ - #ifdef CONFIG_PPC_E500 /* next_tlbcam_idx is used to round-robin tlbcam entry assignment */ DEFINE_PER_CPU(int, next_tlbcam_idx); @@ -358,381 +329,7 @@ void tlb_flush(struct mmu_gather *tlb) flush_tlb_mm(tlb->mm); } -/* - * Below are functions specific to the 64-bit variant of Book3E though that - * may change in the future - */ - -#ifdef CONFIG_PPC64 - -/* - * Handling of virtual linear page tables or indirect TLB entries - * flushing when PTE pages are freed - */ -void tlb_flush_pgtable(struct mmu_gather *tlb, unsigned long address) -{ - int tsize = mmu_psize_defs[mmu_pte_psize].enc; - - if (book3e_htw_mode != PPC_HTW_NONE) { - unsigned long start = address & PMD_MASK; - unsigned long end = address + PMD_SIZE; - unsigned long size = 1UL << mmu_psize_defs[mmu_pte_psize].shift; - - /* This isn't the most optimal, ideally we would factor out the - * while preempt & CPU mask mucking around, or even the IPI but - * it will do for now - */ - while (start < end) { - __flush_tlb_page(tlb->mm, start, tsize, 1); - start += size; - } - } else { - unsigned long rmask = 0xf000000000000000ul; - unsigned long rid = (address & rmask) | 0x1000000000000000ul; - unsigned long vpte = address & ~rmask; - - vpte = (vpte >> (PAGE_SHIFT - 3)) & ~0xffful; - vpte |= rid; - __flush_tlb_page(tlb->mm, vpte, tsize, 0); - } -} - -static void __init setup_page_sizes(void) -{ - unsigned int tlb0cfg; - unsigned int tlb0ps; - unsigned int eptcfg; - int i, psize; - -#ifdef CONFIG_PPC_E500 - unsigned int mmucfg = mfspr(SPRN_MMUCFG); - int fsl_mmu = mmu_has_feature(MMU_FTR_TYPE_FSL_E); - - if (fsl_mmu && (mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V1) { - unsigned int tlb1cfg = mfspr(SPRN_TLB1CFG); - unsigned int min_pg, max_pg; - - min_pg = (tlb1cfg & TLBnCFG_MINSIZE) >> TLBnCFG_MINSIZE_SHIFT; - max_pg = (tlb1cfg & TLBnCFG_MAXSIZE) >> TLBnCFG_MAXSIZE_SHIFT; - - for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) { - struct mmu_psize_def *def; - unsigned int shift; - - def = &mmu_psize_defs[psize]; - shift = def->shift; - - if (shift == 0 || shift & 1) - continue; - - /* adjust to be in terms of 4^shift Kb */ - shift = (shift - 10) >> 1; - - if ((shift >= min_pg) && (shift <= max_pg)) - def->flags |= MMU_PAGE_SIZE_DIRECT; - } - - goto out; - } - - if (fsl_mmu && (mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V2) { - u32 tlb1cfg, tlb1ps; - - tlb0cfg = mfspr(SPRN_TLB0CFG); - tlb1cfg = mfspr(SPRN_TLB1CFG); - tlb1ps = mfspr(SPRN_TLB1PS); - eptcfg = mfspr(SPRN_EPTCFG); - - if ((tlb1cfg & TLBnCFG_IND) && (tlb0cfg & TLBnCFG_PT)) - book3e_htw_mode = PPC_HTW_E6500; - - /* - * We expect 4K subpage size and unrestricted indirect size. - * The lack of a restriction on indirect size is a Freescale - * extension, indicated by PSn = 0 but SPSn != 0. - */ - if (eptcfg != 2) - book3e_htw_mode = PPC_HTW_NONE; - - for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) { - struct mmu_psize_def *def = &mmu_psize_defs[psize]; - - if (!def->shift) - continue; - - if (tlb1ps & (1U << (def->shift - 10))) { - def->flags |= MMU_PAGE_SIZE_DIRECT; - - if (book3e_htw_mode && psize == MMU_PAGE_2M) - def->flags |= MMU_PAGE_SIZE_INDIRECT; - } - } - - goto out; - } -#endif - - tlb0cfg = mfspr(SPRN_TLB0CFG); - tlb0ps = mfspr(SPRN_TLB0PS); - eptcfg = mfspr(SPRN_EPTCFG); - - /* Look for supported direct sizes */ - for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) { - struct mmu_psize_def *def = &mmu_psize_defs[psize]; - - if (tlb0ps & (1U << (def->shift - 10))) - def->flags |= MMU_PAGE_SIZE_DIRECT; - } - - /* Indirect page sizes supported ? */ - if ((tlb0cfg & TLBnCFG_IND) == 0 || - (tlb0cfg & TLBnCFG_PT) == 0) - goto out; - - book3e_htw_mode = PPC_HTW_IBM; - - /* Now, we only deal with one IND page size for each - * direct size. Hopefully all implementations today are - * unambiguous, but we might want to be careful in the - * future. - */ - for (i = 0; i < 3; i++) { - unsigned int ps, sps; - - sps = eptcfg & 0x1f; - eptcfg >>= 5; - ps = eptcfg & 0x1f; - eptcfg >>= 5; - if (!ps || !sps) - continue; - for (psize = 0; psize < MMU_PAGE_COUNT; psize++) { - struct mmu_psize_def *def = &mmu_psize_defs[psize]; - - if (ps == (def->shift - 10)) - def->flags |= MMU_PAGE_SIZE_INDIRECT; - if (sps == (def->shift - 10)) - def->ind = ps + 10; - } - } - -out: - /* Cleanup array and print summary */ - pr_info("MMU: Supported page sizes\n"); - for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) { - struct mmu_psize_def *def = &mmu_psize_defs[psize]; - const char *__page_type_names[] = { - "unsupported", - "direct", - "indirect", - "direct & indirect" - }; - if (def->flags == 0) { - def->shift = 0; - continue; - } - pr_info(" %8ld KB as %s\n", 1ul << (def->shift - 10), - __page_type_names[def->flags & 0x3]); - } -} - -static void __init setup_mmu_htw(void) -{ - /* - * If we want to use HW tablewalk, enable it by patching the TLB miss - * handlers to branch to the one dedicated to it. - */ - - switch (book3e_htw_mode) { - case PPC_HTW_IBM: - patch_exception(0x1c0, exc_data_tlb_miss_htw_book3e); - patch_exception(0x1e0, exc_instruction_tlb_miss_htw_book3e); - break; -#ifdef CONFIG_PPC_E500 - case PPC_HTW_E6500: - extlb_level_exc = EX_TLB_SIZE; - patch_exception(0x1c0, exc_data_tlb_miss_e6500_book3e); - patch_exception(0x1e0, exc_instruction_tlb_miss_e6500_book3e); - break; -#endif - } - pr_info("MMU: Book3E HW tablewalk %s\n", - book3e_htw_mode != PPC_HTW_NONE ? "enabled" : "not supported"); -} - -/* - * Early initialization of the MMU TLB code - */ -static void early_init_this_mmu(void) -{ - unsigned int mas4; - - /* Set MAS4 based on page table setting */ - - mas4 = 0x4 << MAS4_WIMGED_SHIFT; - switch (book3e_htw_mode) { - case PPC_HTW_E6500: - mas4 |= MAS4_INDD; - mas4 |= BOOK3E_PAGESZ_2M << MAS4_TSIZED_SHIFT; - mas4 |= MAS4_TLBSELD(1); - mmu_pte_psize = MMU_PAGE_2M; - break; - - case PPC_HTW_IBM: - mas4 |= MAS4_INDD; - mas4 |= BOOK3E_PAGESZ_1M << MAS4_TSIZED_SHIFT; - mmu_pte_psize = MMU_PAGE_1M; - break; - - case PPC_HTW_NONE: - mas4 |= BOOK3E_PAGESZ_4K << MAS4_TSIZED_SHIFT; - mmu_pte_psize = mmu_virtual_psize; - break; - } - mtspr(SPRN_MAS4, mas4); - -#ifdef CONFIG_PPC_E500 - if (mmu_has_feature(MMU_FTR_TYPE_FSL_E)) { - unsigned int num_cams; - bool map = true; - - /* use a quarter of the TLBCAM for bolted linear map */ - num_cams = (mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY) / 4; - - /* - * Only do the mapping once per core, or else the - * transient mapping would cause problems. - */ -#ifdef CONFIG_SMP - if (hweight32(get_tensr()) > 1) - map = false; -#endif - - if (map) - linear_map_top = map_mem_in_cams(linear_map_top, - num_cams, false, true); - } -#endif - - /* A sync won't hurt us after mucking around with - * the MMU configuration - */ - mb(); -} - -static void __init early_init_mmu_global(void) -{ - /* XXX This should be decided at runtime based on supported - * page sizes in the TLB, but for now let's assume 16M is - * always there and a good fit (which it probably is) - * - * Freescale booke only supports 4K pages in TLB0, so use that. - */ - if (mmu_has_feature(MMU_FTR_TYPE_FSL_E)) - mmu_vmemmap_psize = MMU_PAGE_4K; - else - mmu_vmemmap_psize = MMU_PAGE_16M; - - /* XXX This code only checks for TLB 0 capabilities and doesn't - * check what page size combos are supported by the HW. It - * also doesn't handle the case where a separate array holds - * the IND entries from the array loaded by the PT. - */ - /* Look for supported page sizes */ - setup_page_sizes(); - - /* Look for HW tablewalk support */ - setup_mmu_htw(); - -#ifdef CONFIG_PPC_E500 - if (mmu_has_feature(MMU_FTR_TYPE_FSL_E)) { - if (book3e_htw_mode == PPC_HTW_NONE) { - extlb_level_exc = EX_TLB_SIZE; - patch_exception(0x1c0, exc_data_tlb_miss_bolted_book3e); - patch_exception(0x1e0, - exc_instruction_tlb_miss_bolted_book3e); - } - } -#endif - - /* Set the global containing the top of the linear mapping - * for use by the TLB miss code - */ - linear_map_top = memblock_end_of_DRAM(); - - ioremap_bot = IOREMAP_BASE; -} - -static void __init early_mmu_set_memory_limit(void) -{ -#ifdef CONFIG_PPC_E500 - if (mmu_has_feature(MMU_FTR_TYPE_FSL_E)) { - /* - * Limit memory so we dont have linear faults. - * Unlike memblock_set_current_limit, which limits - * memory available during early boot, this permanently - * reduces the memory available to Linux. We need to - * do this because highmem is not supported on 64-bit. - */ - memblock_enforce_memory_limit(linear_map_top); - } -#endif - - memblock_set_current_limit(linear_map_top); -} - -/* boot cpu only */ -void __init early_init_mmu(void) -{ - early_init_mmu_global(); - early_init_this_mmu(); - early_mmu_set_memory_limit(); -} - -void early_init_mmu_secondary(void) -{ - early_init_this_mmu(); -} - -void setup_initial_memory_limit(phys_addr_t first_memblock_base, - phys_addr_t first_memblock_size) -{ - /* On non-FSL Embedded 64-bit, we adjust the RMA size to match - * the bolted TLB entry. We know for now that only 1G - * entries are supported though that may eventually - * change. - * - * on FSL Embedded 64-bit, usually all RAM is bolted, but with - * unusual memory sizes it's possible for some RAM to not be mapped - * (such RAM is not used at all by Linux, since we don't support - * highmem on 64-bit). We limit ppc64_rma_size to what would be - * mappable if this memblock is the only one. Additional memblocks - * can only increase, not decrease, the amount that ends up getting - * mapped. We still limit max to 1G even if we'll eventually map - * more. This is due to what the early init code is set up to do. - * - * We crop it to the size of the first MEMBLOCK to - * avoid going over total available memory just in case... - */ -#ifdef CONFIG_PPC_E500 - if (early_mmu_has_feature(MMU_FTR_TYPE_FSL_E)) { - unsigned long linear_sz; - unsigned int num_cams; - - /* use a quarter of the TLBCAM for bolted linear map */ - num_cams = (mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY) / 4; - - linear_sz = map_mem_in_cams(first_memblock_size, num_cams, - true, true); - - ppc64_rma_size = min_t(u64, linear_sz, 0x40000000); - } else -#endif - ppc64_rma_size = min_t(u64, first_memblock_size, 0x40000000); - - /* Finally limit subsequent allocations */ - memblock_set_current_limit(first_memblock_base + ppc64_rma_size); -} -#else /* ! CONFIG_PPC64 */ +#ifndef CONFIG_PPC64 void __init early_init_mmu(void) { unsigned long root = of_get_flat_dt_root(); diff --git a/arch/powerpc/mm/nohash/tlb_64e.c b/arch/powerpc/mm/nohash/tlb_64e.c new file mode 100644 index 000000000000..113edf76d3ce --- /dev/null +++ b/arch/powerpc/mm/nohash/tlb_64e.c @@ -0,0 +1,314 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright 2008,2009 Ben Herrenschmidt + * IBM Corp. + * + * Derived from arch/ppc/mm/init.c: + * Copyright (C) 1995-1996 Gary Thomas (gdt@linuxppc.org) + * + * Modifications by Paul Mackerras (PowerMac) (paulus@cs.anu.edu.au) + * and Cort Dougan (PReP) (cort@cs.nmt.edu) + * Copyright (C) 1996 Paul Mackerras + * + * Derived from "arch/i386/mm/init.c" + * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds + */ + +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include + +#include + +/* The variables below are currently only used on 64-bit Book3E + * though this will probably be made common with other nohash + * implementations at some point + */ +int mmu_pte_psize; /* Page size used for PTE pages */ +int mmu_vmemmap_psize; /* Page size used for the virtual mem map */ +int book3e_htw_mode; /* HW tablewalk? Value is PPC_HTW_* */ +unsigned long linear_map_top; /* Top of linear mapping */ + + +/* + * Number of bytes to add to SPRN_SPRG_TLB_EXFRAME on crit/mcheck/debug + * exceptions. This is used for bolted and e6500 TLB miss handlers which + * do not modify this SPRG in the TLB miss code; for other TLB miss handlers, + * this is set to zero. + */ +int extlb_level_exc; + +/* + * Handling of virtual linear page tables or indirect TLB entries + * flushing when PTE pages are freed + */ +void tlb_flush_pgtable(struct mmu_gather *tlb, unsigned long address) +{ + int tsize = mmu_psize_defs[mmu_pte_psize].shift - 10; + + if (book3e_htw_mode != PPC_HTW_NONE) { + unsigned long start = address & PMD_MASK; + unsigned long end = address + PMD_SIZE; + unsigned long size = 1UL << mmu_psize_defs[mmu_pte_psize].shift; + + /* This isn't the most optimal, ideally we would factor out the + * while preempt & CPU mask mucking around, or even the IPI but + * it will do for now + */ + while (start < end) { + __flush_tlb_page(tlb->mm, start, tsize, 1); + start += size; + } + } else { + unsigned long rmask = 0xf000000000000000ul; + unsigned long rid = (address & rmask) | 0x1000000000000000ul; + unsigned long vpte = address & ~rmask; + + vpte = (vpte >> (PAGE_SHIFT - 3)) & ~0xffful; + vpte |= rid; + __flush_tlb_page(tlb->mm, vpte, tsize, 0); + } +} + +static void __init setup_page_sizes(void) +{ + unsigned int tlb0cfg; + unsigned int eptcfg; + int psize; + + unsigned int mmucfg = mfspr(SPRN_MMUCFG); + + if ((mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V1) { + unsigned int tlb1cfg = mfspr(SPRN_TLB1CFG); + unsigned int min_pg, max_pg; + + min_pg = (tlb1cfg & TLBnCFG_MINSIZE) >> TLBnCFG_MINSIZE_SHIFT; + max_pg = (tlb1cfg & TLBnCFG_MAXSIZE) >> TLBnCFG_MAXSIZE_SHIFT; + + for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) { + struct mmu_psize_def *def; + unsigned int shift; + + def = &mmu_psize_defs[psize]; + shift = def->shift; + + if (shift == 0 || shift & 1) + continue; + + /* adjust to be in terms of 4^shift Kb */ + shift = (shift - 10) >> 1; + + if ((shift >= min_pg) && (shift <= max_pg)) + def->flags |= MMU_PAGE_SIZE_DIRECT; + } + + goto out; + } + + if ((mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V2) { + u32 tlb1cfg, tlb1ps; + + tlb0cfg = mfspr(SPRN_TLB0CFG); + tlb1cfg = mfspr(SPRN_TLB1CFG); + tlb1ps = mfspr(SPRN_TLB1PS); + eptcfg = mfspr(SPRN_EPTCFG); + + if ((tlb1cfg & TLBnCFG_IND) && (tlb0cfg & TLBnCFG_PT)) + book3e_htw_mode = PPC_HTW_E6500; + + /* + * We expect 4K subpage size and unrestricted indirect size. + * The lack of a restriction on indirect size is a Freescale + * extension, indicated by PSn = 0 but SPSn != 0. + */ + if (eptcfg != 2) + book3e_htw_mode = PPC_HTW_NONE; + + for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) { + struct mmu_psize_def *def = &mmu_psize_defs[psize]; + + if (!def->shift) + continue; + + if (tlb1ps & (1U << (def->shift - 10))) { + def->flags |= MMU_PAGE_SIZE_DIRECT; + + if (book3e_htw_mode && psize == MMU_PAGE_2M) + def->flags |= MMU_PAGE_SIZE_INDIRECT; + } + } + + goto out; + } +out: + /* Cleanup array and print summary */ + pr_info("MMU: Supported page sizes\n"); + for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) { + struct mmu_psize_def *def = &mmu_psize_defs[psize]; + const char *__page_type_names[] = { + "unsupported", + "direct", + "indirect", + "direct & indirect" + }; + if (def->flags == 0) { + def->shift = 0; + continue; + } + pr_info(" %8ld KB as %s\n", 1ul << (def->shift - 10), + __page_type_names[def->flags & 0x3]); + } +} + +/* + * Early initialization of the MMU TLB code + */ +static void early_init_this_mmu(void) +{ + unsigned int mas4; + + /* Set MAS4 based on page table setting */ + + mas4 = 0x4 << MAS4_WIMGED_SHIFT; + switch (book3e_htw_mode) { + case PPC_HTW_E6500: + mas4 |= MAS4_INDD; + mas4 |= BOOK3E_PAGESZ_2M << MAS4_TSIZED_SHIFT; + mas4 |= MAS4_TLBSELD(1); + mmu_pte_psize = MMU_PAGE_2M; + break; + + case PPC_HTW_NONE: + mas4 |= BOOK3E_PAGESZ_4K << MAS4_TSIZED_SHIFT; + mmu_pte_psize = mmu_virtual_psize; + break; + } + mtspr(SPRN_MAS4, mas4); + + unsigned int num_cams; + bool map = true; + + /* use a quarter of the TLBCAM for bolted linear map */ + num_cams = (mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY) / 4; + + /* + * Only do the mapping once per core, or else the + * transient mapping would cause problems. + */ +#ifdef CONFIG_SMP + if (hweight32(get_tensr()) > 1) + map = false; +#endif + + if (map) + linear_map_top = map_mem_in_cams(linear_map_top, + num_cams, false, true); + + /* A sync won't hurt us after mucking around with + * the MMU configuration + */ + mb(); +} + +static void __init early_init_mmu_global(void) +{ + /* + * Freescale booke only supports 4K pages in TLB0, so use that. + */ + mmu_vmemmap_psize = MMU_PAGE_4K; + + /* XXX This code only checks for TLB 0 capabilities and doesn't + * check what page size combos are supported by the HW. It + * also doesn't handle the case where a separate array holds + * the IND entries from the array loaded by the PT. + */ + /* Look for supported page sizes */ + setup_page_sizes(); + + /* + * If we want to use HW tablewalk, enable it by patching the TLB miss + * handlers to branch to the one dedicated to it. + */ + extlb_level_exc = EX_TLB_SIZE; + switch (book3e_htw_mode) { + case PPC_HTW_E6500: + patch_exception(0x1c0, exc_data_tlb_miss_e6500_book3e); + patch_exception(0x1e0, exc_instruction_tlb_miss_e6500_book3e); + break; + } + + pr_info("MMU: Book3E HW tablewalk %s\n", + book3e_htw_mode != PPC_HTW_NONE ? "enabled" : "not supported"); + + /* Set the global containing the top of the linear mapping + * for use by the TLB miss code + */ + linear_map_top = memblock_end_of_DRAM(); + + ioremap_bot = IOREMAP_BASE; +} + +static void __init early_mmu_set_memory_limit(void) +{ + /* + * Limit memory so we dont have linear faults. + * Unlike memblock_set_current_limit, which limits + * memory available during early boot, this permanently + * reduces the memory available to Linux. We need to + * do this because highmem is not supported on 64-bit. + */ + memblock_enforce_memory_limit(linear_map_top); + + memblock_set_current_limit(linear_map_top); +} + +/* boot cpu only */ +void __init early_init_mmu(void) +{ + early_init_mmu_global(); + early_init_this_mmu(); + early_mmu_set_memory_limit(); +} + +void early_init_mmu_secondary(void) +{ + early_init_this_mmu(); +} + +void setup_initial_memory_limit(phys_addr_t first_memblock_base, + phys_addr_t first_memblock_size) +{ + /* + * On FSL Embedded 64-bit, usually all RAM is bolted, but with + * unusual memory sizes it's possible for some RAM to not be mapped + * (such RAM is not used at all by Linux, since we don't support + * highmem on 64-bit). We limit ppc64_rma_size to what would be + * mappable if this memblock is the only one. Additional memblocks + * can only increase, not decrease, the amount that ends up getting + * mapped. We still limit max to 1G even if we'll eventually map + * more. This is due to what the early init code is set up to do. + * + * We crop it to the size of the first MEMBLOCK to + * avoid going over total available memory just in case... + */ + unsigned long linear_sz; + unsigned int num_cams; + + /* use a quarter of the TLBCAM for bolted linear map */ + num_cams = (mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY) / 4; + + linear_sz = map_mem_in_cams(first_memblock_size, num_cams, true, true); + ppc64_rma_size = min_t(u64, linear_sz, 0x40000000); + + /* Finally limit subsequent allocations */ + memblock_set_current_limit(first_memblock_base + ppc64_rma_size); +} diff --git a/arch/powerpc/mm/nohash/tlb_low_64e.S b/arch/powerpc/mm/nohash/tlb_low_64e.S index 7e0b8fe1c279..de568297d5c5 100644 --- a/arch/powerpc/mm/nohash/tlb_low_64e.S +++ b/arch/powerpc/mm/nohash/tlb_low_64e.S @@ -450,11 +450,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_SMT) tlb_miss_huge_e6500: beq tlb_miss_fault_e6500 - li r10,1 - andi. r15,r14,HUGEPD_SHIFT_MASK@l /* r15 = psize */ - rldimi r14,r10,63,0 /* Set PD_HUGE */ - xor r14,r14,r15 /* Clear size bits */ - ldx r14,0,r14 + rlwinm r15,r14,32-_PAGE_PSIZE_SHIFT,0x1e /* * Now we build the MAS for a huge page. @@ -465,7 +461,6 @@ tlb_miss_huge_e6500: * MAS 2,3+7: Needs to be redone similar to non-tablewalk handler */ - subi r15,r15,10 /* Convert psize to tsize */ mfspr r10,SPRN_MAS1 rlwinm r10,r10,0,~MAS1_IND rlwimi r10,r15,MAS1_TSIZE_SHIFT,MAS1_TSIZE_MASK @@ -511,232 +506,6 @@ itlb_miss_fault_e6500: tlb_epilog_bolted b exc_instruction_storage_book3e -/********************************************************************** - * * - * TLB miss handling for Book3E with TLB reservation and HES support * - * * - **********************************************************************/ - - -/* Data TLB miss */ - START_EXCEPTION(data_tlb_miss) - TLB_MISS_PROLOG - - /* Now we handle the fault proper. We only save DEAR in normal - * fault case since that's the only interesting values here. - * We could probably also optimize by not saving SRR0/1 in the - * linear mapping case but I'll leave that for later - */ - mfspr r14,SPRN_ESR - mfspr r16,SPRN_DEAR /* get faulting address */ - srdi r15,r16,44 /* get region */ - xoris r15,r15,0xc - cmpldi cr0,r15,0 /* linear mapping ? */ - beq tlb_load_linear /* yes -> go to linear map load */ - cmpldi cr1,r15,1 /* vmalloc mapping ? */ - - /* The page tables are mapped virtually linear. At this point, though, - * we don't know whether we are trying to fault in a first level - * virtual address or a virtual page table address. We can get that - * from bit 0x1 of the region ID which we have set for a page table - */ - andis. r10,r15,0x1 - bne- virt_page_table_tlb_miss - - std r14,EX_TLB_ESR(r12); /* save ESR */ - std r16,EX_TLB_DEAR(r12); /* save DEAR */ - - /* We need _PAGE_PRESENT and _PAGE_ACCESSED set */ - li r11,_PAGE_PRESENT - oris r11,r11,_PAGE_ACCESSED@h - - /* We do the user/kernel test for the PID here along with the RW test - */ - srdi. r15,r16,60 /* Check for user region */ - - /* We pre-test some combination of permissions to avoid double - * faults: - * - * We move the ESR:ST bit into the position of _PAGE_BAP_SW in the PTE - * ESR_ST is 0x00800000 - * _PAGE_BAP_SW is 0x00000010 - * So the shift is >> 19. This tests for supervisor writeability. - * If the page happens to be supervisor writeable and not user - * writeable, we will take a new fault later, but that should be - * a rare enough case. - * - * We also move ESR_ST in _PAGE_DIRTY position - * _PAGE_DIRTY is 0x00001000 so the shift is >> 11 - * - * MAS1 is preset for all we need except for TID that needs to - * be cleared for kernel translations - */ - rlwimi r11,r14,32-19,27,27 - rlwimi r11,r14,32-16,19,19 - beq normal_tlb_miss_user - /* XXX replace the RMW cycles with immediate loads + writes */ -1: mfspr r10,SPRN_MAS1 - rlwinm r10,r10,0,16,1 /* Clear TID */ - mtspr SPRN_MAS1,r10 - beq+ cr1,normal_tlb_miss - - /* We got a crappy address, just fault with whatever DEAR and ESR - * are here - */ - TLB_MISS_EPILOG_ERROR - b exc_data_storage_book3e - -/* Instruction TLB miss */ - START_EXCEPTION(instruction_tlb_miss) - TLB_MISS_PROLOG - - /* If we take a recursive fault, the second level handler may need - * to know whether we are handling a data or instruction fault in - * order to get to the right store fault handler. We provide that - * info by writing a crazy value in ESR in our exception frame - */ - li r14,-1 /* store to exception frame is done later */ - - /* Now we handle the fault proper. We only save DEAR in the non - * linear mapping case since we know the linear mapping case will - * not re-enter. We could indeed optimize and also not save SRR0/1 - * in the linear mapping case but I'll leave that for later - * - * Faulting address is SRR0 which is already in r16 - */ - srdi r15,r16,44 /* get region */ - xoris r15,r15,0xc - cmpldi cr0,r15,0 /* linear mapping ? */ - beq tlb_load_linear /* yes -> go to linear map load */ - cmpldi cr1,r15,1 /* vmalloc mapping ? */ - - /* We do the user/kernel test for the PID here along with the RW test - */ - li r11,_PAGE_PRESENT|_PAGE_BAP_UX /* Base perm */ - oris r11,r11,_PAGE_ACCESSED@h - - srdi. r15,r16,60 /* Check for user region */ - std r14,EX_TLB_ESR(r12) /* write crazy -1 to frame */ - beq normal_tlb_miss_user - - li r11,_PAGE_PRESENT|_PAGE_BAP_SX /* Base perm */ - oris r11,r11,_PAGE_ACCESSED@h - /* XXX replace the RMW cycles with immediate loads + writes */ - mfspr r10,SPRN_MAS1 - rlwinm r10,r10,0,16,1 /* Clear TID */ - mtspr SPRN_MAS1,r10 - beq+ cr1,normal_tlb_miss - - /* We got a crappy address, just fault */ - TLB_MISS_EPILOG_ERROR - b exc_instruction_storage_book3e - -/* - * This is the guts of the first-level TLB miss handler for direct - * misses. We are entered with: - * - * r16 = faulting address - * r15 = region ID - * r14 = crap (free to use) - * r13 = PACA - * r12 = TLB exception frame in PACA - * r11 = PTE permission mask - * r10 = crap (free to use) - */ -normal_tlb_miss_user: -#ifdef CONFIG_PPC_KUAP - mfspr r14,SPRN_MAS1 - rlwinm. r14,r14,0,0x3fff0000 - beq- normal_tlb_miss_access_fault /* KUAP fault */ -#endif -normal_tlb_miss: - /* So we first construct the page table address. We do that by - * shifting the bottom of the address (not the region ID) by - * PAGE_SHIFT-3, clearing the bottom 3 bits (get a PTE ptr) and - * or'ing the fourth high bit. - * - * NOTE: For 64K pages, we do things slightly differently in - * order to handle the weird page table format used by linux - */ - srdi r15,r16,44 - oris r10,r15,0x1 - rldicl r14,r16,64-(PAGE_SHIFT-3),PAGE_SHIFT-3+4 - sldi r15,r10,44 - clrrdi r14,r14,19 - or r10,r15,r14 - - ld r14,0(r10) - -finish_normal_tlb_miss: - /* Check if required permissions are met */ - andc. r15,r11,r14 - bne- normal_tlb_miss_access_fault - - /* Now we build the MAS: - * - * MAS 0 : Fully setup with defaults in MAS4 and TLBnCFG - * MAS 1 : Almost fully setup - * - PID already updated by caller if necessary - * - TSIZE need change if !base page size, not - * yet implemented for now - * MAS 2 : Defaults not useful, need to be redone - * MAS 3+7 : Needs to be done - * - * TODO: mix up code below for better scheduling - */ - clrrdi r10,r16,12 /* Clear low crap in EA */ - rlwimi r10,r14,32-19,27,31 /* Insert WIMGE */ - mtspr SPRN_MAS2,r10 - - /* Check page size, if not standard, update MAS1 */ - rldicl r10,r14,64-8,64-8 - cmpldi cr0,r10,BOOK3E_PAGESZ_4K - beq- 1f - mfspr r11,SPRN_MAS1 - rlwimi r11,r14,31,21,24 - rlwinm r11,r11,0,21,19 - mtspr SPRN_MAS1,r11 -1: - /* Move RPN in position */ - rldicr r11,r14,64-(PTE_RPN_SHIFT-PAGE_SHIFT),63-PAGE_SHIFT - clrldi r15,r11,12 /* Clear crap at the top */ - rlwimi r15,r14,32-8,22,25 /* Move in U bits */ - rlwimi r15,r14,32-2,26,31 /* Move in BAP bits */ - - /* Mask out SW and UW if !DIRTY (XXX optimize this !) */ - andi. r11,r14,_PAGE_DIRTY - bne 1f - li r11,MAS3_SW|MAS3_UW - andc r15,r15,r11 -1: - srdi r16,r15,32 - mtspr SPRN_MAS3,r15 - mtspr SPRN_MAS7,r16 - - tlbwe - -normal_tlb_miss_done: - /* We don't bother with restoring DEAR or ESR since we know we are - * level 0 and just going back to userland. They are only needed - * if you are going to take an access fault - */ - TLB_MISS_EPILOG_SUCCESS - rfi - -normal_tlb_miss_access_fault: - /* We need to check if it was an instruction miss */ - andi. r10,r11,_PAGE_BAP_UX - bne 1f - ld r14,EX_TLB_DEAR(r12) - ld r15,EX_TLB_ESR(r12) - mtspr SPRN_DEAR,r14 - mtspr SPRN_ESR,r15 - TLB_MISS_EPILOG_ERROR - b exc_data_storage_book3e -1: TLB_MISS_EPILOG_ERROR - b exc_instruction_storage_book3e - - /* * This is the guts of the second-level TLB miss handler for direct * misses. We are entered with: @@ -893,201 +662,6 @@ virt_page_table_tlb_miss_whacko_fault: TLB_MISS_EPILOG_ERROR b exc_data_storage_book3e - -/************************************************************** - * * - * TLB miss handling for Book3E with hw page table support * - * * - **************************************************************/ - - -/* Data TLB miss */ - START_EXCEPTION(data_tlb_miss_htw) - TLB_MISS_PROLOG - - /* Now we handle the fault proper. We only save DEAR in normal - * fault case since that's the only interesting values here. - * We could probably also optimize by not saving SRR0/1 in the - * linear mapping case but I'll leave that for later - */ - mfspr r14,SPRN_ESR - mfspr r16,SPRN_DEAR /* get faulting address */ - srdi r11,r16,44 /* get region */ - xoris r11,r11,0xc - cmpldi cr0,r11,0 /* linear mapping ? */ - beq tlb_load_linear /* yes -> go to linear map load */ - cmpldi cr1,r11,1 /* vmalloc mapping ? */ - - /* We do the user/kernel test for the PID here along with the RW test - */ - srdi. r11,r16,60 /* Check for user region */ - ld r15,PACAPGD(r13) /* Load user pgdir */ - beq htw_tlb_miss - - /* XXX replace the RMW cycles with immediate loads + writes */ -1: mfspr r10,SPRN_MAS1 - rlwinm r10,r10,0,16,1 /* Clear TID */ - mtspr SPRN_MAS1,r10 - ld r15,PACA_KERNELPGD(r13) /* Load kernel pgdir */ - beq+ cr1,htw_tlb_miss - - /* We got a crappy address, just fault with whatever DEAR and ESR - * are here - */ - TLB_MISS_EPILOG_ERROR - b exc_data_storage_book3e - -/* Instruction TLB miss */ - START_EXCEPTION(instruction_tlb_miss_htw) - TLB_MISS_PROLOG - - /* If we take a recursive fault, the second level handler may need - * to know whether we are handling a data or instruction fault in - * order to get to the right store fault handler. We provide that - * info by keeping a crazy value for ESR in r14 - */ - li r14,-1 /* store to exception frame is done later */ - - /* Now we handle the fault proper. We only save DEAR in the non - * linear mapping case since we know the linear mapping case will - * not re-enter. We could indeed optimize and also not save SRR0/1 - * in the linear mapping case but I'll leave that for later - * - * Faulting address is SRR0 which is already in r16 - */ - srdi r11,r16,44 /* get region */ - xoris r11,r11,0xc - cmpldi cr0,r11,0 /* linear mapping ? */ - beq tlb_load_linear /* yes -> go to linear map load */ - cmpldi cr1,r11,1 /* vmalloc mapping ? */ - - /* We do the user/kernel test for the PID here along with the RW test - */ - srdi. r11,r16,60 /* Check for user region */ - ld r15,PACAPGD(r13) /* Load user pgdir */ - beq htw_tlb_miss - - /* XXX replace the RMW cycles with immediate loads + writes */ -1: mfspr r10,SPRN_MAS1 - rlwinm r10,r10,0,16,1 /* Clear TID */ - mtspr SPRN_MAS1,r10 - ld r15,PACA_KERNELPGD(r13) /* Load kernel pgdir */ - beq+ htw_tlb_miss - - /* We got a crappy address, just fault */ - TLB_MISS_EPILOG_ERROR - b exc_instruction_storage_book3e - - -/* - * This is the guts of the second-level TLB miss handler for direct - * misses. We are entered with: - * - * r16 = virtual page table faulting address - * r15 = PGD pointer - * r14 = ESR - * r13 = PACA - * r12 = TLB exception frame in PACA - * r11 = crap (free to use) - * r10 = crap (free to use) - * - * It can be re-entered by the linear mapping miss handler. However, to - * avoid too much complication, it will save/restore things for us - */ -htw_tlb_miss: -#ifdef CONFIG_PPC_KUAP - mfspr r10,SPRN_MAS1 - rlwinm. r10,r10,0,0x3fff0000 - beq- htw_tlb_miss_fault /* KUAP fault */ -#endif - /* Search if we already have a TLB entry for that virtual address, and - * if we do, bail out. - * - * MAS1:IND should be already set based on MAS4 - */ - PPC_TLBSRX_DOT(0,R16) - beq htw_tlb_miss_done - - /* Now, we need to walk the page tables. First check if we are in - * range. - */ - rldicl. r10,r16,64-PGTABLE_EADDR_SIZE,PGTABLE_EADDR_SIZE+4 - bne- htw_tlb_miss_fault - - /* Get the PGD pointer */ - cmpldi cr0,r15,0 - beq- htw_tlb_miss_fault - - /* Get to PGD entry */ - rldicl r11,r16,64-(PGDIR_SHIFT-3),64-PGD_INDEX_SIZE-3 - clrrdi r10,r11,3 - ldx r15,r10,r15 - cmpdi cr0,r15,0 - bge htw_tlb_miss_fault - - /* Get to PUD entry */ - rldicl r11,r16,64-(PUD_SHIFT-3),64-PUD_INDEX_SIZE-3 - clrrdi r10,r11,3 - ldx r15,r10,r15 - cmpdi cr0,r15,0 - bge htw_tlb_miss_fault - - /* Get to PMD entry */ - rldicl r11,r16,64-(PMD_SHIFT-3),64-PMD_INDEX_SIZE-3 - clrrdi r10,r11,3 - ldx r15,r10,r15 - cmpdi cr0,r15,0 - bge htw_tlb_miss_fault - - /* Ok, we're all right, we can now create an indirect entry for - * a 1M or 256M page. - * - * The last trick is now that because we use "half" pages for - * the HTW (1M IND is 2K and 256M IND is 32K) we need to account - * for an added LSB bit to the RPN. For 64K pages, there is no - * problem as we already use 32K arrays (half PTE pages), but for - * 4K page we need to extract a bit from the virtual address and - * insert it into the "PA52" bit of the RPN. - */ - rlwimi r15,r16,32-9,20,20 - /* Now we build the MAS: - * - * MAS 0 : Fully setup with defaults in MAS4 and TLBnCFG - * MAS 1 : Almost fully setup - * - PID already updated by caller if necessary - * - TSIZE for now is base ind page size always - * MAS 2 : Use defaults - * MAS 3+7 : Needs to be done - */ - ori r10,r15,(BOOK3E_PAGESZ_4K << MAS3_SPSIZE_SHIFT) - - srdi r16,r10,32 - mtspr SPRN_MAS3,r10 - mtspr SPRN_MAS7,r16 - - tlbwe - -htw_tlb_miss_done: - /* We don't bother with restoring DEAR or ESR since we know we are - * level 0 and just going back to userland. They are only needed - * if you are going to take an access fault - */ - TLB_MISS_EPILOG_SUCCESS - rfi - -htw_tlb_miss_fault: - /* We need to check if it was an instruction miss. We know this - * though because r14 would contain -1 - */ - cmpdi cr0,r14,-1 - beq 1f - mtspr SPRN_DEAR,r16 - mtspr SPRN_ESR,r14 - TLB_MISS_EPILOG_ERROR - b exc_data_storage_book3e -1: TLB_MISS_EPILOG_ERROR - b exc_instruction_storage_book3e - /* * This is the guts of "any" level TLB miss handler for kernel linear * mapping misses. We are entered with: diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 9e7ba9c3851f..ab0656115424 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -297,11 +297,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma, } #if defined(CONFIG_PPC_8xx) -void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, - pte_t pte, unsigned long sz) +static void __set_huge_pte_at(pmd_t *pmd, pte_t *ptep, pte_basic_t val) { - pmd_t *pmd = pmd_off(mm, addr); - pte_basic_t val; pte_basic_t *entry = (pte_basic_t *)ptep; int num, i; @@ -311,15 +308,60 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, */ VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep)); - pte = set_pte_filter(pte, addr); - - val = pte_val(pte); - num = number_of_cells_per_pte(pmd, val, 1); for (i = 0; i < num; i++, entry++, val += SZ_4K) *entry = val; } + +void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + pte_t pte, unsigned long sz) +{ + pmd_t *pmdp = pmd_off(mm, addr); + + pte = set_pte_filter(pte, addr); + + if (sz == SZ_8M) { /* Flag both PMD entries as 8M and fill both page tables */ + *pmdp = __pmd(pmd_val(*pmdp) | _PMD_PAGE_8M); + *(pmdp + 1) = __pmd(pmd_val(*(pmdp + 1)) | _PMD_PAGE_8M); + + __set_huge_pte_at(pmdp, pte_offset_kernel(pmdp, 0), pte_val(pte)); + __set_huge_pte_at(pmdp, pte_offset_kernel(pmdp + 1, 0), pte_val(pte) + SZ_4M); + } else { + __set_huge_pte_at(pmdp, ptep, pte_val(pte)); + } +} +#else +void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, + pte_t pte, unsigned long sz) +{ + unsigned long pdsize; + int i; + + pte = set_pte_filter(pte, addr); + + /* + * Make sure hardware valid bit is not set. We don't do + * tlb flush for this update. + */ + VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep)); + + if (sz < PMD_SIZE) + pdsize = PAGE_SIZE; + else if (sz < PUD_SIZE) + pdsize = PMD_SIZE; + else if (sz < P4D_SIZE) + pdsize = PUD_SIZE; + else if (sz < PGDIR_SIZE) + pdsize = P4D_SIZE; + else + pdsize = PGDIR_SIZE; + + for (i = 0; i < sz / pdsize; i++, ptep++, addr += pdsize) { + __set_pte_at(mm, addr, ptep, pte, 0); + pte = __pte(pte_val(pte) + ((unsigned long long)pdsize / PAGE_SIZE << PFN_PTE_SHIFT)); + } +} #endif #endif /* CONFIG_HUGETLB_PAGE */ @@ -367,11 +409,10 @@ unsigned long vmalloc_to_phys(void *va) EXPORT_SYMBOL_GPL(vmalloc_to_phys); /* - * We have 4 cases for pgds and pmds: + * We have 3 cases for pgds and pmds: * (1) invalid (all zeroes) * (2) pointer to next table, as normal; bottom 6 bits == 0 * (3) leaf pte for huge page _PAGE_PTE set - * (4) hugepd pointer, _PAGE_PTE = 0 and bits [2..6] indicate size of table * * So long as we atomically load page table pointers we are safe against teardown, * we can follow the address down to the page and take a ref on it. @@ -382,11 +423,12 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, bool *is_thp, unsigned *hpage_shift) { pgd_t *pgdp; +#ifdef CONFIG_PPC64 p4d_t p4d, *p4dp; pud_t pud, *pudp; +#endif pmd_t pmd, *pmdp; pte_t *ret_pte; - hugepd_t *hpdp = NULL; unsigned pdshift; if (hpage_shift) @@ -401,8 +443,12 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, * page fault or a page unmap. The return pte_t * is still not * stable. So should be checked there for above conditions. * Top level is an exception because it is folded into p4d. + * + * On PPC32, P4D/PUD/PMD are folded into PGD so go straight to + * PMD level. */ pgdp = pgdir + pgd_index(ea); +#ifdef CONFIG_PPC64 p4dp = p4d_offset(pgdp, ea); p4d = READ_ONCE(*p4dp); pdshift = P4D_SHIFT; @@ -415,11 +461,6 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, goto out; } - if (is_hugepd(__hugepd(p4d_val(p4d)))) { - hpdp = (hugepd_t *)&p4d; - goto out_huge; - } - /* * Even if we end up with an unmap, the pgtable will not * be freed, because we do an rcu free and here we are @@ -437,13 +478,11 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, goto out; } - if (is_hugepd(__hugepd(pud_val(pud)))) { - hpdp = (hugepd_t *)&pud; - goto out_huge; - } - - pdshift = PMD_SHIFT; pmdp = pmd_offset(&pud, ea); +#else + pmdp = pmd_offset(pud_offset(p4d_offset(pgdp, ea), ea), ea); +#endif + pdshift = PMD_SHIFT; pmd = READ_ONCE(*pmdp); /* @@ -476,19 +515,8 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, goto out; } - if (is_hugepd(__hugepd(pmd_val(pmd)))) { - hpdp = (hugepd_t *)&pmd; - goto out_huge; - } - return pte_offset_kernel(&pmd, ea); -out_huge: - if (!hpdp) - return NULL; - - ret_pte = hugepte_offset(*hpdp, ea, pdshift); - pdshift = hugepd_shift(*hpdp); out: if (hpage_shift) *hpage_shift = pdshift; diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index cfd622ebf774..787b22206386 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -48,7 +48,7 @@ notrace void __init early_ioremap_init(void) early_ioremap_setup(); } -static void __init *early_alloc_pgtable(unsigned long size) +void __init *early_alloc_pgtable(unsigned long size) { void *ptr = memblock_alloc(size, size); diff --git a/arch/riscv/include/asm/hugetlb.h b/arch/riscv/include/asm/hugetlb.h index b1ce97a9dbfc..faf3624d8057 100644 --- a/arch/riscv/include/asm/hugetlb.h +++ b/arch/riscv/include/asm/hugetlb.h @@ -44,7 +44,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma, pte_t pte, int dirty); #define __HAVE_ARCH_HUGE_PTEP_GET -pte_t huge_ptep_get(pte_t *ptep); +pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep); pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags); #define arch_make_huge_pte arch_make_huge_pte diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index ab7a759e1a8c..089f3c9f56a3 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -514,8 +514,8 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, #define update_mmu_cache(vma, addr, ptep) \ update_mmu_cache_range(NULL, vma, addr, ptep, 1) -#define __HAVE_ARCH_UPDATE_MMU_TLB -#define update_mmu_tlb update_mmu_cache +#define update_mmu_tlb_range(vma, addr, ptep, nr) \ + update_mmu_cache_range(NULL, vma, addr, ptep, nr) static inline void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c index 0ebd968b33c9..42314f093922 100644 --- a/arch/riscv/mm/hugetlbpage.c +++ b/arch/riscv/mm/hugetlbpage.c @@ -3,7 +3,7 @@ #include #ifdef CONFIG_RISCV_ISA_SVNAPOT -pte_t huge_ptep_get(pte_t *ptep) +pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { unsigned long pte_num; int i; diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 371c2bf88149..59e0d861e26f 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -159,6 +159,7 @@ config S390 select HAVE_ARCH_KASAN select HAVE_ARCH_KASAN_VMALLOC select HAVE_ARCH_KCSAN + select HAVE_ARCH_KMSAN select HAVE_ARCH_KFENCE select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET select HAVE_ARCH_SECCOMP_FILTER diff --git a/arch/s390/Makefile b/arch/s390/Makefile index f2b21c7a70ef..7fd57398221e 100644 --- a/arch/s390/Makefile +++ b/arch/s390/Makefile @@ -36,7 +36,7 @@ KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_DEBUG_INFO_DWARF4), $(call cc-option KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_CC_NO_ARRAY_BOUNDS),-Wno-array-bounds) UTS_MACHINE := s390x -STACK_SIZE := $(if $(CONFIG_KASAN),65536,16384) +STACK_SIZE := $(if $(CONFIG_KASAN),65536,$(if $(CONFIG_KMSAN),65536,16384)) CHECKFLAGS += -D__s390__ -D__s390x__ export LD_BFD diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile index 070c9b2e905f..e7658997452b 100644 --- a/arch/s390/boot/Makefile +++ b/arch/s390/boot/Makefile @@ -3,11 +3,13 @@ # Makefile for the linux s390-specific parts of the memory manager. # +# Tooling runtimes are unavailable and cannot be linked for early boot code KCOV_INSTRUMENT := n GCOV_PROFILE := n UBSAN_SANITIZE := n KASAN_SANITIZE := n KCSAN_SANITIZE := n +KMSAN_SANITIZE := n KBUILD_AFLAGS := $(KBUILD_AFLAGS_DECOMPRESSOR) KBUILD_CFLAGS := $(KBUILD_CFLAGS_DECOMPRESSOR) @@ -42,6 +44,7 @@ obj-$(findstring y, $(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) $(CONFIG_PGSTE)) += obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o obj-y += $(if $(CONFIG_KERNEL_UNCOMPRESSED),,decompressor.o) info.o obj-$(CONFIG_KERNEL_ZSTD) += clz_ctz.o +obj-$(CONFIG_KMSAN) += kmsan.o obj-all := $(obj-y) piggy.o syms.o targets := bzImage section_cmp.boot.data section_cmp.boot.preserved.data $(obj-y) diff --git a/arch/s390/boot/kmsan.c b/arch/s390/boot/kmsan.c new file mode 100644 index 000000000000..e7b3ac48143e --- /dev/null +++ b/arch/s390/boot/kmsan.c @@ -0,0 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0 +#include + +void kmsan_unpoison_memory(const void *address, size_t size) +{ +} diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c index 320c964cd603..c59014945af0 100644 --- a/arch/s390/boot/startup.c +++ b/arch/s390/boot/startup.c @@ -304,11 +304,18 @@ static unsigned long setup_kernel_memory_layout(unsigned long kernel_size) MODULES_END = round_down(kernel_start, _SEGMENT_SIZE); MODULES_VADDR = MODULES_END - MODULES_LEN; VMALLOC_END = MODULES_VADDR; + if (IS_ENABLED(CONFIG_KMSAN)) + VMALLOC_END -= MODULES_LEN * 2; /* allow vmalloc area to occupy up to about 1/2 of the rest virtual space left */ vsize = (VMALLOC_END - FIXMAP_SIZE) / 2; vsize = round_down(vsize, _SEGMENT_SIZE); vmalloc_size = min(vmalloc_size, vsize); + if (IS_ENABLED(CONFIG_KMSAN)) { + /* take 2/3 of vmalloc area for KMSAN shadow and origins */ + vmalloc_size = round_down(vmalloc_size / 3, _SEGMENT_SIZE); + VMALLOC_END -= vmalloc_size * 2; + } VMALLOC_START = VMALLOC_END - vmalloc_size; __memcpy_real_area = round_down(VMALLOC_START - MEMCPY_REAL_SIZE, PAGE_SIZE); diff --git a/arch/s390/boot/string.c b/arch/s390/boot/string.c index faccb33b462c..f6b9b1df48a8 100644 --- a/arch/s390/boot/string.c +++ b/arch/s390/boot/string.c @@ -1,11 +1,18 @@ // SPDX-License-Identifier: GPL-2.0 +#define IN_BOOT_STRING_C 1 #include #include #include #undef CONFIG_KASAN #undef CONFIG_KASAN_GENERIC +#undef CONFIG_KMSAN #include "../lib/string.c" +/* + * Duplicate some functions from the common lib/string.c + * instead of fully including it. + */ + int strncmp(const char *cs, const char *ct, size_t count) { unsigned char c1, c2; @@ -22,6 +29,15 @@ int strncmp(const char *cs, const char *ct, size_t count) return 0; } +void *memset64(uint64_t *s, uint64_t v, size_t count) +{ + uint64_t *xs = s; + + while (count--) + *xs++ = v; + return s; +} + char *skip_spaces(const char *str) { while (isspace(*str)) diff --git a/arch/s390/include/asm/checksum.h b/arch/s390/include/asm/checksum.h index b89159591ca0..46f5c9660616 100644 --- a/arch/s390/include/asm/checksum.h +++ b/arch/s390/include/asm/checksum.h @@ -13,6 +13,7 @@ #define _S390_CHECKSUM_H #include +#include #include static inline __wsum cksm(const void *buff, int len, __wsum sum) @@ -23,6 +24,7 @@ static inline __wsum cksm(const void *buff, int len, __wsum sum) }; instrument_read(buff, len); + kmsan_check_memory(buff, len); asm volatile("\n" "0: cksm %[sum],%[rp]\n" " jo 0b\n" diff --git a/arch/s390/include/asm/cpacf.h b/arch/s390/include/asm/cpacf.h index c786538e397c..dae8843b164f 100644 --- a/arch/s390/include/asm/cpacf.h +++ b/arch/s390/include/asm/cpacf.h @@ -12,6 +12,7 @@ #define _ASM_S390_CPACF_H #include +#include /* * Instruction opcodes for the CPACF instructions @@ -542,6 +543,8 @@ static inline void cpacf_trng(u8 *ucbuf, unsigned long ucbuf_len, : [ucbuf] "+&d" (u.pair), [cbuf] "+&d" (c.pair) : [fc] "K" (CPACF_PRNO_TRNG), [opc] "i" (CPACF_PRNO) : "cc", "memory", "0"); + kmsan_unpoison_memory(ucbuf, ucbuf_len); + kmsan_unpoison_memory(cbuf, cbuf_len); } /** diff --git a/arch/s390/include/asm/cpu_mf.h b/arch/s390/include/asm/cpu_mf.h index a0de5b9b02ea..9e4bbc3e53f8 100644 --- a/arch/s390/include/asm/cpu_mf.h +++ b/arch/s390/include/asm/cpu_mf.h @@ -10,6 +10,7 @@ #define _ASM_S390_CPU_MF_H #include +#include #include #include @@ -239,6 +240,11 @@ static __always_inline int stcctm(enum stcctm_ctr_set set, u64 range, u64 *dest) : "=d" (cc) : "Q" (*dest), "d" (range), "i" (set) : "cc", "memory"); + /* + * If cc == 2, less than RANGE counters are stored, but it's not easy + * to tell how many. Always unpoison the whole range for simplicity. + */ + kmsan_unpoison_memory(dest, range * sizeof(u64)); return cc; } diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h index ce5f4fe8be4d..cf1b5d6fb1a6 100644 --- a/arch/s390/include/asm/hugetlb.h +++ b/arch/s390/include/asm/hugetlb.h @@ -19,7 +19,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned long sz); void __set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte); -pte_t huge_ptep_get(pte_t *ptep); +pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep); pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep); @@ -64,7 +64,7 @@ static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t pte, int dirty) { - int changed = !pte_same(huge_ptep_get(ptep), pte); + int changed = !pte_same(huge_ptep_get(vma->vm_mm, addr, ptep), pte); if (changed) { huge_ptep_get_and_clear(vma->vm_mm, addr, ptep); __set_huge_pte_at(vma->vm_mm, addr, ptep, pte); diff --git a/arch/s390/include/asm/irqflags.h b/arch/s390/include/asm/irqflags.h index 02427b205c11..bcab456dfb80 100644 --- a/arch/s390/include/asm/irqflags.h +++ b/arch/s390/include/asm/irqflags.h @@ -37,12 +37,18 @@ static __always_inline void __arch_local_irq_ssm(unsigned long flags) asm volatile("ssm %0" : : "Q" (flags) : "memory"); } -static __always_inline unsigned long arch_local_save_flags(void) +#ifdef CONFIG_KMSAN +#define arch_local_irq_attributes noinline notrace __no_sanitize_memory __maybe_unused +#else +#define arch_local_irq_attributes __always_inline +#endif + +static arch_local_irq_attributes unsigned long arch_local_save_flags(void) { return __arch_local_irq_stnsm(0xff); } -static __always_inline unsigned long arch_local_irq_save(void) +static arch_local_irq_attributes unsigned long arch_local_irq_save(void) { return __arch_local_irq_stnsm(0xfc); } @@ -52,7 +58,12 @@ static __always_inline void arch_local_irq_disable(void) arch_local_irq_save(); } -static __always_inline void arch_local_irq_enable(void) +static arch_local_irq_attributes void arch_local_irq_enable_external(void) +{ + __arch_local_irq_stosm(0x01); +} + +static arch_local_irq_attributes void arch_local_irq_enable(void) { __arch_local_irq_stosm(0x03); } diff --git a/arch/s390/include/asm/kmsan.h b/arch/s390/include/asm/kmsan.h new file mode 100644 index 000000000000..27db65fbf3f6 --- /dev/null +++ b/arch/s390/include/asm/kmsan.h @@ -0,0 +1,59 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_S390_KMSAN_H +#define _ASM_S390_KMSAN_H + +#include +#include +#include +#include +#include + +#ifndef MODULE + +static inline bool is_lowcore_addr(void *addr) +{ + return addr >= (void *)&S390_lowcore && + addr < (void *)(&S390_lowcore + 1); +} + +static inline void *arch_kmsan_get_meta_or_null(void *addr, bool is_origin) +{ + if (is_lowcore_addr(addr)) { + /* + * Different lowcores accessed via S390_lowcore are described + * by the same struct page. Resolve the prefix manually in + * order to get a distinct struct page. + */ + addr += (void *)lowcore_ptr[raw_smp_processor_id()] - + (void *)&S390_lowcore; + if (KMSAN_WARN_ON(is_lowcore_addr(addr))) + return NULL; + return kmsan_get_metadata(addr, is_origin); + } + return NULL; +} + +static inline bool kmsan_virt_addr_valid(void *addr) +{ + bool ret; + + /* + * pfn_valid() relies on RCU, and may call into the scheduler on exiting + * the critical section. However, this would result in recursion with + * KMSAN. Therefore, disable preemption here, and re-enable preemption + * below while suppressing reschedules to avoid recursion. + * + * Note, this sacrifices occasionally breaking scheduling guarantees. + * Although, a kernel compiled with KMSAN has already given up on any + * performance guarantees due to being heavily instrumented. + */ + preempt_disable(); + ret = virt_addr_valid(addr); + preempt_enable_no_resched(); + + return ret; +} + +#endif /* !MODULE */ + +#endif /* _ASM_S390_KMSAN_H */ diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index b5632dbe5438..3fa280d0672a 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -107,6 +107,18 @@ static inline int is_module_addr(void *addr) return 1; } +#ifdef CONFIG_KMSAN +#define KMSAN_VMALLOC_SIZE (VMALLOC_END - VMALLOC_START) +#define KMSAN_VMALLOC_SHADOW_START VMALLOC_END +#define KMSAN_VMALLOC_SHADOW_END (KMSAN_VMALLOC_SHADOW_START + KMSAN_VMALLOC_SIZE) +#define KMSAN_VMALLOC_ORIGIN_START KMSAN_VMALLOC_SHADOW_END +#define KMSAN_VMALLOC_ORIGIN_END (KMSAN_VMALLOC_ORIGIN_START + KMSAN_VMALLOC_SIZE) +#define KMSAN_MODULES_SHADOW_START KMSAN_VMALLOC_ORIGIN_END +#define KMSAN_MODULES_SHADOW_END (KMSAN_MODULES_SHADOW_START + MODULES_LEN) +#define KMSAN_MODULES_ORIGIN_START KMSAN_MODULES_SHADOW_END +#define KMSAN_MODULES_ORIGIN_END (KMSAN_MODULES_ORIGIN_START + MODULES_LEN) +#endif + #ifdef CONFIG_RANDOMIZE_BASE #define KASLR_LEN (1UL << 31) #else diff --git a/arch/s390/include/asm/string.h b/arch/s390/include/asm/string.h index 351685de53d2..2ab868cbae6c 100644 --- a/arch/s390/include/asm/string.h +++ b/arch/s390/include/asm/string.h @@ -15,15 +15,12 @@ #define __HAVE_ARCH_MEMCPY /* gcc builtin & arch function */ #define __HAVE_ARCH_MEMMOVE /* gcc builtin & arch function */ #define __HAVE_ARCH_MEMSET /* gcc builtin & arch function */ -#define __HAVE_ARCH_MEMSET16 /* arch function */ -#define __HAVE_ARCH_MEMSET32 /* arch function */ -#define __HAVE_ARCH_MEMSET64 /* arch function */ void *memcpy(void *dest, const void *src, size_t n); void *memset(void *s, int c, size_t n); void *memmove(void *dest, const void *src, size_t n); -#ifndef CONFIG_KASAN +#if !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN) #define __HAVE_ARCH_MEMCHR /* inline & arch function */ #define __HAVE_ARCH_MEMCMP /* arch function */ #define __HAVE_ARCH_MEMSCAN /* inline & arch function */ @@ -36,6 +33,9 @@ void *memmove(void *dest, const void *src, size_t n); #define __HAVE_ARCH_STRNCPY /* arch function */ #define __HAVE_ARCH_STRNLEN /* inline & arch function */ #define __HAVE_ARCH_STRSTR /* arch function */ +#define __HAVE_ARCH_MEMSET16 /* arch function */ +#define __HAVE_ARCH_MEMSET32 /* arch function */ +#define __HAVE_ARCH_MEMSET64 /* arch function */ /* Prototypes for non-inlined arch strings functions. */ int memcmp(const void *s1, const void *s2, size_t n); @@ -44,7 +44,7 @@ size_t strlcat(char *dest, const char *src, size_t n); char *strncat(char *dest, const char *src, size_t n); char *strncpy(char *dest, const char *src, size_t n); char *strstr(const char *s1, const char *s2); -#endif /* !CONFIG_KASAN */ +#endif /* !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN) */ #undef __HAVE_ARCH_STRCHR #undef __HAVE_ARCH_STRNCHR @@ -74,20 +74,30 @@ void *__memset16(uint16_t *s, uint16_t v, size_t count); void *__memset32(uint32_t *s, uint32_t v, size_t count); void *__memset64(uint64_t *s, uint64_t v, size_t count); +#ifdef __HAVE_ARCH_MEMSET16 static inline void *memset16(uint16_t *s, uint16_t v, size_t count) { return __memset16(s, v, count * sizeof(v)); } +#endif +#ifdef __HAVE_ARCH_MEMSET32 static inline void *memset32(uint32_t *s, uint32_t v, size_t count) { return __memset32(s, v, count * sizeof(v)); } +#endif +#ifdef __HAVE_ARCH_MEMSET64 +#ifdef IN_BOOT_STRING_C +void *memset64(uint64_t *s, uint64_t v, size_t count); +#else static inline void *memset64(uint64_t *s, uint64_t v, size_t count) { return __memset64(s, v, count * sizeof(v)); } +#endif +#endif #if !defined(IN_ARCH_STRING_C) && (!defined(CONFIG_FORTIFY_SOURCE) || defined(__NO_FORTIFY)) diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h index a674c7d25da5..d02a709717b8 100644 --- a/arch/s390/include/asm/thread_info.h +++ b/arch/s390/include/asm/thread_info.h @@ -16,7 +16,7 @@ /* * General size of kernel stacks */ -#ifdef CONFIG_KASAN +#if defined(CONFIG_KASAN) || defined(CONFIG_KMSAN) #define THREAD_SIZE_ORDER 4 #else #define THREAD_SIZE_ORDER 2 diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h index 81ae8a98e7ec..9213be0529ee 100644 --- a/arch/s390/include/asm/uaccess.h +++ b/arch/s390/include/asm/uaccess.h @@ -18,6 +18,7 @@ #include #include #include +#include void debug_user_asce(int exit); @@ -78,13 +79,24 @@ union oac { int __noreturn __put_user_bad(void); -#define __put_user_asm(to, from, size) \ -({ \ +#ifdef CONFIG_KMSAN +#define get_put_user_noinstr_attributes \ + noinline __maybe_unused __no_sanitize_memory +#else +#define get_put_user_noinstr_attributes __always_inline +#endif + +#define DEFINE_PUT_USER(type) \ +static get_put_user_noinstr_attributes int \ +__put_user_##type##_noinstr(unsigned type __user *to, \ + unsigned type *from, \ + unsigned long size) \ +{ \ union oac __oac_spec = { \ .oac1.as = PSW_BITS_AS_SECONDARY, \ .oac1.a = 1, \ }; \ - int __rc; \ + int rc; \ \ asm volatile( \ " lr 0,%[spec]\n" \ @@ -93,12 +105,28 @@ int __noreturn __put_user_bad(void); "2:\n" \ EX_TABLE_UA_STORE(0b, 2b, %[rc]) \ EX_TABLE_UA_STORE(1b, 2b, %[rc]) \ - : [rc] "=&d" (__rc), [_to] "+Q" (*(to)) \ + : [rc] "=&d" (rc), [_to] "+Q" (*(to)) \ : [_size] "d" (size), [_from] "Q" (*(from)), \ [spec] "d" (__oac_spec.val) \ : "cc", "0"); \ - __rc; \ -}) + return rc; \ +} \ + \ +static __always_inline int \ +__put_user_##type(unsigned type __user *to, unsigned type *from, \ + unsigned long size) \ +{ \ + int rc; \ + \ + rc = __put_user_##type##_noinstr(to, from, size); \ + instrument_put_user(*from, to, size); \ + return rc; \ +} + +DEFINE_PUT_USER(char); +DEFINE_PUT_USER(short); +DEFINE_PUT_USER(int); +DEFINE_PUT_USER(long); static __always_inline int __put_user_fn(void *x, void __user *ptr, unsigned long size) { @@ -106,24 +134,24 @@ static __always_inline int __put_user_fn(void *x, void __user *ptr, unsigned lon switch (size) { case 1: - rc = __put_user_asm((unsigned char __user *)ptr, - (unsigned char *)x, - size); + rc = __put_user_char((unsigned char __user *)ptr, + (unsigned char *)x, + size); break; case 2: - rc = __put_user_asm((unsigned short __user *)ptr, - (unsigned short *)x, - size); + rc = __put_user_short((unsigned short __user *)ptr, + (unsigned short *)x, + size); break; case 4: - rc = __put_user_asm((unsigned int __user *)ptr, + rc = __put_user_int((unsigned int __user *)ptr, (unsigned int *)x, size); break; case 8: - rc = __put_user_asm((unsigned long __user *)ptr, - (unsigned long *)x, - size); + rc = __put_user_long((unsigned long __user *)ptr, + (unsigned long *)x, + size); break; default: __put_user_bad(); @@ -134,13 +162,17 @@ static __always_inline int __put_user_fn(void *x, void __user *ptr, unsigned lon int __noreturn __get_user_bad(void); -#define __get_user_asm(to, from, size) \ -({ \ +#define DEFINE_GET_USER(type) \ +static get_put_user_noinstr_attributes int \ +__get_user_##type##_noinstr(unsigned type *to, \ + unsigned type __user *from, \ + unsigned long size) \ +{ \ union oac __oac_spec = { \ .oac2.as = PSW_BITS_AS_SECONDARY, \ .oac2.a = 1, \ }; \ - int __rc; \ + int rc; \ \ asm volatile( \ " lr 0,%[spec]\n" \ @@ -149,13 +181,29 @@ int __noreturn __get_user_bad(void); "2:\n" \ EX_TABLE_UA_LOAD_MEM(0b, 2b, %[rc], %[_to], %[_ksize]) \ EX_TABLE_UA_LOAD_MEM(1b, 2b, %[rc], %[_to], %[_ksize]) \ - : [rc] "=&d" (__rc), "=Q" (*(to)) \ + : [rc] "=&d" (rc), "=Q" (*(to)) \ : [_size] "d" (size), [_from] "Q" (*(from)), \ [spec] "d" (__oac_spec.val), [_to] "a" (to), \ [_ksize] "K" (size) \ : "cc", "0"); \ - __rc; \ -}) + return rc; \ +} \ + \ +static __always_inline int \ +__get_user_##type(unsigned type *to, unsigned type __user *from, \ + unsigned long size) \ +{ \ + int rc; \ + \ + rc = __get_user_##type##_noinstr(to, from, size); \ + instrument_get_user(*to); \ + return rc; \ +} + +DEFINE_GET_USER(char); +DEFINE_GET_USER(short); +DEFINE_GET_USER(int); +DEFINE_GET_USER(long); static __always_inline int __get_user_fn(void *x, const void __user *ptr, unsigned long size) { @@ -163,24 +211,24 @@ static __always_inline int __get_user_fn(void *x, const void __user *ptr, unsign switch (size) { case 1: - rc = __get_user_asm((unsigned char *)x, - (unsigned char __user *)ptr, - size); + rc = __get_user_char((unsigned char *)x, + (unsigned char __user *)ptr, + size); break; case 2: - rc = __get_user_asm((unsigned short *)x, - (unsigned short __user *)ptr, - size); + rc = __get_user_short((unsigned short *)x, + (unsigned short __user *)ptr, + size); break; case 4: - rc = __get_user_asm((unsigned int *)x, + rc = __get_user_int((unsigned int *)x, (unsigned int __user *)ptr, size); break; case 8: - rc = __get_user_asm((unsigned long *)x, - (unsigned long __user *)ptr, - size); + rc = __get_user_long((unsigned long *)x, + (unsigned long __user *)ptr, + size); break; default: __get_user_bad(); diff --git a/arch/s390/kernel/diag.c b/arch/s390/kernel/diag.c index 9b65f04c83de..ac7b8c8e3133 100644 --- a/arch/s390/kernel/diag.c +++ b/arch/s390/kernel/diag.c @@ -282,12 +282,14 @@ int diag224(void *ptr) int rc = -EOPNOTSUPP; diag_stat_inc(DIAG_STAT_X224); - asm volatile( - " diag %1,%2,0x224\n" - "0: lhi %0,0x0\n" + asm volatile("\n" + " diag %[type],%[addr],0x224\n" + "0: lhi %[rc],0\n" "1:\n" EX_TABLE(0b,1b) - : "+d" (rc) :"d" (0), "d" (addr) : "memory"); + : [rc] "+d" (rc) + , "=m" (*(struct { char buf[PAGE_SIZE]; } *)ptr) + : [type] "d" (0), [addr] "d" (addr)); return rc; } EXPORT_SYMBOL(diag224); diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c index ddf2ee47cb87..0bd6adc40a34 100644 --- a/arch/s390/kernel/ftrace.c +++ b/arch/s390/kernel/ftrace.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -303,6 +304,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip, if (bit < 0) return; + kmsan_unpoison_memory(fregs, sizeof(*fregs)); regs = ftrace_get_regs(fregs); p = get_kprobe((kprobe_opcode_t *)ip); if (!regs || unlikely(!p) || kprobe_disabled(p)) diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c index a7c211a3a0c9..160b2acba8db 100644 --- a/arch/s390/kernel/traps.c +++ b/arch/s390/kernel/traps.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -262,6 +263,11 @@ static void monitor_event_exception(struct pt_regs *regs) void kernel_stack_overflow(struct pt_regs *regs) { + /* + * Normally regs are unpoisoned by the generic entry code, but + * kernel_stack_overflow() is a rare case that is called bypassing it. + */ + kmsan_unpoison_entry_regs(regs); bust_spinlocks(1); printk("Kernel stack overflow.\n"); show_regs(regs); diff --git a/arch/s390/kernel/unwind_bc.c b/arch/s390/kernel/unwind_bc.c index 0ece156fdd7c..cd44be2b6ce8 100644 --- a/arch/s390/kernel/unwind_bc.c +++ b/arch/s390/kernel/unwind_bc.c @@ -49,6 +49,8 @@ static inline bool is_final_pt_regs(struct unwind_state *state, READ_ONCE_NOCHECK(regs->psw.mask) & PSW_MASK_PSTATE; } +/* Avoid KMSAN false positives from touching uninitialized frames. */ +__no_kmsan_checks bool unwind_next_frame(struct unwind_state *state) { struct stack_info *info = &state->stack_info; @@ -118,6 +120,8 @@ out_stop: } EXPORT_SYMBOL_GPL(unwind_next_frame); +/* Avoid KMSAN false positives from touching uninitialized frames. */ +__no_kmsan_checks void __unwind_start(struct unwind_state *state, struct task_struct *task, struct pt_regs *regs, unsigned long first_frame) { diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c index 34d558164f0d..ded0eff58a19 100644 --- a/arch/s390/mm/hugetlbpage.c +++ b/arch/s390/mm/hugetlbpage.c @@ -169,7 +169,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, __set_huge_pte_at(mm, addr, ptep, pte); } -pte_t huge_ptep_get(pte_t *ptep) +pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { return __rste_to_pte(pte_val(*ptep)); } @@ -177,7 +177,7 @@ pte_t huge_ptep_get(pte_t *ptep) pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { - pte_t pte = huge_ptep_get(ptep); + pte_t pte = huge_ptep_get(mm, addr, ptep); pmd_t *pmdp = (pmd_t *) ptep; pud_t *pudp = (pud_t *) ptep; diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 00b247d924a9..53d7cb5bbffe 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -490,7 +490,7 @@ void flush_dcache_folio(struct folio *folio) } set_dcache_dirty(folio, this_cpu); } else { - /* We could delay the flush for the !page_mapping + /* We could delay the flush for the !folio_mapping * case too. But that case is for exec env/arg * pages and those are %99 certainly going to get * faulted into the tlb (and thus flushed) anyways. diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 28002cc7a37d..d8dbeac8b206 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -988,8 +988,6 @@ static void __meminit free_pagetable(struct page *page, int order) /* bootmem page has reserved flag */ if (PageReserved(page)) { - __ClearPageReserved(page); - magic = page->index; if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) { while (nr_pages--) @@ -1362,18 +1360,6 @@ void __init mem_init(void) preallocate_vmalloc_pages(); } -#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT -int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask) -{ - /* - * More CPUs always led to greater speedups on tested systems, up to - * all the nodes' CPUs. Use all since the system is otherwise idle - * now. - */ - return max_t(int, cpumask_weight(node_cpumask), 1); -} -#endif - int kernel_set_to_readonly; void mark_rodata_ro(void) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 443a97e515c0..44f7b2ea6a07 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -1119,8 +1119,8 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address, lpinc = PMD_SIZE; /* * Clear the PSE flags if the PRESENT flag is not set - * otherwise pmd_present/pmd_huge will return true - * even on a non present pmd. + * otherwise pmd_present() will return true even on a non + * present pmd. */ if (!(pgprot_val(ref_prot) & _PAGE_PRESENT)) pgprot_val(ref_prot) &= ~_PAGE_PSE; diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h index 9a7e5e57ee9a..1647a7cc3fbf 100644 --- a/arch/xtensa/include/asm/pgtable.h +++ b/arch/xtensa/include/asm/pgtable.h @@ -410,9 +410,9 @@ void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, typedef pte_t *pte_addr_t; -void update_mmu_tlb(struct vm_area_struct *vma, - unsigned long address, pte_t *ptep); -#define __HAVE_ARCH_UPDATE_MMU_TLB +void update_mmu_tlb_range(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, unsigned int nr); +#define update_mmu_tlb_range update_mmu_tlb_range #endif /* !defined (__ASSEMBLY__) */ diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c index d8b60d6e50a8..0a1a815dc796 100644 --- a/arch/xtensa/mm/tlb.c +++ b/arch/xtensa/mm/tlb.c @@ -163,10 +163,10 @@ void local_flush_tlb_kernel_range(unsigned long start, unsigned long end) } } -void update_mmu_tlb(struct vm_area_struct *vma, - unsigned long address, pte_t *ptep) +void update_mmu_tlb_range(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, unsigned int nr) { - local_flush_tlb_page(vma, address); + local_flush_tlb_range(vma, address, address + PAGE_SIZE * nr); } #ifdef CONFIG_DEBUG_TLB_SANITY diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c index febd9e51350b..1a902a02390f 100644 --- a/drivers/acpi/numa/hmat.c +++ b/drivers/acpi/numa/hmat.c @@ -933,17 +933,14 @@ static int hmat_callback(struct notifier_block *self, return NOTIFY_OK; } -static int hmat_set_default_dram_perf(void) +static int __init hmat_set_default_dram_perf(void) { int rc; int nid, pxm; struct memory_target *target; struct access_coordinate *attrs; - if (!default_dram_type) - return -EIO; - - for_each_node_mask(nid, default_dram_type->nodes) { + for_each_node_mask(nid, default_dram_nodes) { pxm = node_to_pxm(nid); target = find_mem_target(pxm); if (!target) diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index 7b29cce60ab2..eacf1cba7bf4 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -2,6 +2,7 @@ config ZRAM tristate "Compressed RAM block device support" depends on BLOCK && SYSFS && MMU + depends on HAVE_ZSMALLOC depends on CRYPTO_LZO || CRYPTO_ZSTD || CRYPTO_LZ4 || CRYPTO_LZ4HC || CRYPTO_842 select ZSMALLOC help diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig index e4dc53a36428..b46eb8a552d7 100644 --- a/drivers/dma-buf/Kconfig +++ b/drivers/dma-buf/Kconfig @@ -35,6 +35,7 @@ config UDMABUF default n depends on DMA_SHARED_BUFFER depends on MEMFD_CREATE || COMPILE_TEST + depends on MMU help A driver to let userspace turn memfd regions into dma-bufs. Qemu can use this to create host dmabufs for guest framebuffers. diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c index c40645999648..047c3cd2ceff 100644 --- a/drivers/dma-buf/udmabuf.c +++ b/drivers/dma-buf/udmabuf.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include #include @@ -25,9 +26,16 @@ MODULE_PARM_DESC(size_limit_mb, "Max size of a dmabuf, in megabytes. Default is struct udmabuf { pgoff_t pagecount; - struct page **pages; + struct folio **folios; struct sg_table *sg; struct miscdevice *device; + pgoff_t *offsets; + struct list_head unpin_list; +}; + +struct udmabuf_folio { + struct folio *folio; + struct list_head list; }; static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf) @@ -35,12 +43,15 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct udmabuf *ubuf = vma->vm_private_data; pgoff_t pgoff = vmf->pgoff; + unsigned long pfn; if (pgoff >= ubuf->pagecount) return VM_FAULT_SIGBUS; - vmf->page = ubuf->pages[pgoff]; - get_page(vmf->page); - return 0; + + pfn = folio_pfn(ubuf->folios[pgoff]); + pfn += ubuf->offsets[pgoff] >> PAGE_SHIFT; + + return vmf_insert_pfn(vma, vmf->address, pfn); } static const struct vm_operations_struct udmabuf_vm_ops = { @@ -56,17 +67,28 @@ static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma) vma->vm_ops = &udmabuf_vm_ops; vma->vm_private_data = ubuf; + vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); return 0; } static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map) { struct udmabuf *ubuf = buf->priv; + struct page **pages; void *vaddr; + pgoff_t pg; dma_resv_assert_held(buf->resv); - vaddr = vm_map_ram(ubuf->pages, ubuf->pagecount, -1); + pages = kmalloc_array(ubuf->pagecount, sizeof(*pages), GFP_KERNEL); + if (!pages) + return -ENOMEM; + + for (pg = 0; pg < ubuf->pagecount; pg++) + pages[pg] = &ubuf->folios[pg]->page; + + vaddr = vm_map_ram(pages, ubuf->pagecount, -1); + kfree(pages); if (!vaddr) return -EINVAL; @@ -88,23 +110,30 @@ static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf, { struct udmabuf *ubuf = buf->priv; struct sg_table *sg; + struct scatterlist *sgl; + unsigned int i = 0; int ret; sg = kzalloc(sizeof(*sg), GFP_KERNEL); if (!sg) return ERR_PTR(-ENOMEM); - ret = sg_alloc_table_from_pages(sg, ubuf->pages, ubuf->pagecount, - 0, ubuf->pagecount << PAGE_SHIFT, - GFP_KERNEL); + + ret = sg_alloc_table(sg, ubuf->pagecount, GFP_KERNEL); if (ret < 0) - goto err; + goto err_alloc; + + for_each_sg(sg->sgl, sgl, ubuf->pagecount, i) + sg_set_folio(sgl, ubuf->folios[i], PAGE_SIZE, + ubuf->offsets[i]); + ret = dma_map_sgtable(dev, sg, direction, 0); if (ret < 0) - goto err; + goto err_map; return sg; -err: +err_map: sg_free_table(sg); +err_alloc: kfree(sg); return ERR_PTR(ret); } @@ -130,18 +159,45 @@ static void unmap_udmabuf(struct dma_buf_attachment *at, return put_sg_table(at->dev, sg, direction); } +static void unpin_all_folios(struct list_head *unpin_list) +{ + struct udmabuf_folio *ubuf_folio; + + while (!list_empty(unpin_list)) { + ubuf_folio = list_first_entry(unpin_list, + struct udmabuf_folio, list); + unpin_folio(ubuf_folio->folio); + + list_del(&ubuf_folio->list); + kfree(ubuf_folio); + } +} + +static int add_to_unpin_list(struct list_head *unpin_list, + struct folio *folio) +{ + struct udmabuf_folio *ubuf_folio; + + ubuf_folio = kzalloc(sizeof(*ubuf_folio), GFP_KERNEL); + if (!ubuf_folio) + return -ENOMEM; + + ubuf_folio->folio = folio; + list_add_tail(&ubuf_folio->list, unpin_list); + return 0; +} + static void release_udmabuf(struct dma_buf *buf) { struct udmabuf *ubuf = buf->priv; struct device *dev = ubuf->device->this_device; - pgoff_t pg; if (ubuf->sg) put_sg_table(dev, ubuf->sg, DMA_BIDIRECTIONAL); - for (pg = 0; pg < ubuf->pagecount; pg++) - put_page(ubuf->pages[pg]); - kfree(ubuf->pages); + unpin_all_folios(&ubuf->unpin_list); + kfree(ubuf->offsets); + kfree(ubuf->folios); kfree(ubuf); } @@ -194,24 +250,64 @@ static const struct dma_buf_ops udmabuf_ops = { #define SEALS_WANTED (F_SEAL_SHRINK) #define SEALS_DENIED (F_SEAL_WRITE) +static int check_memfd_seals(struct file *memfd) +{ + int seals; + + if (!memfd) + return -EBADFD; + + if (!shmem_file(memfd) && !is_file_hugepages(memfd)) + return -EBADFD; + + seals = memfd_fcntl(memfd, F_GET_SEALS, 0); + if (seals == -EINVAL) + return -EBADFD; + + if ((seals & SEALS_WANTED) != SEALS_WANTED || + (seals & SEALS_DENIED) != 0) + return -EINVAL; + + return 0; +} + +static int export_udmabuf(struct udmabuf *ubuf, + struct miscdevice *device, + u32 flags) +{ + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + struct dma_buf *buf; + + ubuf->device = device; + exp_info.ops = &udmabuf_ops; + exp_info.size = ubuf->pagecount << PAGE_SHIFT; + exp_info.priv = ubuf; + exp_info.flags = O_RDWR; + + buf = dma_buf_export(&exp_info); + if (IS_ERR(buf)) + return PTR_ERR(buf); + + return dma_buf_fd(buf, flags); +} + static long udmabuf_create(struct miscdevice *device, struct udmabuf_create_list *head, struct udmabuf_create_item *list) { - DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + pgoff_t pgoff, pgcnt, pglimit, pgbuf = 0; + long nr_folios, ret = -EINVAL; struct file *memfd = NULL; - struct address_space *mapping = NULL; + struct folio **folios; struct udmabuf *ubuf; - struct dma_buf *buf; - pgoff_t pgoff, pgcnt, pgidx, pgbuf = 0, pglimit; - struct page *page; - int seals, ret = -EINVAL; - u32 i, flags; + u32 i, j, k, flags; + loff_t end; ubuf = kzalloc(sizeof(*ubuf), GFP_KERNEL); if (!ubuf) return -ENOMEM; + INIT_LIST_HEAD(&ubuf->unpin_list); pglimit = (size_limit_mb * 1024 * 1024) >> PAGE_SHIFT; for (i = 0; i < head->count; i++) { if (!IS_ALIGNED(list[i].offset, PAGE_SIZE)) @@ -226,66 +322,84 @@ static long udmabuf_create(struct miscdevice *device, if (!ubuf->pagecount) goto err; - ubuf->pages = kmalloc_array(ubuf->pagecount, sizeof(*ubuf->pages), + ubuf->folios = kmalloc_array(ubuf->pagecount, sizeof(*ubuf->folios), GFP_KERNEL); - if (!ubuf->pages) { + if (!ubuf->folios) { + ret = -ENOMEM; + goto err; + } + ubuf->offsets = kcalloc(ubuf->pagecount, sizeof(*ubuf->offsets), + GFP_KERNEL); + if (!ubuf->offsets) { ret = -ENOMEM; goto err; } pgbuf = 0; for (i = 0; i < head->count; i++) { - ret = -EBADFD; memfd = fget(list[i].memfd); - if (!memfd) + ret = check_memfd_seals(memfd); + if (ret < 0) goto err; - mapping = memfd->f_mapping; - if (!shmem_mapping(mapping)) + + pgcnt = list[i].size >> PAGE_SHIFT; + folios = kmalloc_array(pgcnt, sizeof(*folios), GFP_KERNEL); + if (!folios) { + ret = -ENOMEM; goto err; - seals = memfd_fcntl(memfd, F_GET_SEALS, 0); - if (seals == -EINVAL) - goto err; - ret = -EINVAL; - if ((seals & SEALS_WANTED) != SEALS_WANTED || - (seals & SEALS_DENIED) != 0) - goto err; - pgoff = list[i].offset >> PAGE_SHIFT; - pgcnt = list[i].size >> PAGE_SHIFT; - for (pgidx = 0; pgidx < pgcnt; pgidx++) { - page = shmem_read_mapping_page(mapping, pgoff + pgidx); - if (IS_ERR(page)) { - ret = PTR_ERR(page); - goto err; - } - ubuf->pages[pgbuf++] = page; } + + end = list[i].offset + (pgcnt << PAGE_SHIFT) - 1; + ret = memfd_pin_folios(memfd, list[i].offset, end, + folios, pgcnt, &pgoff); + if (ret <= 0) { + kfree(folios); + if (!ret) + ret = -EINVAL; + goto err; + } + + nr_folios = ret; + pgoff >>= PAGE_SHIFT; + for (j = 0, k = 0; j < pgcnt; j++) { + ubuf->folios[pgbuf] = folios[k]; + ubuf->offsets[pgbuf] = pgoff << PAGE_SHIFT; + + if (j == 0 || ubuf->folios[pgbuf-1] != folios[k]) { + ret = add_to_unpin_list(&ubuf->unpin_list, + folios[k]); + if (ret < 0) { + kfree(folios); + goto err; + } + } + + pgbuf++; + if (++pgoff == folio_nr_pages(folios[k])) { + pgoff = 0; + if (++k == nr_folios) + break; + } + } + + kfree(folios); fput(memfd); memfd = NULL; } - exp_info.ops = &udmabuf_ops; - exp_info.size = ubuf->pagecount << PAGE_SHIFT; - exp_info.priv = ubuf; - exp_info.flags = O_RDWR; - - ubuf->device = device; - buf = dma_buf_export(&exp_info); - if (IS_ERR(buf)) { - ret = PTR_ERR(buf); + flags = head->flags & UDMABUF_FLAGS_CLOEXEC ? O_CLOEXEC : 0; + ret = export_udmabuf(ubuf, device, flags); + if (ret < 0) goto err; - } - flags = 0; - if (head->flags & UDMABUF_FLAGS_CLOEXEC) - flags |= O_CLOEXEC; - return dma_buf_fd(buf, flags); + return ret; err: - while (pgbuf > 0) - put_page(ubuf->pages[--pgbuf]); if (memfd) fput(memfd); - kfree(ubuf->pages); + unpin_all_folios(&ubuf->unpin_list); + kfree(ubuf->offsets); + kfree(ubuf->folios); kfree(ubuf); return ret; } diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 0e7427c2baf5..c38dcdfcb914 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -683,9 +683,8 @@ static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg) if (!PageOffline(pg)) __SetPageOffline(pg); return; - } - if (PageOffline(pg)) - __ClearPageOffline(pg); + } else if (!PageOffline(pg)) + return; /* This frame is currently backed; online the page. */ generic_online_page(pg, 0); diff --git a/drivers/s390/char/sclp.c b/drivers/s390/char/sclp.c index fbe29cabcbb8..f3621adbd5de 100644 --- a/drivers/s390/char/sclp.c +++ b/drivers/s390/char/sclp.c @@ -736,7 +736,7 @@ sclp_sync_wait(void) cr0_sync.val = cr0.val & ~CR0_IRQ_SUBCLASS_MASK; cr0_sync.val |= 1UL << (63 - 54); local_ctl_load(0, &cr0_sync); - __arch_local_irq_stosm(0x01); + arch_local_irq_enable_external(); /* Loop until driver state indicates finished request */ while (sclp_running_state != sclp_running_state_idle) { /* Check for expired request timer */ diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c index 5ee7e78c2cea..65363df8e81b 100644 --- a/drivers/video/fbdev/core/fb_defio.c +++ b/drivers/video/fbdev/core/fb_defio.c @@ -146,7 +146,7 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) printk(KERN_ERR "no mapping available\n"); BUG_ON(!page->mapping); - page->index = vmf->pgoff; /* for page_mkclean() */ + page->index = vmf->pgoff; /* for folio_mkclean() */ vmf->page = page; return 0; @@ -194,7 +194,7 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long /* * We want the page to remain locked from ->page_mkwrite until - * the PTE is marked dirty to avoid page_mkclean() being called + * the PTE is marked dirty to avoid folio_mkclean() being called * before the PTE is updated, which would leave the page ignored * by defio. * Do this by locking the page here and informing the caller @@ -277,10 +277,11 @@ static void fb_deferred_io_work(struct work_struct *work) /* here we mkclean the pages, then do all deferred IO */ mutex_lock(&fbdefio->lock); list_for_each_entry(pageref, &fbdefio->pagereflist, list) { - struct page *cur = pageref->page; - lock_page(cur); - page_mkclean(cur); - unlock_page(cur); + struct folio *folio = page_folio(pageref->page); + + folio_lock(folio); + folio_mkclean(folio); + folio_unlock(folio); } /* driver's callback with pagereflist */ diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index a3857bacc844..b0b871441578 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -1146,12 +1146,16 @@ static void virtio_mem_set_fake_offline(unsigned long pfn, for (; nr_pages--; pfn++) { struct page *page = pfn_to_page(pfn); - __SetPageOffline(page); - if (!onlined) { + if (!onlined) + /* + * Pages that have not been onlined yet were initialized + * to PageOffline(). Remember that we have to route them + * through generic_online_page(). + */ SetPageDirty(page); - /* FIXME: remove after cleanups */ - ClearPageReserved(page); - } + else + __SetPageOffline(page); + VM_WARN_ON_ONCE(!PageOffline(page)); } page_offline_end(); } @@ -1166,9 +1170,11 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn, for (; nr_pages--; pfn++) { struct page *page = pfn_to_page(pfn); - __ClearPageOffline(page); if (!onlined) + /* generic_online_page() will clear PageOffline(). */ ClearPageDirty(page); + else + __ClearPageOffline(page); } } @@ -1263,12 +1269,6 @@ static void virtio_mem_fake_offline_going_offline(unsigned long pfn, struct page *page; unsigned long i; - /* - * Drop our reference to the pages so the memory can get offlined - * and add the unplugged pages to the managed page counters (so - * offlining code can correctly subtract them again). - */ - adjust_managed_page_count(pfn_to_page(pfn), nr_pages); /* Drop our reference to the pages so the memory can get offlined. */ for (i = 0; i < nr_pages; i++) { page = pfn_to_page(pfn + i); @@ -1287,10 +1287,9 @@ static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn, unsigned long i; /* - * Get the reference we dropped when going offline and subtract the - * unplugged pages from the managed page counters. + * Get the reference again that we dropped via page_ref_dec_and_test() + * when going offline. */ - adjust_managed_page_count(pfn_to_page(pfn), -nr_pages); for (i = 0; i < nr_pages; i++) page_ref_inc(pfn_to_page(pfn + i)); } diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index aaf2514fcfa4..528395133b4f 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -146,7 +146,8 @@ static DECLARE_WAIT_QUEUE_HEAD(balloon_wq); /* balloon_append: add the given page to the balloon. */ static void balloon_append(struct page *page) { - __SetPageOffline(page); + if (!PageOffline(page)) + __SetPageOffline(page); /* Lowmem is re-populated first, so highmem pages go at list tail. */ if (PageHighMem(page)) { @@ -412,7 +413,11 @@ static enum bp_state increase_reservation(unsigned long nr_pages) xenmem_reservation_va_mapping_update(1, &page, &frame_list[i]); - /* Relinquish the page back to the allocator. */ + /* + * Relinquish the page back to the allocator. Note that + * some pages, including ones added via xen_online_page(), might + * not be marked reserved; free_reserved_page() will handle that. + */ free_reserved_page(page); } diff --git a/fs/afs/dir.c b/fs/afs/dir.c index 67afe68972d5..f8622ed72e08 100644 --- a/fs/afs/dir.c +++ b/fs/afs/dir.c @@ -533,14 +533,14 @@ static int afs_dir_iterate(struct inode *dir, struct dir_context *ctx, break; } - offset = round_down(ctx->pos, sizeof(*dblock)) - folio_file_pos(folio); + offset = round_down(ctx->pos, sizeof(*dblock)) - folio_pos(folio); size = min_t(loff_t, folio_size(folio), - req->actual_len - folio_file_pos(folio)); + req->actual_len - folio_pos(folio)); do { dblock = kmap_local_folio(folio, offset); ret = afs_dir_iterate_block(dvnode, ctx, dblock, - folio_file_pos(folio) + offset); + folio_pos(folio) + offset); kunmap_local(dblock); if (ret != 1) goto out; diff --git a/fs/afs/dir_edit.c b/fs/afs/dir_edit.c index e2fa577b66fe..a71bff10496b 100644 --- a/fs/afs/dir_edit.c +++ b/fs/afs/dir_edit.c @@ -256,7 +256,7 @@ void afs_edit_dir_add(struct afs_vnode *vnode, folio = folio0; } - block = kmap_local_folio(folio, b * AFS_DIR_BLOCK_SIZE - folio_file_pos(folio)); + block = kmap_local_folio(folio, b * AFS_DIR_BLOCK_SIZE - folio_pos(folio)); /* Abandon the edit if we got a callback break. */ if (!test_bit(AFS_VNODE_DIR_VALID, &vnode->flags)) @@ -417,7 +417,7 @@ void afs_edit_dir_remove(struct afs_vnode *vnode, folio = folio0; } - block = kmap_local_folio(folio, b * AFS_DIR_BLOCK_SIZE - folio_file_pos(folio)); + block = kmap_local_folio(folio, b * AFS_DIR_BLOCK_SIZE - folio_pos(folio)); /* Abandon the edit if we got a callback break. */ if (!test_bit(AFS_VNODE_DIR_VALID, &vnode->flags)) diff --git a/fs/aio.c b/fs/aio.c index 93ef59d358b3..6066f64967b3 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -410,17 +410,7 @@ static int aio_migrate_folio(struct address_space *mapping, struct folio *dst, struct kioctx *ctx; unsigned long flags; pgoff_t idx; - int rc; - - /* - * We cannot support the _NO_COPY case here, because copy needs to - * happen under the ctx->completion_lock. That does not work with the - * migration workflow of MIGRATE_SYNC_NO_COPY. - */ - if (mode == MIGRATE_SYNC_NO_COPY) - return -EINVAL; - - rc = 0; + int rc = 0; /* mapping->i_private_lock here protects against the kioctx teardown. */ spin_lock(&mapping->i_private_lock); @@ -465,7 +455,8 @@ static int aio_migrate_folio(struct address_space *mapping, struct folio *dst, * events from being lost. */ spin_lock_irqsave(&ctx->completion_lock, flags); - folio_migrate_copy(dst, src); + folio_copy(dst, src); + folio_migrate_flags(dst, src); BUG_ON(ctx->ring_folios[idx] != src); ctx->ring_folios[idx] = dst; spin_unlock_irqrestore(&ctx->completion_lock, flags); diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index b592fc8cf368..0533d0f82dc9 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -2981,8 +2981,7 @@ static int relocate_one_folio(struct reloc_control *rc, if (folio_test_readahead(folio)) page_cache_async_readahead(inode->i_mapping, ra, NULL, - folio, index, - last_index + 1 - index); + folio, last_index + 1 - index); if (!folio_test_uptodate(folio)) { btrfs_read_folio(NULL, folio); diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index fb3675f5bf50..4ca711a773ef 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -5306,7 +5306,7 @@ static int put_file_data(struct send_ctx *sctx, u64 offset, u32 len) if (folio_test_readahead(folio)) page_cache_async_readahead(mapping, &sctx->ra, NULL, folio, - index, last_index + 1 - index); + last_index + 1 - index); if (!folio_test_uptodate(folio)) { btrfs_read_folio(NULL, folio); diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 82a2e2a06a65..5aadc56e0cc0 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -141,7 +141,7 @@ __dcache_find_get_entry(struct dentry *parent, u64 idx, if (ptr_pos >= i_size_read(dir)) return NULL; - if (!cache_ctl->page || ptr_pgoff != page_index(cache_ctl->page)) { + if (!cache_ctl->page || ptr_pgoff != cache_ctl->page->index) { ceph_readdir_cache_release(cache_ctl); cache_ctl->page = find_lock_page(&dir->i_data, ptr_pgoff); if (!cache_ctl->page) { diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 249ddfbb1b03..8f8de8f33abb 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -1863,7 +1863,7 @@ static int fill_readdir_cache(struct inode *dir, struct dentry *dn, unsigned idx = ctl->index % nsize; pgoff_t pgoff = ctl->index / nsize; - if (!ctl->page || pgoff != page_index(ctl->page)) { + if (!ctl->page || pgoff != ctl->page->index) { ceph_readdir_cache_release(ctl); if (idx == 0) ctl->page = grab_cache_page(&dir->i_data, pgoff); diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 81dab95f67ed..9f6cff356796 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -222,13 +222,13 @@ generic_hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long flags) { struct mm_struct *mm = current->mm; - struct vm_area_struct *vma; + struct vm_area_struct *vma, *prev; struct hstate *h = hstate_file(file); const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags); if (len & ~huge_page_mask(h)) return -EINVAL; - if (len > TASK_SIZE) + if (len > mmap_end - mmap_min_addr) return -ENOMEM; if (flags & MAP_FIXED) { @@ -239,9 +239,10 @@ generic_hugetlb_get_unmapped_area(struct file *file, unsigned long addr, if (addr) { addr = ALIGN(addr, huge_page_size(h)); - vma = find_vma(mm, addr); - if (mmap_end - len >= addr && - (!vma || addr + len <= vm_start_gap(vma))) + vma = find_vma_prev(mm, addr, &prev); + if (mmap_end - len >= addr && addr >= mmap_min_addr && + (!vma || addr + len <= vm_start_gap(vma)) && + (!prev || addr >= vm_end_gap(prev))) return addr; } @@ -422,7 +423,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma, if (!ptep) return false; - pte = huge_ptep_get(ptep); + pte = huge_ptep_get(vma->vm_mm, addr, ptep); if (huge_pte_none(pte) || !pte_present(pte)) return false; @@ -892,7 +893,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, error = PTR_ERR(folio); goto out; } - clear_huge_page(&folio->page, addr, pages_per_huge_page(h)); + folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); __folio_mark_uptodate(folio); error = hugetlb_add_to_page_cache(folio, mapping, index); if (unlikely(error)) { @@ -1128,10 +1129,7 @@ static int hugetlbfs_migrate_folio(struct address_space *mapping, hugetlb_set_folio_subpool(src, NULL); } - if (mode != MIGRATE_SYNC_NO_COPY) - folio_migrate_copy(dst, src); - else - folio_migrate_flags(dst, src); + folio_migrate_flags(dst, src); return MIGRATEPAGE_SUCCESS; } diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c index 4c0401dbbfcf..a6d5d07cd436 100644 --- a/fs/netfs/buffered_read.c +++ b/fs/netfs/buffered_read.c @@ -271,7 +271,7 @@ int netfs_read_folio(struct file *file, struct folio *folio) kenter("%lx", folio->index); rreq = netfs_alloc_request(mapping, file, - folio_file_pos(folio), folio_size(folio), + folio_pos(folio), folio_size(folio), NETFS_READPAGE); if (IS_ERR(rreq)) { ret = PTR_ERR(rreq); @@ -470,7 +470,7 @@ retry: } rreq = netfs_alloc_request(mapping, file, - folio_file_pos(folio), folio_size(folio), + folio_pos(folio), folio_size(folio), NETFS_READ_FOR_WRITE); if (IS_ERR(rreq)) { ret = PTR_ERR(rreq); diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c index ecbc99ec7d36..68a3f1383cee 100644 --- a/fs/netfs/buffered_write.c +++ b/fs/netfs/buffered_write.c @@ -54,7 +54,7 @@ static enum netfs_how_to_modify netfs_how_to_modify(struct netfs_inode *ctx, { struct netfs_folio *finfo = netfs_folio_info(folio); struct netfs_group *group = netfs_folio_group(folio); - loff_t pos = folio_file_pos(folio); + loff_t pos = folio_pos(folio); kenter(""); diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 9aa2ab218c0a..61a8cdb9f1e1 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -591,7 +591,7 @@ static vm_fault_t nfs_vm_page_mkwrite(struct vm_fault *vmf) dfprintk(PAGECACHE, "NFS: vm_page_mkwrite(%pD2(%lu), offset %lld)\n", filp, filp->f_mapping->host->i_ino, - (long long)folio_file_pos(folio)); + (long long)folio_pos(folio)); sb_start_pagefault(inode->i_sb); diff --git a/fs/nfs/iostat.h b/fs/nfs/iostat.h index b17a9eb9b148..49862c95b224 100644 --- a/fs/nfs/iostat.h +++ b/fs/nfs/iostat.h @@ -46,6 +46,10 @@ static inline void nfs_add_stats(const struct inode *inode, nfs_add_server_stats(NFS_SERVER(inode), stat, addend); } +/* + * This specialized allocator has to be a macro for its allocations to be + * accounted separately (to have a separate alloc_tag). + */ #define nfs_alloc_iostats() alloc_percpu(struct nfs_iostats) static inline void nfs_free_iostats(struct nfs_iostats __percpu *stats) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 190c1fa8882c..d074d0ceb4f0 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -182,7 +182,7 @@ static void nfs_grow_file(struct folio *folio, unsigned int offset, end_index = ((i_size - 1) >> folio_shift(folio)) << folio_order(folio); if (i_size > 0 && folio->index < end_index) goto out; - end = folio_file_pos(folio) + (loff_t)offset + (loff_t)count; + end = folio_pos(folio) + (loff_t)offset + (loff_t)count; if (i_size >= end) goto out; trace_nfs_size_grow(inode, end); @@ -1344,7 +1344,7 @@ int nfs_update_folio(struct file *file, struct folio *folio, nfs_inc_stats(inode, NFSIOS_VFSUPDATEPAGE); dprintk("NFS: nfs_update_folio(%pD2 %d@%lld)\n", file, count, - (long long)(folio_file_pos(folio) + offset)); + (long long)(folio_pos(folio) + offset)); if (!count) goto out; diff --git a/fs/nilfs2/bmap.c b/fs/nilfs2/bmap.c index 383f0afa2cea..cd14ea25968c 100644 --- a/fs/nilfs2/bmap.c +++ b/fs/nilfs2/bmap.c @@ -450,15 +450,9 @@ int nilfs_bmap_test_and_clear_dirty(struct nilfs_bmap *bmap) __u64 nilfs_bmap_data_get_key(const struct nilfs_bmap *bmap, const struct buffer_head *bh) { - struct buffer_head *pbh; - __u64 key; + loff_t pos = folio_pos(bh->b_folio) + bh_offset(bh); - key = page_index(bh->b_page) << (PAGE_SHIFT - - bmap->b_inode->i_blkbits); - for (pbh = page_buffers(bh->b_page); pbh != bh; pbh = pbh->b_this_page) - key++; - - return key; + return pos >> bmap->b_inode->i_blkbits; } __u64 nilfs_bmap_find_target_seq(const struct nilfs_bmap *bmap, __u64 key) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index a71ac5379584..a8a8576d8592 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -13,6 +13,7 @@ #include #include #include +#include struct ctl_table_header; struct mempolicy; @@ -142,6 +143,38 @@ unsigned name_to_int(const struct qstr *qstr); /* Worst case buffer size needed for holding an integer. */ #define PROC_NUMBUF 13 +/** + * folio_precise_page_mapcount() - Number of mappings of this folio page. + * @folio: The folio. + * @page: The page. + * + * The number of present user page table entries that reference this page + * as tracked via the RMAP: either referenced directly (PTE) or as part of + * a larger area that covers this page (e.g., PMD). + * + * Use this function only for the calculation of existing statistics + * (USS, PSS, mapcount_max) and for debugging purposes (/proc/kpagecount). + * + * Do not add new users. + * + * Returns: The number of mappings of this folio page. 0 for + * folios that are not mapped to user space or are not tracked via the RMAP + * (e.g., shared zeropage). + */ +static inline int folio_precise_page_mapcount(struct folio *folio, + struct page *page) +{ + int mapcount = atomic_read(&page->_mapcount) + 1; + + /* Handle page_has_type() pages */ + if (mapcount < PAGE_MAPCOUNT_RESERVE + 1) + mapcount = 0; + if (folio_test_large(folio)) + mapcount += folio_entire_mapcount(folio); + + return mapcount; +} + /* * array.c */ diff --git a/fs/proc/page.c b/fs/proc/page.c index 2fb64bdb64eb..b7a5c84b5819 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -37,21 +37,19 @@ static inline unsigned long get_max_dump_pfn(void) #endif } -/* /proc/kpagecount - an array exposing page counts +/* /proc/kpagecount - an array exposing page mapcounts * * Each entry is a u64 representing the corresponding - * physical page count. + * physical page mapcount. */ static ssize_t kpagecount_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { const unsigned long max_dump_pfn = get_max_dump_pfn(); u64 __user *out = (u64 __user *)buf; - struct page *ppage; unsigned long src = *ppos; unsigned long pfn; ssize_t ret = 0; - u64 pcount; pfn = src / KPMSIZE; if (src & KPMMASK || count & KPMMASK) @@ -61,18 +59,19 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, count = min_t(unsigned long, count, (max_dump_pfn * KPMSIZE) - src); while (count > 0) { + struct page *page; + u64 mapcount = 0; + /* * TODO: ZONE_DEVICE support requires to identify * memmaps that were actually initialized. */ - ppage = pfn_to_online_page(pfn); + page = pfn_to_online_page(pfn); + if (page) + mapcount = folio_precise_page_mapcount(page_folio(page), + page); - if (!ppage) - pcount = 0; - else - pcount = page_mapcount(ppage); - - if (put_user(pcount, out)) { + if (put_user(mapcount, out)) { ret = -EFAULT; break; } @@ -148,19 +147,16 @@ u64 stable_page_flags(const struct page *page) u |= 1 << KPF_COMPOUND_TAIL; if (folio_test_hugetlb(folio)) u |= 1 << KPF_HUGE; - /* - * We need to check PageLRU/PageAnon - * to make sure a given page is a thp, not a non-huge compound page. - */ - else if (folio_test_large(folio)) { - if ((k & (1 << PG_lru)) || is_anon) - u |= 1 << KPF_THP; - else if (is_huge_zero_folio(folio)) { - u |= 1 << KPF_ZERO_PAGE; - u |= 1 << KPF_THP; - } - } else if (is_zero_pfn(page_to_pfn(page))) + else if (folio_test_large(folio) && + folio_test_large_rmappable(folio)) { + /* Note: we indicate any THPs here, not just PMD-sized ones */ + u |= 1 << KPF_THP; + } else if (is_huge_zero_folio(folio)) { u |= 1 << KPF_ZERO_PAGE; + u |= 1 << KPF_THP; + } else if (is_zero_folio(folio)) { + u |= 1 << KPF_ZERO_PAGE; + } /* * Caveats on high order pages: PG_buddy and PG_slab will only be set diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 71e5039d940d..775a2e8d600c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -239,6 +240,67 @@ static int do_maps_open(struct inode *inode, struct file *file, sizeof(struct proc_maps_private)); } +static void get_vma_name(struct vm_area_struct *vma, + const struct path **path, + const char **name, + const char **name_fmt) +{ + struct anon_vma_name *anon_name = vma->vm_mm ? anon_vma_name(vma) : NULL; + + *name = NULL; + *path = NULL; + *name_fmt = NULL; + + /* + * Print the dentry name for named mappings, and a + * special [heap] marker for the heap: + */ + if (vma->vm_file) { + /* + * If user named this anon shared memory via + * prctl(PR_SET_VMA ..., use the provided name. + */ + if (anon_name) { + *name_fmt = "[anon_shmem:%s]"; + *name = anon_name->name; + } else { + *path = file_user_path(vma->vm_file); + } + return; + } + + if (vma->vm_ops && vma->vm_ops->name) { + *name = vma->vm_ops->name(vma); + if (*name) + return; + } + + *name = arch_vma_name(vma); + if (*name) + return; + + if (!vma->vm_mm) { + *name = "[vdso]"; + return; + } + + if (vma_is_initial_heap(vma)) { + *name = "[heap]"; + return; + } + + if (vma_is_initial_stack(vma)) { + *name = "[stack]"; + return; + } + + if (anon_name) { + *name_fmt = "[anon:%s]"; + *name = anon_name->name; + return; + } +} + static void show_vma_header_prefix(struct seq_file *m, unsigned long start, unsigned long end, vm_flags_t flags, unsigned long long pgoff, @@ -262,17 +324,15 @@ static void show_vma_header_prefix(struct seq_file *m, static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma) { - struct anon_vma_name *anon_name = NULL; - struct mm_struct *mm = vma->vm_mm; - struct file *file = vma->vm_file; + const struct path *path; + const char *name_fmt, *name; vm_flags_t flags = vma->vm_flags; unsigned long ino = 0; unsigned long long pgoff = 0; unsigned long start, end; dev_t dev = 0; - const char *name = NULL; - if (file) { + if (vma->vm_file) { const struct inode *inode = file_user_inode(vma->vm_file); dev = inode->i_sb->s_dev; @@ -283,57 +343,15 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma) start = vma->vm_start; end = vma->vm_end; show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino); - if (mm) - anon_name = anon_vma_name(vma); - /* - * Print the dentry name for named mappings, and a - * special [heap] marker for the heap: - */ - if (file) { + get_vma_name(vma, &path, &name, &name_fmt); + if (path) { seq_pad(m, ' '); - /* - * If user named this anon shared memory via - * prctl(PR_SET_VMA ..., use the provided name. - */ - if (anon_name) - seq_printf(m, "[anon_shmem:%s]", anon_name->name); - else - seq_path(m, file_user_path(file), "\n"); - goto done; - } - - if (vma->vm_ops && vma->vm_ops->name) { - name = vma->vm_ops->name(vma); - if (name) - goto done; - } - - name = arch_vma_name(vma); - if (!name) { - if (!mm) { - name = "[vdso]"; - goto done; - } - - if (vma_is_initial_heap(vma)) { - name = "[heap]"; - goto done; - } - - if (vma_is_initial_stack(vma)) { - name = "[stack]"; - goto done; - } - - if (anon_name) { - seq_pad(m, ' '); - seq_printf(m, "[anon:%s]", anon_name->name); - } - } - -done: - if (name) { + seq_path(m, path, "\n"); + } else if (name_fmt) { + seq_pad(m, ' '); + seq_printf(m, name_fmt, name); + } else if (name) { seq_pad(m, ' '); seq_puts(m, name); } @@ -358,11 +376,268 @@ static int pid_maps_open(struct inode *inode, struct file *file) return do_maps_open(inode, file, &proc_pid_maps_op); } +#define PROCMAP_QUERY_VMA_FLAGS ( \ + PROCMAP_QUERY_VMA_READABLE | \ + PROCMAP_QUERY_VMA_WRITABLE | \ + PROCMAP_QUERY_VMA_EXECUTABLE | \ + PROCMAP_QUERY_VMA_SHARED \ +) + +#define PROCMAP_QUERY_VALID_FLAGS_MASK ( \ + PROCMAP_QUERY_COVERING_OR_NEXT_VMA | \ + PROCMAP_QUERY_FILE_BACKED_VMA | \ + PROCMAP_QUERY_VMA_FLAGS \ +) + +static int query_vma_setup(struct mm_struct *mm) +{ + return mmap_read_lock_killable(mm); +} + +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) +{ + mmap_read_unlock(mm); +} + +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) +{ + return find_vma(mm, addr); +} + +static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, + unsigned long addr, u32 flags) +{ + struct vm_area_struct *vma; + +next_vma: + vma = query_vma_find_by_addr(mm, addr); + if (!vma) + goto no_vma; + + /* user requested only file-backed VMA, keep iterating */ + if ((flags & PROCMAP_QUERY_FILE_BACKED_VMA) && !vma->vm_file) + goto skip_vma; + + /* VMA permissions should satisfy query flags */ + if (flags & PROCMAP_QUERY_VMA_FLAGS) { + u32 perm = 0; + + if (flags & PROCMAP_QUERY_VMA_READABLE) + perm |= VM_READ; + if (flags & PROCMAP_QUERY_VMA_WRITABLE) + perm |= VM_WRITE; + if (flags & PROCMAP_QUERY_VMA_EXECUTABLE) + perm |= VM_EXEC; + if (flags & PROCMAP_QUERY_VMA_SHARED) + perm |= VM_MAYSHARE; + + if ((vma->vm_flags & perm) != perm) + goto skip_vma; + } + + /* found covering VMA or user is OK with the matching next VMA */ + if ((flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) || vma->vm_start <= addr) + return vma; + +skip_vma: + /* + * If the user needs closest matching VMA, keep iterating. + */ + addr = vma->vm_end; + if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) + goto next_vma; + +no_vma: + return ERR_PTR(-ENOENT); +} + +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) +{ + struct procmap_query karg; + struct vm_area_struct *vma; + struct mm_struct *mm; + const char *name = NULL; + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; + __u64 usize; + int err; + + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) + return -EFAULT; + /* argument struct can never be that large, reject abuse */ + if (usize > PAGE_SIZE) + return -E2BIG; + /* argument struct should have at least query_flags and query_addr fields */ + if (usize < offsetofend(struct procmap_query, query_addr)) + return -EINVAL; + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); + if (err) + return err; + + /* reject unknown flags */ + if (karg.query_flags & ~PROCMAP_QUERY_VALID_FLAGS_MASK) + return -EINVAL; + /* either both buffer address and size are set, or both should be zero */ + if (!!karg.vma_name_size != !!karg.vma_name_addr) + return -EINVAL; + if (!!karg.build_id_size != !!karg.build_id_addr) + return -EINVAL; + + mm = priv->mm; + if (!mm || !mmget_not_zero(mm)) + return -ESRCH; + + err = query_vma_setup(mm); + if (err) { + mmput(mm); + return err; + } + + vma = query_matching_vma(mm, karg.query_addr, karg.query_flags); + if (IS_ERR(vma)) { + err = PTR_ERR(vma); + vma = NULL; + goto out; + } + + karg.vma_start = vma->vm_start; + karg.vma_end = vma->vm_end; + + karg.vma_flags = 0; + if (vma->vm_flags & VM_READ) + karg.vma_flags |= PROCMAP_QUERY_VMA_READABLE; + if (vma->vm_flags & VM_WRITE) + karg.vma_flags |= PROCMAP_QUERY_VMA_WRITABLE; + if (vma->vm_flags & VM_EXEC) + karg.vma_flags |= PROCMAP_QUERY_VMA_EXECUTABLE; + if (vma->vm_flags & VM_MAYSHARE) + karg.vma_flags |= PROCMAP_QUERY_VMA_SHARED; + + karg.vma_page_size = vma_kernel_pagesize(vma); + + if (vma->vm_file) { + const struct inode *inode = file_user_inode(vma->vm_file); + + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; + karg.dev_major = MAJOR(inode->i_sb->s_dev); + karg.dev_minor = MINOR(inode->i_sb->s_dev); + karg.inode = inode->i_ino; + } else { + karg.vma_offset = 0; + karg.dev_major = 0; + karg.dev_minor = 0; + karg.inode = 0; + } + + if (karg.build_id_size) { + __u32 build_id_sz; + + err = build_id_parse(vma, build_id_buf, &build_id_sz); + if (err) { + karg.build_id_size = 0; + } else { + if (karg.build_id_size < build_id_sz) { + err = -ENAMETOOLONG; + goto out; + } + karg.build_id_size = build_id_sz; + } + } + + if (karg.build_id_size) { + __u32 build_id_sz; + + err = build_id_parse(vma, build_id_buf, &build_id_sz); + if (err) { + karg.build_id_size = 0; + } else { + if (karg.build_id_size < build_id_sz) { + err = -ENAMETOOLONG; + goto out; + } + karg.build_id_size = build_id_sz; + } + } + + if (karg.vma_name_size) { + size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size); + const struct path *path; + const char *name_fmt; + size_t name_sz = 0; + + get_vma_name(vma, &path, &name, &name_fmt); + + if (path || name_fmt || name) { + name_buf = kmalloc(name_buf_sz, GFP_KERNEL); + if (!name_buf) { + err = -ENOMEM; + goto out; + } + } + if (path) { + name = d_path(path, name_buf, name_buf_sz); + if (IS_ERR(name)) { + err = PTR_ERR(name); + goto out; + } + name_sz = name_buf + name_buf_sz - name; + } else if (name || name_fmt) { + name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name); + name = name_buf; + } + if (name_sz > name_buf_sz) { + err = -ENAMETOOLONG; + goto out; + } + karg.vma_name_size = name_sz; + } + + /* unlock vma or mmap_lock, and put mm_struct before copying data to user */ + query_vma_teardown(mm, vma); + mmput(mm); + + if (karg.vma_name_size && copy_to_user(u64_to_user_ptr(karg.vma_name_addr), + name, karg.vma_name_size)) { + kfree(name_buf); + return -EFAULT; + } + kfree(name_buf); + + if (karg.build_id_size && copy_to_user(u64_to_user_ptr(karg.build_id_addr), + build_id_buf, karg.build_id_size)) + return -EFAULT; + + if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize))) + return -EFAULT; + + return 0; + +out: + query_vma_teardown(mm, vma); + mmput(mm); + kfree(name_buf); + return err; +} + +static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{ + struct seq_file *seq = file->private_data; + struct proc_maps_private *priv = seq->private; + + switch (cmd) { + case PROCMAP_QUERY: + return do_procmap_query(priv, (void __user *)arg); + default: + return -ENOIOCTLCMD; + } +} + const struct file_operations proc_pid_maps_operations = { .open = pid_maps_open, .read = seq_read, .llseek = seq_lseek, .release = proc_map_release, + .unlocked_ioctl = procfs_procmap_ioctl, + .compat_ioctl = compat_ptr_ioctl, }; /* @@ -442,7 +717,7 @@ static void smaps_page_accumulate(struct mem_size_stats *mss, static void smaps_account(struct mem_size_stats *mss, struct page *page, bool compound, bool young, bool dirty, bool locked, - bool migration) + bool present) { struct folio *folio = page_folio(page); int i, nr = compound ? compound_nr(page) : 1; @@ -471,24 +746,29 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page, * Then accumulate quantities that may depend on sharing, or that may * differ page-by-page. * - * refcount == 1 guarantees the page is mapped exactly once. - * If any subpage of the compound page mapped with PTE it would elevate - * the refcount. + * refcount == 1 for present entries guarantees that the folio is mapped + * exactly once. For large folios this implies that exactly one + * PTE/PMD/... maps (a part of) this folio. * - * The page_mapcount() is called to get a snapshot of the mapcount. - * Without holding the page lock this snapshot can be slightly wrong as - * we cannot always read the mapcount atomically. It is not safe to - * call page_mapcount() even with PTL held if the page is not mapped, - * especially for migration entries. Treat regular migration entries - * as mapcount == 1. + * Treat all non-present entries (where relying on the mapcount and + * refcount doesn't make sense) as "maybe shared, but not sure how + * often". We treat device private entries as being fake-present. + * + * Note that it would not be safe to read the mapcount especially for + * pages referenced by migration entries, even with the PTL held. */ - if ((folio_ref_count(folio) == 1) || migration) { + if (folio_ref_count(folio) == 1 || !present) { smaps_page_accumulate(mss, folio, size, size << PSS_SHIFT, - dirty, locked, true); + dirty, locked, present); return; } + /* + * We obtain a snapshot of the mapcount. Without holding the folio lock + * this snapshot can be slightly wrong as we cannot always read the + * mapcount atomically. + */ for (i = 0; i < nr; i++, page++) { - int mapcount = page_mapcount(page); + int mapcount = folio_precise_page_mapcount(folio, page); unsigned long pss = PAGE_SIZE << PSS_SHIFT; if (mapcount >= 2) pss /= mapcount; @@ -531,13 +811,14 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct vm_area_struct *vma = walk->vma; bool locked = !!(vma->vm_flags & VM_LOCKED); struct page *page = NULL; - bool migration = false, young = false, dirty = false; + bool present = false, young = false, dirty = false; pte_t ptent = ptep_get(pte); if (pte_present(ptent)) { page = vm_normal_page(vma, addr, ptent); young = pte_young(ptent); dirty = pte_dirty(ptent); + present = true; } else if (is_swap_pte(ptent)) { swp_entry_t swpent = pte_to_swp_entry(ptent); @@ -555,8 +836,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT; } } else if (is_pfn_swap_entry(swpent)) { - if (is_migration_entry(swpent)) - migration = true; + if (is_device_private_entry(swpent)) + present = true; page = pfn_swap_entry_to_page(swpent); } } else { @@ -567,7 +848,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, if (!page) return; - smaps_account(mss, page, false, young, dirty, locked, migration); + smaps_account(mss, page, false, young, dirty, locked, present); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE @@ -578,18 +859,17 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, struct vm_area_struct *vma = walk->vma; bool locked = !!(vma->vm_flags & VM_LOCKED); struct page *page = NULL; + bool present = false; struct folio *folio; - bool migration = false; if (pmd_present(*pmd)) { page = vm_normal_page_pmd(vma, addr, *pmd); + present = true; } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) { swp_entry_t entry = pmd_to_swp_entry(*pmd); - if (is_migration_entry(entry)) { - migration = true; + if (is_pfn_swap_entry(entry)) page = pfn_swap_entry_to_page(entry); - } } if (IS_ERR_OR_NULL(page)) return; @@ -604,7 +884,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, mss->file_thp += HPAGE_PMD_SIZE; smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd), - locked, migration); + locked, present); } #else static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, @@ -733,19 +1013,23 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, { struct mem_size_stats *mss = walk->private; struct vm_area_struct *vma = walk->vma; - pte_t ptent = huge_ptep_get(pte); + pte_t ptent = huge_ptep_get(walk->mm, addr, pte); struct folio *folio = NULL; + bool present = false; if (pte_present(ptent)) { folio = page_folio(pte_page(ptent)); + present = true; } else if (is_swap_pte(ptent)) { swp_entry_t swpent = pte_to_swp_entry(ptent); if (is_pfn_swap_entry(swpent)) folio = pfn_swap_entry_folio(swpent); } + if (folio) { - if (folio_likely_mapped_shared(folio) || + /* We treat non-present entries as "maybe shared". */ + if (!present || folio_likely_mapped_shared(folio) || hugetlb_pmd_shared(pte)) mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); else @@ -1091,7 +1375,7 @@ struct clear_refs_private { static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, pte_t pte) { - struct page *page; + struct folio *folio; if (!pte_write(pte)) return false; @@ -1099,10 +1383,10 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, return false; if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags))) return false; - page = vm_normal_page(vma, addr, pte); - if (!page) + folio = vm_normal_folio(vma, addr, pte); + if (!folio) return false; - return page_maybe_dma_pinned(page); + return folio_maybe_dma_pinned(folio); } static inline void clear_soft_dirty(struct vm_area_struct *vma, @@ -1418,7 +1702,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, { u64 frame = 0, flags = 0; struct page *page = NULL; - bool migration = false; + struct folio *folio; if (pte_present(pte)) { if (pm->show_pfn) @@ -1450,17 +1734,20 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, (offset << MAX_SWAPFILES_SHIFT); } flags |= PM_SWAP; - migration = is_migration_entry(entry); if (is_pfn_swap_entry(entry)) page = pfn_swap_entry_to_page(entry); if (pte_marker_entry_uffd_wp(entry)) flags |= PM_UFFD_WP; } - if (page && !PageAnon(page)) - flags |= PM_FILE; - if (page && !migration && page_mapcount(page) == 1) - flags |= PM_MMAP_EXCLUSIVE; + if (page) { + folio = page_folio(page); + if (!folio_test_anon(folio)) + flags |= PM_FILE; + if ((flags & PM_PRESENT) && + folio_precise_page_mapcount(folio, page) == 1) + flags |= PM_MMAP_EXCLUSIVE; + } if (vma->vm_flags & VM_SOFTDIRTY) flags |= PM_SOFT_DIRTY; @@ -1476,13 +1763,14 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, pte_t *pte, *orig_pte; int err = 0; #ifdef CONFIG_TRANSPARENT_HUGEPAGE - bool migration = false; ptl = pmd_trans_huge_lock(pmdp, vma); if (ptl) { + unsigned int idx = (addr & ~PMD_MASK) >> PAGE_SHIFT; u64 flags = 0, frame = 0; pmd_t pmd = *pmdp; struct page *page = NULL; + struct folio *folio = NULL; if (vma->vm_flags & VM_SOFTDIRTY) flags |= PM_SOFT_DIRTY; @@ -1496,8 +1784,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, if (pmd_uffd_wp(pmd)) flags |= PM_UFFD_WP; if (pm->show_pfn) - frame = pmd_pfn(pmd) + - ((addr & ~PMD_MASK) >> PAGE_SHIFT); + frame = pmd_pfn(pmd) + idx; } #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION else if (is_swap_pmd(pmd)) { @@ -1506,11 +1793,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, if (pm->show_pfn) { if (is_pfn_swap_entry(entry)) - offset = swp_offset_pfn(entry); + offset = swp_offset_pfn(entry) + idx; else - offset = swp_offset(entry); - offset = offset + - ((addr & ~PMD_MASK) >> PAGE_SHIFT); + offset = swp_offset(entry) + idx; frame = swp_type(entry) | (offset << MAX_SWAPFILES_SHIFT); } @@ -1520,17 +1805,25 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, if (pmd_swp_uffd_wp(pmd)) flags |= PM_UFFD_WP; VM_BUG_ON(!is_pmd_migration_entry(pmd)); - migration = is_migration_entry(entry); page = pfn_swap_entry_to_page(entry); } #endif - if (page && !migration && page_mapcount(page) == 1) - flags |= PM_MMAP_EXCLUSIVE; + if (page) { + folio = page_folio(page); + if (!folio_test_anon(folio)) + flags |= PM_FILE; + } - for (; addr != end; addr += PAGE_SIZE) { - pagemap_entry_t pme = make_pme(frame, flags); + for (; addr != end; addr += PAGE_SIZE, idx++) { + unsigned long cur_flags = flags; + pagemap_entry_t pme; + if (folio && (flags & PM_PRESENT) && + folio_precise_page_mapcount(folio, page + idx) == 1) + cur_flags |= PM_MMAP_EXCLUSIVE; + + pme = make_pme(frame, cur_flags); err = add_to_pagemap(&pme, pm); if (err) break; @@ -1585,7 +1878,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, if (vma->vm_flags & VM_SOFTDIRTY) flags |= PM_SOFT_DIRTY; - pte = huge_ptep_get(ptep); + pte = huge_ptep_get(walk->mm, addr, ptep); if (pte_present(pte)) { struct folio *folio = page_folio(pte_page(pte)); @@ -2274,7 +2567,7 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask, if (~p->arg.flags & PM_SCAN_WP_MATCHING) { /* Go the short route when not write-protecting pages. */ - pte = huge_ptep_get(ptep); + pte = huge_ptep_get(walk->mm, start, ptep); categories = p->cur_vma_category | pagemap_hugetlb_category(pte); if (!pagemap_scan_is_interesting_page(categories, p)) @@ -2286,7 +2579,7 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask, i_mmap_lock_write(vma->vm_file->f_mapping); ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep); - pte = huge_ptep_get(ptep); + pte = huge_ptep_get(walk->mm, start, ptep); categories = p->cur_vma_category | pagemap_hugetlb_category(pte); if (!pagemap_scan_is_interesting_page(categories, p)) @@ -2566,7 +2859,7 @@ static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty, unsigned long nr_pages) { struct folio *folio = page_folio(page); - int count = page_mapcount(page); + int count = folio_precise_page_mapcount(folio, page); md->pages += nr_pages; if (pte_dirty || folio_test_dirty(folio)) @@ -2682,7 +2975,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { - pte_t huge_pte = huge_ptep_get(pte); + pte_t huge_pte = huge_ptep_get(walk->mm, addr, pte); struct numa_maps *md; struct page *page; diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 17e409ceaa33..27a3e9285fbf 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -257,7 +257,7 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, goto out; ret = false; - pte = huge_ptep_get(ptep); + pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep); /* * Lockless access: we're in a wait_event so it's ok if it diff --git a/include/acpi/platform/aclinuxex.h b/include/acpi/platform/aclinuxex.h index 62cac266a1c8..eeff40295b4b 100644 --- a/include/acpi/platform/aclinuxex.h +++ b/include/acpi/platform/aclinuxex.h @@ -46,6 +46,9 @@ acpi_status acpi_os_terminate(void); * Interrupts are off during resume, just like they are for boot. * However, boot has (system_state != SYSTEM_RUNNING) * to quiet __might_sleep() in kmalloc() and resume does not. + * + * These specialized allocators have to be macros for their allocations to be + * accounted separately (to have separate alloc_tag). */ #define acpi_os_allocate(_size) \ kmalloc(_size, irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL) @@ -53,14 +56,14 @@ acpi_status acpi_os_terminate(void); #define acpi_os_allocate_zeroed(_size) \ kzalloc(_size, irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL) +#define acpi_os_acquire_object(_cache) \ + kmem_cache_zalloc(_cache, irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL) + static inline void acpi_os_free(void *memory) { kfree(memory); } -#define acpi_os_acquire_object(_cache) \ - kmem_cache_zalloc(_cache, irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL) - static inline acpi_thread_id acpi_os_get_thread_id(void) { return (acpi_thread_id) (unsigned long)current; diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h index 6dcf4d576970..594d5905f615 100644 --- a/include/asm-generic/hugetlb.h +++ b/include/asm-generic/hugetlb.h @@ -144,7 +144,7 @@ static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma, #endif #ifndef __HAVE_ARCH_HUGE_PTEP_GET -static inline pte_t huge_ptep_get(pte_t *ptep) +static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { return ptep_get(ptep); } diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h index abd24016a900..8c61ccd161ba 100644 --- a/include/linux/alloc_tag.h +++ b/include/linux/alloc_tag.h @@ -122,7 +122,7 @@ static inline void alloc_tag_add_check(union codetag_ref *ref, struct alloc_tag "alloc_tag was not cleared (got tag for %s:%u)\n", ref->ct->filename, ref->ct->lineno); - WARN_ONCE(!tag, "current->alloc_tag not set"); + WARN_ONCE(!tag, "current->alloc_tag not set\n"); } static inline void alloc_tag_sub_check(union codetag_ref *ref) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 4f1d4a97b9d1..3b94ec161e8c 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -275,7 +275,7 @@ struct bpf_map { u32 btf_value_type_id; u32 btf_vmlinux_value_type_id; struct btf *btf; -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG struct obj_cgroup *objcg; #endif char name[BPF_OBJ_NAME_LEN]; @@ -2253,7 +2253,7 @@ struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id); int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array); -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags, int node); void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags); @@ -2262,6 +2262,10 @@ void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size, void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align, gfp_t flags); #else +/* + * These specialized allocators have to be macros for their allocations to be + * accounted separately (to have separate alloc_tag). + */ #define bpf_map_kmalloc_node(_map, _size, _flags, _node) \ kmalloc_node(_size, _flags, _node) #define bpf_map_kzalloc(_map, _size, _flags) \ diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index e022e40b099e..14acf1bbe0ce 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -53,7 +53,7 @@ typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate); * filesystem and block layers. Nowadays the basic I/O unit * is the bio, and buffer_heads are used for extracting block * mappings (via a get_block_t call), for tracking state within - * a page (via a page_mapping) and for wrapping bio submission + * a folio (via a folio_mapping) and for wrapping bio submission * for backward compatibility reasons (e.g. submit_bh). */ struct buffer_head { diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index b36690ca0d3f..293af7f8a694 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -681,9 +681,7 @@ struct cftype { __poll_t (*poll)(struct kernfs_open_file *of, struct poll_table_struct *pt); -#ifdef CONFIG_DEBUG_LOCK_ALLOC struct lock_class_key lockdep_key; -#endif }; /* diff --git a/include/linux/damon.h b/include/linux/damon.h index f7da65e1ac04..27c546bfc6d4 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -105,6 +105,8 @@ struct damon_target { * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. * @DAMOS_LRU_PRIO: Prioritize the region on its LRU lists. * @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists. + * @DAMOS_MIGRATE_HOT: Migrate the regions prioritizing warmer regions. + * @DAMOS_MIGRATE_COLD: Migrate the regions prioritizing colder regions. * @DAMOS_STAT: Do nothing but count the stat. * @NR_DAMOS_ACTIONS: Total number of DAMOS actions * @@ -122,6 +124,8 @@ enum damos_action { DAMOS_NOHUGEPAGE, DAMOS_LRU_PRIO, DAMOS_LRU_DEPRIO, + DAMOS_MIGRATE_HOT, + DAMOS_MIGRATE_COLD, DAMOS_STAT, /* Do nothing but only record the stat */ NR_DAMOS_ACTIONS, }; @@ -374,6 +378,7 @@ struct damos_access_pattern { * @apply_interval_us: The time between applying the @action. * @quota: Control the aggressiveness of this scheme. * @wmarks: Watermarks for automated (in)activation of this scheme. + * @target_nid: Destination node if @action is "migrate_{hot,cold}". * @filters: Additional set of &struct damos_filter for &action. * @stat: Statistics of this scheme. * @list: List head for siblings. @@ -389,6 +394,10 @@ struct damos_access_pattern { * monitoring context are inactive, DAMON stops monitoring either, and just * repeatedly checks the watermarks. * + * @target_nid is used to set the migration target node for migrate_hot or + * migrate_cold actions, which means it's only meaningful when @action is either + * "migrate_hot" or "migrate_cold". + * * Before applying the &action to a memory region, &struct damon_operations * implementation could check pages of the region and skip &action to respect * &filters @@ -410,6 +419,9 @@ struct damos { /* public: */ struct damos_quota quota; struct damos_watermarks wmarks; + union { + int target_nid; + }; struct list_head filters; struct damos_stat stat; struct list_head list; @@ -726,9 +738,11 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern, enum damos_action action, unsigned long apply_interval_us, struct damos_quota *quota, - struct damos_watermarks *wmarks); + struct damos_watermarks *wmarks, + int target_nid); void damon_add_scheme(struct damon_ctx *ctx, struct damos *s); void damon_destroy_scheme(struct damos *s); +int damos_commit_quota_goals(struct damos_quota *dst, struct damos_quota *src); struct damon_target *damon_new_target(void); void damon_add_target(struct damon_ctx *ctx, struct damon_target *t); @@ -742,6 +756,7 @@ void damon_destroy_ctx(struct damon_ctx *ctx); int damon_set_attrs(struct damon_ctx *ctx, struct damon_attrs *attrs); void damon_set_schemes(struct damon_ctx *ctx, struct damos **schemes, ssize_t nr_schemes); +int damon_commit_ctx(struct damon_ctx *old_ctx, struct damon_ctx *new_ctx); int damon_nr_running_ctxs(void); bool damon_is_registered_ops(enum damon_ops_id id); int damon_register_ops(struct damon_operations *ops); diff --git a/include/linux/dma-fence-chain.h b/include/linux/dma-fence-chain.h index ad9e2506c2f4..68c3c1e41014 100644 --- a/include/linux/dma-fence-chain.h +++ b/include/linux/dma-fence-chain.h @@ -85,6 +85,10 @@ dma_fence_chain_contained(struct dma_fence *fence) * dma_fence_chain_alloc * * Returns a new struct dma_fence_chain object or NULL on failure. + * + * This specialized allocator has to be a macro for its allocations to be + * accounted separately (to have a separate alloc_tag). The typecast is + * intentional to enforce typesafety. */ #define dma_fence_chain_alloc() \ ((struct dma_fence_chain *)kmalloc(sizeof(struct dma_fence_chain), GFP_KERNEL)) diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h index 6d5edef09d45..354413950d34 100644 --- a/include/linux/fault-inject.h +++ b/include/linux/fault-inject.h @@ -91,22 +91,19 @@ static inline void fault_config_init(struct fault_config *config, struct kmem_cache; -bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order); - #ifdef CONFIG_FAIL_PAGE_ALLOC -bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order); +bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order); #else -static inline bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) +static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) { return false; } #endif /* CONFIG_FAIL_PAGE_ALLOC */ -int should_failslab(struct kmem_cache *s, gfp_t gfpflags); #ifdef CONFIG_FAILSLAB -extern bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags); +int should_failslab(struct kmem_cache *s, gfp_t gfpflags); #else -static inline bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags) +static inline int should_failslab(struct kmem_cache *s, gfp_t gfpflags) { return false; } diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 7f9691d375f0..f53f76e0b17e 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -303,6 +303,8 @@ struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order); struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order, struct mempolicy *mpol, pgoff_t ilx, int nid); struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order); +struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order, + struct mempolicy *mpol, pgoff_t ilx, int nid); struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr, bool hugepage); #else @@ -319,6 +321,11 @@ static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order) { return __folio_alloc_node(gfp, order, numa_node_id()); } +static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order, + struct mempolicy *mpol, pgoff_t ilx, int nid) +{ + return folio_alloc_noprof(gfp, order); +} #define vma_alloc_folio_noprof(gfp, order, vma, addr, hugepage) \ folio_alloc_noprof(gfp, order) #endif @@ -326,6 +333,7 @@ static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order) #define alloc_pages(...) alloc_hooks(alloc_pages_noprof(__VA_ARGS__)) #define alloc_pages_mpol(...) alloc_hooks(alloc_pages_mpol_noprof(__VA_ARGS__)) #define folio_alloc(...) alloc_hooks(folio_alloc_noprof(__VA_ARGS__)) +#define folio_alloc_mpol(...) alloc_hooks(folio_alloc_mpol_noprof(__VA_ARGS__)) #define vma_alloc_folio(...) alloc_hooks(vma_alloc_folio_noprof(__VA_ARGS__)) #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) diff --git a/include/linux/hid_bpf.h b/include/linux/hid_bpf.h index 9ca96fc90449..d4d063cf63b5 100644 --- a/include/linux/hid_bpf.h +++ b/include/linux/hid_bpf.h @@ -228,6 +228,11 @@ static inline int hid_bpf_connect_device(struct hid_device *hdev) { return 0; } static inline void hid_bpf_disconnect_device(struct hid_device *hdev) {} static inline void hid_bpf_destroy_device(struct hid_device *hid) {} static inline int hid_bpf_device_init(struct hid_device *hid) { return 0; } +/* + * This specialized allocator has to be a macro for its allocations to be + * accounted separately (to have a separate alloc_tag). The typecast is + * intentional to enforce typesafety. + */ #define call_hid_bpf_rdesc_fixup(_hdev, _rdesc, _size) \ ((u8 *)kmemdup(_rdesc, *(_size), GFP_KERNEL)) diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h index a3028e400a9c..dd100e849f5e 100644 --- a/include/linux/highmem-internal.h +++ b/include/linux/highmem-internal.h @@ -131,22 +131,17 @@ static inline void __kunmap_atomic(const void *addr) preempt_enable(); } -unsigned int __nr_free_highpages(void); -extern atomic_long_t _totalhigh_pages; +unsigned long __nr_free_highpages(void); +unsigned long __totalhigh_pages(void); -static inline unsigned int nr_free_highpages(void) +static inline unsigned long nr_free_highpages(void) { return __nr_free_highpages(); } static inline unsigned long totalhigh_pages(void) { - return (unsigned long)atomic_long_read(&_totalhigh_pages); -} - -static inline void totalhigh_pages_add(long count) -{ - atomic_long_add(count, &_totalhigh_pages); + return __totalhigh_pages(); } static inline bool is_kmap_addr(const void *x) @@ -239,8 +234,8 @@ static inline void __kunmap_atomic(const void *addr) preempt_enable(); } -static inline unsigned int nr_free_highpages(void) { return 0; } -static inline unsigned long totalhigh_pages(void) { return 0UL; } +static inline unsigned long nr_free_highpages(void) { return 0; } +static inline unsigned long totalhigh_pages(void) { return 0; } static inline bool is_kmap_addr(const void *x) { diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 00341b56d291..930a591b9b61 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -179,7 +179,7 @@ static inline void *kmap_local_folio(struct folio *folio, size_t offset); static inline void *kmap_atomic(struct page *page); /* Highmem related interfaces for management code */ -static inline unsigned int nr_free_highpages(void); +static inline unsigned long nr_free_highpages(void); static inline unsigned long totalhigh_pages(void); #ifndef ARCH_HAS_FLUSH_ANON_PAGE @@ -352,6 +352,9 @@ static inline int copy_mc_user_highpage(struct page *to, struct page *from, kunmap_local(vto); kunmap_local(vfrom); + if (ret) + memory_failure_queue(page_to_pfn(from), 0); + return ret; } @@ -368,6 +371,9 @@ static inline int copy_mc_highpage(struct page *to, struct page *from) kunmap_local(vto); kunmap_local(vfrom); + if (ret) + memory_failure_queue(page_to_pfn(from), 0); + return ret; } #else diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2aa986a5cd1b..cff002be83eb 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -6,6 +6,7 @@ #include #include /* only for vma_is_dax() */ +#include vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, @@ -63,6 +64,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf, enum transparent_hugepage_flag flag); extern struct kobj_attribute shmem_enabled_attr; +extern struct kobj_attribute thpsize_shmem_enabled_attr; /* * Mask of all large folio orders supported for anonymous THP; all orders up to @@ -126,18 +128,6 @@ static inline bool hugepage_global_always(void) (1<private; \ return test_bit(HPG_##flname, private); \ - } \ -static inline int HPage##uname(struct page *page) \ - { return test_bit(HPG_##flname, &(page->private)); } + } #define SETHPAGEFLAG(uname, flname) \ static __always_inline \ void folio_set_hugetlb_##flname(struct folio *folio) \ { void *private = &folio->private; \ set_bit(HPG_##flname, private); \ - } \ -static inline void SetHPage##uname(struct page *page) \ - { set_bit(HPG_##flname, &(page->private)); } + } #define CLEARHPAGEFLAG(uname, flname) \ static __always_inline \ void folio_clear_hugetlb_##flname(struct folio *folio) \ { void *private = &folio->private; \ clear_bit(HPG_##flname, private); \ - } \ -static inline void ClearHPage##uname(struct page *page) \ - { clear_bit(HPG_##flname, &(page->private)); } + } #else #define TESTHPAGEFLAG(uname, flname) \ static inline bool \ folio_test_hugetlb_##flname(struct folio *folio) \ - { return 0; } \ -static inline int HPage##uname(struct page *page) \ { return 0; } #define SETHPAGEFLAG(uname, flname) \ static inline void \ folio_set_hugetlb_##flname(struct folio *folio) \ - { } \ -static inline void SetHPage##uname(struct page *page) \ { } #define CLEARHPAGEFLAG(uname, flname) \ static inline void \ folio_clear_hugetlb_##flname(struct folio *folio) \ - { } \ -static inline void ClearHPage##uname(struct page *page) \ { } #endif @@ -681,6 +663,7 @@ HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable) /* Defines one hugetlb page size */ struct hstate { struct mutex resize_lock; + struct lock_class_key resize_key; int next_nid_to_alloc; int next_nid_to_free; unsigned int order; @@ -698,11 +681,6 @@ struct hstate { unsigned int nr_huge_pages_node[MAX_NUMNODES]; unsigned int free_huge_pages_node[MAX_NUMNODES]; unsigned int surplus_huge_pages_node[MAX_NUMNODES]; -#ifdef CONFIG_CGROUP_HUGETLB - /* cgroup control files */ - struct cftype cgroup_files_dfl[8]; - struct cftype cgroup_files_legacy[10]; -#endif char name[HSTATE_NAME_LEN]; }; diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index b900c642210c..5157d92b6f23 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -1595,6 +1595,11 @@ void jbd2_journal_put_journal_head(struct journal_head *jh); */ extern struct kmem_cache *jbd2_handle_cache; +/* + * This specialized allocator has to be a macro for its allocations to be + * accounted separately (to have a separate alloc_tag). The typecast is + * intentional to enforce typesafety. + */ #define jbd2_alloc_handle(_gfp_flags) \ ((handle_t *)kmem_cache_zalloc(jbd2_handle_cache, _gfp_flags)) @@ -1609,6 +1614,11 @@ static inline void jbd2_free_handle(handle_t *handle) */ extern struct kmem_cache *jbd2_inode_cache; +/* + * This specialized allocator has to be a macro for its allocations to be + * accounted separately (to have a separate alloc_tag). The typecast is + * intentional to enforce typesafety. + */ #define jbd2_alloc_inode(_gfp_flags) \ ((struct jbd2_inode *)kmem_cache_alloc(jbd2_inode_cache, _gfp_flags)) diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h index e0c23a32cdf0..2b1432cc16d5 100644 --- a/include/linux/kmsan.h +++ b/include/linux/kmsan.h @@ -230,6 +230,67 @@ void kmsan_handle_urb(const struct urb *urb, bool is_out); */ void kmsan_unpoison_entry_regs(const struct pt_regs *regs); +/** + * kmsan_get_metadata() - Return a pointer to KMSAN shadow or origins. + * @addr: kernel address. + * @is_origin: whether to return origins or shadow. + * + * Return NULL if metadata cannot be found. + */ +void *kmsan_get_metadata(void *addr, bool is_origin); + +/** + * kmsan_enable_current(): Enable KMSAN for the current task. + * + * Each kmsan_enable_current() current call must be preceded by a + * kmsan_disable_current() call. These call pairs may be nested. + */ +void kmsan_enable_current(void); + +/** + * kmsan_disable_current(): Disable KMSAN for the current task. + * + * Each kmsan_disable_current() current call must be followed by a + * kmsan_enable_current() call. These call pairs may be nested. + */ +void kmsan_disable_current(void); + +/** + * memset_no_sanitize_memory(): Fill memory without KMSAN instrumentation. + * @s: address of kernel memory to fill. + * @c: constant byte to fill the memory with. + * @n: number of bytes to fill. + * + * This is like memset(), but without KMSAN instrumentation. + */ +static inline void *memset_no_sanitize_memory(void *s, int c, size_t n) +{ + return __memset(s, c, n); +} + +extern bool kmsan_enabled; +extern int panic_on_kmsan; + +/* + * KMSAN performs a lot of consistency checks that are currently enabled by + * default. BUG_ON is normally discouraged in the kernel, unless used for + * debugging, but KMSAN itself is a debugging tool, so it makes little sense to + * recover if something goes wrong. + */ +#define KMSAN_WARN_ON(cond) \ + ({ \ + const bool __cond = WARN_ON(cond); \ + if (unlikely(__cond)) { \ + WRITE_ONCE(kmsan_enabled, false); \ + if (panic_on_kmsan) { \ + /* Can't call panic() here because */ \ + /* of uaccess checks. */ \ + BUG(); \ + } \ + } \ + __cond; \ + }) + #else static inline void kmsan_init_shadow(void) @@ -329,6 +390,21 @@ static inline void kmsan_unpoison_entry_regs(const struct pt_regs *regs) { } +static inline void kmsan_enable_current(void) +{ +} + +static inline void kmsan_disable_current(void) +{ +} + +static inline void *memset_no_sanitize_memory(void *s, int c, size_t n) +{ + return memset(s, c, n); +} + +#define KMSAN_WARN_ON WARN_ON + #endif #endif /* _LINUX_KMSAN_H */ diff --git a/include/linux/kmsan_types.h b/include/linux/kmsan_types.h index 929287981afe..dfc59918b3c0 100644 --- a/include/linux/kmsan_types.h +++ b/include/linux/kmsan_types.h @@ -31,7 +31,7 @@ struct kmsan_context_state { struct kmsan_ctx { struct kmsan_context_state cstate; int kmsan_in_runtime; - bool allow_reporting; + unsigned int depth; }; #endif /* _LINUX_KMSAN_TYPES_H */ diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index 792b67ceb631..5099a8ccd5f4 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -50,7 +50,7 @@ struct list_lru_node { struct list_lru { struct list_lru_node *node; -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG struct list_head list; int shrinker_id; bool memcg_aware; diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 45cac33334c8..fc4d75c6cec3 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -316,8 +316,6 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone, for (; i != U64_MAX; \ __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end)) -int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask); - #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ /** diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 030d34e9d117..7e2eb091049a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -69,18 +69,6 @@ struct mem_cgroup_id { refcount_t ref; }; -/* - * Per memcg event counter is incremented at every pagein/pageout. With THP, - * it will be incremented by the number of pages. This counter is used - * to trigger some periodic events. This is straightforward and better - * than using jiffies etc. to handle periodic memcg event. - */ -enum mem_cgroup_events_target { - MEM_CGROUP_TARGET_THRESH, - MEM_CGROUP_TARGET_SOFTLIMIT, - MEM_CGROUP_NTARGETS, -}; - struct memcg_vmstats_percpu; struct memcg_vmstats; struct lruvec_stats_percpu; @@ -96,23 +84,33 @@ struct mem_cgroup_reclaim_iter { * per-node information in memory controller. */ struct mem_cgroup_per_node { - struct lruvec lruvec; + /* Keep the read-only fields at the start */ + struct mem_cgroup *memcg; /* Back pointer, we cannot */ + /* use container_of */ struct lruvec_stats_percpu __percpu *lruvec_stats_percpu; struct lruvec_stats *lruvec_stats; - - unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]; - - struct mem_cgroup_reclaim_iter iter; - struct shrinker_info __rcu *shrinker_info; +#ifdef CONFIG_MEMCG_V1 + /* + * Memcg-v1 only stuff in middle as buffer between read mostly fields + * and update often fields to avoid false sharing. If v1 stuff is + * not present, an explicit padding is needed. + */ + struct rb_node tree_node; /* RB tree node */ unsigned long usage_in_excess;/* Set to the value by which */ /* the soft limit is exceeded*/ bool on_tree; - struct mem_cgroup *memcg; /* Back pointer, we cannot */ - /* use container_of */ +#else + CACHELINE_PADDING(_pad1_); +#endif + + /* Fields which get updated often at the end. */ + struct lruvec lruvec; + unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]; + struct mem_cgroup_reclaim_iter iter; }; struct mem_cgroup_threshold { @@ -194,14 +192,10 @@ struct mem_cgroup { struct page_counter memsw; /* v1 only */ }; - /* Legacy consumer-oriented counters */ - struct page_counter kmem; /* v1 only */ - struct page_counter tcpmem; /* v1 only */ - /* Range enforcement for interrupt charges */ struct work_struct high_work; -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) +#ifdef CONFIG_ZSWAP unsigned long zswap_max; /* @@ -211,8 +205,6 @@ struct mem_cgroup { bool zswap_writeback; #endif - unsigned long soft_limit; - /* vmpressure notifications */ struct vmpressure vmpressure; @@ -221,13 +213,7 @@ struct mem_cgroup { */ bool oom_group; - /* protected by memcg_oom_lock */ - bool oom_lock; - int under_oom; - - int swappiness; - /* OOM-Killer disable */ - int oom_kill_disable; + int swappiness; /* memory.events and memory.events.local */ struct cgroup_file events_file; @@ -236,6 +222,62 @@ struct mem_cgroup { /* handle for "memory.swap.events" */ struct cgroup_file swap_events_file; + /* memory.stat */ + struct memcg_vmstats *vmstats; + + /* memory.events */ + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; + atomic_long_t memory_events_local[MEMCG_NR_MEMORY_EVENTS]; + + /* + * Hint of reclaim pressure for socket memroy management. Note + * that this indicator should NOT be used in legacy cgroup mode + * where socket memory is accounted/charged separately. + */ + unsigned long socket_pressure; + + int kmemcg_id; + /* + * memcg->objcg is wiped out as a part of the objcg repaprenting + * process. memcg->orig_objcg preserves a pointer (and a reference) + * to the original objcg until the end of live of memcg. + */ + struct obj_cgroup __rcu *objcg; + struct obj_cgroup *orig_objcg; + /* list of inherited objcgs, protected by objcg_lock */ + struct list_head objcg_list; + + struct memcg_vmstats_percpu __percpu *vmstats_percpu; + +#ifdef CONFIG_CGROUP_WRITEBACK + struct list_head cgwb_list; + struct wb_domain cgwb_domain; + struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT]; +#endif + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + struct deferred_split deferred_split_queue; +#endif + +#ifdef CONFIG_LRU_GEN_WALKS_MMU + /* per-memcg mm_struct list */ + struct lru_gen_mm_list mm_list; +#endif + +#ifdef CONFIG_MEMCG_V1 + /* Legacy consumer-oriented counters */ + struct page_counter kmem; /* v1 only */ + struct page_counter tcpmem; /* v1 only */ + + unsigned long soft_limit; + + /* protected by memcg_oom_lock */ + bool oom_lock; + int under_oom; + + /* OOM-Killer disable */ + int oom_kill_disable; + /* protect arrays of thresholds */ struct mutex thresholds_lock; @@ -254,70 +296,23 @@ struct mem_cgroup { */ unsigned long move_charge_at_immigrate; /* taken only while moving_account > 0 */ - spinlock_t move_lock; - unsigned long move_lock_flags; - - CACHELINE_PADDING(_pad1_); - - /* memory.stat */ - struct memcg_vmstats *vmstats; - - /* memory.events */ - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; - atomic_long_t memory_events_local[MEMCG_NR_MEMORY_EVENTS]; - - /* - * Hint of reclaim pressure for socket memroy management. Note - * that this indicator should NOT be used in legacy cgroup mode - * where socket memory is accounted/charged separately. - */ - unsigned long socket_pressure; + spinlock_t move_lock; + unsigned long move_lock_flags; /* Legacy tcp memory accounting */ - bool tcpmem_active; - int tcpmem_pressure; - -#ifdef CONFIG_MEMCG_KMEM - int kmemcg_id; - /* - * memcg->objcg is wiped out as a part of the objcg repaprenting - * process. memcg->orig_objcg preserves a pointer (and a reference) - * to the original objcg until the end of live of memcg. - */ - struct obj_cgroup __rcu *objcg; - struct obj_cgroup *orig_objcg; - /* list of inherited objcgs, protected by objcg_lock */ - struct list_head objcg_list; -#endif - - CACHELINE_PADDING(_pad2_); + bool tcpmem_active; + int tcpmem_pressure; /* * set > 0 if pages under this cgroup are moving to other cgroup. */ - atomic_t moving_account; - struct task_struct *move_lock_task; - - struct memcg_vmstats_percpu __percpu *vmstats_percpu; - -#ifdef CONFIG_CGROUP_WRITEBACK - struct list_head cgwb_list; - struct wb_domain cgwb_domain; - struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT]; -#endif + atomic_t moving_account; + struct task_struct *move_lock_task; /* List of events which userspace want to receive */ struct list_head event_list; spinlock_t event_list_lock; - -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - struct deferred_split deferred_split_queue; -#endif - -#ifdef CONFIG_LRU_GEN_WALKS_MMU - /* per-memcg mm_struct list */ - struct lru_gen_mm_list mm_list; -#endif +#endif /* CONFIG_MEMCG_V1 */ struct mem_cgroup_per_node *nodeinfo[]; }; @@ -443,11 +438,6 @@ static inline struct mem_cgroup *folio_memcg(struct folio *folio) return __folio_memcg(folio); } -static inline struct mem_cgroup *page_memcg(struct page *page) -{ - return folio_memcg(page_folio(page)); -} - /** * folio_memcg_rcu - Locklessly get the memory cgroup associated with a folio. * @folio: Pointer to the folio. @@ -540,7 +530,6 @@ retry: return memcg; } -#ifdef CONFIG_MEMCG_KMEM /* * folio_memcg_kmem - Check if the folio has the memcg_kmem flag set. * @folio: Pointer to the folio. @@ -556,15 +545,6 @@ static inline bool folio_memcg_kmem(struct folio *folio) return folio->memcg_data & MEMCG_DATA_KMEM; } - -#else -static inline bool folio_memcg_kmem(struct folio *folio) -{ - return false; -} - -#endif - static inline bool PageMemcgKmem(struct page *page) { return folio_memcg_kmem(page_folio(page)); @@ -949,51 +929,13 @@ void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg); -static inline void mem_cgroup_enter_user_fault(void) -{ - WARN_ON(current->in_user_fault); - current->in_user_fault = 1; -} - -static inline void mem_cgroup_exit_user_fault(void) -{ - WARN_ON(!current->in_user_fault); - current->in_user_fault = 0; -} - -static inline bool task_in_memcg_oom(struct task_struct *p) -{ - return p->memcg_in_oom; -} - -bool mem_cgroup_oom_synchronize(bool wait); struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, struct mem_cgroup *oom_domain); void mem_cgroup_print_oom_group(struct mem_cgroup *memcg); -void folio_memcg_lock(struct folio *folio); -void folio_memcg_unlock(struct folio *folio); - void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, int val); -/* try to stablize folio_memcg() for all the pages in a memcg */ -static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) -{ - rcu_read_lock(); - - if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account)) - return true; - - rcu_read_unlock(); - return false; -} - -static inline void mem_cgroup_unlock_pages(void) -{ - rcu_read_unlock(); -} - /* idx can be of type enum memcg_stat_item or node_stat_item */ static inline void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, int val) @@ -1014,7 +956,7 @@ static inline void mod_memcg_page_state(struct page *page, return; rcu_read_lock(); - memcg = page_memcg(page); + memcg = folio_memcg(page_folio(page)); if (memcg) mod_memcg_state(memcg, idx, val); rcu_read_unlock(); @@ -1120,10 +1062,6 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, void split_page_memcg(struct page *head, int old_order, int new_order); -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, - gfp_t gfp_mask, - unsigned long *total_scanned); - #else /* CONFIG_MEMCG */ #define MEM_CGROUP_ID_SHIFT 0 @@ -1133,11 +1071,6 @@ static inline struct mem_cgroup *folio_memcg(struct folio *folio) return NULL; } -static inline struct mem_cgroup *page_memcg(struct page *page) -{ - return NULL; -} - static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio) { WARN_ON_ONCE(!rcu_read_lock_held()); @@ -1439,48 +1372,10 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) { } -static inline void folio_memcg_lock(struct folio *folio) -{ -} - -static inline void folio_memcg_unlock(struct folio *folio) -{ -} - -static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) -{ - /* to match folio_memcg_rcu() */ - rcu_read_lock(); - return true; -} - -static inline void mem_cgroup_unlock_pages(void) -{ - rcu_read_unlock(); -} - static inline void mem_cgroup_handle_over_high(gfp_t gfp_mask) { } -static inline void mem_cgroup_enter_user_fault(void) -{ -} - -static inline void mem_cgroup_exit_user_fault(void) -{ -} - -static inline bool task_in_memcg_oom(struct task_struct *p) -{ - return false; -} - -static inline bool mem_cgroup_oom_synchronize(bool wait) -{ - return false; -} - static inline struct mem_cgroup *mem_cgroup_get_oom_group( struct task_struct *victim, struct mem_cgroup *oom_domain) { @@ -1574,14 +1469,6 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) static inline void split_page_memcg(struct page *head, int old_order, int new_order) { } - -static inline -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - return 0; -} #endif /* CONFIG_MEMCG */ /* @@ -1589,7 +1476,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, * if MEMCG_DATA_OBJEXTS is set. */ struct slabobj_ext { -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG struct obj_cgroup *objcg; #endif #ifdef CONFIG_MEM_ALLOC_PROFILING @@ -1636,7 +1523,7 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, spin_unlock_irqrestore(&lruvec->lru_lock, flags); } -/* Test requires a stable page->memcg binding, see page_memcg() */ +/* Test requires a stable folio->memcg binding, see folio_memcg() */ static inline bool folio_matches_lruvec(struct folio *folio, struct lruvec *lruvec) { @@ -1734,8 +1621,10 @@ void mem_cgroup_sk_alloc(struct sock *sk); void mem_cgroup_sk_free(struct sock *sk); static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg) { +#ifdef CONFIG_MEMCG_V1 if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return !!memcg->tcpmem_pressure; +#endif /* CONFIG_MEMCG_V1 */ do { if (time_before(jiffies, READ_ONCE(memcg->socket_pressure))) return true; @@ -1762,7 +1651,7 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg, } #endif -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG bool mem_cgroup_kmem_disabled(void); int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order); void __memcg_kmem_uncharge_page(struct page *page, int order); @@ -1905,9 +1794,9 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, { } -#endif /* CONFIG_MEMCG_KMEM */ +#endif /* CONFIG_MEMCG */ -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) +#if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP) bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size); void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size); @@ -1932,4 +1821,100 @@ static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) } #endif + +/* Cgroup v1-related declarations */ + +#ifdef CONFIG_MEMCG_V1 +unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order, + gfp_t gfp_mask, + unsigned long *total_scanned); + +bool mem_cgroup_oom_synchronize(bool wait); + +static inline bool task_in_memcg_oom(struct task_struct *p) +{ + return p->memcg_in_oom; +} + +void folio_memcg_lock(struct folio *folio); +void folio_memcg_unlock(struct folio *folio); + +/* try to stablize folio_memcg() for all the pages in a memcg */ +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) +{ + rcu_read_lock(); + + if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account)) + return true; + + rcu_read_unlock(); + return false; +} + +static inline void mem_cgroup_unlock_pages(void) +{ + rcu_read_unlock(); +} + +static inline void mem_cgroup_enter_user_fault(void) +{ + WARN_ON(current->in_user_fault); + current->in_user_fault = 1; +} + +static inline void mem_cgroup_exit_user_fault(void) +{ + WARN_ON(!current->in_user_fault); + current->in_user_fault = 0; +} + +#else /* CONFIG_MEMCG_V1 */ +static inline +unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order, + gfp_t gfp_mask, + unsigned long *total_scanned) +{ + return 0; +} + +static inline void folio_memcg_lock(struct folio *folio) +{ +} + +static inline void folio_memcg_unlock(struct folio *folio) +{ +} + +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg) +{ + /* to match folio_memcg_rcu() */ + rcu_read_lock(); + return true; +} + +static inline void mem_cgroup_unlock_pages(void) +{ + rcu_read_unlock(); +} + +static inline bool task_in_memcg_oom(struct task_struct *p) +{ + return false; +} + +static inline bool mem_cgroup_oom_synchronize(bool wait) +{ + return false; +} + +static inline void mem_cgroup_enter_user_fault(void) +{ +} + +static inline void mem_cgroup_exit_user_fault(void) +{ +} + +#endif /* CONFIG_MEMCG_V1 */ + #endif /* _LINUX_MEMCONTROL_H */ diff --git a/include/linux/memfd.h b/include/linux/memfd.h index e7abf6fa4c52..3f2cf339ceaf 100644 --- a/include/linux/memfd.h +++ b/include/linux/memfd.h @@ -6,11 +6,16 @@ #ifdef CONFIG_MEMFD_CREATE extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg); +struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx); #else static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a) { return -EINVAL; } +static inline struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) +{ + return ERR_PTR(-EINVAL); +} #endif #endif /* __LINUX_MEMFD_H */ diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 0d70788558f4..0dc0cf2863e2 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -38,6 +38,7 @@ struct access_coordinate; #ifdef CONFIG_NUMA extern bool numa_demotion_enabled; extern struct memory_dev_type *default_dram_type; +extern nodemask_t default_dram_nodes; struct memory_dev_type *alloc_memory_type(int adistance); void put_memory_type(struct memory_dev_type *memtype); void init_node_memory_type(int node, struct memory_dev_type *default_type); @@ -76,6 +77,7 @@ static inline bool node_is_toptier(int node) #define numa_demotion_enabled false #define default_dram_type NULL +#define default_dram_nodes NODE_MASK_NONE /* * CONFIG_NUMA implementation returns non NULL error. */ diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 7a9ff464608d..ebe876930e78 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -175,8 +175,8 @@ extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages); extern int online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group); -extern void __offline_isolated_pages(unsigned long start_pfn, - unsigned long end_pfn); +extern unsigned long __offline_isolated_pages(unsigned long start_pfn, + unsigned long end_pfn); typedef void (*online_page_callback_t)(struct page *page, unsigned int order); diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 2ce13e8a309b..644be30b69c8 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -63,8 +63,6 @@ extern const char *migrate_reason_names[MR_TYPES]; #ifdef CONFIG_MIGRATION void putback_movable_pages(struct list_head *l); -int migrate_folio_extra(struct address_space *mapping, struct folio *dst, - struct folio *src, enum migrate_mode mode, int extra_count); int migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode mode); int migrate_pages(struct list_head *l, new_folio_t new, free_folio_t free, @@ -78,7 +76,6 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, void migration_entry_wait_on_locked(swp_entry_t entry, spinlock_t *ptl) __releases(ptl); void folio_migrate_flags(struct folio *newfolio, struct folio *folio); -void folio_migrate_copy(struct folio *newfolio, struct folio *folio); int folio_migrate_mapping(struct address_space *mapping, struct folio *newfolio, struct folio *folio, int extra_count); @@ -142,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page) } #ifdef CONFIG_NUMA_BALANCING +int migrate_misplaced_folio_prepare(struct folio *folio, + struct vm_area_struct *vma, int node); int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, int node); #else +static inline int migrate_misplaced_folio_prepare(struct folio *folio, + struct vm_area_struct *vma, int node) +{ + return -EAGAIN; /* can't migrate now */ +} static inline int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, int node) { diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h index f37cc03f9369..265c4328b36a 100644 --- a/include/linux/migrate_mode.h +++ b/include/linux/migrate_mode.h @@ -7,16 +7,11 @@ * on most operations but not ->writepage as the potential stall time * is too significant * MIGRATE_SYNC will block when migrating pages - * MIGRATE_SYNC_NO_COPY will block when migrating pages but will not copy pages - * with the CPU. Instead, page copy happens outside the migratepage() - * callback and is likely using a DMA engine. See migrate_vma() and HMM - * (mm/hmm.c) for users of this mode. */ enum migrate_mode { MIGRATE_ASYNC, MIGRATE_SYNC_LIGHT, MIGRATE_SYNC, - MIGRATE_SYNC_NO_COPY, }; enum migrate_reason { @@ -29,6 +24,7 @@ enum migrate_reason { MR_CONTIG_RANGE, MR_LONGTERM_PIN, MR_DEMOTION, + MR_DAMON, MR_TYPES }; diff --git a/include/linux/mm.h b/include/linux/mm.h index ab3d78116043..7d044e737dba 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1202,8 +1202,7 @@ static inline int is_vmalloc_or_module_addr(const void *x) /* * How many times the entire folio is mapped as a single unit (eg by a * PMD or PUD entry). This is probably not what you want, except for - * debugging purposes - it does not include PTE-mapped sub-pages; look - * at folio_mapcount() or page_mapcount() instead. + * debugging purposes or implementation of other core folio_*() primitives. */ static inline int folio_entire_mapcount(const struct folio *folio) { @@ -1211,40 +1210,6 @@ static inline int folio_entire_mapcount(const struct folio *folio) return atomic_read(&folio->_entire_mapcount) + 1; } -/* - * The atomic page->_mapcount, starts from -1: so that transitions - * both from it and to it can be tracked, using atomic_inc_and_test - * and atomic_add_negative(-1). - */ -static inline void page_mapcount_reset(struct page *page) -{ - atomic_set(&(page)->_mapcount, -1); -} - -/** - * page_mapcount() - Number of times this precise page is mapped. - * @page: The page. - * - * The number of times this page is mapped. If this page is part of - * a large folio, it includes the number of times this page is mapped - * as part of that folio. - * - * Will report 0 for pages which cannot be mapped into userspace, eg - * slab, page tables and similar. - */ -static inline int page_mapcount(struct page *page) -{ - int mapcount = atomic_read(&page->_mapcount) + 1; - - /* Handle page_has_type() pages */ - if (mapcount < PAGE_MAPCOUNT_RESERVE + 1) - mapcount = 0; - if (unlikely(PageCompound(page))) - mapcount += folio_entire_mapcount(page_folio(page)); - - return mapcount; -} - static inline int folio_large_mapcount(const struct folio *folio) { VM_WARN_ON_FOLIO(!folio_test_large(folio), folio); @@ -1326,6 +1291,7 @@ void put_pages_list(struct list_head *pages); void split_page(struct page *page, unsigned int order); void folio_copy(struct folio *dst, struct folio *src); +int folio_mc_copy(struct folio *dst, struct folio *src); unsigned long nr_free_buffer_pages(void); @@ -1612,17 +1578,19 @@ static inline void put_page(struct page *page) * issue. * * Locking: the lockless algorithm described in folio_try_get_rcu() - * provides safe operation for get_user_pages(), page_mkclean() and + * provides safe operation for get_user_pages(), folio_mkclean() and * other calls that race to set up page table entries. */ #define GUP_PIN_COUNTING_BIAS (1U << 10) void unpin_user_page(struct page *page); +void unpin_folio(struct folio *folio); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty); void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, bool make_dirty); void unpin_user_pages(struct page **pages, unsigned long npages); +void unpin_folios(struct folio **folios, unsigned long nfolios); static inline bool is_cow_mapping(vm_flags_t flags) { @@ -1953,8 +1921,8 @@ static inline struct folio *pfn_folio(unsigned long pfn) * * For more information, please see Documentation/core-api/pin_user_pages.rst. * - * Return: True, if it is likely that the page has been "dma-pinned". - * False, if the page is definitely not dma-pinned. + * Return: True, if it is likely that the folio has been "dma-pinned". + * False, if the folio is definitely not dma-pinned. */ static inline bool folio_maybe_dma_pinned(struct folio *folio) { @@ -1973,11 +1941,6 @@ static inline bool folio_maybe_dma_pinned(struct folio *folio) GUP_PIN_COUNTING_BIAS; } -static inline bool page_maybe_dma_pinned(struct page *page) -{ - return folio_maybe_dma_pinned(page_folio(page)); -} - /* * This should most likely only be called during fork() to see whether we * should break the cow immediately for an anon page on the src mm. @@ -2295,19 +2258,6 @@ static inline void *folio_address(const struct folio *folio) return page_address(&folio->page); } -extern pgoff_t __page_file_index(struct page *page); - -/* - * Return the pagecache index of the passed page. Regular pagecache pages - * use ->index whereas swapcache pages use swp_offset(->private) - */ -static inline pgoff_t page_index(struct page *page) -{ - if (unlikely(PageSwapCache(page))) - return __page_file_index(page); - return page->index; -} - /* * Return true only if the page has been allocated with * ALLOC_NO_WATERMARKS and the low watermark was not @@ -2550,6 +2500,9 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, + struct folio **folios, unsigned int max_folios, + pgoff_t *offset); int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); @@ -4038,7 +3991,6 @@ extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, bool *migratable_cleared); void num_poisoned_pages_inc(unsigned long pfn); void num_poisoned_pages_sub(unsigned long pfn, long i); -struct task_struct *task_early_kill(struct task_struct *tsk, int force_early); #else static inline void memory_failure_queue(unsigned long pfn, int flags) { @@ -4059,12 +4011,6 @@ static inline void num_poisoned_pages_sub(unsigned long pfn, long i) } #endif -#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_KSM) -void add_to_kill_ksm(struct task_struct *tsk, struct page *p, - struct vm_area_struct *vma, struct list_head *to_kill, - unsigned long ksm_addr); -#endif - #if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_MEMORY_HOTPLUG) extern void memblk_nr_poison_inc(unsigned long pfn); extern void memblk_nr_poison_sub(unsigned long pfn, long i); @@ -4105,10 +4051,10 @@ enum mf_result { enum mf_action_page_type { MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER, - MF_MSG_SLAB, MF_MSG_DIFFERENT_COMPOUND, MF_MSG_HUGE, MF_MSG_FREE_HUGE, + MF_MSG_GET_HWPOISON, MF_MSG_UNMAP_FAILED, MF_MSG_DIRTY_SWAPCACHE, MF_MSG_CLEAN_SWAPCACHE, @@ -4122,13 +4068,12 @@ enum mf_action_page_type { MF_MSG_BUDDY, MF_MSG_DAX, MF_MSG_UNSPLIT_THP, + MF_MSG_ALREADY_POISONED, MF_MSG_UNKNOWN, }; #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) -extern void clear_huge_page(struct page *page, - unsigned long addr_hint, - unsigned int pages_per_huge_page); +void folio_zero_user(struct folio *folio, unsigned long addr_hint); int copy_user_large_folio(struct folio *dst, struct folio *src, unsigned long addr_hint, struct vm_area_struct *vma); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index a199c48bc462..485424979254 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -46,9 +46,7 @@ struct mem_cgroup; * which is guaranteed to be aligned. If you use the same storage as * page->mapping, you must restore it to NULL before freeing the page. * - * If your page will not be mapped to userspace, you can also use the four - * bytes in the mapcount union, but you must call page_mapcount_reset() - * before freeing it. + * The mapcount field must not be used for own purposes. * * If you want to use the refcount field, it must be used in such a way * that other CPUs temporarily incrementing and then decrementing the @@ -152,18 +150,31 @@ struct page { union { /* This union is 4 bytes in size. */ /* - * If the page can be mapped to userspace, encodes the number - * of times this page is referenced by a page table. - */ - atomic_t _mapcount; - - /* - * If the page is neither PageSlab nor mappable to userspace, - * the value stored here may help determine what this page - * is used for. See page-flags.h for a list of page types - * which are currently stored here. + * For head pages of typed folios, the value stored here + * allows for determining what this page is used for. The + * tail pages of typed folios will not store a type + * (page_type == _mapcount == -1). + * + * See page-flags.h for a list of page types which are currently + * stored here. + * + * Owners of typed folios may reuse the lower 16 bit of the + * head page page_type field after setting the page type, + * but must reset these 16 bit to -1 before clearing the + * page type. */ unsigned int page_type; + + /* + * For pages that are part of non-typed folios for which mappings + * are tracked via the RMAP, encodes the number of times this page + * is directly referenced by a page table. + * + * Note that the mapcount is always initialized to -1, so that + * transitions both from it and to it can be tracked, using + * atomic_inc_and_test() and atomic_add_negative(-1). + */ + atomic_t _mapcount; }; /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 1dc6248feb83..41458892bc8a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -220,6 +220,8 @@ enum node_stat_item { PGDEMOTE_KSWAPD, PGDEMOTE_DIRECT, PGDEMOTE_KHUGEPAGED, + NR_MEMMAP, /* page metadata allocated through buddy allocator */ + NR_MEMMAP_BOOT, /* page metadata allocated through boot allocator */ NR_VM_NODE_STAT_ITEMS }; diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index b9e914e1face..5769fe6e4950 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -30,16 +30,11 @@ * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying * to read/write these pages might end badly. Don't touch! * - The zero page(s) - * - Pages not added to the page allocator when onlining a section because - * they were excluded via the online_page_callback() or because they are - * PG_hwpoison. * - Pages allocated in the context of kexec/kdump (loaded kernel image, * control pages, vmcoreinfo) * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are * not marked PG_reserved (as they might be in use by somebody else who does * not respect the caching strategy). - * - Pages part of an offline section (struct pages of offline sections should - * not be trusted as they will be initialized when first onlined). * - MCA pages on ia64 * - Pages holding CPU notes for POWER Firmware Assisted Dump * - Device memory (e.g. PMEM, DAX, HMM) @@ -616,11 +611,6 @@ PAGEFLAG_FALSE(Uncached, uncached) PAGEFLAG(HWPoison, hwpoison, PF_ANY) TESTSCFLAG(HWPoison, hwpoison, PF_ANY) #define __PG_HWPOISON (1UL << PG_hwpoison) -#define MAGIC_HWPOISON 0x48575053U /* HWPS */ -extern void SetPageHWPoisonTakenOff(struct page *page); -extern void ClearPageHWPoisonTakenOff(struct page *page); -extern bool take_page_off_buddy(struct page *page); -extern bool put_page_back_buddy(struct page *page); #else PAGEFLAG_FALSE(HWPoison, hwpoison) #define __PG_HWPOISON 0 @@ -655,27 +645,28 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted) #endif /* - * On an anonymous page mapped into a user virtual memory area, - * page->mapping points to its anon_vma, not to a struct address_space; + * On an anonymous folio mapped into a user virtual memory area, + * folio->mapping points to its anon_vma, not to a struct address_space; * with the PAGE_MAPPING_ANON bit set to distinguish it. See rmap.h. * * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled, * the PAGE_MAPPING_MOVABLE bit may be set along with the PAGE_MAPPING_ANON - * bit; and then page->mapping points, not to an anon_vma, but to a private + * bit; and then folio->mapping points, not to an anon_vma, but to a private * structure which KSM associates with that merged page. See ksm.h. * * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is used for non-lru movable - * page and then page->mapping points to a struct movable_operations. + * page and then folio->mapping points to a struct movable_operations. * - * Please note that, confusingly, "page_mapping" refers to the inode - * address_space which maps the page from disk; whereas "page_mapped" - * refers to user virtual address space into which the page is mapped. + * Please note that, confusingly, "folio_mapping" refers to the inode + * address_space which maps the folio from disk; whereas "folio_mapped" + * refers to user virtual address space into which the folio is mapped. * * For slab pages, since slab reuses the bits in struct page to store its - * internal states, the page->mapping does not exist as such, nor do these - * flags below. So in order to avoid testing non-existent bits, please - * make sure that PageSlab(page) actually evaluates to false before calling - * the following functions (e.g., PageAnon). See mm/slab.h. + * internal states, the folio->mapping does not exist as such, nor do + * these flags below. So in order to avoid testing non-existent bits, + * please make sure that folio_test_slab(folio) actually evaluates to + * false before calling the following functions (e.g., folio_test_anon). + * See mm/slab.h. */ #define PAGE_MAPPING_ANON 0x1 #define PAGE_MAPPING_MOVABLE 0x2 @@ -945,22 +936,28 @@ PAGEFLAG_FALSE(HasHWPoisoned, has_hwpoisoned) */ enum pagetype { - PG_buddy = 0x00000080, - PG_offline = 0x00000100, - PG_table = 0x00000200, - PG_guard = 0x00000400, - PG_hugetlb = 0x00000800, - PG_slab = 0x00001000, + PG_buddy = 0x40000000, + PG_offline = 0x20000000, + PG_table = 0x10000000, + PG_guard = 0x08000000, + PG_hugetlb = 0x04000000, + PG_slab = 0x02000000, + PG_zsmalloc = 0x01000000, - PAGE_TYPE_BASE = 0xf0000000, - /* Reserve 0x0000007f to catch underflows of _mapcount */ - PAGE_MAPCOUNT_RESERVE = -128, + PAGE_TYPE_BASE = 0x80000000, + + /* + * Reserve 0xffff0000 - 0xfffffffe to catch _mapcount underflows and + * allow owners that set a type to reuse the lower 16 bit for their own + * purposes. + */ + PAGE_MAPCOUNT_RESERVE = ~0x0000ffff, }; #define PageType(page, flag) \ - ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) + ((READ_ONCE(page->page_type) & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) #define folio_test_type(folio, flag) \ - ((folio->page.page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) + ((READ_ONCE(folio->page.page_type) & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) static inline int page_type_has_type(unsigned int page_type) { @@ -969,7 +966,7 @@ static inline int page_type_has_type(unsigned int page_type) static inline int page_has_type(const struct page *page) { - return page_type_has_type(page->page_type); + return page_type_has_type(READ_ONCE(page->page_type)); } #define FOLIO_TYPE_OPS(lname, fname) \ @@ -1018,15 +1015,22 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy) * The content of these pages is effectively stale. Such pages should not * be touched (read/write/dump/save) except by their owner. * + * When a memory block gets onlined, all pages are initialized with a + * refcount of 1 and PageOffline(). generic_online_page() will + * take care of clearing PageOffline(). + * * If a driver wants to allow to offline unmovable PageOffline() pages without * putting them back to the buddy, it can do so via the memory notifier by * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline() - * pages (now with a reference count of zero) are treated like free pages, - * allowing the containing memory block to get offlined. A driver that + * pages (now with a reference count of zero) are treated like free (unmanaged) + * pages, allowing the containing memory block to get offlined. A driver that * relies on this feature is aware that re-onlining the memory block will - * require to re-set the pages PageOffline() and not giving them to the - * buddy via online_page_callback_t. + * require not giving them to the buddy via generic_online_page(). + * + * Memory offlining code will not adjust the managed page count for any + * PageOffline() pages, treating them like they were never exposed to the + * buddy using generic_online_page(). * * There are drivers that mark a page PageOffline() and expect there won't be * any further access to page content. PFN walkers that read content of random @@ -1070,6 +1074,8 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb) FOLIO_TEST_FLAG_FALSE(hugetlb) #endif +PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc) + /** * PageHuge - Determine if the page belongs to hugetlbfs * @page: The page to test. diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 8cd858d912c4..904c52f97284 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -81,4 +81,8 @@ static inline void page_counter_reset_watermark(struct page_counter *counter) counter->watermark = page_counter_read(counter); } +void page_counter_calculate_protection(struct page_counter *root, + struct page_counter *counter, + bool recursive_protection); + #endif /* _LINUX_PAGE_COUNTER_H */ diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 2d72bd89bf7b..483a191bb4df 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -434,7 +434,6 @@ static inline void filemap_nr_thps_dec(struct address_space *mapping) #endif } -struct address_space *page_mapping(struct page *); struct address_space *folio_mapping(struct folio *); struct address_space *swapcache_mapping(struct folio *); @@ -800,7 +799,7 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping, mapping_gfp_mask(mapping)); } -#define swapcache_index(folio) __page_file_index(&(folio)->page) +extern pgoff_t __folio_swap_cache_index(struct folio *folio); /** * folio_index - File index of a folio. @@ -815,9 +814,9 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping, */ static inline pgoff_t folio_index(struct folio *folio) { - if (unlikely(folio_test_swapcache(folio))) - return swapcache_index(folio); - return folio->index; + if (unlikely(folio_test_swapcache(folio))) + return __folio_swap_cache_index(folio); + return folio->index; } /** @@ -940,11 +939,6 @@ static inline loff_t page_offset(struct page *page) return ((loff_t)page->index) << PAGE_SHIFT; } -static inline loff_t page_file_offset(struct page *page) -{ - return ((loff_t)page_index(page)) << PAGE_SHIFT; -} - /** * folio_pos - Returns the byte position of this folio in its file. * @folio: The folio. @@ -954,18 +948,6 @@ static inline loff_t folio_pos(struct folio *folio) return page_offset(&folio->page); } -/** - * folio_file_pos - Returns the byte position of this folio in its file. - * @folio: The folio. - * - * This differs from folio_pos() for folios which belong to a swap file. - * NFS is the only filesystem today which needs to use folio_file_pos(). - */ -static inline loff_t folio_file_pos(struct folio *folio) -{ - return page_file_offset(&folio->page); -} - /* * Get the offset in PAGE_SIZE (even for hugetlb folios). */ @@ -1319,8 +1301,7 @@ void page_cache_sync_readahead(struct address_space *mapping, * @mapping: address_space which holds the pagecache and I/O vectors * @ra: file_ra_state which holds the readahead state * @file: Used by the filesystem for authentication. - * @folio: The folio at @index which triggered the readahead call. - * @index: Index of first page to be read. + * @folio: The folio which triggered the readahead call. * @req_count: Total number of pages being read by the caller. * * page_cache_async_readahead() should be called when a page is used which @@ -1331,9 +1312,9 @@ void page_cache_sync_readahead(struct address_space *mapping, static inline void page_cache_async_readahead(struct address_space *mapping, struct file_ra_state *ra, struct file *file, - struct folio *folio, pgoff_t index, unsigned long req_count) + struct folio *folio, unsigned long req_count) { - DEFINE_READAHEAD(ractl, file, ra, mapping, index); + DEFINE_READAHEAD(ractl, file, ra, mapping, folio->index); page_cache_async_ra(&ractl, folio, req_count); } diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index ec3573119923..8efce7414fad 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -475,6 +475,12 @@ do { \ raw_cpu_cmpxchg(pcp, oval, nval); \ }) +#define __this_cpu_try_cmpxchg(pcp, ovalp, nval) \ +({ \ + __this_cpu_preempt_check("try_cmpxchg"); \ + raw_cpu_try_cmpxchg(pcp, ovalp, nval); \ +}) + #define __this_cpu_sub(pcp, val) __this_cpu_add(pcp, -(typeof(pcp))(val)) #define __this_cpu_inc(pcp) __this_cpu_add(pcp, 1) #define __this_cpu_dec(pcp) __this_cpu_sub(pcp, 1) diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h index 9cacadbd61f8..18cd0c0c73d9 100644 --- a/include/linux/pgalloc_tag.h +++ b/include/linux/pgalloc_tag.h @@ -15,7 +15,7 @@ extern struct page_ext_operations page_alloc_tagging_ops; static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext *page_ext) { - return (void *)page_ext + page_alloc_tagging_ops.offset; + return (union codetag_ref *)page_ext_data(page_ext, &page_alloc_tagging_ops); } static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref *ref) @@ -71,6 +71,7 @@ static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) static inline void pgalloc_tag_split(struct page *page, unsigned int nr) { int i; + struct page_ext *first_page_ext; struct page_ext *page_ext; union codetag_ref *ref; struct alloc_tag *tag; @@ -78,7 +79,7 @@ static inline void pgalloc_tag_split(struct page *page, unsigned int nr) if (!mem_alloc_profiling_enabled()) return; - page_ext = page_ext_get(page); + first_page_ext = page_ext = page_ext_get(page); if (unlikely(!page_ext)) return; @@ -94,7 +95,7 @@ static inline void pgalloc_tag_split(struct page *page, unsigned int nr) page_ext = page_ext_next(page_ext); } out: - page_ext_put(page_ext); + page_ext_put(first_page_ext); } static inline struct alloc_tag *pgalloc_tag_get(struct page *page) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 18019f037bae..2a6a3cccfc36 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -729,13 +729,18 @@ static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr, * fault. This function updates TLB only, do nothing with cache or others. * It is the difference with function update_mmu_cache. */ -#ifndef __HAVE_ARCH_UPDATE_MMU_TLB +#ifndef update_mmu_tlb_range +static inline void update_mmu_tlb_range(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, unsigned int nr) +{ +} +#endif + static inline void update_mmu_tlb(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { + update_mmu_tlb_range(vma, address, ptep, 1); } -#define __HAVE_ARCH_UPDATE_MMU_TLB -#endif /* * Some architectures may be able to avoid expensive synchronization @@ -1084,6 +1089,15 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b) }) #ifndef __HAVE_ARCH_DO_SWAP_PAGE +static inline void arch_do_swap_page_nr(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long addr, + pte_t pte, pte_t oldpte, + int nr) +{ + +} +#else /* * Some architectures support metadata associated with a page. When a * page is being swapped out, this metadata must be saved so it can be @@ -1092,12 +1106,17 @@ static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b) * page as metadata for the page. arch_do_swap_page() can restore this * metadata when a page is swapped back in. */ -static inline void arch_do_swap_page(struct mm_struct *mm, - struct vm_area_struct *vma, - unsigned long addr, - pte_t pte, pte_t oldpte) +static inline void arch_do_swap_page_nr(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long addr, + pte_t pte, pte_t oldpte, + int nr) { - + for (int i = 0; i < nr; i++) { + arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE, + pte_advance_pfn(pte, i), + pte_advance_pfn(oldpte, i)); + } } #endif @@ -1888,9 +1907,12 @@ typedef unsigned int pgtbl_mod_mask; #ifndef pmd_leaf_size #define pmd_leaf_size(x) PMD_SIZE #endif +#ifndef __pte_leaf_size #ifndef pte_leaf_size #define pte_leaf_size(x) PAGE_SIZE #endif +#define __pte_leaf_size(x,y) pte_leaf_size(y) +#endif /* * We always define pmd_pfn for all archs as it's used in lots of generic diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 7229b9baf20d..0978c64f49d8 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -200,6 +200,9 @@ static inline void __folio_rmap_sanity_checks(struct folio *folio, /* hugetlb folios are handled separately. */ VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); + /* When (un)mapping zeropages, we should never touch ref+mapcount. */ + VM_WARN_ON_FOLIO(is_zero_folio(folio), folio); + /* * TODO: we get driver-allocated folios that have nothing to do with * the rmap using vm_insert_page(); therefore, we cannot assume that @@ -241,7 +244,7 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages, void folio_add_anon_rmap_pmd(struct folio *, struct page *, struct vm_area_struct *, unsigned long address, rmap_t flags); void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *, - unsigned long address); + unsigned long address, rmap_t flags); void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages, struct vm_area_struct *); #define folio_add_file_rmap_pte(folio, page, vma) \ @@ -681,16 +684,6 @@ struct page_vma_mapped_walk { unsigned int flags; }; -#define DEFINE_PAGE_VMA_WALK(name, _page, _vma, _address, _flags) \ - struct page_vma_mapped_walk name = { \ - .pfn = page_to_pfn(_page), \ - .nr_pages = compound_nr(_page), \ - .pgoff = page_to_pgoff(_page), \ - .vma = _vma, \ - .address = _address, \ - .flags = _flags, \ - } - #define DEFINE_FOLIO_VMA_WALK(name, _folio, _vma, _address, _flags) \ struct page_vma_mapped_walk name = { \ .pfn = folio_pfn(_folio), \ @@ -710,6 +703,30 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) spin_unlock(pvmw->ptl); } +/** + * page_vma_mapped_walk_restart - Restart the page table walk. + * @pvmw: Pointer to struct page_vma_mapped_walk. + * + * It restarts the page table walk when changes occur in the page + * table, such as splitting a PMD. Ensures that the PTL held during + * the previous walk is released and resets the state to allow for + * a new walk starting at the current address stored in pvmw->address. + */ +static inline void +page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw) +{ + WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte); + + if (likely(pvmw->ptl)) + spin_unlock(pvmw->ptl); + else + WARN_ON_ONCE(1); + + pvmw->ptl = NULL; + pvmw->pmd = NULL; + pvmw->pte = NULL; +} + bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); /* @@ -730,8 +747,6 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked); -unsigned long page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); - /* * rmap_walk_control: To control rmap traversing for specific needs * @@ -787,8 +802,4 @@ static inline int folio_mkclean(struct folio *folio) } #endif /* CONFIG_MMU */ -static inline int page_mkclean(struct page *page) -{ - return folio_mkclean(page_folio(page)); -} #endif /* _LINUX_RMAP_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index e330ee0205c0..a898d9bde8ca 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -942,7 +942,7 @@ struct task_struct { #ifndef TIF_RESTORE_SIGMASK unsigned restore_sigmask:1; #endif -#ifdef CONFIG_MEMCG +#ifdef CONFIG_MEMCG_V1 unsigned in_user_fault:1; #endif #ifdef CONFIG_LRU_GEN @@ -1458,17 +1458,18 @@ struct task_struct { unsigned int kcov_softirq; #endif -#ifdef CONFIG_MEMCG +#ifdef CONFIG_MEMCG_V1 struct mem_cgroup *memcg_in_oom; +#endif +#ifdef CONFIG_MEMCG /* Number of pages to reclaim on returning to userland: */ unsigned int memcg_nr_pages_over_high; /* Used by memcontrol for targeted memcg charge: */ struct mem_cgroup *active_memcg; -#endif -#ifdef CONFIG_MEMCG_KMEM + /* Cache for current->cgroups->memcg->objcg lookups: */ struct obj_cgroup *objcg; #endif diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 3fb18f7eb73e..1d06b1e5408a 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -113,12 +113,21 @@ int shmem_unuse(unsigned int type); #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force, struct mm_struct *mm, unsigned long vm_flags); +unsigned long shmem_allowable_huge_orders(struct inode *inode, + struct vm_area_struct *vma, pgoff_t index, + bool global_huge); #else static __always_inline bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force, struct mm_struct *mm, unsigned long vm_flags) { return false; } +static inline unsigned long shmem_allowable_huge_orders(struct inode *inode, + struct vm_area_struct *vma, pgoff_t index, + bool global_huge) +{ + return 0; +} #endif #ifdef CONFIG_SHMEM diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 9c29bdd5596d..29c3ea5b6e93 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3429,6 +3429,10 @@ static inline struct page *__dev_alloc_pages_noprof(gfp_t gfp_mask, } #define __dev_alloc_pages(...) alloc_hooks(__dev_alloc_pages_noprof(__VA_ARGS__)) +/* + * This specialized allocator has to be a macro for its allocations to be + * accounted separately (to have a separate alloc_tag). + */ #define dev_alloc_pages(_order) __dev_alloc_pages(GFP_ATOMIC | __GFP_NOWARN, _order) /** @@ -3445,6 +3449,10 @@ static inline struct page *__dev_alloc_page_noprof(gfp_t gfp_mask) } #define __dev_alloc_page(...) alloc_hooks(__dev_alloc_page_noprof(__VA_ARGS__)) +/* + * This specialized allocator has to be a macro for its allocations to be + * accounted separately (to have a separate alloc_tag). + */ #define dev_alloc_page() dev_alloc_pages(0) /** diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index c9efda9df285..d9b03e0746e7 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -414,6 +414,11 @@ void sk_psock_stop_verdict(struct sock *sk, struct sk_psock *psock); int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, struct sk_msg *msg); +/* + * This specialized allocator has to be a macro for its allocations to be + * accounted separately (to have a separate alloc_tag). The typecast is + * intentional to enforce typesafety. + */ #define sk_psock_init_link() \ ((struct sk_psock_link *)kzalloc(sizeof(struct sk_psock_link), \ GFP_ATOMIC | __GFP_NOWARN)) diff --git a/include/linux/slab.h b/include/linux/slab.h index d99afce36098..eb2bf4629157 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -41,7 +41,7 @@ enum _slab_flag_bits { #ifdef CONFIG_FAILSLAB _SLAB_FAILSLAB, #endif -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG _SLAB_ACCOUNT, #endif #ifdef CONFIG_KASAN_GENERIC @@ -171,7 +171,7 @@ enum _slab_flag_bits { # define SLAB_FAILSLAB __SLAB_FLAG_UNUSED #endif /* Account to memcg */ -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG # define SLAB_ACCOUNT __SLAB_FLAG_BIT(_SLAB_ACCOUNT) #else # define SLAB_ACCOUNT __SLAB_FLAG_UNUSED @@ -407,7 +407,7 @@ enum kmalloc_cache_type { #ifndef CONFIG_ZONE_DMA KMALLOC_DMA = KMALLOC_NORMAL, #endif -#ifndef CONFIG_MEMCG_KMEM +#ifndef CONFIG_MEMCG KMALLOC_CGROUP = KMALLOC_NORMAL, #endif KMALLOC_RANDOM_START = KMALLOC_NORMAL, @@ -420,7 +420,7 @@ enum kmalloc_cache_type { #ifdef CONFIG_ZONE_DMA KMALLOC_DMA, #endif -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG KMALLOC_CGROUP, #endif NR_KMALLOC_TYPES @@ -436,7 +436,7 @@ extern kmem_buckets kmalloc_caches[NR_KMALLOC_TYPES]; #define KMALLOC_NOT_NORMAL_BITS \ (__GFP_RECLAIMABLE | \ (IS_ENABLED(CONFIG_ZONE_DMA) ? __GFP_DMA : 0) | \ - (IS_ENABLED(CONFIG_MEMCG_KMEM) ? __GFP_ACCOUNT : 0)) + (IS_ENABLED(CONFIG_MEMCG) ? __GFP_ACCOUNT : 0)) extern unsigned long random_kmalloc_seed; @@ -464,7 +464,7 @@ static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags, unsigne */ if (IS_ENABLED(CONFIG_ZONE_DMA) && (flags & __GFP_DMA)) return KMALLOC_DMA; - if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || (flags & __GFP_RECLAIMABLE)) + if (!IS_ENABLED(CONFIG_MEMCG) || (flags & __GFP_RECLAIMABLE)) return KMALLOC_RECLAIM; else return KMALLOC_CGROUP; diff --git a/include/linux/swap.h b/include/linux/swap.h index e685e93ba354..ba7ea95d1c57 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -405,10 +405,13 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, #define MEMCG_RECLAIM_MAY_SWAP (1 << 1) #define MEMCG_RECLAIM_PROACTIVE (1 << 2) +#define MIN_SWAPPINESS 0 +#define MAX_SWAPPINESS 200 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options); + unsigned int reclaim_options, + int *swappiness); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, @@ -478,7 +481,7 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t); -extern void swap_free(swp_entry_t); +extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void swapcache_free_entries(swp_entry_t *entries, int n); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); @@ -556,7 +559,7 @@ static inline int swapcache_prepare(swp_entry_t swp) return 0; } -static inline void swap_free(swp_entry_t swp) +static inline void swap_free_nr(swp_entry_t entry, int nr_pages) { } @@ -604,6 +607,11 @@ static inline void free_swap_and_cache(swp_entry_t entry) free_swap_and_cache_nr(entry, 1); } +static inline void swap_free(swp_entry_t entry) +{ + swap_free_nr(entry, 1); +} + #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/include/linux/swapops.h b/include/linux/swapops.h index a5c560a2f8c2..cb468e418ea1 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -334,7 +334,7 @@ static inline bool is_migration_entry_dirty(swp_entry_t entry) extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address); -extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte); +extern void migration_entry_wait_huge(struct vm_area_struct *vma, unsigned long addr, pte_t *pte); #else /* CONFIG_MIGRATION */ static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) { @@ -359,7 +359,7 @@ static inline int is_migration_entry(swp_entry_t swp) static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address) { } static inline void migration_entry_wait_huge(struct vm_area_struct *vma, - pte_t *pte) { } + unsigned long addr, pte_t *pte) { } static inline int is_writable_migration_entry(swp_entry_t entry) { return 0; diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 735eae6e272c..16b0cfa80502 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -624,4 +624,8 @@ static inline void lruvec_stat_sub_folio(struct folio *folio, { lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio)); } + +void __meminit mod_node_early_perpage_metadata(int nid, long delta); +void __meminit store_early_perpage_metadata(void); + #endif /* _LINUX_VMSTAT_H */ diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 2a85b941db97..6cecb4a4f68b 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -35,7 +35,8 @@ void zswap_swapoff(int type); void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); void zswap_lruvec_state_init(struct lruvec *lruvec); void zswap_folio_swapin(struct folio *folio); -bool is_zswap_enabled(void); +bool zswap_is_enabled(void); +bool zswap_never_enabled(void); #else struct zswap_lruvec_state {}; @@ -60,11 +61,16 @@ static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {} static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {} static inline void zswap_folio_swapin(struct folio *folio) {} -static inline bool is_zswap_enabled(void) +static inline bool zswap_is_enabled(void) { return false; } +static inline bool zswap_never_enabled(void) +{ + return true; +} + #endif #endif /* _LINUX_ZSWAP_H */ diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h index 7c47151d5c72..e5f7ee0864e7 100644 --- a/include/ras/ras_event.h +++ b/include/ras/ras_event.h @@ -356,10 +356,9 @@ TRACE_EVENT(aer_event, #define MF_PAGE_TYPE \ EM ( MF_MSG_KERNEL, "reserved kernel page" ) \ EM ( MF_MSG_KERNEL_HIGH_ORDER, "high-order kernel page" ) \ - EM ( MF_MSG_SLAB, "kernel slab page" ) \ - EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \ EM ( MF_MSG_HUGE, "huge page" ) \ EM ( MF_MSG_FREE_HUGE, "free huge page" ) \ + EM ( MF_MSG_GET_HWPOISON, "get hwpoison page" ) \ EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" ) \ EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" ) \ EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" ) \ @@ -373,6 +372,7 @@ TRACE_EVENT(aer_event, EM ( MF_MSG_BUDDY, "free buddy page" ) \ EM ( MF_MSG_DAX, "dax page" ) \ EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" ) \ + EM ( MF_MSG_ALREADY_POISONED, "already poisoned" ) \ EMe ( MF_MSG_UNKNOWN, "unknown page" ) /* diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index 8a829e0f6e55..b37eb0a7060f 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -36,7 +36,7 @@ TRACE_EVENT(kmem_cache_alloc, __entry->bytes_alloc = s->size; __entry->gfp_flags = (__force unsigned long)gfp_flags; __entry->node = node; - __entry->accounted = IS_ENABLED(CONFIG_MEMCG_KMEM) ? + __entry->accounted = IS_ENABLED(CONFIG_MEMCG) ? ((gfp_flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)) : false; ), @@ -87,7 +87,7 @@ TRACE_EVENT(kmalloc, __entry->bytes_alloc, show_gfp_flags(__entry->gfp_flags), __entry->node, - (IS_ENABLED(CONFIG_MEMCG_KMEM) && + (IS_ENABLED(CONFIG_MEMCG) && (__entry->gfp_flags & (__force unsigned long)__GFP_ACCOUNT)) ? "true" : "false") ); diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h index 0190ef725b43..cd01dd7b3640 100644 --- a/include/trace/events/migrate.h +++ b/include/trace/events/migrate.h @@ -22,7 +22,8 @@ EM( MR_NUMA_MISPLACED, "numa_misplaced") \ EM( MR_CONTIG_RANGE, "contig_range") \ EM( MR_LONGTERM_PIN, "longterm_pin") \ - EMe(MR_DEMOTION, "demotion") + EM( MR_DEMOTION, "demotion") \ + EMe(MR_DAMON, "damon") /* * First define the enums in the above macros to be exported to userspace diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 191a7e88a8ab..753971770733 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -336,8 +336,10 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC) +#define PROCFS_IOCTL_MAGIC 'f' + /* Pagemap ioctl */ -#define PAGEMAP_SCAN _IOWR('f', 16, struct pm_scan_arg) +#define PAGEMAP_SCAN _IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg) /* Bitmasks provided in pm_scan_args masks and reported in page_region.categories. */ #define PAGE_IS_WPALLOWED (1 << 0) @@ -396,4 +398,158 @@ struct pm_scan_arg { __u64 return_mask; }; +/* /proc//maps ioctl */ +#define PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 17, struct procmap_query) + +enum procmap_query_flags { + /* + * VMA permission flags. + * + * Can be used as part of procmap_query.query_flags field to look up + * only VMAs satisfying specified subset of permissions. E.g., specifying + * PROCMAP_QUERY_VMA_READABLE only will return both readable and read/write VMAs, + * while having PROCMAP_QUERY_VMA_READABLE | PROCMAP_QUERY_VMA_WRITABLE will only + * return read/write VMAs, though both executable/non-executable and + * private/shared will be ignored. + * + * PROCMAP_QUERY_VMA_* flags are also returned in procmap_query.vma_flags + * field to specify actual VMA permissions. + */ + PROCMAP_QUERY_VMA_READABLE = 0x01, + PROCMAP_QUERY_VMA_WRITABLE = 0x02, + PROCMAP_QUERY_VMA_EXECUTABLE = 0x04, + PROCMAP_QUERY_VMA_SHARED = 0x08, + /* + * Query modifier flags. + * + * By default VMA that covers provided address is returned, or -ENOENT + * is returned. With PROCMAP_QUERY_COVERING_OR_NEXT_VMA flag set, closest + * VMA with vma_start > addr will be returned if no covering VMA is + * found. + * + * PROCMAP_QUERY_FILE_BACKED_VMA instructs query to consider only VMAs that + * have file backing. Can be combined with PROCMAP_QUERY_COVERING_OR_NEXT_VMA + * to iterate all VMAs with file backing. + */ + PROCMAP_QUERY_COVERING_OR_NEXT_VMA = 0x10, + PROCMAP_QUERY_FILE_BACKED_VMA = 0x20, +}; + +/* + * Input/output argument structured passed into ioctl() call. It can be used + * to query a set of VMAs (Virtual Memory Areas) of a process. + * + * Each field can be one of three kinds, marked in a short comment to the + * right of the field: + * - "in", input argument, user has to provide this value, kernel doesn't modify it; + * - "out", output argument, kernel sets this field with VMA data; + * - "in/out", input and output argument; user provides initial value (used + * to specify maximum allowable buffer size), and kernel sets it to actual + * amount of data written (or zero, if there is no data). + * + * If matching VMA is found (according to criterias specified by + * query_addr/query_flags, all the out fields are filled out, and ioctl() + * returns 0. If there is no matching VMA, -ENOENT will be returned. + * In case of any other error, negative error code other than -ENOENT is + * returned. + * + * Most of the data is similar to the one returned as text in /proc//maps + * file, but procmap_query provides more querying flexibility. There are no + * consistency guarantees between subsequent ioctl() calls, but data returned + * for matched VMA is self-consistent. + */ +struct procmap_query { + /* Query struct size, for backwards/forward compatibility */ + __u64 size; + /* + * Query flags, a combination of enum procmap_query_flags values. + * Defines query filtering and behavior, see enum procmap_query_flags. + * + * Input argument, provided by user. Kernel doesn't modify it. + */ + __u64 query_flags; /* in */ + /* + * Query address. By default, VMA that covers this address will + * be looked up. PROCMAP_QUERY_* flags above modify this default + * behavior further. + * + * Input argument, provided by user. Kernel doesn't modify it. + */ + __u64 query_addr; /* in */ + /* VMA starting (inclusive) and ending (exclusive) address, if VMA is found. */ + __u64 vma_start; /* out */ + __u64 vma_end; /* out */ + /* VMA permissions flags. A combination of PROCMAP_QUERY_VMA_* flags. */ + __u64 vma_flags; /* out */ + /* VMA backing page size granularity. */ + __u64 vma_page_size; /* out */ + /* + * VMA file offset. If VMA has file backing, this specifies offset + * within the file that VMA's start address corresponds to. + * Is set to zero if VMA has no backing file. + */ + __u64 vma_offset; /* out */ + /* Backing file's inode number, or zero, if VMA has no backing file. */ + __u64 inode; /* out */ + /* Backing file's device major/minor number, or zero, if VMA has no backing file. */ + __u32 dev_major; /* out */ + __u32 dev_minor; /* out */ + /* + * If set to non-zero value, signals the request to return VMA name + * (i.e., VMA's backing file's absolute path, with " (deleted)" suffix + * appended, if file was unlinked from FS) for matched VMA. VMA name + * can also be some special name (e.g., "[heap]", "[stack]") or could + * be even user-supplied with prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME). + * + * Kernel will set this field to zero, if VMA has no associated name. + * Otherwise kernel will return actual amount of bytes filled in + * user-supplied buffer (see vma_name_addr field below), including the + * terminating zero. + * + * If VMA name is longer that user-supplied maximum buffer size, + * -E2BIG error is returned. + * + * If this field is set to non-zero value, vma_name_addr should point + * to valid user space memory buffer of at least vma_name_size bytes. + * If set to zero, vma_name_addr should be set to zero as well + */ + __u32 vma_name_size; /* in/out */ + /* + * If set to non-zero value, signals the request to extract and return + * VMA's backing file's build ID, if the backing file is an ELF file + * and it contains embedded build ID. + * + * Kernel will set this field to zero, if VMA has no backing file, + * backing file is not an ELF file, or ELF file has no build ID + * embedded. + * + * Build ID is a binary value (not a string). Kernel will set + * build_id_size field to exact number of bytes used for build ID. + * If build ID is requested and present, but needs more bytes than + * user-supplied maximum buffer size (see build_id_addr field below), + * -E2BIG error will be returned. + * + * If this field is set to non-zero value, build_id_addr should point + * to valid user space memory buffer of at least build_id_size bytes. + * If set to zero, build_id_addr should be set to zero as well + */ + __u32 build_id_size; /* in/out */ + /* + * User-supplied address of a buffer of at least vma_name_size bytes + * for kernel to fill with matched VMA's name (see vma_name_size field + * description above for details). + * + * Should be set to zero if VMA name should not be returned. + */ + __u64 vma_name_addr; /* in */ + /* + * User-supplied address of a buffer of at least build_id_size bytes + * for kernel to fill with matched VMA's ELF build ID, if available + * (see build_id_size field description above for details). + * + * Should be set to zero if build ID should not be returned. + */ + __u64 build_id_addr; /* in */ +}; + #endif /* _UAPI_LINUX_FS_H */ diff --git a/init/Kconfig b/init/Kconfig index 964355d1757e..4b81a49a25c4 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -971,10 +971,22 @@ config MEMCG help Provides control over the memory footprint of tasks in a cgroup. -config MEMCG_KMEM - bool +config MEMCG_V1 + bool "Legacy cgroup v1 memory controller" depends on MEMCG - default y + default n + help + Legacy cgroup v1 memory controller which has been deprecated by + cgroup v2 implementation. The v1 is there for legacy applications + which haven't migrated to the new cgroup v2 interface yet. If you + do not have any such application then you are completely fine leaving + this option disabled. + + Please note that feature set of the legacy memory controller is likely + going to shrink due to deprecation process. New deployments with v1 + controller are highly discouraged. + + San N is unsure. config BLK_CGROUP bool "IO controller" diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index a546aba46d5d..dec892ded031 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -155,12 +155,9 @@ static void *__alloc(struct bpf_mem_cache *c, int node, gfp_t flags) static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c) { -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG if (c->objcg) return get_mem_cgroup_from_objcg(c->objcg); -#endif - -#ifdef CONFIG_MEMCG return root_mem_cgroup; #else return NULL; @@ -534,7 +531,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) size += LLIST_NODE_SZ; /* room for llist_node */ unit_size = size; -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG if (memcg_bpf_enabled()) objcg = get_obj_cgroup_from_current(); #endif @@ -556,7 +553,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL); if (!pcc) return -ENOMEM; -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG objcg = get_obj_cgroup_from_current(); #endif ma->objcg = objcg; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 869265852d51..0719192a3482 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -385,7 +385,7 @@ void bpf_map_free_id(struct bpf_map *map) spin_unlock_irqrestore(&map_idr_lock, flags); } -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG static void bpf_map_save_memcg(struct bpf_map *map) { /* Currently if a map is created by a process belonging to the root @@ -486,7 +486,7 @@ int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid, unsigned long i, j; struct page *pg; int ret = 0; -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG struct mem_cgroup *memcg, *old_memcg; memcg = bpf_map_get_memcg(map); @@ -505,7 +505,7 @@ int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid, break; } -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG set_active_memcg(old_memcg); mem_cgroup_put(memcg); #endif diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8da132a1ef28..4cb5441ad75f 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -21132,8 +21132,12 @@ BTF_SET_START(btf_non_sleepable_error_inject) * Assume non-sleepable from bpf safety point of view. */ BTF_ID(func, __filemap_add_folio) +#ifdef CONFIG_FAIL_PAGE_ALLOC BTF_ID(func, should_fail_alloc_page) +#endif +#ifdef CONFIG_FAILSLAB BTF_ID(func, should_failslab) +#endif BTF_SET_END(btf_non_sleepable_error_inject) static int check_non_sleepable_error_inject(u32 btf_id) diff --git a/kernel/events/core.c b/kernel/events/core.c index ab6c4c942f79..a2f3545f31b2 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7634,7 +7634,7 @@ again: pte = ptep_get_lockless(ptep); if (pte_present(pte)) - size = pte_leaf_size(pte); + size = __pte_leaf_size(pmd, pte); pte_unmap(ptep); #endif /* CONFIG_HAVE_GUP_FAST */ diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 99be2adedbc0..73cc47708679 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -181,7 +181,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, if (new_page) { folio_get(new_folio); - folio_add_new_anon_rmap(new_folio, vma, addr); + folio_add_new_anon_rmap(new_folio, vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(new_folio, vma); } else /* no new page, just dec_mm_counter for old_page */ diff --git a/kernel/exit.c b/kernel/exit.c index be81342caf1b..7430852a8571 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -438,14 +438,46 @@ static void coredump_task_exit(struct task_struct *tsk) } #ifdef CONFIG_MEMCG +/* drops tasklist_lock if succeeds */ +static bool __try_to_set_owner(struct task_struct *tsk, struct mm_struct *mm) +{ + bool ret = false; + + task_lock(tsk); + if (likely(tsk->mm == mm)) { + /* tsk can't pass exit_mm/exec_mmap and exit */ + read_unlock(&tasklist_lock); + WRITE_ONCE(mm->owner, tsk); + lru_gen_migrate_mm(mm); + ret = true; + } + task_unlock(tsk); + return ret; +} + +static bool try_to_set_owner(struct task_struct *g, struct mm_struct *mm) +{ + struct task_struct *t; + + for_each_thread(g, t) { + struct mm_struct *t_mm = READ_ONCE(t->mm); + if (t_mm == mm) { + if (__try_to_set_owner(t, mm)) + return true; + } else if (t_mm) + break; + } + + return false; +} + /* * A task is exiting. If it owned this mm, find a new owner for the mm. */ void mm_update_next_owner(struct mm_struct *mm) { - struct task_struct *c, *g, *p = current; + struct task_struct *g, *p = current; -retry: /* * If the exiting or execing task is not the owner, it's * someone else's problem. @@ -466,19 +498,17 @@ retry: /* * Search in the children */ - list_for_each_entry(c, &p->children, sibling) { - if (c->mm == mm) - goto assign_new_owner; + list_for_each_entry(g, &p->children, sibling) { + if (try_to_set_owner(g, mm)) + goto ret; } - /* * Search in the siblings */ - list_for_each_entry(c, &p->real_parent->children, sibling) { - if (c->mm == mm) - goto assign_new_owner; + list_for_each_entry(g, &p->real_parent->children, sibling) { + if (try_to_set_owner(g, mm)) + goto ret; } - /* * Search through everything else, we should not get here often. */ @@ -487,12 +517,8 @@ retry: break; if (g->flags & PF_KTHREAD) continue; - for_each_thread(g, c) { - if (c->mm == mm) - goto assign_new_owner; - if (c->mm) - break; - } + if (try_to_set_owner(g, mm)) + goto ret; } read_unlock(&tasklist_lock); /* @@ -501,30 +527,9 @@ retry: * ptrace or page migration (get_task_mm()). Mark owner as NULL. */ WRITE_ONCE(mm->owner, NULL); + ret: return; -assign_new_owner: - BUG_ON(c == p); - get_task_struct(c); - /* - * The task_lock protects c->mm from changing. - * We always want mm->owner->mm == mm - */ - task_lock(c); - /* - * Delay read_unlock() till we have the task_lock() - * to ensure that c does not slip away underneath us - */ - read_unlock(&tasklist_lock); - if (c->mm != mm) { - task_unlock(c); - put_task_struct(c); - goto retry; - } - WRITE_ONCE(mm->owner, c); - lru_gen_migrate_mm(mm); - task_unlock(c); - put_task_struct(c); } #endif /* CONFIG_MEMCG */ diff --git a/kernel/fork.c b/kernel/fork.c index 942e3d8617bf..ef48f6bdf175 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -44,6 +44,7 @@ #include #include #include +#include #include #include #include @@ -992,10 +993,10 @@ void __init __weak arch_task_cache_init(void) { } /* * set_max_threads */ -static void set_max_threads(unsigned int max_threads_suggested) +static void __init set_max_threads(unsigned int max_threads_suggested) { u64 threads; - unsigned long nr_pages = totalram_pages(); + unsigned long nr_pages = PHYS_PFN(memblock_phys_mem_size() - memblock_reserved_size()); /* * The number of threads shall be limited such that the thread @@ -1018,7 +1019,7 @@ static void set_max_threads(unsigned int max_threads_suggested) int arch_task_struct_size __read_mostly; #endif -static void task_struct_whitelist(unsigned long *offset, unsigned long *size) +static void __init task_struct_whitelist(unsigned long *offset, unsigned long *size) { /* Fetch thread_struct whitelist for the architecture. */ arch_thread_struct_whitelist(offset, size); @@ -1519,14 +1520,13 @@ struct mm_struct *get_task_mm(struct task_struct *task) { struct mm_struct *mm; + if (task->flags & PF_KTHREAD) + return NULL; + task_lock(task); mm = task->mm; - if (mm) { - if (task->flags & PF_KTHREAD) - mm = NULL; - else - mmget(mm); - } + if (mm) + mmget(mm); task_unlock(task); return mm; } diff --git a/kernel/power/swap.c b/kernel/power/swap.c index 753b8dd42a59..82b884b67152 100644 --- a/kernel/power/swap.c +++ b/kernel/power/swap.c @@ -200,12 +200,11 @@ void free_all_swap_pages(int swap) while ((node = swsusp_extents.rb_node)) { struct swsusp_extent *ext; - unsigned long offset; ext = rb_entry(node, struct swsusp_extent, node); rb_erase(node, &swsusp_extents); - for (offset = ext->start; offset <= ext->end; offset++) - swap_free(swp_entry(swap, offset)); + swap_free_nr(swp_entry(swap, ext->start), + ext->end - ext->start + 1); kfree(ext); } diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index e5d6a4ab433b..0f579430f02a 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -7920,6 +7920,7 @@ out: void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, struct ftrace_ops *op, struct ftrace_regs *fregs) { + kmsan_unpoison_memory(fregs, sizeof(*fregs)); __ftrace_ops_list_func(ip, parent_ip, NULL, fregs); } #else diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 2d7d27e6ae3c..aa3a5df15b8e 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4203,31 +4203,28 @@ slow_path: * * Return: The contents that was stored at the index. */ -static inline void *mas_wr_store_entry(struct ma_wr_state *wr_mas) +static inline void mas_wr_store_entry(struct ma_wr_state *wr_mas) { struct ma_state *mas = wr_mas->mas; wr_mas->content = mas_start(mas); if (mas_is_none(mas) || mas_is_ptr(mas)) { mas_store_root(mas, wr_mas->entry); - return wr_mas->content; + return; } if (unlikely(!mas_wr_walk(wr_mas))) { mas_wr_spanning_store(wr_mas); - return wr_mas->content; + return; } /* At this point, we are at the leaf node that needs to be altered. */ mas_wr_end_piv(wr_mas); /* New root for a single pointer */ - if (unlikely(!mas->index && mas->last == ULONG_MAX)) { + if (unlikely(!mas->index && mas->last == ULONG_MAX)) mas_new_root(mas, wr_mas->entry); - return wr_mas->content; - } - - mas_wr_modify(wr_mas); - return wr_mas->content; + else + mas_wr_modify(wr_mas); } /** diff --git a/lib/test_hmm.c b/lib/test_hmm.c index b823ba7cb6a1..ee20e1f9bae9 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -1550,4 +1550,5 @@ static void __exit hmm_dmirror_exit(void) module_init(hmm_dmirror_init); module_exit(hmm_dmirror_exit); +MODULE_DESCRIPTION("HMM (Heterogeneous Memory Management) test module"); MODULE_LICENSE("GPL"); diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c index 399380db449c..31561e0e1a0d 100644 --- a/lib/test_maple_tree.c +++ b/lib/test_maple_tree.c @@ -3946,4 +3946,5 @@ static void __exit maple_tree_harvest(void) module_init(maple_tree_seed); module_exit(maple_tree_harvest); MODULE_AUTHOR("Liam R. Howlett "); +MODULE_DESCRIPTION("maple tree API test module"); MODULE_LICENSE("GPL"); diff --git a/lib/test_ubsan.c b/lib/test_ubsan.c index c288df9372ed..5d7b10e98610 100644 --- a/lib/test_ubsan.c +++ b/lib/test_ubsan.c @@ -156,4 +156,5 @@ static void __exit test_ubsan_exit(void) module_exit(test_ubsan_exit); MODULE_AUTHOR("Jinbum Park "); +MODULE_DESCRIPTION("UBSAN unit test"); MODULE_LICENSE("GPL v2"); diff --git a/lib/test_xarray.c b/lib/test_xarray.c index ab9cc42a0d74..d5c5cbba33ed 100644 --- a/lib/test_xarray.c +++ b/lib/test_xarray.c @@ -2173,4 +2173,5 @@ static void xarray_exit(void) module_init(xarray_checks); module_exit(xarray_exit); MODULE_AUTHOR("Matthew Wilcox "); +MODULE_DESCRIPTION("XArray API test module"); MODULE_LICENSE("GPL"); diff --git a/lib/zlib_dfltcc/dfltcc.h b/lib/zlib_dfltcc/dfltcc.h index b96232bdd44d..0f2a16d7a48a 100644 --- a/lib/zlib_dfltcc/dfltcc.h +++ b/lib/zlib_dfltcc/dfltcc.h @@ -80,6 +80,7 @@ struct dfltcc_param_v0 { uint8_t csb[1152]; }; +static_assert(offsetof(struct dfltcc_param_v0, csb) == 384); static_assert(sizeof(struct dfltcc_param_v0) == 1536); #define CVT_CRC32 0 diff --git a/lib/zlib_dfltcc/dfltcc_util.h b/lib/zlib_dfltcc/dfltcc_util.h index 4a46b5009f0d..10509270d822 100644 --- a/lib/zlib_dfltcc/dfltcc_util.h +++ b/lib/zlib_dfltcc/dfltcc_util.h @@ -2,6 +2,8 @@ #ifndef DFLTCC_UTIL_H #define DFLTCC_UTIL_H +#include "dfltcc.h" +#include #include /* @@ -20,6 +22,7 @@ typedef enum { #define DFLTCC_CMPR 2 #define DFLTCC_XPND 4 #define HBT_CIRCULAR (1 << 7) +#define DFLTCC_FN_MASK ((1 << 7) - 1) #define HB_BITS 15 #define HB_SIZE (1 << HB_BITS) @@ -34,6 +37,7 @@ static inline dfltcc_cc dfltcc( ) { Byte *t2 = op1 ? *op1 : NULL; + unsigned char *orig_t2 = t2; size_t t3 = len1 ? *len1 : 0; const Byte *t4 = op2 ? *op2 : NULL; size_t t5 = len2 ? *len2 : 0; @@ -59,6 +63,30 @@ static inline dfltcc_cc dfltcc( : "cc", "memory"); t2 = r2; t3 = r3; t4 = r4; t5 = r5; + /* + * Unpoison the parameter block and the output buffer. + * This is a no-op in non-KMSAN builds. + */ + switch (fn & DFLTCC_FN_MASK) { + case DFLTCC_QAF: + kmsan_unpoison_memory(param, sizeof(struct dfltcc_qaf_param)); + break; + case DFLTCC_GDHT: + kmsan_unpoison_memory(param, offsetof(struct dfltcc_param_v0, csb)); + break; + case DFLTCC_CMPR: + kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0)); + kmsan_unpoison_memory( + orig_t2, + t2 - orig_t2 + + (((struct dfltcc_param_v0 *)param)->sbb == 0 ? 0 : 1)); + break; + case DFLTCC_XPND: + kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0)); + kmsan_unpoison_memory(orig_t2, t2 - orig_t2); + break; + } + if (op1) *op1 = t2; if (len1) diff --git a/mm/Kconfig b/mm/Kconfig index e0dfb268717c..b72e7d040f78 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -128,7 +128,7 @@ config ZSWAP_COMPRESSOR_DEFAULT choice prompt "Default allocator" depends on ZSWAP - default ZSWAP_ZPOOL_DEFAULT_ZSMALLOC if MMU + default ZSWAP_ZPOOL_DEFAULT_ZSMALLOC if HAVE_ZSMALLOC default ZSWAP_ZPOOL_DEFAULT_ZBUD help Selects the default allocator for the compressed cache for @@ -154,6 +154,7 @@ config ZSWAP_ZPOOL_DEFAULT_Z3FOLD config ZSWAP_ZPOOL_DEFAULT_ZSMALLOC bool "zsmalloc" + depends on HAVE_ZSMALLOC select ZSMALLOC help Use the zsmalloc allocator as the default allocator. @@ -186,10 +187,15 @@ config Z3FOLD page. It is a ZBUD derivative so the simplicity and determinism are still there. +config HAVE_ZSMALLOC + def_bool y + depends on MMU + depends on PAGE_SIZE_LESS_THAN_256KB # we want <= 64 KiB + config ZSMALLOC tristate prompt "N:1 compression allocator (zsmalloc)" if ZSWAP - depends on MMU + depends on HAVE_ZSMALLOC help zsmalloc is a slab-based memory allocator designed to store pages of various compression levels efficiently. It achieves @@ -731,7 +737,7 @@ config DEFAULT_MMAP_MIN_ADDR from userspace allocation. Keeping a user from writing to low pages can help reduce the impact of kernel NULL pointer bugs. - For most ppc64 and x86 users with lots of address space + For most arm64, ppc64 and x86 users with lots of address space a value of 65536 is reasonable and should cause no problems. On arm and other archs it should not be higher than 32768. Programs which use vm86 functionality or have some need to map @@ -963,6 +969,7 @@ config DEFERRED_STRUCT_PAGE_INIT depends on SPARSEMEM depends on !NEED_PER_CPU_KM depends on 64BIT + depends on !KMSAN select PADATA help Ordinarily all struct pages are initialised during early boot in a @@ -1136,16 +1143,6 @@ config DMAPOOL_TEST config ARCH_HAS_PTE_SPECIAL bool -# -# Some architectures require a special hugepage directory format that is -# required to support multiple hugepage sizes. For example a4fe3ce76 -# "powerpc/mm: Allow more flexible layouts for hugepage pagetables" -# introduced it on powerpc. This allows for a more flexible hugepage -# pagetable layouts. -# -config ARCH_HAS_HUGEPD - bool - config MAPPING_DIRTY_HELPERS bool diff --git a/mm/Makefile b/mm/Makefile index 8fb85acda1b1..d2915f8c9dc0 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -26,6 +26,7 @@ KCOV_INSTRUMENT_page_alloc.o := n KCOV_INSTRUMENT_debug-pagealloc.o := n KCOV_INSTRUMENT_kmemleak.o := n KCOV_INSTRUMENT_memcontrol.o := n +KCOV_INSTRUMENT_memcontrol-v1.o := n KCOV_INSTRUMENT_mmzone.o := n KCOV_INSTRUMENT_vmstat.o := n KCOV_INSTRUMENT_failslab.o := n @@ -95,6 +96,7 @@ obj-$(CONFIG_NUMA) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o +obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index 22c96fed70b5..6597ebea8ae2 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -234,14 +234,6 @@ static int balloon_page_migrate(struct page *newpage, struct page *page, { struct balloon_dev_info *balloon = balloon_page_device(page); - /* - * We can not easily support the no copy case here so ignore it as it - * is unlikely to be used with balloon pages. See include/linux/hmm.h - * for a user of the MIGRATE_SYNC_NO_COPY mode. - */ - if (mode == MIGRATE_SYNC_NO_COPY) - return -EINVAL; - VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); diff --git a/mm/damon/core.c b/mm/damon/core.c index e66823d6b10b..7a87628b76ab 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -354,7 +354,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern, enum damos_action action, unsigned long apply_interval_us, struct damos_quota *quota, - struct damos_watermarks *wmarks) + struct damos_watermarks *wmarks, + int target_nid) { struct damos *scheme; @@ -381,6 +382,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern, scheme->wmarks = *wmarks; scheme->wmarks.activated = true; + scheme->target_nid = target_nid; + return scheme; } @@ -663,6 +666,339 @@ void damon_set_schemes(struct damon_ctx *ctx, struct damos **schemes, damon_add_scheme(ctx, schemes[i]); } +static struct damos_quota_goal *damos_nth_quota_goal( + int n, struct damos_quota *q) +{ + struct damos_quota_goal *goal; + int i = 0; + + damos_for_each_quota_goal(goal, q) { + if (i++ == n) + return goal; + } + return NULL; +} + +static void damos_commit_quota_goal( + struct damos_quota_goal *dst, struct damos_quota_goal *src) +{ + dst->metric = src->metric; + dst->target_value = src->target_value; + if (dst->metric == DAMOS_QUOTA_USER_INPUT) + dst->current_value = src->current_value; + /* keep last_psi_total as is, since it will be updated in next cycle */ +} + +/** + * damos_commit_quota_goals() - Commit DAMOS quota goals to another quota. + * @dst: The commit destination DAMOS quota. + * @src: The commit source DAMOS quota. + * + * Copies user-specified parameters for quota goals from @src to @dst. Users + * should use this function for quota goals-level parameters update of running + * DAMON contexts, instead of manual in-place updates. + * + * This function should be called from parameters-update safe context, like + * DAMON callbacks. + */ +int damos_commit_quota_goals(struct damos_quota *dst, struct damos_quota *src) +{ + struct damos_quota_goal *dst_goal, *next, *src_goal, *new_goal; + int i = 0, j = 0; + + damos_for_each_quota_goal_safe(dst_goal, next, dst) { + src_goal = damos_nth_quota_goal(i++, src); + if (src_goal) + damos_commit_quota_goal(dst_goal, src_goal); + else + damos_destroy_quota_goal(dst_goal); + } + damos_for_each_quota_goal_safe(src_goal, next, src) { + if (j++ < i) + continue; + new_goal = damos_new_quota_goal( + src_goal->metric, src_goal->target_value); + if (!new_goal) + return -ENOMEM; + damos_add_quota_goal(dst, new_goal); + } + return 0; +} + +static int damos_commit_quota(struct damos_quota *dst, struct damos_quota *src) +{ + int err; + + dst->reset_interval = src->reset_interval; + dst->ms = src->ms; + dst->sz = src->sz; + err = damos_commit_quota_goals(dst, src); + if (err) + return err; + dst->weight_sz = src->weight_sz; + dst->weight_nr_accesses = src->weight_nr_accesses; + dst->weight_age = src->weight_age; + return 0; +} + +static struct damos_filter *damos_nth_filter(int n, struct damos *s) +{ + struct damos_filter *filter; + int i = 0; + + damos_for_each_filter(filter, s) { + if (i++ == n) + return filter; + } + return NULL; +} + +static void damos_commit_filter_arg( + struct damos_filter *dst, struct damos_filter *src) +{ + switch (dst->type) { + case DAMOS_FILTER_TYPE_MEMCG: + dst->memcg_id = src->memcg_id; + break; + case DAMOS_FILTER_TYPE_ADDR: + dst->addr_range = src->addr_range; + break; + case DAMOS_FILTER_TYPE_TARGET: + dst->target_idx = src->target_idx; + break; + default: + break; + } +} + +static void damos_commit_filter( + struct damos_filter *dst, struct damos_filter *src) +{ + dst->type = src->type; + dst->matching = src->matching; + damos_commit_filter_arg(dst, src); +} + +static int damos_commit_filters(struct damos *dst, struct damos *src) +{ + struct damos_filter *dst_filter, *next, *src_filter, *new_filter; + int i = 0, j = 0; + + damos_for_each_filter_safe(dst_filter, next, dst) { + src_filter = damos_nth_filter(i++, src); + if (src_filter) + damos_commit_filter(dst_filter, src_filter); + else + damos_destroy_filter(dst_filter); + } + + damos_for_each_filter_safe(src_filter, next, src) { + if (j++ < i) + continue; + + new_filter = damos_new_filter( + src_filter->type, src_filter->matching); + if (!new_filter) + return -ENOMEM; + damos_commit_filter_arg(new_filter, src_filter); + damos_add_filter(dst, new_filter); + } + return 0; +} + +static struct damos *damon_nth_scheme(int n, struct damon_ctx *ctx) +{ + struct damos *s; + int i = 0; + + damon_for_each_scheme(s, ctx) { + if (i++ == n) + return s; + } + return NULL; +} + +static int damos_commit(struct damos *dst, struct damos *src) +{ + int err; + + dst->pattern = src->pattern; + dst->action = src->action; + dst->apply_interval_us = src->apply_interval_us; + + err = damos_commit_quota(&dst->quota, &src->quota); + if (err) + return err; + + dst->wmarks = src->wmarks; + + err = damos_commit_filters(dst, src); + return err; +} + +static int damon_commit_schemes(struct damon_ctx *dst, struct damon_ctx *src) +{ + struct damos *dst_scheme, *next, *src_scheme, *new_scheme; + int i = 0, j = 0, err; + + damon_for_each_scheme_safe(dst_scheme, next, dst) { + src_scheme = damon_nth_scheme(i++, src); + if (src_scheme) { + err = damos_commit(dst_scheme, src_scheme); + if (err) + return err; + } else { + damon_destroy_scheme(dst_scheme); + } + } + + damon_for_each_scheme_safe(src_scheme, next, src) { + if (j++ < i) + continue; + new_scheme = damon_new_scheme(&src_scheme->pattern, + src_scheme->action, + src_scheme->apply_interval_us, + &src_scheme->quota, &src_scheme->wmarks, + NUMA_NO_NODE); + if (!new_scheme) + return -ENOMEM; + damon_add_scheme(dst, new_scheme); + } + return 0; +} + +static struct damon_target *damon_nth_target(int n, struct damon_ctx *ctx) +{ + struct damon_target *t; + int i = 0; + + damon_for_each_target(t, ctx) { + if (i++ == n) + return t; + } + return NULL; +} + +/* + * The caller should ensure the regions of @src are + * 1. valid (end >= src) and + * 2. sorted by starting address. + * + * If @src has no region, @dst keeps current regions. + */ +static int damon_commit_target_regions( + struct damon_target *dst, struct damon_target *src) +{ + struct damon_region *src_region; + struct damon_addr_range *ranges; + int i = 0, err; + + damon_for_each_region(src_region, src) + i++; + if (!i) + return 0; + + ranges = kmalloc_array(i, sizeof(*ranges), GFP_KERNEL | __GFP_NOWARN); + if (!ranges) + return -ENOMEM; + i = 0; + damon_for_each_region(src_region, src) + ranges[i++] = src_region->ar; + err = damon_set_regions(dst, ranges, i); + kfree(ranges); + return err; +} + +static int damon_commit_target( + struct damon_target *dst, bool dst_has_pid, + struct damon_target *src, bool src_has_pid) +{ + int err; + + err = damon_commit_target_regions(dst, src); + if (err) + return err; + if (dst_has_pid) + put_pid(dst->pid); + if (src_has_pid) + get_pid(src->pid); + dst->pid = src->pid; + return 0; +} + +static int damon_commit_targets( + struct damon_ctx *dst, struct damon_ctx *src) +{ + struct damon_target *dst_target, *next, *src_target, *new_target; + int i = 0, j = 0, err; + + damon_for_each_target_safe(dst_target, next, dst) { + src_target = damon_nth_target(i++, src); + if (src_target) { + err = damon_commit_target( + dst_target, damon_target_has_pid(dst), + src_target, damon_target_has_pid(src)); + if (err) + return err; + } else { + if (damon_target_has_pid(dst)) + put_pid(dst_target->pid); + damon_destroy_target(dst_target); + } + } + + damon_for_each_target_safe(src_target, next, src) { + if (j++ < i) + continue; + new_target = damon_new_target(); + if (!new_target) + return -ENOMEM; + err = damon_commit_target(new_target, false, + src_target, damon_target_has_pid(src)); + if (err) + return err; + } + return 0; +} + +/** + * damon_commit_ctx() - Commit parameters of a DAMON context to another. + * @dst: The commit destination DAMON context. + * @src: The commit source DAMON context. + * + * This function copies user-specified parameters from @src to @dst and update + * the internal status and results accordingly. Users should use this function + * for context-level parameters update of running context, instead of manual + * in-place updates. + * + * This function should be called from parameters-update safe context, like + * DAMON callbacks. + */ +int damon_commit_ctx(struct damon_ctx *dst, struct damon_ctx *src) +{ + int err; + + err = damon_commit_schemes(dst, src); + if (err) + return err; + err = damon_commit_targets(dst, src); + if (err) + return err; + /* + * schemes and targets should be updated first, since + * 1. damon_set_attrs() updates monitoring results of targets and + * next_apply_sis of schemes, and + * 2. ops update should be done after pid handling is done (target + * committing require putting pids). + */ + err = damon_set_attrs(dst, &src->attrs); + if (err) + return err; + dst->ops = src->ops; + + return 0; +} + /** * damon_nr_running_ctxs() - Return number of currently running contexts. */ diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 2461cfe2e968..51a6f1cac385 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -281,7 +281,7 @@ static struct damos **str_to_schemes(const char *str, ssize_t len, pos += parsed; scheme = damon_new_scheme(&pattern, action, 0, "a, - &wmarks); + &wmarks, NUMA_NO_NODE); if (!scheme) goto fail; diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c index 3de2916a65c3..4af8fd4a390b 100644 --- a/mm/damon/lru_sort.c +++ b/mm/damon/lru_sort.c @@ -163,7 +163,8 @@ static struct damos *damon_lru_sort_new_scheme( /* under the quota. */ "a, /* (De)activate this according to the watermarks. */ - &damon_lru_sort_wmarks); + &damon_lru_sort_wmarks, + NUMA_NO_NODE); } /* Create a DAMON-based operation scheme for hot memory regions */ @@ -185,61 +186,48 @@ static struct damos *damon_lru_sort_new_cold_scheme(unsigned int cold_thres) return damon_lru_sort_new_scheme(&pattern, DAMOS_LRU_DEPRIO); } -static void damon_lru_sort_copy_quota_status(struct damos_quota *dst, - struct damos_quota *src) -{ - dst->total_charged_sz = src->total_charged_sz; - dst->total_charged_ns = src->total_charged_ns; - dst->charged_sz = src->charged_sz; - dst->charged_from = src->charged_from; - dst->charge_target_from = src->charge_target_from; - dst->charge_addr_from = src->charge_addr_from; -} - static int damon_lru_sort_apply_parameters(void) { - struct damos *scheme, *hot_scheme, *cold_scheme; - struct damos *old_hot_scheme = NULL, *old_cold_scheme = NULL; + struct damon_ctx *param_ctx; + struct damon_target *param_target; + struct damos *hot_scheme, *cold_scheme; unsigned int hot_thres, cold_thres; - int err = 0; + int err; - err = damon_set_attrs(ctx, &damon_lru_sort_mon_attrs); + err = damon_modules_new_paddr_ctx_target(¶m_ctx, ¶m_target); if (err) return err; - damon_for_each_scheme(scheme, ctx) { - if (!old_hot_scheme) { - old_hot_scheme = scheme; - continue; - } - old_cold_scheme = scheme; - } + err = damon_set_attrs(ctx, &damon_lru_sort_mon_attrs); + if (err) + goto out; + err = -ENOMEM; hot_thres = damon_max_nr_accesses(&damon_lru_sort_mon_attrs) * hot_thres_access_freq / 1000; hot_scheme = damon_lru_sort_new_hot_scheme(hot_thres); if (!hot_scheme) - return -ENOMEM; - if (old_hot_scheme) - damon_lru_sort_copy_quota_status(&hot_scheme->quota, - &old_hot_scheme->quota); + goto out; cold_thres = cold_min_age / damon_lru_sort_mon_attrs.aggr_interval; cold_scheme = damon_lru_sort_new_cold_scheme(cold_thres); if (!cold_scheme) { damon_destroy_scheme(hot_scheme); - return -ENOMEM; + goto out; } - if (old_cold_scheme) - damon_lru_sort_copy_quota_status(&cold_scheme->quota, - &old_cold_scheme->quota); - damon_set_schemes(ctx, &hot_scheme, 1); - damon_add_scheme(ctx, cold_scheme); + damon_set_schemes(param_ctx, &hot_scheme, 1); + damon_add_scheme(param_ctx, cold_scheme); - return damon_set_region_biggest_system_ram_default(target, + err = damon_set_region_biggest_system_ram_default(param_target, &monitor_region_start, &monitor_region_end); + if (err) + goto out; + err = damon_commit_ctx(ctx, param_ctx); +out: + damon_destroy_ctx(param_ctx); + return err; } static int damon_lru_sort_turn(bool on) diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 18797c1b419b..a9ff35341d65 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -12,6 +12,9 @@ #include #include #include +#include +#include +#include #include "../internal.h" #include "ops-common.h" @@ -325,6 +328,153 @@ static unsigned long damon_pa_deactivate_pages(struct damon_region *r, return damon_pa_mark_accessed_or_deactivate(r, s, false); } +static unsigned int __damon_pa_migrate_folio_list( + struct list_head *migrate_folios, struct pglist_data *pgdat, + int target_nid) +{ + unsigned int nr_succeeded = 0; + nodemask_t allowed_mask = NODE_MASK_NONE; + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | + __GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = &allowed_mask + }; + + if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE) + return 0; + + if (list_empty(migrate_folios)) + return 0; + + /* Migration ignores all cpuset and mempolicy settings */ + migrate_pages(migrate_folios, alloc_migrate_folio, NULL, + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DAMON, + &nr_succeeded); + + return nr_succeeded; +} + +static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list, + struct pglist_data *pgdat, + int target_nid) +{ + unsigned int nr_migrated = 0; + struct folio *folio; + LIST_HEAD(ret_folios); + LIST_HEAD(migrate_folios); + + while (!list_empty(folio_list)) { + struct folio *folio; + + cond_resched(); + + folio = lru_to_folio(folio_list); + list_del(&folio->lru); + + if (!folio_trylock(folio)) + goto keep; + + /* Relocate its contents to another node. */ + list_add(&folio->lru, &migrate_folios); + folio_unlock(folio); + continue; +keep: + list_add(&folio->lru, &ret_folios); + } + /* 'folio_list' is always empty here */ + + /* Migrate folios selected for migration */ + nr_migrated += __damon_pa_migrate_folio_list( + &migrate_folios, pgdat, target_nid); + /* + * Folios that could not be migrated are still in @migrate_folios. Add + * those back on @folio_list + */ + if (!list_empty(&migrate_folios)) + list_splice_init(&migrate_folios, folio_list); + + try_to_unmap_flush(); + + list_splice(&ret_folios, folio_list); + + while (!list_empty(folio_list)) { + folio = lru_to_folio(folio_list); + list_del(&folio->lru); + folio_putback_lru(folio); + } + + return nr_migrated; +} + +static unsigned long damon_pa_migrate_pages(struct list_head *folio_list, + int target_nid) +{ + int nid; + unsigned long nr_migrated = 0; + LIST_HEAD(node_folio_list); + unsigned int noreclaim_flag; + + if (list_empty(folio_list)) + return nr_migrated; + + noreclaim_flag = memalloc_noreclaim_save(); + + nid = folio_nid(lru_to_folio(folio_list)); + do { + struct folio *folio = lru_to_folio(folio_list); + + if (nid == folio_nid(folio)) { + list_move(&folio->lru, &node_folio_list); + continue; + } + + nr_migrated += damon_pa_migrate_folio_list(&node_folio_list, + NODE_DATA(nid), + target_nid); + nid = folio_nid(lru_to_folio(folio_list)); + } while (!list_empty(folio_list)); + + nr_migrated += damon_pa_migrate_folio_list(&node_folio_list, + NODE_DATA(nid), + target_nid); + + memalloc_noreclaim_restore(noreclaim_flag); + + return nr_migrated; +} + +static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s) +{ + unsigned long addr, applied; + LIST_HEAD(folio_list); + + for (addr = r->ar.start; addr < r->ar.end; addr += PAGE_SIZE) { + struct folio *folio = damon_get_folio(PHYS_PFN(addr)); + + if (!folio) + continue; + + if (damos_pa_filter_out(s, folio)) + goto put_folio; + + if (!folio_isolate_lru(folio)) + goto put_folio; + list_add(&folio->lru, &folio_list); +put_folio: + folio_put(folio); + } + applied = damon_pa_migrate_pages(&folio_list, s->target_nid); + cond_resched(); + return applied * PAGE_SIZE; +} + + static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx, struct damon_target *t, struct damon_region *r, struct damos *scheme) @@ -336,6 +486,9 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx, return damon_pa_mark_accessed(r, scheme); case DAMOS_LRU_DEPRIO: return damon_pa_deactivate_pages(r, scheme); + case DAMOS_MIGRATE_HOT: + case DAMOS_MIGRATE_COLD: + return damon_pa_migrate(r, scheme); case DAMOS_STAT: break; default: @@ -356,6 +509,10 @@ static int damon_pa_scheme_score(struct damon_ctx *context, return damon_hot_score(context, r, scheme); case DAMOS_LRU_DEPRIO: return damon_cold_score(context, r, scheme); + case DAMOS_MIGRATE_HOT: + return damon_hot_score(context, r, scheme); + case DAMOS_MIGRATE_COLD: + return damon_cold_score(context, r, scheme); default: break; } diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c index 9bd341d62b4c..9e0077a9404e 100644 --- a/mm/damon/reclaim.c +++ b/mm/damon/reclaim.c @@ -177,76 +177,65 @@ static struct damos *damon_reclaim_new_scheme(void) /* under the quota. */ &damon_reclaim_quota, /* (De)activate this according to the watermarks. */ - &damon_reclaim_wmarks); -} - -static void damon_reclaim_copy_quota_status(struct damos_quota *dst, - struct damos_quota *src) -{ - dst->total_charged_sz = src->total_charged_sz; - dst->total_charged_ns = src->total_charged_ns; - dst->charged_sz = src->charged_sz; - dst->charged_from = src->charged_from; - dst->charge_target_from = src->charge_target_from; - dst->charge_addr_from = src->charge_addr_from; - dst->esz_bp = src->esz_bp; + &damon_reclaim_wmarks, + NUMA_NO_NODE); } static int damon_reclaim_apply_parameters(void) { - struct damos *scheme, *old_scheme; + struct damon_ctx *param_ctx; + struct damon_target *param_target; + struct damos *scheme; struct damos_quota_goal *goal; struct damos_filter *filter; - int err = 0; + int err; - err = damon_set_attrs(ctx, &damon_reclaim_mon_attrs); + err = damon_modules_new_paddr_ctx_target(¶m_ctx, ¶m_target); if (err) return err; - /* Will be freed by next 'damon_set_schemes()' below */ + err = damon_set_attrs(ctx, &damon_reclaim_mon_attrs); + if (err) + goto out; + + err = -ENOMEM; scheme = damon_reclaim_new_scheme(); if (!scheme) - return -ENOMEM; - if (!list_empty(&ctx->schemes)) { - damon_for_each_scheme(old_scheme, ctx) - damon_reclaim_copy_quota_status(&scheme->quota, - &old_scheme->quota); - } + goto out; + damon_set_schemes(ctx, &scheme, 1); if (quota_mem_pressure_us) { goal = damos_new_quota_goal(DAMOS_QUOTA_SOME_MEM_PSI_US, quota_mem_pressure_us); - if (!goal) { - damon_destroy_scheme(scheme); - return -ENOMEM; - } + if (!goal) + goto out; damos_add_quota_goal(&scheme->quota, goal); } if (quota_autotune_feedback) { goal = damos_new_quota_goal(DAMOS_QUOTA_USER_INPUT, 10000); - if (!goal) { - damon_destroy_scheme(scheme); - return -ENOMEM; - } + if (!goal) + goto out; goal->current_value = quota_autotune_feedback; damos_add_quota_goal(&scheme->quota, goal); } if (skip_anon) { filter = damos_new_filter(DAMOS_FILTER_TYPE_ANON, true); - if (!filter) { - /* Will be freed by next 'damon_set_schemes()' below */ - damon_destroy_scheme(scheme); - return -ENOMEM; - } + if (!filter) + goto out; damos_add_filter(scheme, filter); } - damon_set_schemes(ctx, &scheme, 1); - return damon_set_region_biggest_system_ram_default(target, + err = damon_set_region_biggest_system_ram_default(param_target, &monitor_region_start, &monitor_region_end); + if (err) + goto out; + err = damon_commit_ctx(ctx, param_ctx); +out: + damon_destroy_ctx(param_ctx); + return err; } static int damon_reclaim_turn(bool on) diff --git a/mm/damon/sysfs-common.h b/mm/damon/sysfs-common.h index a63f51577cff..9a18f3c535d3 100644 --- a/mm/damon/sysfs-common.h +++ b/mm/damon/sysfs-common.h @@ -38,7 +38,7 @@ void damon_sysfs_schemes_rm_dirs(struct damon_sysfs_schemes *schemes); extern const struct kobj_type damon_sysfs_schemes_ktype; -int damon_sysfs_set_schemes(struct damon_ctx *ctx, +int damon_sysfs_add_schemes(struct damon_ctx *ctx, struct damon_sysfs_schemes *sysfs_schemes); void damon_sysfs_schemes_update_stats( diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index bea5bc52846a..b095457380b5 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -6,6 +6,7 @@ */ #include +#include #include "sysfs-common.h" @@ -1445,6 +1446,7 @@ struct damon_sysfs_scheme { struct damon_sysfs_scheme_filters *filters; struct damon_sysfs_stats *stats; struct damon_sysfs_scheme_regions *tried_regions; + int target_nid; }; /* This should match with enum damos_action */ @@ -1456,6 +1458,8 @@ static const char * const damon_sysfs_damos_action_strs[] = { "nohugepage", "lru_prio", "lru_deprio", + "migrate_hot", + "migrate_cold", "stat", }; @@ -1470,6 +1474,7 @@ static struct damon_sysfs_scheme *damon_sysfs_scheme_alloc( scheme->kobj = (struct kobject){}; scheme->action = action; scheme->apply_interval_us = apply_interval_us; + scheme->target_nid = NUMA_NO_NODE; return scheme; } @@ -1692,6 +1697,28 @@ static ssize_t apply_interval_us_store(struct kobject *kobj, return err ? err : count; } +static ssize_t target_nid_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct damon_sysfs_scheme *scheme = container_of(kobj, + struct damon_sysfs_scheme, kobj); + + return sysfs_emit(buf, "%d\n", scheme->target_nid); +} + +static ssize_t target_nid_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + struct damon_sysfs_scheme *scheme = container_of(kobj, + struct damon_sysfs_scheme, kobj); + int err = 0; + + /* TODO: error handling for target_nid range. */ + err = kstrtoint(buf, 0, &scheme->target_nid); + + return err ? err : count; +} + static void damon_sysfs_scheme_release(struct kobject *kobj) { kfree(container_of(kobj, struct damon_sysfs_scheme, kobj)); @@ -1703,9 +1730,13 @@ static struct kobj_attribute damon_sysfs_scheme_action_attr = static struct kobj_attribute damon_sysfs_scheme_apply_interval_us_attr = __ATTR_RW_MODE(apply_interval_us, 0600); +static struct kobj_attribute damon_sysfs_scheme_target_nid_attr = + __ATTR_RW_MODE(target_nid, 0600); + static struct attribute *damon_sysfs_scheme_attrs[] = { &damon_sysfs_scheme_action_attr.attr, &damon_sysfs_scheme_apply_interval_us_attr.attr, + &damon_sysfs_scheme_target_nid_attr.attr, NULL, }; ATTRIBUTE_GROUPS(damon_sysfs_scheme); @@ -1877,14 +1908,10 @@ static int damon_sysfs_memcg_path_to_id(char *memcg_path, unsigned short *id) return found ? 0 : -EINVAL; } -static int damon_sysfs_set_scheme_filters(struct damos *scheme, +static int damon_sysfs_add_scheme_filters(struct damos *scheme, struct damon_sysfs_scheme_filters *sysfs_filters) { int i; - struct damos_filter *filter, *next; - - damos_for_each_filter_safe(filter, next, scheme) - damos_destroy_filter(filter); for (i = 0; i < sysfs_filters->nr; i++) { struct damon_sysfs_scheme_filter *sysfs_filter = @@ -1920,16 +1947,13 @@ static int damon_sysfs_set_scheme_filters(struct damos *scheme, return 0; } -static int damos_sysfs_set_quota_score( +static int damos_sysfs_add_quota_score( struct damos_sysfs_quota_goals *sysfs_goals, struct damos_quota *quota) { - struct damos_quota_goal *goal, *next; + struct damos_quota_goal *goal; int i; - damos_for_each_quota_goal_safe(goal, next, quota) - damos_destroy_quota_goal(goal); - for (i = 0; i < sysfs_goals->nr; i++) { struct damos_sysfs_quota_goal *sysfs_goal = sysfs_goals->goals_arr[i]; @@ -1952,10 +1976,13 @@ int damos_sysfs_set_quota_scores(struct damon_sysfs_schemes *sysfs_schemes, struct damon_ctx *ctx) { struct damos *scheme; + struct damos_quota quota = {}; int i = 0; + INIT_LIST_HEAD("a.goals); damon_for_each_scheme(scheme, ctx) { struct damon_sysfs_scheme *sysfs_scheme; + struct damos_quota_goal *g, *g_next; int err; /* user could have removed the scheme sysfs dir */ @@ -1963,10 +1990,17 @@ int damos_sysfs_set_quota_scores(struct damon_sysfs_schemes *sysfs_schemes, break; sysfs_scheme = sysfs_schemes->schemes_arr[i]; - err = damos_sysfs_set_quota_score(sysfs_scheme->quotas->goals, - &scheme->quota); + err = damos_sysfs_add_quota_score(sysfs_scheme->quotas->goals, + "a); + if (err) { + damos_for_each_quota_goal_safe(g, g_next, "a) + damos_destroy_quota_goal(g); + return err; + } + err = damos_commit_quota_goals(&scheme->quota, "a); + damos_for_each_quota_goal_safe(g, g_next, "a) + damos_destroy_quota_goal(g); if (err) - /* kdamond will clean up schemes and terminated */ return err; i++; } @@ -2031,17 +2065,18 @@ static struct damos *damon_sysfs_mk_scheme( }; scheme = damon_new_scheme(&pattern, sysfs_scheme->action, - sysfs_scheme->apply_interval_us, "a, &wmarks); + sysfs_scheme->apply_interval_us, "a, &wmarks, + sysfs_scheme->target_nid); if (!scheme) return NULL; - err = damos_sysfs_set_quota_score(sysfs_quotas->goals, &scheme->quota); + err = damos_sysfs_add_quota_score(sysfs_quotas->goals, &scheme->quota); if (err) { damon_destroy_scheme(scheme); return NULL; } - err = damon_sysfs_set_scheme_filters(scheme, sysfs_filters); + err = damon_sysfs_add_scheme_filters(scheme, sysfs_filters); if (err) { damon_destroy_scheme(scheme); return NULL; @@ -2049,66 +2084,12 @@ static struct damos *damon_sysfs_mk_scheme( return scheme; } -static void damon_sysfs_update_scheme(struct damos *scheme, - struct damon_sysfs_scheme *sysfs_scheme) -{ - struct damon_sysfs_access_pattern *access_pattern = - sysfs_scheme->access_pattern; - struct damon_sysfs_quotas *sysfs_quotas = sysfs_scheme->quotas; - struct damon_sysfs_weights *sysfs_weights = sysfs_quotas->weights; - struct damon_sysfs_watermarks *sysfs_wmarks = sysfs_scheme->watermarks; - int err; - - scheme->pattern.min_sz_region = access_pattern->sz->min; - scheme->pattern.max_sz_region = access_pattern->sz->max; - scheme->pattern.min_nr_accesses = access_pattern->nr_accesses->min; - scheme->pattern.max_nr_accesses = access_pattern->nr_accesses->max; - scheme->pattern.min_age_region = access_pattern->age->min; - scheme->pattern.max_age_region = access_pattern->age->max; - - scheme->action = sysfs_scheme->action; - scheme->apply_interval_us = sysfs_scheme->apply_interval_us; - - scheme->quota.ms = sysfs_quotas->ms; - scheme->quota.sz = sysfs_quotas->sz; - scheme->quota.reset_interval = sysfs_quotas->reset_interval_ms; - scheme->quota.weight_sz = sysfs_weights->sz; - scheme->quota.weight_nr_accesses = sysfs_weights->nr_accesses; - scheme->quota.weight_age = sysfs_weights->age; - - err = damos_sysfs_set_quota_score(sysfs_quotas->goals, &scheme->quota); - if (err) { - damon_destroy_scheme(scheme); - return; - } - - scheme->wmarks.metric = sysfs_wmarks->metric; - scheme->wmarks.interval = sysfs_wmarks->interval_us; - scheme->wmarks.high = sysfs_wmarks->high; - scheme->wmarks.mid = sysfs_wmarks->mid; - scheme->wmarks.low = sysfs_wmarks->low; - - err = damon_sysfs_set_scheme_filters(scheme, sysfs_scheme->filters); - if (err) - damon_destroy_scheme(scheme); -} - -int damon_sysfs_set_schemes(struct damon_ctx *ctx, +int damon_sysfs_add_schemes(struct damon_ctx *ctx, struct damon_sysfs_schemes *sysfs_schemes) { - struct damos *scheme, *next; - int i = 0; + int i; - damon_for_each_scheme_safe(scheme, next, ctx) { - if (i < sysfs_schemes->nr) - damon_sysfs_update_scheme(scheme, - sysfs_schemes->schemes_arr[i]); - else - damon_destroy_scheme(scheme); - i++; - } - - for (; i < sysfs_schemes->nr; i++) { + for (i = 0; i < sysfs_schemes->nr; i++) { struct damos *scheme, *next; scheme = damon_sysfs_mk_scheme(sysfs_schemes->schemes_arr[i]); diff --git a/mm/damon/sysfs-test.h b/mm/damon/sysfs-test.h index 73bdce2452c1..1c9b596057a7 100644 --- a/mm/damon/sysfs-test.h +++ b/mm/damon/sysfs-test.h @@ -38,7 +38,7 @@ static int __damon_sysfs_test_get_any_pid(int min, int max) return -1; } -static void damon_sysfs_test_set_targets(struct kunit *test) +static void damon_sysfs_test_add_targets(struct kunit *test) { struct damon_sysfs_targets *sysfs_targets; struct damon_sysfs_target *sysfs_target; @@ -56,13 +56,13 @@ static void damon_sysfs_test_set_targets(struct kunit *test) ctx = damon_new_ctx(); - damon_sysfs_set_targets(ctx, sysfs_targets); + damon_sysfs_add_targets(ctx, sysfs_targets); KUNIT_EXPECT_EQ(test, 1u, nr_damon_targets(ctx)); sysfs_target->pid = __damon_sysfs_test_get_any_pid( sysfs_target->pid + 1, 200); - damon_sysfs_set_targets(ctx, sysfs_targets); - KUNIT_EXPECT_EQ(test, 1u, nr_damon_targets(ctx)); + damon_sysfs_add_targets(ctx, sysfs_targets); + KUNIT_EXPECT_EQ(test, 2u, nr_damon_targets(ctx)); damon_destroy_ctx(ctx); kfree(sysfs_targets->targets_arr); @@ -71,7 +71,7 @@ static void damon_sysfs_test_set_targets(struct kunit *test) } static struct kunit_case damon_sysfs_test_cases[] = { - KUNIT_CASE(damon_sysfs_test_set_targets), + KUNIT_CASE(damon_sysfs_test_add_targets), {}, }; diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index 6fee383bc0c5..cffc755e7775 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -1162,72 +1162,16 @@ destroy_targets_out: return err; } -static int damon_sysfs_update_target_pid(struct damon_target *target, int pid) -{ - struct pid *pid_new; - - pid_new = find_get_pid(pid); - if (!pid_new) - return -EINVAL; - - if (pid_new == target->pid) { - put_pid(pid_new); - return 0; - } - - put_pid(target->pid); - target->pid = pid_new; - return 0; -} - -static int damon_sysfs_update_target(struct damon_target *target, - struct damon_ctx *ctx, - struct damon_sysfs_target *sys_target) -{ - int err = 0; - - if (damon_target_has_pid(ctx)) { - err = damon_sysfs_update_target_pid(target, sys_target->pid); - if (err) - return err; - } - - /* - * Do monitoring target region boundary update only if one or more - * regions are set by the user. This is for keeping current monitoring - * target results and range easier, especially for dynamic monitoring - * target regions update ops like 'vaddr'. - */ - if (sys_target->regions->nr) - err = damon_sysfs_set_regions(target, sys_target->regions); - return err; -} - -static int damon_sysfs_set_targets(struct damon_ctx *ctx, +static int damon_sysfs_add_targets(struct damon_ctx *ctx, struct damon_sysfs_targets *sysfs_targets) { - struct damon_target *t, *next; - int i = 0, err; + int i, err; /* Multiple physical address space monitoring targets makes no sense */ if (ctx->ops.id == DAMON_OPS_PADDR && sysfs_targets->nr > 1) return -EINVAL; - damon_for_each_target_safe(t, next, ctx) { - if (i < sysfs_targets->nr) { - err = damon_sysfs_update_target(t, ctx, - sysfs_targets->targets_arr[i]); - if (err) - return err; - } else { - if (damon_target_has_pid(ctx)) - put_pid(t->pid); - damon_destroy_target(t); - } - i++; - } - - for (; i < sysfs_targets->nr; i++) { + for (i = 0; i < sysfs_targets->nr; i++) { struct damon_sysfs_target *st = sysfs_targets->targets_arr[i]; err = damon_sysfs_add_target(st, ctx); @@ -1339,12 +1283,15 @@ static int damon_sysfs_apply_inputs(struct damon_ctx *ctx, err = damon_sysfs_set_attrs(ctx, sys_ctx->attrs); if (err) return err; - err = damon_sysfs_set_targets(ctx, sys_ctx->targets); + err = damon_sysfs_add_targets(ctx, sys_ctx->targets); if (err) return err; - return damon_sysfs_set_schemes(ctx, sys_ctx->schemes); + return damon_sysfs_add_schemes(ctx, sys_ctx->schemes); } +static struct damon_ctx *damon_sysfs_build_ctx( + struct damon_sysfs_context *sys_ctx); + /* * damon_sysfs_commit_input() - Commit user inputs to a running kdamond. * @kdamond: The kobject wrapper for the associated kdamond. @@ -1353,14 +1300,22 @@ static int damon_sysfs_apply_inputs(struct damon_ctx *ctx, */ static int damon_sysfs_commit_input(struct damon_sysfs_kdamond *kdamond) { + struct damon_ctx *param_ctx; + int err; + if (!damon_sysfs_kdamond_running(kdamond)) return -EINVAL; /* TODO: Support multiple contexts per kdamond */ if (kdamond->contexts->nr != 1) return -EINVAL; - return damon_sysfs_apply_inputs(kdamond->damon_ctx, - kdamond->contexts->contexts_arr[0]); + param_ctx = damon_sysfs_build_ctx(kdamond->contexts->contexts_arr[0]); + if (IS_ERR(param_ctx)) + return PTR_ERR(param_ctx); + err = damon_commit_ctx(kdamond->damon_ctx, param_ctx); + damon_sysfs_destroy_targets(param_ctx); + damon_destroy_ctx(param_ctx); + return err; } static int damon_sysfs_commit_schemes_quota_goals( diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 381559e4a1fa..58829baf8b5d 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -339,7 +339,7 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr) { bool referenced = false; - pte_t entry = huge_ptep_get(pte); + pte_t entry = huge_ptep_get(mm, addr, pte); struct folio *folio = pfn_folio(pte_pfn(entry)); unsigned long psize = huge_page_size(hstate_vma(vma)); @@ -373,7 +373,7 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask, pte_t entry; ptl = huge_pte_lock(h, walk->mm, pte); - entry = huge_ptep_get(pte); + entry = huge_ptep_get(walk->mm, addr, pte); if (!pte_present(entry)) goto out; @@ -509,7 +509,7 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask, pte_t entry; ptl = huge_pte_lock(h, walk->mm, pte); - entry = huge_ptep_get(pte); + entry = huge_ptep_get(walk->mm, addr, pte); if (!pte_present(entry)) goto out; diff --git a/mm/dmapool_test.c b/mm/dmapool_test.c index 370fb9e209ef..54b1fd1ccfbb 100644 --- a/mm/dmapool_test.c +++ b/mm/dmapool_test.c @@ -144,4 +144,5 @@ static void dmapool_exit(void) module_init(dmapool_checks); module_exit(dmapool_exit); +MODULE_DESCRIPTION("dma_pool timing test"); MODULE_LICENSE("GPL"); diff --git a/mm/fail_page_alloc.c b/mm/fail_page_alloc.c index b1b09cce9394..532851ce5132 100644 --- a/mm/fail_page_alloc.c +++ b/mm/fail_page_alloc.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include +#include #include static struct { @@ -21,7 +22,7 @@ static int __init setup_fail_page_alloc(char *str) } __setup("fail_page_alloc=", setup_fail_page_alloc); -bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) +bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) { int flags = 0; @@ -41,6 +42,7 @@ bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) return should_fail_ex(&fail_page_alloc.attr, 1 << order, flags); } +ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE); #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS diff --git a/mm/failslab.c b/mm/failslab.c index ffc420c0e767..af16c2ed578f 100644 --- a/mm/failslab.c +++ b/mm/failslab.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include +#include #include #include #include "slab.h" @@ -14,23 +15,23 @@ static struct { .cache_filter = false, }; -bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags) +int should_failslab(struct kmem_cache *s, gfp_t gfpflags) { int flags = 0; /* No fault-injection for bootstrap cache */ if (unlikely(s == kmem_cache)) - return false; + return 0; if (gfpflags & __GFP_NOFAIL) - return false; + return 0; if (failslab.ignore_gfp_reclaim && (gfpflags & __GFP_DIRECT_RECLAIM)) - return false; + return 0; if (failslab.cache_filter && !(s->flags & SLAB_FAILSLAB)) - return false; + return 0; /* * In some cases, it expects to specify __GFP_NOWARN @@ -41,8 +42,9 @@ bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags) if (gfpflags & __GFP_NOWARN) flags |= FAULT_NOWARN; - return should_fail_ex(&failslab.attr, s->object_size, flags); + return should_fail_ex(&failslab.attr, s->object_size, flags) ? -ENOMEM : 0; } +ALLOW_ERROR_INJECTION(should_failslab, ERRNO); static int __init setup_failslab(char *str) { diff --git a/mm/filemap.c b/mm/filemap.c index ca8c8d889eef..d62150418b91 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -177,7 +177,7 @@ static void filemap_unaccount_folio(struct address_space *mapping, * and we'd rather not leak it: if we're wrong, * another bad page check should catch it later. */ - page_mapcount_reset(&folio->page); + atomic_set(&folio->_mapcount, -1); folio_ref_sub(folio, mapcount); } } @@ -1752,12 +1752,12 @@ pgoff_t page_cache_next_miss(struct address_space *mapping, while (max_scan--) { void *entry = xas_next(&xas); if (!entry || xa_is_value(entry)) - break; + return xas.xa_index; if (xas.xa_index == 0) - break; + return 0; } - return xas.xa_index; + return index + max_scan; } EXPORT_SYMBOL(page_cache_next_miss); diff --git a/mm/folio-compat.c b/mm/folio-compat.c index f31e0ce65b11..f05906006b3c 100644 --- a/mm/folio-compat.c +++ b/mm/folio-compat.c @@ -10,12 +10,6 @@ #include #include "internal.h" -struct address_space *page_mapping(struct page *page) -{ - return folio_mapping(page_folio(page)); -} -EXPORT_SYMBOL(page_mapping); - void unlock_page(struct page *page) { return folio_unlock(page_folio(page)); diff --git a/mm/gup.c b/mm/gup.c index f1d6bc06eb52..54d0dc3831fb 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -5,6 +5,7 @@ #include #include +#include #include #include #include @@ -17,6 +18,7 @@ #include #include #include +#include #include #include @@ -188,6 +190,19 @@ void unpin_user_page(struct page *page) } EXPORT_SYMBOL(unpin_user_page); +/** + * unpin_folio() - release a dma-pinned folio + * @folio: pointer to folio to be released + * + * Folios that were pinned via memfd_pin_folios() or other similar routines + * must be released either using unpin_folio() or unpin_folios(). + */ +void unpin_folio(struct folio *folio) +{ + gup_put_folio(folio, 1, FOLL_PIN); +} +EXPORT_SYMBOL_GPL(unpin_folio); + /** * folio_add_pin - Try to get an additional pin on a pinned folio * @folio: The folio to be pinned @@ -290,7 +305,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, * 1) This code sees the page as already dirty, so it * skips the call to set_page_dirty(). That could happen * because clear_page_dirty_for_io() called - * page_mkclean(), followed by set_page_dirty(). + * folio_mkclean(), followed by set_page_dirty(). * However, now the page is going to get written back, * which meets the original intention of setting it * dirty, so all is well: clear_page_dirty_for_io() goes @@ -400,6 +415,40 @@ void unpin_user_pages(struct page **pages, unsigned long npages) } EXPORT_SYMBOL(unpin_user_pages); +/** + * unpin_folios() - release an array of gup-pinned folios. + * @folios: array of folios to be marked dirty and released. + * @nfolios: number of folios in the @folios array. + * + * For each folio in the @folios array, release the folio using gup_put_folio. + * + * Please see the unpin_folio() documentation for details. + */ +void unpin_folios(struct folio **folios, unsigned long nfolios) +{ + unsigned long i = 0, j; + + /* + * If this WARN_ON() fires, then the system *might* be leaking folios + * (by leaving them pinned), but probably not. More likely, gup/pup + * returned a hard -ERRNO error to the caller, who erroneously passed + * it here. + */ + if (WARN_ON(IS_ERR_VALUE(nfolios))) + return; + + while (i < nfolios) { + for (j = i + 1; j < nfolios; j++) + if (folios[i] != folios[j]) + break; + + if (folios[i]) + gup_put_folio(folios[i], j - i, FOLL_PIN); + i = j; + } +} +EXPORT_SYMBOL_GPL(unpin_folios); + /* * Set the MMF_HAS_PINNED if not set yet; after set it'll be there for the mm's * lifecycle. Avoid setting the bit unless necessary, or it might cause write @@ -413,7 +462,7 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags) #ifdef CONFIG_MMU -#if defined(CONFIG_ARCH_HAS_HUGEPD) || defined(CONFIG_HAVE_GUP_FAST) +#ifdef CONFIG_HAVE_GUP_FAST static int record_subpages(struct page *page, unsigned long sz, unsigned long addr, unsigned long end, struct page **pages) @@ -523,154 +572,7 @@ static struct folio *try_grab_folio_fast(struct page *page, int refs, return folio; } -#endif /* CONFIG_ARCH_HAS_HUGEPD || CONFIG_HAVE_GUP_FAST */ - -#ifdef CONFIG_ARCH_HAS_HUGEPD -static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, - unsigned long sz) -{ - unsigned long __boundary = (addr + sz) & ~(sz-1); - return (__boundary - 1 < end - 1) ? __boundary : end; -} - -/* - * Returns 1 if succeeded, 0 if failed, -EMLINK if unshare needed. - * - * NOTE: for the same entry, gup-fast and gup-slow can return different - * results (0 v.s. -EMLINK) depending on whether vma is available. This is - * the expected behavior, where we simply want gup-fast to fallback to - * gup-slow to take the vma reference first. - */ -static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long sz, - unsigned long addr, unsigned long end, unsigned int flags, - struct page **pages, int *nr, bool fast) -{ - unsigned long pte_end; - struct page *page; - struct folio *folio; - pte_t pte; - int refs; - - pte_end = (addr + sz) & ~(sz-1); - if (pte_end < end) - end = pte_end; - - pte = huge_ptep_get(ptep); - - if (!pte_access_permitted(pte, flags & FOLL_WRITE)) - return 0; - - /* hugepages are never "special" */ - VM_BUG_ON(!pfn_valid(pte_pfn(pte))); - - page = pte_page(pte); - refs = record_subpages(page, sz, addr, end, pages + *nr); - - if (fast) { - folio = try_grab_folio_fast(page, refs, flags); - if (!folio) - return 0; - } else { - folio = page_folio(page); - if (try_grab_folio(folio, refs, flags)) - return 0; - } - - if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { - gup_put_folio(folio, refs, flags); - return 0; - } - - if (!pte_write(pte) && gup_must_unshare(vma, flags, &folio->page)) { - gup_put_folio(folio, refs, flags); - return -EMLINK; - } - - *nr += refs; - folio_set_referenced(folio); - return 1; -} - -/* - * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file - * systems on Power, which does not have issue with folio writeback against - * GUP updates. When hugepd will be extended to support non-hugetlbfs or - * even anonymous memory, we need to do extra check as what we do with most - * of the other folios. See writable_file_mapping_allowed() and - * gup_fast_folio_allowed() for more information. - */ -static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, - unsigned long addr, unsigned int pdshift, - unsigned long end, unsigned int flags, - struct page **pages, int *nr, bool fast) -{ - pte_t *ptep; - unsigned long sz = 1UL << hugepd_shift(hugepd); - unsigned long next; - int ret; - - ptep = hugepte_offset(hugepd, addr, pdshift); - do { - next = hugepte_addr_end(addr, end, sz); - ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr, - fast); - if (ret != 1) - return ret; - } while (ptep++, addr = next, addr != end); - - return 1; -} - -static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, - unsigned long addr, unsigned int pdshift, - unsigned int flags, - struct follow_page_context *ctx) -{ - struct page *page; - struct hstate *h; - spinlock_t *ptl; - int nr = 0, ret; - pte_t *ptep; - - /* Only hugetlb supports hugepd */ - if (WARN_ON_ONCE(!is_vm_hugetlb_page(vma))) - return ERR_PTR(-EFAULT); - - h = hstate_vma(vma); - ptep = hugepte_offset(hugepd, addr, pdshift); - ptl = huge_pte_lock(h, vma->vm_mm, ptep); - ret = gup_hugepd(vma, hugepd, addr, pdshift, addr + PAGE_SIZE, - flags, &page, &nr, false); - spin_unlock(ptl); - - if (ret == 1) { - /* GUP succeeded */ - WARN_ON_ONCE(nr != 1); - ctx->page_mask = (1U << huge_page_order(h)) - 1; - return page; - } - - /* ret can be either 0 (translates to NULL) or negative */ - return ERR_PTR(ret); -} -#else /* CONFIG_ARCH_HAS_HUGEPD */ -static inline int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, - unsigned long addr, unsigned int pdshift, - unsigned long end, unsigned int flags, - struct page **pages, int *nr, bool fast) -{ - return 0; -} - -static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, - unsigned long addr, unsigned int pdshift, - unsigned int flags, - struct follow_page_context *ctx) -{ - return NULL; -} -#endif /* CONFIG_ARCH_HAS_HUGEPD */ - +#endif /* CONFIG_HAVE_GUP_FAST */ static struct page *no_page_table(struct vm_area_struct *vma, unsigned int flags, unsigned long address) @@ -786,7 +688,7 @@ static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, return false; /* ... and a write-fault isn't required for other reasons. */ - if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd)) + if (pmd_needs_soft_dirty_wp(vma, pmd)) return false; return !userfaultfd_huge_pmd_wp(vma, pmd); } @@ -907,7 +809,7 @@ static inline bool can_follow_write_pte(pte_t pte, struct page *page, return false; /* ... and a write-fault isn't required for other reasons. */ - if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) + if (pte_needs_soft_dirty_wp(vma, pte)) return false; return !userfaultfd_pte_wp(vma, pte); } @@ -1040,9 +942,6 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, return no_page_table(vma, flags, address); if (!pmd_present(pmdval)) return no_page_table(vma, flags, address); - if (unlikely(is_hugepd(__hugepd(pmd_val(pmdval))))) - return follow_hugepd(vma, __hugepd(pmd_val(pmdval)), - address, PMD_SHIFT, flags, ctx); if (pmd_devmap(pmdval)) { ptl = pmd_lock(mm, pmd); page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap); @@ -1093,9 +992,6 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma, pud = READ_ONCE(*pudp); if (!pud_present(pud)) return no_page_table(vma, flags, address); - if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) - return follow_hugepd(vma, __hugepd(pud_val(pud)), - address, PUD_SHIFT, flags, ctx); if (pud_leaf(pud)) { ptl = pud_lock(mm, pudp); page = follow_huge_pud(vma, address, pudp, flags, ctx); @@ -1121,10 +1017,6 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma, p4d = READ_ONCE(*p4dp); BUILD_BUG_ON(p4d_leaf(p4d)); - if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) - return follow_hugepd(vma, __hugepd(p4d_val(p4d)), - address, P4D_SHIFT, flags, ctx); - if (!p4d_present(p4d) || p4d_bad(p4d)) return no_page_table(vma, flags, address); @@ -1168,10 +1060,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma, ctx->page_mask = 0; pgd = pgd_offset(mm, address); - if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd))))) - page = follow_hugepd(vma, __hugepd(pgd_val(*pgd)), - address, PGDIR_SHIFT, flags, ctx); - else if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) page = no_page_table(vma, flags, address); else page = follow_p4d_mask(vma, address, pgd, flags, ctx); @@ -2394,19 +2283,19 @@ struct page *get_dump_page(unsigned long addr) #ifdef CONFIG_MIGRATION /* - * Returns the number of collected pages. Return value is always >= 0. + * Returns the number of collected folios. Return value is always >= 0. */ -static unsigned long collect_longterm_unpinnable_pages( - struct list_head *movable_page_list, - unsigned long nr_pages, - struct page **pages) +static unsigned long collect_longterm_unpinnable_folios( + struct list_head *movable_folio_list, + unsigned long nr_folios, + struct folio **folios) { unsigned long i, collected = 0; struct folio *prev_folio = NULL; bool drain_allow = true; - for (i = 0; i < nr_pages; i++) { - struct folio *folio = page_folio(pages[i]); + for (i = 0; i < nr_folios; i++) { + struct folio *folio = folios[i]; if (folio == prev_folio) continue; @@ -2421,7 +2310,7 @@ static unsigned long collect_longterm_unpinnable_pages( continue; if (folio_test_hugetlb(folio)) { - isolate_hugetlb(folio, movable_page_list); + isolate_hugetlb(folio, movable_folio_list); continue; } @@ -2433,7 +2322,7 @@ static unsigned long collect_longterm_unpinnable_pages( if (!folio_isolate_lru(folio)) continue; - list_add_tail(&folio->lru, movable_page_list); + list_add_tail(&folio->lru, movable_folio_list); node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio), folio_nr_pages(folio)); @@ -2443,27 +2332,28 @@ static unsigned long collect_longterm_unpinnable_pages( } /* - * Unpins all pages and migrates device coherent pages and movable_page_list. - * Returns -EAGAIN if all pages were successfully migrated or -errno for failure - * (or partial success). + * Unpins all folios and migrates device coherent folios and movable_folio_list. + * Returns -EAGAIN if all folios were successfully migrated or -errno for + * failure (or partial success). */ -static int migrate_longterm_unpinnable_pages( - struct list_head *movable_page_list, - unsigned long nr_pages, - struct page **pages) +static int migrate_longterm_unpinnable_folios( + struct list_head *movable_folio_list, + unsigned long nr_folios, + struct folio **folios) { int ret; unsigned long i; - for (i = 0; i < nr_pages; i++) { - struct folio *folio = page_folio(pages[i]); + for (i = 0; i < nr_folios; i++) { + struct folio *folio = folios[i]; if (folio_is_device_coherent(folio)) { /* - * Migration will fail if the page is pinned, so convert - * the pin on the source page to a normal reference. + * Migration will fail if the folio is pinned, so + * convert the pin on the source folio to a normal + * reference. */ - pages[i] = NULL; + folios[i] = NULL; folio_get(folio); gup_put_folio(folio, 1, FOLL_PIN); @@ -2476,24 +2366,24 @@ static int migrate_longterm_unpinnable_pages( } /* - * We can't migrate pages with unexpected references, so drop + * We can't migrate folios with unexpected references, so drop * the reference obtained by __get_user_pages_locked(). - * Migrating pages have been added to movable_page_list after + * Migrating folios have been added to movable_folio_list after * calling folio_isolate_lru() which takes a reference so the - * page won't be freed if it's migrating. + * folio won't be freed if it's migrating. */ - unpin_user_page(pages[i]); - pages[i] = NULL; + unpin_folio(folios[i]); + folios[i] = NULL; } - if (!list_empty(movable_page_list)) { + if (!list_empty(movable_folio_list)) { struct migration_target_control mtc = { .nid = NUMA_NO_NODE, .gfp_mask = GFP_USER | __GFP_NOWARN, .reason = MR_LONGTERM_PIN, }; - if (migrate_pages(movable_page_list, alloc_migration_target, + if (migrate_pages(movable_folio_list, alloc_migration_target, NULL, (unsigned long)&mtc, MIGRATE_SYNC, MR_LONGTERM_PIN, NULL)) { ret = -ENOMEM; @@ -2501,48 +2391,71 @@ static int migrate_longterm_unpinnable_pages( } } - putback_movable_pages(movable_page_list); + putback_movable_pages(movable_folio_list); return -EAGAIN; err: - for (i = 0; i < nr_pages; i++) - if (pages[i]) - unpin_user_page(pages[i]); - putback_movable_pages(movable_page_list); + unpin_folios(folios, nr_folios); + putback_movable_pages(movable_folio_list); return ret; } /* - * Check whether all pages are *allowed* to be pinned. Rather confusingly, all - * pages in the range are required to be pinned via FOLL_PIN, before calling - * this routine. + * Check whether all folios are *allowed* to be pinned indefinitely (longterm). + * Rather confusingly, all folios in the range are required to be pinned via + * FOLL_PIN, before calling this routine. * - * If any pages in the range are not allowed to be pinned, then this routine - * will migrate those pages away, unpin all the pages in the range and return + * If any folios in the range are not allowed to be pinned, then this routine + * will migrate those folios away, unpin all the folios in the range and return * -EAGAIN. The caller should re-pin the entire range with FOLL_PIN and then * call this routine again. * * If an error other than -EAGAIN occurs, this indicates a migration failure. * The caller should give up, and propagate the error back up the call stack. * - * If everything is OK and all pages in the range are allowed to be pinned, then - * this routine leaves all pages pinned and returns zero for success. + * If everything is OK and all folios in the range are allowed to be pinned, + * then this routine leaves all folios pinned and returns zero for success. + */ +static long check_and_migrate_movable_folios(unsigned long nr_folios, + struct folio **folios) +{ + unsigned long collected; + LIST_HEAD(movable_folio_list); + + collected = collect_longterm_unpinnable_folios(&movable_folio_list, + nr_folios, folios); + if (!collected) + return 0; + + return migrate_longterm_unpinnable_folios(&movable_folio_list, + nr_folios, folios); +} + +/* + * This routine just converts all the pages in the @pages array to folios and + * calls check_and_migrate_movable_folios() to do the heavy lifting. + * + * Please see the check_and_migrate_movable_folios() documentation for details. */ static long check_and_migrate_movable_pages(unsigned long nr_pages, struct page **pages) { - unsigned long collected; - LIST_HEAD(movable_page_list); + struct folio **folios; + long i, ret; - collected = collect_longterm_unpinnable_pages(&movable_page_list, - nr_pages, pages); - if (!collected) - return 0; + folios = kmalloc_array(nr_pages, sizeof(*folios), GFP_KERNEL); + if (!folios) + return -ENOMEM; - return migrate_longterm_unpinnable_pages(&movable_page_list, nr_pages, - pages); + for (i = 0; i < nr_pages; i++) + folios[i] = page_folio(pages[i]); + + ret = check_and_migrate_movable_folios(nr_pages, folios); + + kfree(folios); + return ret; } #else static long check_and_migrate_movable_pages(unsigned long nr_pages, @@ -2550,6 +2463,12 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, { return 0; } + +static long check_and_migrate_movable_folios(unsigned long nr_folios, + struct folio **folios) +{ + return 0; +} #endif /* CONFIG_MIGRATION */ /* @@ -3283,15 +3202,6 @@ static int gup_fast_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, pages, nr)) return 0; - } else if (unlikely(is_hugepd(__hugepd(pmd_val(pmd))))) { - /* - * architecture have different format for hugetlbfs - * pmd format and THP pmd format - */ - if (gup_hugepd(NULL, __hugepd(pmd_val(pmd)), addr, - PMD_SHIFT, next, flags, pages, nr, - true) != 1) - return 0; } else if (!gup_fast_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) return 0; @@ -3318,11 +3228,6 @@ static int gup_fast_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, if (!gup_fast_pud_leaf(pud, pudp, addr, next, flags, pages, nr)) return 0; - } else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) { - if (gup_hugepd(NULL, __hugepd(pud_val(pud)), addr, - PUD_SHIFT, next, flags, pages, nr, - true) != 1) - return 0; } else if (!gup_fast_pmd_range(pudp, pud, addr, next, flags, pages, nr)) return 0; @@ -3346,13 +3251,8 @@ static int gup_fast_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, if (!p4d_present(p4d)) return 0; BUILD_BUG_ON(p4d_leaf(p4d)); - if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) { - if (gup_hugepd(NULL, __hugepd(p4d_val(p4d)), addr, - P4D_SHIFT, next, flags, pages, nr, - true) != 1) - return 0; - } else if (!gup_fast_pud_range(p4dp, p4d, addr, next, flags, - pages, nr)) + if (!gup_fast_pud_range(p4dp, p4d, addr, next, flags, + pages, nr)) return 0; } while (p4dp++, addr = next, addr != end); @@ -3376,11 +3276,6 @@ static void gup_fast_pgd_range(unsigned long addr, unsigned long end, if (!gup_fast_pgd_leaf(pgd, pgdp, addr, next, flags, pages, nr)) return; - } else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) { - if (gup_hugepd(NULL, __hugepd(pgd_val(pgd)), addr, - PGDIR_SHIFT, next, flags, pages, nr, - true) != 1) - return; } else if (!gup_fast_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) return; @@ -3687,3 +3582,140 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, &locked, gup_flags); } EXPORT_SYMBOL(pin_user_pages_unlocked); + +/** + * memfd_pin_folios() - pin folios associated with a memfd + * @memfd: the memfd whose folios are to be pinned + * @start: the first memfd offset + * @end: the last memfd offset (inclusive) + * @folios: array that receives pointers to the folios pinned + * @max_folios: maximum number of entries in @folios + * @offset: the offset into the first folio + * + * Attempt to pin folios associated with a memfd in the contiguous range + * [start, end]. Given that a memfd is either backed by shmem or hugetlb, + * the folios can either be found in the page cache or need to be allocated + * if necessary. Once the folios are located, they are all pinned via + * FOLL_PIN and @offset is populatedwith the offset into the first folio. + * And, eventually, these pinned folios must be released either using + * unpin_folios() or unpin_folio(). + * + * It must be noted that the folios may be pinned for an indefinite amount + * of time. And, in most cases, the duration of time they may stay pinned + * would be controlled by the userspace. This behavior is effectively the + * same as using FOLL_LONGTERM with other GUP APIs. + * + * Returns number of folios pinned, which could be less than @max_folios + * as it depends on the folio sizes that cover the range [start, end]. + * If no folios were pinned, it returns -errno. + */ +long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, + struct folio **folios, unsigned int max_folios, + pgoff_t *offset) +{ + unsigned int flags, nr_folios, nr_found; + unsigned int i, pgshift = PAGE_SHIFT; + pgoff_t start_idx, end_idx, next_idx; + struct folio *folio = NULL; + struct folio_batch fbatch; + struct hstate *h; + long ret = -EINVAL; + + if (start < 0 || start > end || !max_folios) + return -EINVAL; + + if (!memfd) + return -EINVAL; + + if (!shmem_file(memfd) && !is_file_hugepages(memfd)) + return -EINVAL; + + if (end >= i_size_read(file_inode(memfd))) + return -EINVAL; + + if (is_file_hugepages(memfd)) { + h = hstate_file(memfd); + pgshift = huge_page_shift(h); + } + + flags = memalloc_pin_save(); + do { + nr_folios = 0; + start_idx = start >> pgshift; + end_idx = end >> pgshift; + if (is_file_hugepages(memfd)) { + start_idx <<= huge_page_order(h); + end_idx <<= huge_page_order(h); + } + + folio_batch_init(&fbatch); + while (start_idx <= end_idx && nr_folios < max_folios) { + /* + * In most cases, we should be able to find the folios + * in the page cache. If we cannot find them for some + * reason, we try to allocate them and add them to the + * page cache. + */ + nr_found = filemap_get_folios_contig(memfd->f_mapping, + &start_idx, + end_idx, + &fbatch); + if (folio) { + folio_put(folio); + folio = NULL; + } + + next_idx = 0; + for (i = 0; i < nr_found; i++) { + /* + * As there can be multiple entries for a + * given folio in the batch returned by + * filemap_get_folios_contig(), the below + * check is to ensure that we pin and return a + * unique set of folios between start and end. + */ + if (next_idx && + next_idx != folio_index(fbatch.folios[i])) + continue; + + folio = page_folio(&fbatch.folios[i]->page); + + if (try_grab_folio(folio, 1, FOLL_PIN)) { + folio_batch_release(&fbatch); + ret = -EINVAL; + goto err; + } + + if (nr_folios == 0) + *offset = offset_in_folio(folio, start); + + folios[nr_folios] = folio; + next_idx = folio_next_index(folio); + if (++nr_folios == max_folios) + break; + } + + folio = NULL; + folio_batch_release(&fbatch); + if (!nr_found) { + folio = memfd_alloc_folio(memfd, start_idx); + if (IS_ERR(folio)) { + ret = PTR_ERR(folio); + if (ret != -EEXIST) + goto err; + } + } + } + + ret = check_and_migrate_movable_folios(nr_folios, folios); + } while (ret == -EAGAIN); + + memalloc_pin_restore(flags); + return ret ? ret : nr_folios; +err: + memalloc_pin_restore(flags); + unpin_folios(folios, nr_folios); + + return ret; +} +EXPORT_SYMBOL_GPL(memfd_pin_folios); diff --git a/mm/highmem.c b/mm/highmem.c index bd48ba445dd4..ef3189b36cad 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -111,13 +111,10 @@ static inline wait_queue_head_t *get_pkmap_wait_queue_head(unsigned int color) } #endif -atomic_long_t _totalhigh_pages __read_mostly; -EXPORT_SYMBOL(_totalhigh_pages); - -unsigned int __nr_free_highpages(void) +unsigned long __nr_free_highpages(void) { + unsigned long pages = 0; struct zone *zone; - unsigned int pages = 0; for_each_populated_zone(zone) { if (is_highmem(zone)) @@ -127,6 +124,20 @@ unsigned int __nr_free_highpages(void) return pages; } +unsigned long __totalhigh_pages(void) +{ + unsigned long pages = 0; + struct zone *zone; + + for_each_populated_zone(zone) { + if (is_highmem(zone)) + pages += zone_managed_pages(zone); + } + + return pages; +} +EXPORT_SYMBOL(__totalhigh_pages); + static int pkmap_count[LAST_PKMAP]; static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kmap_lock); diff --git a/mm/hmm.c b/mm/hmm.c index 93aebd9cc130..7e0229ae4a5a 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -480,7 +480,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask, pte_t entry; ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte); - entry = huge_ptep_get(pte); + entry = huge_ptep_get(walk->mm, addr, pte); i = (start - range->start) >> PAGE_SHIFT; pfn_req_flags = range->hmm_pfns[i]; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2120f7478e55..f9696c94e211 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -150,10 +151,15 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, * Must be done before hugepage flags check since shmem has its * own flags. */ - if (!in_pf && shmem_file(vma->vm_file)) - return shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff, - !enforce_sysfs, vma->vm_mm, vm_flags) - ? orders : 0; + if (!in_pf && shmem_file(vma->vm_file)) { + bool global_huge = shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff, + !enforce_sysfs, vma->vm_mm, vm_flags); + + if (!vma_is_anon_shmem(vma)) + return global_huge ? orders : 0; + return shmem_allowable_huge_orders(file_inode(vma->vm_file), + vma, vma->vm_pgoff, global_huge); + } if (!vma_is_anonymous(vma)) { /* @@ -449,14 +455,6 @@ static void thpsize_release(struct kobject *kobj); static DEFINE_SPINLOCK(huge_anon_orders_lock); static LIST_HEAD(thpsize_list); -struct thpsize { - struct kobject kobj; - struct list_head node; - int order; -}; - -#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj) - static ssize_t thpsize_enabled_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -509,6 +507,13 @@ static ssize_t thpsize_enabled_store(struct kobject *kobj, } else ret = -EINVAL; + if (ret > 0) { + int err; + + err = start_stop_khugepaged(); + if (err) + ret = err; + } return ret; } @@ -517,6 +522,9 @@ static struct kobj_attribute thpsize_enabled_attr = static struct attribute *thpsize_attrs[] = { &thpsize_enabled_attr.attr, +#ifdef CONFIG_SHMEM + &thpsize_shmem_enabled_attr.attr, +#endif NULL, }; @@ -560,6 +568,12 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT); DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK); +DEFINE_MTHP_STAT_ATTR(shmem_alloc, MTHP_STAT_SHMEM_ALLOC); +DEFINE_MTHP_STAT_ATTR(shmem_fallback, MTHP_STAT_SHMEM_FALLBACK); +DEFINE_MTHP_STAT_ATTR(shmem_fallback_charge, MTHP_STAT_SHMEM_FALLBACK_CHARGE); +DEFINE_MTHP_STAT_ATTR(split, MTHP_STAT_SPLIT); +DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED); +DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED); static struct attribute *stats_attrs[] = { &anon_fault_alloc_attr.attr, @@ -567,6 +581,12 @@ static struct attribute *stats_attrs[] = { &anon_fault_fallback_charge_attr.attr, &swpout_attr.attr, &swpout_fallback_attr.attr, + &shmem_alloc_attr.attr, + &shmem_fallback_attr.attr, + &shmem_fallback_charge_attr.attr, + &split_attr.attr, + &split_failed_attr.attr, + &split_deferred_attr.attr, NULL, }; @@ -942,10 +962,10 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, goto release; } - clear_huge_page(page, vmf->address, HPAGE_PMD_NR); + folio_zero_user(folio, vmf->address); /* * The memory barrier inside __folio_mark_uptodate makes sure that - * clear_huge_page writes become visible before the set_pmd_at() + * folio_zero_user writes become visible before the set_pmd_at() * write. */ __folio_mark_uptodate(folio); @@ -972,7 +992,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, entry = mk_huge_pmd(page, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); - folio_add_new_anon_rmap(folio, vma, haddr); + folio_add_new_anon_rmap(folio, vma, haddr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); @@ -1624,7 +1644,7 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma, return false; /* Do we need write faults for softdirty tracking? */ - if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd)) + if (pmd_needs_soft_dirty_wp(vma, pmd)) return false; /* Do we need write faults for uffd-wp tracking? */ @@ -1651,7 +1671,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) unsigned long haddr = vmf->address & HPAGE_PMD_MASK; int nid = NUMA_NO_NODE; int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK); - bool migrated = false, writable = false; + bool writable = false; int flags = 0; vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); @@ -1687,16 +1707,17 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) if (node_is_toptier(nid)) last_cpupid = folio_last_cpupid(folio); target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags); - if (target_nid == NUMA_NO_NODE) { - folio_put(folio); + if (target_nid == NUMA_NO_NODE) + goto out_map; + if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { + flags |= TNF_MIGRATE_FAIL; goto out_map; } - + /* The folio is isolated and isolation code holds a folio reference. */ spin_unlock(vmf->ptl); writable = false; - migrated = migrate_misplaced_folio(folio, vma, target_nid); - if (migrated) { + if (!migrate_misplaced_folio(folio, vma, target_nid)) { flags |= TNF_MIGRATED; nid = target_nid; } else { @@ -2581,6 +2602,27 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, pmd_populate(mm, pmd, pgtable); } +void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmd, bool freeze, struct folio *folio) +{ + VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); + VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); + VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); + VM_BUG_ON(freeze && !folio); + + /* + * When the caller requests to set up a migration entry, we + * require a folio to check the PMD against. Otherwise, there + * is a risk of replacing the wrong folio. + */ + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || + is_pmd_migration_entry(*pmd)) { + if (folio && folio != pmd_folio(*pmd)) + return; + __split_huge_pmd_locked(vma, pmd, address, freeze); + } +} + void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address, bool freeze, struct folio *folio) { @@ -2592,26 +2634,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE); mmu_notifier_invalidate_range_start(&range); ptl = pmd_lock(vma->vm_mm, pmd); - - /* - * If caller asks to setup a migration entry, we need a folio to check - * pmd against. Otherwise we can end up replacing wrong folio. - */ - VM_BUG_ON(freeze && !folio); - VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); - - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || - is_pmd_migration_entry(*pmd)) { - /* - * It's safe to call pmd_page when folio is set because it's - * guaranteed that pmd is present. - */ - if (folio && folio != pmd_folio(*pmd)) - goto out; - __split_huge_pmd_locked(vma, pmd, range.start, freeze); - } - -out: + split_huge_pmd_locked(vma, range.start, pmd, freeze, folio); spin_unlock(ptl); mmu_notifier_invalidate_range_end(&range); } @@ -2685,6 +2708,71 @@ static void unmap_folio(struct folio *folio) try_to_unmap_flush(); } +static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp, + struct folio *folio) +{ + struct mm_struct *mm = vma->vm_mm; + int ref_count, map_count; + pmd_t orig_pmd = *pmdp; + + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd)) + return false; + + orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp); + + /* + * Syncing against concurrent GUP-fast: + * - clear PMD; barrier; read refcount + * - inc refcount; barrier; read PMD + */ + smp_mb(); + + ref_count = folio_ref_count(folio); + map_count = folio_mapcount(folio); + + /* + * Order reads for folio refcount and dirty flag + * (see comments in __remove_mapping()). + */ + smp_rmb(); + + /* + * If the folio or its PMD is redirtied at this point, or if there + * are unexpected references, we will give up to discard this folio + * and remap it. + * + * The only folio refs must be one from isolation plus the rmap(s). + */ + if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) || + ref_count != map_count + 1) { + set_pmd_at(mm, addr, pmdp, orig_pmd); + return false; + } + + folio_remove_rmap_pmd(folio, pmd_page(orig_pmd), vma); + zap_deposited_table(mm, pmdp); + add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR); + if (vma->vm_flags & VM_LOCKED) + mlock_drain_local(); + folio_put(folio); + + return true; +} + +bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmdp, struct folio *folio) +{ + VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE)); + + if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) + return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio); + + return false; +} + static void remap_page(struct folio *folio, unsigned long nr) { int i = 0; @@ -2838,7 +2926,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, split_page_memcg(head, order, new_order); if (folio_test_anon(folio) && folio_test_swapcache(folio)) { - offset = swp_offset(folio->swap); + offset = swap_cache_index(folio->swap); swap_cache = swap_address_space(folio->swap); xa_lock(&swap_cache->i_pages); } @@ -2998,7 +3086,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; - bool is_thp = folio_test_pmd_mappable(folio); + int order = folio_order(folio); int extra_pins, ret; pgoff_t end; bool is_hzp; @@ -3183,27 +3271,17 @@ out_unlock: i_mmap_unlock_read(mapping); out: xas_destroy(&xas); - if (is_thp) + if (order == HPAGE_PMD_ORDER) count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED); + count_mthp_stat(order, !ret ? MTHP_STAT_SPLIT : MTHP_STAT_SPLIT_FAILED); return ret; } -void folio_undo_large_rmappable(struct folio *folio) +void __folio_undo_large_rmappable(struct folio *folio) { struct deferred_split *ds_queue; unsigned long flags; - if (folio_order(folio) <= 1) - return; - - /* - * At this point, there is no one trying to add the folio to - * deferred_list. If folio is not in deferred_list, it's safe - * to check without acquiring the split_queue_lock. - */ - if (data_race(list_empty(&folio->_deferred_list))) - return; - ds_queue = get_deferred_split_queue(folio); spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (!list_empty(&folio->_deferred_list)) { @@ -3248,6 +3326,7 @@ void deferred_split_folio(struct folio *folio) if (list_empty(&folio->_deferred_list)) { if (folio_test_pmd_mappable(folio)) count_vm_event(THP_DEFERRED_SPLIT_PAGE); + count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED); list_add_tail(&folio->_deferred_list, &ds_queue->split_queue); ds_queue->split_queue_len++; #ifdef CONFIG_MEMCG diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 43e1af868cfd..0858a1827207 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1355,6 +1355,10 @@ static struct folio *dequeue_hugetlb_folio_nodemask(struct hstate *h, gfp_t gfp_ struct zoneref *z; int node = NUMA_NO_NODE; + /* 'nid' should not be NUMA_NO_NODE. Try to catch any misuse of it and rectifiy. */ + if (nid == NUMA_NO_NODE) + nid = numa_node_id(); + zonelist = node_zonelist(nid, gfp_mask); retry_cpuset: @@ -2257,13 +2261,11 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h, * pages is zero. */ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h, - gfp_t gfp_mask, int nid, nodemask_t *nmask, - nodemask_t *node_alloc_noretry) + gfp_t gfp_mask, int nid, nodemask_t *nmask) { struct folio *folio; - folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, - node_alloc_noretry); + folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL); if (!folio) return NULL; @@ -2481,7 +2483,7 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h, goto out_unlock; spin_unlock_irq(&hugetlb_lock); - folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL); + folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask); if (!folio) return NULL; @@ -2517,7 +2519,7 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas if (hstate_is_gigantic(h)) return NULL; - folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL); + folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask); if (!folio) return NULL; @@ -2586,6 +2588,23 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, return alloc_migrate_hugetlb_folio(h, gfp_mask, preferred_nid, nmask); } +static nodemask_t *policy_mbind_nodemask(gfp_t gfp) +{ +#ifdef CONFIG_NUMA + struct mempolicy *mpol = get_task_policy(current); + + /* + * Only enforce MPOL_BIND policy which overlaps with cpuset policy + * (from policy_nodemask) specifically for hugetlb case + */ + if (mpol->mode == MPOL_BIND && + (apply_policy_zone(mpol, gfp_zone(gfp)) && + cpuset_nodemask_valid_mems_allowed(&mpol->nodes))) + return &mpol->nodes; +#endif + return NULL; +} + /* * Increase the hugetlb pool such that it can accommodate a reservation * of size 'delta'. @@ -2599,6 +2618,8 @@ static int gather_surplus_pages(struct hstate *h, long delta) long i; long needed, allocated; bool alloc_ok = true; + int node; + nodemask_t *mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h)); lockdep_assert_held(&hugetlb_lock); needed = (h->resv_huge_pages + delta) - h->free_huge_pages; @@ -2613,8 +2634,15 @@ static int gather_surplus_pages(struct hstate *h, long delta) retry: spin_unlock_irq(&hugetlb_lock); for (i = 0; i < needed; i++) { - folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h), - NUMA_NO_NODE, NULL); + folio = NULL; + for_each_node_mask(node, cpuset_current_mems_allowed) { + if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) { + folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h), + node, NULL); + if (folio) + break; + } + } if (!folio) { alloc_ok = false; break; @@ -3439,7 +3467,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid) gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, - &node_states[N_MEMORY], NULL); + &node_states[N_MEMORY]); if (!folio) break; free_huge_folio(folio); /* free it into the hugepage allocator */ @@ -4617,7 +4645,7 @@ void __init hugetlb_add_hstate(unsigned int order) BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); h = &hstates[hugetlb_max_hstate++]; - mutex_init(&h->resize_lock); + __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); h->order = order; h->mask = ~(huge_page_size(h) - 1); for (i = 0; i < MAX_NUMNODES; ++i) @@ -4840,23 +4868,6 @@ static int __init default_hugepagesz_setup(char *s) } __setup("default_hugepagesz=", default_hugepagesz_setup); -static nodemask_t *policy_mbind_nodemask(gfp_t gfp) -{ -#ifdef CONFIG_NUMA - struct mempolicy *mpol = get_task_policy(current); - - /* - * Only enforce MPOL_BIND policy which overlaps with cpuset policy - * (from policy_nodemask) specifically for hugetlb case - */ - if (mpol->mode == MPOL_BIND && - (apply_policy_zone(mpol, gfp_zone(gfp)) && - cpuset_nodemask_valid_mems_allowed(&mpol->nodes))) - return &mpol->nodes; -#endif - return NULL; -} - static unsigned int allowed_mems_nr(struct hstate *h) { int node; @@ -4875,7 +4886,7 @@ static unsigned int allowed_mems_nr(struct hstate *h) } #ifdef CONFIG_SYSCTL -static int proc_hugetlb_doulongvec_minmax(struct ctl_table *table, int write, +static int proc_hugetlb_doulongvec_minmax(const struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos, unsigned long *out) { @@ -4892,7 +4903,7 @@ static int proc_hugetlb_doulongvec_minmax(struct ctl_table *table, int write, } static int hugetlb_sysctl_handler_common(bool obey_mempolicy, - struct ctl_table *table, int write, + const struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos) { struct hstate *h = &default_hstate; @@ -5279,7 +5290,7 @@ static void set_huge_ptep_writable(struct vm_area_struct *vma, { pte_t entry; - entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(ptep))); + entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(vma->vm_mm, address, ptep))); if (huge_ptep_set_access_flags(vma, address, ptep, entry, 1)) update_mmu_cache(vma, address, ptep); } @@ -5387,7 +5398,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, dst_ptl = huge_pte_lock(h, dst, dst_pte); src_ptl = huge_pte_lockptr(h, src, src_pte); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); - entry = huge_ptep_get(src_pte); + entry = huge_ptep_get(src_vma->vm_mm, addr, src_pte); again: if (huge_pte_none(entry)) { /* @@ -5425,7 +5436,7 @@ again: set_huge_pte_at(dst, addr, dst_pte, make_pte_marker(marker), sz); } else { - entry = huge_ptep_get(src_pte); + entry = huge_ptep_get(src_vma->vm_mm, addr, src_pte); pte_folio = page_folio(pte_page(entry)); folio_get(pte_folio); @@ -5454,9 +5465,8 @@ again: ret = PTR_ERR(new_folio); break; } - ret = copy_user_large_folio(new_folio, - pte_folio, - addr, dst_vma); + ret = copy_user_large_folio(new_folio, pte_folio, + ALIGN_DOWN(addr, sz), dst_vma); folio_put(pte_folio); if (ret) { folio_put(new_folio); @@ -5467,7 +5477,7 @@ again: dst_ptl = huge_pte_lock(h, dst, dst_pte); src_ptl = huge_pte_lockptr(h, src, src_pte); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); - entry = huge_ptep_get(src_pte); + entry = huge_ptep_get(src_vma->vm_mm, addr, src_pte); if (!pte_same(src_pte_old, entry)) { restore_reserve_on_error(h, dst_vma, addr, new_folio); @@ -5577,7 +5587,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, new_addr |= last_addr_mask; continue; } - if (huge_pte_none(huge_ptep_get(src_pte))) + if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte))) continue; if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) { @@ -5650,7 +5660,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, continue; } - pte = huge_ptep_get(ptep); + pte = huge_ptep_get(mm, address, ptep); if (huge_pte_none(pte)) { spin_unlock(ptl); continue; @@ -5899,7 +5909,7 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; - pte_t pte = huge_ptep_get(vmf->pte); + pte_t pte = huge_ptep_get(mm, vmf->address, vmf->pte); struct hstate *h = hstate_vma(vma); struct folio *old_folio; struct folio *new_folio; @@ -6020,7 +6030,7 @@ retry_avoidcopy: vmf->pte = hugetlb_walk(vma, vmf->address, huge_page_size(h)); if (likely(vmf->pte && - pte_same(huge_ptep_get(vmf->pte), pte))) + pte_same(huge_ptep_get(mm, vmf->address, vmf->pte), pte))) goto retry_avoidcopy; /* * race occurs while re-acquiring page table @@ -6058,7 +6068,7 @@ retry_avoidcopy: */ spin_lock(vmf->ptl); vmf->pte = hugetlb_walk(vma, vmf->address, huge_page_size(h)); - if (likely(vmf->pte && pte_same(huge_ptep_get(vmf->pte), pte))) { + if (likely(vmf->pte && pte_same(huge_ptep_get(mm, vmf->address, vmf->pte), pte))) { pte_t newpte = make_huge_pte(vma, &new_folio->page, !unshare); /* Break COW or unshare */ @@ -6159,14 +6169,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_fault *vmf, * Recheck pte with pgtable lock. Returns true if pte didn't change, or * false if pte changed or is changing. */ -static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, +static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t old_pte) { spinlock_t *ptl; bool same; ptl = huge_pte_lock(h, mm, ptep); - same = pte_same(huge_ptep_get(ptep), old_pte); + same = pte_same(huge_ptep_get(mm, addr, ptep), old_pte); spin_unlock(ptl); return same; @@ -6227,7 +6237,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, * never happen on the page after UFFDIO_COPY has * correctly installed the page and returned. */ - if (!hugetlb_pte_stable(h, mm, vmf->pte, vmf->orig_pte)) { + if (!hugetlb_pte_stable(h, mm, vmf->address, vmf->pte, vmf->orig_pte)) { ret = 0; goto out; } @@ -6256,14 +6266,13 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, * here. Before returning error, get ptl and make * sure there really is no pte entry. */ - if (hugetlb_pte_stable(h, mm, vmf->pte, vmf->orig_pte)) + if (hugetlb_pte_stable(h, mm, vmf->address, vmf->pte, vmf->orig_pte)) ret = vmf_error(PTR_ERR(folio)); else ret = 0; goto out; } - clear_huge_page(&folio->page, vmf->real_address, - pages_per_huge_page(h)); + folio_zero_user(folio, vmf->real_address); __folio_mark_uptodate(folio); new_folio = true; @@ -6306,7 +6315,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, folio_unlock(folio); folio_put(folio); /* See comment in userfaultfd_missing() block above */ - if (!hugetlb_pte_stable(h, mm, vmf->pte, vmf->orig_pte)) { + if (!hugetlb_pte_stable(h, mm, vmf->address, vmf->pte, vmf->orig_pte)) { ret = 0; goto out; } @@ -6333,7 +6342,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, vmf->ptl = huge_pte_lock(h, mm, vmf->pte); ret = 0; /* If pte changed from under us, retry */ - if (!pte_same(huge_ptep_get(vmf->pte), vmf->orig_pte)) + if (!pte_same(huge_ptep_get(mm, vmf->address, vmf->pte), vmf->orig_pte)) goto backout; if (anon_rmap) @@ -6454,7 +6463,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, return VM_FAULT_OOM; } - vmf.orig_pte = huge_ptep_get(vmf.pte); + vmf.orig_pte = huge_ptep_get(mm, vmf.address, vmf.pte); if (huge_pte_none_mostly(vmf.orig_pte)) { if (is_pte_marker(vmf.orig_pte)) { pte_marker marker = @@ -6495,7 +6504,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * be released there. */ mutex_unlock(&hugetlb_fault_mutex_table[hash]); - migration_entry_wait_huge(vma, vmf.pte); + migration_entry_wait_huge(vma, vmf.address, vmf.pte); return 0; } else if (unlikely(is_hugetlb_entry_hwpoisoned(vmf.orig_pte))) ret = VM_FAULT_HWPOISON_LARGE | @@ -6528,11 +6537,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, vmf.ptl = huge_pte_lock(h, mm, vmf.pte); /* Check for a racing update before calling hugetlb_wp() */ - if (unlikely(!pte_same(vmf.orig_pte, huge_ptep_get(vmf.pte)))) + if (unlikely(!pte_same(vmf.orig_pte, huge_ptep_get(mm, vmf.address, vmf.pte)))) goto out_ptl; /* Handle userfault-wp first, before trying to lock more pages */ - if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(vmf.pte)) && + if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(mm, vmf.address, vmf.pte)) && (flags & FAULT_FLAG_WRITE) && !huge_pte_write(vmf.orig_pte)) { if (!userfaultfd_wp_async(vma)) { spin_unlock(vmf.ptl); @@ -6647,7 +6656,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, struct hstate *h = hstate_vma(dst_vma); struct address_space *mapping = dst_vma->vm_file->f_mapping; pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); - unsigned long size; + unsigned long size = huge_page_size(h); int vm_shared = dst_vma->vm_flags & VM_SHARED; pte_t _dst_pte; spinlock_t *ptl; @@ -6660,14 +6669,13 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, ptl = huge_pte_lock(h, dst_mm, dst_pte); /* Don't overwrite any existing PTEs (even markers) */ - if (!huge_pte_none(huge_ptep_get(dst_pte))) { + if (!huge_pte_none(huge_ptep_get(dst_mm, dst_addr, dst_pte))) { spin_unlock(ptl); return -EEXIST; } _dst_pte = make_pte_marker(PTE_MARKER_POISONED); - set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte, - huge_page_size(h)); + set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte, size); /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); @@ -6741,7 +6749,8 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, *foliop = NULL; goto out; } - ret = copy_user_large_folio(folio, *foliop, dst_addr, dst_vma); + ret = copy_user_large_folio(folio, *foliop, + ALIGN_DOWN(dst_addr, size), dst_vma); folio_put(*foliop); *foliop = NULL; if (ret) { @@ -6768,9 +6777,8 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, /* Add shared, newly allocated pages to the page cache. */ if (vm_shared && !is_continue) { - size = i_size_read(mapping->host) >> huge_page_shift(h); ret = -EFAULT; - if (idx >= size) + if (idx >= (i_size_read(mapping->host) >> huge_page_shift(h))) goto out_release_nounlock; /* @@ -6797,7 +6805,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, * page backing it, then access the page. */ ret = -EEXIST; - if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) + if (!huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte))) goto out_release_unlock; if (folio_in_pagecache) @@ -6827,7 +6835,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, if (wp_enabled) _dst_pte = huge_pte_mkuffd_wp(_dst_pte); - set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte, huge_page_size(h)); + set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte, size); hugetlb_count_add(pages_per_huge_page(h), dst_mm); @@ -6918,7 +6926,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma, address |= last_addr_mask; continue; } - pte = huge_ptep_get(ptep); + pte = huge_ptep_get(mm, address, ptep); if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) { /* Nothing to do. */ } else if (unlikely(is_hugetlb_entry_migration(pte))) { diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index e20339a346b9..4ff238ba1250 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -27,7 +27,17 @@ #define MEMFILE_IDX(val) (((val) >> 16) & 0xffff) #define MEMFILE_ATTR(val) ((val) & 0xffff) +/* Use t->m[0] to encode the offset */ +#define MEMFILE_OFFSET(t, m0) (((offsetof(t, m0) << 16) | sizeof_field(t, m0))) +#define MEMFILE_OFFSET0(val) (((val) >> 16) & 0xffff) +#define MEMFILE_FIELD_SIZE(val) ((val) & 0xffff) + +#define DFL_TMPL_SIZE ARRAY_SIZE(hugetlb_dfl_tmpl) +#define LEGACY_TMPL_SIZE ARRAY_SIZE(hugetlb_legacy_tmpl) + static struct hugetlb_cgroup *root_h_cgroup __read_mostly; +static struct cftype *dfl_files; +static struct cftype *legacy_files; static inline struct page_counter * __hugetlb_cgroup_counter_from_cgroup(struct hugetlb_cgroup *h_cg, int idx, @@ -460,7 +470,7 @@ static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy) int nid; struct cftype *cft = seq_cft(seq); int idx = MEMFILE_IDX(cft->private); - bool legacy = MEMFILE_ATTR(cft->private); + bool legacy = !cgroup_subsys_on_dfl(hugetlb_cgrp_subsys); struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq)); struct cgroup_subsys_state *css; unsigned long usage; @@ -702,166 +712,185 @@ static int hugetlb_events_local_show(struct seq_file *seq, void *v) return __hugetlb_events_show(seq, true); } -static void __init __hugetlb_cgroup_file_dfl_init(int idx) +static struct cftype hugetlb_dfl_tmpl[] = { + { + .name = "max", + .private = RES_LIMIT, + .seq_show = hugetlb_cgroup_read_u64_max, + .write = hugetlb_cgroup_write_dfl, + .flags = CFTYPE_NOT_ON_ROOT, + }, + { + .name = "rsvd.max", + .private = RES_RSVD_LIMIT, + .seq_show = hugetlb_cgroup_read_u64_max, + .write = hugetlb_cgroup_write_dfl, + .flags = CFTYPE_NOT_ON_ROOT, + }, + { + .name = "current", + .private = RES_USAGE, + .seq_show = hugetlb_cgroup_read_u64_max, + .flags = CFTYPE_NOT_ON_ROOT, + }, + { + .name = "rsvd.current", + .private = RES_RSVD_USAGE, + .seq_show = hugetlb_cgroup_read_u64_max, + .flags = CFTYPE_NOT_ON_ROOT, + }, + { + .name = "events", + .seq_show = hugetlb_events_show, + .file_offset = MEMFILE_OFFSET(struct hugetlb_cgroup, events_file[0]), + .flags = CFTYPE_NOT_ON_ROOT, + }, + { + .name = "events.local", + .seq_show = hugetlb_events_local_show, + .file_offset = MEMFILE_OFFSET(struct hugetlb_cgroup, events_local_file[0]), + .flags = CFTYPE_NOT_ON_ROOT, + }, + { + .name = "numa_stat", + .seq_show = hugetlb_cgroup_read_numa_stat, + .flags = CFTYPE_NOT_ON_ROOT, + }, + /* don't need terminator here */ +}; + +static struct cftype hugetlb_legacy_tmpl[] = { + { + .name = "limit_in_bytes", + .private = RES_LIMIT, + .read_u64 = hugetlb_cgroup_read_u64, + .write = hugetlb_cgroup_write_legacy, + }, + { + .name = "rsvd.limit_in_bytes", + .private = RES_RSVD_LIMIT, + .read_u64 = hugetlb_cgroup_read_u64, + .write = hugetlb_cgroup_write_legacy, + }, + { + .name = "usage_in_bytes", + .private = RES_USAGE, + .read_u64 = hugetlb_cgroup_read_u64, + }, + { + .name = "rsvd.usage_in_bytes", + .private = RES_RSVD_USAGE, + .read_u64 = hugetlb_cgroup_read_u64, + }, + { + .name = "max_usage_in_bytes", + .private = RES_MAX_USAGE, + .write = hugetlb_cgroup_reset, + .read_u64 = hugetlb_cgroup_read_u64, + }, + { + .name = "rsvd.max_usage_in_bytes", + .private = RES_RSVD_MAX_USAGE, + .write = hugetlb_cgroup_reset, + .read_u64 = hugetlb_cgroup_read_u64, + }, + { + .name = "failcnt", + .private = RES_FAILCNT, + .write = hugetlb_cgroup_reset, + .read_u64 = hugetlb_cgroup_read_u64, + }, + { + .name = "rsvd.failcnt", + .private = RES_RSVD_FAILCNT, + .write = hugetlb_cgroup_reset, + .read_u64 = hugetlb_cgroup_read_u64, + }, + { + .name = "numa_stat", + .seq_show = hugetlb_cgroup_read_numa_stat, + }, + /* don't need terminator here */ +}; + +static void __init +hugetlb_cgroup_cfttypes_init(struct hstate *h, struct cftype *cft, + struct cftype *tmpl, int tmpl_size) { char buf[32]; - struct cftype *cft; - struct hstate *h = &hstates[idx]; + int i, idx = hstate_index(h); /* format the size */ mem_fmt(buf, sizeof(buf), huge_page_size(h)); - /* Add the limit file */ - cft = &h->cgroup_files_dfl[0]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT); - cft->seq_show = hugetlb_cgroup_read_u64_max; - cft->write = hugetlb_cgroup_write_dfl; - cft->flags = CFTYPE_NOT_ON_ROOT; + for (i = 0; i < tmpl_size; cft++, tmpl++, i++) { + *cft = *tmpl; + /* rebuild the name */ + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.%s", buf, tmpl->name); + /* rebuild the private */ + cft->private = MEMFILE_PRIVATE(idx, tmpl->private); + /* rebuild the file_offset */ + if (tmpl->file_offset) { + unsigned int offset = tmpl->file_offset; - /* Add the reservation limit file */ - cft = &h->cgroup_files_dfl[1]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.max", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_LIMIT); - cft->seq_show = hugetlb_cgroup_read_u64_max; - cft->write = hugetlb_cgroup_write_dfl; - cft->flags = CFTYPE_NOT_ON_ROOT; + cft->file_offset = MEMFILE_OFFSET0(offset) + + MEMFILE_FIELD_SIZE(offset) * idx; + } - /* Add the current usage file */ - cft = &h->cgroup_files_dfl[2]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.current", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_USAGE); - cft->seq_show = hugetlb_cgroup_read_u64_max; - cft->flags = CFTYPE_NOT_ON_ROOT; + lockdep_register_key(&cft->lockdep_key); + } +} - /* Add the current reservation usage file */ - cft = &h->cgroup_files_dfl[3]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.current", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_USAGE); - cft->seq_show = hugetlb_cgroup_read_u64_max; - cft->flags = CFTYPE_NOT_ON_ROOT; +static void __init __hugetlb_cgroup_file_dfl_init(struct hstate *h) +{ + int idx = hstate_index(h); - /* Add the events file */ - cft = &h->cgroup_files_dfl[4]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf); - cft->private = MEMFILE_PRIVATE(idx, 0); - cft->seq_show = hugetlb_events_show; - cft->file_offset = offsetof(struct hugetlb_cgroup, events_file[idx]); - cft->flags = CFTYPE_NOT_ON_ROOT; + hugetlb_cgroup_cfttypes_init(h, dfl_files + idx * DFL_TMPL_SIZE, + hugetlb_dfl_tmpl, DFL_TMPL_SIZE); +} - /* Add the events.local file */ - cft = &h->cgroup_files_dfl[5]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf); - cft->private = MEMFILE_PRIVATE(idx, 0); - cft->seq_show = hugetlb_events_local_show; - cft->file_offset = offsetof(struct hugetlb_cgroup, - events_local_file[idx]); - cft->flags = CFTYPE_NOT_ON_ROOT; +static void __init __hugetlb_cgroup_file_legacy_init(struct hstate *h) +{ + int idx = hstate_index(h); - /* Add the numa stat file */ - cft = &h->cgroup_files_dfl[6]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf); - cft->private = MEMFILE_PRIVATE(idx, 0); - cft->seq_show = hugetlb_cgroup_read_numa_stat; - cft->flags = CFTYPE_NOT_ON_ROOT; + hugetlb_cgroup_cfttypes_init(h, legacy_files + idx * LEGACY_TMPL_SIZE, + hugetlb_legacy_tmpl, LEGACY_TMPL_SIZE); +} - /* NULL terminate the last cft */ - cft = &h->cgroup_files_dfl[7]; - memset(cft, 0, sizeof(*cft)); +static void __init __hugetlb_cgroup_file_init(struct hstate *h) +{ + __hugetlb_cgroup_file_dfl_init(h); + __hugetlb_cgroup_file_legacy_init(h); +} +static void __init __hugetlb_cgroup_file_pre_init(void) +{ + int cft_count; + + cft_count = hugetlb_max_hstate * DFL_TMPL_SIZE + 1; /* add terminator */ + dfl_files = kcalloc(cft_count, sizeof(struct cftype), GFP_KERNEL); + BUG_ON(!dfl_files); + cft_count = hugetlb_max_hstate * LEGACY_TMPL_SIZE + 1; /* add terminator */ + legacy_files = kcalloc(cft_count, sizeof(struct cftype), GFP_KERNEL); + BUG_ON(!legacy_files); +} + +static void __init __hugetlb_cgroup_file_post_init(void) +{ WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys, - h->cgroup_files_dfl)); -} - -static void __init __hugetlb_cgroup_file_legacy_init(int idx) -{ - char buf[32]; - struct cftype *cft; - struct hstate *h = &hstates[idx]; - - /* format the size */ - mem_fmt(buf, sizeof(buf), huge_page_size(h)); - - /* Add the limit file */ - cft = &h->cgroup_files_legacy[0]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT); - cft->read_u64 = hugetlb_cgroup_read_u64; - cft->write = hugetlb_cgroup_write_legacy; - - /* Add the reservation limit file */ - cft = &h->cgroup_files_legacy[1]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.limit_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_LIMIT); - cft->read_u64 = hugetlb_cgroup_read_u64; - cft->write = hugetlb_cgroup_write_legacy; - - /* Add the usage file */ - cft = &h->cgroup_files_legacy[2]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_USAGE); - cft->read_u64 = hugetlb_cgroup_read_u64; - - /* Add the reservation usage file */ - cft = &h->cgroup_files_legacy[3]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.usage_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_USAGE); - cft->read_u64 = hugetlb_cgroup_read_u64; - - /* Add the MAX usage file */ - cft = &h->cgroup_files_legacy[4]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE); - cft->write = hugetlb_cgroup_reset; - cft->read_u64 = hugetlb_cgroup_read_u64; - - /* Add the MAX reservation usage file */ - cft = &h->cgroup_files_legacy[5]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.max_usage_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_MAX_USAGE); - cft->write = hugetlb_cgroup_reset; - cft->read_u64 = hugetlb_cgroup_read_u64; - - /* Add the failcntfile */ - cft = &h->cgroup_files_legacy[6]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); - cft->write = hugetlb_cgroup_reset; - cft->read_u64 = hugetlb_cgroup_read_u64; - - /* Add the reservation failcntfile */ - cft = &h->cgroup_files_legacy[7]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.failcnt", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_FAILCNT); - cft->write = hugetlb_cgroup_reset; - cft->read_u64 = hugetlb_cgroup_read_u64; - - /* Add the numa stat file */ - cft = &h->cgroup_files_legacy[8]; - snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf); - cft->private = MEMFILE_PRIVATE(idx, 1); - cft->seq_show = hugetlb_cgroup_read_numa_stat; - - /* NULL terminate the last cft */ - cft = &h->cgroup_files_legacy[9]; - memset(cft, 0, sizeof(*cft)); - + dfl_files)); WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys, - h->cgroup_files_legacy)); -} - -static void __init __hugetlb_cgroup_file_init(int idx) -{ - __hugetlb_cgroup_file_dfl_init(idx); - __hugetlb_cgroup_file_legacy_init(idx); + legacy_files)); } void __init hugetlb_cgroup_file_init(void) { struct hstate *h; + __hugetlb_cgroup_file_pre_init(); for_each_hstate(h) - __hugetlb_cgroup_file_init(hstate_index(h)); + __hugetlb_cgroup_file_init(h); + __hugetlb_cgroup_file_post_init(); } /* diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 8193906515c6..829112b0a914 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -184,10 +184,13 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end, */ static inline void free_vmemmap_page(struct page *page) { - if (PageReserved(page)) + if (PageReserved(page)) { free_bootmem_page(page); - else + mod_node_page_state(page_pgdat(page), NR_MEMMAP_BOOT, -1); + } else { __free_page(page); + mod_node_page_state(page_pgdat(page), NR_MEMMAP, -1); + } } /* Free a list of the vmemmap pages */ @@ -338,6 +341,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, copy_page(page_to_virt(walk.reuse_page), (void *)walk.reuse_addr); list_add(&walk.reuse_page->lru, vmemmap_pages); + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, 1); } /* @@ -384,14 +388,19 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, unsigned long nr_pages = (end - start) >> PAGE_SHIFT; int nid = page_to_nid((struct page *)start); struct page *page, *next; + int i; - while (nr_pages--) { + for (i = 0; i < nr_pages; i++) { page = alloc_pages_node(nid, gfp_mask, 0); - if (!page) + if (!page) { + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, i); goto out; + } list_add(&page->lru, list); } + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, nr_pages); + return 0; out: list_for_each_entry_safe(page, next, list, lru) diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c index c9d653f51e45..7ecaa1900137 100644 --- a/mm/hwpoison-inject.c +++ b/mm/hwpoison-inject.c @@ -110,4 +110,5 @@ static int __init pfn_inject_init(void) module_init(pfn_inject_init); module_exit(pfn_inject_exit); +MODULE_DESCRIPTION("HWPoison pages injector"); MODULE_LICENSE("GPL"); diff --git a/mm/internal.h b/mm/internal.h index cc2c5e07fad3..b4d86436565b 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, } /** - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte. + * pte_move_swp_offset - Move the swap entry offset field of a swap pte + * forward or backward by delta * @pte: The initial pte state; is_swap_pte(pte) must be true and * non_swap_entry() must be false. + * @delta: The direction and the offset we are moving; forward if delta + * is positive; backward if delta is negative * - * Increments the swap offset, while maintaining all other fields, including + * Moves the swap offset, while maintaining all other fields, including * swap type, and any swp pte bits. The resulting pte is returned. */ -static inline pte_t pte_next_swp_offset(pte_t pte) +static inline pte_t pte_move_swp_offset(pte_t pte, long delta) { swp_entry_t entry = pte_to_swp_entry(pte); pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry), - (swp_offset(entry) + 1))); + (swp_offset(entry) + delta))); if (pte_swp_soft_dirty(pte)) new = pte_swp_mksoft_dirty(new); @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte) return new; } + +/** + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte. + * @pte: The initial pte state; is_swap_pte(pte) must be true and + * non_swap_entry() must be false. + * + * Increments the swap offset, while maintaining all other fields, including + * swap type, and any swp pte bits. The resulting pte is returned. + */ +static inline pte_t pte_next_swp_offset(pte_t pte) +{ + return pte_move_swp_offset(pte, 1); +} + /** * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries * @start_ptep: Page table pointer for the first entry. @@ -587,7 +604,8 @@ extern void __putback_isolated_page(struct page *page, unsigned int order, int mt); extern void memblock_free_pages(struct page *page, unsigned long pfn, unsigned int order); -extern void __free_pages_core(struct page *page, unsigned int order); +extern void __free_pages_core(struct page *page, unsigned int order, + enum meminit_context context); /* * This will have no effect, other than possibly generating a warning, if the @@ -604,7 +622,22 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) #endif } -void folio_undo_large_rmappable(struct folio *folio); +void __folio_undo_large_rmappable(struct folio *folio); +static inline void folio_undo_large_rmappable(struct folio *folio) +{ + if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio)) + return; + + /* + * At this point, there is no one trying to add the folio to + * deferred_list. If folio is not in deferred_list, it's safe + * to check without acquiring the split_queue_lock. + */ + if (data_race(list_empty(&folio->_deferred_list))) + return; + + __folio_undo_large_rmappable(folio); +} static inline struct folio *page_rmappable_folio(struct page *page) { @@ -1045,12 +1078,23 @@ extern u64 hwpoison_filter_flags_mask; extern u64 hwpoison_filter_flags_value; extern u64 hwpoison_filter_memcg; extern u32 hwpoison_filter_enable; +#define MAGIC_HWPOISON 0x48575053U /* HWPS */ +void SetPageHWPoisonTakenOff(struct page *page); +void ClearPageHWPoisonTakenOff(struct page *page); +bool take_page_off_buddy(struct page *page); +bool put_page_back_buddy(struct page *page); +struct task_struct *task_early_kill(struct task_struct *tsk, int force_early); +void add_to_kill_ksm(struct task_struct *tsk, struct page *p, + struct vm_area_struct *vma, struct list_head *to_kill, + unsigned long ksm_addr); +unsigned long page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); extern void set_pageblock_order(void); +struct folio *alloc_migrate_folio(struct folio *src, unsigned long private); unsigned long reclaim_pages(struct list_head *folio_list); unsigned int reclaim_clean_pages_from_list(struct zone *zone, struct list_head *folio_list); @@ -1316,6 +1360,16 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma) return !(vma->vm_flags & VM_SOFTDIRTY); } +static inline bool pmd_needs_soft_dirty_wp(struct vm_area_struct *vma, pmd_t pmd) +{ + return vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd); +} + +static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte) +{ + return vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte); +} + static inline void vma_iter_config(struct vma_iterator *vmi, unsigned long index, unsigned long last) { @@ -1515,4 +1569,13 @@ static inline void shrinker_debugfs_remove(struct dentry *debugfs_entry, void workingset_update_node(struct xa_node *node); extern struct list_lru shadow_nodes; +struct unlink_vma_file_batch { + int count; + struct vm_area_struct *vmas[8]; +}; + +void unlink_file_vma_batch_init(struct unlink_vma_file_batch *); +void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *); +void unlink_file_vma_batch_final(struct unlink_vma_file_batch *); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/kfence/core.c b/mm/kfence/core.c index 964b8482275b..c5cb54fc696d 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -305,8 +305,14 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex WRITE_ONCE(meta->state, next); } +#ifdef CONFIG_KMSAN +#define check_canary_attributes noinline __no_kmsan_checks +#else +#define check_canary_attributes inline +#endif + /* Check canary byte at @addr. */ -static inline bool check_canary_byte(u8 *addr) +static check_canary_attributes bool check_canary_byte(u8 *addr) { struct kfence_metadata *meta; unsigned long flags; @@ -341,7 +347,8 @@ static inline void set_canary(const struct kfence_metadata *meta) *((u64 *)addr) = KFENCE_CANARY_PATTERN_U64; } -static inline void check_canary(const struct kfence_metadata *meta) +static check_canary_attributes void +check_canary(const struct kfence_metadata *meta) { const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE); unsigned long addr = pageaddr; @@ -595,7 +602,7 @@ static unsigned long kfence_init_pool(void) continue; __folio_set_slab(slab_folio(slab)); -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG slab->obj_exts = (unsigned long)&kfence_metadata_init[i / 2 - 1].obj_exts | MEMCG_DATA_OBJEXTS; #endif @@ -645,7 +652,7 @@ reset_slab: if (!i || (i % 2)) continue; -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG slab->obj_exts = 0; #endif __folio_clear_slab(slab_folio(slab)); @@ -1139,7 +1146,7 @@ void __kfence_free(void *addr) { struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr); -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG KFENCE_WARN_ON(meta->obj_exts.objcg); #endif /* diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h index 084f5f36e8e7..db87a05047bd 100644 --- a/mm/kfence/kfence.h +++ b/mm/kfence/kfence.h @@ -97,7 +97,7 @@ struct kfence_metadata { struct kfence_track free_track; /* For updating alloc_covered on frees. */ u32 alloc_stack_hash; -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG struct slabobj_ext obj_exts; #endif }; diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c index 95b2b84c296d..00fd17285285 100644 --- a/mm/kfence/kfence_test.c +++ b/mm/kfence/kfence_test.c @@ -852,3 +852,4 @@ kunit_test_suites(&kfence_test_suite); MODULE_LICENSE("GPL v2"); MODULE_AUTHOR("Alexander Potapenko , Marco Elver "); +MODULE_DESCRIPTION("kfence unit test suite"); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index aab471791bd9..cdd1d8655a76 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -385,10 +385,7 @@ int hugepage_madvise(struct vm_area_struct *vma, int __init khugepaged_init(void) { - mm_slot_cache = kmem_cache_create("khugepaged_mm_slot", - sizeof(struct khugepaged_mm_slot), - __alignof__(struct khugepaged_mm_slot), - 0, NULL); + mm_slot_cache = KMEM_CACHE(khugepaged_mm_slot, 0); if (!mm_slot_cache) return -ENOMEM; @@ -416,6 +413,26 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm) test_bit(MMF_DISABLE_THP, &mm->flags); } +static bool hugepage_pmd_enabled(void) +{ + /* + * We cover both the anon and the file-backed case here; file-backed + * hugepages, when configured in, are determined by the global control. + * Anon pmd-sized hugepages are determined by the pmd-size control. + */ + if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && + hugepage_global_enabled()) + return true; + if (test_bit(PMD_ORDER, &huge_anon_orders_always)) + return true; + if (test_bit(PMD_ORDER, &huge_anon_orders_madvise)) + return true; + if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) && + hugepage_global_enabled()) + return true; + return false; +} + void __khugepaged_enter(struct mm_struct *mm) { struct khugepaged_mm_slot *mm_slot; @@ -452,7 +469,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, unsigned long vm_flags) { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && - hugepage_flags_enabled()) { + hugepage_pmd_enabled()) { if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS, PMD_ORDER)) __khugepaged_enter(vma->vm_mm); @@ -1213,7 +1230,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); - folio_add_new_anon_rmap(folio, vma, address); + folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, address, pmd, _pmd); @@ -2465,8 +2482,7 @@ breakouterloop_mmap_lock: static int khugepaged_has_work(void) { - return !list_empty(&khugepaged_scan.mm_head) && - hugepage_flags_enabled(); + return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled(); } static int khugepaged_wait_event(void) @@ -2539,7 +2555,7 @@ static void khugepaged_wait_work(void) return; } - if (hugepage_flags_enabled()) + if (hugepage_pmd_enabled()) wait_event_freezable(khugepaged_wait, khugepaged_wait_event()); } @@ -2570,7 +2586,7 @@ static void set_recommended_min_free_kbytes(void) int nr_zones = 0; unsigned long recommended_min; - if (!hugepage_flags_enabled()) { + if (!hugepage_pmd_enabled()) { calculate_min_free_kbytes(); goto update_wmarks; } @@ -2620,7 +2636,7 @@ int start_stop_khugepaged(void) int err = 0; mutex_lock(&khugepaged_mutex); - if (hugepage_flags_enabled()) { + if (hugepage_pmd_enabled()) { if (!khugepaged_thread) khugepaged_thread = kthread_run(khugepaged, NULL, "khugepaged"); @@ -2646,7 +2662,7 @@ fail: void khugepaged_min_free_kbytes_update(void) { mutex_lock(&khugepaged_mutex); - if (hugepage_flags_enabled() && khugepaged_thread) + if (hugepage_pmd_enabled() && khugepaged_thread) set_recommended_min_free_kbytes(); mutex_unlock(&khugepaged_mutex); } diff --git a/mm/kmemleak.c b/mm/kmemleak.c index d5b6fba44fc9..764b08100570 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -657,10 +657,10 @@ static struct kmemleak_object *__alloc_object(gfp_t gfp) /* task information */ if (in_hardirq()) { object->pid = 0; - strncpy(object->comm, "hardirq", sizeof(object->comm)); + strscpy(object->comm, "hardirq"); } else if (in_serving_softirq()) { object->pid = 0; - strncpy(object->comm, "softirq", sizeof(object->comm)); + strscpy(object->comm, "softirq"); } else { object->pid = current->pid; /* @@ -669,7 +669,7 @@ static struct kmemleak_object *__alloc_object(gfp_t gfp) * dependency issues with current->alloc_lock. In the worst * case, the command line is not correct. */ - strncpy(object->comm, current->comm, sizeof(object->comm)); + strscpy(object->comm, current->comm); } /* kernel backtrace */ diff --git a/mm/kmsan/core.c b/mm/kmsan/core.c index 95f859e38c53..a495debf1436 100644 --- a/mm/kmsan/core.c +++ b/mm/kmsan/core.c @@ -43,7 +43,6 @@ void kmsan_internal_task_create(struct task_struct *task) struct thread_info *info = current_thread_info(); __memset(ctx, 0, sizeof(*ctx)); - ctx->allow_reporting = true; kmsan_internal_unpoison_memory(info, sizeof(*info), false); } @@ -250,8 +249,8 @@ struct page *kmsan_vmalloc_to_page_or_null(void *vaddr) return NULL; } -void kmsan_internal_check_memory(void *addr, size_t size, const void *user_addr, - int reason) +void kmsan_internal_check_memory(void *addr, size_t size, + const void __user *user_addr, int reason) { depot_stack_handle_t cur_origin = 0, new_origin = 0; unsigned long addr64 = (unsigned long)addr; diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c index 22e8657800ef..3ea50f09311f 100644 --- a/mm/kmsan/hooks.c +++ b/mm/kmsan/hooks.c @@ -39,12 +39,10 @@ void kmsan_task_create(struct task_struct *task) void kmsan_task_exit(struct task_struct *task) { - struct kmsan_ctx *ctx = &task->kmsan_ctx; - if (!kmsan_enabled || kmsan_in_runtime()) return; - ctx->allow_reporting = false; + kmsan_disable_current(); } void kmsan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags) @@ -76,7 +74,7 @@ void kmsan_slab_free(struct kmem_cache *s, void *object) return; /* RCU slabs could be legally used after free within the RCU period */ - if (unlikely(s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON))) + if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU)) return; /* * If there's a constructor, freed memory must remain in the same state @@ -267,7 +265,8 @@ void kmsan_copy_to_user(void __user *to, const void *from, size_t to_copy, return; ua_flags = user_access_save(); - if ((u64)to < TASK_SIZE) { + if (!IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) || + (u64)to < TASK_SIZE) { /* This is a user memory access, check it. */ kmsan_internal_check_memory((void *)from, to_copy - left, to, REASON_COPY_TO_USER); @@ -304,7 +303,8 @@ void kmsan_handle_urb(const struct urb *urb, bool is_out) if (is_out) kmsan_internal_check_memory(urb->transfer_buffer, urb->transfer_buffer_length, - /*user_addr*/ 0, REASON_SUBMIT_URB); + /*user_addr*/ NULL, + REASON_SUBMIT_URB); else kmsan_internal_unpoison_memory(urb->transfer_buffer, urb->transfer_buffer_length, @@ -317,14 +317,14 @@ static void kmsan_handle_dma_page(const void *addr, size_t size, { switch (dir) { case DMA_BIDIRECTIONAL: - kmsan_internal_check_memory((void *)addr, size, /*user_addr*/ 0, - REASON_ANY); + kmsan_internal_check_memory((void *)addr, size, + /*user_addr*/ NULL, REASON_ANY); kmsan_internal_unpoison_memory((void *)addr, size, /*checked*/ false); break; case DMA_TO_DEVICE: - kmsan_internal_check_memory((void *)addr, size, /*user_addr*/ 0, - REASON_ANY); + kmsan_internal_check_memory((void *)addr, size, + /*user_addr*/ NULL, REASON_ANY); break; case DMA_FROM_DEVICE: kmsan_internal_unpoison_memory((void *)addr, size, @@ -419,7 +419,21 @@ void kmsan_check_memory(const void *addr, size_t size) { if (!kmsan_enabled) return; - return kmsan_internal_check_memory((void *)addr, size, /*user_addr*/ 0, - REASON_ANY); + return kmsan_internal_check_memory((void *)addr, size, + /*user_addr*/ NULL, REASON_ANY); } EXPORT_SYMBOL(kmsan_check_memory); + +void kmsan_enable_current(void) +{ + KMSAN_WARN_ON(current->kmsan_ctx.depth == 0); + current->kmsan_ctx.depth--; +} +EXPORT_SYMBOL(kmsan_enable_current); + +void kmsan_disable_current(void) +{ + current->kmsan_ctx.depth++; + KMSAN_WARN_ON(current->kmsan_ctx.depth == 0); +} +EXPORT_SYMBOL(kmsan_disable_current); diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c index 3ac3b8921d36..10f52c085e6c 100644 --- a/mm/kmsan/init.c +++ b/mm/kmsan/init.c @@ -33,7 +33,10 @@ static void __init kmsan_record_future_shadow_range(void *start, void *end) bool merged = false; KMSAN_WARN_ON(future_index == NUM_FUTURE_RANGES); - KMSAN_WARN_ON((nstart >= nend) || !nstart || !nend); + KMSAN_WARN_ON((nstart >= nend) || + /* Virtual address 0 is valid on s390. */ + (!IS_ENABLED(CONFIG_S390) && !nstart) || + !nend); nstart = ALIGN_DOWN(nstart, PAGE_SIZE); nend = ALIGN(nend, PAGE_SIZE); @@ -72,7 +75,7 @@ static void __init kmsan_record_future_shadow_range(void *start, void *end) */ void __init kmsan_init_shadow(void) { - const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE); + const size_t nd_size = sizeof(pg_data_t); phys_addr_t p_start, p_end; u64 loop; int nid; @@ -172,7 +175,7 @@ static void do_collection(void) shadow = smallstack_pop(&collect); origin = smallstack_pop(&collect); kmsan_setup_meta(page, shadow, origin, collect.order); - __free_pages_core(page, collect.order); + __free_pages_core(page, collect.order, MEMINIT_EARLY); } } diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c index cc3907a9c33a..02a405e55d6c 100644 --- a/mm/kmsan/instrumentation.c +++ b/mm/kmsan/instrumentation.c @@ -14,13 +14,15 @@ #include "kmsan.h" #include +#include #include #include #include static inline bool is_bad_asm_addr(void *addr, uintptr_t size, bool is_store) { - if ((u64)addr < TASK_SIZE) + if (IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) && + (u64)addr < TASK_SIZE) return true; if (!kmsan_get_metadata(addr, KMSAN_META_SHADOW)) return true; @@ -110,11 +112,10 @@ void __msan_instrument_asm_store(void *addr, uintptr_t size) ua_flags = user_access_save(); /* - * Most of the accesses are below 32 bytes. The two exceptions so far - * are clwb() (64 bytes) and FPU state (512 bytes). - * It's unlikely that the assembly will touch more than 512 bytes. + * Most of the accesses are below 32 bytes. The exceptions so far are + * clwb() (64 bytes), FPU state (512 bytes) and chsc() (4096 bytes). */ - if (size > 512) { + if (size > 4096) { WARN_ONCE(1, "assembly store size too big: %ld\n", size); size = 8; } @@ -314,8 +315,8 @@ void __msan_warning(u32 origin) if (!kmsan_enabled || kmsan_in_runtime()) return; kmsan_enter_runtime(); - kmsan_report(origin, /*address*/ 0, /*size*/ 0, - /*off_first*/ 0, /*off_last*/ 0, /*user_addr*/ 0, + kmsan_report(origin, /*address*/ NULL, /*size*/ 0, + /*off_first*/ 0, /*off_last*/ 0, /*user_addr*/ NULL, REASON_ANY); kmsan_leave_runtime(); } diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h index a14744205435..29555a8bc315 100644 --- a/mm/kmsan/kmsan.h +++ b/mm/kmsan/kmsan.h @@ -10,14 +10,15 @@ #ifndef __MM_KMSAN_KMSAN_H #define __MM_KMSAN_KMSAN_H -#include #include +#include +#include +#include +#include +#include #include #include #include -#include -#include -#include #define KMSAN_ALLOCA_MAGIC_ORIGIN 0xabcd0100 #define KMSAN_CHAIN_MAGIC_ORIGIN 0xabcd0200 @@ -34,29 +35,6 @@ #define KMSAN_META_SHADOW (false) #define KMSAN_META_ORIGIN (true) -extern bool kmsan_enabled; -extern int panic_on_kmsan; - -/* - * KMSAN performs a lot of consistency checks that are currently enabled by - * default. BUG_ON is normally discouraged in the kernel, unless used for - * debugging, but KMSAN itself is a debugging tool, so it makes little sense to - * recover if something goes wrong. - */ -#define KMSAN_WARN_ON(cond) \ - ({ \ - const bool __cond = WARN_ON(cond); \ - if (unlikely(__cond)) { \ - WRITE_ONCE(kmsan_enabled, false); \ - if (panic_on_kmsan) { \ - /* Can't call panic() here because */ \ - /* of uaccess checks. */ \ - BUG(); \ - } \ - } \ - __cond; \ - }) - /* * A pair of metadata pointers to be returned by the instrumentation functions. */ @@ -66,7 +44,6 @@ struct shadow_origin_ptr { struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void *addr, u64 size, bool store); -void *kmsan_get_metadata(void *addr, bool is_origin); void __init kmsan_init_alloc_meta_for_range(void *start, void *end); enum kmsan_bug_reason { @@ -96,7 +73,7 @@ void kmsan_print_origin(depot_stack_handle_t origin); * @off_last corresponding to different @origin values. */ void kmsan_report(depot_stack_handle_t origin, void *address, int size, - int off_first, int off_last, const void *user_addr, + int off_first, int off_last, const void __user *user_addr, enum kmsan_bug_reason reason); DECLARE_PER_CPU(struct kmsan_ctx, kmsan_percpu_ctx); @@ -186,8 +163,8 @@ depot_stack_handle_t kmsan_internal_chain_origin(depot_stack_handle_t id); void kmsan_internal_task_create(struct task_struct *task); bool kmsan_metadata_is_contiguous(void *addr, size_t size); -void kmsan_internal_check_memory(void *addr, size_t size, const void *user_addr, - int reason); +void kmsan_internal_check_memory(void *addr, size_t size, + const void __user *user_addr, int reason); struct page *kmsan_vmalloc_to_page_or_null(void *vaddr); void kmsan_setup_meta(struct page *page, struct page *shadow, diff --git a/mm/kmsan/kmsan_test.c b/mm/kmsan/kmsan_test.c index 07d3a3a5a9c5..13236d579eba 100644 --- a/mm/kmsan/kmsan_test.c +++ b/mm/kmsan/kmsan_test.c @@ -614,6 +614,32 @@ static void test_stackdepot_roundtrip(struct kunit *test) KUNIT_EXPECT_TRUE(test, report_matches(&expect)); } +/* + * Test case: ensure that kmsan_unpoison_memory() and the instrumentation work + * the same. + */ +static void test_unpoison_memory(struct kunit *test) +{ + EXPECTATION_UNINIT_VALUE_FN(expect, "test_unpoison_memory"); + volatile char a[4], b[4]; + + kunit_info( + test, + "unpoisoning via the instrumentation vs. kmsan_unpoison_memory() (2 UMR reports)\n"); + + /* Initialize a[0] and check a[1]--a[3]. */ + a[0] = 0; + kmsan_check_memory((char *)&a[1], 3); + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); + + report_reset(); + + /* Initialize b[0] and check b[1]--b[3]. */ + kmsan_unpoison_memory((char *)&b[0], 1); + kmsan_check_memory((char *)&b[1], 3); + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); +} + static struct kunit_case kmsan_test_cases[] = { KUNIT_CASE(test_uninit_kmalloc), KUNIT_CASE(test_init_kmalloc), @@ -637,6 +663,7 @@ static struct kunit_case kmsan_test_cases[] = { KUNIT_CASE(test_memset64), KUNIT_CASE(test_long_origin_chain), KUNIT_CASE(test_stackdepot_roundtrip), + KUNIT_CASE(test_unpoison_memory), {}, }; @@ -659,9 +686,13 @@ static void test_exit(struct kunit *test) { } +static int orig_panic_on_kmsan; + static int kmsan_suite_init(struct kunit_suite *suite) { register_trace_console(probe_console, NULL); + orig_panic_on_kmsan = panic_on_kmsan; + panic_on_kmsan = 0; return 0; } @@ -669,6 +700,7 @@ static void kmsan_suite_exit(struct kunit_suite *suite) { unregister_trace_console(probe_console, NULL); tracepoint_synchronize_unregister(); + panic_on_kmsan = orig_panic_on_kmsan; } static struct kunit_suite kmsan_test_suite = { diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c index 02736ec757f2..94a3303fb65e 100644 --- a/mm/kmsan/report.c +++ b/mm/kmsan/report.c @@ -8,6 +8,7 @@ */ #include +#include #include #include #include @@ -20,6 +21,7 @@ static DEFINE_RAW_SPINLOCK(kmsan_report_lock); /* Protected by kmsan_report_lock */ static char report_local_descr[DESCR_SIZE]; int panic_on_kmsan __read_mostly; +EXPORT_SYMBOL_GPL(panic_on_kmsan); #ifdef MODULE_PARAM_PREFIX #undef MODULE_PARAM_PREFIX @@ -146,7 +148,7 @@ void kmsan_print_origin(depot_stack_handle_t origin) } void kmsan_report(depot_stack_handle_t origin, void *address, int size, - int off_first, int off_last, const void *user_addr, + int off_first, int off_last, const void __user *user_addr, enum kmsan_bug_reason reason) { unsigned long stack_entries[KMSAN_STACK_DEPTH]; @@ -157,12 +159,12 @@ void kmsan_report(depot_stack_handle_t origin, void *address, int size, if (!kmsan_enabled) return; - if (!current->kmsan_ctx.allow_reporting) + if (current->kmsan_ctx.depth) return; if (!origin) return; - current->kmsan_ctx.allow_reporting = false; + kmsan_disable_current(); ua_flags = user_access_save(); raw_spin_lock(&kmsan_report_lock); pr_err("=====================================================\n"); @@ -215,5 +217,5 @@ void kmsan_report(depot_stack_handle_t origin, void *address, int size, if (panic_on_kmsan) panic("kmsan.panic set ...\n"); user_access_restore(ua_flags); - current->kmsan_ctx.allow_reporting = true; + kmsan_enable_current(); } diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c index b9d05aff313e..9c58f081d84f 100644 --- a/mm/kmsan/shadow.c +++ b/mm/kmsan/shadow.c @@ -123,14 +123,12 @@ return_dummy: */ void *kmsan_get_metadata(void *address, bool is_origin) { - u64 addr = (u64)address, pad, off; + u64 addr = (u64)address, off; struct page *page; void *ret; - if (is_origin && !IS_ALIGNED(addr, KMSAN_ORIGIN_SIZE)) { - pad = addr % KMSAN_ORIGIN_SIZE; - addr -= pad; - } + if (is_origin) + addr = ALIGN_DOWN(addr, KMSAN_ORIGIN_SIZE); address = (void *)addr; if (kmsan_internal_is_vmalloc_addr(address) || kmsan_internal_is_module_addr(address)) @@ -243,7 +241,6 @@ int kmsan_vmap_pages_range_noflush(unsigned long start, unsigned long end, s_pages[i] = shadow_page_for(pages[i]); o_pages[i] = origin_page_for(pages[i]); } - prot = __pgprot(pgprot_val(prot) | _PAGE_NX); prot = PAGE_KERNEL; origin_start = vmalloc_meta((void *)start, KMSAN_META_ORIGIN); diff --git a/mm/ksm.c b/mm/ksm.c index 34c4820e0d3d..df6bae3a5a2c 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -488,21 +488,17 @@ static DECLARE_WAIT_QUEUE_HEAD(ksm_iter_wait); static DEFINE_MUTEX(ksm_thread_mutex); static DEFINE_SPINLOCK(ksm_mmlist_lock); -#define KSM_KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\ - sizeof(struct __struct), __alignof__(struct __struct),\ - (__flags), NULL) - static int __init ksm_slab_init(void) { - rmap_item_cache = KSM_KMEM_CACHE(ksm_rmap_item, 0); + rmap_item_cache = KMEM_CACHE(ksm_rmap_item, 0); if (!rmap_item_cache) goto out; - stable_node_cache = KSM_KMEM_CACHE(ksm_stable_node, 0); + stable_node_cache = KMEM_CACHE(ksm_stable_node, 0); if (!stable_node_cache) goto out_free1; - mm_slot_cache = KSM_KMEM_CACHE(ksm_mm_slot, 0); + mm_slot_cache = KMEM_CACHE(ksm_mm_slot, 0); if (!mm_slot_cache) goto out_free2; @@ -1531,6 +1527,44 @@ out: return err; } +/* + * This function returns 0 if the pages were merged or if they are + * no longer merging candidates (e.g., VMA stale), -EFAULT otherwise. + */ +static int try_to_merge_with_zero_page(struct ksm_rmap_item *rmap_item, + struct page *page) +{ + struct mm_struct *mm = rmap_item->mm; + int err = -EFAULT; + + /* + * Same checksum as an empty page. We attempt to merge it with the + * appropriate zero page if the user enabled this via sysfs. + */ + if (ksm_use_zero_pages && (rmap_item->oldchecksum == zero_checksum)) { + struct vm_area_struct *vma; + + mmap_read_lock(mm); + vma = find_mergeable_vma(mm, rmap_item->address); + if (vma) { + err = try_to_merge_one_page(vma, page, + ZERO_PAGE(rmap_item->address)); + trace_ksm_merge_one_page( + page_to_pfn(ZERO_PAGE(rmap_item->address)), + rmap_item, mm, err); + } else { + /* + * If the vma is out of date, we do not need to + * continue. + */ + err = 0; + } + mmap_read_unlock(mm); + } + + return err; +} + /* * try_to_merge_with_ksm_page - like try_to_merge_two_pages, * but no new kernel page is allocated: kpage must already be a ksm page. @@ -1625,7 +1659,6 @@ static struct folio *stable_node_dup(struct ksm_stable_node **_stable_node_dup, struct ksm_stable_node *dup, *found = NULL, *stable_node = *_stable_node; struct hlist_node *hlist_safe; struct folio *folio, *tree_folio = NULL; - int nr = 0; int found_rmap_hlist_len; if (!prune_stale_stable_nodes || @@ -1652,33 +1685,26 @@ static struct folio *stable_node_dup(struct ksm_stable_node **_stable_node_dup, folio = ksm_get_folio(dup, KSM_GET_FOLIO_NOLOCK); if (!folio) continue; - nr += 1; - if (is_page_sharing_candidate(dup)) { - if (!found || - dup->rmap_hlist_len > found_rmap_hlist_len) { - if (found) - folio_put(tree_folio); - found = dup; - found_rmap_hlist_len = found->rmap_hlist_len; - tree_folio = folio; - - /* skip put_page for found dup */ - if (!prune_stale_stable_nodes) - break; - continue; - } + /* Pick the best candidate if possible. */ + if (!found || (is_page_sharing_candidate(dup) && + (!is_page_sharing_candidate(found) || + dup->rmap_hlist_len > found_rmap_hlist_len))) { + if (found) + folio_put(tree_folio); + found = dup; + found_rmap_hlist_len = found->rmap_hlist_len; + tree_folio = folio; + /* skip put_page for found candidate */ + if (!prune_stale_stable_nodes && + is_page_sharing_candidate(found)) + break; + continue; } folio_put(folio); } if (found) { - /* - * nr is counting all dups in the chain only if - * prune_stale_stable_nodes is true, otherwise we may - * break the loop at nr == 1 even if there are - * multiple entries. - */ - if (prune_stale_stable_nodes && nr == 1) { + if (hlist_is_singular_node(&found->hlist_dup, &stable_node->hlist)) { /* * If there's not just one entry it would * corrupt memory, better BUG_ON. In KSM @@ -1730,25 +1756,15 @@ static struct folio *stable_node_dup(struct ksm_stable_node **_stable_node_dup, hlist_add_head(&found->hlist_dup, &stable_node->hlist); } + } else { + /* Its hlist must be empty if no one found. */ + free_stable_node_chain(stable_node, root); } *_stable_node_dup = found; return tree_folio; } -static struct ksm_stable_node *stable_node_dup_any(struct ksm_stable_node *stable_node, - struct rb_root *root) -{ - if (!is_stable_node_chain(stable_node)) - return stable_node; - if (hlist_empty(&stable_node->hlist)) { - free_stable_node_chain(stable_node, root); - return NULL; - } - return hlist_entry(stable_node->hlist.first, - typeof(*stable_node), hlist_dup); -} - /* * Like for ksm_get_folio, this function can free the *_stable_node and * *_stable_node_dup if the returned tree_page is NULL. @@ -1769,17 +1785,10 @@ static struct folio *__stable_node_chain(struct ksm_stable_node **_stable_node_d bool prune_stale_stable_nodes) { struct ksm_stable_node *stable_node = *_stable_node; + if (!is_stable_node_chain(stable_node)) { - if (is_page_sharing_candidate(stable_node)) { - *_stable_node_dup = stable_node; - return ksm_get_folio(stable_node, KSM_GET_FOLIO_NOLOCK); - } - /* - * _stable_node_dup set to NULL means the stable_node - * reached the ksm_max_page_sharing limit. - */ - *_stable_node_dup = NULL; - return NULL; + *_stable_node_dup = stable_node; + return ksm_get_folio(stable_node, KSM_GET_FOLIO_NOLOCK); } return stable_node_dup(_stable_node_dup, _stable_node, root, prune_stale_stable_nodes); @@ -1793,16 +1802,10 @@ static __always_inline struct folio *chain_prune(struct ksm_stable_node **s_n_d, } static __always_inline struct folio *chain(struct ksm_stable_node **s_n_d, - struct ksm_stable_node *s_n, + struct ksm_stable_node **s_n, struct rb_root *root) { - struct ksm_stable_node *old_stable_node = s_n; - struct folio *tree_folio; - - tree_folio = __stable_node_chain(s_n_d, &s_n, root, false); - /* not pruning dups so s_n cannot have changed */ - VM_BUG_ON(s_n != old_stable_node); - return tree_folio; + return __stable_node_chain(s_n_d, s_n, root, false); } /* @@ -1820,7 +1823,7 @@ static struct page *stable_tree_search(struct page *page) struct rb_root *root; struct rb_node **new; struct rb_node *parent; - struct ksm_stable_node *stable_node, *stable_node_dup, *stable_node_any; + struct ksm_stable_node *stable_node, *stable_node_dup; struct ksm_stable_node *page_node; struct folio *folio; @@ -1844,45 +1847,7 @@ again: cond_resched(); stable_node = rb_entry(*new, struct ksm_stable_node, node); - stable_node_any = NULL; tree_folio = chain_prune(&stable_node_dup, &stable_node, root); - /* - * NOTE: stable_node may have been freed by - * chain_prune() if the returned stable_node_dup is - * not NULL. stable_node_dup may have been inserted in - * the rbtree instead as a regular stable_node (in - * order to collapse the stable_node chain if a single - * stable_node dup was found in it). In such case the - * stable_node is overwritten by the callee to point - * to the stable_node_dup that was collapsed in the - * stable rbtree and stable_node will be equal to - * stable_node_dup like if the chain never existed. - */ - if (!stable_node_dup) { - /* - * Either all stable_node dups were full in - * this stable_node chain, or this chain was - * empty and should be rb_erased. - */ - stable_node_any = stable_node_dup_any(stable_node, - root); - if (!stable_node_any) { - /* rb_erase just run */ - goto again; - } - /* - * Take any of the stable_node dups page of - * this stable_node chain to let the tree walk - * continue. All KSM pages belonging to the - * stable_node dups in a stable_node chain - * have the same content and they're - * write protected at all times. Any will work - * fine to continue the walk. - */ - tree_folio = ksm_get_folio(stable_node_any, - KSM_GET_FOLIO_NOLOCK); - } - VM_BUG_ON(!stable_node_dup ^ !!stable_node_any); if (!tree_folio) { /* * If we walked over a stale stable_node, @@ -1920,7 +1885,7 @@ again: goto chain_append; } - if (!stable_node_dup) { + if (!is_page_sharing_candidate(stable_node_dup)) { /* * If the stable_node is a chain and * we got a payload match in memcmp @@ -2029,9 +1994,6 @@ replace: return &folio->page; chain_append: - /* stable_node_dup could be null if it reached the limit */ - if (!stable_node_dup) - stable_node_dup = stable_node_any; /* * If stable_node was a chain and chain_prune collapsed it, * stable_node has been updated to be the new regular @@ -2076,7 +2038,7 @@ static struct ksm_stable_node *stable_tree_insert(struct folio *kfolio) struct rb_root *root; struct rb_node **new; struct rb_node *parent; - struct ksm_stable_node *stable_node, *stable_node_dup, *stable_node_any; + struct ksm_stable_node *stable_node, *stable_node_dup; bool need_chain = false; kpfn = folio_pfn(kfolio); @@ -2092,33 +2054,7 @@ again: cond_resched(); stable_node = rb_entry(*new, struct ksm_stable_node, node); - stable_node_any = NULL; - tree_folio = chain(&stable_node_dup, stable_node, root); - if (!stable_node_dup) { - /* - * Either all stable_node dups were full in - * this stable_node chain, or this chain was - * empty and should be rb_erased. - */ - stable_node_any = stable_node_dup_any(stable_node, - root); - if (!stable_node_any) { - /* rb_erase just run */ - goto again; - } - /* - * Take any of the stable_node dups page of - * this stable_node chain to let the tree walk - * continue. All KSM pages belonging to the - * stable_node dups in a stable_node chain - * have the same content and they're - * write protected at all times. Any will work - * fine to continue the walk. - */ - tree_folio = ksm_get_folio(stable_node_any, - KSM_GET_FOLIO_NOLOCK); - } - VM_BUG_ON(!stable_node_dup ^ !!stable_node_any); + tree_folio = chain(&stable_node_dup, &stable_node, root); if (!tree_folio) { /* * If we walked over a stale stable_node, @@ -2306,7 +2242,6 @@ static void stable_tree_append(struct ksm_rmap_item *rmap_item, */ static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_item) { - struct mm_struct *mm = rmap_item->mm; struct ksm_rmap_item *tree_rmap_item; struct page *tree_page = NULL; struct ksm_stable_node *stable_node; @@ -2333,6 +2268,23 @@ static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_ite */ if (!is_page_sharing_candidate(stable_node)) max_page_sharing_bypass = true; + } else { + remove_rmap_item_from_tree(rmap_item); + + /* + * If the hash value of the page has changed from the last time + * we calculated it, this page is changing frequently: therefore we + * don't want to insert it in the unstable tree, and we don't want + * to waste our time searching for something identical to it there. + */ + checksum = calc_checksum(page); + if (rmap_item->oldchecksum != checksum) { + rmap_item->oldchecksum = checksum; + return; + } + + if (!try_to_merge_with_zero_page(rmap_item, page)) + return; } /* We first start with searching the page inside the stable tree */ @@ -2363,48 +2315,6 @@ static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_ite return; } - /* - * If the hash value of the page has changed from the last time - * we calculated it, this page is changing frequently: therefore we - * don't want to insert it in the unstable tree, and we don't want - * to waste our time searching for something identical to it there. - */ - checksum = calc_checksum(page); - if (rmap_item->oldchecksum != checksum) { - rmap_item->oldchecksum = checksum; - return; - } - - /* - * Same checksum as an empty page. We attempt to merge it with the - * appropriate zero page if the user enabled this via sysfs. - */ - if (ksm_use_zero_pages && (checksum == zero_checksum)) { - struct vm_area_struct *vma; - - mmap_read_lock(mm); - vma = find_mergeable_vma(mm, rmap_item->address); - if (vma) { - err = try_to_merge_one_page(vma, page, - ZERO_PAGE(rmap_item->address)); - trace_ksm_merge_one_page( - page_to_pfn(ZERO_PAGE(rmap_item->address)), - rmap_item, mm, err); - } else { - /* - * If the vma is out of date, we do not need to - * continue. - */ - err = 0; - } - mmap_read_unlock(mm); - /* - * In case of failure, the page was not really empty, so we - * need to continue. Otherwise we're done. - */ - if (!err) - return; - } tree_rmap_item = unstable_tree_search_insert(rmap_item, page, &tree_page); if (tree_rmap_item) { @@ -3088,7 +2998,6 @@ struct folio *ksm_might_need_to_copy(struct folio *folio, if (copy_mc_user_highpage(folio_page(new_folio, 0), page, addr, vma)) { folio_put(new_folio); - memory_failure_queue(folio_pfn(folio), 0); return ERR_PTR(-EHWPOISON); } folio_set_dirty(new_folio); diff --git a/mm/list_lru.c b/mm/list_lru.c index 3fd64736bc45..a29d96929d7c 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -15,7 +15,7 @@ #include "slab.h" #include "internal.h" -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG static LIST_HEAD(memcg_list_lrus); static DEFINE_MUTEX(list_lrus_mutex); @@ -83,7 +83,7 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx) { return &lru->node[nid].lru; } -#endif /* CONFIG_MEMCG_KMEM */ +#endif /* CONFIG_MEMCG */ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid, struct mem_cgroup *memcg) @@ -294,7 +294,7 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid, isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg, nr_to_walk); -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) { struct list_lru_memcg *mlru; unsigned long index; @@ -324,7 +324,7 @@ static void init_one_lru(struct list_lru_one *l) l->nr_items = 0; } -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG static struct list_lru_memcg *memcg_init_list_lru_one(gfp_t gfp) { int nid; @@ -544,14 +544,14 @@ static inline void memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) static void memcg_destroy_list_lru(struct list_lru *lru) { } -#endif /* CONFIG_MEMCG_KMEM */ +#endif /* CONFIG_MEMCG */ int __list_lru_init(struct list_lru *lru, bool memcg_aware, struct lock_class_key *key, struct shrinker *shrinker) { int i; -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG if (shrinker) lru->shrinker_id = shrinker->id; else @@ -591,7 +591,7 @@ void list_lru_destroy(struct list_lru *lru) kfree(lru->node); lru->node = NULL; -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG lru->shrinker_id = -1; #endif } diff --git a/mm/madvise.c b/mm/madvise.c index a77893462b92..96c026fe0c99 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1147,7 +1147,7 @@ static int madvise_inject_error(int behavior, } else { pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", pfn, start); - ret = memory_failure(pfn, MF_COUNT_INCREASED | MF_SW_SIMULATED); + ret = memory_failure(pfn, MF_ACTION_REQUIRED | MF_COUNT_INCREASED | MF_SW_SIMULATED); if (ret == -EOPNOTSUPP) ret = 0; } diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c new file mode 100644 index 000000000000..2aeea4d8bf8e --- /dev/null +++ b/mm/memcontrol-v1.c @@ -0,0 +1,2969 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" +#include "swap.h" +#include "memcontrol-v1.h" + +/* + * Cgroups above their limits are maintained in a RB-Tree, independent of + * their hierarchy representation + */ + +struct mem_cgroup_tree_per_node { + struct rb_root rb_root; + struct rb_node *rb_rightmost; + spinlock_t lock; +}; + +struct mem_cgroup_tree { + struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES]; +}; + +static struct mem_cgroup_tree soft_limit_tree __read_mostly; + +/* + * Maximum loops in mem_cgroup_soft_reclaim(), used for soft + * limit reclaim to prevent infinite loops, if they ever occur. + */ +#define MEM_CGROUP_MAX_RECLAIM_LOOPS 100 +#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2 + +/* Stuffs for move charges at task migration. */ +/* + * Types of charges to be moved. + */ +#define MOVE_ANON 0x1ULL +#define MOVE_FILE 0x2ULL +#define MOVE_MASK (MOVE_ANON | MOVE_FILE) + +/* "mc" and its members are protected by cgroup_mutex */ +static struct move_charge_struct { + spinlock_t lock; /* for from, to */ + struct mm_struct *mm; + struct mem_cgroup *from; + struct mem_cgroup *to; + unsigned long flags; + unsigned long precharge; + unsigned long moved_charge; + unsigned long moved_swap; + struct task_struct *moving_task; /* a task moving charges */ + wait_queue_head_t waitq; /* a waitq for other context */ +} mc = { + .lock = __SPIN_LOCK_UNLOCKED(mc.lock), + .waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq), +}; + +/* for OOM */ +struct mem_cgroup_eventfd_list { + struct list_head list; + struct eventfd_ctx *eventfd; +}; + +/* + * cgroup_event represents events which userspace want to receive. + */ +struct mem_cgroup_event { + /* + * memcg which the event belongs to. + */ + struct mem_cgroup *memcg; + /* + * eventfd to signal userspace about the event. + */ + struct eventfd_ctx *eventfd; + /* + * Each of these stored in a list by the cgroup. + */ + struct list_head list; + /* + * register_event() callback will be used to add new userspace + * waiter for changes related to this event. Use eventfd_signal() + * on eventfd to send notification to userspace. + */ + int (*register_event)(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd, const char *args); + /* + * unregister_event() callback will be called when userspace closes + * the eventfd or on cgroup removing. This callback must be set, + * if you want provide notification functionality. + */ + void (*unregister_event)(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd); + /* + * All fields below needed to unregister event when + * userspace closes eventfd. + */ + poll_table pt; + wait_queue_head_t *wqh; + wait_queue_entry_t wait; + struct work_struct remove; +}; + +#define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val)) +#define MEMFILE_TYPE(val) ((val) >> 16 & 0xffff) +#define MEMFILE_ATTR(val) ((val) & 0xffff) + +enum { + RES_USAGE, + RES_LIMIT, + RES_MAX_USAGE, + RES_FAILCNT, + RES_SOFT_LIMIT, +}; + +#ifdef CONFIG_LOCKDEP +static struct lockdep_map memcg_oom_lock_dep_map = { + .name = "memcg_oom_lock", +}; +#endif + +DEFINE_SPINLOCK(memcg_oom_lock); + +static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz, + struct mem_cgroup_tree_per_node *mctz, + unsigned long new_usage_in_excess) +{ + struct rb_node **p = &mctz->rb_root.rb_node; + struct rb_node *parent = NULL; + struct mem_cgroup_per_node *mz_node; + bool rightmost = true; + + if (mz->on_tree) + return; + + mz->usage_in_excess = new_usage_in_excess; + if (!mz->usage_in_excess) + return; + while (*p) { + parent = *p; + mz_node = rb_entry(parent, struct mem_cgroup_per_node, + tree_node); + if (mz->usage_in_excess < mz_node->usage_in_excess) { + p = &(*p)->rb_left; + rightmost = false; + } else { + p = &(*p)->rb_right; + } + } + + if (rightmost) + mctz->rb_rightmost = &mz->tree_node; + + rb_link_node(&mz->tree_node, parent, p); + rb_insert_color(&mz->tree_node, &mctz->rb_root); + mz->on_tree = true; +} + +static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz, + struct mem_cgroup_tree_per_node *mctz) +{ + if (!mz->on_tree) + return; + + if (&mz->tree_node == mctz->rb_rightmost) + mctz->rb_rightmost = rb_prev(&mz->tree_node); + + rb_erase(&mz->tree_node, &mctz->rb_root); + mz->on_tree = false; +} + +static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz, + struct mem_cgroup_tree_per_node *mctz) +{ + unsigned long flags; + + spin_lock_irqsave(&mctz->lock, flags); + __mem_cgroup_remove_exceeded(mz, mctz); + spin_unlock_irqrestore(&mctz->lock, flags); +} + +static unsigned long soft_limit_excess(struct mem_cgroup *memcg) +{ + unsigned long nr_pages = page_counter_read(&memcg->memory); + unsigned long soft_limit = READ_ONCE(memcg->soft_limit); + unsigned long excess = 0; + + if (nr_pages > soft_limit) + excess = nr_pages - soft_limit; + + return excess; +} + +static void memcg1_update_tree(struct mem_cgroup *memcg, int nid) +{ + unsigned long excess; + struct mem_cgroup_per_node *mz; + struct mem_cgroup_tree_per_node *mctz; + + if (lru_gen_enabled()) { + if (soft_limit_excess(memcg)) + lru_gen_soft_reclaim(memcg, nid); + return; + } + + mctz = soft_limit_tree.rb_tree_per_node[nid]; + if (!mctz) + return; + /* + * Necessary to update all ancestors when hierarchy is used. + * because their event counter is not touched. + */ + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + mz = memcg->nodeinfo[nid]; + excess = soft_limit_excess(memcg); + /* + * We have to update the tree if mz is on RB-tree or + * mem is over its softlimit. + */ + if (excess || mz->on_tree) { + unsigned long flags; + + spin_lock_irqsave(&mctz->lock, flags); + /* if on-tree, remove it */ + if (mz->on_tree) + __mem_cgroup_remove_exceeded(mz, mctz); + /* + * Insert again. mz->usage_in_excess will be updated. + * If excess is 0, no tree ops. + */ + __mem_cgroup_insert_exceeded(mz, mctz, excess); + spin_unlock_irqrestore(&mctz->lock, flags); + } + } +} + +void memcg1_remove_from_trees(struct mem_cgroup *memcg) +{ + struct mem_cgroup_tree_per_node *mctz; + struct mem_cgroup_per_node *mz; + int nid; + + for_each_node(nid) { + mz = memcg->nodeinfo[nid]; + mctz = soft_limit_tree.rb_tree_per_node[nid]; + if (mctz) + mem_cgroup_remove_exceeded(mz, mctz); + } +} + +static struct mem_cgroup_per_node * +__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) +{ + struct mem_cgroup_per_node *mz; + +retry: + mz = NULL; + if (!mctz->rb_rightmost) + goto done; /* Nothing to reclaim from */ + + mz = rb_entry(mctz->rb_rightmost, + struct mem_cgroup_per_node, tree_node); + /* + * Remove the node now but someone else can add it back, + * we will to add it back at the end of reclaim to its correct + * position in the tree. + */ + __mem_cgroup_remove_exceeded(mz, mctz); + if (!soft_limit_excess(mz->memcg) || + !css_tryget(&mz->memcg->css)) + goto retry; +done: + return mz; +} + +static struct mem_cgroup_per_node * +mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) +{ + struct mem_cgroup_per_node *mz; + + spin_lock_irq(&mctz->lock); + mz = __mem_cgroup_largest_soft_limit_node(mctz); + spin_unlock_irq(&mctz->lock); + return mz; +} + +static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, + pg_data_t *pgdat, + gfp_t gfp_mask, + unsigned long *total_scanned) +{ + struct mem_cgroup *victim = NULL; + int total = 0; + int loop = 0; + unsigned long excess; + unsigned long nr_scanned; + struct mem_cgroup_reclaim_cookie reclaim = { + .pgdat = pgdat, + }; + + excess = soft_limit_excess(root_memcg); + + while (1) { + victim = mem_cgroup_iter(root_memcg, victim, &reclaim); + if (!victim) { + loop++; + if (loop >= 2) { + /* + * If we have not been able to reclaim + * anything, it might because there are + * no reclaimable pages under this hierarchy + */ + if (!total) + break; + /* + * We want to do more targeted reclaim. + * excess >> 2 is not to excessive so as to + * reclaim too much, nor too less that we keep + * coming back to reclaim from this cgroup + */ + if (total >= (excess >> 2) || + (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) + break; + } + continue; + } + total += mem_cgroup_shrink_node(victim, gfp_mask, false, + pgdat, &nr_scanned); + *total_scanned += nr_scanned; + if (!soft_limit_excess(root_memcg)) + break; + } + mem_cgroup_iter_break(root_memcg, victim); + return total; +} + +unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order, + gfp_t gfp_mask, + unsigned long *total_scanned) +{ + unsigned long nr_reclaimed = 0; + struct mem_cgroup_per_node *mz, *next_mz = NULL; + unsigned long reclaimed; + int loop = 0; + struct mem_cgroup_tree_per_node *mctz; + unsigned long excess; + + if (lru_gen_enabled()) + return 0; + + if (order > 0) + return 0; + + mctz = soft_limit_tree.rb_tree_per_node[pgdat->node_id]; + + /* + * Do not even bother to check the largest node if the root + * is empty. Do it lockless to prevent lock bouncing. Races + * are acceptable as soft limit is best effort anyway. + */ + if (!mctz || RB_EMPTY_ROOT(&mctz->rb_root)) + return 0; + + /* + * This loop can run a while, specially if mem_cgroup's continuously + * keep exceeding their soft limit and putting the system under + * pressure + */ + do { + if (next_mz) + mz = next_mz; + else + mz = mem_cgroup_largest_soft_limit_node(mctz); + if (!mz) + break; + + reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat, + gfp_mask, total_scanned); + nr_reclaimed += reclaimed; + spin_lock_irq(&mctz->lock); + + /* + * If we failed to reclaim anything from this memory cgroup + * it is time to move on to the next cgroup + */ + next_mz = NULL; + if (!reclaimed) + next_mz = __mem_cgroup_largest_soft_limit_node(mctz); + + excess = soft_limit_excess(mz->memcg); + /* + * One school of thought says that we should not add + * back the node to the tree if reclaim returns 0. + * But our reclaim could return 0, simply because due + * to priority we are exposing a smaller subset of + * memory to reclaim from. Consider this as a longer + * term TODO. + */ + /* If excess == 0, no tree ops */ + __mem_cgroup_insert_exceeded(mz, mctz, excess); + spin_unlock_irq(&mctz->lock); + css_put(&mz->memcg->css); + loop++; + /* + * Could not reclaim anything and there are no more + * mem cgroups to try or we seem to be looping without + * reclaiming anything. + */ + if (!nr_reclaimed && + (next_mz == NULL || + loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) + break; + } while (!nr_reclaimed); + if (next_mz) + css_put(&next_mz->memcg->css); + return nr_reclaimed; +} + +/* + * A routine for checking "mem" is under move_account() or not. + * + * Checking a cgroup is mc.from or mc.to or under hierarchy of + * moving cgroups. This is for waiting at high-memory pressure + * caused by "move". + */ +static bool mem_cgroup_under_move(struct mem_cgroup *memcg) +{ + struct mem_cgroup *from; + struct mem_cgroup *to; + bool ret = false; + /* + * Unlike task_move routines, we access mc.to, mc.from not under + * mutual exclusion by cgroup_mutex. Here, we take spinlock instead. + */ + spin_lock(&mc.lock); + from = mc.from; + to = mc.to; + if (!from) + goto unlock; + + ret = mem_cgroup_is_descendant(from, memcg) || + mem_cgroup_is_descendant(to, memcg); +unlock: + spin_unlock(&mc.lock); + return ret; +} + +bool memcg1_wait_acct_move(struct mem_cgroup *memcg) +{ + if (mc.moving_task && current != mc.moving_task) { + if (mem_cgroup_under_move(memcg)) { + DEFINE_WAIT(wait); + prepare_to_wait(&mc.waitq, &wait, TASK_INTERRUPTIBLE); + /* moving charge context might have finished. */ + if (mc.moving_task) + schedule(); + finish_wait(&mc.waitq, &wait); + return true; + } + } + return false; +} + +/** + * folio_memcg_lock - Bind a folio to its memcg. + * @folio: The folio. + * + * This function prevents unlocked LRU folios from being moved to + * another cgroup. + * + * It ensures lifetime of the bound memcg. The caller is responsible + * for the lifetime of the folio. + */ +void folio_memcg_lock(struct folio *folio) +{ + struct mem_cgroup *memcg; + unsigned long flags; + + /* + * The RCU lock is held throughout the transaction. The fast + * path can get away without acquiring the memcg->move_lock + * because page moving starts with an RCU grace period. + */ + rcu_read_lock(); + + if (mem_cgroup_disabled()) + return; +again: + memcg = folio_memcg(folio); + if (unlikely(!memcg)) + return; + +#ifdef CONFIG_PROVE_LOCKING + local_irq_save(flags); + might_lock(&memcg->move_lock); + local_irq_restore(flags); +#endif + + if (atomic_read(&memcg->moving_account) <= 0) + return; + + spin_lock_irqsave(&memcg->move_lock, flags); + if (memcg != folio_memcg(folio)) { + spin_unlock_irqrestore(&memcg->move_lock, flags); + goto again; + } + + /* + * When charge migration first begins, we can have multiple + * critical sections holding the fast-path RCU lock and one + * holding the slowpath move_lock. Track the task who has the + * move_lock for folio_memcg_unlock(). + */ + memcg->move_lock_task = current; + memcg->move_lock_flags = flags; +} + +static void __folio_memcg_unlock(struct mem_cgroup *memcg) +{ + if (memcg && memcg->move_lock_task == current) { + unsigned long flags = memcg->move_lock_flags; + + memcg->move_lock_task = NULL; + memcg->move_lock_flags = 0; + + spin_unlock_irqrestore(&memcg->move_lock, flags); + } + + rcu_read_unlock(); +} + +/** + * folio_memcg_unlock - Release the binding between a folio and its memcg. + * @folio: The folio. + * + * This releases the binding created by folio_memcg_lock(). This does + * not change the accounting of this folio to its memcg, but it does + * permit others to change it. + */ +void folio_memcg_unlock(struct folio *folio) +{ + __folio_memcg_unlock(folio_memcg(folio)); +} + +#ifdef CONFIG_SWAP +/** + * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record. + * @entry: swap entry to be moved + * @from: mem_cgroup which the entry is moved from + * @to: mem_cgroup which the entry is moved to + * + * It succeeds only when the swap_cgroup's record for this entry is the same + * as the mem_cgroup's id of @from. + * + * Returns 0 on success, -EINVAL on failure. + * + * The caller must have charged to @to, IOW, called page_counter_charge() about + * both res and memsw, and called css_get(). + */ +static int mem_cgroup_move_swap_account(swp_entry_t entry, + struct mem_cgroup *from, struct mem_cgroup *to) +{ + unsigned short old_id, new_id; + + old_id = mem_cgroup_id(from); + new_id = mem_cgroup_id(to); + + if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) { + mod_memcg_state(from, MEMCG_SWAP, -1); + mod_memcg_state(to, MEMCG_SWAP, 1); + return 0; + } + return -EINVAL; +} +#else +static inline int mem_cgroup_move_swap_account(swp_entry_t entry, + struct mem_cgroup *from, struct mem_cgroup *to) +{ + return -EINVAL; +} +#endif + +static u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return mem_cgroup_from_css(css)->move_charge_at_immigrate; +} + +#ifdef CONFIG_MMU +static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css, + struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + pr_warn_once("Cgroup memory moving (move_charge_at_immigrate) is deprecated. " + "Please report your usecase to linux-mm@kvack.org if you " + "depend on this functionality.\n"); + + if (val & ~MOVE_MASK) + return -EINVAL; + + /* + * No kind of locking is needed in here, because ->can_attach() will + * check this value once in the beginning of the process, and then carry + * on with stale data. This means that changes to this value will only + * affect task migrations starting after the change. + */ + memcg->move_charge_at_immigrate = val; + return 0; +} +#else +static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css, + struct cftype *cft, u64 val) +{ + return -ENOSYS; +} +#endif + +#ifdef CONFIG_MMU +/* Handlers for move charge at task migration. */ +static int mem_cgroup_do_precharge(unsigned long count) +{ + int ret; + + /* Try a single bulk charge without reclaim first, kswapd may wake */ + ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count); + if (!ret) { + mc.precharge += count; + return ret; + } + + /* Try charges one by one with reclaim, but do not retry */ + while (count--) { + ret = try_charge(mc.to, GFP_KERNEL | __GFP_NORETRY, 1); + if (ret) + return ret; + mc.precharge++; + cond_resched(); + } + return 0; +} + +union mc_target { + struct folio *folio; + swp_entry_t ent; +}; + +enum mc_target_type { + MC_TARGET_NONE = 0, + MC_TARGET_PAGE, + MC_TARGET_SWAP, + MC_TARGET_DEVICE, +}; + +static struct page *mc_handle_present_pte(struct vm_area_struct *vma, + unsigned long addr, pte_t ptent) +{ + struct page *page = vm_normal_page(vma, addr, ptent); + + if (!page) + return NULL; + if (PageAnon(page)) { + if (!(mc.flags & MOVE_ANON)) + return NULL; + } else { + if (!(mc.flags & MOVE_FILE)) + return NULL; + } + get_page(page); + + return page; +} + +#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE) +static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, + pte_t ptent, swp_entry_t *entry) +{ + struct page *page = NULL; + swp_entry_t ent = pte_to_swp_entry(ptent); + + if (!(mc.flags & MOVE_ANON)) + return NULL; + + /* + * Handle device private pages that are not accessible by the CPU, but + * stored as special swap entries in the page table. + */ + if (is_device_private_entry(ent)) { + page = pfn_swap_entry_to_page(ent); + if (!get_page_unless_zero(page)) + return NULL; + return page; + } + + if (non_swap_entry(ent)) + return NULL; + + /* + * Because swap_cache_get_folio() updates some statistics counter, + * we call find_get_page() with swapper_space directly. + */ + page = find_get_page(swap_address_space(ent), swap_cache_index(ent)); + entry->val = ent.val; + + return page; +} +#else +static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, + pte_t ptent, swp_entry_t *entry) +{ + return NULL; +} +#endif + +static struct page *mc_handle_file_pte(struct vm_area_struct *vma, + unsigned long addr, pte_t ptent) +{ + unsigned long index; + struct folio *folio; + + if (!vma->vm_file) /* anonymous vma */ + return NULL; + if (!(mc.flags & MOVE_FILE)) + return NULL; + + /* folio is moved even if it's not RSS of this task(page-faulted). */ + /* shmem/tmpfs may report page out on swap: account for that too. */ + index = linear_page_index(vma, addr); + folio = filemap_get_incore_folio(vma->vm_file->f_mapping, index); + if (IS_ERR(folio)) + return NULL; + return folio_file_page(folio, index); +} + +/** + * mem_cgroup_move_account - move account of the folio + * @folio: The folio. + * @compound: charge the page as compound or small page + * @from: mem_cgroup which the folio is moved from. + * @to: mem_cgroup which the folio is moved to. @from != @to. + * + * The folio must be locked and not on the LRU. + * + * This function doesn't do "charge" to new cgroup and doesn't do "uncharge" + * from old cgroup. + */ +static int mem_cgroup_move_account(struct folio *folio, + bool compound, + struct mem_cgroup *from, + struct mem_cgroup *to) +{ + struct lruvec *from_vec, *to_vec; + struct pglist_data *pgdat; + unsigned int nr_pages = compound ? folio_nr_pages(folio) : 1; + int nid, ret; + + VM_BUG_ON(from == to); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); + VM_BUG_ON(compound && !folio_test_large(folio)); + + ret = -EINVAL; + if (folio_memcg(folio) != from) + goto out; + + pgdat = folio_pgdat(folio); + from_vec = mem_cgroup_lruvec(from, pgdat); + to_vec = mem_cgroup_lruvec(to, pgdat); + + folio_memcg_lock(folio); + + if (folio_test_anon(folio)) { + if (folio_mapped(folio)) { + __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); + __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); + if (folio_test_pmd_mappable(folio)) { + __mod_lruvec_state(from_vec, NR_ANON_THPS, + -nr_pages); + __mod_lruvec_state(to_vec, NR_ANON_THPS, + nr_pages); + } + } + } else { + __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages); + __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages); + + if (folio_test_swapbacked(folio)) { + __mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages); + __mod_lruvec_state(to_vec, NR_SHMEM, nr_pages); + } + + if (folio_mapped(folio)) { + __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); + __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); + } + + if (folio_test_dirty(folio)) { + struct address_space *mapping = folio_mapping(folio); + + if (mapping_can_writeback(mapping)) { + __mod_lruvec_state(from_vec, NR_FILE_DIRTY, + -nr_pages); + __mod_lruvec_state(to_vec, NR_FILE_DIRTY, + nr_pages); + } + } + } + +#ifdef CONFIG_SWAP + if (folio_test_swapcache(folio)) { + __mod_lruvec_state(from_vec, NR_SWAPCACHE, -nr_pages); + __mod_lruvec_state(to_vec, NR_SWAPCACHE, nr_pages); + } +#endif + if (folio_test_writeback(folio)) { + __mod_lruvec_state(from_vec, NR_WRITEBACK, -nr_pages); + __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages); + } + + /* + * All state has been migrated, let's switch to the new memcg. + * + * It is safe to change page's memcg here because the page + * is referenced, charged, isolated, and locked: we can't race + * with (un)charging, migration, LRU putback, or anything else + * that would rely on a stable page's memory cgroup. + * + * Note that folio_memcg_lock is a memcg lock, not a page lock, + * to save space. As soon as we switch page's memory cgroup to a + * new memcg that isn't locked, the above state can change + * concurrently again. Make sure we're truly done with it. + */ + smp_mb(); + + css_get(&to->css); + css_put(&from->css); + + folio->memcg_data = (unsigned long)to; + + __folio_memcg_unlock(from); + + ret = 0; + nid = folio_nid(folio); + + local_irq_disable(); + mem_cgroup_charge_statistics(to, nr_pages); + memcg1_check_events(to, nid); + mem_cgroup_charge_statistics(from, -nr_pages); + memcg1_check_events(from, nid); + local_irq_enable(); +out: + return ret; +} + +/** + * get_mctgt_type - get target type of moving charge + * @vma: the vma the pte to be checked belongs + * @addr: the address corresponding to the pte to be checked + * @ptent: the pte to be checked + * @target: the pointer the target page or swap ent will be stored(can be NULL) + * + * Context: Called with pte lock held. + * Return: + * * MC_TARGET_NONE - If the pte is not a target for move charge. + * * MC_TARGET_PAGE - If the page corresponding to this pte is a target for + * move charge. If @target is not NULL, the folio is stored in target->folio + * with extra refcnt taken (Caller should release it). + * * MC_TARGET_SWAP - If the swap entry corresponding to this pte is a + * target for charge migration. If @target is not NULL, the entry is + * stored in target->ent. + * * MC_TARGET_DEVICE - Like MC_TARGET_PAGE but page is device memory and + * thus not on the lru. For now such page is charged like a regular page + * would be as it is just special memory taking the place of a regular page. + * See Documentations/vm/hmm.txt and include/linux/hmm.h + */ +static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma, + unsigned long addr, pte_t ptent, union mc_target *target) +{ + struct page *page = NULL; + struct folio *folio; + enum mc_target_type ret = MC_TARGET_NONE; + swp_entry_t ent = { .val = 0 }; + + if (pte_present(ptent)) + page = mc_handle_present_pte(vma, addr, ptent); + else if (pte_none_mostly(ptent)) + /* + * PTE markers should be treated as a none pte here, separated + * from other swap handling below. + */ + page = mc_handle_file_pte(vma, addr, ptent); + else if (is_swap_pte(ptent)) + page = mc_handle_swap_pte(vma, ptent, &ent); + + if (page) + folio = page_folio(page); + if (target && page) { + if (!folio_trylock(folio)) { + folio_put(folio); + return ret; + } + /* + * page_mapped() must be stable during the move. This + * pte is locked, so if it's present, the page cannot + * become unmapped. If it isn't, we have only partial + * control over the mapped state: the page lock will + * prevent new faults against pagecache and swapcache, + * so an unmapped page cannot become mapped. However, + * if the page is already mapped elsewhere, it can + * unmap, and there is nothing we can do about it. + * Alas, skip moving the page in this case. + */ + if (!pte_present(ptent) && page_mapped(page)) { + folio_unlock(folio); + folio_put(folio); + return ret; + } + } + + if (!page && !ent.val) + return ret; + if (page) { + /* + * Do only loose check w/o serialization. + * mem_cgroup_move_account() checks the page is valid or + * not under LRU exclusion. + */ + if (folio_memcg(folio) == mc.from) { + ret = MC_TARGET_PAGE; + if (folio_is_device_private(folio) || + folio_is_device_coherent(folio)) + ret = MC_TARGET_DEVICE; + if (target) + target->folio = folio; + } + if (!ret || !target) { + if (target) + folio_unlock(folio); + folio_put(folio); + } + } + /* + * There is a swap entry and a page doesn't exist or isn't charged. + * But we cannot move a tail-page in a THP. + */ + if (ent.val && !ret && (!page || !PageTransCompound(page)) && + mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) { + ret = MC_TARGET_SWAP; + if (target) + target->ent = ent; + } + return ret; +} + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +/* + * We don't consider PMD mapped swapping or file mapped pages because THP does + * not support them for now. + * Caller should make sure that pmd_trans_huge(pmd) is true. + */ +static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma, + unsigned long addr, pmd_t pmd, union mc_target *target) +{ + struct page *page = NULL; + struct folio *folio; + enum mc_target_type ret = MC_TARGET_NONE; + + if (unlikely(is_swap_pmd(pmd))) { + VM_BUG_ON(thp_migration_supported() && + !is_pmd_migration_entry(pmd)); + return ret; + } + page = pmd_page(pmd); + VM_BUG_ON_PAGE(!page || !PageHead(page), page); + folio = page_folio(page); + if (!(mc.flags & MOVE_ANON)) + return ret; + if (folio_memcg(folio) == mc.from) { + ret = MC_TARGET_PAGE; + if (target) { + folio_get(folio); + if (!folio_trylock(folio)) { + folio_put(folio); + return MC_TARGET_NONE; + } + target->folio = folio; + } + } + return ret; +} +#else +static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma, + unsigned long addr, pmd_t pmd, union mc_target *target) +{ + return MC_TARGET_NONE; +} +#endif + +static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, + unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; + pte_t *pte; + spinlock_t *ptl; + + ptl = pmd_trans_huge_lock(pmd, vma); + if (ptl) { + /* + * Note their can not be MC_TARGET_DEVICE for now as we do not + * support transparent huge page with MEMORY_DEVICE_PRIVATE but + * this might change. + */ + if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) + mc.precharge += HPAGE_PMD_NR; + spin_unlock(ptl); + return 0; + } + + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; + for (; addr != end; pte++, addr += PAGE_SIZE) + if (get_mctgt_type(vma, addr, ptep_get(pte), NULL)) + mc.precharge++; /* increment precharge temporarily */ + pte_unmap_unlock(pte - 1, ptl); + cond_resched(); + + return 0; +} + +static const struct mm_walk_ops precharge_walk_ops = { + .pmd_entry = mem_cgroup_count_precharge_pte_range, + .walk_lock = PGWALK_RDLOCK, +}; + +static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm) +{ + unsigned long precharge; + + mmap_read_lock(mm); + walk_page_range(mm, 0, ULONG_MAX, &precharge_walk_ops, NULL); + mmap_read_unlock(mm); + + precharge = mc.precharge; + mc.precharge = 0; + + return precharge; +} + +static int mem_cgroup_precharge_mc(struct mm_struct *mm) +{ + unsigned long precharge = mem_cgroup_count_precharge(mm); + + VM_BUG_ON(mc.moving_task); + mc.moving_task = current; + return mem_cgroup_do_precharge(precharge); +} + +/* cancels all extra charges on mc.from and mc.to, and wakes up all waiters. */ +static void __mem_cgroup_clear_mc(void) +{ + struct mem_cgroup *from = mc.from; + struct mem_cgroup *to = mc.to; + + /* we must uncharge all the leftover precharges from mc.to */ + if (mc.precharge) { + mem_cgroup_cancel_charge(mc.to, mc.precharge); + mc.precharge = 0; + } + /* + * we didn't uncharge from mc.from at mem_cgroup_move_account(), so + * we must uncharge here. + */ + if (mc.moved_charge) { + mem_cgroup_cancel_charge(mc.from, mc.moved_charge); + mc.moved_charge = 0; + } + /* we must fixup refcnts and charges */ + if (mc.moved_swap) { + /* uncharge swap account from the old cgroup */ + if (!mem_cgroup_is_root(mc.from)) + page_counter_uncharge(&mc.from->memsw, mc.moved_swap); + + mem_cgroup_id_put_many(mc.from, mc.moved_swap); + + /* + * we charged both to->memory and to->memsw, so we + * should uncharge to->memory. + */ + if (!mem_cgroup_is_root(mc.to)) + page_counter_uncharge(&mc.to->memory, mc.moved_swap); + + mc.moved_swap = 0; + } + memcg1_oom_recover(from); + memcg1_oom_recover(to); + wake_up_all(&mc.waitq); +} + +static void mem_cgroup_clear_mc(void) +{ + struct mm_struct *mm = mc.mm; + + /* + * we must clear moving_task before waking up waiters at the end of + * task migration. + */ + mc.moving_task = NULL; + __mem_cgroup_clear_mc(); + spin_lock(&mc.lock); + mc.from = NULL; + mc.to = NULL; + mc.mm = NULL; + spin_unlock(&mc.lock); + + mmput(mm); +} + +int memcg1_can_attach(struct cgroup_taskset *tset) +{ + struct cgroup_subsys_state *css; + struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */ + struct mem_cgroup *from; + struct task_struct *leader, *p; + struct mm_struct *mm; + unsigned long move_flags; + int ret = 0; + + /* charge immigration isn't supported on the default hierarchy */ + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return 0; + + /* + * Multi-process migrations only happen on the default hierarchy + * where charge immigration is not used. Perform charge + * immigration if @tset contains a leader and whine if there are + * multiple. + */ + p = NULL; + cgroup_taskset_for_each_leader(leader, css, tset) { + WARN_ON_ONCE(p); + p = leader; + memcg = mem_cgroup_from_css(css); + } + if (!p) + return 0; + + /* + * We are now committed to this value whatever it is. Changes in this + * tunable will only affect upcoming migrations, not the current one. + * So we need to save it, and keep it going. + */ + move_flags = READ_ONCE(memcg->move_charge_at_immigrate); + if (!move_flags) + return 0; + + from = mem_cgroup_from_task(p); + + VM_BUG_ON(from == memcg); + + mm = get_task_mm(p); + if (!mm) + return 0; + /* We move charges only when we move a owner of the mm */ + if (mm->owner == p) { + VM_BUG_ON(mc.from); + VM_BUG_ON(mc.to); + VM_BUG_ON(mc.precharge); + VM_BUG_ON(mc.moved_charge); + VM_BUG_ON(mc.moved_swap); + + spin_lock(&mc.lock); + mc.mm = mm; + mc.from = from; + mc.to = memcg; + mc.flags = move_flags; + spin_unlock(&mc.lock); + /* We set mc.moving_task later */ + + ret = mem_cgroup_precharge_mc(mm); + if (ret) + mem_cgroup_clear_mc(); + } else { + mmput(mm); + } + return ret; +} + +void memcg1_cancel_attach(struct cgroup_taskset *tset) +{ + if (mc.to) + mem_cgroup_clear_mc(); +} + +static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, + unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + int ret = 0; + struct vm_area_struct *vma = walk->vma; + pte_t *pte; + spinlock_t *ptl; + enum mc_target_type target_type; + union mc_target target; + struct folio *folio; + + ptl = pmd_trans_huge_lock(pmd, vma); + if (ptl) { + if (mc.precharge < HPAGE_PMD_NR) { + spin_unlock(ptl); + return 0; + } + target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); + if (target_type == MC_TARGET_PAGE) { + folio = target.folio; + if (folio_isolate_lru(folio)) { + if (!mem_cgroup_move_account(folio, true, + mc.from, mc.to)) { + mc.precharge -= HPAGE_PMD_NR; + mc.moved_charge += HPAGE_PMD_NR; + } + folio_putback_lru(folio); + } + folio_unlock(folio); + folio_put(folio); + } else if (target_type == MC_TARGET_DEVICE) { + folio = target.folio; + if (!mem_cgroup_move_account(folio, true, + mc.from, mc.to)) { + mc.precharge -= HPAGE_PMD_NR; + mc.moved_charge += HPAGE_PMD_NR; + } + folio_unlock(folio); + folio_put(folio); + } + spin_unlock(ptl); + return 0; + } + +retry: + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + if (!pte) + return 0; + for (; addr != end; addr += PAGE_SIZE) { + pte_t ptent = ptep_get(pte++); + bool device = false; + swp_entry_t ent; + + if (!mc.precharge) + break; + + switch (get_mctgt_type(vma, addr, ptent, &target)) { + case MC_TARGET_DEVICE: + device = true; + fallthrough; + case MC_TARGET_PAGE: + folio = target.folio; + /* + * We can have a part of the split pmd here. Moving it + * can be done but it would be too convoluted so simply + * ignore such a partial THP and keep it in original + * memcg. There should be somebody mapping the head. + */ + if (folio_test_large(folio)) + goto put; + if (!device && !folio_isolate_lru(folio)) + goto put; + if (!mem_cgroup_move_account(folio, false, + mc.from, mc.to)) { + mc.precharge--; + /* we uncharge from mc.from later. */ + mc.moved_charge++; + } + if (!device) + folio_putback_lru(folio); +put: /* get_mctgt_type() gets & locks the page */ + folio_unlock(folio); + folio_put(folio); + break; + case MC_TARGET_SWAP: + ent = target.ent; + if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) { + mc.precharge--; + mem_cgroup_id_get_many(mc.to, 1); + /* we fixup other refcnts and charges later. */ + mc.moved_swap++; + } + break; + default: + break; + } + } + pte_unmap_unlock(pte - 1, ptl); + cond_resched(); + + if (addr != end) { + /* + * We have consumed all precharges we got in can_attach(). + * We try charge one by one, but don't do any additional + * charges to mc.to if we have failed in charge once in attach() + * phase. + */ + ret = mem_cgroup_do_precharge(1); + if (!ret) + goto retry; + } + + return ret; +} + +static const struct mm_walk_ops charge_walk_ops = { + .pmd_entry = mem_cgroup_move_charge_pte_range, + .walk_lock = PGWALK_RDLOCK, +}; + +static void mem_cgroup_move_charge(void) +{ + lru_add_drain_all(); + /* + * Signal folio_memcg_lock() to take the memcg's move_lock + * while we're moving its pages to another memcg. Then wait + * for already started RCU-only updates to finish. + */ + atomic_inc(&mc.from->moving_account); + synchronize_rcu(); +retry: + if (unlikely(!mmap_read_trylock(mc.mm))) { + /* + * Someone who are holding the mmap_lock might be waiting in + * waitq. So we cancel all extra charges, wake up all waiters, + * and retry. Because we cancel precharges, we might not be able + * to move enough charges, but moving charge is a best-effort + * feature anyway, so it wouldn't be a big problem. + */ + __mem_cgroup_clear_mc(); + cond_resched(); + goto retry; + } + /* + * When we have consumed all precharges and failed in doing + * additional charge, the page walk just aborts. + */ + walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL); + mmap_read_unlock(mc.mm); + atomic_dec(&mc.from->moving_account); +} + +void memcg1_move_task(void) +{ + if (mc.to) { + mem_cgroup_move_charge(); + mem_cgroup_clear_mc(); + } +} + +#else /* !CONFIG_MMU */ +int memcg1_can_attach(struct cgroup_taskset *tset) +{ + return 0; +} +void memcg1_cancel_attach(struct cgroup_taskset *tset) +{ +} +void memcg1_move_task(void) +{ +} +#endif + +static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap) +{ + struct mem_cgroup_threshold_ary *t; + unsigned long usage; + int i; + + rcu_read_lock(); + if (!swap) + t = rcu_dereference(memcg->thresholds.primary); + else + t = rcu_dereference(memcg->memsw_thresholds.primary); + + if (!t) + goto unlock; + + usage = mem_cgroup_usage(memcg, swap); + + /* + * current_threshold points to threshold just below or equal to usage. + * If it's not true, a threshold was crossed after last + * call of __mem_cgroup_threshold(). + */ + i = t->current_threshold; + + /* + * Iterate backward over array of thresholds starting from + * current_threshold and check if a threshold is crossed. + * If none of thresholds below usage is crossed, we read + * only one element of the array here. + */ + for (; i >= 0 && unlikely(t->entries[i].threshold > usage); i--) + eventfd_signal(t->entries[i].eventfd); + + /* i = current_threshold + 1 */ + i++; + + /* + * Iterate forward over array of thresholds starting from + * current_threshold+1 and check if a threshold is crossed. + * If none of thresholds above usage is crossed, we read + * only one element of the array here. + */ + for (; i < t->size && unlikely(t->entries[i].threshold <= usage); i++) + eventfd_signal(t->entries[i].eventfd); + + /* Update current_threshold */ + t->current_threshold = i - 1; +unlock: + rcu_read_unlock(); +} + +static void mem_cgroup_threshold(struct mem_cgroup *memcg) +{ + while (memcg) { + __mem_cgroup_threshold(memcg, false); + if (do_memsw_account()) + __mem_cgroup_threshold(memcg, true); + + memcg = parent_mem_cgroup(memcg); + } +} + +/* + * Check events in order. + * + */ +void memcg1_check_events(struct mem_cgroup *memcg, int nid) +{ + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + return; + + /* threshold event is triggered in finer grain than soft limit */ + if (unlikely(mem_cgroup_event_ratelimit(memcg, + MEM_CGROUP_TARGET_THRESH))) { + bool do_softlimit; + + do_softlimit = mem_cgroup_event_ratelimit(memcg, + MEM_CGROUP_TARGET_SOFTLIMIT); + mem_cgroup_threshold(memcg); + if (unlikely(do_softlimit)) + memcg1_update_tree(memcg, nid); + } +} + +static int compare_thresholds(const void *a, const void *b) +{ + const struct mem_cgroup_threshold *_a = a; + const struct mem_cgroup_threshold *_b = b; + + if (_a->threshold > _b->threshold) + return 1; + + if (_a->threshold < _b->threshold) + return -1; + + return 0; +} + +static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg) +{ + struct mem_cgroup_eventfd_list *ev; + + spin_lock(&memcg_oom_lock); + + list_for_each_entry(ev, &memcg->oom_notify, list) + eventfd_signal(ev->eventfd); + + spin_unlock(&memcg_oom_lock); + return 0; +} + +static void mem_cgroup_oom_notify(struct mem_cgroup *memcg) +{ + struct mem_cgroup *iter; + + for_each_mem_cgroup_tree(iter, memcg) + mem_cgroup_oom_notify_cb(iter); +} + +static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd, const char *args, enum res_type type) +{ + struct mem_cgroup_thresholds *thresholds; + struct mem_cgroup_threshold_ary *new; + unsigned long threshold; + unsigned long usage; + int i, size, ret; + + ret = page_counter_memparse(args, "-1", &threshold); + if (ret) + return ret; + + mutex_lock(&memcg->thresholds_lock); + + if (type == _MEM) { + thresholds = &memcg->thresholds; + usage = mem_cgroup_usage(memcg, false); + } else if (type == _MEMSWAP) { + thresholds = &memcg->memsw_thresholds; + usage = mem_cgroup_usage(memcg, true); + } else + BUG(); + + /* Check if a threshold crossed before adding a new one */ + if (thresholds->primary) + __mem_cgroup_threshold(memcg, type == _MEMSWAP); + + size = thresholds->primary ? thresholds->primary->size + 1 : 1; + + /* Allocate memory for new array of thresholds */ + new = kmalloc(struct_size(new, entries, size), GFP_KERNEL); + if (!new) { + ret = -ENOMEM; + goto unlock; + } + new->size = size; + + /* Copy thresholds (if any) to new array */ + if (thresholds->primary) + memcpy(new->entries, thresholds->primary->entries, + flex_array_size(new, entries, size - 1)); + + /* Add new threshold */ + new->entries[size - 1].eventfd = eventfd; + new->entries[size - 1].threshold = threshold; + + /* Sort thresholds. Registering of new threshold isn't time-critical */ + sort(new->entries, size, sizeof(*new->entries), + compare_thresholds, NULL); + + /* Find current threshold */ + new->current_threshold = -1; + for (i = 0; i < size; i++) { + if (new->entries[i].threshold <= usage) { + /* + * new->current_threshold will not be used until + * rcu_assign_pointer(), so it's safe to increment + * it here. + */ + ++new->current_threshold; + } else + break; + } + + /* Free old spare buffer and save old primary buffer as spare */ + kfree(thresholds->spare); + thresholds->spare = thresholds->primary; + + rcu_assign_pointer(thresholds->primary, new); + + /* To be sure that nobody uses thresholds */ + synchronize_rcu(); + +unlock: + mutex_unlock(&memcg->thresholds_lock); + + return ret; +} + +static int mem_cgroup_usage_register_event(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd, const char *args) +{ + return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEM); +} + +static int memsw_cgroup_usage_register_event(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd, const char *args) +{ + return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEMSWAP); +} + +static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd, enum res_type type) +{ + struct mem_cgroup_thresholds *thresholds; + struct mem_cgroup_threshold_ary *new; + unsigned long usage; + int i, j, size, entries; + + mutex_lock(&memcg->thresholds_lock); + + if (type == _MEM) { + thresholds = &memcg->thresholds; + usage = mem_cgroup_usage(memcg, false); + } else if (type == _MEMSWAP) { + thresholds = &memcg->memsw_thresholds; + usage = mem_cgroup_usage(memcg, true); + } else + BUG(); + + if (!thresholds->primary) + goto unlock; + + /* Check if a threshold crossed before removing */ + __mem_cgroup_threshold(memcg, type == _MEMSWAP); + + /* Calculate new number of threshold */ + size = entries = 0; + for (i = 0; i < thresholds->primary->size; i++) { + if (thresholds->primary->entries[i].eventfd != eventfd) + size++; + else + entries++; + } + + new = thresholds->spare; + + /* If no items related to eventfd have been cleared, nothing to do */ + if (!entries) + goto unlock; + + /* Set thresholds array to NULL if we don't have thresholds */ + if (!size) { + kfree(new); + new = NULL; + goto swap_buffers; + } + + new->size = size; + + /* Copy thresholds and find current threshold */ + new->current_threshold = -1; + for (i = 0, j = 0; i < thresholds->primary->size; i++) { + if (thresholds->primary->entries[i].eventfd == eventfd) + continue; + + new->entries[j] = thresholds->primary->entries[i]; + if (new->entries[j].threshold <= usage) { + /* + * new->current_threshold will not be used + * until rcu_assign_pointer(), so it's safe to increment + * it here. + */ + ++new->current_threshold; + } + j++; + } + +swap_buffers: + /* Swap primary and spare array */ + thresholds->spare = thresholds->primary; + + rcu_assign_pointer(thresholds->primary, new); + + /* To be sure that nobody uses thresholds */ + synchronize_rcu(); + + /* If all events are unregistered, free the spare array */ + if (!new) { + kfree(thresholds->spare); + thresholds->spare = NULL; + } +unlock: + mutex_unlock(&memcg->thresholds_lock); +} + +static void mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd) +{ + return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEM); +} + +static void memsw_cgroup_usage_unregister_event(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd) +{ + return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEMSWAP); +} + +static int mem_cgroup_oom_register_event(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd, const char *args) +{ + struct mem_cgroup_eventfd_list *event; + + event = kmalloc(sizeof(*event), GFP_KERNEL); + if (!event) + return -ENOMEM; + + spin_lock(&memcg_oom_lock); + + event->eventfd = eventfd; + list_add(&event->list, &memcg->oom_notify); + + /* already in OOM ? */ + if (memcg->under_oom) + eventfd_signal(eventfd); + spin_unlock(&memcg_oom_lock); + + return 0; +} + +static void mem_cgroup_oom_unregister_event(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd) +{ + struct mem_cgroup_eventfd_list *ev, *tmp; + + spin_lock(&memcg_oom_lock); + + list_for_each_entry_safe(ev, tmp, &memcg->oom_notify, list) { + if (ev->eventfd == eventfd) { + list_del(&ev->list); + kfree(ev); + } + } + + spin_unlock(&memcg_oom_lock); +} + +/* + * DO NOT USE IN NEW FILES. + * + * "cgroup.event_control" implementation. + * + * This is way over-engineered. It tries to support fully configurable + * events for each user. Such level of flexibility is completely + * unnecessary especially in the light of the planned unified hierarchy. + * + * Please deprecate this and replace with something simpler if at all + * possible. + */ + +/* + * Unregister event and free resources. + * + * Gets called from workqueue. + */ +static void memcg_event_remove(struct work_struct *work) +{ + struct mem_cgroup_event *event = + container_of(work, struct mem_cgroup_event, remove); + struct mem_cgroup *memcg = event->memcg; + + remove_wait_queue(event->wqh, &event->wait); + + event->unregister_event(memcg, event->eventfd); + + /* Notify userspace the event is going away. */ + eventfd_signal(event->eventfd); + + eventfd_ctx_put(event->eventfd); + kfree(event); + css_put(&memcg->css); +} + +/* + * Gets called on EPOLLHUP on eventfd when user closes it. + * + * Called with wqh->lock held and interrupts disabled. + */ +static int memcg_event_wake(wait_queue_entry_t *wait, unsigned mode, + int sync, void *key) +{ + struct mem_cgroup_event *event = + container_of(wait, struct mem_cgroup_event, wait); + struct mem_cgroup *memcg = event->memcg; + __poll_t flags = key_to_poll(key); + + if (flags & EPOLLHUP) { + /* + * If the event has been detached at cgroup removal, we + * can simply return knowing the other side will cleanup + * for us. + * + * We can't race against event freeing since the other + * side will require wqh->lock via remove_wait_queue(), + * which we hold. + */ + spin_lock(&memcg->event_list_lock); + if (!list_empty(&event->list)) { + list_del_init(&event->list); + /* + * We are in atomic context, but cgroup_event_remove() + * may sleep, so we have to call it in workqueue. + */ + schedule_work(&event->remove); + } + spin_unlock(&memcg->event_list_lock); + } + + return 0; +} + +static void memcg_event_ptable_queue_proc(struct file *file, + wait_queue_head_t *wqh, poll_table *pt) +{ + struct mem_cgroup_event *event = + container_of(pt, struct mem_cgroup_event, pt); + + event->wqh = wqh; + add_wait_queue(wqh, &event->wait); +} + +/* + * DO NOT USE IN NEW FILES. + * + * Parse input and register new cgroup event handler. + * + * Input must be in format ' '. + * Interpretation of args is defined by control file implementation. + */ +static ssize_t memcg_write_event_control(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct cgroup_subsys_state *css = of_css(of); + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + struct mem_cgroup_event *event; + struct cgroup_subsys_state *cfile_css; + unsigned int efd, cfd; + struct fd efile; + struct fd cfile; + struct dentry *cdentry; + const char *name; + char *endp; + int ret; + + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + return -EOPNOTSUPP; + + buf = strstrip(buf); + + efd = simple_strtoul(buf, &endp, 10); + if (*endp != ' ') + return -EINVAL; + buf = endp + 1; + + cfd = simple_strtoul(buf, &endp, 10); + if ((*endp != ' ') && (*endp != '\0')) + return -EINVAL; + buf = endp + 1; + + event = kzalloc(sizeof(*event), GFP_KERNEL); + if (!event) + return -ENOMEM; + + event->memcg = memcg; + INIT_LIST_HEAD(&event->list); + init_poll_funcptr(&event->pt, memcg_event_ptable_queue_proc); + init_waitqueue_func_entry(&event->wait, memcg_event_wake); + INIT_WORK(&event->remove, memcg_event_remove); + + efile = fdget(efd); + if (!efile.file) { + ret = -EBADF; + goto out_kfree; + } + + event->eventfd = eventfd_ctx_fileget(efile.file); + if (IS_ERR(event->eventfd)) { + ret = PTR_ERR(event->eventfd); + goto out_put_efile; + } + + cfile = fdget(cfd); + if (!cfile.file) { + ret = -EBADF; + goto out_put_eventfd; + } + + /* the process need read permission on control file */ + /* AV: shouldn't we check that it's been opened for read instead? */ + ret = file_permission(cfile.file, MAY_READ); + if (ret < 0) + goto out_put_cfile; + + /* + * The control file must be a regular cgroup1 file. As a regular cgroup + * file can't be renamed, it's safe to access its name afterwards. + */ + cdentry = cfile.file->f_path.dentry; + if (cdentry->d_sb->s_type != &cgroup_fs_type || !d_is_reg(cdentry)) { + ret = -EINVAL; + goto out_put_cfile; + } + + /* + * Determine the event callbacks and set them in @event. This used + * to be done via struct cftype but cgroup core no longer knows + * about these events. The following is crude but the whole thing + * is for compatibility anyway. + * + * DO NOT ADD NEW FILES. + */ + name = cdentry->d_name.name; + + if (!strcmp(name, "memory.usage_in_bytes")) { + event->register_event = mem_cgroup_usage_register_event; + event->unregister_event = mem_cgroup_usage_unregister_event; + } else if (!strcmp(name, "memory.oom_control")) { + event->register_event = mem_cgroup_oom_register_event; + event->unregister_event = mem_cgroup_oom_unregister_event; + } else if (!strcmp(name, "memory.pressure_level")) { + event->register_event = vmpressure_register_event; + event->unregister_event = vmpressure_unregister_event; + } else if (!strcmp(name, "memory.memsw.usage_in_bytes")) { + event->register_event = memsw_cgroup_usage_register_event; + event->unregister_event = memsw_cgroup_usage_unregister_event; + } else { + ret = -EINVAL; + goto out_put_cfile; + } + + /* + * Verify @cfile should belong to @css. Also, remaining events are + * automatically removed on cgroup destruction but the removal is + * asynchronous, so take an extra ref on @css. + */ + cfile_css = css_tryget_online_from_dir(cdentry->d_parent, + &memory_cgrp_subsys); + ret = -EINVAL; + if (IS_ERR(cfile_css)) + goto out_put_cfile; + if (cfile_css != css) { + css_put(cfile_css); + goto out_put_cfile; + } + + ret = event->register_event(memcg, event->eventfd, buf); + if (ret) + goto out_put_css; + + vfs_poll(efile.file, &event->pt); + + spin_lock_irq(&memcg->event_list_lock); + list_add(&event->list, &memcg->event_list); + spin_unlock_irq(&memcg->event_list_lock); + + fdput(cfile); + fdput(efile); + + return nbytes; + +out_put_css: + css_put(css); +out_put_cfile: + fdput(cfile); +out_put_eventfd: + eventfd_ctx_put(event->eventfd); +out_put_efile: + fdput(efile); +out_kfree: + kfree(event); + + return ret; +} + +void memcg1_memcg_init(struct mem_cgroup *memcg) +{ + INIT_LIST_HEAD(&memcg->oom_notify); + mutex_init(&memcg->thresholds_lock); + spin_lock_init(&memcg->move_lock); + INIT_LIST_HEAD(&memcg->event_list); + spin_lock_init(&memcg->event_list_lock); +} + +void memcg1_css_offline(struct mem_cgroup *memcg) +{ + struct mem_cgroup_event *event, *tmp; + + /* + * Unregister events and notify userspace. + * Notify userspace about cgroup removing only after rmdir of cgroup + * directory to avoid race between userspace and kernelspace. + */ + spin_lock_irq(&memcg->event_list_lock); + list_for_each_entry_safe(event, tmp, &memcg->event_list, list) { + list_del_init(&event->list); + schedule_work(&event->remove); + } + spin_unlock_irq(&memcg->event_list_lock); +} + +/* + * Check OOM-Killer is already running under our hierarchy. + * If someone is running, return false. + */ +static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg) +{ + struct mem_cgroup *iter, *failed = NULL; + + spin_lock(&memcg_oom_lock); + + for_each_mem_cgroup_tree(iter, memcg) { + if (iter->oom_lock) { + /* + * this subtree of our hierarchy is already locked + * so we cannot give a lock. + */ + failed = iter; + mem_cgroup_iter_break(memcg, iter); + break; + } else + iter->oom_lock = true; + } + + if (failed) { + /* + * OK, we failed to lock the whole subtree so we have + * to clean up what we set up to the failing subtree + */ + for_each_mem_cgroup_tree(iter, memcg) { + if (iter == failed) { + mem_cgroup_iter_break(memcg, iter); + break; + } + iter->oom_lock = false; + } + } else + mutex_acquire(&memcg_oom_lock_dep_map, 0, 1, _RET_IP_); + + spin_unlock(&memcg_oom_lock); + + return !failed; +} + +static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg) +{ + struct mem_cgroup *iter; + + spin_lock(&memcg_oom_lock); + mutex_release(&memcg_oom_lock_dep_map, _RET_IP_); + for_each_mem_cgroup_tree(iter, memcg) + iter->oom_lock = false; + spin_unlock(&memcg_oom_lock); +} + +static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg) +{ + struct mem_cgroup *iter; + + spin_lock(&memcg_oom_lock); + for_each_mem_cgroup_tree(iter, memcg) + iter->under_oom++; + spin_unlock(&memcg_oom_lock); +} + +static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg) +{ + struct mem_cgroup *iter; + + /* + * Be careful about under_oom underflows because a child memcg + * could have been added after mem_cgroup_mark_under_oom. + */ + spin_lock(&memcg_oom_lock); + for_each_mem_cgroup_tree(iter, memcg) + if (iter->under_oom > 0) + iter->under_oom--; + spin_unlock(&memcg_oom_lock); +} + +static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq); + +struct oom_wait_info { + struct mem_cgroup *memcg; + wait_queue_entry_t wait; +}; + +static int memcg_oom_wake_function(wait_queue_entry_t *wait, + unsigned mode, int sync, void *arg) +{ + struct mem_cgroup *wake_memcg = (struct mem_cgroup *)arg; + struct mem_cgroup *oom_wait_memcg; + struct oom_wait_info *oom_wait_info; + + oom_wait_info = container_of(wait, struct oom_wait_info, wait); + oom_wait_memcg = oom_wait_info->memcg; + + if (!mem_cgroup_is_descendant(wake_memcg, oom_wait_memcg) && + !mem_cgroup_is_descendant(oom_wait_memcg, wake_memcg)) + return 0; + return autoremove_wake_function(wait, mode, sync, arg); +} + +void memcg1_oom_recover(struct mem_cgroup *memcg) +{ + /* + * For the following lockless ->under_oom test, the only required + * guarantee is that it must see the state asserted by an OOM when + * this function is called as a result of userland actions + * triggered by the notification of the OOM. This is trivially + * achieved by invoking mem_cgroup_mark_under_oom() before + * triggering notification. + */ + if (memcg && memcg->under_oom) + __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); +} + +/** + * mem_cgroup_oom_synchronize - complete memcg OOM handling + * @handle: actually kill/wait or just clean up the OOM state + * + * This has to be called at the end of a page fault if the memcg OOM + * handler was enabled. + * + * Memcg supports userspace OOM handling where failed allocations must + * sleep on a waitqueue until the userspace task resolves the + * situation. Sleeping directly in the charge context with all kinds + * of locks held is not a good idea, instead we remember an OOM state + * in the task and mem_cgroup_oom_synchronize() has to be called at + * the end of the page fault to complete the OOM handling. + * + * Returns %true if an ongoing memcg OOM situation was detected and + * completed, %false otherwise. + */ +bool mem_cgroup_oom_synchronize(bool handle) +{ + struct mem_cgroup *memcg = current->memcg_in_oom; + struct oom_wait_info owait; + bool locked; + + /* OOM is global, do not handle */ + if (!memcg) + return false; + + if (!handle) + goto cleanup; + + owait.memcg = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.entry); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + mem_cgroup_mark_under_oom(memcg); + + locked = mem_cgroup_oom_trylock(memcg); + + if (locked) + mem_cgroup_oom_notify(memcg); + + schedule(); + mem_cgroup_unmark_under_oom(memcg); + finish_wait(&memcg_oom_waitq, &owait.wait); + + if (locked) + mem_cgroup_oom_unlock(memcg); +cleanup: + current->memcg_in_oom = NULL; + css_put(&memcg->css); + return true; +} + + +bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked) +{ + /* + * We are in the middle of the charge context here, so we + * don't want to block when potentially sitting on a callstack + * that holds all kinds of filesystem and mm locks. + * + * cgroup1 allows disabling the OOM killer and waiting for outside + * handling until the charge can succeed; remember the context and put + * the task to sleep at the end of the page fault when all locks are + * released. + * + * On the other hand, in-kernel OOM killer allows for an async victim + * memory reclaim (oom_reaper) and that means that we are not solely + * relying on the oom victim to make a forward progress and we can + * invoke the oom killer here. + * + * Please note that mem_cgroup_out_of_memory might fail to find a + * victim and then we have to bail out from the charge path. + */ + if (READ_ONCE(memcg->oom_kill_disable)) { + if (current->in_user_fault) { + css_get(&memcg->css); + current->memcg_in_oom = memcg; + } + return false; + } + + mem_cgroup_mark_under_oom(memcg); + + *locked = mem_cgroup_oom_trylock(memcg); + + if (*locked) + mem_cgroup_oom_notify(memcg); + + mem_cgroup_unmark_under_oom(memcg); + + return true; +} + +void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked) +{ + if (locked) + mem_cgroup_oom_unlock(memcg); +} + +static DEFINE_MUTEX(memcg_max_mutex); + +static int mem_cgroup_resize_max(struct mem_cgroup *memcg, + unsigned long max, bool memsw) +{ + bool enlarge = false; + bool drained = false; + int ret; + bool limits_invariant; + struct page_counter *counter = memsw ? &memcg->memsw : &memcg->memory; + + do { + if (signal_pending(current)) { + ret = -EINTR; + break; + } + + mutex_lock(&memcg_max_mutex); + /* + * Make sure that the new limit (memsw or memory limit) doesn't + * break our basic invariant rule memory.max <= memsw.max. + */ + limits_invariant = memsw ? max >= READ_ONCE(memcg->memory.max) : + max <= memcg->memsw.max; + if (!limits_invariant) { + mutex_unlock(&memcg_max_mutex); + ret = -EINVAL; + break; + } + if (max > counter->max) + enlarge = true; + ret = page_counter_set_max(counter, max); + mutex_unlock(&memcg_max_mutex); + + if (!ret) + break; + + if (!drained) { + drain_all_stock(memcg); + drained = true; + continue; + } + + if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { + ret = -EBUSY; + break; + } + } while (true); + + if (!ret && enlarge) + memcg1_oom_recover(memcg); + + return ret; +} + +/* + * Reclaims as many pages from the given memcg as possible. + * + * Caller is responsible for holding css reference for memcg. + */ +static int mem_cgroup_force_empty(struct mem_cgroup *memcg) +{ + int nr_retries = MAX_RECLAIM_RETRIES; + + /* we call try-to-free pages for make this cgroup empty */ + lru_add_drain_all(); + + drain_all_stock(memcg); + + /* try to free all pages in this cgroup */ + while (nr_retries && page_counter_read(&memcg->memory)) { + if (signal_pending(current)) + return -EINTR; + + if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, + MEMCG_RECLAIM_MAY_SWAP, NULL)) + nr_retries--; + } + + return 0; +} + +static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + + if (mem_cgroup_is_root(memcg)) + return -EINVAL; + return mem_cgroup_force_empty(memcg) ?: nbytes; +} + +static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return 1; +} + +static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css, + struct cftype *cft, u64 val) +{ + if (val == 1) + return 0; + + pr_warn_once("Non-hierarchical mode is deprecated. " + "Please report your usecase to linux-mm@kvack.org if you " + "depend on this functionality.\n"); + + return -EINVAL; +} + +static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + struct page_counter *counter; + + switch (MEMFILE_TYPE(cft->private)) { + case _MEM: + counter = &memcg->memory; + break; + case _MEMSWAP: + counter = &memcg->memsw; + break; + case _KMEM: + counter = &memcg->kmem; + break; + case _TCP: + counter = &memcg->tcpmem; + break; + default: + BUG(); + } + + switch (MEMFILE_ATTR(cft->private)) { + case RES_USAGE: + if (counter == &memcg->memory) + return (u64)mem_cgroup_usage(memcg, false) * PAGE_SIZE; + if (counter == &memcg->memsw) + return (u64)mem_cgroup_usage(memcg, true) * PAGE_SIZE; + return (u64)page_counter_read(counter) * PAGE_SIZE; + case RES_LIMIT: + return (u64)counter->max * PAGE_SIZE; + case RES_MAX_USAGE: + return (u64)counter->watermark * PAGE_SIZE; + case RES_FAILCNT: + return counter->failcnt; + case RES_SOFT_LIMIT: + return (u64)READ_ONCE(memcg->soft_limit) * PAGE_SIZE; + default: + BUG(); + } +} + +/* + * This function doesn't do anything useful. Its only job is to provide a read + * handler for a file so that cgroup_file_mode() will add read permissions. + */ +static int mem_cgroup_dummy_seq_show(__always_unused struct seq_file *m, + __always_unused void *v) +{ + return -EINVAL; +} + +static int memcg_update_tcp_max(struct mem_cgroup *memcg, unsigned long max) +{ + int ret; + + mutex_lock(&memcg_max_mutex); + + ret = page_counter_set_max(&memcg->tcpmem, max); + if (ret) + goto out; + + if (!memcg->tcpmem_active) { + /* + * The active flag needs to be written after the static_key + * update. This is what guarantees that the socket activation + * function is the last one to run. See mem_cgroup_sk_alloc() + * for details, and note that we don't mark any socket as + * belonging to this memcg until that flag is up. + * + * We need to do this, because static_keys will span multiple + * sites, but we can't control their order. If we mark a socket + * as accounted, but the accounting functions are not patched in + * yet, we'll lose accounting. + * + * We never race with the readers in mem_cgroup_sk_alloc(), + * because when this value change, the code to process it is not + * patched in yet. + */ + static_branch_inc(&memcg_sockets_enabled_key); + memcg->tcpmem_active = true; + } +out: + mutex_unlock(&memcg_max_mutex); + return ret; +} + +/* + * The user of this function is... + * RES_LIMIT. + */ +static ssize_t mem_cgroup_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long nr_pages; + int ret; + + buf = strstrip(buf); + ret = page_counter_memparse(buf, "-1", &nr_pages); + if (ret) + return ret; + + switch (MEMFILE_ATTR(of_cft(of)->private)) { + case RES_LIMIT: + if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */ + ret = -EINVAL; + break; + } + switch (MEMFILE_TYPE(of_cft(of)->private)) { + case _MEM: + ret = mem_cgroup_resize_max(memcg, nr_pages, false); + break; + case _MEMSWAP: + ret = mem_cgroup_resize_max(memcg, nr_pages, true); + break; + case _KMEM: + pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. " + "Writing any value to this file has no effect. " + "Please report your usecase to linux-mm@kvack.org if you " + "depend on this functionality.\n"); + ret = 0; + break; + case _TCP: + ret = memcg_update_tcp_max(memcg, nr_pages); + break; + } + break; + case RES_SOFT_LIMIT: + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { + ret = -EOPNOTSUPP; + } else { + WRITE_ONCE(memcg->soft_limit, nr_pages); + ret = 0; + } + break; + } + return ret ?: nbytes; +} + +static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + struct page_counter *counter; + + switch (MEMFILE_TYPE(of_cft(of)->private)) { + case _MEM: + counter = &memcg->memory; + break; + case _MEMSWAP: + counter = &memcg->memsw; + break; + case _KMEM: + counter = &memcg->kmem; + break; + case _TCP: + counter = &memcg->tcpmem; + break; + default: + BUG(); + } + + switch (MEMFILE_ATTR(of_cft(of)->private)) { + case RES_MAX_USAGE: + page_counter_reset_watermark(counter); + break; + case RES_FAILCNT: + counter->failcnt = 0; + break; + default: + BUG(); + } + + return nbytes; +} + +#ifdef CONFIG_NUMA + +#define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE)) +#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON)) +#define LRU_ALL ((1 << NR_LRU_LISTS) - 1) + +static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, + int nid, unsigned int lru_mask, bool tree) +{ + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); + unsigned long nr = 0; + enum lru_list lru; + + VM_BUG_ON((unsigned)nid >= nr_node_ids); + + for_each_lru(lru) { + if (!(BIT(lru) & lru_mask)) + continue; + if (tree) + nr += lruvec_page_state(lruvec, NR_LRU_BASE + lru); + else + nr += lruvec_page_state_local(lruvec, NR_LRU_BASE + lru); + } + return nr; +} + +static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg, + unsigned int lru_mask, + bool tree) +{ + unsigned long nr = 0; + enum lru_list lru; + + for_each_lru(lru) { + if (!(BIT(lru) & lru_mask)) + continue; + if (tree) + nr += memcg_page_state(memcg, NR_LRU_BASE + lru); + else + nr += memcg_page_state_local(memcg, NR_LRU_BASE + lru); + } + return nr; +} + +static int memcg_numa_stat_show(struct seq_file *m, void *v) +{ + struct numa_stat { + const char *name; + unsigned int lru_mask; + }; + + static const struct numa_stat stats[] = { + { "total", LRU_ALL }, + { "file", LRU_ALL_FILE }, + { "anon", LRU_ALL_ANON }, + { "unevictable", BIT(LRU_UNEVICTABLE) }, + }; + const struct numa_stat *stat; + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + mem_cgroup_flush_stats(memcg); + + for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { + seq_printf(m, "%s=%lu", stat->name, + mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, + false)); + for_each_node_state(nid, N_MEMORY) + seq_printf(m, " N%d=%lu", nid, + mem_cgroup_node_nr_lru_pages(memcg, nid, + stat->lru_mask, false)); + seq_putc(m, '\n'); + } + + for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { + + seq_printf(m, "hierarchical_%s=%lu", stat->name, + mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, + true)); + for_each_node_state(nid, N_MEMORY) + seq_printf(m, " N%d=%lu", nid, + mem_cgroup_node_nr_lru_pages(memcg, nid, + stat->lru_mask, true)); + seq_putc(m, '\n'); + } + + return 0; +} +#endif /* CONFIG_NUMA */ + +static const unsigned int memcg1_stats[] = { + NR_FILE_PAGES, + NR_ANON_MAPPED, +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + NR_ANON_THPS, +#endif + NR_SHMEM, + NR_FILE_MAPPED, + NR_FILE_DIRTY, + NR_WRITEBACK, + WORKINGSET_REFAULT_ANON, + WORKINGSET_REFAULT_FILE, +#ifdef CONFIG_SWAP + MEMCG_SWAP, + NR_SWAPCACHE, +#endif +}; + +static const char *const memcg1_stat_names[] = { + "cache", + "rss", +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + "rss_huge", +#endif + "shmem", + "mapped_file", + "dirty", + "writeback", + "workingset_refault_anon", + "workingset_refault_file", +#ifdef CONFIG_SWAP + "swap", + "swapcached", +#endif +}; + +/* Universal VM events cgroup1 shows, original sort order */ +static const unsigned int memcg1_events[] = { + PGPGIN, + PGPGOUT, + PGFAULT, + PGMAJFAULT, +}; + +void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) +{ + unsigned long memory, memsw; + struct mem_cgroup *mi; + unsigned int i; + + BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); + + mem_cgroup_flush_stats(memcg); + + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { + unsigned long nr; + + nr = memcg_page_state_local_output(memcg, memcg1_stats[i]); + seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i], nr); + } + + for (i = 0; i < ARRAY_SIZE(memcg1_events); i++) + seq_buf_printf(s, "%s %lu\n", vm_event_name(memcg1_events[i]), + memcg_events_local(memcg, memcg1_events[i])); + + for (i = 0; i < NR_LRU_LISTS; i++) + seq_buf_printf(s, "%s %lu\n", lru_list_name(i), + memcg_page_state_local(memcg, NR_LRU_BASE + i) * + PAGE_SIZE); + + /* Hierarchical information */ + memory = memsw = PAGE_COUNTER_MAX; + for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) { + memory = min(memory, READ_ONCE(mi->memory.max)); + memsw = min(memsw, READ_ONCE(mi->memsw.max)); + } + seq_buf_printf(s, "hierarchical_memory_limit %llu\n", + (u64)memory * PAGE_SIZE); + seq_buf_printf(s, "hierarchical_memsw_limit %llu\n", + (u64)memsw * PAGE_SIZE); + + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { + unsigned long nr; + + nr = memcg_page_state_output(memcg, memcg1_stats[i]); + seq_buf_printf(s, "total_%s %llu\n", memcg1_stat_names[i], + (u64)nr); + } + + for (i = 0; i < ARRAY_SIZE(memcg1_events); i++) + seq_buf_printf(s, "total_%s %llu\n", + vm_event_name(memcg1_events[i]), + (u64)memcg_events(memcg, memcg1_events[i])); + + for (i = 0; i < NR_LRU_LISTS; i++) + seq_buf_printf(s, "total_%s %llu\n", lru_list_name(i), + (u64)memcg_page_state(memcg, NR_LRU_BASE + i) * + PAGE_SIZE); + +#ifdef CONFIG_DEBUG_VM + { + pg_data_t *pgdat; + struct mem_cgroup_per_node *mz; + unsigned long anon_cost = 0; + unsigned long file_cost = 0; + + for_each_online_pgdat(pgdat) { + mz = memcg->nodeinfo[pgdat->node_id]; + + anon_cost += mz->lruvec.anon_cost; + file_cost += mz->lruvec.file_cost; + } + seq_buf_printf(s, "anon_cost %lu\n", anon_cost); + seq_buf_printf(s, "file_cost %lu\n", file_cost); + } +#endif +} + +static u64 mem_cgroup_swappiness_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + return mem_cgroup_swappiness(memcg); +} + +static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css, + struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + if (val > MAX_SWAPPINESS) + return -EINVAL; + + if (!mem_cgroup_is_root(memcg)) + WRITE_ONCE(memcg->swappiness, val); + else + WRITE_ONCE(vm_swappiness, val); + + return 0; +} + +static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_seq(sf); + + seq_printf(sf, "oom_kill_disable %d\n", READ_ONCE(memcg->oom_kill_disable)); + seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom); + seq_printf(sf, "oom_kill %lu\n", + atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL])); + return 0; +} + +static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css, + struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + /* cannot set to root cgroup and only 0 and 1 are allowed */ + if (mem_cgroup_is_root(memcg) || !((val == 0) || (val == 1))) + return -EINVAL; + + WRITE_ONCE(memcg->oom_kill_disable, val); + if (!val) + memcg1_oom_recover(memcg); + + return 0; +} + +#ifdef CONFIG_SLUB_DEBUG +static int mem_cgroup_slab_show(struct seq_file *m, void *p) +{ + /* + * Deprecated. + * Please, take a look at tools/cgroup/memcg_slabinfo.py . + */ + return 0; +} +#endif + +struct cftype mem_cgroup_legacy_files[] = { + { + .name = "usage_in_bytes", + .private = MEMFILE_PRIVATE(_MEM, RES_USAGE), + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "max_usage_in_bytes", + .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "limit_in_bytes", + .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT), + .write = mem_cgroup_write, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "soft_limit_in_bytes", + .private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT), + .write = mem_cgroup_write, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "failcnt", + .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "stat", + .seq_show = memory_stat_show, + }, + { + .name = "force_empty", + .write = mem_cgroup_force_empty_write, + }, + { + .name = "use_hierarchy", + .write_u64 = mem_cgroup_hierarchy_write, + .read_u64 = mem_cgroup_hierarchy_read, + }, + { + .name = "cgroup.event_control", /* XXX: for compat */ + .write = memcg_write_event_control, + .flags = CFTYPE_NO_PREFIX | CFTYPE_WORLD_WRITABLE, + }, + { + .name = "swappiness", + .read_u64 = mem_cgroup_swappiness_read, + .write_u64 = mem_cgroup_swappiness_write, + }, + { + .name = "move_charge_at_immigrate", + .read_u64 = mem_cgroup_move_charge_read, + .write_u64 = mem_cgroup_move_charge_write, + }, + { + .name = "oom_control", + .seq_show = mem_cgroup_oom_control_read, + .write_u64 = mem_cgroup_oom_control_write, + }, + { + .name = "pressure_level", + .seq_show = mem_cgroup_dummy_seq_show, + }, +#ifdef CONFIG_NUMA + { + .name = "numa_stat", + .seq_show = memcg_numa_stat_show, + }, +#endif + { + .name = "kmem.limit_in_bytes", + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), + .write = mem_cgroup_write, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "kmem.usage_in_bytes", + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "kmem.failcnt", + .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "kmem.max_usage_in_bytes", + .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, +#ifdef CONFIG_SLUB_DEBUG + { + .name = "kmem.slabinfo", + .seq_show = mem_cgroup_slab_show, + }, +#endif + { + .name = "kmem.tcp.limit_in_bytes", + .private = MEMFILE_PRIVATE(_TCP, RES_LIMIT), + .write = mem_cgroup_write, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "kmem.tcp.usage_in_bytes", + .private = MEMFILE_PRIVATE(_TCP, RES_USAGE), + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "kmem.tcp.failcnt", + .private = MEMFILE_PRIVATE(_TCP, RES_FAILCNT), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "kmem.tcp.max_usage_in_bytes", + .private = MEMFILE_PRIVATE(_TCP, RES_MAX_USAGE), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, + { }, /* terminate */ +}; + +struct cftype memsw_files[] = { + { + .name = "memsw.usage_in_bytes", + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE), + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "memsw.max_usage_in_bytes", + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "memsw.limit_in_bytes", + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT), + .write = mem_cgroup_write, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "memsw.failcnt", + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, + { }, /* terminate */ +}; + +void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages) +{ + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { + if (nr_pages > 0) + page_counter_charge(&memcg->kmem, nr_pages); + else + page_counter_uncharge(&memcg->kmem, -nr_pages); + } +} + +bool memcg1_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, + gfp_t gfp_mask) +{ + struct page_counter *fail; + + if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) { + memcg->tcpmem_pressure = 0; + return true; + } + memcg->tcpmem_pressure = 1; + if (gfp_mask & __GFP_NOFAIL) { + page_counter_charge(&memcg->tcpmem, nr_pages); + return true; + } + return false; +} + +static int __init memcg1_init(void) +{ + int node; + + for_each_node(node) { + struct mem_cgroup_tree_per_node *rtpn; + + rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node); + + rtpn->rb_root = RB_ROOT; + rtpn->rb_rightmost = NULL; + spin_lock_init(&rtpn->lock); + soft_limit_tree.rb_tree_per_node[node] = rtpn; + } + + return 0; +} +subsys_initcall(memcg1_init); diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h new file mode 100644 index 000000000000..56d7eaa98274 --- /dev/null +++ b/mm/memcontrol-v1.h @@ -0,0 +1,147 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ + +#ifndef __MM_MEMCONTROL_V1_H +#define __MM_MEMCONTROL_V1_H + +#include + +/* Cgroup v1 and v2 common declarations */ + +void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages); +int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, + unsigned int nr_pages); + +static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, + unsigned int nr_pages) +{ + if (mem_cgroup_is_root(memcg)) + return 0; + + return try_charge_memcg(memcg, gfp_mask, nr_pages); +} + +void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n); +void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n); + +/* + * Iteration constructs for visiting all cgroups (under a tree). If + * loops are exited prematurely (break), mem_cgroup_iter_break() must + * be used for reference counting. + */ +#define for_each_mem_cgroup_tree(iter, root) \ + for (iter = mem_cgroup_iter(root, NULL, NULL); \ + iter != NULL; \ + iter = mem_cgroup_iter(root, iter, NULL)) + +#define for_each_mem_cgroup(iter) \ + for (iter = mem_cgroup_iter(NULL, NULL, NULL); \ + iter != NULL; \ + iter = mem_cgroup_iter(NULL, iter, NULL)) + +/* Whether legacy memory+swap accounting is active */ +static bool do_memsw_account(void) +{ + return !cgroup_subsys_on_dfl(memory_cgrp_subsys); +} + +/* + * Per memcg event counter is incremented at every pagein/pageout. With THP, + * it will be incremented by the number of pages. This counter is used + * to trigger some periodic events. This is straightforward and better + * than using jiffies etc. to handle periodic memcg event. + */ +enum mem_cgroup_events_target { + MEM_CGROUP_TARGET_THRESH, + MEM_CGROUP_TARGET_SOFTLIMIT, + MEM_CGROUP_NTARGETS, +}; + +bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, + enum mem_cgroup_events_target target); +unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap); + +void drain_all_stock(struct mem_cgroup *root_memcg); + +unsigned long memcg_events(struct mem_cgroup *memcg, int event); +unsigned long memcg_events_local(struct mem_cgroup *memcg, int event); +unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx); +unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item); +unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item); +int memory_stat_show(struct seq_file *m, void *v); + +/* Cgroup v1-specific declarations */ +#ifdef CONFIG_MEMCG_V1 +void memcg1_memcg_init(struct mem_cgroup *memcg); +void memcg1_remove_from_trees(struct mem_cgroup *memcg); + +static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg) +{ + WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); +} + +bool memcg1_wait_acct_move(struct mem_cgroup *memcg); + +struct cgroup_taskset; +int memcg1_can_attach(struct cgroup_taskset *tset); +void memcg1_cancel_attach(struct cgroup_taskset *tset); +void memcg1_move_task(void); +void memcg1_css_offline(struct mem_cgroup *memcg); + +/* for encoding cft->private value on file */ +enum res_type { + _MEM, + _MEMSWAP, + _KMEM, + _TCP, +}; + +bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked); +void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked); +void memcg1_oom_recover(struct mem_cgroup *memcg); + +void memcg1_check_events(struct mem_cgroup *memcg, int nid); + +void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s); + +void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages); +static inline bool memcg1_tcpmem_active(struct mem_cgroup *memcg) +{ + return memcg->tcpmem_active; +} +bool memcg1_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, + gfp_t gfp_mask); +static inline void memcg1_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) +{ + page_counter_uncharge(&memcg->tcpmem, nr_pages); +} + +extern struct cftype memsw_files[]; +extern struct cftype mem_cgroup_legacy_files[]; + +#else /* CONFIG_MEMCG_V1 */ + +static inline void memcg1_memcg_init(struct mem_cgroup *memcg) {} +static inline void memcg1_remove_from_trees(struct mem_cgroup *memcg) {} +static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg) {} +static inline bool memcg1_wait_acct_move(struct mem_cgroup *memcg) { return false; } +static inline void memcg1_css_offline(struct mem_cgroup *memcg) {} + +static inline bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked) { return true; } +static inline void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked) {} +static inline void memcg1_oom_recover(struct mem_cgroup *memcg) {} + +static inline void memcg1_check_events(struct mem_cgroup *memcg, int nid) {} + +static inline void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) {} + +static inline void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages) {} +static inline bool memcg1_tcpmem_active(struct mem_cgroup *memcg) { return false; } +static inline bool memcg1_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, + gfp_t gfp_mask) { return true; } +static inline void memcg1_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) {} + +extern struct cftype memsw_files[]; +extern struct cftype mem_cgroup_legacy_files[]; +#endif /* CONFIG_MEMCG_V1 */ + +#endif /* __MM_MEMCONTROL_V1_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8f2f1bb18c9c..960371788687 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -28,7 +28,6 @@ #include #include #include -#include #include #include #include @@ -45,14 +44,11 @@ #include #include #include -#include #include #include -#include -#include -#include #include #include +#include #include #include #include @@ -60,7 +56,6 @@ #include #include #include -#include #include #include #include @@ -70,7 +65,7 @@ #include #include #include "slab.h" -#include "swap.h" +#include "memcontrol-v1.h" #include @@ -98,140 +93,9 @@ static bool cgroup_memory_nobpf __ro_after_init; static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); #endif -/* Whether legacy memory+swap accounting is active */ -static bool do_memsw_account(void) -{ - return !cgroup_subsys_on_dfl(memory_cgrp_subsys); -} - #define THRESHOLDS_EVENTS_TARGET 128 #define SOFTLIMIT_EVENTS_TARGET 1024 -/* - * Cgroups above their limits are maintained in a RB-Tree, independent of - * their hierarchy representation - */ - -struct mem_cgroup_tree_per_node { - struct rb_root rb_root; - struct rb_node *rb_rightmost; - spinlock_t lock; -}; - -struct mem_cgroup_tree { - struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES]; -}; - -static struct mem_cgroup_tree soft_limit_tree __read_mostly; - -/* for OOM */ -struct mem_cgroup_eventfd_list { - struct list_head list; - struct eventfd_ctx *eventfd; -}; - -/* - * cgroup_event represents events which userspace want to receive. - */ -struct mem_cgroup_event { - /* - * memcg which the event belongs to. - */ - struct mem_cgroup *memcg; - /* - * eventfd to signal userspace about the event. - */ - struct eventfd_ctx *eventfd; - /* - * Each of these stored in a list by the cgroup. - */ - struct list_head list; - /* - * register_event() callback will be used to add new userspace - * waiter for changes related to this event. Use eventfd_signal() - * on eventfd to send notification to userspace. - */ - int (*register_event)(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd, const char *args); - /* - * unregister_event() callback will be called when userspace closes - * the eventfd or on cgroup removing. This callback must be set, - * if you want provide notification functionality. - */ - void (*unregister_event)(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd); - /* - * All fields below needed to unregister event when - * userspace closes eventfd. - */ - poll_table pt; - wait_queue_head_t *wqh; - wait_queue_entry_t wait; - struct work_struct remove; -}; - -static void mem_cgroup_threshold(struct mem_cgroup *memcg); -static void mem_cgroup_oom_notify(struct mem_cgroup *memcg); - -/* Stuffs for move charges at task migration. */ -/* - * Types of charges to be moved. - */ -#define MOVE_ANON 0x1U -#define MOVE_FILE 0x2U -#define MOVE_MASK (MOVE_ANON | MOVE_FILE) - -/* "mc" and its members are protected by cgroup_mutex */ -static struct move_charge_struct { - spinlock_t lock; /* for from, to */ - struct mm_struct *mm; - struct mem_cgroup *from; - struct mem_cgroup *to; - unsigned long flags; - unsigned long precharge; - unsigned long moved_charge; - unsigned long moved_swap; - struct task_struct *moving_task; /* a task moving charges */ - wait_queue_head_t waitq; /* a waitq for other context */ -} mc = { - .lock = __SPIN_LOCK_UNLOCKED(mc.lock), - .waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq), -}; - -/* - * Maximum loops in mem_cgroup_soft_reclaim(), used for soft - * limit reclaim to prevent infinite loops, if they ever occur. - */ -#define MEM_CGROUP_MAX_RECLAIM_LOOPS 100 -#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2 - -/* for encoding cft->private value on file */ -enum res_type { - _MEM, - _MEMSWAP, - _KMEM, - _TCP, -}; - -#define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val)) -#define MEMFILE_TYPE(val) ((val) >> 16 & 0xffff) -#define MEMFILE_ATTR(val) ((val) & 0xffff) - -/* - * Iteration constructs for visiting all cgroups (under a tree). If - * loops are exited prematurely (break), mem_cgroup_iter_break() must - * be used for reference counting. - */ -#define for_each_mem_cgroup_tree(iter, root) \ - for (iter = mem_cgroup_iter(root, NULL, NULL); \ - iter != NULL; \ - iter = mem_cgroup_iter(root, iter, NULL)) - -#define for_each_mem_cgroup(iter) \ - for (iter = mem_cgroup_iter(NULL, NULL, NULL); \ - iter != NULL; \ - iter = mem_cgroup_iter(NULL, iter, NULL)) - static inline bool task_is_dying(void) { return tsk_is_oom_victim(current) || fatal_signal_pending(current) || @@ -254,7 +118,6 @@ struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr) #define CURRENT_OBJCG_UPDATE_BIT 0 #define CURRENT_OBJCG_UPDATE_FLAG (1UL << CURRENT_OBJCG_UPDATE_BIT) -#ifdef CONFIG_MEMCG_KMEM static DEFINE_SPINLOCK(objcg_lock); bool mem_cgroup_kmem_disabled(void) @@ -359,7 +222,6 @@ EXPORT_SYMBOL(memcg_kmem_online_key); DEFINE_STATIC_KEY_FALSE(memcg_bpf_enabled_key); EXPORT_SYMBOL(memcg_bpf_enabled_key); -#endif /** * mem_cgroup_css_from_folio - css of the memcg associated with a folio @@ -412,169 +274,6 @@ ino_t page_cgroup_ino(struct page *page) return ino; } -static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz, - struct mem_cgroup_tree_per_node *mctz, - unsigned long new_usage_in_excess) -{ - struct rb_node **p = &mctz->rb_root.rb_node; - struct rb_node *parent = NULL; - struct mem_cgroup_per_node *mz_node; - bool rightmost = true; - - if (mz->on_tree) - return; - - mz->usage_in_excess = new_usage_in_excess; - if (!mz->usage_in_excess) - return; - while (*p) { - parent = *p; - mz_node = rb_entry(parent, struct mem_cgroup_per_node, - tree_node); - if (mz->usage_in_excess < mz_node->usage_in_excess) { - p = &(*p)->rb_left; - rightmost = false; - } else { - p = &(*p)->rb_right; - } - } - - if (rightmost) - mctz->rb_rightmost = &mz->tree_node; - - rb_link_node(&mz->tree_node, parent, p); - rb_insert_color(&mz->tree_node, &mctz->rb_root); - mz->on_tree = true; -} - -static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz, - struct mem_cgroup_tree_per_node *mctz) -{ - if (!mz->on_tree) - return; - - if (&mz->tree_node == mctz->rb_rightmost) - mctz->rb_rightmost = rb_prev(&mz->tree_node); - - rb_erase(&mz->tree_node, &mctz->rb_root); - mz->on_tree = false; -} - -static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz, - struct mem_cgroup_tree_per_node *mctz) -{ - unsigned long flags; - - spin_lock_irqsave(&mctz->lock, flags); - __mem_cgroup_remove_exceeded(mz, mctz); - spin_unlock_irqrestore(&mctz->lock, flags); -} - -static unsigned long soft_limit_excess(struct mem_cgroup *memcg) -{ - unsigned long nr_pages = page_counter_read(&memcg->memory); - unsigned long soft_limit = READ_ONCE(memcg->soft_limit); - unsigned long excess = 0; - - if (nr_pages > soft_limit) - excess = nr_pages - soft_limit; - - return excess; -} - -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid) -{ - unsigned long excess; - struct mem_cgroup_per_node *mz; - struct mem_cgroup_tree_per_node *mctz; - - if (lru_gen_enabled()) { - if (soft_limit_excess(memcg)) - lru_gen_soft_reclaim(memcg, nid); - return; - } - - mctz = soft_limit_tree.rb_tree_per_node[nid]; - if (!mctz) - return; - /* - * Necessary to update all ancestors when hierarchy is used. - * because their event counter is not touched. - */ - for (; memcg; memcg = parent_mem_cgroup(memcg)) { - mz = memcg->nodeinfo[nid]; - excess = soft_limit_excess(memcg); - /* - * We have to update the tree if mz is on RB-tree or - * mem is over its softlimit. - */ - if (excess || mz->on_tree) { - unsigned long flags; - - spin_lock_irqsave(&mctz->lock, flags); - /* if on-tree, remove it */ - if (mz->on_tree) - __mem_cgroup_remove_exceeded(mz, mctz); - /* - * Insert again. mz->usage_in_excess will be updated. - * If excess is 0, no tree ops. - */ - __mem_cgroup_insert_exceeded(mz, mctz, excess); - spin_unlock_irqrestore(&mctz->lock, flags); - } - } -} - -static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) -{ - struct mem_cgroup_tree_per_node *mctz; - struct mem_cgroup_per_node *mz; - int nid; - - for_each_node(nid) { - mz = memcg->nodeinfo[nid]; - mctz = soft_limit_tree.rb_tree_per_node[nid]; - if (mctz) - mem_cgroup_remove_exceeded(mz, mctz); - } -} - -static struct mem_cgroup_per_node * -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) -{ - struct mem_cgroup_per_node *mz; - -retry: - mz = NULL; - if (!mctz->rb_rightmost) - goto done; /* Nothing to reclaim from */ - - mz = rb_entry(mctz->rb_rightmost, - struct mem_cgroup_per_node, tree_node); - /* - * Remove the node now but someone else can add it back, - * we will to add it back at the end of reclaim to its correct - * position in the tree. - */ - __mem_cgroup_remove_exceeded(mz, mctz); - if (!soft_limit_excess(mz->memcg) || - !css_tryget(&mz->memcg->css)) - goto retry; -done: - return mz; -} - -static struct mem_cgroup_per_node * -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) -{ - struct mem_cgroup_per_node *mz; - - spin_lock_irq(&mctz->lock); - mz = __mem_cgroup_largest_soft_limit_node(mctz); - spin_unlock_irq(&mctz->lock); - return mz; -} - /* Subset of node_stat_item for memcg stats */ static const unsigned int memcg_node_stat_items[] = { NR_INACTIVE_ANON, @@ -722,7 +421,7 @@ static const unsigned int memcg_vm_event_stat[] = { PGDEACTIVATE, PGLAZYFREE, PGLAZYFREED, -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) +#ifdef CONFIG_ZSWAP ZSWPIN, ZSWPOUT, ZSWPWB, @@ -971,7 +670,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, } /* idx can be of type enum memcg_stat_item or node_stat_item. */ -static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) +unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) { long x; int i = memcg_stats_index(idx); @@ -1120,7 +819,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, memcg_stats_unlock(); } -static unsigned long memcg_events(struct mem_cgroup *memcg, int event) +unsigned long memcg_events(struct mem_cgroup *memcg, int event) { int i = memcg_events_index(event); @@ -1130,7 +829,7 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event) return READ_ONCE(memcg->vmstats->events[i]); } -static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) +unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) { int i = memcg_events_index(event); @@ -1140,8 +839,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) return READ_ONCE(memcg->vmstats->events_local[i]); } -static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, - int nr_pages) +void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages) { /* pagein of a big page is an event. So, ignore page size */ if (nr_pages > 0) @@ -1154,8 +852,8 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, __this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages); } -static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, - enum mem_cgroup_events_target target) +bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, + enum mem_cgroup_events_target target) { unsigned long val, next; @@ -1179,28 +877,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, return false; } -/* - * Check events in order. - * - */ -static void memcg_check_events(struct mem_cgroup *memcg, int nid) -{ - if (IS_ENABLED(CONFIG_PREEMPT_RT)) - return; - - /* threshold event is triggered in finer grain than soft limit */ - if (unlikely(mem_cgroup_event_ratelimit(memcg, - MEM_CGROUP_TARGET_THRESH))) { - bool do_softlimit; - - do_softlimit = mem_cgroup_event_ratelimit(memcg, - MEM_CGROUP_TARGET_SOFTLIMIT); - mem_cgroup_threshold(memcg); - if (unlikely(do_softlimit)) - mem_cgroup_update_tree(memcg, nid); - } -} - struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) { /* @@ -1652,51 +1328,6 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) return margin; } -/* - * A routine for checking "mem" is under move_account() or not. - * - * Checking a cgroup is mc.from or mc.to or under hierarchy of - * moving cgroups. This is for waiting at high-memory pressure - * caused by "move". - */ -static bool mem_cgroup_under_move(struct mem_cgroup *memcg) -{ - struct mem_cgroup *from; - struct mem_cgroup *to; - bool ret = false; - /* - * Unlike task_move routines, we access mc.to, mc.from not under - * mutual exclusion by cgroup_mutex. Here, we take spinlock instead. - */ - spin_lock(&mc.lock); - from = mc.from; - to = mc.to; - if (!from) - goto unlock; - - ret = mem_cgroup_is_descendant(from, memcg) || - mem_cgroup_is_descendant(to, memcg); -unlock: - spin_unlock(&mc.lock); - return ret; -} - -static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg) -{ - if (mc.moving_task && current != mc.moving_task) { - if (mem_cgroup_under_move(memcg)) { - DEFINE_WAIT(wait); - prepare_to_wait(&mc.waitq, &wait, TASK_INTERRUPTIBLE); - /* moving charge context might have finished. */ - if (mc.moving_task) - schedule(); - finish_wait(&mc.waitq, &wait); - return true; - } - } - return false; -} - struct memory_stat { const char *name; unsigned int idx; @@ -1713,7 +1344,7 @@ static const struct memory_stat memory_stats[] = { { "sock", MEMCG_SOCK }, { "vmalloc", MEMCG_VMALLOC }, { "shmem", NR_SHMEM }, -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) +#ifdef CONFIG_ZSWAP { "zswap", MEMCG_ZSWAP_B }, { "zswapped", MEMCG_ZSWAPPED }, #endif @@ -1783,15 +1414,13 @@ static int memcg_page_state_output_unit(int item) } } -static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg, - int item) +unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item) { return memcg_page_state(memcg, item) * memcg_page_state_output_unit(item); } -static inline unsigned long memcg_page_state_local_output( - struct mem_cgroup *memcg, int item) +unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item) { return memcg_page_state_local(memcg, item) * memcg_page_state_output_unit(item); @@ -1845,20 +1474,16 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) vm_event_name(memcg_vm_event_stat[i]), memcg_events(memcg, memcg_vm_event_stat[i])); } - - /* The above should easily fit into one page */ - WARN_ON_ONCE(seq_buf_has_overflowed(s)); } -static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s); - static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) { if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) memcg_stat_format(memcg, s); else memcg1_stat_format(memcg, s); - WARN_ON_ONCE(seq_buf_has_overflowed(s)); + if (seq_buf_has_overflowed(s)) + pr_warn("%s: Warning, stat buffer overflow, please report\n", __func__); } /** @@ -1906,6 +1531,7 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) pr_info("swap: usage %llukB, limit %llukB, failcnt %lu\n", K((u64)page_counter_read(&memcg->swap)), K((u64)READ_ONCE(memcg->swap.max)), memcg->swap.failcnt); +#ifdef CONFIG_MEMCG_V1 else { pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %lu\n", K((u64)page_counter_read(&memcg->memsw)), @@ -1914,6 +1540,7 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) K((u64)page_counter_read(&memcg->kmem)), K((u64)memcg->kmem.max), memcg->kmem.failcnt); } +#endif pr_info("Memory cgroup stats for "); pr_cont_cgroup_path(memcg->css.cgroup); @@ -1979,180 +1606,6 @@ unlock: return ret; } -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, - pg_data_t *pgdat, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - struct mem_cgroup *victim = NULL; - int total = 0; - int loop = 0; - unsigned long excess; - unsigned long nr_scanned; - struct mem_cgroup_reclaim_cookie reclaim = { - .pgdat = pgdat, - }; - - excess = soft_limit_excess(root_memcg); - - while (1) { - victim = mem_cgroup_iter(root_memcg, victim, &reclaim); - if (!victim) { - loop++; - if (loop >= 2) { - /* - * If we have not been able to reclaim - * anything, it might because there are - * no reclaimable pages under this hierarchy - */ - if (!total) - break; - /* - * We want to do more targeted reclaim. - * excess >> 2 is not to excessive so as to - * reclaim too much, nor too less that we keep - * coming back to reclaim from this cgroup - */ - if (total >= (excess >> 2) || - (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) - break; - } - continue; - } - total += mem_cgroup_shrink_node(victim, gfp_mask, false, - pgdat, &nr_scanned); - *total_scanned += nr_scanned; - if (!soft_limit_excess(root_memcg)) - break; - } - mem_cgroup_iter_break(root_memcg, victim); - return total; -} - -#ifdef CONFIG_LOCKDEP -static struct lockdep_map memcg_oom_lock_dep_map = { - .name = "memcg_oom_lock", -}; -#endif - -static DEFINE_SPINLOCK(memcg_oom_lock); - -/* - * Check OOM-Killer is already running under our hierarchy. - * If someone is running, return false. - */ -static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg) -{ - struct mem_cgroup *iter, *failed = NULL; - - spin_lock(&memcg_oom_lock); - - for_each_mem_cgroup_tree(iter, memcg) { - if (iter->oom_lock) { - /* - * this subtree of our hierarchy is already locked - * so we cannot give a lock. - */ - failed = iter; - mem_cgroup_iter_break(memcg, iter); - break; - } else - iter->oom_lock = true; - } - - if (failed) { - /* - * OK, we failed to lock the whole subtree so we have - * to clean up what we set up to the failing subtree - */ - for_each_mem_cgroup_tree(iter, memcg) { - if (iter == failed) { - mem_cgroup_iter_break(memcg, iter); - break; - } - iter->oom_lock = false; - } - } else - mutex_acquire(&memcg_oom_lock_dep_map, 0, 1, _RET_IP_); - - spin_unlock(&memcg_oom_lock); - - return !failed; -} - -static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg) -{ - struct mem_cgroup *iter; - - spin_lock(&memcg_oom_lock); - mutex_release(&memcg_oom_lock_dep_map, _RET_IP_); - for_each_mem_cgroup_tree(iter, memcg) - iter->oom_lock = false; - spin_unlock(&memcg_oom_lock); -} - -static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg) -{ - struct mem_cgroup *iter; - - spin_lock(&memcg_oom_lock); - for_each_mem_cgroup_tree(iter, memcg) - iter->under_oom++; - spin_unlock(&memcg_oom_lock); -} - -static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg) -{ - struct mem_cgroup *iter; - - /* - * Be careful about under_oom underflows because a child memcg - * could have been added after mem_cgroup_mark_under_oom. - */ - spin_lock(&memcg_oom_lock); - for_each_mem_cgroup_tree(iter, memcg) - if (iter->under_oom > 0) - iter->under_oom--; - spin_unlock(&memcg_oom_lock); -} - -static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq); - -struct oom_wait_info { - struct mem_cgroup *memcg; - wait_queue_entry_t wait; -}; - -static int memcg_oom_wake_function(wait_queue_entry_t *wait, - unsigned mode, int sync, void *arg) -{ - struct mem_cgroup *wake_memcg = (struct mem_cgroup *)arg; - struct mem_cgroup *oom_wait_memcg; - struct oom_wait_info *oom_wait_info; - - oom_wait_info = container_of(wait, struct oom_wait_info, wait); - oom_wait_memcg = oom_wait_info->memcg; - - if (!mem_cgroup_is_descendant(wake_memcg, oom_wait_memcg) && - !mem_cgroup_is_descendant(oom_wait_memcg, wake_memcg)) - return 0; - return autoremove_wake_function(wait, mode, sync, arg); -} - -static void memcg_oom_recover(struct mem_cgroup *memcg) -{ - /* - * For the following lockless ->under_oom test, the only required - * guarantee is that it must see the state asserted by an OOM when - * this function is called as a result of userland actions - * triggered by the notification of the OOM. This is trivially - * achieved by invoking mem_cgroup_mark_under_oom() before - * triggering notification. - */ - if (memcg && memcg->under_oom) - __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); -} - /* * Returns true if successfully killed one or more processes. Though in some * corner cases it can return true even without killing any process. @@ -2166,104 +1619,16 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) memcg_memory_event(memcg, MEMCG_OOM); - /* - * We are in the middle of the charge context here, so we - * don't want to block when potentially sitting on a callstack - * that holds all kinds of filesystem and mm locks. - * - * cgroup1 allows disabling the OOM killer and waiting for outside - * handling until the charge can succeed; remember the context and put - * the task to sleep at the end of the page fault when all locks are - * released. - * - * On the other hand, in-kernel OOM killer allows for an async victim - * memory reclaim (oom_reaper) and that means that we are not solely - * relying on the oom victim to make a forward progress and we can - * invoke the oom killer here. - * - * Please note that mem_cgroup_out_of_memory might fail to find a - * victim and then we have to bail out from the charge path. - */ - if (READ_ONCE(memcg->oom_kill_disable)) { - if (current->in_user_fault) { - css_get(&memcg->css); - current->memcg_in_oom = memcg; - } + if (!memcg1_oom_prepare(memcg, &locked)) return false; - } - mem_cgroup_mark_under_oom(memcg); - - locked = mem_cgroup_oom_trylock(memcg); - - if (locked) - mem_cgroup_oom_notify(memcg); - - mem_cgroup_unmark_under_oom(memcg); ret = mem_cgroup_out_of_memory(memcg, mask, order); - if (locked) - mem_cgroup_oom_unlock(memcg); + memcg1_oom_finish(memcg, locked); return ret; } -/** - * mem_cgroup_oom_synchronize - complete memcg OOM handling - * @handle: actually kill/wait or just clean up the OOM state - * - * This has to be called at the end of a page fault if the memcg OOM - * handler was enabled. - * - * Memcg supports userspace OOM handling where failed allocations must - * sleep on a waitqueue until the userspace task resolves the - * situation. Sleeping directly in the charge context with all kinds - * of locks held is not a good idea, instead we remember an OOM state - * in the task and mem_cgroup_oom_synchronize() has to be called at - * the end of the page fault to complete the OOM handling. - * - * Returns %true if an ongoing memcg OOM situation was detected and - * completed, %false otherwise. - */ -bool mem_cgroup_oom_synchronize(bool handle) -{ - struct mem_cgroup *memcg = current->memcg_in_oom; - struct oom_wait_info owait; - bool locked; - - /* OOM is global, do not handle */ - if (!memcg) - return false; - - if (!handle) - goto cleanup; - - owait.memcg = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.entry); - - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); - mem_cgroup_mark_under_oom(memcg); - - locked = mem_cgroup_oom_trylock(memcg); - - if (locked) - mem_cgroup_oom_notify(memcg); - - schedule(); - mem_cgroup_unmark_under_oom(memcg); - finish_wait(&memcg_oom_waitq, &owait.wait); - - if (locked) - mem_cgroup_oom_unlock(memcg); -cleanup: - current->memcg_in_oom = NULL; - css_put(&memcg->css); - return true; -} - /** * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM * @victim: task to be killed by the OOM killer @@ -2328,99 +1693,16 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) pr_cont(" are going to be killed due to memory.oom.group set\n"); } -/** - * folio_memcg_lock - Bind a folio to its memcg. - * @folio: The folio. - * - * This function prevents unlocked LRU folios from being moved to - * another cgroup. - * - * It ensures lifetime of the bound memcg. The caller is responsible - * for the lifetime of the folio. - */ -void folio_memcg_lock(struct folio *folio) -{ - struct mem_cgroup *memcg; - unsigned long flags; - - /* - * The RCU lock is held throughout the transaction. The fast - * path can get away without acquiring the memcg->move_lock - * because page moving starts with an RCU grace period. - */ - rcu_read_lock(); - - if (mem_cgroup_disabled()) - return; -again: - memcg = folio_memcg(folio); - if (unlikely(!memcg)) - return; - -#ifdef CONFIG_PROVE_LOCKING - local_irq_save(flags); - might_lock(&memcg->move_lock); - local_irq_restore(flags); -#endif - - if (atomic_read(&memcg->moving_account) <= 0) - return; - - spin_lock_irqsave(&memcg->move_lock, flags); - if (memcg != folio_memcg(folio)) { - spin_unlock_irqrestore(&memcg->move_lock, flags); - goto again; - } - - /* - * When charge migration first begins, we can have multiple - * critical sections holding the fast-path RCU lock and one - * holding the slowpath move_lock. Track the task who has the - * move_lock for folio_memcg_unlock(). - */ - memcg->move_lock_task = current; - memcg->move_lock_flags = flags; -} - -static void __folio_memcg_unlock(struct mem_cgroup *memcg) -{ - if (memcg && memcg->move_lock_task == current) { - unsigned long flags = memcg->move_lock_flags; - - memcg->move_lock_task = NULL; - memcg->move_lock_flags = 0; - - spin_unlock_irqrestore(&memcg->move_lock, flags); - } - - rcu_read_unlock(); -} - -/** - * folio_memcg_unlock - Release the binding between a folio and its memcg. - * @folio: The folio. - * - * This releases the binding created by folio_memcg_lock(). This does - * not change the accounting of this folio to its memcg, but it does - * permit others to change it. - */ -void folio_memcg_unlock(struct folio *folio) -{ - __folio_memcg_unlock(folio_memcg(folio)); -} - struct memcg_stock_pcp { local_lock_t stock_lock; struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; -#ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; unsigned int nr_bytes; int nr_slab_reclaimable_b; int nr_slab_unreclaimable_b; -#endif struct work_struct work; unsigned long flags; @@ -2431,26 +1713,9 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { }; static DEFINE_MUTEX(percpu_charge_mutex); -#ifdef CONFIG_MEMCG_KMEM static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock); static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg); -static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages); - -#else -static inline struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) -{ - return NULL; -} -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, - struct mem_cgroup *root_memcg) -{ - return false; -} -static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages) -{ -} -#endif /** * consume_stock: Try to consume stocked charge on this cpu. @@ -2567,7 +1832,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) * Drains all per-CPU charge caches for given root_memcg resp. subtree * of the hierarchy under it. */ -static void drain_all_stock(struct mem_cgroup *root_memcg) +void drain_all_stock(struct mem_cgroup *root_memcg) { int cpu, curcpu; @@ -2636,7 +1901,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP); + MEMCG_RECLAIM_MAY_SWAP, + NULL); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2887,8 +2153,8 @@ out: css_put(&memcg->css); } -static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages) +int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, + unsigned int nr_pages) { unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages); int nr_retries = MAX_RECLAIM_RETRIES; @@ -2942,7 +2208,7 @@ retry: psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options); + gfp_mask, reclaim_options, NULL); psi_memstall_leave(&pflags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -2971,7 +2237,7 @@ retry: * At task move, charge accounts can be doubly counted. So, it's * better to wait until the end of task_move if something is going on. */ - if (mem_cgroup_wait_acct_move(mem_over_limit)) + if (memcg1_wait_acct_move(mem_over_limit)) goto retry; if (nr_retries--) @@ -3083,15 +2349,6 @@ done_restock: return 0; } -static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages) -{ - if (mem_cgroup_is_root(memcg)) - return 0; - - return try_charge_memcg(memcg, gfp_mask, nr_pages); -} - /** * mem_cgroup_cancel_charge() - cancel an uncommitted try_charge() call. * @memcg: memcg previously charged. @@ -3134,12 +2391,10 @@ void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg) local_irq_disable(); mem_cgroup_charge_statistics(memcg, folio_nr_pages(folio)); - memcg_check_events(memcg, folio_nid(folio)); + memcg1_check_events(memcg, folio_nid(folio)); local_irq_enable(); } -#ifdef CONFIG_MEMCG_KMEM - static inline void __mod_objcg_mlstate(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) @@ -3367,18 +2622,6 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio) return objcg; } -static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages) -{ - mod_memcg_state(memcg, MEMCG_KMEM, nr_pages); - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { - if (nr_pages > 0) - page_counter_charge(&memcg->kmem, nr_pages); - else - page_counter_uncharge(&memcg->kmem, -nr_pages); - } -} - - /* * obj_cgroup_uncharge_pages: uncharge a number of kernel pages from a objcg * @objcg: object cgroup to uncharge @@ -3391,7 +2634,8 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, memcg = get_mem_cgroup_from_objcg(objcg); - memcg_account_kmem(memcg, -nr_pages); + mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages); + memcg1_account_kmem(memcg, -nr_pages); refill_stock(memcg, nr_pages); css_put(&memcg->css); @@ -3417,7 +2661,8 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp, if (ret) goto out; - memcg_account_kmem(memcg, nr_pages); + mod_memcg_state(memcg, MEMCG_KMEM, nr_pages); + memcg1_account_kmem(memcg, nr_pages); out: css_put(&memcg->css); @@ -3570,7 +2815,8 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) memcg = get_mem_cgroup_from_objcg(old); - memcg_account_kmem(memcg, -nr_pages); + mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages); + memcg1_account_kmem(memcg, -nr_pages); __refill_stock(memcg, nr_pages); css_put(&memcg->css); @@ -3804,10 +3050,9 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, obj_cgroup_put(objcg); } } -#endif /* CONFIG_MEMCG_KMEM */ /* - * Because page_memcg(head) is not set on tails, set it now. + * Because folio_memcg(head) is not set on tails, set it now. */ void split_page_memcg(struct page *head, int old_order, int new_order) { @@ -3829,240 +3074,7 @@ void split_page_memcg(struct page *head, int old_order, int new_order) css_get_many(&memcg->css, old_nr / new_nr - 1); } -#ifdef CONFIG_SWAP -/** - * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record. - * @entry: swap entry to be moved - * @from: mem_cgroup which the entry is moved from - * @to: mem_cgroup which the entry is moved to - * - * It succeeds only when the swap_cgroup's record for this entry is the same - * as the mem_cgroup's id of @from. - * - * Returns 0 on success, -EINVAL on failure. - * - * The caller must have charged to @to, IOW, called page_counter_charge() about - * both res and memsw, and called css_get(). - */ -static int mem_cgroup_move_swap_account(swp_entry_t entry, - struct mem_cgroup *from, struct mem_cgroup *to) -{ - unsigned short old_id, new_id; - - old_id = mem_cgroup_id(from); - new_id = mem_cgroup_id(to); - - if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) { - mod_memcg_state(from, MEMCG_SWAP, -1); - mod_memcg_state(to, MEMCG_SWAP, 1); - return 0; - } - return -EINVAL; -} -#else -static inline int mem_cgroup_move_swap_account(swp_entry_t entry, - struct mem_cgroup *from, struct mem_cgroup *to) -{ - return -EINVAL; -} -#endif - -static DEFINE_MUTEX(memcg_max_mutex); - -static int mem_cgroup_resize_max(struct mem_cgroup *memcg, - unsigned long max, bool memsw) -{ - bool enlarge = false; - bool drained = false; - int ret; - bool limits_invariant; - struct page_counter *counter = memsw ? &memcg->memsw : &memcg->memory; - - do { - if (signal_pending(current)) { - ret = -EINTR; - break; - } - - mutex_lock(&memcg_max_mutex); - /* - * Make sure that the new limit (memsw or memory limit) doesn't - * break our basic invariant rule memory.max <= memsw.max. - */ - limits_invariant = memsw ? max >= READ_ONCE(memcg->memory.max) : - max <= memcg->memsw.max; - if (!limits_invariant) { - mutex_unlock(&memcg_max_mutex); - ret = -EINVAL; - break; - } - if (max > counter->max) - enlarge = true; - ret = page_counter_set_max(counter, max); - mutex_unlock(&memcg_max_mutex); - - if (!ret) - break; - - if (!drained) { - drain_all_stock(memcg); - drained = true; - continue; - } - - if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { - ret = -EBUSY; - break; - } - } while (true); - - if (!ret && enlarge) - memcg_oom_recover(memcg); - - return ret; -} - -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, - gfp_t gfp_mask, - unsigned long *total_scanned) -{ - unsigned long nr_reclaimed = 0; - struct mem_cgroup_per_node *mz, *next_mz = NULL; - unsigned long reclaimed; - int loop = 0; - struct mem_cgroup_tree_per_node *mctz; - unsigned long excess; - - if (lru_gen_enabled()) - return 0; - - if (order > 0) - return 0; - - mctz = soft_limit_tree.rb_tree_per_node[pgdat->node_id]; - - /* - * Do not even bother to check the largest node if the root - * is empty. Do it lockless to prevent lock bouncing. Races - * are acceptable as soft limit is best effort anyway. - */ - if (!mctz || RB_EMPTY_ROOT(&mctz->rb_root)) - return 0; - - /* - * This loop can run a while, specially if mem_cgroup's continuously - * keep exceeding their soft limit and putting the system under - * pressure - */ - do { - if (next_mz) - mz = next_mz; - else - mz = mem_cgroup_largest_soft_limit_node(mctz); - if (!mz) - break; - - reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat, - gfp_mask, total_scanned); - nr_reclaimed += reclaimed; - spin_lock_irq(&mctz->lock); - - /* - * If we failed to reclaim anything from this memory cgroup - * it is time to move on to the next cgroup - */ - next_mz = NULL; - if (!reclaimed) - next_mz = __mem_cgroup_largest_soft_limit_node(mctz); - - excess = soft_limit_excess(mz->memcg); - /* - * One school of thought says that we should not add - * back the node to the tree if reclaim returns 0. - * But our reclaim could return 0, simply because due - * to priority we are exposing a smaller subset of - * memory to reclaim from. Consider this as a longer - * term TODO. - */ - /* If excess == 0, no tree ops */ - __mem_cgroup_insert_exceeded(mz, mctz, excess); - spin_unlock_irq(&mctz->lock); - css_put(&mz->memcg->css); - loop++; - /* - * Could not reclaim anything and there are no more - * mem cgroups to try or we seem to be looping without - * reclaiming anything. - */ - if (!nr_reclaimed && - (next_mz == NULL || - loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) - break; - } while (!nr_reclaimed); - if (next_mz) - css_put(&next_mz->memcg->css); - return nr_reclaimed; -} - -/* - * Reclaims as many pages from the given memcg as possible. - * - * Caller is responsible for holding css reference for memcg. - */ -static int mem_cgroup_force_empty(struct mem_cgroup *memcg) -{ - int nr_retries = MAX_RECLAIM_RETRIES; - - /* we call try-to-free pages for make this cgroup empty */ - lru_add_drain_all(); - - drain_all_stock(memcg); - - /* try to free all pages in this cgroup */ - while (nr_retries && page_counter_read(&memcg->memory)) { - if (signal_pending(current)) - return -EINTR; - - if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP)) - nr_retries--; - } - - return 0; -} - -static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, - loff_t off) -{ - struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); - - if (mem_cgroup_is_root(memcg)) - return -EINVAL; - return mem_cgroup_force_empty(memcg) ?: nbytes; -} - -static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css, - struct cftype *cft) -{ - return 1; -} - -static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css, - struct cftype *cft, u64 val) -{ - if (val == 1) - return 0; - - pr_warn_once("Non-hierarchical mode is deprecated. " - "Please report your usecase to linux-mm@kvack.org if you " - "depend on this functionality.\n"); - - return -EINVAL; -} - -static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) +unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) { unsigned long val; @@ -4084,68 +3096,6 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) return val; } -enum { - RES_USAGE, - RES_LIMIT, - RES_MAX_USAGE, - RES_FAILCNT, - RES_SOFT_LIMIT, -}; - -static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, - struct cftype *cft) -{ - struct mem_cgroup *memcg = mem_cgroup_from_css(css); - struct page_counter *counter; - - switch (MEMFILE_TYPE(cft->private)) { - case _MEM: - counter = &memcg->memory; - break; - case _MEMSWAP: - counter = &memcg->memsw; - break; - case _KMEM: - counter = &memcg->kmem; - break; - case _TCP: - counter = &memcg->tcpmem; - break; - default: - BUG(); - } - - switch (MEMFILE_ATTR(cft->private)) { - case RES_USAGE: - if (counter == &memcg->memory) - return (u64)mem_cgroup_usage(memcg, false) * PAGE_SIZE; - if (counter == &memcg->memsw) - return (u64)mem_cgroup_usage(memcg, true) * PAGE_SIZE; - return (u64)page_counter_read(counter) * PAGE_SIZE; - case RES_LIMIT: - return (u64)counter->max * PAGE_SIZE; - case RES_MAX_USAGE: - return (u64)counter->watermark * PAGE_SIZE; - case RES_FAILCNT: - return counter->failcnt; - case RES_SOFT_LIMIT: - return (u64)READ_ONCE(memcg->soft_limit) * PAGE_SIZE; - default: - BUG(); - } -} - -/* - * This function doesn't do anything useful. Its only job is to provide a read - * handler for a file so that cgroup_file_mode() will add read permissions. - */ -static int mem_cgroup_dummy_seq_show(__always_unused struct seq_file *m, - __always_unused void *v) -{ - return -EINVAL; -} - -#ifdef CONFIG_MEMCG_KMEM static int memcg_online_kmem(struct mem_cgroup *memcg) { struct obj_cgroup *objcg; @@ -4196,760 +3146,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg) */ memcg_reparent_list_lrus(memcg, parent); } -#else -static int memcg_online_kmem(struct mem_cgroup *memcg) -{ - return 0; -} -static void memcg_offline_kmem(struct mem_cgroup *memcg) -{ -} -#endif /* CONFIG_MEMCG_KMEM */ - -static int memcg_update_tcp_max(struct mem_cgroup *memcg, unsigned long max) -{ - int ret; - - mutex_lock(&memcg_max_mutex); - - ret = page_counter_set_max(&memcg->tcpmem, max); - if (ret) - goto out; - - if (!memcg->tcpmem_active) { - /* - * The active flag needs to be written after the static_key - * update. This is what guarantees that the socket activation - * function is the last one to run. See mem_cgroup_sk_alloc() - * for details, and note that we don't mark any socket as - * belonging to this memcg until that flag is up. - * - * We need to do this, because static_keys will span multiple - * sites, but we can't control their order. If we mark a socket - * as accounted, but the accounting functions are not patched in - * yet, we'll lose accounting. - * - * We never race with the readers in mem_cgroup_sk_alloc(), - * because when this value change, the code to process it is not - * patched in yet. - */ - static_branch_inc(&memcg_sockets_enabled_key); - memcg->tcpmem_active = true; - } -out: - mutex_unlock(&memcg_max_mutex); - return ret; -} - -/* - * The user of this function is... - * RES_LIMIT. - */ -static ssize_t mem_cgroup_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) -{ - struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); - unsigned long nr_pages; - int ret; - - buf = strstrip(buf); - ret = page_counter_memparse(buf, "-1", &nr_pages); - if (ret) - return ret; - - switch (MEMFILE_ATTR(of_cft(of)->private)) { - case RES_LIMIT: - if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */ - ret = -EINVAL; - break; - } - switch (MEMFILE_TYPE(of_cft(of)->private)) { - case _MEM: - ret = mem_cgroup_resize_max(memcg, nr_pages, false); - break; - case _MEMSWAP: - ret = mem_cgroup_resize_max(memcg, nr_pages, true); - break; - case _KMEM: - pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. " - "Writing any value to this file has no effect. " - "Please report your usecase to linux-mm@kvack.org if you " - "depend on this functionality.\n"); - ret = 0; - break; - case _TCP: - ret = memcg_update_tcp_max(memcg, nr_pages); - break; - } - break; - case RES_SOFT_LIMIT: - if (IS_ENABLED(CONFIG_PREEMPT_RT)) { - ret = -EOPNOTSUPP; - } else { - WRITE_ONCE(memcg->soft_limit, nr_pages); - ret = 0; - } - break; - } - return ret ?: nbytes; -} - -static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf, - size_t nbytes, loff_t off) -{ - struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); - struct page_counter *counter; - - switch (MEMFILE_TYPE(of_cft(of)->private)) { - case _MEM: - counter = &memcg->memory; - break; - case _MEMSWAP: - counter = &memcg->memsw; - break; - case _KMEM: - counter = &memcg->kmem; - break; - case _TCP: - counter = &memcg->tcpmem; - break; - default: - BUG(); - } - - switch (MEMFILE_ATTR(of_cft(of)->private)) { - case RES_MAX_USAGE: - page_counter_reset_watermark(counter); - break; - case RES_FAILCNT: - counter->failcnt = 0; - break; - default: - BUG(); - } - - return nbytes; -} - -static u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css, - struct cftype *cft) -{ - return mem_cgroup_from_css(css)->move_charge_at_immigrate; -} - -#ifdef CONFIG_MMU -static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css, - struct cftype *cft, u64 val) -{ - struct mem_cgroup *memcg = mem_cgroup_from_css(css); - - pr_warn_once("Cgroup memory moving (move_charge_at_immigrate) is deprecated. " - "Please report your usecase to linux-mm@kvack.org if you " - "depend on this functionality.\n"); - - if (val & ~MOVE_MASK) - return -EINVAL; - - /* - * No kind of locking is needed in here, because ->can_attach() will - * check this value once in the beginning of the process, and then carry - * on with stale data. This means that changes to this value will only - * affect task migrations starting after the change. - */ - memcg->move_charge_at_immigrate = val; - return 0; -} -#else -static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css, - struct cftype *cft, u64 val) -{ - return -ENOSYS; -} -#endif - -#ifdef CONFIG_NUMA - -#define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE)) -#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON)) -#define LRU_ALL ((1 << NR_LRU_LISTS) - 1) - -static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, - int nid, unsigned int lru_mask, bool tree) -{ - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); - unsigned long nr = 0; - enum lru_list lru; - - VM_BUG_ON((unsigned)nid >= nr_node_ids); - - for_each_lru(lru) { - if (!(BIT(lru) & lru_mask)) - continue; - if (tree) - nr += lruvec_page_state(lruvec, NR_LRU_BASE + lru); - else - nr += lruvec_page_state_local(lruvec, NR_LRU_BASE + lru); - } - return nr; -} - -static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg, - unsigned int lru_mask, - bool tree) -{ - unsigned long nr = 0; - enum lru_list lru; - - for_each_lru(lru) { - if (!(BIT(lru) & lru_mask)) - continue; - if (tree) - nr += memcg_page_state(memcg, NR_LRU_BASE + lru); - else - nr += memcg_page_state_local(memcg, NR_LRU_BASE + lru); - } - return nr; -} - -static int memcg_numa_stat_show(struct seq_file *m, void *v) -{ - struct numa_stat { - const char *name; - unsigned int lru_mask; - }; - - static const struct numa_stat stats[] = { - { "total", LRU_ALL }, - { "file", LRU_ALL_FILE }, - { "anon", LRU_ALL_ANON }, - { "unevictable", BIT(LRU_UNEVICTABLE) }, - }; - const struct numa_stat *stat; - int nid; - struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - - mem_cgroup_flush_stats(memcg); - - for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { - seq_printf(m, "%s=%lu", stat->name, - mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, - false)); - for_each_node_state(nid, N_MEMORY) - seq_printf(m, " N%d=%lu", nid, - mem_cgroup_node_nr_lru_pages(memcg, nid, - stat->lru_mask, false)); - seq_putc(m, '\n'); - } - - for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { - - seq_printf(m, "hierarchical_%s=%lu", stat->name, - mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, - true)); - for_each_node_state(nid, N_MEMORY) - seq_printf(m, " N%d=%lu", nid, - mem_cgroup_node_nr_lru_pages(memcg, nid, - stat->lru_mask, true)); - seq_putc(m, '\n'); - } - - return 0; -} -#endif /* CONFIG_NUMA */ - -static const unsigned int memcg1_stats[] = { - NR_FILE_PAGES, - NR_ANON_MAPPED, -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - NR_ANON_THPS, -#endif - NR_SHMEM, - NR_FILE_MAPPED, - NR_FILE_DIRTY, - NR_WRITEBACK, - WORKINGSET_REFAULT_ANON, - WORKINGSET_REFAULT_FILE, -#ifdef CONFIG_SWAP - MEMCG_SWAP, - NR_SWAPCACHE, -#endif -}; - -static const char *const memcg1_stat_names[] = { - "cache", - "rss", -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - "rss_huge", -#endif - "shmem", - "mapped_file", - "dirty", - "writeback", - "workingset_refault_anon", - "workingset_refault_file", -#ifdef CONFIG_SWAP - "swap", - "swapcached", -#endif -}; - -/* Universal VM events cgroup1 shows, original sort order */ -static const unsigned int memcg1_events[] = { - PGPGIN, - PGPGOUT, - PGFAULT, - PGMAJFAULT, -}; - -static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) -{ - unsigned long memory, memsw; - struct mem_cgroup *mi; - unsigned int i; - - BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); - - mem_cgroup_flush_stats(memcg); - - for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { - unsigned long nr; - - nr = memcg_page_state_local_output(memcg, memcg1_stats[i]); - seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i], nr); - } - - for (i = 0; i < ARRAY_SIZE(memcg1_events); i++) - seq_buf_printf(s, "%s %lu\n", vm_event_name(memcg1_events[i]), - memcg_events_local(memcg, memcg1_events[i])); - - for (i = 0; i < NR_LRU_LISTS; i++) - seq_buf_printf(s, "%s %lu\n", lru_list_name(i), - memcg_page_state_local(memcg, NR_LRU_BASE + i) * - PAGE_SIZE); - - /* Hierarchical information */ - memory = memsw = PAGE_COUNTER_MAX; - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) { - memory = min(memory, READ_ONCE(mi->memory.max)); - memsw = min(memsw, READ_ONCE(mi->memsw.max)); - } - seq_buf_printf(s, "hierarchical_memory_limit %llu\n", - (u64)memory * PAGE_SIZE); - seq_buf_printf(s, "hierarchical_memsw_limit %llu\n", - (u64)memsw * PAGE_SIZE); - - for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { - unsigned long nr; - - nr = memcg_page_state_output(memcg, memcg1_stats[i]); - seq_buf_printf(s, "total_%s %llu\n", memcg1_stat_names[i], - (u64)nr); - } - - for (i = 0; i < ARRAY_SIZE(memcg1_events); i++) - seq_buf_printf(s, "total_%s %llu\n", - vm_event_name(memcg1_events[i]), - (u64)memcg_events(memcg, memcg1_events[i])); - - for (i = 0; i < NR_LRU_LISTS; i++) - seq_buf_printf(s, "total_%s %llu\n", lru_list_name(i), - (u64)memcg_page_state(memcg, NR_LRU_BASE + i) * - PAGE_SIZE); - -#ifdef CONFIG_DEBUG_VM - { - pg_data_t *pgdat; - struct mem_cgroup_per_node *mz; - unsigned long anon_cost = 0; - unsigned long file_cost = 0; - - for_each_online_pgdat(pgdat) { - mz = memcg->nodeinfo[pgdat->node_id]; - - anon_cost += mz->lruvec.anon_cost; - file_cost += mz->lruvec.file_cost; - } - seq_buf_printf(s, "anon_cost %lu\n", anon_cost); - seq_buf_printf(s, "file_cost %lu\n", file_cost); - } -#endif -} - -static u64 mem_cgroup_swappiness_read(struct cgroup_subsys_state *css, - struct cftype *cft) -{ - struct mem_cgroup *memcg = mem_cgroup_from_css(css); - - return mem_cgroup_swappiness(memcg); -} - -static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css, - struct cftype *cft, u64 val) -{ - struct mem_cgroup *memcg = mem_cgroup_from_css(css); - - if (val > 200) - return -EINVAL; - - if (!mem_cgroup_is_root(memcg)) - WRITE_ONCE(memcg->swappiness, val); - else - WRITE_ONCE(vm_swappiness, val); - - return 0; -} - -static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap) -{ - struct mem_cgroup_threshold_ary *t; - unsigned long usage; - int i; - - rcu_read_lock(); - if (!swap) - t = rcu_dereference(memcg->thresholds.primary); - else - t = rcu_dereference(memcg->memsw_thresholds.primary); - - if (!t) - goto unlock; - - usage = mem_cgroup_usage(memcg, swap); - - /* - * current_threshold points to threshold just below or equal to usage. - * If it's not true, a threshold was crossed after last - * call of __mem_cgroup_threshold(). - */ - i = t->current_threshold; - - /* - * Iterate backward over array of thresholds starting from - * current_threshold and check if a threshold is crossed. - * If none of thresholds below usage is crossed, we read - * only one element of the array here. - */ - for (; i >= 0 && unlikely(t->entries[i].threshold > usage); i--) - eventfd_signal(t->entries[i].eventfd); - - /* i = current_threshold + 1 */ - i++; - - /* - * Iterate forward over array of thresholds starting from - * current_threshold+1 and check if a threshold is crossed. - * If none of thresholds above usage is crossed, we read - * only one element of the array here. - */ - for (; i < t->size && unlikely(t->entries[i].threshold <= usage); i++) - eventfd_signal(t->entries[i].eventfd); - - /* Update current_threshold */ - t->current_threshold = i - 1; -unlock: - rcu_read_unlock(); -} - -static void mem_cgroup_threshold(struct mem_cgroup *memcg) -{ - while (memcg) { - __mem_cgroup_threshold(memcg, false); - if (do_memsw_account()) - __mem_cgroup_threshold(memcg, true); - - memcg = parent_mem_cgroup(memcg); - } -} - -static int compare_thresholds(const void *a, const void *b) -{ - const struct mem_cgroup_threshold *_a = a; - const struct mem_cgroup_threshold *_b = b; - - if (_a->threshold > _b->threshold) - return 1; - - if (_a->threshold < _b->threshold) - return -1; - - return 0; -} - -static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg) -{ - struct mem_cgroup_eventfd_list *ev; - - spin_lock(&memcg_oom_lock); - - list_for_each_entry(ev, &memcg->oom_notify, list) - eventfd_signal(ev->eventfd); - - spin_unlock(&memcg_oom_lock); - return 0; -} - -static void mem_cgroup_oom_notify(struct mem_cgroup *memcg) -{ - struct mem_cgroup *iter; - - for_each_mem_cgroup_tree(iter, memcg) - mem_cgroup_oom_notify_cb(iter); -} - -static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd, const char *args, enum res_type type) -{ - struct mem_cgroup_thresholds *thresholds; - struct mem_cgroup_threshold_ary *new; - unsigned long threshold; - unsigned long usage; - int i, size, ret; - - ret = page_counter_memparse(args, "-1", &threshold); - if (ret) - return ret; - - mutex_lock(&memcg->thresholds_lock); - - if (type == _MEM) { - thresholds = &memcg->thresholds; - usage = mem_cgroup_usage(memcg, false); - } else if (type == _MEMSWAP) { - thresholds = &memcg->memsw_thresholds; - usage = mem_cgroup_usage(memcg, true); - } else - BUG(); - - /* Check if a threshold crossed before adding a new one */ - if (thresholds->primary) - __mem_cgroup_threshold(memcg, type == _MEMSWAP); - - size = thresholds->primary ? thresholds->primary->size + 1 : 1; - - /* Allocate memory for new array of thresholds */ - new = kmalloc(struct_size(new, entries, size), GFP_KERNEL); - if (!new) { - ret = -ENOMEM; - goto unlock; - } - new->size = size; - - /* Copy thresholds (if any) to new array */ - if (thresholds->primary) - memcpy(new->entries, thresholds->primary->entries, - flex_array_size(new, entries, size - 1)); - - /* Add new threshold */ - new->entries[size - 1].eventfd = eventfd; - new->entries[size - 1].threshold = threshold; - - /* Sort thresholds. Registering of new threshold isn't time-critical */ - sort(new->entries, size, sizeof(*new->entries), - compare_thresholds, NULL); - - /* Find current threshold */ - new->current_threshold = -1; - for (i = 0; i < size; i++) { - if (new->entries[i].threshold <= usage) { - /* - * new->current_threshold will not be used until - * rcu_assign_pointer(), so it's safe to increment - * it here. - */ - ++new->current_threshold; - } else - break; - } - - /* Free old spare buffer and save old primary buffer as spare */ - kfree(thresholds->spare); - thresholds->spare = thresholds->primary; - - rcu_assign_pointer(thresholds->primary, new); - - /* To be sure that nobody uses thresholds */ - synchronize_rcu(); - -unlock: - mutex_unlock(&memcg->thresholds_lock); - - return ret; -} - -static int mem_cgroup_usage_register_event(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd, const char *args) -{ - return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEM); -} - -static int memsw_cgroup_usage_register_event(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd, const char *args) -{ - return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEMSWAP); -} - -static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd, enum res_type type) -{ - struct mem_cgroup_thresholds *thresholds; - struct mem_cgroup_threshold_ary *new; - unsigned long usage; - int i, j, size, entries; - - mutex_lock(&memcg->thresholds_lock); - - if (type == _MEM) { - thresholds = &memcg->thresholds; - usage = mem_cgroup_usage(memcg, false); - } else if (type == _MEMSWAP) { - thresholds = &memcg->memsw_thresholds; - usage = mem_cgroup_usage(memcg, true); - } else - BUG(); - - if (!thresholds->primary) - goto unlock; - - /* Check if a threshold crossed before removing */ - __mem_cgroup_threshold(memcg, type == _MEMSWAP); - - /* Calculate new number of threshold */ - size = entries = 0; - for (i = 0; i < thresholds->primary->size; i++) { - if (thresholds->primary->entries[i].eventfd != eventfd) - size++; - else - entries++; - } - - new = thresholds->spare; - - /* If no items related to eventfd have been cleared, nothing to do */ - if (!entries) - goto unlock; - - /* Set thresholds array to NULL if we don't have thresholds */ - if (!size) { - kfree(new); - new = NULL; - goto swap_buffers; - } - - new->size = size; - - /* Copy thresholds and find current threshold */ - new->current_threshold = -1; - for (i = 0, j = 0; i < thresholds->primary->size; i++) { - if (thresholds->primary->entries[i].eventfd == eventfd) - continue; - - new->entries[j] = thresholds->primary->entries[i]; - if (new->entries[j].threshold <= usage) { - /* - * new->current_threshold will not be used - * until rcu_assign_pointer(), so it's safe to increment - * it here. - */ - ++new->current_threshold; - } - j++; - } - -swap_buffers: - /* Swap primary and spare array */ - thresholds->spare = thresholds->primary; - - rcu_assign_pointer(thresholds->primary, new); - - /* To be sure that nobody uses thresholds */ - synchronize_rcu(); - - /* If all events are unregistered, free the spare array */ - if (!new) { - kfree(thresholds->spare); - thresholds->spare = NULL; - } -unlock: - mutex_unlock(&memcg->thresholds_lock); -} - -static void mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd) -{ - return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEM); -} - -static void memsw_cgroup_usage_unregister_event(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd) -{ - return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEMSWAP); -} - -static int mem_cgroup_oom_register_event(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd, const char *args) -{ - struct mem_cgroup_eventfd_list *event; - - event = kmalloc(sizeof(*event), GFP_KERNEL); - if (!event) - return -ENOMEM; - - spin_lock(&memcg_oom_lock); - - event->eventfd = eventfd; - list_add(&event->list, &memcg->oom_notify); - - /* already in OOM ? */ - if (memcg->under_oom) - eventfd_signal(eventfd); - spin_unlock(&memcg_oom_lock); - - return 0; -} - -static void mem_cgroup_oom_unregister_event(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd) -{ - struct mem_cgroup_eventfd_list *ev, *tmp; - - spin_lock(&memcg_oom_lock); - - list_for_each_entry_safe(ev, tmp, &memcg->oom_notify, list) { - if (ev->eventfd == eventfd) { - list_del(&ev->list); - kfree(ev); - } - } - - spin_unlock(&memcg_oom_lock); -} - -static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v) -{ - struct mem_cgroup *memcg = mem_cgroup_from_seq(sf); - - seq_printf(sf, "oom_kill_disable %d\n", READ_ONCE(memcg->oom_kill_disable)); - seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom); - seq_printf(sf, "oom_kill %lu\n", - atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL])); - return 0; -} - -static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css, - struct cftype *cft, u64 val) -{ - struct mem_cgroup *memcg = mem_cgroup_from_css(css); - - /* cannot set to root cgroup and only 0 and 1 are allowed */ - if (mem_cgroup_is_root(memcg) || !((val == 0) || (val == 1))) - return -EINVAL; - - WRITE_ONCE(memcg->oom_kill_disable, val); - if (!val) - memcg_oom_recover(memcg); - - return 0; -} #ifdef CONFIG_CGROUP_WRITEBACK @@ -5164,384 +3360,6 @@ static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) #endif /* CONFIG_CGROUP_WRITEBACK */ -/* - * DO NOT USE IN NEW FILES. - * - * "cgroup.event_control" implementation. - * - * This is way over-engineered. It tries to support fully configurable - * events for each user. Such level of flexibility is completely - * unnecessary especially in the light of the planned unified hierarchy. - * - * Please deprecate this and replace with something simpler if at all - * possible. - */ - -/* - * Unregister event and free resources. - * - * Gets called from workqueue. - */ -static void memcg_event_remove(struct work_struct *work) -{ - struct mem_cgroup_event *event = - container_of(work, struct mem_cgroup_event, remove); - struct mem_cgroup *memcg = event->memcg; - - remove_wait_queue(event->wqh, &event->wait); - - event->unregister_event(memcg, event->eventfd); - - /* Notify userspace the event is going away. */ - eventfd_signal(event->eventfd); - - eventfd_ctx_put(event->eventfd); - kfree(event); - css_put(&memcg->css); -} - -/* - * Gets called on EPOLLHUP on eventfd when user closes it. - * - * Called with wqh->lock held and interrupts disabled. - */ -static int memcg_event_wake(wait_queue_entry_t *wait, unsigned mode, - int sync, void *key) -{ - struct mem_cgroup_event *event = - container_of(wait, struct mem_cgroup_event, wait); - struct mem_cgroup *memcg = event->memcg; - __poll_t flags = key_to_poll(key); - - if (flags & EPOLLHUP) { - /* - * If the event has been detached at cgroup removal, we - * can simply return knowing the other side will cleanup - * for us. - * - * We can't race against event freeing since the other - * side will require wqh->lock via remove_wait_queue(), - * which we hold. - */ - spin_lock(&memcg->event_list_lock); - if (!list_empty(&event->list)) { - list_del_init(&event->list); - /* - * We are in atomic context, but cgroup_event_remove() - * may sleep, so we have to call it in workqueue. - */ - schedule_work(&event->remove); - } - spin_unlock(&memcg->event_list_lock); - } - - return 0; -} - -static void memcg_event_ptable_queue_proc(struct file *file, - wait_queue_head_t *wqh, poll_table *pt) -{ - struct mem_cgroup_event *event = - container_of(pt, struct mem_cgroup_event, pt); - - event->wqh = wqh; - add_wait_queue(wqh, &event->wait); -} - -/* - * DO NOT USE IN NEW FILES. - * - * Parse input and register new cgroup event handler. - * - * Input must be in format ' '. - * Interpretation of args is defined by control file implementation. - */ -static ssize_t memcg_write_event_control(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) -{ - struct cgroup_subsys_state *css = of_css(of); - struct mem_cgroup *memcg = mem_cgroup_from_css(css); - struct mem_cgroup_event *event; - struct cgroup_subsys_state *cfile_css; - unsigned int efd, cfd; - struct fd efile; - struct fd cfile; - struct dentry *cdentry; - const char *name; - char *endp; - int ret; - - if (IS_ENABLED(CONFIG_PREEMPT_RT)) - return -EOPNOTSUPP; - - buf = strstrip(buf); - - efd = simple_strtoul(buf, &endp, 10); - if (*endp != ' ') - return -EINVAL; - buf = endp + 1; - - cfd = simple_strtoul(buf, &endp, 10); - if ((*endp != ' ') && (*endp != '\0')) - return -EINVAL; - buf = endp + 1; - - event = kzalloc(sizeof(*event), GFP_KERNEL); - if (!event) - return -ENOMEM; - - event->memcg = memcg; - INIT_LIST_HEAD(&event->list); - init_poll_funcptr(&event->pt, memcg_event_ptable_queue_proc); - init_waitqueue_func_entry(&event->wait, memcg_event_wake); - INIT_WORK(&event->remove, memcg_event_remove); - - efile = fdget(efd); - if (!efile.file) { - ret = -EBADF; - goto out_kfree; - } - - event->eventfd = eventfd_ctx_fileget(efile.file); - if (IS_ERR(event->eventfd)) { - ret = PTR_ERR(event->eventfd); - goto out_put_efile; - } - - cfile = fdget(cfd); - if (!cfile.file) { - ret = -EBADF; - goto out_put_eventfd; - } - - /* the process need read permission on control file */ - /* AV: shouldn't we check that it's been opened for read instead? */ - ret = file_permission(cfile.file, MAY_READ); - if (ret < 0) - goto out_put_cfile; - - /* - * The control file must be a regular cgroup1 file. As a regular cgroup - * file can't be renamed, it's safe to access its name afterwards. - */ - cdentry = cfile.file->f_path.dentry; - if (cdentry->d_sb->s_type != &cgroup_fs_type || !d_is_reg(cdentry)) { - ret = -EINVAL; - goto out_put_cfile; - } - - /* - * Determine the event callbacks and set them in @event. This used - * to be done via struct cftype but cgroup core no longer knows - * about these events. The following is crude but the whole thing - * is for compatibility anyway. - * - * DO NOT ADD NEW FILES. - */ - name = cdentry->d_name.name; - - if (!strcmp(name, "memory.usage_in_bytes")) { - event->register_event = mem_cgroup_usage_register_event; - event->unregister_event = mem_cgroup_usage_unregister_event; - } else if (!strcmp(name, "memory.oom_control")) { - event->register_event = mem_cgroup_oom_register_event; - event->unregister_event = mem_cgroup_oom_unregister_event; - } else if (!strcmp(name, "memory.pressure_level")) { - event->register_event = vmpressure_register_event; - event->unregister_event = vmpressure_unregister_event; - } else if (!strcmp(name, "memory.memsw.usage_in_bytes")) { - event->register_event = memsw_cgroup_usage_register_event; - event->unregister_event = memsw_cgroup_usage_unregister_event; - } else { - ret = -EINVAL; - goto out_put_cfile; - } - - /* - * Verify @cfile should belong to @css. Also, remaining events are - * automatically removed on cgroup destruction but the removal is - * asynchronous, so take an extra ref on @css. - */ - cfile_css = css_tryget_online_from_dir(cdentry->d_parent, - &memory_cgrp_subsys); - ret = -EINVAL; - if (IS_ERR(cfile_css)) - goto out_put_cfile; - if (cfile_css != css) { - css_put(cfile_css); - goto out_put_cfile; - } - - ret = event->register_event(memcg, event->eventfd, buf); - if (ret) - goto out_put_css; - - vfs_poll(efile.file, &event->pt); - - spin_lock_irq(&memcg->event_list_lock); - list_add(&event->list, &memcg->event_list); - spin_unlock_irq(&memcg->event_list_lock); - - fdput(cfile); - fdput(efile); - - return nbytes; - -out_put_css: - css_put(css); -out_put_cfile: - fdput(cfile); -out_put_eventfd: - eventfd_ctx_put(event->eventfd); -out_put_efile: - fdput(efile); -out_kfree: - kfree(event); - - return ret; -} - -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG) -static int mem_cgroup_slab_show(struct seq_file *m, void *p) -{ - /* - * Deprecated. - * Please, take a look at tools/cgroup/memcg_slabinfo.py . - */ - return 0; -} -#endif - -static int memory_stat_show(struct seq_file *m, void *v); - -static struct cftype mem_cgroup_legacy_files[] = { - { - .name = "usage_in_bytes", - .private = MEMFILE_PRIVATE(_MEM, RES_USAGE), - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "max_usage_in_bytes", - .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE), - .write = mem_cgroup_reset, - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "limit_in_bytes", - .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT), - .write = mem_cgroup_write, - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "soft_limit_in_bytes", - .private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT), - .write = mem_cgroup_write, - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "failcnt", - .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT), - .write = mem_cgroup_reset, - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "stat", - .seq_show = memory_stat_show, - }, - { - .name = "force_empty", - .write = mem_cgroup_force_empty_write, - }, - { - .name = "use_hierarchy", - .write_u64 = mem_cgroup_hierarchy_write, - .read_u64 = mem_cgroup_hierarchy_read, - }, - { - .name = "cgroup.event_control", /* XXX: for compat */ - .write = memcg_write_event_control, - .flags = CFTYPE_NO_PREFIX | CFTYPE_WORLD_WRITABLE, - }, - { - .name = "swappiness", - .read_u64 = mem_cgroup_swappiness_read, - .write_u64 = mem_cgroup_swappiness_write, - }, - { - .name = "move_charge_at_immigrate", - .read_u64 = mem_cgroup_move_charge_read, - .write_u64 = mem_cgroup_move_charge_write, - }, - { - .name = "oom_control", - .seq_show = mem_cgroup_oom_control_read, - .write_u64 = mem_cgroup_oom_control_write, - }, - { - .name = "pressure_level", - .seq_show = mem_cgroup_dummy_seq_show, - }, -#ifdef CONFIG_NUMA - { - .name = "numa_stat", - .seq_show = memcg_numa_stat_show, - }, -#endif - { - .name = "kmem.limit_in_bytes", - .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), - .write = mem_cgroup_write, - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "kmem.usage_in_bytes", - .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "kmem.failcnt", - .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT), - .write = mem_cgroup_reset, - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "kmem.max_usage_in_bytes", - .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE), - .write = mem_cgroup_reset, - .read_u64 = mem_cgroup_read_u64, - }, -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG) - { - .name = "kmem.slabinfo", - .seq_show = mem_cgroup_slab_show, - }, -#endif - { - .name = "kmem.tcp.limit_in_bytes", - .private = MEMFILE_PRIVATE(_TCP, RES_LIMIT), - .write = mem_cgroup_write, - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "kmem.tcp.usage_in_bytes", - .private = MEMFILE_PRIVATE(_TCP, RES_USAGE), - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "kmem.tcp.failcnt", - .private = MEMFILE_PRIVATE(_TCP, RES_FAILCNT), - .write = mem_cgroup_reset, - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "kmem.tcp.max_usage_in_bytes", - .private = MEMFILE_PRIVATE(_TCP, RES_MAX_USAGE), - .write = mem_cgroup_reset, - .read_u64 = mem_cgroup_read_u64, - }, - { }, /* terminate */ -}; - /* * Private memory cgroup IDR * @@ -5577,13 +3395,13 @@ static void mem_cgroup_id_remove(struct mem_cgroup *memcg) } } -static void __maybe_unused mem_cgroup_id_get_many(struct mem_cgroup *memcg, - unsigned int n) +void __maybe_unused mem_cgroup_id_get_many(struct mem_cgroup *memcg, + unsigned int n) { refcount_add(n, &memcg->id.ref); } -static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n) +void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n) { if (refcount_sub_and_test(n, &memcg->id.ref)) { mem_cgroup_id_remove(memcg); @@ -5739,17 +3557,11 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) goto fail; INIT_WORK(&memcg->high_work, high_work_func); - INIT_LIST_HEAD(&memcg->oom_notify); - mutex_init(&memcg->thresholds_lock); - spin_lock_init(&memcg->move_lock); vmpressure_init(&memcg->vmpressure); - INIT_LIST_HEAD(&memcg->event_list); - spin_lock_init(&memcg->event_list_lock); memcg->socket_pressure = jiffies; -#ifdef CONFIG_MEMCG_KMEM + memcg1_memcg_init(memcg); memcg->kmemcg_id = -1; INIT_LIST_HEAD(&memcg->objcg_list); -#endif #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) @@ -5782,8 +3594,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) return ERR_CAST(memcg); page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX); - WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) + memcg1_soft_limit_reset(memcg); +#ifdef CONFIG_ZSWAP memcg->zswap_max = PAGE_COUNTER_MAX; WRITE_ONCE(memcg->zswap_writeback, !parent || READ_ONCE(parent->zswap_writeback)); @@ -5791,20 +3603,23 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); if (parent) { WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent)); - WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable)); page_counter_init(&memcg->memory, &parent->memory); page_counter_init(&memcg->swap, &parent->swap); +#ifdef CONFIG_MEMCG_V1 + WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable)); page_counter_init(&memcg->kmem, &parent->kmem); page_counter_init(&memcg->tcpmem, &parent->tcpmem); +#endif } else { init_memcg_stats(); init_memcg_events(); page_counter_init(&memcg->memory, NULL); page_counter_init(&memcg->swap, NULL); +#ifdef CONFIG_MEMCG_V1 page_counter_init(&memcg->kmem, NULL); page_counter_init(&memcg->tcpmem, NULL); - +#endif root_mem_cgroup = memcg; return &memcg->css; } @@ -5812,10 +3627,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) static_branch_inc(&memcg_sockets_enabled_key); -#if defined(CONFIG_MEMCG_KMEM) if (!cgroup_memory_nobpf) static_branch_inc(&memcg_bpf_enabled_key); -#endif return &memcg->css; } @@ -5867,19 +3680,8 @@ remove_id: static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); - struct mem_cgroup_event *event, *tmp; - /* - * Unregister events and notify userspace. - * Notify userspace about cgroup removing only after rmdir of cgroup - * directory to avoid race between userspace and kernelspace. - */ - spin_lock_irq(&memcg->event_list_lock); - list_for_each_entry_safe(event, tmp, &memcg->event_list, list) { - list_del_init(&event->list); - schedule_work(&event->remove); - } - spin_unlock_irq(&memcg->event_list_lock); + memcg1_css_offline(memcg); page_counter_set_min(&memcg->memory, 0); page_counter_set_low(&memcg->memory, 0); @@ -5916,17 +3718,15 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) static_branch_dec(&memcg_sockets_enabled_key); - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_active) + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg1_tcpmem_active(memcg)) static_branch_dec(&memcg_sockets_enabled_key); -#if defined(CONFIG_MEMCG_KMEM) if (!cgroup_memory_nobpf) static_branch_dec(&memcg_bpf_enabled_key); -#endif vmpressure_cleanup(&memcg->vmpressure); cancel_work_sync(&memcg->high_work); - mem_cgroup_remove_from_trees(memcg); + memcg1_remove_from_trees(memcg); free_shrinker_info(memcg); mem_cgroup_free(memcg); } @@ -5950,12 +3750,14 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) page_counter_set_max(&memcg->memory, PAGE_COUNTER_MAX); page_counter_set_max(&memcg->swap, PAGE_COUNTER_MAX); +#ifdef CONFIG_MEMCG_V1 page_counter_set_max(&memcg->kmem, PAGE_COUNTER_MAX); page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX); +#endif page_counter_set_min(&memcg->memory, 0); page_counter_set_low(&memcg->memory, 0); page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX); - WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); + memcg1_soft_limit_reset(memcg); page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); memcg_wb_domain_size_changed(memcg); } @@ -6063,758 +3865,6 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) atomic64_set(&memcg->vmstats->stats_updates, 0); } -#ifdef CONFIG_MMU -/* Handlers for move charge at task migration. */ -static int mem_cgroup_do_precharge(unsigned long count) -{ - int ret; - - /* Try a single bulk charge without reclaim first, kswapd may wake */ - ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count); - if (!ret) { - mc.precharge += count; - return ret; - } - - /* Try charges one by one with reclaim, but do not retry */ - while (count--) { - ret = try_charge(mc.to, GFP_KERNEL | __GFP_NORETRY, 1); - if (ret) - return ret; - mc.precharge++; - cond_resched(); - } - return 0; -} - -union mc_target { - struct folio *folio; - swp_entry_t ent; -}; - -enum mc_target_type { - MC_TARGET_NONE = 0, - MC_TARGET_PAGE, - MC_TARGET_SWAP, - MC_TARGET_DEVICE, -}; - -static struct page *mc_handle_present_pte(struct vm_area_struct *vma, - unsigned long addr, pte_t ptent) -{ - struct page *page = vm_normal_page(vma, addr, ptent); - - if (!page) - return NULL; - if (PageAnon(page)) { - if (!(mc.flags & MOVE_ANON)) - return NULL; - } else { - if (!(mc.flags & MOVE_FILE)) - return NULL; - } - get_page(page); - - return page; -} - -#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE) -static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, - pte_t ptent, swp_entry_t *entry) -{ - struct page *page = NULL; - swp_entry_t ent = pte_to_swp_entry(ptent); - - if (!(mc.flags & MOVE_ANON)) - return NULL; - - /* - * Handle device private pages that are not accessible by the CPU, but - * stored as special swap entries in the page table. - */ - if (is_device_private_entry(ent)) { - page = pfn_swap_entry_to_page(ent); - if (!get_page_unless_zero(page)) - return NULL; - return page; - } - - if (non_swap_entry(ent)) - return NULL; - - /* - * Because swap_cache_get_folio() updates some statistics counter, - * we call find_get_page() with swapper_space directly. - */ - page = find_get_page(swap_address_space(ent), swp_offset(ent)); - entry->val = ent.val; - - return page; -} -#else -static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, - pte_t ptent, swp_entry_t *entry) -{ - return NULL; -} -#endif - -static struct page *mc_handle_file_pte(struct vm_area_struct *vma, - unsigned long addr, pte_t ptent) -{ - unsigned long index; - struct folio *folio; - - if (!vma->vm_file) /* anonymous vma */ - return NULL; - if (!(mc.flags & MOVE_FILE)) - return NULL; - - /* folio is moved even if it's not RSS of this task(page-faulted). */ - /* shmem/tmpfs may report page out on swap: account for that too. */ - index = linear_page_index(vma, addr); - folio = filemap_get_incore_folio(vma->vm_file->f_mapping, index); - if (IS_ERR(folio)) - return NULL; - return folio_file_page(folio, index); -} - -/** - * mem_cgroup_move_account - move account of the folio - * @folio: The folio. - * @compound: charge the page as compound or small page - * @from: mem_cgroup which the folio is moved from. - * @to: mem_cgroup which the folio is moved to. @from != @to. - * - * The folio must be locked and not on the LRU. - * - * This function doesn't do "charge" to new cgroup and doesn't do "uncharge" - * from old cgroup. - */ -static int mem_cgroup_move_account(struct folio *folio, - bool compound, - struct mem_cgroup *from, - struct mem_cgroup *to) -{ - struct lruvec *from_vec, *to_vec; - struct pglist_data *pgdat; - unsigned int nr_pages = compound ? folio_nr_pages(folio) : 1; - int nid, ret; - - VM_BUG_ON(from == to); - VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); - VM_BUG_ON(compound && !folio_test_large(folio)); - - ret = -EINVAL; - if (folio_memcg(folio) != from) - goto out; - - pgdat = folio_pgdat(folio); - from_vec = mem_cgroup_lruvec(from, pgdat); - to_vec = mem_cgroup_lruvec(to, pgdat); - - folio_memcg_lock(folio); - - if (folio_test_anon(folio)) { - if (folio_mapped(folio)) { - __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); - __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); - if (folio_test_pmd_mappable(folio)) { - __mod_lruvec_state(from_vec, NR_ANON_THPS, - -nr_pages); - __mod_lruvec_state(to_vec, NR_ANON_THPS, - nr_pages); - } - } - } else { - __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages); - __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages); - - if (folio_test_swapbacked(folio)) { - __mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages); - __mod_lruvec_state(to_vec, NR_SHMEM, nr_pages); - } - - if (folio_mapped(folio)) { - __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); - __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); - } - - if (folio_test_dirty(folio)) { - struct address_space *mapping = folio_mapping(folio); - - if (mapping_can_writeback(mapping)) { - __mod_lruvec_state(from_vec, NR_FILE_DIRTY, - -nr_pages); - __mod_lruvec_state(to_vec, NR_FILE_DIRTY, - nr_pages); - } - } - } - -#ifdef CONFIG_SWAP - if (folio_test_swapcache(folio)) { - __mod_lruvec_state(from_vec, NR_SWAPCACHE, -nr_pages); - __mod_lruvec_state(to_vec, NR_SWAPCACHE, nr_pages); - } -#endif - if (folio_test_writeback(folio)) { - __mod_lruvec_state(from_vec, NR_WRITEBACK, -nr_pages); - __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages); - } - - /* - * All state has been migrated, let's switch to the new memcg. - * - * It is safe to change page's memcg here because the page - * is referenced, charged, isolated, and locked: we can't race - * with (un)charging, migration, LRU putback, or anything else - * that would rely on a stable page's memory cgroup. - * - * Note that folio_memcg_lock is a memcg lock, not a page lock, - * to save space. As soon as we switch page's memory cgroup to a - * new memcg that isn't locked, the above state can change - * concurrently again. Make sure we're truly done with it. - */ - smp_mb(); - - css_get(&to->css); - css_put(&from->css); - - folio->memcg_data = (unsigned long)to; - - __folio_memcg_unlock(from); - - ret = 0; - nid = folio_nid(folio); - - local_irq_disable(); - mem_cgroup_charge_statistics(to, nr_pages); - memcg_check_events(to, nid); - mem_cgroup_charge_statistics(from, -nr_pages); - memcg_check_events(from, nid); - local_irq_enable(); -out: - return ret; -} - -/** - * get_mctgt_type - get target type of moving charge - * @vma: the vma the pte to be checked belongs - * @addr: the address corresponding to the pte to be checked - * @ptent: the pte to be checked - * @target: the pointer the target page or swap ent will be stored(can be NULL) - * - * Context: Called with pte lock held. - * Return: - * * MC_TARGET_NONE - If the pte is not a target for move charge. - * * MC_TARGET_PAGE - If the page corresponding to this pte is a target for - * move charge. If @target is not NULL, the folio is stored in target->folio - * with extra refcnt taken (Caller should release it). - * * MC_TARGET_SWAP - If the swap entry corresponding to this pte is a - * target for charge migration. If @target is not NULL, the entry is - * stored in target->ent. - * * MC_TARGET_DEVICE - Like MC_TARGET_PAGE but page is device memory and - * thus not on the lru. For now such page is charged like a regular page - * would be as it is just special memory taking the place of a regular page. - * See Documentations/vm/hmm.txt and include/linux/hmm.h - */ -static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma, - unsigned long addr, pte_t ptent, union mc_target *target) -{ - struct page *page = NULL; - struct folio *folio; - enum mc_target_type ret = MC_TARGET_NONE; - swp_entry_t ent = { .val = 0 }; - - if (pte_present(ptent)) - page = mc_handle_present_pte(vma, addr, ptent); - else if (pte_none_mostly(ptent)) - /* - * PTE markers should be treated as a none pte here, separated - * from other swap handling below. - */ - page = mc_handle_file_pte(vma, addr, ptent); - else if (is_swap_pte(ptent)) - page = mc_handle_swap_pte(vma, ptent, &ent); - - if (page) - folio = page_folio(page); - if (target && page) { - if (!folio_trylock(folio)) { - folio_put(folio); - return ret; - } - /* - * page_mapped() must be stable during the move. This - * pte is locked, so if it's present, the page cannot - * become unmapped. If it isn't, we have only partial - * control over the mapped state: the page lock will - * prevent new faults against pagecache and swapcache, - * so an unmapped page cannot become mapped. However, - * if the page is already mapped elsewhere, it can - * unmap, and there is nothing we can do about it. - * Alas, skip moving the page in this case. - */ - if (!pte_present(ptent) && page_mapped(page)) { - folio_unlock(folio); - folio_put(folio); - return ret; - } - } - - if (!page && !ent.val) - return ret; - if (page) { - /* - * Do only loose check w/o serialization. - * mem_cgroup_move_account() checks the page is valid or - * not under LRU exclusion. - */ - if (folio_memcg(folio) == mc.from) { - ret = MC_TARGET_PAGE; - if (folio_is_device_private(folio) || - folio_is_device_coherent(folio)) - ret = MC_TARGET_DEVICE; - if (target) - target->folio = folio; - } - if (!ret || !target) { - if (target) - folio_unlock(folio); - folio_put(folio); - } - } - /* - * There is a swap entry and a page doesn't exist or isn't charged. - * But we cannot move a tail-page in a THP. - */ - if (ent.val && !ret && (!page || !PageTransCompound(page)) && - mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) { - ret = MC_TARGET_SWAP; - if (target) - target->ent = ent; - } - return ret; -} - -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -/* - * We don't consider PMD mapped swapping or file mapped pages because THP does - * not support them for now. - * Caller should make sure that pmd_trans_huge(pmd) is true. - */ -static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma, - unsigned long addr, pmd_t pmd, union mc_target *target) -{ - struct page *page = NULL; - struct folio *folio; - enum mc_target_type ret = MC_TARGET_NONE; - - if (unlikely(is_swap_pmd(pmd))) { - VM_BUG_ON(thp_migration_supported() && - !is_pmd_migration_entry(pmd)); - return ret; - } - page = pmd_page(pmd); - VM_BUG_ON_PAGE(!page || !PageHead(page), page); - folio = page_folio(page); - if (!(mc.flags & MOVE_ANON)) - return ret; - if (folio_memcg(folio) == mc.from) { - ret = MC_TARGET_PAGE; - if (target) { - folio_get(folio); - if (!folio_trylock(folio)) { - folio_put(folio); - return MC_TARGET_NONE; - } - target->folio = folio; - } - } - return ret; -} -#else -static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma, - unsigned long addr, pmd_t pmd, union mc_target *target) -{ - return MC_TARGET_NONE; -} -#endif - -static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, - unsigned long addr, unsigned long end, - struct mm_walk *walk) -{ - struct vm_area_struct *vma = walk->vma; - pte_t *pte; - spinlock_t *ptl; - - ptl = pmd_trans_huge_lock(pmd, vma); - if (ptl) { - /* - * Note their can not be MC_TARGET_DEVICE for now as we do not - * support transparent huge page with MEMORY_DEVICE_PRIVATE but - * this might change. - */ - if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) - mc.precharge += HPAGE_PMD_NR; - spin_unlock(ptl); - return 0; - } - - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); - if (!pte) - return 0; - for (; addr != end; pte++, addr += PAGE_SIZE) - if (get_mctgt_type(vma, addr, ptep_get(pte), NULL)) - mc.precharge++; /* increment precharge temporarily */ - pte_unmap_unlock(pte - 1, ptl); - cond_resched(); - - return 0; -} - -static const struct mm_walk_ops precharge_walk_ops = { - .pmd_entry = mem_cgroup_count_precharge_pte_range, - .walk_lock = PGWALK_RDLOCK, -}; - -static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm) -{ - unsigned long precharge; - - mmap_read_lock(mm); - walk_page_range(mm, 0, ULONG_MAX, &precharge_walk_ops, NULL); - mmap_read_unlock(mm); - - precharge = mc.precharge; - mc.precharge = 0; - - return precharge; -} - -static int mem_cgroup_precharge_mc(struct mm_struct *mm) -{ - unsigned long precharge = mem_cgroup_count_precharge(mm); - - VM_BUG_ON(mc.moving_task); - mc.moving_task = current; - return mem_cgroup_do_precharge(precharge); -} - -/* cancels all extra charges on mc.from and mc.to, and wakes up all waiters. */ -static void __mem_cgroup_clear_mc(void) -{ - struct mem_cgroup *from = mc.from; - struct mem_cgroup *to = mc.to; - - /* we must uncharge all the leftover precharges from mc.to */ - if (mc.precharge) { - mem_cgroup_cancel_charge(mc.to, mc.precharge); - mc.precharge = 0; - } - /* - * we didn't uncharge from mc.from at mem_cgroup_move_account(), so - * we must uncharge here. - */ - if (mc.moved_charge) { - mem_cgroup_cancel_charge(mc.from, mc.moved_charge); - mc.moved_charge = 0; - } - /* we must fixup refcnts and charges */ - if (mc.moved_swap) { - /* uncharge swap account from the old cgroup */ - if (!mem_cgroup_is_root(mc.from)) - page_counter_uncharge(&mc.from->memsw, mc.moved_swap); - - mem_cgroup_id_put_many(mc.from, mc.moved_swap); - - /* - * we charged both to->memory and to->memsw, so we - * should uncharge to->memory. - */ - if (!mem_cgroup_is_root(mc.to)) - page_counter_uncharge(&mc.to->memory, mc.moved_swap); - - mc.moved_swap = 0; - } - memcg_oom_recover(from); - memcg_oom_recover(to); - wake_up_all(&mc.waitq); -} - -static void mem_cgroup_clear_mc(void) -{ - struct mm_struct *mm = mc.mm; - - /* - * we must clear moving_task before waking up waiters at the end of - * task migration. - */ - mc.moving_task = NULL; - __mem_cgroup_clear_mc(); - spin_lock(&mc.lock); - mc.from = NULL; - mc.to = NULL; - mc.mm = NULL; - spin_unlock(&mc.lock); - - mmput(mm); -} - -static int mem_cgroup_can_attach(struct cgroup_taskset *tset) -{ - struct cgroup_subsys_state *css; - struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */ - struct mem_cgroup *from; - struct task_struct *leader, *p; - struct mm_struct *mm; - unsigned long move_flags; - int ret = 0; - - /* charge immigration isn't supported on the default hierarchy */ - if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) - return 0; - - /* - * Multi-process migrations only happen on the default hierarchy - * where charge immigration is not used. Perform charge - * immigration if @tset contains a leader and whine if there are - * multiple. - */ - p = NULL; - cgroup_taskset_for_each_leader(leader, css, tset) { - WARN_ON_ONCE(p); - p = leader; - memcg = mem_cgroup_from_css(css); - } - if (!p) - return 0; - - /* - * We are now committed to this value whatever it is. Changes in this - * tunable will only affect upcoming migrations, not the current one. - * So we need to save it, and keep it going. - */ - move_flags = READ_ONCE(memcg->move_charge_at_immigrate); - if (!move_flags) - return 0; - - from = mem_cgroup_from_task(p); - - VM_BUG_ON(from == memcg); - - mm = get_task_mm(p); - if (!mm) - return 0; - /* We move charges only when we move a owner of the mm */ - if (mm->owner == p) { - VM_BUG_ON(mc.from); - VM_BUG_ON(mc.to); - VM_BUG_ON(mc.precharge); - VM_BUG_ON(mc.moved_charge); - VM_BUG_ON(mc.moved_swap); - - spin_lock(&mc.lock); - mc.mm = mm; - mc.from = from; - mc.to = memcg; - mc.flags = move_flags; - spin_unlock(&mc.lock); - /* We set mc.moving_task later */ - - ret = mem_cgroup_precharge_mc(mm); - if (ret) - mem_cgroup_clear_mc(); - } else { - mmput(mm); - } - return ret; -} - -static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset) -{ - if (mc.to) - mem_cgroup_clear_mc(); -} - -static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, - unsigned long addr, unsigned long end, - struct mm_walk *walk) -{ - int ret = 0; - struct vm_area_struct *vma = walk->vma; - pte_t *pte; - spinlock_t *ptl; - enum mc_target_type target_type; - union mc_target target; - struct folio *folio; - - ptl = pmd_trans_huge_lock(pmd, vma); - if (ptl) { - if (mc.precharge < HPAGE_PMD_NR) { - spin_unlock(ptl); - return 0; - } - target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); - if (target_type == MC_TARGET_PAGE) { - folio = target.folio; - if (folio_isolate_lru(folio)) { - if (!mem_cgroup_move_account(folio, true, - mc.from, mc.to)) { - mc.precharge -= HPAGE_PMD_NR; - mc.moved_charge += HPAGE_PMD_NR; - } - folio_putback_lru(folio); - } - folio_unlock(folio); - folio_put(folio); - } else if (target_type == MC_TARGET_DEVICE) { - folio = target.folio; - if (!mem_cgroup_move_account(folio, true, - mc.from, mc.to)) { - mc.precharge -= HPAGE_PMD_NR; - mc.moved_charge += HPAGE_PMD_NR; - } - folio_unlock(folio); - folio_put(folio); - } - spin_unlock(ptl); - return 0; - } - -retry: - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); - if (!pte) - return 0; - for (; addr != end; addr += PAGE_SIZE) { - pte_t ptent = ptep_get(pte++); - bool device = false; - swp_entry_t ent; - - if (!mc.precharge) - break; - - switch (get_mctgt_type(vma, addr, ptent, &target)) { - case MC_TARGET_DEVICE: - device = true; - fallthrough; - case MC_TARGET_PAGE: - folio = target.folio; - /* - * We can have a part of the split pmd here. Moving it - * can be done but it would be too convoluted so simply - * ignore such a partial THP and keep it in original - * memcg. There should be somebody mapping the head. - */ - if (folio_test_large(folio)) - goto put; - if (!device && !folio_isolate_lru(folio)) - goto put; - if (!mem_cgroup_move_account(folio, false, - mc.from, mc.to)) { - mc.precharge--; - /* we uncharge from mc.from later. */ - mc.moved_charge++; - } - if (!device) - folio_putback_lru(folio); -put: /* get_mctgt_type() gets & locks the page */ - folio_unlock(folio); - folio_put(folio); - break; - case MC_TARGET_SWAP: - ent = target.ent; - if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) { - mc.precharge--; - mem_cgroup_id_get_many(mc.to, 1); - /* we fixup other refcnts and charges later. */ - mc.moved_swap++; - } - break; - default: - break; - } - } - pte_unmap_unlock(pte - 1, ptl); - cond_resched(); - - if (addr != end) { - /* - * We have consumed all precharges we got in can_attach(). - * We try charge one by one, but don't do any additional - * charges to mc.to if we have failed in charge once in attach() - * phase. - */ - ret = mem_cgroup_do_precharge(1); - if (!ret) - goto retry; - } - - return ret; -} - -static const struct mm_walk_ops charge_walk_ops = { - .pmd_entry = mem_cgroup_move_charge_pte_range, - .walk_lock = PGWALK_RDLOCK, -}; - -static void mem_cgroup_move_charge(void) -{ - lru_add_drain_all(); - /* - * Signal folio_memcg_lock() to take the memcg's move_lock - * while we're moving its pages to another memcg. Then wait - * for already started RCU-only updates to finish. - */ - atomic_inc(&mc.from->moving_account); - synchronize_rcu(); -retry: - if (unlikely(!mmap_read_trylock(mc.mm))) { - /* - * Someone who are holding the mmap_lock might be waiting in - * waitq. So we cancel all extra charges, wake up all waiters, - * and retry. Because we cancel precharges, we might not be able - * to move enough charges, but moving charge is a best-effort - * feature anyway, so it wouldn't be a big problem. - */ - __mem_cgroup_clear_mc(); - cond_resched(); - goto retry; - } - /* - * When we have consumed all precharges and failed in doing - * additional charge, the page walk just aborts. - */ - walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL); - mmap_read_unlock(mc.mm); - atomic_dec(&mc.from->moving_account); -} - -static void mem_cgroup_move_task(void) -{ - if (mc.to) { - mem_cgroup_move_charge(); - mem_cgroup_clear_mc(); - } -} - -#else /* !CONFIG_MMU */ -static int mem_cgroup_can_attach(struct cgroup_taskset *tset) -{ - return 0; -} -static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset) -{ -} -static void mem_cgroup_move_task(void) -{ -} -#endif - -#ifdef CONFIG_MEMCG_KMEM static void mem_cgroup_fork(struct task_struct *task) { /* @@ -6842,7 +3892,6 @@ static void mem_cgroup_exit(struct task_struct *task) */ task->objcg = NULL; } -#endif #ifdef CONFIG_LRU_GEN static void mem_cgroup_lru_gen_attach(struct cgroup_taskset *tset) @@ -6866,7 +3915,6 @@ static void mem_cgroup_lru_gen_attach(struct cgroup_taskset *tset) static void mem_cgroup_lru_gen_attach(struct cgroup_taskset *tset) {} #endif /* CONFIG_LRU_GEN */ -#ifdef CONFIG_MEMCG_KMEM static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset) { struct task_struct *task; @@ -6877,17 +3925,12 @@ static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset) set_bit(CURRENT_OBJCG_UPDATE_BIT, (unsigned long *)&task->objcg); } } -#else -static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset) {} -#endif /* CONFIG_MEMCG_KMEM */ -#if defined(CONFIG_LRU_GEN) || defined(CONFIG_MEMCG_KMEM) static void mem_cgroup_attach(struct cgroup_taskset *tset) { mem_cgroup_lru_gen_attach(tset); mem_cgroup_kmem_attach(tset); } -#endif static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value) { @@ -7000,7 +4043,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); if (!reclaimed && !nr_retries--) break; @@ -7049,7 +4092,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) nr_reclaims--; continue; } @@ -7095,7 +4138,7 @@ static int memory_events_local_show(struct seq_file *m, void *v) return 0; } -static int memory_stat_show(struct seq_file *m, void *v) +int memory_stat_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_seq(m); char *buf = kmalloc(PAGE_SIZE, GFP_KERNEL); @@ -7179,19 +4222,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +enum { + MEMORY_RECLAIM_SWAPPINESS = 0, + MEMORY_RECLAIM_NULL, +}; + +static const match_table_t tokens = { + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, + { MEMORY_RECLAIM_NULL, NULL }, +}; + static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; + int swappiness = -1; unsigned int reclaim_options; - int err; + char *old_buf, *start; + substring_t args[MAX_OPT_ARGS]; buf = strstrip(buf); - err = page_counter_memparse(buf, "", &nr_to_reclaim); - if (err) - return err; + + old_buf = buf; + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; + if (buf == old_buf) + return -EINVAL; + + buf = strstrip(buf); + + while ((start = strsep(&buf, " ")) != NULL) { + if (!strlen(start)) + continue; + switch (match_token(start, tokens, args)) { + case MEMORY_RECLAIM_SWAPPINESS: + if (match_int(&args[0], &swappiness)) + return -EINVAL; + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) + return -EINVAL; + break; + default: + return -EINVAL; + } + } reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; while (nr_reclaimed < nr_to_reclaim) { @@ -7211,7 +4285,9 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, lru_add_drain_all(); reclaimed = try_to_free_mem_cgroup_pages(memcg, - batch_size, GFP_KERNEL, reclaim_options); + batch_size, GFP_KERNEL, + reclaim_options, + swappiness == -1 ? NULL : &swappiness); if (!reclaimed && !nr_retries--) return -EAGAIN; @@ -7301,137 +4377,19 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_free = mem_cgroup_css_free, .css_reset = mem_cgroup_css_reset, .css_rstat_flush = mem_cgroup_css_rstat_flush, - .can_attach = mem_cgroup_can_attach, -#if defined(CONFIG_LRU_GEN) || defined(CONFIG_MEMCG_KMEM) .attach = mem_cgroup_attach, -#endif - .cancel_attach = mem_cgroup_cancel_attach, - .post_attach = mem_cgroup_move_task, -#ifdef CONFIG_MEMCG_KMEM .fork = mem_cgroup_fork, .exit = mem_cgroup_exit, -#endif .dfl_cftypes = memory_files, +#ifdef CONFIG_MEMCG_V1 + .can_attach = memcg1_can_attach, + .cancel_attach = memcg1_cancel_attach, + .post_attach = memcg1_move_task, .legacy_cftypes = mem_cgroup_legacy_files, +#endif .early_init = 0, }; -/* - * This function calculates an individual cgroup's effective - * protection which is derived from its own memory.min/low, its - * parent's and siblings' settings, as well as the actual memory - * distribution in the tree. - * - * The following rules apply to the effective protection values: - * - * 1. At the first level of reclaim, effective protection is equal to - * the declared protection in memory.min and memory.low. - * - * 2. To enable safe delegation of the protection configuration, at - * subsequent levels the effective protection is capped to the - * parent's effective protection. - * - * 3. To make complex and dynamic subtrees easier to configure, the - * user is allowed to overcommit the declared protection at a given - * level. If that is the case, the parent's effective protection is - * distributed to the children in proportion to how much protection - * they have declared and how much of it they are utilizing. - * - * This makes distribution proportional, but also work-conserving: - * if one cgroup claims much more protection than it uses memory, - * the unused remainder is available to its siblings. - * - * 4. Conversely, when the declared protection is undercommitted at a - * given level, the distribution of the larger parental protection - * budget is NOT proportional. A cgroup's protection from a sibling - * is capped to its own memory.min/low setting. - * - * 5. However, to allow protecting recursive subtrees from each other - * without having to declare each individual cgroup's fixed share - * of the ancestor's claim to protection, any unutilized - - * "floating" - protection from up the tree is distributed in - * proportion to each cgroup's *usage*. This makes the protection - * neutral wrt sibling cgroups and lets them compete freely over - * the shared parental protection budget, but it protects the - * subtree as a whole from neighboring subtrees. - * - * Note that 4. and 5. are not in conflict: 4. is about protecting - * against immediate siblings whereas 5. is about protecting against - * neighboring subtrees. - */ -static unsigned long effective_protection(unsigned long usage, - unsigned long parent_usage, - unsigned long setting, - unsigned long parent_effective, - unsigned long siblings_protected) -{ - unsigned long protected; - unsigned long ep; - - protected = min(usage, setting); - /* - * If all cgroups at this level combined claim and use more - * protection than what the parent affords them, distribute - * shares in proportion to utilization. - * - * We are using actual utilization rather than the statically - * claimed protection in order to be work-conserving: claimed - * but unused protection is available to siblings that would - * otherwise get a smaller chunk than what they claimed. - */ - if (siblings_protected > parent_effective) - return protected * parent_effective / siblings_protected; - - /* - * Ok, utilized protection of all children is within what the - * parent affords them, so we know whatever this child claims - * and utilizes is effectively protected. - * - * If there is unprotected usage beyond this value, reclaim - * will apply pressure in proportion to that amount. - * - * If there is unutilized protection, the cgroup will be fully - * shielded from reclaim, but we do return a smaller value for - * protection than what the group could enjoy in theory. This - * is okay. With the overcommit distribution above, effective - * protection is always dependent on how memory is actually - * consumed among the siblings anyway. - */ - ep = protected; - - /* - * If the children aren't claiming (all of) the protection - * afforded to them by the parent, distribute the remainder in - * proportion to the (unprotected) memory of each cgroup. That - * way, cgroups that aren't explicitly prioritized wrt each - * other compete freely over the allowance, but they are - * collectively protected from neighboring trees. - * - * We're using unprotected memory for the weight so that if - * some cgroups DO claim explicit protection, we don't protect - * the same bytes twice. - * - * Check both usage and parent_usage against the respective - * protected values. One should imply the other, but they - * aren't read atomically - make sure the division is sane. - */ - if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)) - return ep; - if (parent_effective > siblings_protected && - parent_usage > siblings_protected && - usage > protected) { - unsigned long unclaimed; - - unclaimed = parent_effective - siblings_protected; - unclaimed *= usage - protected; - unclaimed /= parent_usage - siblings_protected; - - ep += unclaimed; - } - - return ep; -} - /** * mem_cgroup_calculate_protection - check if memory consumption is in the normal range * @root: the top ancestor of the sub-tree being checked @@ -7443,8 +4401,8 @@ static unsigned long effective_protection(unsigned long usage, void mem_cgroup_calculate_protection(struct mem_cgroup *root, struct mem_cgroup *memcg) { - unsigned long usage, parent_usage; - struct mem_cgroup *parent; + bool recursive_protection = + cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT; if (mem_cgroup_disabled()) return; @@ -7452,39 +4410,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, if (!root) root = root_mem_cgroup; - /* - * Effective values of the reclaim targets are ignored so they - * can be stale. Have a look at mem_cgroup_protection for more - * details. - * TODO: calculation should be more robust so that we do not need - * that special casing. - */ - if (memcg == root) - return; - - usage = page_counter_read(&memcg->memory); - if (!usage) - return; - - parent = parent_mem_cgroup(memcg); - - if (parent == root) { - memcg->memory.emin = READ_ONCE(memcg->memory.min); - memcg->memory.elow = READ_ONCE(memcg->memory.low); - return; - } - - parent_usage = page_counter_read(&parent->memory); - - WRITE_ONCE(memcg->memory.emin, effective_protection(usage, parent_usage, - READ_ONCE(memcg->memory.min), - READ_ONCE(parent->memory.emin), - atomic_long_read(&parent->memory.children_min_usage))); - - WRITE_ONCE(memcg->memory.elow, effective_protection(usage, parent_usage, - READ_ONCE(memcg->memory.low), - READ_ONCE(parent->memory.elow), - atomic_long_read(&parent->memory.children_low_usage))); + page_counter_calculate_protection(&root->memory, &memcg->memory, recursive_protection); } static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, @@ -7637,15 +4563,17 @@ static void uncharge_batch(const struct uncharge_gather *ug) page_counter_uncharge(&ug->memcg->memory, ug->nr_memory); if (do_memsw_account()) page_counter_uncharge(&ug->memcg->memsw, ug->nr_memory); - if (ug->nr_kmem) - memcg_account_kmem(ug->memcg, -ug->nr_kmem); - memcg_oom_recover(ug->memcg); + if (ug->nr_kmem) { + mod_memcg_state(ug->memcg, MEMCG_KMEM, -ug->nr_kmem); + memcg1_account_kmem(ug->memcg, -ug->nr_kmem); + } + memcg1_oom_recover(ug->memcg); } local_irq_save(flags); __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); - memcg_check_events(ug->memcg, ug->nid); + memcg1_check_events(ug->memcg, ug->nid); local_irq_restore(flags); /* drop reference from uncharge_folio */ @@ -7784,7 +4712,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new) local_irq_save(flags); mem_cgroup_charge_statistics(memcg, nr_pages); - memcg_check_events(memcg, folio_nid(new)); + memcg1_check_events(memcg, folio_nid(new)); local_irq_restore(flags); } @@ -7807,6 +4735,7 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) VM_BUG_ON_FOLIO(!folio_test_locked(new), new); VM_BUG_ON_FOLIO(folio_test_anon(old) != folio_test_anon(new), new); VM_BUG_ON_FOLIO(folio_nr_pages(old) != folio_nr_pages(new), new); + VM_BUG_ON_FOLIO(folio_test_lru(old), old); if (mem_cgroup_disabled()) return; @@ -7844,7 +4773,7 @@ void mem_cgroup_sk_alloc(struct sock *sk) memcg = mem_cgroup_from_task(current); if (mem_cgroup_is_root(memcg)) goto out; - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg->tcpmem_active) + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg1_tcpmem_active(memcg)) goto out; if (css_tryget(&memcg->css)) sk->sk_memcg = memcg; @@ -7870,20 +4799,8 @@ void mem_cgroup_sk_free(struct sock *sk) bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask) { - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { - struct page_counter *fail; - - if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) { - memcg->tcpmem_pressure = 0; - return true; - } - memcg->tcpmem_pressure = 1; - if (gfp_mask & __GFP_NOFAIL) { - page_counter_charge(&memcg->tcpmem, nr_pages); - return true; - } - return false; - } + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return memcg1_charge_skmem(memcg, nr_pages, gfp_mask); if (try_charge(memcg, gfp_mask, nr_pages) == 0) { mod_memcg_state(memcg, MEMCG_SOCK, nr_pages); @@ -7901,7 +4818,7 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { - page_counter_uncharge(&memcg->tcpmem, nr_pages); + memcg1_uncharge_skmem(memcg, nr_pages); return; } @@ -7938,7 +4855,7 @@ __setup("cgroup.memory=", cgroup_memory); */ static int __init mem_cgroup_init(void) { - int cpu, node; + int cpu; /* * Currently s32 type (can refer to struct batched_lruvec_stat) is @@ -7955,17 +4872,6 @@ static int __init mem_cgroup_init(void) INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work, drain_local_stock); - for_each_node(node) { - struct mem_cgroup_tree_per_node *rtpn; - - rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node); - - rtpn->rb_root = RB_ROOT; - rtpn->rb_rightmost = NULL; - spin_lock_init(&rtpn->lock); - soft_limit_tree.rb_tree_per_node[node] = rtpn; - } - return 0; } subsys_initcall(mem_cgroup_init); @@ -8052,7 +4958,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry) memcg_stats_lock(); mem_cgroup_charge_statistics(memcg, -nr_entries); memcg_stats_unlock(); - memcg_check_events(memcg, folio_nid(folio)); + memcg1_check_events(memcg, folio_nid(folio)); css_put(&memcg->css); } @@ -8293,34 +5199,7 @@ static struct cftype swap_files[] = { { } /* terminate */ }; -static struct cftype memsw_files[] = { - { - .name = "memsw.usage_in_bytes", - .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE), - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "memsw.max_usage_in_bytes", - .private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE), - .write = mem_cgroup_reset, - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "memsw.limit_in_bytes", - .private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT), - .write = mem_cgroup_write, - .read_u64 = mem_cgroup_read_u64, - }, - { - .name = "memsw.failcnt", - .private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT), - .write = mem_cgroup_reset, - .read_u64 = mem_cgroup_read_u64, - }, - { }, /* terminate */ -}; - -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) +#ifdef CONFIG_ZSWAP /** * obj_cgroup_may_zswap - check if this cgroup can zswap * @objcg: the object cgroup @@ -8423,7 +5302,7 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) { /* if zswap is disabled, do not block pages going to the swapping device */ - return !is_zswap_enabled() || !memcg || READ_ONCE(memcg->zswap_writeback); + return !zswap_is_enabled() || !memcg || READ_ONCE(memcg->zswap_writeback); } static u64 zswap_current_read(struct cgroup_subsys_state *css, @@ -8502,7 +5381,7 @@ static struct cftype zswap_files[] = { }, { } /* terminate */ }; -#endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */ +#endif /* CONFIG_ZSWAP */ static int __init mem_cgroup_swap_init(void) { @@ -8510,8 +5389,10 @@ static int __init mem_cgroup_swap_init(void) return 0; WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files)); +#ifdef CONFIG_MEMCG_V1 WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, memsw_files)); -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) +#endif +#ifdef CONFIG_ZSWAP WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, zswap_files)); #endif return 0; diff --git a/mm/memfd.c b/mm/memfd.c index 7d8d3ab3fa37..e7b7c5294d59 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -59,6 +59,51 @@ static void memfd_tag_pins(struct xa_state *xas) xas_unlock_irq(xas); } +/* + * This is a helper function used by memfd_pin_user_pages() in GUP (gup.c). + * It is mainly called to allocate a folio in a memfd when the caller + * (memfd_pin_folios()) cannot find a folio in the page cache at a given + * index in the mapping. + */ +struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) +{ +#ifdef CONFIG_HUGETLB_PAGE + struct folio *folio; + gfp_t gfp_mask; + int err; + + if (is_file_hugepages(memfd)) { + /* + * The folio would most likely be accessed by a DMA driver, + * therefore, we have zone memory constraints where we can + * alloc from. Also, the folio will be pinned for an indefinite + * amount of time, so it is not expected to be migrated away. + */ + gfp_mask = htlb_alloc_mask(hstate_file(memfd)); + gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); + + folio = alloc_hugetlb_folio_nodemask(hstate_file(memfd), + numa_node_id(), + NULL, + gfp_mask, + false); + if (folio && folio_try_get(folio)) { + err = hugetlb_add_to_page_cache(folio, + memfd->f_mapping, + idx); + if (err) { + folio_put(folio); + free_huge_folio(folio); + return ERR_PTR(err); + } + return folio; + } + return ERR_PTR(-ENOMEM); + } +#endif + return shmem_read_folio(memfd->f_mapping, idx); +} + /* * Setting SEAL_WRITE requires us to verify there's no pending writer. However, * via get_user_pages(), drivers might have some pending I/O without any active diff --git a/mm/memory-failure.c b/mm/memory-failure.c index d3c830e817e3..581d3e5c9117 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -68,6 +68,8 @@ static int sysctl_memory_failure_early_kill __read_mostly; static int sysctl_memory_failure_recovery __read_mostly = 1; +static int sysctl_enable_soft_offline __read_mostly = 1; + atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); static bool hw_memory_failure __read_mostly = false; @@ -141,6 +143,15 @@ static struct ctl_table memory_failure_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, + { + .procname = "enable_soft_offline", + .data = &sysctl_enable_soft_offline, + .maxlen = sizeof(sysctl_enable_soft_offline), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + } }; /* @@ -294,6 +305,7 @@ int hwpoison_filter(struct page *p) return 0; } +EXPORT_SYMBOL_GPL(hwpoison_filter); #else int hwpoison_filter(struct page *p) { @@ -301,8 +313,6 @@ int hwpoison_filter(struct page *p) } #endif -EXPORT_SYMBOL_GPL(hwpoison_filter); - /* * Kill all processes that have a poisoned page mapped and then isolate * the page. @@ -344,7 +354,7 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags) int ret = 0; pr_err("%#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", - pfn, t->comm, t->pid); + pfn, t->comm, task_pid_nr(t)); if ((flags & MF_ACTION_REQUIRED) && (t == current)) ret = force_sig_mceerr(BUS_MCEERR_AR, @@ -355,14 +365,12 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags) * PF_MCE_EARLY set. * Don't use force here, it's convenient if the signal * can be temporarily blocked. - * This could cause a loop when the user sets SIGBUS - * to SIG_IGN, but hopefully no one will do that? */ ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr, addr_lsb, t); if (ret < 0) pr_info("Error sending signal to %s:%d: %d\n", - t->comm, t->pid, ret); + t->comm, task_pid_nr(t), ret); return ret; } @@ -514,24 +522,17 @@ void add_to_kill_ksm(struct task_struct *tsk, struct page *p, * * Only do anything when FORCEKILL is set, otherwise just free the * list (this is used for clean pages which do not need killing) - * Also when FAIL is set do a force kill because something went - * wrong earlier. */ -static void kill_procs(struct list_head *to_kill, int forcekill, bool fail, +static void kill_procs(struct list_head *to_kill, int forcekill, unsigned long pfn, int flags) { struct to_kill *tk, *next; list_for_each_entry_safe(tk, next, to_kill, nd) { if (forcekill) { - /* - * In case something went wrong with munmapping - * make sure the process doesn't catch the - * signal and then access the memory. Just kill it. - */ - if (fail || tk->addr == -EFAULT) { + if (tk->addr == -EFAULT) { pr_err("%#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n", - pfn, tk->tsk->comm, tk->tsk->pid); + pfn, tk->tsk->comm, task_pid_nr(tk->tsk)); do_send_sig_info(SIGKILL, SEND_SIG_PRIV, tk->tsk, PIDTYPE_PID); } @@ -544,7 +545,7 @@ static void kill_procs(struct list_head *to_kill, int forcekill, bool fail, */ else if (kill_proc(tk, pfn, flags) < 0) pr_err("%#lx: Cannot send advisory machine check signal to %s:%d\n", - pfn, tk->tsk->comm, tk->tsk->pid); + pfn, tk->tsk->comm, task_pid_nr(tk->tsk)); } list_del(&tk->nd); put_task_struct(tk->tsk); @@ -834,7 +835,7 @@ static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask, struct mm_walk *walk) { struct hwpoison_walk *hwp = walk->private; - pte_t pte = huge_ptep_get(ptep); + pte_t pte = huge_ptep_get(walk->mm, addr, ptep); struct hstate *h = hstate_vma(walk->vma); return check_hwpoisoned_entry(pte, addr, huge_page_shift(h), @@ -886,6 +887,28 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, return ret > 0 ? -EHWPOISON : -EFAULT; } +/* + * MF_IGNORED - The m-f() handler marks the page as PG_hwpoisoned'ed. + * But it could not do more to isolate the page from being accessed again, + * nor does it kill the process. This is extremely rare and one of the + * potential causes is that the page state has been changed due to + * underlying race condition. This is the most severe outcomes. + * + * MF_FAILED - The m-f() handler marks the page as PG_hwpoisoned'ed. + * It should have killed the process, but it can't isolate the page, + * due to conditions such as extra pin, unmap failure, etc. Accessing + * the page again may trigger another MCE and the process will be killed + * by the m-f() handler immediately. + * + * MF_DELAYED - The m-f() handler marks the page as PG_hwpoisoned'ed. + * The page is unmapped, and is removed from the LRU or file mapping. + * An attempt to access the page again will trigger page fault and the + * PF handler will kill the process. + * + * MF_RECOVERED - The m-f() handler marks the page as PG_hwpoisoned'ed. + * The page has been completely isolated, that is, unmapped, taken out of + * the buddy system, or hole-punnched out of the file mapping. + */ static const char *action_name[] = { [MF_IGNORED] = "Ignored", [MF_FAILED] = "Failed", @@ -896,10 +919,9 @@ static const char *action_name[] = { static const char * const action_page_types[] = { [MF_MSG_KERNEL] = "reserved kernel page", [MF_MSG_KERNEL_HIGH_ORDER] = "high-order kernel page", - [MF_MSG_SLAB] = "kernel slab page", - [MF_MSG_DIFFERENT_COMPOUND] = "different compound page after locking", [MF_MSG_HUGE] = "huge page", [MF_MSG_FREE_HUGE] = "free huge page", + [MF_MSG_GET_HWPOISON] = "get hwpoison page", [MF_MSG_UNMAP_FAILED] = "unmapping failed page", [MF_MSG_DIRTY_SWAPCACHE] = "dirty swapcache page", [MF_MSG_CLEAN_SWAPCACHE] = "clean swapcache page", @@ -913,6 +935,7 @@ static const char * const action_page_types[] = { [MF_MSG_BUDDY] = "free buddy page", [MF_MSG_DAX] = "dax page", [MF_MSG_UNSPLIT_THP] = "unsplit thp", + [MF_MSG_ALREADY_POISONED] = "already poisoned", [MF_MSG_UNKNOWN] = "unknown page", }; @@ -1020,12 +1043,13 @@ static int me_kernel(struct page_state *ps, struct page *p) /* * Page in unknown state. Do nothing. + * This is a catch-all in case we fail to make sense of the page state. */ static int me_unknown(struct page_state *ps, struct page *p) { pr_err("%#lx: Unknown page state\n", page_to_pfn(p)); unlock_page(p); - return MF_FAILED; + return MF_IGNORED; } /* @@ -1094,7 +1118,6 @@ static int me_pagecache_dirty(struct page_state *ps, struct page *p) struct folio *folio = page_folio(p); struct address_space *mapping = folio_mapping(folio); - SetPageError(p); /* TBD: print more information about the file. */ if (mapping) { /* @@ -1102,34 +1125,6 @@ static int me_pagecache_dirty(struct page_state *ps, struct page *p) * who check the mapping. * This way the application knows that something went * wrong with its dirty file data. - * - * There's one open issue: - * - * The EIO will be only reported on the next IO - * operation and then cleared through the IO map. - * Normally Linux has two mechanisms to pass IO error - * first through the AS_EIO flag in the address space - * and then through the PageError flag in the page. - * Since we drop pages on memory failure handling the - * only mechanism open to use is through AS_AIO. - * - * This has the disadvantage that it gets cleared on - * the first operation that returns an error, while - * the PageError bit is more sticky and only cleared - * when the page is reread or dropped. If an - * application assumes it will always get error on - * fsync, but does other operations on the fd before - * and the page is dropped between then the error - * will not be properly reported. - * - * This can already happen even without hwpoisoned - * pages: first on metadata IO errors (which only - * report through AS_EIO) or when the page is dropped - * at the wrong time. - * - * So right now we assume that the application DTRT on - * the first EIO, but we're not worse than other parts - * of the kernel. */ mapping_set_error(mapping, -EIO); } @@ -1141,7 +1136,7 @@ static int me_pagecache_dirty(struct page_state *ps, struct page *p) * Clean and dirty swap cache. * * Dirty swap cache page is tricky to handle. The page could live both in page - * cache and swap cache(ie. page is freshly swapped in). So it could be + * table and swap cache(ie. page is freshly swapped in). So it could be * referenced concurrently by 2 types of PTEs: * normal PTEs and swap PTEs. We try to handle them consistently by calling * try_to_unmap(!TTU_HWPOISON) to convert the normal PTEs to swap PTEs, @@ -1429,6 +1424,8 @@ static int __get_hwpoison_page(struct page *page, unsigned long flags) return 0; } +#define GET_PAGE_MAX_RETRY_NUM 3 + static int get_any_page(struct page *p, unsigned long flags) { int ret = 0, pass = 0; @@ -1443,12 +1440,12 @@ try_again: if (!ret) { if (page_count(p)) { /* We raced with an allocation, retry. */ - if (pass++ < 3) + if (pass++ < GET_PAGE_MAX_RETRY_NUM) goto try_again; ret = -EBUSY; } else if (!PageHuge(p) && !is_free_buddy_page(p)) { /* We raced with put_page, retry. */ - if (pass++ < 3) + if (pass++ < GET_PAGE_MAX_RETRY_NUM) goto try_again; ret = -EIO; } @@ -1474,7 +1471,7 @@ try_again: * A page we cannot handle. Check whether we can turn * it into something we can handle. */ - if (pass++ < 3) { + if (pass++ < GET_PAGE_MAX_RETRY_NUM) { put_page(p); shake_page(p); count_increased = false; @@ -1536,7 +1533,7 @@ static int __get_unpoison_page(struct page *page) * the given page has PG_hwpoison. So it's never reused for other page * allocations, and __get_unpoison_page() never races with them. * - * Return: 0 on failure, + * Return: 0 on failure or free buddy (hugetlb) page, * 1 on success for in-use pages in a well-defined state, * -EIO for pages on which we can not handle memory errors, * -EBUSY when get_hwpoison_page() has raced with page lifecycle @@ -1585,7 +1582,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p, * This check implies we don't kill processes if their pages * are in the swap cache early. Those are always late kills. */ - if (!page_mapped(p)) + if (!folio_mapped(folio)) return true; if (folio_test_swapcache(folio)) { @@ -1636,10 +1633,10 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p, try_to_unmap(folio, ttu); } - unmap_success = !page_mapped(p); + unmap_success = !folio_mapped(folio); if (!unmap_success) pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n", - pfn, folio_mapcount(page_folio(p))); + pfn, folio_mapcount(folio)); /* * try_to_unmap() might put mlocked page in lru cache, so call @@ -1660,7 +1657,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p, */ forcekill = folio_test_dirty(folio) || (flags & MF_MUST_KILL) || !unmap_success; - kill_procs(&tokill, forcekill, !unmap_success, pfn, flags); + kill_procs(&tokill, forcekill, pfn, flags); return unmap_success; } @@ -1688,7 +1685,12 @@ static int identify_page_state(unsigned long pfn, struct page *p, return page_action(ps, p, pfn); } -static int try_to_split_thp_page(struct page *page) +/* + * When 'release' is 'false', it means that if thp split has failed, + * there is still more to do, hence the page refcount we took earlier + * is still needed. + */ +static int try_to_split_thp_page(struct page *page, bool release) { int ret; @@ -1696,7 +1698,7 @@ static int try_to_split_thp_page(struct page *page) ret = split_huge_page(page); unlock_page(page); - if (unlikely(ret)) + if (ret && release) put_page(page); return ret; @@ -1724,7 +1726,7 @@ static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn, unmap_mapping_range(mapping, start, size, 0); } - kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags); + kill_procs(to_kill, flags & MF_MUST_KILL, pfn, flags); } /* @@ -1912,7 +1914,7 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page) { struct llist_head *head; struct raw_hwp_page *raw_hwp; - struct raw_hwp_page *p, *next; + struct raw_hwp_page *p; int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0; /* @@ -1923,7 +1925,7 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page) if (folio_test_hugetlb_raw_hwp_unreliable(folio)) return -EHWPOISON; head = raw_hwp_list_head(folio); - llist_for_each_entry_safe(p, next, head->first, node) { + llist_for_each_entry(p, head->first, node) { if (p->page == page) return -EHWPOISON; } @@ -2062,6 +2064,7 @@ retry: if (flags & MF_ACTION_REQUIRED) { folio = page_folio(p); res = kill_accessing_process(current, folio_pfn(folio), flags); + action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED); } return res; } else if (res == -EBUSY) { @@ -2069,7 +2072,7 @@ retry: flags |= MF_NO_RETRY; goto retry; } - return action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); + return action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED); } folio = page_folio(p); @@ -2104,7 +2107,7 @@ retry: if (!hwpoison_user_mappings(folio, p, pfn, flags)) { folio_unlock(folio); - return action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED); + return action_result(pfn, MF_MSG_UNMAP_FAILED, MF_FAILED); } return identify_page_state(pfn, p, page_flags); @@ -2125,14 +2128,10 @@ static inline unsigned long folio_free_raw_hwp(struct folio *folio, bool flag) /* Drop the extra refcount in case we come from madvise() */ static void put_ref_page(unsigned long pfn, int flags) { - struct page *page; - if (!(flags & MF_COUNT_INCREASED)) return; - page = pfn_to_page(pfn); - if (page) - put_page(page); + put_page(pfn_to_page(pfn)); } static int memory_failure_dev_pagemap(unsigned long pfn, int flags, @@ -2167,6 +2166,22 @@ out: return rc; } +/* + * The calling condition is as such: thp split failed, page might have + * been RDMA pinned, not much can be done for recovery. + * But a SIGBUS should be delivered with vaddr provided so that the user + * application has a chance to recover. Also, application processes' + * election for MCE early killed will be honored. + */ +static void kill_procs_now(struct page *p, unsigned long pfn, int flags, + struct folio *folio) +{ + LIST_HEAD(tokill); + + collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED); + kill_procs(&tokill, true, pfn, flags); +} + /** * memory_failure - Handle memory failure of a page. * @pfn: Page Number of the corrupted page @@ -2238,6 +2253,7 @@ try_again: res = kill_accessing_process(current, pfn, flags); if (flags & MF_COUNT_INCREASED) put_page(p); + action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED); goto unlock_mutex; } @@ -2274,12 +2290,24 @@ try_again: } goto unlock_mutex; } else if (res < 0) { - res = action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); + res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED); goto unlock_mutex; } } folio = page_folio(p); + + /* filter pages that are protected from hwpoison test by users */ + folio_lock(folio); + if (hwpoison_filter(p)) { + ClearPageHWPoison(p); + folio_unlock(folio); + folio_put(folio); + res = -EOPNOTSUPP; + goto unlock_mutex; + } + folio_unlock(folio); + if (folio_test_large(folio)) { /* * The flag must be set after the refcount is bumped @@ -2295,8 +2323,11 @@ try_again: * page is a valid handlable page. */ folio_set_has_hwpoisoned(folio); - if (try_to_split_thp_page(p) < 0) { - res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED); + if (try_to_split_thp_page(p, false) < 0) { + res = -EHWPOISON; + kill_procs_now(p, pfn, flags, folio); + put_page(p); + action_result(pfn, MF_MSG_UNSPLIT_THP, MF_FAILED); goto unlock_mutex; } VM_BUG_ON_PAGE(!page_count(p), p); @@ -2317,22 +2348,10 @@ try_again: /* * We're only intended to deal with the non-Compound page here. - * However, the page could have changed compound pages due to - * race window. If this happens, we could try again to hopefully - * handle the page next round. + * The page cannot become compound pages again as folio has been + * splited and extra refcnt is held. */ - if (folio_test_large(folio)) { - if (retry) { - ClearPageHWPoison(p); - folio_unlock(folio); - folio_put(folio); - flags &= ~MF_COUNT_INCREASED; - retry = false; - goto try_again; - } - res = action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED); - goto unlock_page; - } + WARN_ON(folio_test_large(folio)); /* * We use page flags to determine what action should be taken, but @@ -2343,14 +2362,6 @@ try_again: */ page_flags = folio->flags; - if (hwpoison_filter(p)) { - ClearPageHWPoison(p); - folio_unlock(folio); - folio_put(folio); - res = -EOPNOTSUPP; - goto unlock_mutex; - } - /* * __munlock_folio() may clear a writeback folio's LRU flag without * the folio lock. We need to wait for writeback completion for this @@ -2370,7 +2381,7 @@ try_again: * Abort on fail: __filemap_remove_folio() assumes unmapped page. */ if (!hwpoison_user_mappings(folio, p, pfn, flags)) { - res = action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED); + res = action_result(pfn, MF_MSG_UNMAP_FAILED, MF_FAILED); goto unlock_page; } @@ -2502,7 +2513,7 @@ static int __init memory_failure_init(void) core_initcall(memory_failure_init); #undef pr_fmt -#define pr_fmt(fmt) "" fmt +#define pr_fmt(fmt) "Unpoison: " fmt #define unpoison_pr_info(fmt, pfn, rs) \ ({ \ if (__ratelimit(rs)) \ @@ -2526,7 +2537,7 @@ int unpoison_memory(unsigned long pfn) struct folio *folio; struct page *p; int ret = -EBUSY, ghp; - unsigned long count = 1; + unsigned long count; bool huge = false; static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); @@ -2540,27 +2551,27 @@ int unpoison_memory(unsigned long pfn) mutex_lock(&mf_mutex); if (hw_memory_failure) { - unpoison_pr_info("Unpoison: Disabled after HW memory failure %#lx\n", + unpoison_pr_info("%#lx: disabled after HW memory failure\n", pfn, &unpoison_rs); ret = -EOPNOTSUPP; goto unlock_mutex; } if (is_huge_zero_folio(folio)) { - unpoison_pr_info("Unpoison: huge zero page is not supported %#lx\n", + unpoison_pr_info("%#lx: huge zero page is not supported\n", pfn, &unpoison_rs); ret = -EOPNOTSUPP; goto unlock_mutex; } if (!PageHWPoison(p)) { - unpoison_pr_info("Unpoison: Page was already unpoisoned %#lx\n", + unpoison_pr_info("%#lx: page was already unpoisoned\n", pfn, &unpoison_rs); goto unlock_mutex; } if (folio_ref_count(folio) > 1) { - unpoison_pr_info("Unpoison: Someone grabs the hwpoison page %#lx\n", + unpoison_pr_info("%#lx: someone grabs the hwpoison page\n", pfn, &unpoison_rs); goto unlock_mutex; } @@ -2569,18 +2580,14 @@ int unpoison_memory(unsigned long pfn) folio_test_reserved(folio) || folio_test_offline(folio)) goto unlock_mutex; - /* - * Note that folio->_mapcount is overloaded in SLAB, so the simple test - * in folio_mapped() has to be done after folio_test_slab() is checked. - */ if (folio_mapped(folio)) { - unpoison_pr_info("Unpoison: Someone maps the hwpoison page %#lx\n", + unpoison_pr_info("%#lx: someone maps the hwpoison page\n", pfn, &unpoison_rs); goto unlock_mutex; } if (folio_mapping(folio)) { - unpoison_pr_info("Unpoison: the hwpoison page has non-NULL mapping %#lx\n", + unpoison_pr_info("%#lx: the hwpoison page has non-NULL mapping\n", pfn, &unpoison_rs); goto unlock_mutex; } @@ -2599,7 +2606,7 @@ int unpoison_memory(unsigned long pfn) ret = put_page_back_buddy(p) ? 0 : -EBUSY; } else { ret = ghp; - unpoison_pr_info("Unpoison: failed to grab page %#lx\n", + unpoison_pr_info("%#lx: failed to grab page\n", pfn, &unpoison_rs); } } else { @@ -2624,13 +2631,16 @@ unlock_mutex: if (!ret) { if (!huge) num_poisoned_pages_sub(pfn, 1); - unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n", + unpoison_pr_info("%#lx: software-unpoisoned page\n", page_to_pfn(p), &unpoison_rs); } return ret; } EXPORT_SYMBOL(unpoison_memory); +#undef pr_fmt +#define pr_fmt(fmt) "Soft offline: " fmt + static bool mf_isolate_folio(struct folio *folio, struct list_head *pagelist) { bool isolated = false; @@ -2685,8 +2695,8 @@ static int soft_offline_in_use_page(struct page *page) }; if (!huge && folio_test_large(folio)) { - if (try_to_split_thp_page(page)) { - pr_info("soft offline: %#lx: thp split failed\n", pfn); + if (try_to_split_thp_page(page, true)) { + pr_info("%#lx: thp split failed\n", pfn); return -EBUSY; } folio = page_folio(page); @@ -2698,7 +2708,7 @@ static int soft_offline_in_use_page(struct page *page) if (PageHWPoison(page)) { folio_unlock(folio); folio_put(folio); - pr_info("soft offline: %#lx page already poisoned\n", pfn); + pr_info("%#lx: page already poisoned\n", pfn); return 0; } @@ -2711,7 +2721,7 @@ static int soft_offline_in_use_page(struct page *page) folio_unlock(folio); if (ret) { - pr_info("soft_offline: %#lx: invalidated\n", pfn); + pr_info("%#lx: invalidated\n", pfn); page_handle_poison(page, false, true); return 0; } @@ -2728,13 +2738,13 @@ static int soft_offline_in_use_page(struct page *page) if (!list_empty(&pagelist)) putback_movable_pages(&pagelist); - pr_info("soft offline: %#lx: %s migration failed %ld, type %pGp\n", + pr_info("%#lx: %s migration failed %ld, type %pGp\n", pfn, msg_page[huge], ret, &page->flags); if (ret > 0) ret = -EBUSY; } } else { - pr_info("soft offline: %#lx: %s isolation failed, page count %d, type %pGp\n", + pr_info("%#lx: %s isolation failed, page count %d, type %pGp\n", pfn, msg_page[huge], page_count(page), &page->flags); ret = -EBUSY; } @@ -2746,8 +2756,9 @@ static int soft_offline_in_use_page(struct page *page) * @pfn: pfn to soft-offline * @flags: flags. Same as memory_failure(). * - * Returns 0 on success - * -EOPNOTSUPP for hwpoison_filter() filtered the error event + * Returns 0 on success, + * -EOPNOTSUPP for hwpoison_filter() filtered the error event, or + * disabled by /proc/sys/vm/enable_soft_offline, * < 0 otherwise negated errno. * * Soft offline a page, by migration or invalidation, @@ -2783,10 +2794,16 @@ int soft_offline_page(unsigned long pfn, int flags) return -EIO; } + if (!sysctl_enable_soft_offline) { + pr_info_once("disabled by /proc/sys/vm/enable_soft_offline\n"); + put_ref_page(pfn, flags); + return -EOPNOTSUPP; + } + mutex_lock(&mf_mutex); if (PageHWPoison(page)) { - pr_info("%s: %#lx page already poisoned\n", __func__, pfn); + pr_info("%#lx: page already poisoned\n", pfn); put_ref_page(pfn, flags); mutex_unlock(&mf_mutex); return 0; diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 6632102bd5c9..4775b3a3dabe 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -43,6 +43,7 @@ static LIST_HEAD(memory_tiers); static LIST_HEAD(default_memory_types); static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; struct memory_dev_type *default_dram_type; +nodemask_t default_dram_nodes __initdata = NODE_MASK_NONE; static const struct bus_type memory_tier_subsys = { .name = "memory_tiering", @@ -671,28 +672,35 @@ EXPORT_SYMBOL_GPL(mt_put_memory_types); /* * This is invoked via `late_initcall()` to initialize memory tiers for - * CPU-less memory nodes after driver initialization, which is - * expected to provide `adistance` algorithms. + * memory nodes, both with and without CPUs. After the initialization of + * firmware and devices, adistance algorithms are expected to be provided. */ static int __init memory_tier_late_init(void) { int nid; + struct memory_tier *memtier; + get_online_mems(); guard(mutex)(&memory_tier_lock); + + /* Assign each uninitialized N_MEMORY node to a memory tier. */ for_each_node_state(nid, N_MEMORY) { /* - * Some device drivers may have initialized memory tiers - * between `memory_tier_init()` and `memory_tier_late_init()`, - * potentially bringing online memory nodes and - * configuring memory tiers. Exclude them here. + * Some device drivers may have initialized + * memory tiers, potentially bringing memory nodes + * online and configuring memory tiers. + * Exclude them here. */ if (node_memory_types[nid].memtype) continue; - set_node_memory_tier(nid); + memtier = set_node_memory_tier(nid); + if (IS_ERR(memtier)) + continue; } establish_demotion_targets(); + put_online_mems(); return 0; } @@ -875,8 +883,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self, static int __init memory_tier_init(void) { - int ret, node; - struct memory_tier *memtier; + int ret; ret = subsys_virtual_register(&memory_tier_subsys, NULL); if (ret) @@ -887,7 +894,8 @@ static int __init memory_tier_init(void) GFP_KERNEL); WARN_ON(!node_demotion); #endif - mutex_lock(&memory_tier_lock); + + guard(mutex)(&memory_tier_lock); /* * For now we can have 4 faster memory tiers with smaller adistance * than default DRAM tier. @@ -897,29 +905,9 @@ static int __init memory_tier_init(void) if (IS_ERR(default_dram_type)) panic("%s() failed to allocate default DRAM tier\n", __func__); - /* - * Look at all the existing N_MEMORY nodes and add them to - * default memory tier or to a tier if we already have memory - * types assigned. - */ - for_each_node_state(node, N_MEMORY) { - if (!node_state(node, N_CPU)) - /* - * Defer memory tier initialization on - * CPUless numa nodes. These will be initialized - * after firmware and devices are initialized. - */ - continue; - - memtier = set_node_memory_tier(node); - if (IS_ERR(memtier)) - /* - * Continue with memtiers we are able to setup - */ - break; - } - establish_demotion_targets(); - mutex_unlock(&memory_tier_lock); + /* Record nodes with memory and CPU to set default DRAM performance. */ + nodes_and(default_dram_nodes, node_states[N_MEMORY], + node_states[N_CPU]); hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI); return 0; diff --git a/mm/memory.c b/mm/memory.c index d10e616d7389..4bcd79619574 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -365,6 +365,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, struct vm_area_struct *vma, unsigned long floor, unsigned long ceiling, bool mm_wr_locked) { + struct unlink_vma_file_batch vb; + do { unsigned long addr = vma->vm_start; struct vm_area_struct *next; @@ -384,12 +386,15 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, if (mm_wr_locked) vma_start_write(vma); unlink_anon_vmas(vma); - unlink_file_vma(vma); if (is_vm_hugetlb_page(vma)) { + unlink_file_vma(vma); hugetlb_free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling); } else { + unlink_file_vma_batch_init(&vb); + unlink_file_vma_batch_add(&vb, vma); + /* * Optimization: gather nearby vmas into one call down */ @@ -402,8 +407,9 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, if (mm_wr_locked) vma_start_write(vma); unlink_anon_vmas(vma); - unlink_file_vma(vma); + unlink_file_vma_batch_add(&vb, vma); } + unlink_file_vma_batch_final(&vb); free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling); } @@ -575,10 +581,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, * VM_MIXEDMAP mappings can likewise contain memory with or without "struct * page" backing, however the difference is that _all_ pages with a struct * page (that is, those where pfn_valid is true) are refcounted and considered - * normal pages by the VM. The disadvantage is that pages are refcounted - * (which can be slower and simply not an option for some PFNMAP users). The - * advantage is that we don't have to follow the strict linearity rule of - * PFNMAP mappings in order to support COWable mappings. + * normal pages by the VM. The only exception are zeropages, which are + * *never* refcounted. + * + * The disadvantage is that pages are refcounted (which can be slower and + * simply not an option for some PFNMAP users). The advantage is that we + * don't have to follow the strict linearity rule of PFNMAP mappings in + * order to support COWable mappings. * */ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, @@ -616,6 +625,8 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, if (vma->vm_flags & VM_MIXEDMAP) { if (!pfn_valid(pfn)) return NULL; + if (is_zero_pfn(pfn)) + return NULL; goto out; } else { unsigned long off; @@ -641,6 +652,7 @@ check_pfn: * eg. VDSO mappings can cause them to exist. */ out: + VM_WARN_ON_ONCE(is_zero_pfn(pfn)); return pfn_to_page(pfn); } @@ -918,7 +930,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma *prealloc = NULL; copy_user_highpage(&new_folio->page, page, addr, src_vma); __folio_mark_uptodate(new_folio); - folio_add_new_anon_rmap(new_folio, dst_vma, addr); + folio_add_new_anon_rmap(new_folio, dst_vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(new_folio, dst_vma); rss[MM_ANONPAGES]++; @@ -1977,10 +1989,48 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, return pte_alloc_map_lock(mm, pmd, addr, ptl); } -static int validate_page_before_insert(struct page *page) +static bool vm_mixed_zeropage_allowed(struct vm_area_struct *vma) +{ + VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP); + /* + * Whoever wants to forbid the zeropage after some zeropages + * might already have been mapped has to scan the page tables and + * bail out on any zeropages. Zeropages in COW mappings can + * be unshared using FAULT_FLAG_UNSHARE faults. + */ + if (mm_forbids_zeropage(vma->vm_mm)) + return false; + /* zeropages in COW mappings are common and unproblematic. */ + if (is_cow_mapping(vma->vm_flags)) + return true; + /* Mappings that do not allow for writable PTEs are unproblematic. */ + if (!(vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) + return true; + /* + * Why not allow any VMA that has vm_ops->pfn_mkwrite? GUP could + * find the shared zeropage and longterm-pin it, which would + * be problematic as soon as the zeropage gets replaced by a different + * page due to vma->vm_ops->pfn_mkwrite, because what's mapped would + * now differ to what GUP looked up. FSDAX is incompatible to + * FOLL_LONGTERM and VM_IO is incompatible to GUP completely (see + * check_vma_flags). + */ + return vma->vm_ops && vma->vm_ops->pfn_mkwrite && + (vma_is_fsdax(vma) || vma->vm_flags & VM_IO); +} + +static int validate_page_before_insert(struct vm_area_struct *vma, + struct page *page) { struct folio *folio = page_folio(page); + if (!folio_ref_count(folio)) + return -EINVAL; + if (unlikely(is_zero_folio(folio))) { + if (!vm_mixed_zeropage_allowed(vma)) + return -EINVAL; + return 0; + } if (folio_test_anon(folio) || folio_test_slab(folio) || page_has_type(page)) return -EINVAL; @@ -1992,24 +2042,23 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, unsigned long addr, struct page *page, pgprot_t prot) { struct folio *folio = page_folio(page); + pte_t pteval; if (!pte_none(ptep_get(pte))) return -EBUSY; /* Ok, finally just insert the thing.. */ - folio_get(folio); - inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); - folio_add_file_rmap_pte(folio, page, vma); - set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot)); + pteval = mk_pte(page, prot); + if (unlikely(is_zero_folio(folio))) { + pteval = pte_mkspecial(pteval); + } else { + folio_get(folio); + inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); + folio_add_file_rmap_pte(folio, page, vma); + } + set_pte_at(vma->vm_mm, addr, pte, pteval); return 0; } -/* - * This is the old fallback for page remapping. - * - * For historical reasons, it only allows reserved pages. Only - * old drivers should use this, and they needed to mark their - * pages reserved for the old functions anyway. - */ static int insert_page(struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot) { @@ -2017,7 +2066,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, pte_t *pte; spinlock_t *ptl; - retval = validate_page_before_insert(page); + retval = validate_page_before_insert(vma, page); if (retval) goto out; retval = -ENOMEM; @@ -2035,9 +2084,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte, { int err; - if (!page_count(page)) - return -EINVAL; - err = validate_page_before_insert(page); + err = validate_page_before_insert(vma, page); if (err) return err; return insert_page_into_pte_locked(vma, pte, addr, page, prot); @@ -2143,7 +2190,8 @@ EXPORT_SYMBOL(vm_insert_pages); * @page: source kernel page * * This allows drivers to insert individual pages they've allocated - * into a user vma. + * into a user vma. The zeropage is supported in some VMAs, + * see vm_mixed_zeropage_allowed(). * * The page has to be a nice clean _individual_ kernel allocation. * If you allocate a compound page, you need to have marked it as @@ -2170,8 +2218,6 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, { if (addr < vma->vm_start || addr >= vma->vm_end) return -EFAULT; - if (!page_count(page)) - return -EINVAL; if (!(vma->vm_flags & VM_MIXEDMAP)) { BUG_ON(mmap_read_trylock(vma->vm_mm)); BUG_ON(vma->vm_flags & VM_PFNMAP); @@ -2189,6 +2235,8 @@ EXPORT_SYMBOL(vm_insert_page); * @offset: user's requested vm_pgoff * * This allows drivers to map range of kernel pages into a user vma. + * The zeropage is supported in some VMAs, see + * vm_mixed_zeropage_allowed(). * * Return: 0 on success and error code otherwise. */ @@ -2404,8 +2452,11 @@ vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vmf_insert_pfn); -static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn) +static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn, bool mkwrite) { + if (unlikely(is_zero_pfn(pfn_t_to_pfn(pfn))) && + (mkwrite || !vm_mixed_zeropage_allowed(vma))) + return false; /* these checks mirror the abort conditions in vm_normal_page */ if (vma->vm_flags & VM_MIXEDMAP) return true; @@ -2424,7 +2475,8 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, pgprot_t pgprot = vma->vm_page_prot; int err; - BUG_ON(!vm_mixed_ok(vma, pfn)); + if (!vm_mixed_ok(vma, pfn, mkwrite)) + return VM_FAULT_SIGBUS; if (addr < vma->vm_start || addr >= vma->vm_end) return VM_FAULT_SIGBUS; @@ -2481,7 +2533,6 @@ vm_fault_t vmf_insert_mixed_mkwrite(struct vm_area_struct *vma, { return __vm_insert_mixed(vma, addr, pfn, true); } -EXPORT_SYMBOL(vmf_insert_mixed_mkwrite); /* * maps a range of physical memory into the requested pages. the old @@ -2970,10 +3021,8 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, unsigned long addr = vmf->address; if (likely(src)) { - if (copy_mc_user_highpage(dst, src, addr, vma)) { - memory_failure_queue(page_to_pfn(src), 0); + if (copy_mc_user_highpage(dst, src, addr, vma)) return -EHWPOISON; - } return 0; } @@ -3172,6 +3221,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio) pte_t entry; VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE)); + VM_WARN_ON(is_zero_pfn(pte_pfn(vmf->orig_pte))); if (folio) { VM_BUG_ON(folio_test_anon(folio) && @@ -3349,7 +3399,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) * some TLBs while the old PTE remains in others. */ ptep_clear_flush(vma, vmf->address, vmf->pte); - folio_add_new_anon_rmap(new_folio, vma, vmf->address); + folio_add_new_anon_rmap(new_folio, vma, vmf->address, RMAP_EXCLUSIVE); folio_add_lru_vma(new_folio, vma); BUG_ON(unshare && pte_write(entry)); set_pte_at(mm, vmf->address, vmf->pte, entry); @@ -3866,7 +3916,7 @@ static inline bool should_try_to_free_swap(struct folio *folio, * reference only in case it's likely that we'll be the exlusive user. */ return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && - folio_ref_count(folio) == 2; + folio_ref_count(folio) == (1 + folio_nr_pages(folio)); } static vm_fault_t pte_marker_clear(struct vm_fault *vmf) @@ -3957,6 +4007,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) pte_t pte; vm_fault_t ret = 0; void *shadow = NULL; + int nr_pages; + unsigned long page_idx; + unsigned long address; + pte_t *ptep; if (!pte_unmap_same(vmf)) goto out; @@ -4058,7 +4112,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* To provide entry to swap_read_folio() */ folio->swap = entry; - swap_read_folio(folio, true, NULL); + swap_read_folio(folio, NULL); folio->private = NULL; } } else { @@ -4155,6 +4209,38 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } + nr_pages = 1; + page_idx = 0; + address = vmf->address; + ptep = vmf->pte; + if (folio_test_large(folio) && folio_test_swapcache(folio)) { + int nr = folio_nr_pages(folio); + unsigned long idx = folio_page_idx(folio, page); + unsigned long folio_start = address - idx * PAGE_SIZE; + unsigned long folio_end = folio_start + nr * PAGE_SIZE; + pte_t *folio_ptep; + pte_t folio_pte; + + if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start))) + goto check_folio; + if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end))) + goto check_folio; + + folio_ptep = vmf->pte - idx; + folio_pte = ptep_get(folio_ptep); + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || + swap_pte_batch(folio_ptep, nr, folio_pte) != nr) + goto check_folio; + + page_idx = idx; + address = folio_start; + ptep = folio_ptep; + nr_pages = nr; + entry = folio->swap; + page = &folio->page; + } + +check_folio: /* * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte * must never point at an anonymous page in the swapcache that is @@ -4214,13 +4300,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * We're already holding a reference on the page but haven't mapped it * yet. */ - swap_free(entry); + swap_free_nr(entry, nr_pages); if (should_try_to_free_swap(folio, vma, vmf->flags)) folio_free_swap(folio); - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); pte = mk_pte(page, vma->vm_page_prot); + if (pte_swp_soft_dirty(vmf->orig_pte)) + pte = pte_mksoft_dirty(pte); + if (pte_swp_uffd_wp(vmf->orig_pte)) + pte = pte_mkuffd_wp(pte); /* * Same logic as in do_wp_page(); however, optimize for pages that are @@ -4230,32 +4320,43 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ if (!folio_test_ksm(folio) && (exclusive || folio_ref_count(folio) == 1)) { - if (vmf->flags & FAULT_FLAG_WRITE) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); - vmf->flags &= ~FAULT_FLAG_WRITE; + if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && + !pte_needs_soft_dirty_wp(vma, pte)) { + pte = pte_mkwrite(pte, vma); + if (vmf->flags & FAULT_FLAG_WRITE) { + pte = pte_mkdirty(pte); + vmf->flags &= ~FAULT_FLAG_WRITE; + } } rmap_flags |= RMAP_EXCLUSIVE; } - flush_icache_page(vma, page); - if (pte_swp_soft_dirty(vmf->orig_pte)) - pte = pte_mksoft_dirty(pte); - if (pte_swp_uffd_wp(vmf->orig_pte)) - pte = pte_mkuffd_wp(pte); - vmf->orig_pte = pte; + folio_ref_add(folio, nr_pages - 1); + flush_icache_pages(vma, page, nr_pages); + vmf->orig_pte = pte_advance_pfn(pte, page_idx); /* ksm created a completely new copy */ if (unlikely(folio != swapcache && swapcache)) { - folio_add_new_anon_rmap(folio, vma, vmf->address); + folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); + } else if (!folio_test_anon(folio)) { + /* + * We currently only expect small !anon folios, which are either + * fully exclusive or fully shared. If we ever get large folios + * here, we have to be careful. + */ + VM_WARN_ON_ONCE(folio_test_large(folio)); + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { - folio_add_anon_rmap_pte(folio, page, vma, vmf->address, + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, rmap_flags); } VM_BUG_ON(!folio_test_anon(folio) || (pte_write(pte) && !PageAnonExclusive(page))); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); + set_ptes(vma->vm_mm, address, ptep, pte, nr_pages); + arch_do_swap_page_nr(vma->vm_mm, vma, address, + pte, pte, nr_pages); folio_unlock(folio); if (folio != swapcache && swapcache) { @@ -4279,7 +4380,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } /* No need to invalidate - it was non-present before */ - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + update_mmu_cache_range(vmf, vma, address, ptep, nr_pages); unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -4384,7 +4485,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) goto next; } folio_throttle_swaprate(folio, gfp); - clear_huge_page(&folio->page, vmf->address, 1 << order); + folio_zero_user(folio, vmf->address); return folio; } next: @@ -4410,7 +4511,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) vm_fault_t ret = 0; int nr_pages = 1; pte_t entry; - int i; /* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) @@ -4480,8 +4580,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) update_mmu_tlb(vma, addr, vmf->pte); goto release; } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) { - for (i = 0; i < nr_pages; i++) - update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i); + update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages); goto release; } @@ -4501,7 +4600,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) #ifdef CONFIG_TRANSPARENT_HUGEPAGE count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC); #endif - folio_add_new_anon_rmap(folio, vma, addr); + folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); setpte: if (vmf_orig_pte_uffd_wp(vmf)) @@ -4541,7 +4640,7 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) * lock_page(B) * lock_page(B) * pte_alloc_one - * shrink_page_list + * shrink_folio_list * wait_on_page_writeback(A) * SetPageWriteback(B) * unlock_page(B) @@ -4699,7 +4798,7 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio, /* copy-on-write page */ if (write && !(vma->vm_flags & VM_SHARED)) { VM_BUG_ON_FOLIO(nr != 1, folio); - folio_add_new_anon_rmap(folio, vma, addr); + folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); } else { folio_add_file_rmap_ptes(folio, page, nr, vma); @@ -4737,9 +4836,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct page *page; + struct folio *folio; vm_fault_t ret; bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED); + int type, nr_pages; + unsigned long addr = vmf->address; /* Did we COW the page? */ if (is_cow) @@ -4770,24 +4872,62 @@ vm_fault_t finish_fault(struct vm_fault *vmf) return VM_FAULT_OOM; } + folio = page_folio(page); + nr_pages = folio_nr_pages(folio); + + /* + * Using per-page fault to maintain the uffd semantics, and same + * approach also applies to non-anonymous-shmem faults to avoid + * inflating the RSS of the process. + */ + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) { + nr_pages = 1; + } else if (nr_pages > 1) { + pgoff_t idx = folio_page_idx(folio, page); + /* The page offset of vmf->address within the VMA. */ + pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff; + /* The index of the entry in the pagetable for fault page. */ + pgoff_t pte_off = pte_index(vmf->address); + + /* + * Fallback to per-page fault in case the folio size in page + * cache beyond the VMA limits and PMD pagetable limits. + */ + if (unlikely(vma_off < idx || + vma_off + (nr_pages - idx) > vma_pages(vma) || + pte_off < idx || + pte_off + (nr_pages - idx) > PTRS_PER_PTE)) { + nr_pages = 1; + } else { + /* Now we can set mappings for the whole large folio. */ + addr = vmf->address - idx * PAGE_SIZE; + page = &folio->page; + } + } + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, - vmf->address, &vmf->ptl); + addr, &vmf->ptl); if (!vmf->pte) return VM_FAULT_NOPAGE; /* Re-check under ptl */ - if (likely(!vmf_pte_changed(vmf))) { - struct folio *folio = page_folio(page); - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio); - - set_pte_range(vmf, folio, page, 1, vmf->address); - add_mm_counter(vma->vm_mm, type, 1); - ret = 0; - } else { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) { + update_mmu_tlb(vma, addr, vmf->pte); ret = VM_FAULT_NOPAGE; + goto unlock; + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) { + update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages); + ret = VM_FAULT_NOPAGE; + goto unlock; } + folio_ref_add(folio, nr_pages - 1); + set_pte_range(vmf, folio, page, nr_pages, addr); + type = is_cow ? MM_ANONPAGES : mm_counter_file(folio); + add_mm_counter(vma->vm_mm, type, nr_pages); + ret = 0; + +unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; } @@ -5067,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf, { struct vm_area_struct *vma = vmf->vma; - folio_get(folio); - /* Record the current PID acceesing VMA */ vma_set_access_pid_bit(vma); @@ -5205,16 +5343,19 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) else last_cpupid = folio_last_cpupid(folio); target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags); - if (target_nid == NUMA_NO_NODE) { - folio_put(folio); + if (target_nid == NUMA_NO_NODE) + goto out_map; + if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { + flags |= TNF_MIGRATE_FAIL; goto out_map; } + /* The folio is isolated and isolation code holds a folio reference. */ pte_unmap_unlock(vmf->pte, vmf->ptl); writable = false; ignore_writable = true; /* Migrate to the requested node */ - if (migrate_misplaced_folio(folio, vma, target_nid)) { + if (!migrate_misplaced_folio(folio, vma, target_nid)) { nid = target_nid; flags |= TNF_MIGRATED; } else { @@ -6244,23 +6385,23 @@ EXPORT_SYMBOL(__might_fault); * cache lines hot. */ static inline int process_huge_page( - unsigned long addr_hint, unsigned int pages_per_huge_page, + unsigned long addr_hint, unsigned int nr_pages, int (*process_subpage)(unsigned long addr, int idx, void *arg), void *arg) { int i, n, base, l, ret; unsigned long addr = addr_hint & - ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); /* Process target subpage last to keep its cache lines hot */ might_sleep(); n = (addr_hint - addr) / PAGE_SIZE; - if (2 * n <= pages_per_huge_page) { + if (2 * n <= nr_pages) { /* If target subpage in first half of huge page */ base = 0; l = n; /* Process subpages at the end of huge page */ - for (i = pages_per_huge_page - 1; i >= 2 * n; i--) { + for (i = nr_pages - 1; i >= 2 * n; i--) { cond_resched(); ret = process_subpage(addr + i * PAGE_SIZE, i, arg); if (ret) @@ -6268,8 +6409,8 @@ static inline int process_huge_page( } } else { /* If target subpage in second half of huge page */ - base = pages_per_huge_page - 2 * (pages_per_huge_page - n); - l = pages_per_huge_page - n; + base = nr_pages - 2 * (nr_pages - n); + l = nr_pages - n; /* Process subpages at the begin of huge page */ for (i = 0; i < base; i++) { cond_resched(); @@ -6298,102 +6439,93 @@ static inline int process_huge_page( return 0; } -static void clear_gigantic_page(struct page *page, - unsigned long addr, - unsigned int pages_per_huge_page) +static void clear_gigantic_page(struct folio *folio, unsigned long addr, + unsigned int nr_pages) { int i; - struct page *p; might_sleep(); - for (i = 0; i < pages_per_huge_page; i++) { - p = nth_page(page, i); + for (i = 0; i < nr_pages; i++) { cond_resched(); - clear_user_highpage(p, addr + i * PAGE_SIZE); + clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE); } } static int clear_subpage(unsigned long addr, int idx, void *arg) { - struct page *page = arg; + struct folio *folio = arg; - clear_user_highpage(nth_page(page, idx), addr); + clear_user_highpage(folio_page(folio, idx), addr); return 0; } -void clear_huge_page(struct page *page, - unsigned long addr_hint, unsigned int pages_per_huge_page) +/** + * folio_zero_user - Zero a folio which will be mapped to userspace. + * @folio: The folio to zero. + * @addr_hint: The address will be accessed or the base address if uncelar. + */ +void folio_zero_user(struct folio *folio, unsigned long addr_hint) { - unsigned long addr = addr_hint & - ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); + unsigned int nr_pages = folio_nr_pages(folio); - if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { - clear_gigantic_page(page, addr, pages_per_huge_page); - return; - } - - process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page); + if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) + clear_gigantic_page(folio, addr_hint, nr_pages); + else + process_huge_page(addr_hint, nr_pages, clear_subpage, folio); } static int copy_user_gigantic_page(struct folio *dst, struct folio *src, - unsigned long addr, - struct vm_area_struct *vma, - unsigned int pages_per_huge_page) + unsigned long addr, + struct vm_area_struct *vma, + unsigned int nr_pages) { int i; struct page *dst_page; struct page *src_page; - for (i = 0; i < pages_per_huge_page; i++) { + for (i = 0; i < nr_pages; i++) { dst_page = folio_page(dst, i); src_page = folio_page(src, i); cond_resched(); if (copy_mc_user_highpage(dst_page, src_page, - addr + i*PAGE_SIZE, vma)) { - memory_failure_queue(page_to_pfn(src_page), 0); + addr + i*PAGE_SIZE, vma)) return -EHWPOISON; - } } return 0; } struct copy_subpage_arg { - struct page *dst; - struct page *src; + struct folio *dst; + struct folio *src; struct vm_area_struct *vma; }; static int copy_subpage(unsigned long addr, int idx, void *arg) { struct copy_subpage_arg *copy_arg = arg; - struct page *dst = nth_page(copy_arg->dst, idx); - struct page *src = nth_page(copy_arg->src, idx); + struct page *dst = folio_page(copy_arg->dst, idx); + struct page *src = folio_page(copy_arg->src, idx); - if (copy_mc_user_highpage(dst, src, addr, copy_arg->vma)) { - memory_failure_queue(page_to_pfn(src), 0); + if (copy_mc_user_highpage(dst, src, addr, copy_arg->vma)) return -EHWPOISON; - } return 0; } int copy_user_large_folio(struct folio *dst, struct folio *src, unsigned long addr_hint, struct vm_area_struct *vma) { - unsigned int pages_per_huge_page = folio_nr_pages(dst); - unsigned long addr = addr_hint & - ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); + unsigned int nr_pages = folio_nr_pages(dst); struct copy_subpage_arg arg = { - .dst = &dst->page, - .src = &src->page, + .dst = dst, + .src = src, .vma = vma, }; - if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) - return copy_user_gigantic_page(dst, src, addr, vma, - pages_per_huge_page); + if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) + return copy_user_gigantic_page(dst, src, addr_hint, vma, nr_pages); - return process_huge_page(addr_hint, pages_per_huge_page, copy_subpage, &arg); + return process_huge_page(addr_hint, nr_pages, copy_subpage, &arg); } long copy_folio_from_user(struct folio *dst_folio, diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 431b1f6753c0..66267c26ca1b 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -628,16 +628,10 @@ int restore_online_page_callback(online_page_callback_t callback) } EXPORT_SYMBOL_GPL(restore_online_page_callback); -void generic_online_page(struct page *page, unsigned int order) +/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ +void __ref generic_online_page(struct page *page, unsigned int order) { - /* - * Freeing the page with debug_pagealloc enabled will try to unmap it, - * so we should map it first. This is better than introducing a special - * case in page freeing fast path. - */ - debug_pagealloc_map_pages(page, 1 << order); - __free_pages_core(page, order); - totalram_pages_add(1UL << order); + __free_pages_core(page, order, MEMINIT_HOTPLUG); } EXPORT_SYMBOL_GPL(generic_online_page); @@ -741,7 +735,7 @@ static inline void section_taint_zone_device(unsigned long pfn) /* * Associate the pfn range with the given zone, initializing the memmaps * and resizing the pgdat/zone data to span the added pages. After this - * call, all affected pages are PG_reserved. + * call, all affected pages are PageOffline(). * * All aligned pageblocks are initialized to the specified migratetype * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related @@ -846,7 +840,6 @@ static bool auto_movable_can_online_movable(int nid, struct memory_group *group, unsigned long kernel_early_pages, movable_pages; struct auto_movable_group_stats group_stats = {}; struct auto_movable_stats stats = {}; - pg_data_t *pgdat = NODE_DATA(nid); struct zone *zone; int i; @@ -857,6 +850,8 @@ static bool auto_movable_can_online_movable(int nid, struct memory_group *group, auto_movable_stats_account_zone(&stats, zone); } else { for (i = 0; i < MAX_NR_ZONES; i++) { + pg_data_t *pgdat = NODE_DATA(nid); + zone = pgdat->node_zones + i; if (populated_zone(zone)) auto_movable_stats_account_zone(&stats, zone); @@ -1107,8 +1102,12 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE); - for (i = 0; i < nr_pages; i++) - SetPageVmemmapSelfHosted(pfn_to_page(pfn + i)); + for (i = 0; i < nr_pages; i++) { + struct page *page = pfn_to_page(pfn + i); + + __ClearPageOffline(page); + SetPageVmemmapSelfHosted(page); + } /* * It might be that the vmemmap_pages fully span sections. If that is @@ -1731,8 +1730,8 @@ static int scan_movable_pages(unsigned long start, unsigned long end, unsigned long pfn; for (pfn = start; pfn < end; pfn++) { - struct page *page, *head; - unsigned long skip; + struct page *page; + struct folio *folio; if (!pfn_valid(pfn)) continue; @@ -1753,7 +1752,7 @@ static int scan_movable_pages(unsigned long start, unsigned long end, if (!PageHuge(page)) continue; - head = compound_head(page); + folio = page_folio(page); /* * This test is racy as we hold no reference or lock. The * hugetlb page could have been free'ed and head is no longer @@ -1761,10 +1760,9 @@ static int scan_movable_pages(unsigned long start, unsigned long end, * cases false positives and negatives are possible. Calling * code must deal with these scenarios. */ - if (HPageMigratable(head)) + if (folio_test_hugetlb_migratable(folio)) goto found; - skip = compound_nr(head) - (pfn - page_to_pfn(head)); - pfn += skip - 1; + pfn |= folio_nr_pages(folio) - 1; } return -ENOENT; found: @@ -1945,7 +1943,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group) { const unsigned long end_pfn = start_pfn + nr_pages; - unsigned long pfn, system_ram_pages = 0; + unsigned long pfn, managed_pages, system_ram_pages = 0; const int node = zone_to_nid(zone); unsigned long flags; struct memory_notify arg; @@ -1967,9 +1965,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, * Don't allow to offline memory blocks that contain holes. * Consequently, memory blocks with holes can never get onlined * via the hotplug path - online_pages() - as hotplugged memory has - * no holes. This way, we e.g., don't have to worry about marking - * memory holes PG_reserved, don't need pfn_valid() checks, and can - * avoid using walk_system_ram_range() later. + * no holes. This way, we don't have to worry about memory holes, + * don't need pfn_valid() checks, and can avoid using + * walk_system_ram_range() later. */ walk_system_ram_range(start_pfn, nr_pages, &system_ram_pages, count_system_ram_pages_cb); @@ -2066,7 +2064,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, } while (ret); /* Mark all sections offline and remove free pages from the buddy. */ - __offline_isolated_pages(start_pfn, end_pfn); + managed_pages = __offline_isolated_pages(start_pfn, end_pfn); pr_debug("Offlined Pages %ld\n", nr_pages); /* @@ -2082,7 +2080,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, zone_pcp_enable(zone); /* removal success */ - adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); + adjust_managed_page_count(pfn_to_page(start_pfn), -managed_pages); adjust_present_page_count(pfn_to_page(start_pfn), group, -nr_pages); /* reinitialise watermarks and update pcp limits */ @@ -2283,10 +2281,8 @@ static int __ref try_remove_memory(u64 start, u64 size) remove_memory_blocks_and_altmaps(start, size); } - if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { - memblock_phys_free(start, size); + if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) memblock_remove(start, size); - } release_mem_region_adjustable(start, size); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index aec756ae5637..327a19b0883d 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -624,7 +624,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask, pte_t entry; ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte); - entry = huge_ptep_get(pte); + entry = huge_ptep_get(walk->mm, addr, pte); if (!pte_present(entry)) { if (unlikely(is_hugetlb_entry_migration(entry))) qp->nr_failed++; @@ -1211,7 +1211,6 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src, struct migration_mpol *mmpol = (struct migration_mpol *)private; struct mempolicy *pol = mmpol->pol; pgoff_t ilx = mmpol->ilx; - struct page *page; unsigned int order; int nid = numa_node_id(); gfp_t gfp; @@ -1235,8 +1234,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src, else gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP; - page = alloc_pages_mpol(gfp, order, pol, ilx, nid); - return page_rmappable_folio(page); + return folio_alloc_mpol(gfp, order, pol, ilx, nid); } #else @@ -2277,6 +2275,13 @@ struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order, return page; } +struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order, + struct mempolicy *pol, pgoff_t ilx, int nid) +{ + return page_rmappable_folio(alloc_pages_mpol_noprof(gfp | __GFP_COMP, + order, pol, ilx, nid)); +} + /** * vma_alloc_folio - Allocate a folio for a VMA. * @gfp: GFP flags. @@ -2298,13 +2303,12 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct { struct mempolicy *pol; pgoff_t ilx; - struct page *page; + struct folio *folio; pol = get_vma_policy(vma, addr, order, &ilx); - page = alloc_pages_mpol_noprof(gfp | __GFP_COMP, order, - pol, ilx, numa_node_id()); + folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id()); mpol_cond_put(pol); - return page_rmappable_folio(page); + return folio; } EXPORT_SYMBOL(vma_alloc_folio_noprof); @@ -3293,8 +3297,9 @@ out: * @pol: pointer to mempolicy to be formatted * * Convert @pol into a string. If @buffer is too short, truncate the string. - * Recommend a @maxlen of at least 32 for the longest mode, "interleave", the - * longest flag, "relative", and to display at least a few node ids. + * Recommend a @maxlen of at least 51 for the longest mode, "weighted + * interleave", plus the longest flag flags, "relative|balancing", and to + * display at least a few node ids. */ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) { @@ -3303,7 +3308,10 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) unsigned short mode = MPOL_DEFAULT; unsigned short flags = 0; - if (pol && pol != &default_policy && !(pol->flags & MPOL_F_MORON)) { + if (pol && + pol != &default_policy && + !(pol >= &preferred_node_policy[0] && + pol <= &preferred_node_policy[ARRAY_SIZE(preferred_node_policy) - 1])) { mode = pol->mode; flags = pol->flags; } @@ -3331,12 +3339,18 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) p += snprintf(p, buffer + maxlen - p, "="); /* - * Currently, the only defined flags are mutually exclusive + * Static and relative are mutually exclusive. */ if (flags & MPOL_F_STATIC_NODES) p += snprintf(p, buffer + maxlen - p, "static"); else if (flags & MPOL_F_RELATIVE_NODES) p += snprintf(p, buffer + maxlen - p, "relative"); + + if (flags & MPOL_F_NUMA_BALANCING) { + if (!is_power_of_2(flags & MPOL_MODE_FLAGS)) + p += snprintf(p, buffer + maxlen - p, "|"); + p += snprintf(p, buffer + maxlen - p, "balancing"); + } } if (!nodes_empty(nodes)) diff --git a/mm/migrate.c b/mm/migrate.c index ed3aac90cf4f..e7296c0fb5d5 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -338,14 +338,14 @@ out: * * This function will release the vma lock before returning. */ -void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *ptep) +void migration_entry_wait_huge(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, ptep); pte_t pte; hugetlb_vma_assert_locked(vma); spin_lock(ptl); - pte = huge_ptep_get(ptep); + pte = huge_ptep_get(vma->vm_mm, addr, ptep); if (unlikely(!is_hugetlb_entry_migration(pte))) { spin_unlock(ptl); @@ -393,28 +393,23 @@ static int folio_expected_refs(struct address_space *mapping, } /* - * Replace the page in the mapping. + * Replace the folio in the mapping. * * The number of remaining references must be: - * 1 for anonymous pages without a mapping - * 2 for pages with a mapping - * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. + * 1 for anonymous folios without a mapping + * 2 for folios with a mapping + * 3 for folios with a mapping and PagePrivate/PagePrivate2 set. */ -int folio_migrate_mapping(struct address_space *mapping, - struct folio *newfolio, struct folio *folio, int extra_count) +static int __folio_migrate_mapping(struct address_space *mapping, + struct folio *newfolio, struct folio *folio, int expected_count) { XA_STATE(xas, &mapping->i_pages, folio_index(folio)); struct zone *oldzone, *newzone; int dirty; - int expected_count = folio_expected_refs(mapping, folio) + extra_count; long nr = folio_nr_pages(folio); long entries, i; if (!mapping) { - /* Anonymous page without mapping */ - if (folio_ref_count(folio) != expected_count) - return -EAGAIN; - /* Take off deferred split queue while frozen and memcg set */ if (folio_test_large(folio) && folio_test_large_rmappable(folio)) { @@ -443,8 +438,7 @@ int folio_migrate_mapping(struct address_space *mapping, } /* Take off deferred split queue while frozen and memcg set */ - if (folio_test_large(folio) && folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); /* * Now we know that no one else is looking at the folio: @@ -465,7 +459,7 @@ int folio_migrate_mapping(struct address_space *mapping, entries = 1; } - /* Move dirty while page refs frozen and newpage not yet exposed */ + /* Move dirty while folio refs frozen and newfolio not yet exposed */ dirty = folio_test_dirty(folio); if (dirty) { folio_clear_dirty(folio); @@ -479,7 +473,7 @@ int folio_migrate_mapping(struct address_space *mapping, } /* - * Drop cache reference from old page by unfreezing + * Drop cache reference from old folio by unfreezing * to one less reference. * We know this isn't the last reference. */ @@ -490,11 +484,11 @@ int folio_migrate_mapping(struct address_space *mapping, /* * If moved to a different zone then also account - * the page for that zone. Other VM counters will be + * the folio for that zone. Other VM counters will be * taken care of when we establish references to the - * new page and drop references to the old page. + * new folio and drop references to the old folio. * - * Note that anonymous pages are accounted for + * Note that anonymous folios are accounted for * via NR_FILE_PAGES and NR_ANON_MAPPED if they * are mapped to swap space. */ @@ -534,6 +528,17 @@ int folio_migrate_mapping(struct address_space *mapping, return MIGRATEPAGE_SUCCESS; } + +int folio_migrate_mapping(struct address_space *mapping, + struct folio *newfolio, struct folio *folio, int extra_count) +{ + int expected_count = folio_expected_refs(mapping, folio) + extra_count; + + if (folio_ref_count(folio) != expected_count) + return -EAGAIN; + + return __folio_migrate_mapping(mapping, newfolio, folio, expected_count); +} EXPORT_SYMBOL(folio_migrate_mapping); /* @@ -544,10 +549,16 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, struct folio *dst, struct folio *src) { XA_STATE(xas, &mapping->i_pages, folio_index(src)); - int expected_count; + int rc, expected_count = folio_expected_refs(mapping, src); + + if (folio_ref_count(src) != expected_count) + return -EAGAIN; + + rc = folio_mc_copy(dst, src); + if (unlikely(rc)) + return rc; xas_lock_irq(&xas); - expected_count = folio_expected_refs(mapping, src); if (!folio_ref_freeze(src, expected_count)) { xas_unlock_irq(&xas); return -EAGAIN; @@ -660,33 +671,32 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio) } EXPORT_SYMBOL(folio_migrate_flags); -void folio_migrate_copy(struct folio *newfolio, struct folio *folio) -{ - folio_copy(newfolio, folio); - folio_migrate_flags(newfolio, folio); -} -EXPORT_SYMBOL(folio_migrate_copy); - /************************************************************ * Migration functions ***********************************************************/ -int migrate_folio_extra(struct address_space *mapping, struct folio *dst, - struct folio *src, enum migrate_mode mode, int extra_count) +static int __migrate_folio(struct address_space *mapping, struct folio *dst, + struct folio *src, void *src_private, + enum migrate_mode mode) { - int rc; + int rc, expected_count = folio_expected_refs(mapping, src); - BUG_ON(folio_test_writeback(src)); /* Writeback must be complete */ + /* Check whether src does not have extra refs before we do more work */ + if (folio_ref_count(src) != expected_count) + return -EAGAIN; - rc = folio_migrate_mapping(mapping, dst, src, extra_count); + rc = folio_mc_copy(dst, src); + if (unlikely(rc)) + return rc; + rc = __folio_migrate_mapping(mapping, dst, src, expected_count); if (rc != MIGRATEPAGE_SUCCESS) return rc; - if (mode != MIGRATE_SYNC_NO_COPY) - folio_migrate_copy(dst, src); - else - folio_migrate_flags(dst, src); + if (src_private) + folio_attach_private(dst, folio_detach_private(src)); + + folio_migrate_flags(dst, src); return MIGRATEPAGE_SUCCESS; } @@ -703,9 +713,10 @@ int migrate_folio_extra(struct address_space *mapping, struct folio *dst, * Folios are locked upon entry and exit. */ int migrate_folio(struct address_space *mapping, struct folio *dst, - struct folio *src, enum migrate_mode mode) + struct folio *src, enum migrate_mode mode) { - return migrate_folio_extra(mapping, dst, src, mode, 0); + BUG_ON(folio_test_writeback(src)); /* Writeback must be complete */ + return __migrate_folio(mapping, dst, src, NULL, mode); } EXPORT_SYMBOL(migrate_folio); @@ -790,24 +801,16 @@ recheck_buffers: } } - rc = folio_migrate_mapping(mapping, dst, src, 0); + rc = filemap_migrate_folio(mapping, dst, src, mode); if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - folio_attach_private(dst, folio_detach_private(src)); - bh = head; do { folio_set_bh(bh, dst, bh_offset(bh)); bh = bh->b_this_page; } while (bh != head); - if (mode != MIGRATE_SYNC_NO_COPY) - folio_migrate_copy(dst, src); - else - folio_migrate_flags(dst, src); - - rc = MIGRATEPAGE_SUCCESS; unlock_buffers: if (check_refs) spin_unlock(&mapping->i_private_lock); @@ -867,20 +870,7 @@ EXPORT_SYMBOL_GPL(buffer_migrate_folio_norefs); int filemap_migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode mode) { - int ret; - - ret = folio_migrate_mapping(mapping, dst, src, 0); - if (ret != MIGRATEPAGE_SUCCESS) - return ret; - - if (folio_get_private(src)) - folio_attach_private(dst, folio_detach_private(src)); - - if (mode != MIGRATE_SYNC_NO_COPY) - folio_migrate_copy(dst, src); - else - folio_migrate_flags(dst, src); - return MIGRATEPAGE_SUCCESS; + return __migrate_folio(mapping, dst, src, folio_get_private(src), mode); } EXPORT_SYMBOL_GPL(filemap_migrate_folio); @@ -935,7 +925,6 @@ static int fallback_migrate_folio(struct address_space *mapping, /* Only writeback folios in full synchronous migration */ switch (mode) { case MIGRATE_SYNC: - case MIGRATE_SYNC_NO_COPY: break; default: return -EBUSY; @@ -1193,7 +1182,6 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, */ switch (mode) { case MIGRATE_SYNC: - case MIGRATE_SYNC_NO_COPY: break; default: rc = -EBUSY; @@ -1404,7 +1392,6 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio, goto out; switch (mode) { case MIGRATE_SYNC: - case MIGRATE_SYNC_NO_COPY: break; default: goto out; @@ -2557,16 +2544,44 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src, return __folio_alloc_node(gfp, order, nid); } -static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio) +/* + * Prepare for calling migrate_misplaced_folio() by isolating the folio if + * permitted. Must be called with the PTL still held. + */ +int migrate_misplaced_folio_prepare(struct folio *folio, + struct vm_area_struct *vma, int node) { int nr_pages = folio_nr_pages(folio); + pg_data_t *pgdat = NODE_DATA(node); + + if (folio_is_file_lru(folio)) { + /* + * Do not migrate file folios that are mapped in multiple + * processes with execute permissions as they are probably + * shared libraries. + * + * See folio_likely_mapped_shared() on possible imprecision + * when we cannot easily detect if a folio is shared. + */ + if ((vma->vm_flags & VM_EXEC) && + folio_likely_mapped_shared(folio)) + return -EACCES; + + /* + * Do not migrate dirty folios as not all filesystems can move + * dirty folios in MIGRATE_ASYNC mode which is a waste of + * cycles. + */ + if (folio_test_dirty(folio)) + return -EAGAIN; + } /* Avoid migrating to a node that is nearly full */ if (!migrate_balanced_pgdat(pgdat, nr_pages)) { int z; if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)) - return 0; + return -EAGAIN; for (z = pgdat->nr_zones - 1; z >= 0; z--) { if (managed_zone(pgdat->node_zones + z)) break; @@ -2577,78 +2592,42 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio) * further. */ if (z < 0) - return 0; + return -EAGAIN; wakeup_kswapd(pgdat->node_zones + z, 0, folio_order(folio), ZONE_MOVABLE); - return 0; + return -EAGAIN; } if (!folio_isolate_lru(folio)) - return 0; + return -EAGAIN; node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio), nr_pages); - - /* - * Isolating the folio has taken another reference, so the - * caller's reference can be safely dropped without the folio - * disappearing underneath us during migration. - */ - folio_put(folio); - return 1; + return 0; } /* * Attempt to migrate a misplaced folio to the specified destination - * node. Caller is expected to have an elevated reference count on - * the folio that will be dropped by this function before returning. + * node. Caller is expected to have isolated the folio by calling + * migrate_misplaced_folio_prepare(), which will result in an + * elevated reference count on the folio. This function will un-isolate the + * folio, dereferencing the folio before returning. */ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, int node) { pg_data_t *pgdat = NODE_DATA(node); - int isolated; int nr_remaining; unsigned int nr_succeeded; LIST_HEAD(migratepages); - int nr_pages = folio_nr_pages(folio); - - /* - * Don't migrate file folios that are mapped in multiple processes - * with execute permissions as they are probably shared libraries. - * - * See folio_likely_mapped_shared() on possible imprecision when we - * cannot easily detect if a folio is shared. - */ - if (folio_likely_mapped_shared(folio) && folio_is_file_lru(folio) && - (vma->vm_flags & VM_EXEC)) - goto out; - - /* - * Also do not migrate dirty folios as not all filesystems can move - * dirty folios in MIGRATE_ASYNC mode which is a waste of cycles. - */ - if (folio_is_file_lru(folio) && folio_test_dirty(folio)) - goto out; - - isolated = numamigrate_isolate_folio(pgdat, folio); - if (!isolated) - goto out; list_add(&folio->lru, &migratepages); nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio, NULL, node, MIGRATE_ASYNC, MR_NUMA_MISPLACED, &nr_succeeded); - if (nr_remaining) { - if (!list_empty(&migratepages)) { - list_del(&folio->lru); - node_stat_mod_folio(folio, NR_ISOLATED_ANON + - folio_is_file_lru(folio), -nr_pages); - folio_putback_lru(folio); - } - isolated = 0; - } + if (nr_remaining && !list_empty(&migratepages)) + putback_movable_pages(&migratepages); if (nr_succeeded) { count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); if (!node_is_toptier(folio_nid(folio)) && node_is_toptier(node)) @@ -2656,11 +2635,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, nr_succeeded); } BUG_ON(!list_empty(&migratepages)); - return isolated; - -out: - folio_put(folio); - return 0; + return nr_remaining ? -EAGAIN : 0; } #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_NUMA */ diff --git a/mm/migrate_device.c b/mm/migrate_device.c index aecc71972a87..6d66dc1c6ffa 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -658,7 +658,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, goto unlock_abort; inc_mm_counter(mm, MM_ANONPAGES); - folio_add_new_anon_rmap(folio, vma, addr); + folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); if (!folio_is_zone_device(folio)) folio_add_lru_vma(folio, vma); folio_get(folio); @@ -692,8 +692,8 @@ static void __migrate_device_pages(unsigned long *src_pfns, struct page *newpage = migrate_pfn_to_page(dst_pfns[i]); struct page *page = migrate_pfn_to_page(src_pfns[i]); struct address_space *mapping; - struct folio *folio; - int r; + struct folio *newfolio, *folio; + int r, extra_cnt = 0; if (!newpage) { src_pfns[i] &= ~MIGRATE_PFN_MIGRATE; @@ -727,11 +727,12 @@ static void __migrate_device_pages(unsigned long *src_pfns, continue; } + newfolio = page_folio(newpage); folio = page_folio(page); mapping = folio_mapping(folio); - if (is_device_private_page(newpage) || - is_device_coherent_page(newpage)) { + if (folio_is_device_private(newfolio) || + folio_is_device_coherent(newfolio)) { if (mapping) { /* * For now only support anonymous memory migrating to @@ -745,7 +746,7 @@ static void __migrate_device_pages(unsigned long *src_pfns, continue; } } - } else if (is_zone_device_page(newpage)) { + } else if (folio_is_zone_device(newfolio)) { /* * Other types of ZONE_DEVICE page are not supported. */ @@ -753,14 +754,15 @@ static void __migrate_device_pages(unsigned long *src_pfns, continue; } + BUG_ON(folio_test_writeback(folio)); + if (migrate && migrate->fault_page == page) - r = migrate_folio_extra(mapping, page_folio(newpage), - folio, MIGRATE_SYNC_NO_COPY, 1); - else - r = migrate_folio(mapping, page_folio(newpage), - folio, MIGRATE_SYNC_NO_COPY); + extra_cnt = 1; + r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt); if (r != MIGRATEPAGE_SUCCESS) src_pfns[i] &= ~MIGRATE_PFN_MIGRATE; + else + folio_migrate_flags(newfolio, folio); } if (notified) diff --git a/mm/mincore.c b/mm/mincore.c index dad3622cc963..d6bd19e520fc 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -33,7 +33,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, * Hugepages under user process are always in RAM and never * swapped out, but theoretically it needs to be checked. */ - present = pte && !huge_pte_none_mostly(huge_ptep_get(pte)); + present = pte && !huge_pte_none_mostly(huge_ptep_get(walk->mm, addr, pte)); for (; addr != end; vec++, addr += PAGE_SIZE) *vec = present; walk->private = vec; @@ -139,7 +139,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } else { #ifdef CONFIG_SWAP *vec = mincore_page(swap_address_space(entry), - swp_offset(entry)); + swap_cache_index(entry)); #else WARN_ON(1); *vec = 1; diff --git a/mm/mlock.c b/mm/mlock.c index 30b51cdea89d..52d6e401ad67 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -307,26 +307,15 @@ void munlock_folio(struct folio *folio) static inline unsigned int folio_mlock_step(struct folio *folio, pte_t *pte, unsigned long addr, unsigned long end) { - unsigned int count, i, nr = folio_nr_pages(folio); - unsigned long pfn = folio_pfn(folio); + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; + unsigned int count = (end - addr) >> PAGE_SHIFT; pte_t ptent = ptep_get(pte); if (!folio_test_large(folio)) return 1; - count = pfn + nr - pte_pfn(ptent); - count = min_t(unsigned int, count, (end - addr) >> PAGE_SHIFT); - - for (i = 0; i < count; i++, pte++) { - pte_t entry = ptep_get(pte); - - if (!pte_present(entry)) - break; - if (pte_pfn(entry) - pfn >= nr) - break; - } - - return i; + return folio_pte_batch(folio, addr, pte, ptent, count, fpb_flags, NULL, + NULL, NULL); } static inline bool allow_mlock_munlock(struct folio *folio, diff --git a/mm/mm_init.c b/mm/mm_init.c index 804df0309257..75c3bd42799b 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "internal.h" #include "slab.h" #include "shuffle.h" @@ -53,7 +54,6 @@ void __init mminit_verify_zonelist(void) struct zonelist *zonelist; int i, listid, zoneid; - BUILD_BUG_ON(MAX_ZONELISTS > 2); for (i = 0; i < MAX_ZONELISTS * MAX_NR_ZONES; i++) { /* Identify the zone and nodelist */ @@ -568,7 +568,7 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn, mm_zero_struct_page(page); set_page_links(page, zone, nid, pfn); init_page_count(page); - page_mapcount_reset(page); + atomic_set(&page->_mapcount, -1); page_cpupid_reset_last(page); page_kasan_tag_reset(page); @@ -891,8 +891,14 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone page = pfn_to_page(pfn); __init_single_page(page, pfn, zone, nid); - if (context == MEMINIT_HOTPLUG) - __SetPageReserved(page); + if (context == MEMINIT_HOTPLUG) { +#ifdef CONFIG_ZONE_DEVICE + if (zone == ZONE_DEVICE) + __SetPageReserved(page); + else +#endif + __SetPageOffline(page); + } /* * Usually, we want to mark the pageblock MIGRATE_MOVABLE, @@ -1617,6 +1623,8 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat) panic("Failed to allocate %ld bytes for node %d memory map\n", size, pgdat->node_id); pgdat->node_mem_map = map + offset; + mod_node_early_perpage_metadata(pgdat->node_id, + DIV_ROUND_UP(size, PAGE_SIZE)); pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n", __func__, pgdat->node_id, (unsigned long)pgdat, (unsigned long)pgdat->node_mem_map); @@ -1912,8 +1920,8 @@ unsigned long __init node_map_pfn_alignment(void) } #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT -static void __init deferred_free_range(unsigned long pfn, - unsigned long nr_pages) +static void __init deferred_free_pages(unsigned long pfn, + unsigned long nr_pages) { struct page *page; unsigned long i; @@ -1927,7 +1935,7 @@ static void __init deferred_free_range(unsigned long pfn, if (nr_pages == MAX_ORDER_NR_PAGES && IS_MAX_ORDER_ALIGNED(pfn)) { for (i = 0; i < nr_pages; i += pageblock_nr_pages) set_pageblock_migratetype(page + i, MIGRATE_MOVABLE); - __free_pages_core(page, MAX_PAGE_ORDER); + __free_pages_core(page, MAX_PAGE_ORDER, MEMINIT_EARLY); return; } @@ -1937,7 +1945,7 @@ static void __init deferred_free_range(unsigned long pfn, for (i = 0; i < nr_pages; i++, page++, pfn++) { if (pageblock_aligned(pfn)) set_pageblock_migratetype(page, MIGRATE_MOVABLE); - __free_pages_core(page, 0); + __free_pages_core(page, 0, MEMINIT_EARLY); } } @@ -1951,69 +1959,21 @@ static inline void __init pgdat_init_report_one_done(void) complete(&pgdat_init_all_done_comp); } -/* - * Returns true if page needs to be initialized or freed to buddy allocator. - * - * We check if a current MAX_PAGE_ORDER block is valid by only checking the - * validity of the head pfn. - */ -static inline bool __init deferred_pfn_valid(unsigned long pfn) -{ - if (IS_MAX_ORDER_ALIGNED(pfn) && !pfn_valid(pfn)) - return false; - return true; -} - -/* - * Free pages to buddy allocator. Try to free aligned pages in - * MAX_ORDER_NR_PAGES sizes. - */ -static void __init deferred_free_pages(unsigned long pfn, - unsigned long end_pfn) -{ - unsigned long nr_free = 0; - - for (; pfn < end_pfn; pfn++) { - if (!deferred_pfn_valid(pfn)) { - deferred_free_range(pfn - nr_free, nr_free); - nr_free = 0; - } else if (IS_MAX_ORDER_ALIGNED(pfn)) { - deferred_free_range(pfn - nr_free, nr_free); - nr_free = 1; - } else { - nr_free++; - } - } - /* Free the last block of pages to allocator */ - deferred_free_range(pfn - nr_free, nr_free); -} - /* * Initialize struct pages. We minimize pfn page lookups and scheduler checks * by performing it only once every MAX_ORDER_NR_PAGES. * Return number of pages initialized. */ -static unsigned long __init deferred_init_pages(struct zone *zone, - unsigned long pfn, - unsigned long end_pfn) +static unsigned long __init deferred_init_pages(struct zone *zone, + unsigned long pfn, unsigned long end_pfn) { int nid = zone_to_nid(zone); - unsigned long nr_pages = 0; + unsigned long nr_pages = end_pfn - pfn; int zid = zone_idx(zone); - struct page *page = NULL; + struct page *page = pfn_to_page(pfn); - for (; pfn < end_pfn; pfn++) { - if (!deferred_pfn_valid(pfn)) { - page = NULL; - continue; - } else if (!page || IS_MAX_ORDER_ALIGNED(pfn)) { - page = pfn_to_page(pfn); - } else { - page++; - } + for (; pfn < end_pfn; pfn++, page++) __init_single_page(page, pfn, zid, nid); - nr_pages++; - } return nr_pages; } @@ -2097,7 +2057,7 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn, break; t = min(mo_pfn, epfn); - deferred_free_pages(spfn, t); + deferred_free_pages(spfn, t - spfn); if (mo_pfn <= epfn) break; @@ -2126,11 +2086,10 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn, } } -/* An arch may override for more concurrency. */ -__weak int __init +static unsigned int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask) { - return 1; + return max(cpumask_weight(node_cpumask), 1U); } /* Initialise remaining memory on a node */ @@ -2315,6 +2274,7 @@ void set_zone_contiguous(struct zone *zone) zone->contiguous = true; } +static void __init mem_init_print_info(void); void __init page_alloc_init_late(void) { struct zone *zone; @@ -2341,6 +2301,8 @@ void __init page_alloc_init_late(void) files_maxfiles_init(); #endif + /* Accounting of total+free memory is stable at this point. */ + mem_init_print_info(); buffer_init(); /* Discard memblock private memory */ @@ -2507,7 +2469,7 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn, } } - __free_pages_core(page, order); + __free_pages_core(page, order, MEMINIT_EARLY); } DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc); @@ -2687,6 +2649,7 @@ static void __init mem_init_print_info(void) void __init mm_core_init(void) { /* Initializations relying on SMP setup */ + BUILD_BUG_ON(MAX_ZONELISTS > 2); build_all_zonelists(NULL); page_alloc_init_cpuhp(); @@ -2701,7 +2664,6 @@ void __init mm_core_init(void) kmsan_init_shadow(); stack_depot_early_init(); mem_init(); - mem_init_print_info(); kmem_cache_init(); /* * page_owner must be initialized after buddy is ready, and also after diff --git a/mm/mmap.c b/mm/mmap.c index 83b4682ec85c..e42d89f98071 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -131,6 +131,47 @@ void unlink_file_vma(struct vm_area_struct *vma) } } +void unlink_file_vma_batch_init(struct unlink_vma_file_batch *vb) +{ + vb->count = 0; +} + +static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb) +{ + struct address_space *mapping; + int i; + + mapping = vb->vmas[0]->vm_file->f_mapping; + i_mmap_lock_write(mapping); + for (i = 0; i < vb->count; i++) { + VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping); + __remove_shared_vm_struct(vb->vmas[i], mapping); + } + i_mmap_unlock_write(mapping); + + unlink_file_vma_batch_init(vb); +} + +void unlink_file_vma_batch_add(struct unlink_vma_file_batch *vb, + struct vm_area_struct *vma) +{ + if (vma->vm_file == NULL) + return; + + if ((vb->count > 0 && vb->vmas[0]->vm_file != vma->vm_file) || + vb->count == ARRAY_SIZE(vb->vmas)) + unlink_file_vma_batch_process(vb); + + vb->vmas[vb->count] = vma; + vb->count++; +} + +void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb) +{ + if (vb->count > 0) + unlink_file_vma_batch_process(vb); +} + /* * Close a vm structure and free it. */ diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 1854850b4b89..368b840e7508 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -19,14 +19,7 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released); #ifdef CONFIG_MEMCG -/* - * Our various events all share the same buffer (because we don't want or need - * to allocate a set of buffers *per event type*), so we need to protect against - * concurrent _reg() and _unreg() calls, and count how many _reg() calls have - * been made. - */ -static DEFINE_MUTEX(reg_lock); -static int reg_refcount; /* Protected by reg_lock. */ +static atomic_t reg_refcount; /* * Size of the buffer for memcg path names. Ignoring stack trace support, @@ -34,136 +27,22 @@ static int reg_refcount; /* Protected by reg_lock. */ */ #define MEMCG_PATH_BUF_SIZE MAX_FILTER_STR_VAL -/* - * How many contexts our trace events might be called in: normal, softirq, irq, - * and NMI. - */ -#define CONTEXT_COUNT 4 - -struct memcg_path { - local_lock_t lock; - char __rcu *buf; - local_t buf_idx; -}; -static DEFINE_PER_CPU(struct memcg_path, memcg_paths) = { - .lock = INIT_LOCAL_LOCK(lock), - .buf_idx = LOCAL_INIT(0), -}; - -static char **tmp_bufs; - -/* Called with reg_lock held. */ -static void free_memcg_path_bufs(void) -{ - struct memcg_path *memcg_path; - int cpu; - char **old = tmp_bufs; - - for_each_possible_cpu(cpu) { - memcg_path = per_cpu_ptr(&memcg_paths, cpu); - *(old++) = rcu_dereference_protected(memcg_path->buf, - lockdep_is_held(®_lock)); - rcu_assign_pointer(memcg_path->buf, NULL); - } - - /* Wait for inflight memcg_path_buf users to finish. */ - synchronize_rcu(); - - old = tmp_bufs; - for_each_possible_cpu(cpu) { - kfree(*(old++)); - } - - kfree(tmp_bufs); - tmp_bufs = NULL; -} - int trace_mmap_lock_reg(void) { - int cpu; - char *new; - - mutex_lock(®_lock); - - /* If the refcount is going 0->1, proceed with allocating buffers. */ - if (reg_refcount++) - goto out; - - tmp_bufs = kmalloc_array(num_possible_cpus(), sizeof(*tmp_bufs), - GFP_KERNEL); - if (tmp_bufs == NULL) - goto out_fail; - - for_each_possible_cpu(cpu) { - new = kmalloc(MEMCG_PATH_BUF_SIZE * CONTEXT_COUNT, GFP_KERNEL); - if (new == NULL) - goto out_fail_free; - rcu_assign_pointer(per_cpu_ptr(&memcg_paths, cpu)->buf, new); - /* Don't need to wait for inflights, they'd have gotten NULL. */ - } - -out: - mutex_unlock(®_lock); + atomic_inc(®_refcount); return 0; - -out_fail_free: - free_memcg_path_bufs(); -out_fail: - /* Since we failed, undo the earlier ref increment. */ - --reg_refcount; - - mutex_unlock(®_lock); - return -ENOMEM; } void trace_mmap_lock_unreg(void) { - mutex_lock(®_lock); - - /* If the refcount is going 1->0, proceed with freeing buffers. */ - if (--reg_refcount) - goto out; - - free_memcg_path_bufs(); - -out: - mutex_unlock(®_lock); + atomic_dec(®_refcount); } -static inline char *get_memcg_path_buf(void) -{ - struct memcg_path *memcg_path = this_cpu_ptr(&memcg_paths); - char *buf; - int idx; - - rcu_read_lock(); - buf = rcu_dereference(memcg_path->buf); - if (buf == NULL) { - rcu_read_unlock(); - return NULL; - } - idx = local_add_return(MEMCG_PATH_BUF_SIZE, &memcg_path->buf_idx) - - MEMCG_PATH_BUF_SIZE; - return &buf[idx]; -} - -static inline void put_memcg_path_buf(void) -{ - local_sub(MEMCG_PATH_BUF_SIZE, &this_cpu_ptr(&memcg_paths)->buf_idx); - rcu_read_unlock(); -} - -#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ - do { \ - const char *memcg_path; \ - local_lock(&memcg_paths.lock); \ - memcg_path = get_mm_memcg_path(mm); \ - trace_mmap_lock_##type(mm, \ - memcg_path != NULL ? memcg_path : "", \ - ##__VA_ARGS__); \ - if (likely(memcg_path != NULL)) \ - put_memcg_path_buf(); \ - local_unlock(&memcg_paths.lock); \ +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ + do { \ + char buf[MEMCG_PATH_BUF_SIZE]; \ + get_mm_memcg_path(mm, buf, sizeof(buf)); \ + trace_mmap_lock_##type(mm, buf, ##__VA_ARGS__); \ } while (0) #else /* !CONFIG_MEMCG */ @@ -185,37 +64,23 @@ void trace_mmap_lock_unreg(void) #ifdef CONFIG_TRACING #ifdef CONFIG_MEMCG /* - * Write the given mm_struct's memcg path to a percpu buffer, and return a - * pointer to it. If the path cannot be determined, or no buffer was available - * (because the trace event is being unregistered), NULL is returned. - * - * Note: buffers are allocated per-cpu to avoid locking, so preemption must be - * disabled by the caller before calling us, and re-enabled only after the - * caller is done with the pointer. - * - * The caller must call put_memcg_path_buf() once the buffer is no longer - * needed. This must be done while preemption is still disabled. + * Write the given mm_struct's memcg path to a buffer. If the path cannot be + * determined or the trace event is being unregistered, empty string is written. */ -static const char *get_mm_memcg_path(struct mm_struct *mm) +static void get_mm_memcg_path(struct mm_struct *mm, char *buf, size_t buflen) { - char *buf = NULL; - struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); + struct mem_cgroup *memcg; + buf[0] = '\0'; + /* No need to get path if no trace event is registered. */ + if (!atomic_read(®_refcount)) + return; + memcg = get_mem_cgroup_from_mm(mm); if (memcg == NULL) - goto out; - if (unlikely(memcg->css.cgroup == NULL)) - goto out_put; - - buf = get_memcg_path_buf(); - if (buf == NULL) - goto out_put; - - cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE); - -out_put: + return; + if (memcg->css.cgroup) + cgroup_path(memcg->css.cgroup, buf, buflen); css_put(&memcg->css); -out: - return buf; } #endif /* CONFIG_MEMCG */ diff --git a/mm/mprotect.c b/mm/mprotect.c index 8c6cd8825273..222ab434da54 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -53,7 +53,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, return false; /* Do we need write faults for softdirty tracking? */ - if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) + if (pte_needs_soft_dirty_wp(vma, pte)) return false; /* Do we need write faults for uffd-wp tracking? */ @@ -71,6 +71,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, return page && PageAnon(page) && PageAnonExclusive(page); } + VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte)); + /* * Writable MAP_SHARED mapping: "clean" might indicate that the FS still * needs a real write-fault for writenotify diff --git a/mm/mremap.c b/mm/mremap.c index 5f96bc5ee918..e7ae140fc640 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -198,7 +198,7 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, * PTE. * * NOTE! Both old and new PTL matter: the old one - * for racing with page_mkclean(), the new one to + * for racing with folio_mkclean(), the new one to * make sure the physical page stays valid until * the TLB entry for the old mapping has been * flushed. diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 8a1c92090129..acff24e9fae4 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -139,6 +139,8 @@ struct dirty_throttle_control { unsigned long wb_bg_thresh; unsigned long pos_ratio; + bool freerun; + bool dirty_exceeded; }; /* @@ -859,6 +861,34 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc, mdtc->avail = filepages + min(headroom, other_clean); } +static inline bool dtc_is_global(struct dirty_throttle_control *dtc) +{ + return mdtc_gdtc(dtc) == NULL; +} + +/* + * Dirty background will ignore pages being written as we're trying to + * decide whether to put more under writeback. + */ +static void domain_dirty_avail(struct dirty_throttle_control *dtc, + bool include_writeback) +{ + if (dtc_is_global(dtc)) { + dtc->avail = global_dirtyable_memory(); + dtc->dirty = global_node_page_state(NR_FILE_DIRTY); + if (include_writeback) + dtc->dirty += global_node_page_state(NR_WRITEBACK); + } else { + unsigned long filepages = 0, headroom = 0, writeback = 0; + + mem_cgroup_wb_stats(dtc->wb, &filepages, &headroom, &dtc->dirty, + &writeback); + if (include_writeback) + dtc->dirty += writeback; + mdtc_calc_avail(dtc, filepages, headroom); + } +} + /** * __wb_calc_thresh - @wb's share of dirty threshold * @dtc: dirty_throttle_context of interest @@ -921,16 +951,9 @@ unsigned long cgwb_calc_thresh(struct bdi_writeback *wb) { struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB }; struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) }; - unsigned long filepages = 0, headroom = 0, writeback = 0; - gdtc.avail = global_dirtyable_memory(); - gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_WRITEBACK); - - mem_cgroup_wb_stats(wb, &filepages, &headroom, - &mdtc.dirty, &writeback); - mdtc.dirty += writeback; - mdtc_calc_avail(&mdtc, filepages, headroom); + domain_dirty_avail(&gdtc, true); + domain_dirty_avail(&mdtc, true); domain_dirty_limits(&mdtc); return __wb_calc_thresh(&mdtc, mdtc.thresh); @@ -1703,6 +1726,100 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc) } } +static unsigned long domain_poll_intv(struct dirty_throttle_control *dtc, + bool strictlimit) +{ + unsigned long dirty, thresh; + + if (strictlimit) { + dirty = dtc->wb_dirty; + thresh = dtc->wb_thresh; + } else { + dirty = dtc->dirty; + thresh = dtc->thresh; + } + + return dirty_poll_interval(dirty, thresh); +} + +/* + * Throttle it only when the background writeback cannot catch-up. This avoids + * (excessively) small writeouts when the wb limits are ramping up in case of + * !strictlimit. + * + * In strictlimit case make decision based on the wb counters and limits. Small + * writeouts when the wb limits are ramping up are the price we consciously pay + * for strictlimit-ing. + */ +static void domain_dirty_freerun(struct dirty_throttle_control *dtc, + bool strictlimit) +{ + unsigned long dirty, thresh, bg_thresh; + + if (unlikely(strictlimit)) { + wb_dirty_limits(dtc); + dirty = dtc->wb_dirty; + thresh = dtc->wb_thresh; + bg_thresh = dtc->wb_bg_thresh; + } else { + dirty = dtc->dirty; + thresh = dtc->thresh; + bg_thresh = dtc->bg_thresh; + } + dtc->freerun = dirty <= dirty_freerun_ceiling(thresh, bg_thresh); +} + +static void balance_domain_limits(struct dirty_throttle_control *dtc, + bool strictlimit) +{ + domain_dirty_avail(dtc, true); + domain_dirty_limits(dtc); + domain_dirty_freerun(dtc, strictlimit); +} + +static void wb_dirty_freerun(struct dirty_throttle_control *dtc, + bool strictlimit) +{ + dtc->freerun = false; + + /* was already handled in domain_dirty_freerun */ + if (strictlimit) + return; + + wb_dirty_limits(dtc); + /* + * LOCAL_THROTTLE tasks must not be throttled when below the per-wb + * freerun ceiling. + */ + if (!(current->flags & PF_LOCAL_THROTTLE)) + return; + + dtc->freerun = dtc->wb_dirty < + dirty_freerun_ceiling(dtc->wb_thresh, dtc->wb_bg_thresh); +} + +static inline void wb_dirty_exceeded(struct dirty_throttle_control *dtc, + bool strictlimit) +{ + dtc->dirty_exceeded = (dtc->wb_dirty > dtc->wb_thresh) && + ((dtc->dirty > dtc->thresh) || strictlimit); +} + +/* + * The limits fields dirty_exceeded and pos_ratio won't be updated if wb is + * in freerun state. Please don't use these invalid fields in freerun case. + */ +static void balance_wb_limits(struct dirty_throttle_control *dtc, + bool strictlimit) +{ + wb_dirty_freerun(dtc, strictlimit); + if (dtc->freerun) + return; + + wb_dirty_exceeded(dtc, strictlimit); + wb_position_ratio(dtc); +} + /* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force @@ -1725,7 +1842,6 @@ static int balance_dirty_pages(struct bdi_writeback *wb, long max_pause; long min_pause; int nr_dirtied_pause; - bool dirty_exceeded = false; unsigned long task_ratelimit; unsigned long dirty_ratelimit; struct backing_dev_info *bdi = wb->bdi; @@ -1735,53 +1851,16 @@ static int balance_dirty_pages(struct bdi_writeback *wb, for (;;) { unsigned long now = jiffies; - unsigned long dirty, thresh, bg_thresh; - unsigned long m_dirty = 0; /* stop bogus uninit warnings */ - unsigned long m_thresh = 0; - unsigned long m_bg_thresh = 0; nr_dirty = global_node_page_state(NR_FILE_DIRTY); - gdtc->avail = global_dirtyable_memory(); - gdtc->dirty = nr_dirty + global_node_page_state(NR_WRITEBACK); - - domain_dirty_limits(gdtc); - - if (unlikely(strictlimit)) { - wb_dirty_limits(gdtc); - - dirty = gdtc->wb_dirty; - thresh = gdtc->wb_thresh; - bg_thresh = gdtc->wb_bg_thresh; - } else { - dirty = gdtc->dirty; - thresh = gdtc->thresh; - bg_thresh = gdtc->bg_thresh; - } + balance_domain_limits(gdtc, strictlimit); if (mdtc) { - unsigned long filepages, headroom, writeback; - /* * If @wb belongs to !root memcg, repeat the same * basic calculations for the memcg domain. */ - mem_cgroup_wb_stats(wb, &filepages, &headroom, - &mdtc->dirty, &writeback); - mdtc->dirty += writeback; - mdtc_calc_avail(mdtc, filepages, headroom); - - domain_dirty_limits(mdtc); - - if (unlikely(strictlimit)) { - wb_dirty_limits(mdtc); - m_dirty = mdtc->wb_dirty; - m_thresh = mdtc->wb_thresh; - m_bg_thresh = mdtc->wb_bg_thresh; - } else { - m_dirty = mdtc->dirty; - m_thresh = mdtc->thresh; - m_bg_thresh = mdtc->bg_thresh; - } + balance_domain_limits(mdtc, strictlimit); } /* @@ -1798,31 +1877,21 @@ static int balance_dirty_pages(struct bdi_writeback *wb, wb_start_background_writeback(wb); /* - * Throttle it only when the background writeback cannot - * catch-up. This avoids (excessively) small writeouts - * when the wb limits are ramping up in case of !strictlimit. - * - * In strictlimit case make decision based on the wb counters - * and limits. Small writeouts when the wb limits are ramping - * up are the price we consciously pay for strictlimit-ing. - * * If memcg domain is in effect, @dirty should be under * both global and memcg freerun ceilings. */ - if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && - (!mdtc || - m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { + if (gdtc->freerun && (!mdtc || mdtc->freerun)) { unsigned long intv; unsigned long m_intv; free_running: - intv = dirty_poll_interval(dirty, thresh); + intv = domain_poll_intv(gdtc, strictlimit); m_intv = ULONG_MAX; current->dirty_paused_when = now; current->nr_dirtied = 0; if (mdtc) - m_intv = dirty_poll_interval(m_dirty, m_thresh); + m_intv = domain_poll_intv(mdtc, strictlimit); current->nr_dirtied_pause = min(intv, m_intv); break; } @@ -1837,24 +1906,9 @@ free_running: * Calculate global domain's pos_ratio and select the * global dtc by default. */ - if (!strictlimit) { - wb_dirty_limits(gdtc); - - if ((current->flags & PF_LOCAL_THROTTLE) && - gdtc->wb_dirty < - dirty_freerun_ceiling(gdtc->wb_thresh, - gdtc->wb_bg_thresh)) - /* - * LOCAL_THROTTLE tasks must not be throttled - * when below the per-wb freerun ceiling. - */ - goto free_running; - } - - dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && - ((gdtc->dirty > gdtc->thresh) || strictlimit); - - wb_position_ratio(gdtc); + balance_wb_limits(gdtc, strictlimit); + if (gdtc->freerun) + goto free_running; sdtc = gdtc; if (mdtc) { @@ -1864,31 +1918,15 @@ free_running: * both global and memcg domains. Choose the one * w/ lower pos_ratio. */ - if (!strictlimit) { - wb_dirty_limits(mdtc); - - if ((current->flags & PF_LOCAL_THROTTLE) && - mdtc->wb_dirty < - dirty_freerun_ceiling(mdtc->wb_thresh, - mdtc->wb_bg_thresh)) - /* - * LOCAL_THROTTLE tasks must not be - * throttled when below the per-wb - * freerun ceiling. - */ - goto free_running; - } - dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) && - ((mdtc->dirty > mdtc->thresh) || strictlimit); - - wb_position_ratio(mdtc); + balance_wb_limits(mdtc, strictlimit); + if (mdtc->freerun) + goto free_running; if (mdtc->pos_ratio < gdtc->pos_ratio) sdtc = mdtc; } - if (dirty_exceeded != wb->dirty_exceeded) - wb->dirty_exceeded = dirty_exceeded; - + wb->dirty_exceeded = gdtc->dirty_exceeded || + (mdtc && mdtc->dirty_exceeded); if (time_is_before_jiffies(READ_ONCE(wb->bw_time_stamp) + BANDWIDTH_INTERVAL)) __wb_update_bandwidth(gdtc, mdtc, true); @@ -2109,6 +2147,35 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping) } EXPORT_SYMBOL(balance_dirty_pages_ratelimited); +/* + * Similar to wb_dirty_limits, wb_bg_dirty_limits also calculates dirty + * and thresh, but it's for background writeback. + */ +static void wb_bg_dirty_limits(struct dirty_throttle_control *dtc) +{ + struct bdi_writeback *wb = dtc->wb; + + dtc->wb_bg_thresh = __wb_calc_thresh(dtc, dtc->bg_thresh); + if (dtc->wb_bg_thresh < 2 * wb_stat_error()) + dtc->wb_dirty = wb_stat_sum(wb, WB_RECLAIMABLE); + else + dtc->wb_dirty = wb_stat(wb, WB_RECLAIMABLE); +} + +static bool domain_over_bg_thresh(struct dirty_throttle_control *dtc) +{ + domain_dirty_avail(dtc, false); + domain_dirty_limits(dtc); + if (dtc->dirty > dtc->bg_thresh) + return true; + + wb_bg_dirty_limits(dtc); + if (dtc->wb_dirty > dtc->wb_bg_thresh) + return true; + + return false; +} + /** * wb_over_bg_thresh - does @wb need to be written back? * @wb: bdi_writeback of interest @@ -2120,54 +2187,14 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited); */ bool wb_over_bg_thresh(struct bdi_writeback *wb) { - struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) }; - struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) }; - struct dirty_throttle_control * const gdtc = &gdtc_stor; - struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? - &mdtc_stor : NULL; - unsigned long reclaimable; - unsigned long thresh; + struct dirty_throttle_control gdtc = { GDTC_INIT(wb) }; + struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) }; - /* - * Similar to balance_dirty_pages() but ignores pages being written - * as we're trying to decide whether to put more under writeback. - */ - gdtc->avail = global_dirtyable_memory(); - gdtc->dirty = global_node_page_state(NR_FILE_DIRTY); - domain_dirty_limits(gdtc); - - if (gdtc->dirty > gdtc->bg_thresh) + if (domain_over_bg_thresh(&gdtc)) return true; - thresh = __wb_calc_thresh(gdtc, gdtc->bg_thresh); - if (thresh < 2 * wb_stat_error()) - reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); - else - reclaimable = wb_stat(wb, WB_RECLAIMABLE); - - if (reclaimable > thresh) - return true; - - if (mdtc) { - unsigned long filepages, headroom, writeback; - - mem_cgroup_wb_stats(wb, &filepages, &headroom, &mdtc->dirty, - &writeback); - mdtc_calc_avail(mdtc, filepages, headroom); - domain_dirty_limits(mdtc); /* ditto, ignore writeback */ - - if (mdtc->dirty > mdtc->bg_thresh) - return true; - - thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh); - if (thresh < 2 * wb_stat_error()) - reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); - else - reclaimable = wb_stat(wb, WB_RECLAIMABLE); - - if (reclaimable > thresh) - return true; - } + if (mdtc_valid(&mdtc)) + return domain_over_bg_thresh(&mdtc); return false; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9ecf99190ea2..3398d914ed83 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -498,7 +498,8 @@ static void bad_page(struct page *page, const char *reason) dump_stack(); out: /* Leave bad fields for debug, except PageBuddy could make trouble */ - page_mapcount_reset(page); /* remove PageBuddy */ + if (PageBuddy(page)) + __ClearPageBuddy(page); add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); } @@ -711,12 +712,12 @@ static inline struct page *get_page_from_free_area(struct free_area *area, } /* - * If this is not the largest possible page, check if the buddy - * of the next-highest order is free. If it is, it's possible + * If this is less than the 2nd largest possible page, check if the buddy + * of the next-higher order is free. If it is, it's possible * that pages are being freed that will coalesce soon. In case, * that is happening, add the free page to the tail of the list * so it's less likely to be used soon and more likely to be merged - * as a higher order page + * as a 2-level higher order page */ static inline bool buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, @@ -1218,7 +1219,8 @@ static void __free_pages_ok(struct page *page, unsigned int order, __count_vm_events(PGFREE, 1 << order); } -void __free_pages_core(struct page *page, unsigned int order) +void __meminit __free_pages_core(struct page *page, unsigned int order, + enum meminit_context context) { unsigned int nr_pages = 1 << order; struct page *p = page; @@ -1228,17 +1230,34 @@ void __free_pages_core(struct page *page, unsigned int order) * When initializing the memmap, __init_single_page() sets the refcount * of all pages to 1 ("allocated"/"not free"). We have to set the * refcount of all involved pages to 0. + * + * Note that hotplugged memory pages are initialized to PageOffline(). + * Pages freed from memblock might be marked as reserved. */ - prefetchw(p); - for (loop = 0; loop < (nr_pages - 1); loop++, p++) { - prefetchw(p + 1); - __ClearPageReserved(p); - set_page_count(p, 0); - } - __ClearPageReserved(p); - set_page_count(p, 0); + if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG) && + unlikely(context == MEMINIT_HOTPLUG)) { + for (loop = 0; loop < nr_pages; loop++, p++) { + VM_WARN_ON_ONCE(PageReserved(p)); + __ClearPageOffline(p); + set_page_count(p, 0); + } - atomic_long_add(nr_pages, &page_zone(page)->managed_pages); + /* + * Freeing the page with debug_pagealloc enabled will try to + * unmap it; some archs don't like double-unmappings, so + * map it first. + */ + debug_pagealloc_map_pages(page, nr_pages); + adjust_managed_page_count(page, nr_pages); + } else { + for (loop = 0; loop < nr_pages; loop++, p++) { + __ClearPageReserved(p); + set_page_count(p, 0); + } + + /* memblock adjusts totalram_pages() manually. */ + atomic_long_add(nr_pages, &page_zone(page)->managed_pages); + } if (page_contains_unaccepted(page, order)) { if (order == MAX_PAGE_ORDER && __free_unaccepted(page)) @@ -1351,7 +1370,8 @@ static void check_new_page_bad(struct page *page) { if (unlikely(page->flags & __PG_HWPOISON)) { /* Don't complain about hwpoisoned pages */ - page_mapcount_reset(page); /* remove PageBuddy */ + if (PageBuddy(page)) + __ClearPageBuddy(page); return; } @@ -2632,8 +2652,7 @@ void free_unref_folios(struct folio_batch *folios) unsigned long pfn = folio_pfn(folio); unsigned int order = folio_order(folio); - if (order > 0 && folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); if (!free_pages_prepare(&folio->page, order)) continue; /* @@ -3031,12 +3050,6 @@ out: return page; } -noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) -{ - return __should_fail_alloc_page(gfp_mask, order); -} -ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE); - static inline long __zone_watermark_unusable_free(struct zone *z, unsigned int order, unsigned int alloc_flags) { @@ -5213,7 +5226,7 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, } /* - * Build gfp_thisnode zonelists + * Build __GFP_THISNODE zonelists */ static void build_thisnode_zonelists(pg_data_t *pgdat) { @@ -5738,6 +5751,7 @@ void __init setup_per_cpu_pageset(void) for_each_online_pgdat(pgdat) pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat); + store_early_perpage_metadata(); } __meminit void zone_pcp_init(struct zone *zone) @@ -5762,10 +5776,6 @@ void adjust_managed_page_count(struct page *page, long count) { atomic_long_add(count, &page_zone(page)->managed_pages); totalram_pages_add(count); -#ifdef CONFIG_HIGHMEM - if (PageHighMem(page)) - totalhigh_pages_add(count); -#endif } EXPORT_SYMBOL(adjust_managed_page_count); @@ -6690,14 +6700,19 @@ void zone_pcp_reset(struct zone *zone) /* * All pages in the range must be in a single zone, must not contain holes, * must span full sections, and must be isolated before calling this function. + * + * Returns the number of managed (non-PageOffline()) pages in the range: the + * number of pages for which memory offlining code must adjust managed page + * counters using adjust_managed_page_count(). */ -void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) +unsigned long __offline_isolated_pages(unsigned long start_pfn, + unsigned long end_pfn) { + unsigned long already_offline = 0, flags; unsigned long pfn = start_pfn; struct page *page; struct zone *zone; unsigned int order; - unsigned long flags; offline_mem_sections(pfn, end_pfn); zone = page_zone(pfn_to_page(pfn)); @@ -6719,6 +6734,7 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) if (PageOffline(page)) { BUG_ON(page_count(page)); BUG_ON(PageBuddy(page)); + already_offline++; pfn++; continue; } @@ -6731,6 +6747,8 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) pfn += (1 << order); } spin_unlock_irqrestore(&zone->lock, flags); + + return end_pfn - start_pfn - already_offline; } #endif diff --git a/mm/page_counter.c b/mm/page_counter.c index db20d6452b71..0153f5bb3161 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -262,3 +262,176 @@ int page_counter_memparse(const char *buf, const char *max, return 0; } + + +/* + * This function calculates an individual page counter's effective + * protection which is derived from its own memory.min/low, its + * parent's and siblings' settings, as well as the actual memory + * distribution in the tree. + * + * The following rules apply to the effective protection values: + * + * 1. At the first level of reclaim, effective protection is equal to + * the declared protection in memory.min and memory.low. + * + * 2. To enable safe delegation of the protection configuration, at + * subsequent levels the effective protection is capped to the + * parent's effective protection. + * + * 3. To make complex and dynamic subtrees easier to configure, the + * user is allowed to overcommit the declared protection at a given + * level. If that is the case, the parent's effective protection is + * distributed to the children in proportion to how much protection + * they have declared and how much of it they are utilizing. + * + * This makes distribution proportional, but also work-conserving: + * if one counter claims much more protection than it uses memory, + * the unused remainder is available to its siblings. + * + * 4. Conversely, when the declared protection is undercommitted at a + * given level, the distribution of the larger parental protection + * budget is NOT proportional. A counter's protection from a sibling + * is capped to its own memory.min/low setting. + * + * 5. However, to allow protecting recursive subtrees from each other + * without having to declare each individual counter's fixed share + * of the ancestor's claim to protection, any unutilized - + * "floating" - protection from up the tree is distributed in + * proportion to each counter's *usage*. This makes the protection + * neutral wrt sibling cgroups and lets them compete freely over + * the shared parental protection budget, but it protects the + * subtree as a whole from neighboring subtrees. + * + * Note that 4. and 5. are not in conflict: 4. is about protecting + * against immediate siblings whereas 5. is about protecting against + * neighboring subtrees. + */ +static unsigned long effective_protection(unsigned long usage, + unsigned long parent_usage, + unsigned long setting, + unsigned long parent_effective, + unsigned long siblings_protected, + bool recursive_protection) +{ + unsigned long protected; + unsigned long ep; + + protected = min(usage, setting); + /* + * If all cgroups at this level combined claim and use more + * protection than what the parent affords them, distribute + * shares in proportion to utilization. + * + * We are using actual utilization rather than the statically + * claimed protection in order to be work-conserving: claimed + * but unused protection is available to siblings that would + * otherwise get a smaller chunk than what they claimed. + */ + if (siblings_protected > parent_effective) + return protected * parent_effective / siblings_protected; + + /* + * Ok, utilized protection of all children is within what the + * parent affords them, so we know whatever this child claims + * and utilizes is effectively protected. + * + * If there is unprotected usage beyond this value, reclaim + * will apply pressure in proportion to that amount. + * + * If there is unutilized protection, the cgroup will be fully + * shielded from reclaim, but we do return a smaller value for + * protection than what the group could enjoy in theory. This + * is okay. With the overcommit distribution above, effective + * protection is always dependent on how memory is actually + * consumed among the siblings anyway. + */ + ep = protected; + + /* + * If the children aren't claiming (all of) the protection + * afforded to them by the parent, distribute the remainder in + * proportion to the (unprotected) memory of each cgroup. That + * way, cgroups that aren't explicitly prioritized wrt each + * other compete freely over the allowance, but they are + * collectively protected from neighboring trees. + * + * We're using unprotected memory for the weight so that if + * some cgroups DO claim explicit protection, we don't protect + * the same bytes twice. + * + * Check both usage and parent_usage against the respective + * protected values. One should imply the other, but they + * aren't read atomically - make sure the division is sane. + */ + if (!recursive_protection) + return ep; + + if (parent_effective > siblings_protected && + parent_usage > siblings_protected && + usage > protected) { + unsigned long unclaimed; + + unclaimed = parent_effective - siblings_protected; + unclaimed *= usage - protected; + unclaimed /= parent_usage - siblings_protected; + + ep += unclaimed; + } + + return ep; +} + + +/** + * page_counter_calculate_protection - check if memory consumption is in the normal range + * @root: the top ancestor of the sub-tree being checked + * @counter: the page_counter the counter to update + * @recursive_protection: Whether to use memory_recursiveprot behavior. + * + * Calculates elow/emin thresholds for given page_counter. + * + * WARNING: This function is not stateless! It can only be used as part + * of a top-down tree iteration, not for isolated queries. + */ +void page_counter_calculate_protection(struct page_counter *root, + struct page_counter *counter, + bool recursive_protection) +{ + unsigned long usage, parent_usage; + struct page_counter *parent = counter->parent; + + /* + * Effective values of the reclaim targets are ignored so they + * can be stale. Have a look at mem_cgroup_protection for more + * details. + * TODO: calculation should be more robust so that we do not need + * that special casing. + */ + if (root == counter) + return; + + usage = page_counter_read(counter); + if (!usage) + return; + + if (parent == root) { + counter->emin = READ_ONCE(counter->min); + counter->elow = READ_ONCE(counter->low); + return; + } + + parent_usage = page_counter_read(parent); + + WRITE_ONCE(counter->emin, effective_protection(usage, parent_usage, + READ_ONCE(counter->min), + READ_ONCE(parent->emin), + atomic_long_read(&parent->children_min_usage), + recursive_protection)); + + WRITE_ONCE(counter->elow, effective_protection(usage, parent_usage, + READ_ONCE(counter->low), + READ_ONCE(parent->elow), + atomic_long_read(&parent->children_low_usage), + recursive_protection)); +} diff --git a/mm/page_ext.c b/mm/page_ext.c index 95dd8ffeaf81..c191e490c401 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -214,6 +214,8 @@ static int __init alloc_node_page_ext(int nid) return -ENOMEM; NODE_DATA(nid)->node_page_ext = base; total_usage += table_size; + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP_BOOT, + DIV_ROUND_UP(table_size, PAGE_SIZE)); return 0; } @@ -268,12 +270,15 @@ static void *__meminit alloc_page_ext(size_t size, int nid) void *addr = NULL; addr = alloc_pages_exact_nid(nid, size, flags); - if (addr) { + if (addr) kmemleak_alloc(addr, size, 1, flags); - return addr; - } + else + addr = vzalloc_node(size, nid); - addr = vzalloc_node(size, nid); + if (addr) { + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, + DIV_ROUND_UP(size, PAGE_SIZE)); + } return addr; } @@ -316,18 +321,27 @@ static int __meminit init_section_page_ext(unsigned long pfn, int nid) static void free_page_ext(void *addr) { + size_t table_size; + struct page *page; + struct pglist_data *pgdat; + + table_size = page_ext_size * PAGES_PER_SECTION; + if (is_vmalloc_addr(addr)) { + page = vmalloc_to_page(addr); + pgdat = page_pgdat(page); vfree(addr); } else { - struct page *page = virt_to_page(addr); - size_t table_size; - - table_size = page_ext_size * PAGES_PER_SECTION; - + page = virt_to_page(addr); + pgdat = page_pgdat(page); BUG_ON(PageReserved(page)); kmemleak_free(addr); free_pages_exact(addr, table_size); } + + mod_node_page_state(pgdat, NR_MEMMAP, + -1L * (DIV_ROUND_UP(table_size, PAGE_SIZE))); + } static void __free_page_ext(unsigned long pfn) diff --git a/mm/page_io.c b/mm/page_io.c index 0a150c240bf4..ff8c99ee3af7 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -196,9 +196,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) return ret; } if (zswap_store(folio)) { - folio_start_writeback(folio); folio_unlock(folio); - folio_end_writeback(folio); return 0; } if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) { @@ -280,7 +278,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret) * be temporary. */ pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n", - ret, page_file_offset(page)); + ret, swap_dev_pos(page_swap_entry(page))); for (p = 0; p < sio->pages; p++) { page = sio->bvec[p].bv_page; set_page_dirty(page); @@ -299,7 +297,7 @@ static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc struct swap_iocb *sio = NULL; struct swap_info_struct *sis = swp_swap_info(folio->swap); struct file *swap_file = sis->swap_file; - loff_t pos = folio_file_pos(folio); + loff_t pos = swap_dev_pos(folio->swap); count_swpout_vm_event(folio); folio_start_writeback(folio); @@ -384,7 +382,12 @@ void __swap_writepage(struct folio *folio, struct writeback_control *wbc) */ if (data_race(sis->flags & SWP_FS_OPS)) swap_writepage_fs(folio, wbc); - else if (sis->flags & SWP_SYNCHRONOUS_IO) + /* + * ->flags can be updated non-atomicially (scan_swap_map_slots), + * but that will never affect SWP_SYNCHRONOUS_IO, so the data_race + * is safe. + */ + else if (data_race(sis->flags & SWP_SYNCHRONOUS_IO)) swap_writepage_bdev_sync(folio, wbc, sis); else swap_writepage_bdev_async(folio, wbc, sis); @@ -430,7 +433,7 @@ static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug) { struct swap_info_struct *sis = swp_swap_info(folio->swap); struct swap_iocb *sio = NULL; - loff_t pos = folio_file_pos(folio); + loff_t pos = swap_dev_pos(folio->swap); if (plug) sio = *plug; @@ -493,10 +496,10 @@ static void swap_read_folio_bdev_async(struct folio *folio, submit_bio(bio); } -void swap_read_folio(struct folio *folio, bool synchronous, - struct swap_iocb **plug) +void swap_read_folio(struct folio *folio, struct swap_iocb **plug) { struct swap_info_struct *sis = swp_swap_info(folio->swap); + bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO; bool workingset = folio_test_workingset(folio); unsigned long pflags; bool in_thrashing; @@ -517,11 +520,10 @@ void swap_read_folio(struct folio *folio, bool synchronous, delayacct_swapin_start(); if (zswap_load(folio)) { - folio_mark_uptodate(folio); folio_unlock(folio); } else if (data_race(sis->flags & SWP_FS_OPS)) { swap_read_folio_fs(folio, plug); - } else if (synchronous || (sis->flags & SWP_SYNCHRONOUS_IO)) { + } else if (synchronous) { swap_read_folio_bdev_sync(folio, sis); } else { swap_read_folio_bdev_async(folio, sis); diff --git a/mm/pagewalk.c b/mm/pagewalk.c index f46c80b18ce4..ae2f08ce991b 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -73,45 +73,6 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return err; } -#ifdef CONFIG_ARCH_HAS_HUGEPD -static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr, - unsigned long end, struct mm_walk *walk, int pdshift) -{ - int err = 0; - const struct mm_walk_ops *ops = walk->ops; - int shift = hugepd_shift(*phpd); - int page_size = 1 << shift; - - if (!ops->pte_entry) - return 0; - - if (addr & (page_size - 1)) - return 0; - - for (;;) { - pte_t *pte; - - spin_lock(&walk->mm->page_table_lock); - pte = hugepte_offset(*phpd, addr, pdshift); - err = ops->pte_entry(pte, addr, addr + page_size, walk); - spin_unlock(&walk->mm->page_table_lock); - - if (err) - break; - if (addr >= end - page_size) - break; - addr += page_size; - } - return err; -} -#else -static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr, - unsigned long end, struct mm_walk *walk, int pdshift) -{ - return 0; -} -#endif - static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -159,10 +120,7 @@ again: if (walk->vma) split_huge_pmd(walk->vma, pmd, addr); - if (is_hugepd(__hugepd(pmd_val(*pmd)))) - err = walk_hugepd_range((hugepd_t *)pmd, addr, next, walk, PMD_SHIFT); - else - err = walk_pte_range(pmd, addr, next, walk); + err = walk_pte_range(pmd, addr, next, walk); if (err) break; @@ -215,10 +173,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, if (pud_none(*pud)) goto again; - if (is_hugepd(__hugepd(pud_val(*pud)))) - err = walk_hugepd_range((hugepd_t *)pud, addr, next, walk, PUD_SHIFT); - else - err = walk_pmd_range(pud, addr, next, walk); + err = walk_pmd_range(pud, addr, next, walk); if (err) break; } while (pud++, addr = next, addr != end); @@ -250,9 +205,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, if (err) break; } - if (is_hugepd(__hugepd(p4d_val(*p4d)))) - err = walk_hugepd_range((hugepd_t *)p4d, addr, next, walk, P4D_SHIFT); - else if (ops->pud_entry || ops->pmd_entry || ops->pte_entry) + if (ops->pud_entry || ops->pmd_entry || ops->pte_entry) err = walk_pud_range(p4d, addr, next, walk); if (err) break; @@ -287,9 +240,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end, if (err) break; } - if (is_hugepd(__hugepd(pgd_val(*pgd)))) - err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT); - else if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry) + if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry) err = walk_p4d_range(pgd, addr, next, walk); if (err) break; diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index 7e42f0ca3b7b..4b3d6ec43703 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -33,7 +33,7 @@ struct pcpu_block_md { }; struct pcpuobj_ext { -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG struct obj_cgroup *cgroup; #endif #ifdef CONFIG_MEM_ALLOC_PROFILING @@ -41,7 +41,7 @@ struct pcpuobj_ext { #endif }; -#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEM_ALLOC_PROFILING) +#if defined(CONFIG_MEMCG) || defined(CONFIG_MEM_ALLOC_PROFILING) #define NEED_PCPUOBJ_EXT #endif @@ -154,7 +154,7 @@ static inline size_t pcpu_obj_full_size(size_t size) { size_t extra_size = 0; -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG if (!mem_cgroup_kmem_disabled()) extra_size += size / PCPU_MIN_ALLOC_SIZE * sizeof(struct obj_cgroup *); #endif diff --git a/mm/percpu.c b/mm/percpu.c index 474e3683b74d..20d91af8c033 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1619,7 +1619,7 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr) return pcpu_get_page_chunk(pcpu_addr_to_page(addr)); } -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, struct obj_cgroup **objcgp) { @@ -1681,7 +1681,7 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) obj_cgroup_put(objcg); } -#else /* CONFIG_MEMCG_KMEM */ +#else /* CONFIG_MEMCG */ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, struct obj_cgroup **objcgp) { @@ -1697,7 +1697,7 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg, static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) { } -#endif /* CONFIG_MEMCG_KMEM */ +#endif /* CONFIG_MEMCG */ #ifdef CONFIG_MEM_ALLOC_PROFILING static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off, diff --git a/mm/readahead.c b/mm/readahead.c index 817b2a352d78..517c0be7ce66 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -313,7 +313,7 @@ void force_page_cache_ra(struct readahead_control *ractl, struct address_space *mapping = ractl->mapping; struct file_ra_state *ra = ractl->ra; struct backing_dev_info *bdi = inode_to_bdi(mapping->host); - unsigned long max_pages, index; + unsigned long max_pages; if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead)) return; @@ -322,7 +322,6 @@ void force_page_cache_ra(struct readahead_control *ractl, * If the request exceeds the readahead window, allow the read to * be up to the optimal hardware IO size */ - index = readahead_index(ractl); max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages); nr_to_read = min_t(unsigned long, nr_to_read, max_pages); while (nr_to_read) { @@ -330,10 +329,8 @@ void force_page_cache_ra(struct readahead_control *ractl, if (this_chunk > nr_to_read) this_chunk = nr_to_read; - ractl->_index = index; do_page_cache_ra(ractl, this_chunk, 0); - index += this_chunk; nr_to_read -= this_chunk; } } @@ -413,58 +410,6 @@ static unsigned long get_next_ra_size(struct file_ra_state *ra, * it approaches max_readhead. */ -/* - * Count contiguously cached pages from @index-1 to @index-@max, - * this count is a conservative estimation of - * - length of the sequential read sequence, or - * - thrashing threshold in memory tight systems - */ -static pgoff_t count_history_pages(struct address_space *mapping, - pgoff_t index, unsigned long max) -{ - pgoff_t head; - - rcu_read_lock(); - head = page_cache_prev_miss(mapping, index - 1, max); - rcu_read_unlock(); - - return index - 1 - head; -} - -/* - * page cache context based readahead - */ -static int try_context_readahead(struct address_space *mapping, - struct file_ra_state *ra, - pgoff_t index, - unsigned long req_size, - unsigned long max) -{ - pgoff_t size; - - size = count_history_pages(mapping, index, max); - - /* - * not enough history pages: - * it could be a random read - */ - if (size <= req_size) - return 0; - - /* - * starts from beginning of file: - * it is a strong indication of long-run stream (or whole-file-read) - */ - if (size >= index) - size *= 2; - - ra->start = index; - ra->size = min(size + req_size, max); - ra->async_size = 1; - - return 1; -} - static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index, pgoff_t mark, unsigned int order, gfp_t gfp) { @@ -491,7 +436,8 @@ void page_cache_ra_order(struct readahead_control *ractl, struct file_ra_state *ra, unsigned int new_order) { struct address_space *mapping = ractl->mapping; - pgoff_t index = readahead_index(ractl); + pgoff_t start = readahead_index(ractl); + pgoff_t index = start; pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT; pgoff_t mark = index + ra->size - ra->async_size; unsigned int nofs; @@ -527,11 +473,6 @@ void page_cache_ra_order(struct readahead_control *ractl, index += 1UL << order; } - if (index > limit) { - ra->size += index - limit - 1; - ra->async_size += index - limit - 1; - } - read_pages(ractl); filemap_invalidate_unlock_shared(mapping); memalloc_nofs_restore(nofs); @@ -544,22 +485,14 @@ void page_cache_ra_order(struct readahead_control *ractl, if (!err) return; fallback: - do_page_cache_ra(ractl, ra->size, ra->async_size); + do_page_cache_ra(ractl, ra->size - (index - start), ra->async_size); } -/* - * A minimal readahead algorithm for trivial sequential/random reads. - */ -static void ondemand_readahead(struct readahead_control *ractl, - struct folio *folio, unsigned long req_size) +static unsigned long ractl_max_pages(struct readahead_control *ractl, + unsigned long req_size) { struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host); - struct file_ra_state *ra = ractl->ra; - unsigned long max_pages = ra->ra_pages; - unsigned long add_pages; - pgoff_t index = readahead_index(ractl); - pgoff_t expected, prev_index; - unsigned int order = folio ? folio_order(folio) : 0; + unsigned long max_pages = ractl->ra->ra_pages; /* * If the request exceeds the readahead window, allow the read to @@ -567,112 +500,17 @@ static void ondemand_readahead(struct readahead_control *ractl, */ if (req_size > max_pages && bdi->io_pages > max_pages) max_pages = min(req_size, bdi->io_pages); - - /* - * start of file - */ - if (!index) - goto initial_readahead; - - /* - * It's the expected callback index, assume sequential access. - * Ramp up sizes, and push forward the readahead window. - */ - expected = round_down(ra->start + ra->size - ra->async_size, - 1UL << order); - if (index == expected || index == (ra->start + ra->size)) { - ra->start += ra->size; - ra->size = get_next_ra_size(ra, max_pages); - ra->async_size = ra->size; - goto readit; - } - - /* - * Hit a marked folio without valid readahead state. - * E.g. interleaved reads. - * Query the pagecache for async_size, which normally equals to - * readahead size. Ramp it up and use it as the new readahead size. - */ - if (folio) { - pgoff_t start; - - rcu_read_lock(); - start = page_cache_next_miss(ractl->mapping, index + 1, - max_pages); - rcu_read_unlock(); - - if (!start || start - index > max_pages) - return; - - ra->start = start; - ra->size = start - index; /* old async_size */ - ra->size += req_size; - ra->size = get_next_ra_size(ra, max_pages); - ra->async_size = ra->size; - goto readit; - } - - /* - * oversize read - */ - if (req_size > max_pages) - goto initial_readahead; - - /* - * sequential cache miss - * trivial case: (index - prev_index) == 1 - * unaligned reads: (index - prev_index) == 0 - */ - prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT; - if (index - prev_index <= 1UL) - goto initial_readahead; - - /* - * Query the page cache and look for the traces(cached history pages) - * that a sequential stream would leave behind. - */ - if (try_context_readahead(ractl->mapping, ra, index, req_size, - max_pages)) - goto readit; - - /* - * standalone, small random read - * Read as is, and do not pollute the readahead state. - */ - do_page_cache_ra(ractl, req_size, 0); - return; - -initial_readahead: - ra->start = index; - ra->size = get_init_ra_size(req_size, max_pages); - ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size; - -readit: - /* - * Will this read hit the readahead marker made by itself? - * If so, trigger the readahead marker hit now, and merge - * the resulted next readahead window into the current one. - * Take care of maximum IO pages as above. - */ - if (index == ra->start && ra->size == ra->async_size) { - add_pages = get_next_ra_size(ra, max_pages); - if (ra->size + add_pages <= max_pages) { - ra->async_size = add_pages; - ra->size += add_pages; - } else { - ra->size = max_pages; - ra->async_size = max_pages >> 1; - } - } - - ractl->_index = ra->start; - page_cache_ra_order(ractl, ra, order); + return max_pages; } void page_cache_sync_ra(struct readahead_control *ractl, unsigned long req_count) { + pgoff_t index = readahead_index(ractl); bool do_forced_ra = ractl->file && (ractl->file->f_mode & FMODE_RANDOM); + struct file_ra_state *ra = ractl->ra; + unsigned long max_pages, contig_count; + pgoff_t prev_index, miss; /* * Even if readahead is disabled, issue this request as readahead @@ -680,7 +518,7 @@ void page_cache_sync_ra(struct readahead_control *ractl, * readahead will do the right thing and limit the read to just the * requested range, which we'll set to 1 page for this case. */ - if (!ractl->ra->ra_pages || blk_cgroup_congested()) { + if (!ra->ra_pages || blk_cgroup_congested()) { if (!ractl->file) return; req_count = 1; @@ -693,15 +531,63 @@ void page_cache_sync_ra(struct readahead_control *ractl, return; } - ondemand_readahead(ractl, NULL, req_count); + max_pages = ractl_max_pages(ractl, req_count); + prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT; + /* + * A start of file, oversized read, or sequential cache miss: + * trivial case: (index - prev_index) == 1 + * unaligned reads: (index - prev_index) == 0 + */ + if (!index || req_count > max_pages || index - prev_index <= 1UL) { + ra->start = index; + ra->size = get_init_ra_size(req_count, max_pages); + ra->async_size = ra->size > req_count ? ra->size - req_count : + ra->size >> 1; + goto readit; + } + + /* + * Query the page cache and look for the traces(cached history pages) + * that a sequential stream would leave behind. + */ + rcu_read_lock(); + miss = page_cache_prev_miss(ractl->mapping, index - 1, max_pages); + rcu_read_unlock(); + contig_count = index - miss - 1; + /* + * Standalone, small random read. Read as is, and do not pollute the + * readahead state. + */ + if (contig_count <= req_count) { + do_page_cache_ra(ractl, req_count, 0); + return; + } + /* + * File cached from the beginning: + * it is a strong indication of long-run stream (or whole-file-read) + */ + if (miss == ULONG_MAX) + contig_count *= 2; + ra->start = index; + ra->size = min(contig_count + req_count, max_pages); + ra->async_size = 1; +readit: + ractl->_index = ra->start; + page_cache_ra_order(ractl, ra, 0); } EXPORT_SYMBOL_GPL(page_cache_sync_ra); void page_cache_async_ra(struct readahead_control *ractl, struct folio *folio, unsigned long req_count) { + unsigned long max_pages; + struct file_ra_state *ra = ractl->ra; + pgoff_t index = readahead_index(ractl); + pgoff_t expected, start; + unsigned int order = folio_order(folio); + /* no readahead */ - if (!ractl->ra->ra_pages) + if (!ra->ra_pages) return; /* @@ -715,7 +601,41 @@ void page_cache_async_ra(struct readahead_control *ractl, if (blk_cgroup_congested()) return; - ondemand_readahead(ractl, folio, req_count); + max_pages = ractl_max_pages(ractl, req_count); + /* + * It's the expected callback index, assume sequential access. + * Ramp up sizes, and push forward the readahead window. + */ + expected = round_down(ra->start + ra->size - ra->async_size, + 1UL << order); + if (index == expected) { + ra->start += ra->size; + ra->size = get_next_ra_size(ra, max_pages); + ra->async_size = ra->size; + goto readit; + } + + /* + * Hit a marked folio without valid readahead state. + * E.g. interleaved reads. + * Query the pagecache for async_size, which normally equals to + * readahead size. Ramp it up and use it as the new readahead size. + */ + rcu_read_lock(); + start = page_cache_next_miss(ractl->mapping, index + 1, max_pages); + rcu_read_unlock(); + + if (!start || start - index > max_pages) + return; + + ra->start = start; + ra->size = start - index; /* old async_size */ + ra->size += req_count; + ra->size = get_next_ra_size(ra, max_pages); + ra->async_size = ra->size; +readit: + ractl->_index = ra->start; + page_cache_ra_order(ractl, ra, order); } EXPORT_SYMBOL_GPL(page_cache_async_ra); diff --git a/mm/rmap.c b/mm/rmap.c index e8fc5ecb59b2..8616308610b9 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1269,33 +1269,42 @@ static void __page_check_anon_rmap(struct folio *folio, struct page *page, page); } +static void __folio_mod_stat(struct folio *folio, int nr, int nr_pmdmapped) +{ + int idx; + + if (nr) { + idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED; + __lruvec_stat_mod_folio(folio, idx, nr); + } + if (nr_pmdmapped) { + if (folio_test_anon(folio)) { + idx = NR_ANON_THPS; + __lruvec_stat_mod_folio(folio, idx, nr_pmdmapped); + } else { + /* NR_*_PMDMAPPED are not maintained per-memcg */ + idx = folio_test_swapbacked(folio) ? + NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED; + __mod_node_page_state(folio_pgdat(folio), idx, + nr_pmdmapped); + } + } +} + static __always_inline void __folio_add_anon_rmap(struct folio *folio, struct page *page, int nr_pages, struct vm_area_struct *vma, unsigned long address, rmap_t flags, enum rmap_level level) { int i, nr, nr_pmdmapped = 0; - nr = __folio_add_rmap(folio, page, nr_pages, level, &nr_pmdmapped); - if (nr_pmdmapped) - __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped); - if (nr) - __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); - if (unlikely(!folio_test_anon(folio))) { - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); - /* - * For a PTE-mapped large folio, we only know that the single - * PTE is exclusive. Further, __folio_set_anon() might not get - * folio->index right when not given the address of the head - * page. - */ - VM_WARN_ON_FOLIO(folio_test_large(folio) && - level != RMAP_LEVEL_PMD, folio); - __folio_set_anon(folio, vma, address, - !!(flags & RMAP_EXCLUSIVE)); - } else if (likely(!folio_test_ksm(folio))) { + nr = __folio_add_rmap(folio, page, nr_pages, level, &nr_pmdmapped); + + if (likely(!folio_test_ksm(folio))) __page_check_anon_rmap(folio, page, vma, address); - } + + __folio_mod_stat(folio, nr, nr_pmdmapped); if (flags & RMAP_EXCLUSIVE) { switch (level) { @@ -1381,29 +1390,37 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page, * @folio: The folio to add the mapping to. * @vma: the vm area in which the mapping is added * @address: the user virtual address mapped + * @flags: The rmap flags * * Like folio_add_anon_rmap_*() but must only be called on *new* folios. * This means the inc-and-test can be bypassed. - * The folio does not have to be locked. + * The folio doesn't necessarily need to be locked while it's exclusive + * unless two threads map it concurrently. However, the folio must be + * locked if it's shared. * - * If the folio is pmd-mappable, it is accounted as a THP. As the folio - * is new, it's assumed to be mapped exclusively by a single process. + * If the folio is pmd-mappable, it is accounted as a THP. */ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, - unsigned long address) + unsigned long address, rmap_t flags) { - int nr = folio_nr_pages(folio); + const int nr = folio_nr_pages(folio); + const bool exclusive = flags & RMAP_EXCLUSIVE; + int nr_pmdmapped = 0; VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); + VM_WARN_ON_FOLIO(!exclusive && !folio_test_locked(folio), folio); VM_BUG_ON_VMA(address < vma->vm_start || address + (nr << PAGE_SHIFT) > vma->vm_end, vma); - __folio_set_swapbacked(folio); - __folio_set_anon(folio, vma, address, true); + + if (!folio_test_swapbacked(folio)) + __folio_set_swapbacked(folio); + __folio_set_anon(folio, vma, address, exclusive); if (likely(!folio_test_large(folio))) { /* increment count (starts at -1) */ atomic_set(&folio->_mapcount, 0); - SetPageAnonExclusive(&folio->page); + if (exclusive) + SetPageAnonExclusive(&folio->page); } else if (!folio_test_pmd_mappable(folio)) { int i; @@ -1412,7 +1429,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, /* increment count (starts at -1) */ atomic_set(&page->_mapcount, 0); - SetPageAnonExclusive(page); + if (exclusive) + SetPageAnonExclusive(page); } /* increment count (starts at -1) */ @@ -1424,28 +1442,24 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, /* increment count (starts at -1) */ atomic_set(&folio->_large_mapcount, 0); atomic_set(&folio->_nr_pages_mapped, ENTIRELY_MAPPED); - SetPageAnonExclusive(&folio->page); - __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); + if (exclusive) + SetPageAnonExclusive(&folio->page); + nr_pmdmapped = nr; } - __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); + __folio_mod_stat(folio, nr, nr_pmdmapped); } static __always_inline void __folio_add_file_rmap(struct folio *folio, struct page *page, int nr_pages, struct vm_area_struct *vma, enum rmap_level level) { - pg_data_t *pgdat = folio_pgdat(folio); int nr, nr_pmdmapped = 0; VM_WARN_ON_FOLIO(folio_test_anon(folio), folio); nr = __folio_add_rmap(folio, page, nr_pages, level, &nr_pmdmapped); - if (nr_pmdmapped) - __mod_node_page_state(pgdat, folio_test_swapbacked(folio) ? - NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped); - if (nr) - __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr); + __folio_mod_stat(folio, nr, nr_pmdmapped); /* See comments in folio_add_anon_rmap_*() */ if (!folio_test_large(folio)) @@ -1494,10 +1508,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, enum rmap_level level) { atomic_t *mapped = &folio->_nr_pages_mapped; - pg_data_t *pgdat = folio_pgdat(folio); int last, nr = 0, nr_pmdmapped = 0; bool partially_mapped = false; - enum node_stat_item idx; __folio_rmap_sanity_checks(folio, page, nr_pages, level); @@ -1541,20 +1553,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, break; } - if (nr_pmdmapped) { - /* NR_{FILE/SHMEM}_PMDMAPPED are not maintained per-memcg */ - if (folio_test_anon(folio)) - __lruvec_stat_mod_folio(folio, NR_ANON_THPS, -nr_pmdmapped); - else - __mod_node_page_state(pgdat, - folio_test_swapbacked(folio) ? - NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, - -nr_pmdmapped); - } if (nr) { - idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED; - __lruvec_stat_mod_folio(folio, idx, -nr); - /* * Queue anon large folio for deferred split if at least one * page of the folio is unmapped and at least one page @@ -1566,6 +1565,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, list_empty(&folio->_deferred_list)) deferred_split_folio(folio); } + __folio_mod_stat(folio, -nr, -nr_pmdmapped); /* * It would be tidy to reset folio_test_anon mapping when fully @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, if (flags & TTU_SYNC) pvmw.flags = PVMW_SYNC; - if (flags & TTU_SPLIT_HUGE_PMD) - split_huge_pmd_address(vma, address, false, folio); - /* * For THP, we have to assume the worse case ie pmd for invalidation. * For hugetlb, it could be much worse if we need to do pud @@ -1668,9 +1665,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, mmu_notifier_invalidate_range_start(&range); while (page_vma_mapped_walk(&pvmw)) { - /* Unexpected PMD-mapped THP? */ - VM_BUG_ON_FOLIO(!pvmw.pte, folio); - /* * If the folio is in an mlock()d vma, we must not swap it out. */ @@ -1679,11 +1673,30 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, /* Restore the mlock which got missed */ if (!folio_test_large(folio)) mlock_vma_folio(folio, vma); - page_vma_mapped_walk_done(&pvmw); - ret = false; - break; + goto walk_abort; } + if (!pvmw.pte) { + if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, + folio)) + goto walk_done; + + if (flags & TTU_SPLIT_HUGE_PMD) { + /* + * We temporarily have to drop the PTL and + * restart so we can process the PTE-mapped THP. + */ + split_huge_pmd_locked(vma, pvmw.address, + pvmw.pmd, false, folio); + flags &= ~TTU_SPLIT_HUGE_PMD; + page_vma_mapped_walk_restart(&pvmw); + continue; + } + } + + /* Unexpected PMD-mapped THP? */ + VM_BUG_ON_FOLIO(!pvmw.pte, folio); + pfn = pte_pfn(ptep_get(pvmw.pte)); subpage = folio_page(folio, pfn - folio_pfn(folio)); address = pvmw.address; @@ -1719,11 +1732,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, */ if (!anon) { VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); - if (!hugetlb_vma_trylock_write(vma)) { - page_vma_mapped_walk_done(&pvmw); - ret = false; - break; - } + if (!hugetlb_vma_trylock_write(vma)) + goto walk_abort; if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { hugetlb_vma_unlock_write(vma); flush_tlb_range(vma, @@ -1738,8 +1748,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * actual page and drop map count * to zero. */ - page_vma_mapped_walk_done(&pvmw); - break; + goto walk_done; } hugetlb_vma_unlock_write(vma); } @@ -1811,9 +1820,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, if (unlikely(folio_test_swapbacked(folio) != folio_test_swapcache(folio))) { WARN_ON_ONCE(1); - ret = false; - page_vma_mapped_walk_done(&pvmw); - break; + goto walk_abort; } /* MADV_FREE page check */ @@ -1852,23 +1859,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, */ set_pte_at(mm, address, pvmw.pte, pteval); folio_set_swapbacked(folio); - ret = false; - page_vma_mapped_walk_done(&pvmw); - break; + goto walk_abort; } if (swap_duplicate(entry) < 0) { set_pte_at(mm, address, pvmw.pte, pteval); - ret = false; - page_vma_mapped_walk_done(&pvmw); - break; + goto walk_abort; } if (arch_unmap_one(mm, vma, address, pteval) < 0) { swap_free(entry); set_pte_at(mm, address, pvmw.pte, pteval); - ret = false; - page_vma_mapped_walk_done(&pvmw); - break; + goto walk_abort; } /* See folio_try_share_anon_rmap(): clear PTE first. */ @@ -1876,9 +1877,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, folio_try_share_anon_rmap_pte(folio, subpage)) { swap_free(entry); set_pte_at(mm, address, pvmw.pte, pteval); - ret = false; - page_vma_mapped_walk_done(&pvmw); - break; + goto walk_abort; } if (list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); @@ -1918,6 +1917,12 @@ discard: if (vma->vm_flags & VM_LOCKED) mlock_drain_local(); folio_put(folio); + continue; +walk_abort: + ret = false; +walk_done: + page_vma_mapped_walk_done(&pvmw); + break; } mmu_notifier_invalidate_range_end(&range); diff --git a/mm/shmem.c b/mm/shmem.c index 831b52dfd56e..2faa9daaf54b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -131,6 +131,13 @@ struct shmem_options { #define SHMEM_SEEN_QUOTA 32 }; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static unsigned long huge_shmem_orders_always __read_mostly; +static unsigned long huge_shmem_orders_madvise __read_mostly; +static unsigned long huge_shmem_orders_inherit __read_mostly; +static unsigned long huge_shmem_orders_within_size __read_mostly; +#endif + #ifdef CONFIG_TMPFS static unsigned long shmem_default_max_blocks(void) { @@ -1614,73 +1621,174 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) return result; } -static struct folio *shmem_alloc_hugefolio(gfp_t gfp, +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +unsigned long shmem_allowable_huge_orders(struct inode *inode, + struct vm_area_struct *vma, pgoff_t index, + bool global_huge) +{ + unsigned long mask = READ_ONCE(huge_shmem_orders_always); + unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size); + unsigned long vm_flags = vma->vm_flags; + /* + * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that + * are enabled for this vma. + */ + unsigned long orders = BIT(PMD_ORDER + 1) - 1; + loff_t i_size; + int order; + + if ((vm_flags & VM_NOHUGEPAGE) || + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) + return 0; + + /* If the hardware/firmware marked hugepage support disabled. */ + if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED)) + return 0; + + /* + * Following the 'deny' semantics of the top level, force the huge + * option off from all mounts. + */ + if (shmem_huge == SHMEM_HUGE_DENY) + return 0; + + /* + * Only allow inherit orders if the top-level value is 'force', which + * means non-PMD sized THP can not override 'huge' mount option now. + */ + if (shmem_huge == SHMEM_HUGE_FORCE) + return READ_ONCE(huge_shmem_orders_inherit); + + /* Allow mTHP that will be fully within i_size. */ + order = highest_order(within_size_orders); + while (within_size_orders) { + index = round_up(index + 1, order); + i_size = round_up(i_size_read(inode), PAGE_SIZE); + if (i_size >> PAGE_SHIFT >= index) { + mask |= within_size_orders; + break; + } + + order = next_order(&within_size_orders, order); + } + + if (vm_flags & VM_HUGEPAGE) + mask |= READ_ONCE(huge_shmem_orders_madvise); + + if (global_huge) + mask |= READ_ONCE(huge_shmem_orders_inherit); + + return orders & mask; +} + +static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf, + struct address_space *mapping, pgoff_t index, + unsigned long orders) +{ + struct vm_area_struct *vma = vmf->vma; + unsigned long pages; + int order; + + orders = thp_vma_suitable_orders(vma, vmf->address, orders); + if (!orders) + return 0; + + /* Find the highest order that can add into the page cache */ + order = highest_order(orders); + while (orders) { + pages = 1UL << order; + index = round_down(index, pages); + if (!xa_find(&mapping->i_pages, &index, + index + pages - 1, XA_PRESENT)) + break; + order = next_order(&orders, order); + } + + return orders; +} +#else +static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf, + struct address_space *mapping, pgoff_t index, + unsigned long orders) +{ + return 0; +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + +static struct folio *shmem_alloc_folio(gfp_t gfp, int order, struct shmem_inode_info *info, pgoff_t index) { struct mempolicy *mpol; pgoff_t ilx; - struct page *page; + struct folio *folio; - mpol = shmem_get_pgoff_policy(info, index, HPAGE_PMD_ORDER, &ilx); - page = alloc_pages_mpol(gfp, HPAGE_PMD_ORDER, mpol, ilx, numa_node_id()); + mpol = shmem_get_pgoff_policy(info, index, order, &ilx); + folio = folio_alloc_mpol(gfp, order, mpol, ilx, numa_node_id()); mpol_cond_put(mpol); - return page_rmappable_folio(page); + return folio; } -static struct folio *shmem_alloc_folio(gfp_t gfp, - struct shmem_inode_info *info, pgoff_t index) -{ - struct mempolicy *mpol; - pgoff_t ilx; - struct page *page; - - mpol = shmem_get_pgoff_policy(info, index, 0, &ilx); - page = alloc_pages_mpol(gfp, 0, mpol, ilx, numa_node_id()); - mpol_cond_put(mpol); - - return (struct folio *)page; -} - -static struct folio *shmem_alloc_and_add_folio(gfp_t gfp, - struct inode *inode, pgoff_t index, - struct mm_struct *fault_mm, bool huge) +static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, + gfp_t gfp, struct inode *inode, pgoff_t index, + struct mm_struct *fault_mm, unsigned long orders) { struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); - struct folio *folio; + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; + unsigned long suitable_orders = 0; + struct folio *folio = NULL; long pages; - int error; + int error, order; if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) - huge = false; + orders = 0; - if (huge) { - pages = HPAGE_PMD_NR; - index = round_down(index, HPAGE_PMD_NR); + if (orders > 0) { + if (vma && vma_is_anon_shmem(vma)) { + suitable_orders = shmem_suitable_orders(inode, vmf, + mapping, index, orders); + } else if (orders & BIT(HPAGE_PMD_ORDER)) { + pages = HPAGE_PMD_NR; + suitable_orders = BIT(HPAGE_PMD_ORDER); + index = round_down(index, HPAGE_PMD_NR); - /* - * Check for conflict before waiting on a huge allocation. - * Conflict might be that a huge page has just been allocated - * and added to page cache by a racing thread, or that there - * is already at least one small page in the huge extent. - * Be careful to retry when appropriate, but not forever! - * Elsewhere -EEXIST would be the right code, but not here. - */ - if (xa_find(&mapping->i_pages, &index, - index + HPAGE_PMD_NR - 1, XA_PRESENT)) - return ERR_PTR(-E2BIG); + /* + * Check for conflict before waiting on a huge allocation. + * Conflict might be that a huge page has just been allocated + * and added to page cache by a racing thread, or that there + * is already at least one small page in the huge extent. + * Be careful to retry when appropriate, but not forever! + * Elsewhere -EEXIST would be the right code, but not here. + */ + if (xa_find(&mapping->i_pages, &index, + index + HPAGE_PMD_NR - 1, XA_PRESENT)) + return ERR_PTR(-E2BIG); + } - folio = shmem_alloc_hugefolio(gfp, info, index); - if (!folio) - count_vm_event(THP_FILE_FALLBACK); + order = highest_order(suitable_orders); + while (suitable_orders) { + pages = 1UL << order; + index = round_down(index, pages); + folio = shmem_alloc_folio(gfp, order, info, index); + if (folio) + goto allocated; + + if (pages == HPAGE_PMD_NR) + count_vm_event(THP_FILE_FALLBACK); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + count_mthp_stat(order, MTHP_STAT_SHMEM_FALLBACK); +#endif + order = next_order(&suitable_orders, order); + } } else { pages = 1; - folio = shmem_alloc_folio(gfp, info, index); + folio = shmem_alloc_folio(gfp, 0, info, index); } if (!folio) return ERR_PTR(-ENOMEM); +allocated: __folio_set_locked(folio); __folio_set_swapbacked(folio); @@ -1690,9 +1798,15 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp, if (xa_find(&mapping->i_pages, &index, index + pages - 1, XA_PRESENT)) { error = -EEXIST; - } else if (huge) { - count_vm_event(THP_FILE_FALLBACK); - count_vm_event(THP_FILE_FALLBACK_CHARGE); + } else if (pages > 1) { + if (pages == HPAGE_PMD_NR) { + count_vm_event(THP_FILE_FALLBACK); + count_vm_event(THP_FILE_FALLBACK_CHARGE); + } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_FALLBACK); + count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_FALLBACK_CHARGE); +#endif } goto unlock; } @@ -1767,7 +1881,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, old = *foliop; entry = old->swap; - swap_index = swp_offset(entry); + swap_index = swap_cache_index(entry); swap_mapping = swap_address_space(entry); /* @@ -1776,7 +1890,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, */ gfp &= ~GFP_CONSTRAINT_MASK; VM_BUG_ON_FOLIO(folio_test_large(old), old); - new = shmem_alloc_folio(gfp, info, index); + new = shmem_alloc_folio(gfp, 0, info, index); if (!new) return -ENOMEM; @@ -1975,7 +2089,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, struct mm_struct *fault_mm; struct folio *folio; int error; - bool alloced; + bool alloced, huge; + unsigned long orders = 0; if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping))) return -EINVAL; @@ -2047,23 +2162,34 @@ repeat: return 0; } - if (shmem_is_huge(inode, index, false, fault_mm, - vma ? vma->vm_flags : 0)) { + huge = shmem_is_huge(inode, index, false, fault_mm, + vma ? vma->vm_flags : 0); + /* Find hugepage orders that are allowed for anonymous shmem. */ + if (vma && vma_is_anon_shmem(vma)) + orders = shmem_allowable_huge_orders(inode, vma, index, huge); + else if (huge) + orders = BIT(HPAGE_PMD_ORDER); + + if (orders > 0) { gfp_t huge_gfp; huge_gfp = vma_thp_gfp_mask(vma); huge_gfp = limit_gfp_mask(huge_gfp, gfp); - folio = shmem_alloc_and_add_folio(huge_gfp, - inode, index, fault_mm, true); + folio = shmem_alloc_and_add_folio(vmf, huge_gfp, + inode, index, fault_mm, orders); if (!IS_ERR(folio)) { - count_vm_event(THP_FILE_ALLOC); + if (folio_test_pmd_mappable(folio)) + count_vm_event(THP_FILE_ALLOC); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_ALLOC); +#endif goto alloced; } if (PTR_ERR(folio) == -EEXIST) goto repeat; } - folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false); + folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0); if (IS_ERR(folio)) { error = PTR_ERR(folio); if (error == -EEXIST) @@ -2074,7 +2200,7 @@ repeat: alloced: alloced = true; - if (folio_test_pmd_mappable(folio) && + if (folio_test_large(folio) && DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) < folio_next_index(folio) - 1) { struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); @@ -2283,6 +2409,7 @@ unsigned long shmem_get_unmapped_area(struct file *file, unsigned long inflated_len; unsigned long inflated_addr; unsigned long inflated_offset; + unsigned long hpage_size; if (len > TASK_SIZE) return -ENOMEM; @@ -2301,8 +2428,6 @@ unsigned long shmem_get_unmapped_area(struct file *file, if (shmem_huge == SHMEM_HUGE_DENY) return addr; - if (len < HPAGE_PMD_SIZE) - return addr; if (flags & MAP_FIXED) return addr; /* @@ -2314,8 +2439,11 @@ unsigned long shmem_get_unmapped_area(struct file *file, if (uaddr == addr) return addr; + hpage_size = HPAGE_PMD_SIZE; if (shmem_huge != SHMEM_HUGE_FORCE) { struct super_block *sb; + unsigned long __maybe_unused hpage_orders; + int order = 0; if (file) { VM_BUG_ON(file->f_op != &shmem_file_operations); @@ -2328,18 +2456,38 @@ unsigned long shmem_get_unmapped_area(struct file *file, if (IS_ERR(shm_mnt)) return addr; sb = shm_mnt->mnt_sb; + + /* + * Find the highest mTHP order used for anonymous shmem to + * provide a suitable alignment address. + */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + hpage_orders = READ_ONCE(huge_shmem_orders_always); + hpage_orders |= READ_ONCE(huge_shmem_orders_within_size); + hpage_orders |= READ_ONCE(huge_shmem_orders_madvise); + if (SHMEM_SB(sb)->huge != SHMEM_HUGE_NEVER) + hpage_orders |= READ_ONCE(huge_shmem_orders_inherit); + + if (hpage_orders > 0) { + order = highest_order(hpage_orders); + hpage_size = PAGE_SIZE << order; + } +#endif } - if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER) + if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER && !order) return addr; } - offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1); - if (offset && offset + len < 2 * HPAGE_PMD_SIZE) - return addr; - if ((addr & (HPAGE_PMD_SIZE-1)) == offset) + if (len < hpage_size) return addr; - inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE; + offset = (pgoff << PAGE_SHIFT) & (hpage_size - 1); + if (offset && offset + len < 2 * hpage_size) + return addr; + if ((addr & (hpage_size - 1)) == offset) + return addr; + + inflated_len = len + hpage_size - PAGE_SIZE; if (inflated_len > TASK_SIZE) return addr; if (inflated_len < len) @@ -2352,10 +2500,10 @@ unsigned long shmem_get_unmapped_area(struct file *file, if (inflated_addr & ~PAGE_MASK) return addr; - inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1); + inflated_offset = inflated_addr & (hpage_size - 1); inflated_addr += offset - inflated_offset; if (inflated_offset > offset) - inflated_addr += HPAGE_PMD_SIZE; + inflated_addr += hpage_size; if (inflated_addr > TASK_SIZE - len) return addr; @@ -2644,7 +2792,7 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, if (!*foliop) { ret = -ENOMEM; - folio = shmem_alloc_folio(gfp, info, pgoff); + folio = shmem_alloc_folio(gfp, 0, info, pgoff); if (!folio) goto out_unacct_blocks; @@ -4695,6 +4843,12 @@ void __init shmem_init(void) SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ + + /* + * Default to setting PMD-sized THP to inherit the global setting and + * disable all other multi-size THPs. + */ + huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER); #endif return; @@ -4754,6 +4908,11 @@ static ssize_t shmem_enabled_store(struct kobject *kobj, huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY) return -EINVAL; + /* Do not override huge allocation policy with non-PMD sized mTHP */ + if (huge == SHMEM_HUGE_FORCE && + huge_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER)) + return -EINVAL; + shmem_huge = huge; if (shmem_huge > SHMEM_HUGE_DENY) SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; @@ -4761,6 +4920,84 @@ static ssize_t shmem_enabled_store(struct kobject *kobj, } struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled); +static DEFINE_SPINLOCK(huge_shmem_orders_lock); + +static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int order = to_thpsize(kobj)->order; + const char *output; + + if (test_bit(order, &huge_shmem_orders_always)) + output = "[always] inherit within_size advise never"; + else if (test_bit(order, &huge_shmem_orders_inherit)) + output = "always [inherit] within_size advise never"; + else if (test_bit(order, &huge_shmem_orders_within_size)) + output = "always inherit [within_size] advise never"; + else if (test_bit(order, &huge_shmem_orders_madvise)) + output = "always inherit within_size [advise] never"; + else + output = "always inherit within_size advise [never]"; + + return sysfs_emit(buf, "%s\n", output); +} + +static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int order = to_thpsize(kobj)->order; + ssize_t ret = count; + + if (sysfs_streq(buf, "always")) { + spin_lock(&huge_shmem_orders_lock); + clear_bit(order, &huge_shmem_orders_inherit); + clear_bit(order, &huge_shmem_orders_madvise); + clear_bit(order, &huge_shmem_orders_within_size); + set_bit(order, &huge_shmem_orders_always); + spin_unlock(&huge_shmem_orders_lock); + } else if (sysfs_streq(buf, "inherit")) { + /* Do not override huge allocation policy with non-PMD sized mTHP */ + if (shmem_huge == SHMEM_HUGE_FORCE && + order != HPAGE_PMD_ORDER) + return -EINVAL; + + spin_lock(&huge_shmem_orders_lock); + clear_bit(order, &huge_shmem_orders_always); + clear_bit(order, &huge_shmem_orders_madvise); + clear_bit(order, &huge_shmem_orders_within_size); + set_bit(order, &huge_shmem_orders_inherit); + spin_unlock(&huge_shmem_orders_lock); + } else if (sysfs_streq(buf, "within_size")) { + spin_lock(&huge_shmem_orders_lock); + clear_bit(order, &huge_shmem_orders_always); + clear_bit(order, &huge_shmem_orders_inherit); + clear_bit(order, &huge_shmem_orders_madvise); + set_bit(order, &huge_shmem_orders_within_size); + spin_unlock(&huge_shmem_orders_lock); + } else if (sysfs_streq(buf, "advise")) { + spin_lock(&huge_shmem_orders_lock); + clear_bit(order, &huge_shmem_orders_always); + clear_bit(order, &huge_shmem_orders_inherit); + clear_bit(order, &huge_shmem_orders_within_size); + set_bit(order, &huge_shmem_orders_madvise); + spin_unlock(&huge_shmem_orders_lock); + } else if (sysfs_streq(buf, "never")) { + spin_lock(&huge_shmem_orders_lock); + clear_bit(order, &huge_shmem_orders_always); + clear_bit(order, &huge_shmem_orders_inherit); + clear_bit(order, &huge_shmem_orders_within_size); + clear_bit(order, &huge_shmem_orders_madvise); + spin_unlock(&huge_shmem_orders_lock); + } else { + ret = -EINVAL; + } + + return ret; +} + +struct kobj_attribute thpsize_shmem_enabled_attr = + __ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store); #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */ #else /* !CONFIG_SHMEM */ diff --git a/mm/slab.h b/mm/slab.h index ece18ef5dd04..dcdb56b8e7f5 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -577,7 +577,7 @@ static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s) NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B; } -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, gfp_t flags, size_t size, void **p); void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, diff --git a/mm/slab_common.c b/mm/slab_common.c index 70943a4c1c4b..40b582a014b8 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -817,7 +817,7 @@ EXPORT_SYMBOL(kmalloc_size_roundup); #define KMALLOC_DMA_NAME(sz) #endif -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG #define KMALLOC_CGROUP_NAME(sz) .name[KMALLOC_CGROUP] = "kmalloc-cg-" #sz, #else #define KMALLOC_CGROUP_NAME(sz) @@ -959,7 +959,7 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type) if ((KMALLOC_RECLAIM != KMALLOC_NORMAL) && (type == KMALLOC_RECLAIM)) { flags |= SLAB_RECLAIM_ACCOUNT; - } else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP)) { + } else if (IS_ENABLED(CONFIG_MEMCG) && (type == KMALLOC_CGROUP)) { if (mem_cgroup_kmem_disabled()) { kmalloc_caches[type][idx] = kmalloc_caches[KMALLOC_NORMAL][idx]; return; @@ -975,10 +975,10 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type) #endif /* - * If CONFIG_MEMCG_KMEM is enabled, disable cache merging for + * If CONFIG_MEMCG is enabled, disable cache merging for * KMALLOC_NORMAL caches. */ - if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_NORMAL)) + if (IS_ENABLED(CONFIG_MEMCG) && (type == KMALLOC_NORMAL)) flags |= SLAB_NO_MERGE; if (minalign > ARCH_KMALLOC_MINALIGN) { @@ -1005,7 +1005,7 @@ void __init create_kmalloc_caches(void) enum kmalloc_cache_type type; /* - * Including KMALLOC_CGROUP if CONFIG_MEMCG_KMEM defined + * Including KMALLOC_CGROUP if CONFIG_MEMCG defined */ for (type = KMALLOC_NORMAL; type < NR_KMALLOC_TYPES; type++) { /* Caches that are NOT of the two-to-the-power-of size. */ diff --git a/mm/slub.c b/mm/slub.c index 829a1f08e8a2..3520acaf9afa 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -845,10 +845,12 @@ static int disable_higher_order_debug; static inline void metadata_access_enable(void) { kasan_disable_current(); + kmsan_disable_current(); } static inline void metadata_access_disable(void) { + kmsan_enable_current(); kasan_enable_current(); } @@ -1153,7 +1155,13 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) unsigned int poison_size = s->object_size; if (s->flags & SLAB_RED_ZONE) { - memset(p - s->red_left_pad, val, s->red_left_pad); + /* + * Here and below, avoid overwriting the KMSAN shadow. Keeping + * the shadow makes it possible to distinguish uninit-value + * from use-after-free. + */ + memset_no_sanitize_memory(p - s->red_left_pad, val, + s->red_left_pad); if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) { /* @@ -1166,12 +1174,13 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) } if (s->flags & __OBJECT_POISON) { - memset(p, POISON_FREE, poison_size - 1); - p[poison_size - 1] = POISON_END; + memset_no_sanitize_memory(p, POISON_FREE, poison_size - 1); + memset_no_sanitize_memory(p + poison_size - 1, POISON_END, 1); } if (s->flags & SLAB_RED_ZONE) - memset(p + poison_size, val, s->inuse - poison_size); + memset_no_sanitize_memory(p + poison_size, val, + s->inuse - poison_size); } static void restore_bytes(struct kmem_cache *s, char *message, u8 data, @@ -1181,9 +1190,16 @@ static void restore_bytes(struct kmem_cache *s, char *message, u8 data, memset(from, data, to - from); } -static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab, - u8 *object, char *what, - u8 *start, unsigned int value, unsigned int bytes) +#ifdef CONFIG_KMSAN +#define pad_check_attributes noinline __no_kmsan_checks +#else +#define pad_check_attributes +#endif + +static pad_check_attributes int +check_bytes_and_report(struct kmem_cache *s, struct slab *slab, + u8 *object, char *what, + u8 *start, unsigned int value, unsigned int bytes) { u8 *fault; u8 *end; @@ -1273,7 +1289,8 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p) } /* Check the pad bytes at the end of a slab page */ -static void slab_pad_check(struct kmem_cache *s, struct slab *slab) +static pad_check_attributes void +slab_pad_check(struct kmem_cache *s, struct slab *slab) { u8 *start; u8 *fault; @@ -2021,7 +2038,7 @@ static inline bool need_slab_obj_ext(void) return true; /* - * CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally + * CONFIG_MEMCG creates vector of obj_cgroup objects conditionally * inside memcg_slab_post_alloc_hook. No other users for now. */ return false; @@ -2126,7 +2143,7 @@ alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p, #endif /* CONFIG_MEM_ALLOC_PROFILING */ -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG static void memcg_alloc_abort_single(struct kmem_cache *s, void *object); @@ -2168,7 +2185,7 @@ void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p, __memcg_slab_free_hook(s, slab, p, objects, obj_exts); } -#else /* CONFIG_MEMCG_KMEM */ +#else /* CONFIG_MEMCG */ static inline bool memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, gfp_t flags, size_t size, @@ -2181,7 +2198,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p, int objects) { } -#endif /* CONFIG_MEMCG_KMEM */ +#endif /* CONFIG_MEMCG */ /* * Hooks for other subsystems that check memory allocations. In a typical @@ -3914,14 +3931,6 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s, 0, sizeof(void *)); } -noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags) -{ - if (__should_failslab(s, gfpflags)) - return -ENOMEM; - return 0; -} -ALLOW_ERROR_INJECTION(should_failslab, ERRNO); - static __fastpath_inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) { @@ -4465,7 +4474,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object, do_slab_free(s, slab, object, object, 1, addr); } -#ifdef CONFIG_MEMCG_KMEM +#ifdef CONFIG_MEMCG /* Do not inline the rare memcg charging failed path into the allocation path */ static noinline void memcg_alloc_abort_single(struct kmem_cache *s, void *object) diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index a2cbe44c48e1..1dda6c53370b 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -469,5 +469,13 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, if (r < 0) return NULL; + if (system_state == SYSTEM_BOOTING) { + mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(end - start, + PAGE_SIZE)); + } else { + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP, + DIV_ROUND_UP(end - start, PAGE_SIZE)); + } + return pfn_to_page(pfn); } diff --git a/mm/sparse.c b/mm/sparse.c index de40b2c73406..e4b830091d13 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -14,7 +14,7 @@ #include #include #include - +#include #include "internal.h" #include @@ -192,13 +192,10 @@ static void subsection_mask_set(unsigned long *map, unsigned long pfn, void __init subsection_map_init(unsigned long pfn, unsigned long nr_pages) { - int end_sec = pfn_to_section_nr(pfn + nr_pages - 1); - unsigned long nr, start_sec = pfn_to_section_nr(pfn); + int end_sec_nr = pfn_to_section_nr(pfn + nr_pages - 1); + unsigned long nr, start_sec_nr = pfn_to_section_nr(pfn); - if (!nr_pages) - return; - - for (nr = start_sec; nr <= end_sec; nr++) { + for (nr = start_sec_nr; nr <= end_sec_nr; nr++) { struct mem_section *ms; unsigned long pfns; @@ -229,17 +226,17 @@ static void __init memory_present(int nid, unsigned long start, unsigned long en start &= PAGE_SECTION_MASK; mminit_validate_memmodel_limits(&start, &end); for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) { - unsigned long section = pfn_to_section_nr(pfn); + unsigned long section_nr = pfn_to_section_nr(pfn); struct mem_section *ms; - sparse_index_init(section, nid); - set_section_nid(section, nid); + sparse_index_init(section_nr, nid); + set_section_nid(section_nr, nid); - ms = __nr_to_section(section); + ms = __nr_to_section(section_nr); if (!ms->section_mem_map) { ms->section_mem_map = sparse_encode_early_nid(nid) | SECTION_IS_ONLINE; - __section_mark_present(ms, section); + __section_mark_present(ms, section_nr); } } } @@ -351,7 +348,7 @@ sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat, again: usage = memblock_alloc_try_nid(size, SMP_CACHE_BYTES, goal, limit, nid); if (!usage && limit) { - limit = 0; + limit = MEMBLOCK_ALLOC_ACCESSIBLE; goto again; } return usage; @@ -465,6 +462,9 @@ static void __init sparse_buffer_init(unsigned long size, int nid) */ sparsemap_buf = memmap_alloc(size, section_map_size(), addr, nid, true); sparsemap_buf_end = sparsemap_buf + size; +#ifndef CONFIG_SPARSEMEM_VMEMMAP + mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(size, PAGE_SIZE)); +#endif } static void __init sparse_buffer_fini(void) @@ -643,6 +643,8 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages, unsigned long start = (unsigned long) pfn_to_page(pfn); unsigned long end = start + nr_pages * sizeof(struct page); + mod_node_page_state(page_pgdat(pfn_to_page(pfn)), NR_MEMMAP, + -1L * (DIV_ROUND_UP(end - start, PAGE_SIZE))); vmemmap_free(start, end, altmap); } static void free_map_bootmem(struct page *memmap) diff --git a/mm/swap.c b/mm/swap.c index 67786cb77130..9caf6b017cf0 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -123,8 +123,7 @@ void __folio_put(struct folio *folio) } page_cache_release(folio); - if (folio_test_large(folio) && folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); mem_cgroup_uncharge(folio); free_unref_page(&folio->page, folio_order(folio)); } @@ -212,10 +211,6 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn) for (i = 0; i < folio_batch_count(fbatch); i++) { struct folio *folio = fbatch->folios[i]; - /* block memcg migration while the folio moves between lru */ - if (move_fn != lru_add_fn && !folio_test_clear_lru(folio)) - continue; - folio_lruvec_relock_irqsave(folio, &lruvec, &flags); move_fn(lruvec, folio); @@ -256,11 +251,16 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio) void folio_rotate_reclaimable(struct folio *folio) { if (!folio_test_locked(folio) && !folio_test_dirty(folio) && - !folio_test_unevictable(folio) && folio_test_lru(folio)) { + !folio_test_unevictable(folio)) { struct folio_batch *fbatch; unsigned long flags; folio_get(folio); + if (!folio_test_clear_lru(folio)) { + folio_put(folio); + return; + } + local_lock_irqsave(&lru_rotate.lock, flags); fbatch = this_cpu_ptr(&lru_rotate.fbatch); folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn); @@ -353,11 +353,15 @@ static void folio_activate_drain(int cpu) void folio_activate(struct folio *folio) { - if (folio_test_lru(folio) && !folio_test_active(folio) && - !folio_test_unevictable(folio)) { + if (!folio_test_active(folio) && !folio_test_unevictable(folio)) { struct folio_batch *fbatch; folio_get(folio); + if (!folio_test_clear_lru(folio)) { + folio_put(folio); + return; + } + local_lock(&cpu_fbatches.lock); fbatch = this_cpu_ptr(&cpu_fbatches.activate); folio_batch_add_and_move(fbatch, folio, folio_activate_fn); @@ -701,6 +705,11 @@ void deactivate_file_folio(struct folio *folio) return; folio_get(folio); + if (!folio_test_clear_lru(folio)) { + folio_put(folio); + return; + } + local_lock(&cpu_fbatches.lock); fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate_file); folio_batch_add_and_move(fbatch, folio, lru_deactivate_file_fn); @@ -717,11 +726,16 @@ void deactivate_file_folio(struct folio *folio) */ void folio_deactivate(struct folio *folio) { - if (folio_test_lru(folio) && !folio_test_unevictable(folio) && - (folio_test_active(folio) || lru_gen_enabled())) { + if (!folio_test_unevictable(folio) && (folio_test_active(folio) || + lru_gen_enabled())) { struct folio_batch *fbatch; folio_get(folio); + if (!folio_test_clear_lru(folio)) { + folio_put(folio); + return; + } + local_lock(&cpu_fbatches.lock); fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate); folio_batch_add_and_move(fbatch, folio, lru_deactivate_fn); @@ -738,12 +752,16 @@ void folio_deactivate(struct folio *folio) */ void folio_mark_lazyfree(struct folio *folio) { - if (folio_test_lru(folio) && folio_test_anon(folio) && - folio_test_swapbacked(folio) && !folio_test_swapcache(folio) && - !folio_test_unevictable(folio)) { + if (folio_test_anon(folio) && folio_test_swapbacked(folio) && + !folio_test_swapcache(folio) && !folio_test_unevictable(folio)) { struct folio_batch *fbatch; folio_get(folio); + if (!folio_test_clear_lru(folio)) { + folio_put(folio); + return; + } + local_lock(&cpu_fbatches.lock); fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree); folio_batch_add_and_move(fbatch, folio, lru_lazyfree_fn); @@ -1002,10 +1020,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs) free_huge_folio(folio); continue; } - if (folio_test_large(folio) && - folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); - + folio_undo_large_rmappable(folio); __page_cache_release(folio, &lruvec, &flags); if (j != i) diff --git a/mm/swap.h b/mm/swap.h index fc2f6ade7f80..baa1fa946b34 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -5,13 +5,13 @@ struct mempolicy; #ifdef CONFIG_SWAP +#include /* for swp_offset */ #include /* for bio_end_io_t */ /* linux/mm/page_io.c */ int sio_pool_init(void); struct swap_iocb; -void swap_read_folio(struct folio *folio, bool do_poll, - struct swap_iocb **plug); +void swap_read_folio(struct folio *folio, struct swap_iocb **plug); void __swap_read_unplug(struct swap_iocb *plug); static inline void swap_read_unplug(struct swap_iocb *plug) { @@ -26,11 +26,29 @@ void __swap_writepage(struct folio *folio, struct writeback_control *wbc); /* One swap address space for each 64M swap space */ #define SWAP_ADDRESS_SPACE_SHIFT 14 #define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT) +#define SWAP_ADDRESS_SPACE_MASK (SWAP_ADDRESS_SPACE_PAGES - 1) extern struct address_space *swapper_spaces[]; #define swap_address_space(entry) \ (&swapper_spaces[swp_type(entry)][swp_offset(entry) \ >> SWAP_ADDRESS_SPACE_SHIFT]) +/* + * Return the swap device position of the swap entry. + */ +static inline loff_t swap_dev_pos(swp_entry_t entry) +{ + return ((loff_t)swp_offset(entry)) << PAGE_SHIFT; +} + +/* + * Return the swap cache index of the swap entry. + */ +static inline pgoff_t swap_cache_index(swp_entry_t entry) +{ + BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK); + return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK; +} + void show_swap_cache_info(void); bool add_to_swap(struct folio *folio); void *get_shadow_from_swap_cache(swp_entry_t entry); @@ -64,8 +82,7 @@ static inline unsigned int folio_swap_flags(struct folio *folio) } #else /* CONFIG_SWAP */ struct swap_iocb; -static inline void swap_read_folio(struct folio *folio, bool do_poll, - struct swap_iocb **plug) +static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug) { } static inline void swap_write_unplug(struct swap_iocb *sio) @@ -77,6 +94,11 @@ static inline struct address_space *swap_address_space(swp_entry_t entry) return NULL; } +static inline pgoff_t swap_cache_index(swp_entry_t entry) +{ + return 0; +} + static inline void show_swap_cache_info(void) { } diff --git a/mm/swap_state.c b/mm/swap_state.c index 642c30d8376c..a1726e49a5eb 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -28,7 +28,7 @@ /* * swapper_space is a fiction, retained to simplify the path through - * vmscan's shrink_page_list. + * vmscan's shrink_folio_list. */ static const struct address_space_operations swap_aops = { .writepage = swap_writepage, @@ -42,6 +42,8 @@ struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly; static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly; static bool enable_vma_readahead __read_mostly = true; +#define SWAP_RA_ORDER_CEILING 5 + #define SWAP_RA_WIN_SHIFT (PAGE_SHIFT / 2) #define SWAP_RA_HITS_MASK ((1UL << SWAP_RA_WIN_SHIFT) - 1) #define SWAP_RA_HITS_MAX SWAP_RA_HITS_MASK @@ -72,7 +74,7 @@ void show_swap_cache_info(void) void *get_shadow_from_swap_cache(swp_entry_t entry) { struct address_space *address_space = swap_address_space(entry); - pgoff_t idx = swp_offset(entry); + pgoff_t idx = swap_cache_index(entry); void *shadow; shadow = xa_load(&address_space->i_pages, idx); @@ -89,7 +91,7 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadowp) { struct address_space *address_space = swap_address_space(entry); - pgoff_t idx = swp_offset(entry); + pgoff_t idx = swap_cache_index(entry); XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio)); unsigned long i, nr = folio_nr_pages(folio); void *old; @@ -144,7 +146,7 @@ void __delete_from_swap_cache(struct folio *folio, struct address_space *address_space = swap_address_space(entry); int i; long nr = folio_nr_pages(folio); - pgoff_t idx = swp_offset(entry); + pgoff_t idx = swap_cache_index(entry); XA_STATE(xas, &address_space->i_pages, idx); xas_set_update(&xas, workingset_update_node); @@ -253,13 +255,14 @@ void clear_shadow_from_swap_cache(int type, unsigned long begin, for (;;) { swp_entry_t entry = swp_entry(type, curr); + unsigned long index = curr & SWAP_ADDRESS_SPACE_MASK; struct address_space *address_space = swap_address_space(entry); - XA_STATE(xas, &address_space->i_pages, curr); + XA_STATE(xas, &address_space->i_pages, index); xas_set_update(&xas, workingset_update_node); xa_lock_irq(&address_space->i_pages); - xas_for_each(&xas, old, end) { + xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAGES)) { if (!xa_is_value(old)) continue; xas_store(&xas, NULL); @@ -350,7 +353,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry, { struct folio *folio; - folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry)); + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); if (!IS_ERR(folio)) { bool vma_ra = swap_use_vma_readahead(); bool readahead; @@ -420,7 +423,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping, si = get_swap_device(swp); if (!si) return ERR_PTR(-ENOENT); - index = swp_offset(swp); + index = swap_cache_index(swp); folio = filemap_get_folio(swap_address_space(swp), index); put_swap_device(si); return folio; @@ -447,7 +450,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * that would confuse statistics. */ folio = filemap_get_folio(swap_address_space(entry), - swp_offset(entry)); + swap_cache_index(entry)); if (!IS_ERR(folio)) goto got_folio; @@ -467,8 +470,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will * cause any racers to loop around until we add it to cache. */ - folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0, - mpol, ilx, numa_node_id()); + folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); if (!folio) goto fail_put_swap; @@ -564,7 +566,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, mpol_cond_put(mpol); if (page_allocated) - swap_read_folio(folio, false, plug); + swap_read_folio(folio, plug); return folio; } @@ -681,7 +683,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, if (!folio) continue; if (page_allocated) { - swap_read_folio(folio, false, &splug); + swap_read_folio(folio, &splug); if (offset != entry_offset) { folio_set_readahead(folio); count_vm_event(SWAP_RA); @@ -698,7 +700,7 @@ skip: &page_allocated, false); if (unlikely(page_allocated)) { zswap_folio_swapin(folio); - swap_read_folio(folio, false, NULL); + swap_read_folio(folio, NULL); } return folio; } @@ -738,62 +740,42 @@ void exit_swap_address_space(unsigned int type) swapper_spaces[type] = NULL; } -#define SWAP_RA_ORDER_CEILING 5 - -struct vma_swap_readahead { - unsigned short win; - unsigned short offset; - unsigned short nr_pte; -}; - -static void swap_ra_info(struct vm_fault *vmf, - struct vma_swap_readahead *ra_info) +static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start, + unsigned long *end) { struct vm_area_struct *vma = vmf->vma; unsigned long ra_val; - unsigned long faddr, pfn, fpfn, lpfn, rpfn; - unsigned long start, end; + unsigned long faddr, prev_faddr, left, right; unsigned int max_win, hits, prev_win, win; - max_win = 1 << min_t(unsigned int, READ_ONCE(page_cluster), - SWAP_RA_ORDER_CEILING); - if (max_win == 1) { - ra_info->win = 1; - return; - } + max_win = 1 << min(READ_ONCE(page_cluster), SWAP_RA_ORDER_CEILING); + if (max_win == 1) + return 1; faddr = vmf->address; - fpfn = PFN_DOWN(faddr); ra_val = GET_SWAP_RA_VAL(vma); - pfn = PFN_DOWN(SWAP_RA_ADDR(ra_val)); + prev_faddr = SWAP_RA_ADDR(ra_val); prev_win = SWAP_RA_WIN(ra_val); hits = SWAP_RA_HITS(ra_val); - ra_info->win = win = __swapin_nr_pages(pfn, fpfn, hits, - max_win, prev_win); - atomic_long_set(&vma->swap_readahead_info, - SWAP_RA_VAL(faddr, win, 0)); + win = __swapin_nr_pages(PFN_DOWN(prev_faddr), PFN_DOWN(faddr), hits, + max_win, prev_win); + atomic_long_set(&vma->swap_readahead_info, SWAP_RA_VAL(faddr, win, 0)); if (win == 1) - return; + return 1; - if (fpfn == pfn + 1) { - lpfn = fpfn; - rpfn = fpfn + win; - } else if (pfn == fpfn + 1) { - lpfn = fpfn - win + 1; - rpfn = fpfn + 1; - } else { - unsigned int left = (win - 1) / 2; + if (faddr == prev_faddr + PAGE_SIZE) + left = faddr; + else if (prev_faddr == faddr + PAGE_SIZE) + left = faddr - (win << PAGE_SHIFT) + PAGE_SIZE; + else + left = faddr - (((win - 1) / 2) << PAGE_SHIFT); + right = left + (win << PAGE_SHIFT); + if ((long)left < 0) + left = 0; + *start = max3(left, vma->vm_start, faddr & PMD_MASK); + *end = min3(right, vma->vm_end, (faddr & PMD_MASK) + PMD_SIZE); - lpfn = fpfn - left; - rpfn = fpfn + win - left; - } - start = max3(lpfn, PFN_DOWN(vma->vm_start), - PFN_DOWN(faddr & PMD_MASK)); - end = min3(rpfn, PFN_DOWN(vma->vm_end), - PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); - - ra_info->nr_pte = end - start; - ra_info->offset = fpfn - start; + return win; } /** @@ -819,24 +801,20 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, struct swap_iocb *splug = NULL; struct folio *folio; pte_t *pte = NULL, pentry; - unsigned long addr; + int win; + unsigned long start, end, addr; swp_entry_t entry; pgoff_t ilx; - unsigned int i; bool page_allocated; - struct vma_swap_readahead ra_info = { - .win = 1, - }; - swap_ra_info(vmf, &ra_info); - if (ra_info.win == 1) + win = swap_vma_ra_win(vmf, &start, &end); + if (win == 1) goto skip; - addr = vmf->address - (ra_info.offset * PAGE_SIZE); - ilx = targ_ilx - ra_info.offset; + ilx = targ_ilx - PFN_DOWN(vmf->address - start); blk_start_plug(&plug); - for (i = 0; i < ra_info.nr_pte; i++, ilx++, addr += PAGE_SIZE) { + for (addr = start; addr < end; ilx++, addr += PAGE_SIZE) { if (!pte++) { pte = pte_offset_map(vmf->pmd, addr); if (!pte) @@ -855,8 +833,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, if (!folio) continue; if (page_allocated) { - swap_read_folio(folio, false, &splug); - if (i != ra_info.offset) { + swap_read_folio(folio, &splug); + if (addr != vmf->address) { folio_set_readahead(folio); count_vm_event(SWAP_RA); } @@ -874,7 +852,7 @@ skip: &page_allocated, false); if (unlikely(page_allocated)) { zswap_folio_swapin(folio); - swap_read_folio(folio, false, NULL); + swap_read_folio(folio, NULL); } return folio; } diff --git a/mm/swapfile.c b/mm/swapfile.c index b3e5e384e330..38bdc439651a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -142,7 +142,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, struct folio *folio; int ret = 0; - folio = filemap_get_folio(swap_address_space(entry), offset); + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); if (IS_ERR(folio)) return 0; /* @@ -1343,17 +1343,55 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) swap_range_free(p, offset, 1); } +static void cluster_swap_free_nr(struct swap_info_struct *sis, + unsigned long offset, int nr_pages) +{ + struct swap_cluster_info *ci; + DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 }; + int i, nr; + + ci = lock_cluster_or_swap_info(sis, offset); + while (nr_pages) { + nr = min(BITS_PER_LONG, nr_pages); + for (i = 0; i < nr; i++) { + if (!__swap_entry_free_locked(sis, offset + i, 1)) + bitmap_set(to_free, i, 1); + } + if (!bitmap_empty(to_free, BITS_PER_LONG)) { + unlock_cluster_or_swap_info(sis, ci); + for_each_set_bit(i, to_free, BITS_PER_LONG) + free_swap_slot(swp_entry(sis->type, offset + i)); + if (nr == nr_pages) + return; + bitmap_clear(to_free, 0, BITS_PER_LONG); + ci = lock_cluster_or_swap_info(sis, offset); + } + offset += nr; + nr_pages -= nr; + } + unlock_cluster_or_swap_info(sis, ci); +} + /* * Caller has made sure that the swap device corresponding to entry * is still around or has not been recycled. */ -void swap_free(swp_entry_t entry) +void swap_free_nr(swp_entry_t entry, int nr_pages) { - struct swap_info_struct *p; + int nr; + struct swap_info_struct *sis; + unsigned long offset = swp_offset(entry); - p = _swap_info_get(entry); - if (p) - __swap_entry_free(p, entry); + sis = _swap_info_get(entry); + if (!sis) + return; + + while (nr_pages) { + nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); + cluster_swap_free_nr(sis, offset, nr); + offset += nr; + nr_pages -= nr; + } } /* @@ -1870,10 +1908,20 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); if (pte_swp_exclusive(old_pte)) rmap_flags |= RMAP_EXCLUSIVE; - - folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags); + /* + * We currently only expect small !anon folios, which are either + * fully exclusive or fully shared. If we ever get large folios + * here, we have to be careful. + */ + if (!folio_test_anon(folio)) { + VM_WARN_ON_ONCE(folio_test_large(folio)); + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + folio_add_new_anon_rmap(folio, vma, addr, rmap_flags); + } else { + folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags); + } } else { /* ksm created a completely new copy */ - folio_add_new_anon_rmap(folio, vma, addr); + folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); } new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot)); @@ -2158,7 +2206,7 @@ retry: (i = find_next_to_unuse(si, i)) != 0) { entry = swp_entry(type, i); - folio = filemap_get_folio(swap_address_space(entry), i); + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); if (IS_ERR(folio)) continue; @@ -3449,12 +3497,11 @@ struct address_space *swapcache_mapping(struct folio *folio) } EXPORT_SYMBOL_GPL(swapcache_mapping); -pgoff_t __page_file_index(struct page *page) +pgoff_t __folio_swap_cache_index(struct folio *folio) { - swp_entry_t swap = page_swap_entry(page); - return swp_offset(swap); + return swap_cache_index(folio->swap); } -EXPORT_SYMBOL_GPL(__page_file_index); +EXPORT_SYMBOL_GPL(__folio_swap_cache_index); /* * add_swap_count_continuation - called when a swap count is duplicated diff --git a/mm/truncate.c b/mm/truncate.c index 581977d2356f..4d61fbdd4b2f 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -39,12 +39,25 @@ static inline void __clear_shadow_entry(struct address_space *mapping, xas_store(&xas, NULL); } -static void clear_shadow_entry(struct address_space *mapping, pgoff_t index, - void *entry) +static void clear_shadow_entries(struct address_space *mapping, + struct folio_batch *fbatch, pgoff_t *indices) { + int i; + + /* Handled by shmem itself, or for DAX we do nothing. */ + if (shmem_mapping(mapping) || dax_mapping(mapping)) + return; + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); - __clear_shadow_entry(mapping, index, entry); + + for (i = 0; i < folio_batch_count(fbatch); i++) { + struct folio *folio = fbatch->folios[i]; + + if (xa_is_value(folio)) + __clear_shadow_entry(mapping, indices[i], folio); + } + xa_unlock_irq(&mapping->i_pages); if (mapping_shrinkable(mapping)) inode_add_lru(mapping->host); @@ -105,36 +118,6 @@ static void truncate_folio_batch_exceptionals(struct address_space *mapping, fbatch->nr = j; } -/* - * Invalidate exceptional entry if easily possible. This handles exceptional - * entries for invalidate_inode_pages(). - */ -static int invalidate_exceptional_entry(struct address_space *mapping, - pgoff_t index, void *entry) -{ - /* Handled by shmem itself, or for DAX we do nothing. */ - if (shmem_mapping(mapping) || dax_mapping(mapping)) - return 1; - clear_shadow_entry(mapping, index, entry); - return 1; -} - -/* - * Invalidate exceptional entry if clean. This handles exceptional entries for - * invalidate_inode_pages2() so for DAX it evicts only clean entries. - */ -static int invalidate_exceptional_entry2(struct address_space *mapping, - pgoff_t index, void *entry) -{ - /* Handled by shmem itself */ - if (shmem_mapping(mapping)) - return 1; - if (dax_mapping(mapping)) - return dax_invalidate_mapping_entry_sync(mapping, index); - clear_shadow_entry(mapping, index, entry); - return 1; -} - /** * folio_invalidate - Invalidate part or all of a folio. * @folio: The folio which is affected. @@ -495,6 +478,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping, unsigned long ret; unsigned long count = 0; int i; + bool xa_has_values = false; folio_batch_init(&fbatch); while (find_lock_entries(mapping, &index, end, &fbatch, indices)) { @@ -504,8 +488,8 @@ unsigned long mapping_try_invalidate(struct address_space *mapping, /* We rely upon deletion not changing folio->index */ if (xa_is_value(folio)) { - count += invalidate_exceptional_entry(mapping, - indices[i], folio); + xa_has_values = true; + count++; continue; } @@ -523,6 +507,10 @@ unsigned long mapping_try_invalidate(struct address_space *mapping, } count += ret; } + + if (xa_has_values) + clear_shadow_entries(mapping, &fbatch, indices); + folio_batch_remove_exceptionals(&fbatch); folio_batch_release(&fbatch); cond_resched(); @@ -555,7 +543,7 @@ EXPORT_SYMBOL(invalidate_mapping_pages); * This is like mapping_evict_folio(), except it ignores the folio's * refcount. We do this because invalidate_inode_pages2() needs stronger * invalidation guarantees, and cannot afford to leave folios behind because - * shrink_page_list() has a temp ref on them, or because they're transiently + * shrink_folio_list() has a temp ref on them, or because they're transiently * sitting in the folio_add_lru() caches. */ static int invalidate_complete_folio2(struct address_space *mapping, @@ -617,6 +605,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, int ret = 0; int ret2 = 0; int did_range_unmap = 0; + bool xa_has_values = false; if (mapping_empty(mapping)) return 0; @@ -630,8 +619,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping, /* We rely upon deletion not changing folio->index */ if (xa_is_value(folio)) { - if (!invalidate_exceptional_entry2(mapping, - indices[i], folio)) + xa_has_values = true; + if (dax_mapping(mapping) && + !dax_invalidate_mapping_entry_sync(mapping, indices[i])) ret = -EBUSY; continue; } @@ -667,6 +657,10 @@ int invalidate_inode_pages2_range(struct address_space *mapping, ret = ret2; folio_unlock(folio); } + + if (xa_has_values) + clear_shadow_entries(mapping, &fbatch, indices); + folio_batch_remove_exceptionals(&fbatch); folio_batch_release(&fbatch); cond_resched(); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index defa5109cc62..e54e5c8907fa 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -216,7 +216,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, folio_add_lru(folio); folio_add_file_rmap_pte(folio, page, dst_vma); } else { - folio_add_new_anon_rmap(folio, dst_vma, dst_addr); + folio_add_new_anon_rmap(folio, dst_vma, dst_addr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, dst_vma); } @@ -587,7 +587,7 @@ retry: } if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) && - !huge_pte_none_mostly(huge_ptep_get(dst_pte))) { + !huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte))) { err = -EEXIST; hugetlb_vma_unlock_read(dst_vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); @@ -995,14 +995,8 @@ void double_pt_lock(spinlock_t *ptl1, __acquires(ptl1) __acquires(ptl2) { - spinlock_t *ptl_tmp; - - if (ptl1 > ptl2) { - /* exchange ptl1 and ptl2 */ - ptl_tmp = ptl1; - ptl1 = ptl2; - ptl2 = ptl_tmp; - } + if (ptl1 > ptl2) + swap(ptl1, ptl2); /* lock in virtual address order to avoid lock inversion */ spin_lock(ptl1); if (ptl1 != ptl2) diff --git a/mm/util.c b/mm/util.c index c6ad21ee6695..bc488f0121a7 100644 --- a/mm/util.c +++ b/mm/util.c @@ -844,6 +844,23 @@ void folio_copy(struct folio *dst, struct folio *src) } EXPORT_SYMBOL(folio_copy); +int folio_mc_copy(struct folio *dst, struct folio *src) +{ + long nr = folio_nr_pages(src); + long i = 0; + + for (;;) { + if (copy_mc_highpage(folio_page(dst, i), folio_page(src, i))) + return -EHWPOISON; + if (++i == nr) + break; + cond_resched(); + } + + return 0; +} +EXPORT_SYMBOL(folio_mc_copy); + int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; int sysctl_overcommit_ratio __read_mostly = 50; unsigned long sysctl_overcommit_kbytes __read_mostly; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e34ea860153f..6b783baf12a1 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1816,7 +1816,7 @@ static void free_vmap_area(struct vmap_area *va) static inline void preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node) { - struct vmap_area *va = NULL; + struct vmap_area *va = NULL, *tmp; /* * Preload this CPU with one extra vmap_area object. It is used @@ -1832,7 +1832,8 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node) spin_lock(lock); - if (va && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, va)) + tmp = NULL; + if (va && !__this_cpu_try_cmpxchg(ne_fit_preload_node, &tmp, va)) kmem_cache_free(vmap_area_cachep, va); } @@ -2055,8 +2056,8 @@ overflow: } if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) - pr_warn("vmap allocation for size %lu failed: use vmalloc= to increase size\n", - size); + pr_warn("vmalloc_node_range for size %lu failed: Address range restricted to %#lx - %#lx\n", + size, vstart, vend); kmem_cache_free(vmap_area_cachep, va); return ERR_PTR(-EBUSY); diff --git a/mm/vmscan.c b/mm/vmscan.c index 2e34de9cd0d4..525d3ffa8451 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -92,6 +92,11 @@ struct scan_control { unsigned long anon_cost; unsigned long file_cost; +#ifdef CONFIG_MEMCG + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ + int *proactive_swappiness; +#endif + /* Can active folios be deactivated as part of reclaim? */ #define DEACTIVATE_ANON 1 #define DEACTIVATE_FILE 2 @@ -128,6 +133,9 @@ struct scan_control { unsigned int memcg_low_reclaim:1; unsigned int memcg_low_skipped:1; + /* Shared cgroup tree walk failed, rescan the whole tree */ + unsigned int memcg_full_walk:1; + unsigned int hibernation_mode:1; /* One of the zones is ready for compaction */ @@ -189,7 +197,7 @@ struct scan_control { #endif /* - * From 0 .. 200. Higher means more swappy. + * From 0 .. MAX_SWAPPINESS. Higher means more swappy. */ int vm_swappiness = 60; @@ -233,6 +241,13 @@ static bool writeback_throttling_sane(struct scan_control *sc) #endif return false; } + +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) +{ + if (sc->proactive && sc->proactive_swappiness) + return *sc->proactive_swappiness; + return mem_cgroup_swappiness(memcg); +} #else static bool cgroup_reclaim(struct scan_control *sc) { @@ -248,6 +263,11 @@ static bool writeback_throttling_sane(struct scan_control *sc) { return true; } + +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) +{ + return READ_ONCE(vm_swappiness); +} #endif static void set_task_reclaim_state(struct task_struct *task, @@ -916,8 +936,7 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -static struct folio *alloc_demote_folio(struct folio *src, - unsigned long private) +struct folio *alloc_migrate_folio(struct folio *src, unsigned long private) { struct folio *dst; nodemask_t *allowed_mask; @@ -980,7 +999,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, node_get_allowed_targets(pgdat, &allowed_mask); /* Demotion ignores all cpuset and mempolicy settings */ - migrate_pages(demote_folios, alloc_demote_folio, NULL, + migrate_pages(demote_folios, alloc_migrate_folio, NULL, (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); @@ -1272,7 +1291,7 @@ retry: * try_to_unmap acquire PTL from the first PTE, * eliminating the influence of temporary PTE values. */ - if (folio_test_large(folio) && list_empty(&folio->_deferred_list)) + if (folio_test_large(folio)) flags |= TTU_SYNC; try_to_unmap(folio, flags); @@ -1437,9 +1456,7 @@ free_it: */ nr_reclaimed += nr_pages; - if (folio_test_large(folio) && - folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); if (folio_batch_add(&free_folios, folio) == 0) { mem_cgroup_uncharge_folios(&free_folios); try_to_unmap_flush(); @@ -1846,9 +1863,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec, if (unlikely(folio_put_testzero(folio))) { __folio_clear_lru_flags(folio); - if (folio_test_large(folio) && - folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); if (folio_batch_add(&free_folios, folio) == 0) { spin_unlock_irq(&lruvec->lru_lock); mem_cgroup_uncharge_folios(&free_folios); @@ -2353,7 +2368,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct mem_cgroup *memcg = lruvec_memcg(lruvec); unsigned long anon_cost, file_cost, total_cost; - int swappiness = mem_cgroup_swappiness(memcg); + int swappiness = sc_swappiness(sc, memcg); u64 fraction[ANON_AND_FILE]; u64 denominator = 0; /* gcc */ enum scan_balance scan_balance; @@ -2429,7 +2444,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, ap = swappiness * (total_cost + 1); ap /= anon_cost + 1; - fp = (200 - swappiness) * (total_cost + 1); + fp = (MAX_SWAPPINESS - swappiness) * (total_cost + 1); fp /= file_cost + 1; fraction[0] = ap; @@ -2634,7 +2649,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) return 0; - return mem_cgroup_swappiness(memcg); + return sc_swappiness(sc, memcg); } static int get_nr_gens(struct lruvec *lruvec, int type) @@ -3900,6 +3915,32 @@ done: * working set protection ******************************************************************************/ +static void set_initial_priority(struct pglist_data *pgdat, struct scan_control *sc) +{ + int priority; + unsigned long reclaimable; + + if (sc->priority != DEF_PRIORITY || sc->nr_to_reclaim < MIN_LRU_BATCH) + return; + /* + * Determine the initial priority based on + * (total >> priority) * reclaimed_to_scanned_ratio = nr_to_reclaim, + * where reclaimed_to_scanned_ratio = inactive / total. + */ + reclaimable = node_page_state(pgdat, NR_INACTIVE_FILE); + if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc)) + reclaimable += node_page_state(pgdat, NR_INACTIVE_ANON); + + /* round down reclaimable and round up sc->nr_to_reclaim */ + priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1); + + /* + * The estimation is based on LRU pages only, so cap it to prevent + * overshoots of shrinker objects by large margins. + */ + sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY); +} + static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc) { int gen, type, zone; @@ -3933,19 +3974,17 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc struct mem_cgroup *memcg = lruvec_memcg(lruvec); DEFINE_MIN_SEQ(lruvec); - /* see the comment on lru_gen_folio */ - gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]); - birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); - - if (time_is_after_jiffies(birth + min_ttl)) + if (mem_cgroup_below_min(NULL, memcg)) return false; if (!lruvec_is_sizable(lruvec, sc)) return false; - mem_cgroup_calculate_protection(NULL, memcg); + /* see the comment on lru_gen_folio */ + gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]); + birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); - return !mem_cgroup_below_min(NULL, memcg); + return time_is_before_jiffies(birth + min_ttl); } /* to protect the working set of the last N jiffies */ @@ -3955,23 +3994,20 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) { struct mem_cgroup *memcg; unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); + bool reclaimable = !min_ttl; VM_WARN_ON_ONCE(!current_is_kswapd()); - /* check the order to exclude compaction-induced reclaim */ - if (!min_ttl || sc->order || sc->priority == DEF_PRIORITY) - return; + set_initial_priority(pgdat, sc); memcg = mem_cgroup_iter(NULL, NULL, NULL); do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); - if (lruvec_is_reclaimable(lruvec, sc, min_ttl)) { - mem_cgroup_iter_break(NULL, memcg); - return; - } + mem_cgroup_calculate_protection(NULL, memcg); - cond_resched(); + if (!reclaimable) + reclaimable = lruvec_is_reclaimable(lruvec, sc, min_ttl); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); /* @@ -3979,7 +4015,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) * younger than min_ttl. However, another possibility is all memcgs are * either too small or below min. */ - if (mutex_trylock(&oom_lock)) { + if (!reclaimable && mutex_trylock(&oom_lock)) { struct oom_control oc = { .gfp_mask = sc->gfp_mask, }; @@ -4449,7 +4485,7 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx { int type, tier; struct ctrl_pos sp, pv; - int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness }; + int gain[ANON_AND_FILE] = { swappiness, MAX_SWAPPINESS - swappiness }; /* * Compare the first tier of anon with that of file to determine which @@ -4496,7 +4532,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw type = LRU_GEN_ANON; else if (swappiness == 1) type = LRU_GEN_FILE; - else if (swappiness == 200) + else if (swappiness == MAX_SWAPPINESS) type = LRU_GEN_ANON; else if (!(sc->gfp_mask & __GFP_IO)) type = LRU_GEN_FILE; @@ -4582,7 +4618,6 @@ retry: /* retry folios that may have missed folio_rotate_reclaimable() */ list_move(&folio->lru, &clean); - sc->nr_scanned -= folio_nr_pages(folio); } spin_lock_irq(&lruvec->lru_lock); @@ -4772,8 +4807,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc) struct mem_cgroup *memcg = lruvec_memcg(lruvec); struct pglist_data *pgdat = lruvec_pgdat(lruvec); - mem_cgroup_calculate_protection(NULL, memcg); - + /* lru_gen_age_node() called mem_cgroup_calculate_protection() */ if (mem_cgroup_below_min(NULL, memcg)) return MEMCG_LRU_YOUNG; @@ -4897,28 +4931,6 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc blk_finish_plug(&plug); } -static void set_initial_priority(struct pglist_data *pgdat, struct scan_control *sc) -{ - int priority; - unsigned long reclaimable; - - if (sc->priority != DEF_PRIORITY || sc->nr_to_reclaim < MIN_LRU_BATCH) - return; - /* - * Determine the initial priority based on - * (total >> priority) * reclaimed_to_scanned_ratio = nr_to_reclaim, - * where reclaimed_to_scanned_ratio = inactive / total. - */ - reclaimable = node_page_state(pgdat, NR_INACTIVE_FILE); - if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc)) - reclaimable += node_page_state(pgdat, NR_INACTIVE_ANON); - - /* round down reclaimable and round up sc->nr_to_reclaim */ - priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1); - - sc->priority = clamp(priority, 0, DEF_PRIORITY); -} - static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *sc) { struct blk_plug plug; @@ -5430,9 +5442,9 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq, lruvec = get_lruvec(memcg, nid); - if (swappiness < 0) + if (swappiness < MIN_SWAPPINESS) swappiness = get_swappiness(lruvec, sc); - else if (swappiness > 200) + else if (swappiness > MAX_SWAPPINESS) goto done; switch (cmd) { @@ -5845,9 +5857,25 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) { struct mem_cgroup *target_memcg = sc->target_mem_cgroup; + struct mem_cgroup_reclaim_cookie reclaim = { + .pgdat = pgdat, + }; + struct mem_cgroup_reclaim_cookie *partial = &reclaim; struct mem_cgroup *memcg; - memcg = mem_cgroup_iter(target_memcg, NULL, NULL); + /* + * In most cases, direct reclaimers can do partial walks + * through the cgroup tree, using an iterator state that + * persists across invocations. This strikes a balance between + * fairness and allocation latency. + * + * For kswapd, reliable forward progress is more important + * than a quick return to idle. Always do full walks. + */ + if (current_is_kswapd() || sc->memcg_full_walk) + partial = NULL; + + memcg = mem_cgroup_iter(target_memcg, NULL, partial); do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); unsigned long reclaimed; @@ -5897,7 +5925,12 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) sc->nr_scanned - scanned, sc->nr_reclaimed - reclaimed); - } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); + /* If partial walks are allowed, bail once goal is reached */ + if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) { + mem_cgroup_iter_break(target_memcg, memcg); + break; + } + } while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial))); } static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) @@ -6150,9 +6183,9 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) * and balancing, not for a memcg's limit. */ nr_soft_scanned = 0; - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat, - sc->order, sc->gfp_mask, - &nr_soft_scanned); + nr_soft_reclaimed = memcg1_soft_limit_reclaim(zone->zone_pgdat, + sc->order, sc->gfp_mask, + &nr_soft_scanned); sc->nr_reclaimed += nr_soft_reclaimed; sc->nr_scanned += nr_soft_scanned; /* need some check for avoid more shrink_zone() */ @@ -6270,6 +6303,21 @@ retry: if (sc->compaction_ready) return 1; + /* + * In most cases, direct reclaimers can do partial walks + * through the cgroup tree to meet the reclaim goal while + * keeping latency low. Since the iterator state is shared + * among all direct reclaim invocations (to retain fairness + * among cgroups), though, high concurrency can result in + * individual threads not seeing enough cgroups to make + * meaningful forward progress. Avoid false OOMs in this case. + */ + if (!sc->memcg_full_walk) { + sc->priority = initial_priority; + sc->memcg_full_walk = 1; + goto retry; + } + /* * We make inactive:active ratio decisions based on the node's * composition of memory, but a restrictive reclaim_idx or a @@ -6515,12 +6563,14 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options) + unsigned int reclaim_options, + int *swappiness) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; struct scan_control sc = { .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), + .proactive_swappiness = swappiness, .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), .reclaim_idx = MAX_NR_ZONES - 1, @@ -6702,6 +6752,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat, { struct zone *zone; int z; + unsigned long nr_reclaimed = sc->nr_reclaimed; /* Reclaim a number of pages proportional to the number of zones */ sc->nr_to_reclaim = 0; @@ -6729,7 +6780,8 @@ static bool kswapd_shrink_node(pg_data_t *pgdat, if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) sc->order = 0; - return sc->nr_scanned >= sc->nr_to_reclaim; + /* account for progress from mm_account_reclaimed_pages() */ + return max(sc->nr_scanned, sc->nr_reclaimed - nr_reclaimed) >= sc->nr_to_reclaim; } /* Page allocator PCP high watermark is lowered if reclaim is active. */ @@ -6899,8 +6951,8 @@ restart: /* Call soft limit reclaim before calling shrink_node. */ sc.nr_scanned = 0; nr_soft_scanned = 0; - nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat, sc.order, - sc.gfp_mask, &nr_soft_scanned); + nr_soft_reclaimed = memcg1_soft_limit_reclaim(pgdat, sc.order, + sc.gfp_mask, &nr_soft_scanned); sc.nr_reclaimed += nr_soft_reclaimed; /* diff --git a/mm/vmstat.c b/mm/vmstat.c index 8507c497218b..73d791d1caad 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1255,7 +1255,8 @@ const char * const vmstat_text[] = { "pgdemote_kswapd", "pgdemote_direct", "pgdemote_khugepaged", - + "nr_memmap", + "nr_memmap_boot", /* enum writeback_stat_item counters */ "nr_dirty_threshold", "nr_dirty_background_threshold", @@ -2282,4 +2283,27 @@ static int __init extfrag_debug_init(void) } module_init(extfrag_debug_init); + #endif + +/* + * Page metadata size (struct page and page_ext) in pages + */ +static unsigned long early_perpage_metadata[MAX_NUMNODES] __meminitdata; + +void __meminit mod_node_early_perpage_metadata(int nid, long delta) +{ + early_perpage_metadata[nid] += delta; +} + +void __meminit store_early_perpage_metadata(void) +{ + int nid; + struct pglist_data *pgdat; + + for_each_online_pgdat(pgdat) { + nid = pgdat->node_id; + mod_node_page_state(NODE_DATA(nid), NR_MEMMAP_BOOT, + early_perpage_metadata[nid]); + } +} diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index b42d3545ca85..5d6581ab7c07 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -20,7 +20,8 @@ * page->index: links together all component pages of a zspage * For the huge page, this is always 0, so we use this field * to store handle. - * page->page_type: first object offset in a subpage of zspage + * page->page_type: PG_zsmalloc, lower 16 bit locate the first object + * offset in a subpage of a zspage * * Usage of struct page flags: * PG_private: identifies the first component page @@ -33,7 +34,8 @@ /* * lock ordering: * page_lock - * pool->lock + * pool->migrate_lock + * class->lock * zspage->lock */ @@ -182,6 +184,7 @@ static struct dentry *zs_stat_root; static size_t huge_class_size; struct size_class { + spinlock_t lock; struct list_head fullness_list[NR_FULLNESS_GROUPS]; /* * Size of objects stored in this class. Must be multiple @@ -236,7 +239,8 @@ struct zs_pool { #ifdef CONFIG_COMPACTION struct work_struct free_work; #endif - spinlock_t lock; + /* protect page/zspage migration */ + rwlock_t migrate_lock; atomic_t compaction_in_progress; }; @@ -335,7 +339,7 @@ static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage) kmem_cache_free(pool->zspage_cachep, zspage); } -/* pool->lock(which owns the handle) synchronizes races */ +/* class->lock(which owns the handle) synchronizes races */ static void record_obj(unsigned long handle, unsigned long obj) { *(unsigned long *)handle = obj; @@ -430,7 +434,7 @@ static __maybe_unused int is_first_page(struct page *page) return PagePrivate(page); } -/* Protected by pool->lock */ +/* Protected by class->lock */ static inline int get_zspage_inuse(struct zspage *zspage) { return zspage->inuse; @@ -450,14 +454,28 @@ static inline struct page *get_first_page(struct zspage *zspage) return first_page; } +#define FIRST_OBJ_PAGE_TYPE_MASK 0xffff + +static inline void reset_first_obj_offset(struct page *page) +{ + VM_WARN_ON_ONCE(!PageZsmalloc(page)); + page->page_type |= FIRST_OBJ_PAGE_TYPE_MASK; +} + static inline unsigned int get_first_obj_offset(struct page *page) { - return page->page_type; + VM_WARN_ON_ONCE(!PageZsmalloc(page)); + return page->page_type & FIRST_OBJ_PAGE_TYPE_MASK; } static inline void set_first_obj_offset(struct page *page, unsigned int offset) { - page->page_type = offset; + /* With 16 bit available, we can support offsets into 64 KiB pages. */ + BUILD_BUG_ON(PAGE_SIZE > SZ_64K); + VM_WARN_ON_ONCE(!PageZsmalloc(page)); + VM_WARN_ON_ONCE(offset & ~FIRST_OBJ_PAGE_TYPE_MASK); + page->page_type &= ~FIRST_OBJ_PAGE_TYPE_MASK; + page->page_type |= offset & FIRST_OBJ_PAGE_TYPE_MASK; } static inline unsigned int get_freeobj(struct zspage *zspage) @@ -494,19 +512,19 @@ static int get_size_class_index(int size) return min_t(int, ZS_SIZE_CLASSES - 1, idx); } -static inline void class_stat_inc(struct size_class *class, - int type, unsigned long cnt) +static inline void class_stat_add(struct size_class *class, int type, + unsigned long cnt) { class->stats.objs[type] += cnt; } -static inline void class_stat_dec(struct size_class *class, - int type, unsigned long cnt) +static inline void class_stat_sub(struct size_class *class, int type, + unsigned long cnt) { class->stats.objs[type] -= cnt; } -static inline unsigned long zs_stat_get(struct size_class *class, int type) +static inline unsigned long class_stat_read(struct size_class *class, int type) { return class->stats.objs[type]; } @@ -554,18 +572,18 @@ static int zs_stats_size_show(struct seq_file *s, void *v) if (class->index != i) continue; - spin_lock(&pool->lock); + spin_lock(&class->lock); seq_printf(s, " %5u %5u ", i, class->size); for (fg = ZS_INUSE_RATIO_10; fg < NR_FULLNESS_GROUPS; fg++) { - inuse_totals[fg] += zs_stat_get(class, fg); - seq_printf(s, "%9lu ", zs_stat_get(class, fg)); + inuse_totals[fg] += class_stat_read(class, fg); + seq_printf(s, "%9lu ", class_stat_read(class, fg)); } - obj_allocated = zs_stat_get(class, ZS_OBJS_ALLOCATED); - obj_used = zs_stat_get(class, ZS_OBJS_INUSE); + obj_allocated = class_stat_read(class, ZS_OBJS_ALLOCATED); + obj_used = class_stat_read(class, ZS_OBJS_INUSE); freeable = zs_can_compact(class); - spin_unlock(&pool->lock); + spin_unlock(&class->lock); objs_per_zspage = class->objs_per_zspage; pages_used = obj_allocated / objs_per_zspage * @@ -668,7 +686,7 @@ static void insert_zspage(struct size_class *class, struct zspage *zspage, int fullness) { - class_stat_inc(class, fullness, 1); + class_stat_add(class, fullness, 1); list_add(&zspage->list, &class->fullness_list[fullness]); zspage->fullness = fullness; } @@ -684,7 +702,7 @@ static void remove_zspage(struct size_class *class, struct zspage *zspage) VM_BUG_ON(list_empty(&class->fullness_list[fullness])); list_del_init(&zspage->list); - class_stat_dec(class, fullness, 1); + class_stat_sub(class, fullness, 1); } /* @@ -791,8 +809,9 @@ static void reset_page(struct page *page) __ClearPageMovable(page); ClearPagePrivate(page); set_page_private(page, 0); - page_mapcount_reset(page); page->index = 0; + reset_first_obj_offset(page); + __ClearPageZsmalloc(page); } static int trylock_zspage(struct zspage *zspage) @@ -821,7 +840,7 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class, { struct page *page, *next; - assert_spin_locked(&pool->lock); + assert_spin_locked(&class->lock); VM_BUG_ON(get_zspage_inuse(zspage)); VM_BUG_ON(zspage->fullness != ZS_INUSE_RATIO_0); @@ -839,7 +858,7 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class, cache_free_zspage(pool, zspage); - class_stat_dec(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage); + class_stat_sub(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage); atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated); } @@ -965,11 +984,13 @@ static struct zspage *alloc_zspage(struct zs_pool *pool, if (!page) { while (--i >= 0) { dec_zone_page_state(pages[i], NR_ZSPAGES); + __ClearPageZsmalloc(pages[i]); __free_page(pages[i]); } cache_free_zspage(pool, zspage); return NULL; } + __SetPageZsmalloc(page); inc_zone_page_state(page, NR_ZSPAGES); pages[i] = page; @@ -1178,19 +1199,19 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, BUG_ON(in_interrupt()); /* It guarantees it can get zspage from handle safely */ - spin_lock(&pool->lock); + read_lock(&pool->migrate_lock); obj = handle_to_obj(handle); obj_to_location(obj, &page, &obj_idx); zspage = get_zspage(page); /* - * migration cannot move any zpages in this zspage. Here, pool->lock + * migration cannot move any zpages in this zspage. Here, class->lock * is too heavy since callers would take some time until they calls * zs_unmap_object API so delegate the locking from class to zspage * which is smaller granularity. */ migrate_read_lock(zspage); - spin_unlock(&pool->lock); + read_unlock(&pool->migrate_lock); class = zspage_class(pool, zspage); off = offset_in_page(class->size * obj_idx); @@ -1285,7 +1306,6 @@ static unsigned long obj_malloc(struct zs_pool *pool, void *vaddr; class = pool->size_class[zspage->class]; - handle |= OBJ_ALLOCATED_TAG; obj = get_freeobj(zspage); offset = obj * class->size; @@ -1301,15 +1321,16 @@ static unsigned long obj_malloc(struct zs_pool *pool, set_freeobj(zspage, link->next >> OBJ_TAG_BITS); if (likely(!ZsHugePage(zspage))) /* record handle in the header of allocated chunk */ - link->handle = handle; + link->handle = handle | OBJ_ALLOCATED_TAG; else /* record handle to page->index */ - zspage->first_page->index = handle; + zspage->first_page->index = handle | OBJ_ALLOCATED_TAG; kunmap_atomic(vaddr); mod_zspage_inuse(zspage, 1); obj = location_to_obj(m_page, obj); + record_obj(handle, obj); return obj; } @@ -1327,7 +1348,7 @@ static unsigned long obj_malloc(struct zs_pool *pool, */ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) { - unsigned long handle, obj; + unsigned long handle; struct size_class *class; int newfg; struct zspage *zspage; @@ -1346,20 +1367,19 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) size += ZS_HANDLE_SIZE; class = pool->size_class[get_size_class_index(size)]; - /* pool->lock effectively protects the zpage migration */ - spin_lock(&pool->lock); + /* class->lock effectively protects the zpage migration */ + spin_lock(&class->lock); zspage = find_get_zspage(class); if (likely(zspage)) { - obj = obj_malloc(pool, zspage, handle); + obj_malloc(pool, zspage, handle); /* Now move the zspage to another fullness group, if required */ fix_fullness_group(class, zspage); - record_obj(handle, obj); - class_stat_inc(class, ZS_OBJS_INUSE, 1); + class_stat_add(class, ZS_OBJS_INUSE, 1); goto out; } - spin_unlock(&pool->lock); + spin_unlock(&class->lock); zspage = alloc_zspage(pool, class, gfp); if (!zspage) { @@ -1367,19 +1387,18 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) return (unsigned long)ERR_PTR(-ENOMEM); } - spin_lock(&pool->lock); - obj = obj_malloc(pool, zspage, handle); + spin_lock(&class->lock); + obj_malloc(pool, zspage, handle); newfg = get_fullness_group(class, zspage); insert_zspage(class, zspage, newfg); - record_obj(handle, obj); atomic_long_add(class->pages_per_zspage, &pool->pages_allocated); - class_stat_inc(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage); - class_stat_inc(class, ZS_OBJS_INUSE, 1); + class_stat_add(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage); + class_stat_add(class, ZS_OBJS_INUSE, 1); /* We completely set up zspage so mark them as movable */ SetZsPageMovable(pool, zspage); out: - spin_unlock(&pool->lock); + spin_unlock(&class->lock); return handle; } @@ -1424,23 +1443,25 @@ void zs_free(struct zs_pool *pool, unsigned long handle) return; /* - * The pool->lock protects the race with zpage's migration + * The pool->migrate_lock protects the race with zpage's migration * so it's safe to get the page from handle. */ - spin_lock(&pool->lock); + read_lock(&pool->migrate_lock); obj = handle_to_obj(handle); obj_to_page(obj, &f_page); zspage = get_zspage(f_page); class = zspage_class(pool, zspage); + spin_lock(&class->lock); + read_unlock(&pool->migrate_lock); - class_stat_dec(class, ZS_OBJS_INUSE, 1); + class_stat_sub(class, ZS_OBJS_INUSE, 1); obj_free(class->size, obj); fullness = fix_fullness_group(class, zspage); if (fullness == ZS_INUSE_RATIO_0) free_zspage(pool, class, zspage); - spin_unlock(&pool->lock); + spin_unlock(&class->lock); cache_free_handle(pool, handle); } EXPORT_SYMBOL_GPL(zs_free); @@ -1568,7 +1589,6 @@ static void migrate_zspage(struct zs_pool *pool, struct zspage *src_zspage, free_obj = obj_malloc(pool, dst_zspage, handle); zs_object_copy(class, free_obj, used_obj); obj_idx++; - record_obj(handle, free_obj); obj_free(class->size, used_obj); /* Stop if there is no more space */ @@ -1752,27 +1772,26 @@ static int zs_page_migrate(struct page *newpage, struct page *page, unsigned long old_obj, new_obj; unsigned int obj_idx; - /* - * We cannot support the _NO_COPY case here, because copy needs to - * happen under the zs lock, which does not work with - * MIGRATE_SYNC_NO_COPY workflow. - */ - if (mode == MIGRATE_SYNC_NO_COPY) - return -EINVAL; - VM_BUG_ON_PAGE(!PageIsolated(page), page); + /* We're committed, tell the world that this is a Zsmalloc page. */ + __SetPageZsmalloc(newpage); + /* The page is locked, so this pointer must remain valid */ zspage = get_zspage(page); pool = zspage->pool; /* - * The pool's lock protects the race between zpage migration + * The pool migrate_lock protects the race between zpage migration * and zs_free. */ - spin_lock(&pool->lock); + write_lock(&pool->migrate_lock); class = zspage_class(pool, zspage); + /* + * the class lock protects zpage alloc/free in the zspage. + */ + spin_lock(&class->lock); /* the migrate_write_lock protects zpage access via zs_map_object */ migrate_write_lock(zspage); @@ -1802,9 +1821,10 @@ static int zs_page_migrate(struct page *newpage, struct page *page, replace_sub_page(class, zspage, newpage, page); /* * Since we complete the data copy and set up new zspage structure, - * it's okay to release the pool's lock. + * it's okay to release migration_lock. */ - spin_unlock(&pool->lock); + write_unlock(&pool->migrate_lock); + spin_unlock(&class->lock); migrate_write_unlock(zspage); get_page(newpage); @@ -1848,20 +1868,21 @@ static void async_free_zspage(struct work_struct *work) if (class->index != i) continue; - spin_lock(&pool->lock); + spin_lock(&class->lock); list_splice_init(&class->fullness_list[ZS_INUSE_RATIO_0], &free_pages); - spin_unlock(&pool->lock); + spin_unlock(&class->lock); } list_for_each_entry_safe(zspage, tmp, &free_pages, list) { list_del(&zspage->list); lock_zspage(zspage); - spin_lock(&pool->lock); class = zspage_class(pool, zspage); + spin_lock(&class->lock); + class_stat_sub(class, ZS_INUSE_RATIO_0, 1); __free_zspage(pool, class, zspage); - spin_unlock(&pool->lock); + spin_unlock(&class->lock); } }; @@ -1902,8 +1923,8 @@ static inline void zs_flush_migration(struct zs_pool *pool) { } static unsigned long zs_can_compact(struct size_class *class) { unsigned long obj_wasted; - unsigned long obj_allocated = zs_stat_get(class, ZS_OBJS_ALLOCATED); - unsigned long obj_used = zs_stat_get(class, ZS_OBJS_INUSE); + unsigned long obj_allocated = class_stat_read(class, ZS_OBJS_ALLOCATED); + unsigned long obj_used = class_stat_read(class, ZS_OBJS_INUSE); if (obj_allocated <= obj_used) return 0; @@ -1925,7 +1946,8 @@ static unsigned long __zs_compact(struct zs_pool *pool, * protect the race between zpage migration and zs_free * as well as zpage allocation/free */ - spin_lock(&pool->lock); + write_lock(&pool->migrate_lock); + spin_lock(&class->lock); while (zs_can_compact(class)) { int fg; @@ -1951,13 +1973,15 @@ static unsigned long __zs_compact(struct zs_pool *pool, src_zspage = NULL; if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100 - || spin_is_contended(&pool->lock)) { + || rwlock_is_contended(&pool->migrate_lock)) { putback_zspage(class, dst_zspage); dst_zspage = NULL; - spin_unlock(&pool->lock); + spin_unlock(&class->lock); + write_unlock(&pool->migrate_lock); cond_resched(); - spin_lock(&pool->lock); + write_lock(&pool->migrate_lock); + spin_lock(&class->lock); } } @@ -1967,7 +1991,8 @@ static unsigned long __zs_compact(struct zs_pool *pool, if (dst_zspage) putback_zspage(class, dst_zspage); - spin_unlock(&pool->lock); + spin_unlock(&class->lock); + write_unlock(&pool->migrate_lock); return pages_freed; } @@ -1979,10 +2004,10 @@ unsigned long zs_compact(struct zs_pool *pool) unsigned long pages_freed = 0; /* - * Pool compaction is performed under pool->lock so it is basically + * Pool compaction is performed under pool->migrate_lock so it is basically * single-threaded. Having more than one thread in __zs_compact() - * will increase pool->lock contention, which will impact other - * zsmalloc operations that need pool->lock. + * will increase pool->migrate_lock contention, which will impact other + * zsmalloc operations that need pool->migrate_lock. */ if (atomic_xchg(&pool->compaction_in_progress, 1)) return 0; @@ -2104,7 +2129,7 @@ struct zs_pool *zs_create_pool(const char *name) return NULL; init_deferred_free(pool); - spin_lock_init(&pool->lock); + rwlock_init(&pool->migrate_lock); atomic_set(&pool->compaction_in_progress, 0); pool->name = kstrdup(name, GFP_KERNEL); @@ -2176,6 +2201,7 @@ struct zs_pool *zs_create_pool(const char *name) class->index = i; class->pages_per_zspage = pages_per_zspage; class->objs_per_zspage = objs_per_zspage; + spin_lock_init(&class->lock); pool->size_class[i] = class; fullness = ZS_INUSE_RATIO_0; @@ -2276,3 +2302,4 @@ module_exit(zs_exit); MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Nitin Gupta "); +MODULE_DESCRIPTION("zsmalloc memory allocator"); diff --git a/mm/zswap.c b/mm/zswap.c index a50e2986cd2f..adeaf9c97fde 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -83,6 +83,7 @@ static bool zswap_pool_reached_full; static int zswap_setup(void); /* Enable/disable zswap */ +static DEFINE_STATIC_KEY_MAYBE(CONFIG_ZSWAP_DEFAULT_ON, zswap_ever_enabled); static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON); static int zswap_enabled_param_set(const char *, const struct kernel_param *); @@ -123,19 +124,21 @@ static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */ module_param_named(accept_threshold_percent, zswap_accept_thr_percent, uint, 0644); -/* Number of zpools in zswap_pool (empirically determined for scalability) */ -#define ZSWAP_NR_ZPOOLS 32 - /* Enable/disable memory pressure-based shrinker. */ static bool zswap_shrinker_enabled = IS_ENABLED( CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644); -bool is_zswap_enabled(void) +bool zswap_is_enabled(void) { return zswap_enabled; } +bool zswap_never_enabled(void) +{ + return !static_branch_maybe(CONFIG_ZSWAP_DEFAULT_ON, &zswap_ever_enabled); +} + /********************************* * data structures **********************************/ @@ -156,7 +159,7 @@ struct crypto_acomp_ctx { * needs to be verified that it's still valid in the tree. */ struct zswap_pool { - struct zpool *zpools[ZSWAP_NR_ZPOOLS]; + struct zpool *zpool; struct crypto_acomp_ctx __percpu *acomp_ctx; struct percpu_ref ref; struct list_head list; @@ -238,7 +241,7 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp) #define zswap_pool_debug(msg, p) \ pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \ - zpool_get_type((p)->zpools[0])) + zpool_get_type((p)->zpool)) /********************************* * pool functions @@ -247,7 +250,6 @@ static void __zswap_pool_empty(struct percpu_ref *ref); static struct zswap_pool *zswap_pool_create(char *type, char *compressor) { - int i; struct zswap_pool *pool; char name[38]; /* 'zswap' + 32 char (max) num + \0 */ gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM; @@ -268,18 +270,14 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) if (!pool) return NULL; - for (i = 0; i < ZSWAP_NR_ZPOOLS; i++) { - /* unique name for each pool specifically required by zsmalloc */ - snprintf(name, 38, "zswap%x", - atomic_inc_return(&zswap_pools_count)); - - pool->zpools[i] = zpool_create_pool(type, name, gfp); - if (!pool->zpools[i]) { - pr_err("%s zpool not available\n", type); - goto error; - } + /* unique name for each pool specifically required by zsmalloc */ + snprintf(name, 38, "zswap%x", atomic_inc_return(&zswap_pools_count)); + pool->zpool = zpool_create_pool(type, name, gfp); + if (!pool->zpool) { + pr_err("%s zpool not available\n", type); + goto error; } - pr_debug("using %s zpool\n", zpool_get_type(pool->zpools[0])); + pr_debug("using %s zpool\n", zpool_get_type(pool->zpool)); strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); @@ -312,8 +310,8 @@ ref_fail: error: if (pool->acomp_ctx) free_percpu(pool->acomp_ctx); - while (i--) - zpool_destroy_pool(pool->zpools[i]); + if (pool->zpool) + zpool_destroy_pool(pool->zpool); kfree(pool); return NULL; } @@ -362,15 +360,12 @@ static struct zswap_pool *__zswap_pool_create_fallback(void) static void zswap_pool_destroy(struct zswap_pool *pool) { - int i; - zswap_pool_debug("destroying", pool); cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); free_percpu(pool->acomp_ctx); - for (i = 0; i < ZSWAP_NR_ZPOOLS; i++) - zpool_destroy_pool(pool->zpools[i]); + zpool_destroy_pool(pool->zpool); kfree(pool); } @@ -465,8 +460,7 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) list_for_each_entry_rcu(pool, &zswap_pools, list) { if (strcmp(pool->tfm_name, compressor)) continue; - /* all zpools share the same type */ - if (strcmp(zpool_get_type(pool->zpools[0]), type)) + if (strcmp(zpool_get_type(pool->zpool), type)) continue; /* if we can't get it, it's about to be destroyed */ if (!zswap_pool_get(pool)) @@ -493,12 +487,8 @@ unsigned long zswap_total_pages(void) unsigned long total = 0; rcu_read_lock(); - list_for_each_entry_rcu(pool, &zswap_pools, list) { - int i; - - for (i = 0; i < ZSWAP_NR_ZPOOLS; i++) - total += zpool_get_total_pages(pool->zpools[i]); - } + list_for_each_entry_rcu(pool, &zswap_pools, list) + total += zpool_get_total_pages(pool->zpool); rcu_read_unlock(); return total; @@ -803,11 +793,6 @@ static void zswap_entry_cache_free(struct zswap_entry *entry) kmem_cache_free(zswap_entry_cache, entry); } -static struct zpool *zswap_find_zpool(struct zswap_entry *entry) -{ - return entry->pool->zpools[hash_ptr(entry, ilog2(ZSWAP_NR_ZPOOLS))]; -} - /* * Carries out the common pattern of freeing and entry's zpool allocation, * freeing the entry itself, and decrementing the number of stored pages. @@ -818,7 +803,7 @@ static void zswap_entry_free(struct zswap_entry *entry) atomic_dec(&zswap_same_filled_pages); else { zswap_lru_del(&zswap_list_lru, entry); - zpool_free(zswap_find_zpool(entry), entry->handle); + zpool_free(entry->pool->zpool, entry->handle); zswap_pool_put(entry->pool); } if (entry->objcg) { @@ -917,7 +902,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry) dst = acomp_ctx->buffer; sg_init_table(&input, 1); - sg_set_page(&input, &folio->page, PAGE_SIZE, 0); + sg_set_folio(&input, folio, PAGE_SIZE, 0); /* * We need PAGE_SIZE * 2 here since there maybe over-compression case, @@ -944,7 +929,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry) if (comp_ret) goto unlock; - zpool = zswap_find_zpool(entry); + zpool = entry->pool->zpool; gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM; if (zpool_malloc_support_movable(zpool)) gfp |= __GFP_HIGHMEM | __GFP_MOVABLE; @@ -971,9 +956,9 @@ unlock: return comp_ret == 0 && alloc_ret == 0; } -static void zswap_decompress(struct zswap_entry *entry, struct page *page) +static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) { - struct zpool *zpool = zswap_find_zpool(entry); + struct zpool *zpool = entry->pool->zpool; struct scatterlist input, output; struct crypto_acomp_ctx *acomp_ctx; u8 *src; @@ -1000,7 +985,7 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page) sg_init_one(&input, src, entry->length); sg_init_table(&output, 1); - sg_set_page(&output, page, PAGE_SIZE, 0); + sg_set_folio(&output, folio, PAGE_SIZE, 0); acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); @@ -1073,7 +1058,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, return -ENOMEM; } - zswap_decompress(entry, &folio->page); + zswap_decompress(entry, folio); count_vm_event(ZSWPWB); if (entry->objcg) @@ -1375,35 +1360,35 @@ resched: **********************************/ static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value) { - unsigned long *page; + unsigned long *data; unsigned long val; - unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1; + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1; bool ret = false; - page = kmap_local_folio(folio, 0); - val = page[0]; + data = kmap_local_folio(folio, 0); + val = data[0]; - if (val != page[last_pos]) + if (val != data[last_pos]) goto out; for (pos = 1; pos < last_pos; pos++) { - if (val != page[pos]) + if (val != data[pos]) goto out; } *value = val; ret = true; out: - kunmap_local(page); + kunmap_local(data); return ret; } -static void zswap_fill_page(void *ptr, unsigned long value) +static void zswap_fill_folio(struct folio *folio, unsigned long value) { - unsigned long *page; + unsigned long *data = kmap_local_folio(folio, 0); - page = (unsigned long *)ptr; - memset_l(page, value, PAGE_SIZE / sizeof(unsigned long)); + memset_l(data, value, PAGE_SIZE / sizeof(unsigned long)); + kunmap_local(data); } /********************************* @@ -1525,7 +1510,7 @@ store_failed: if (!entry->length) atomic_dec(&zswap_same_filled_pages); else { - zpool_free(zswap_find_zpool(entry), entry->handle); + zpool_free(entry->pool->zpool, entry->handle); put_pool: zswap_pool_put(entry->pool); } @@ -1551,14 +1536,26 @@ bool zswap_load(struct folio *folio) { swp_entry_t swp = folio->swap; pgoff_t offset = swp_offset(swp); - struct page *page = &folio->page; bool swapcache = folio_test_swapcache(folio); struct xarray *tree = swap_zswap_tree(swp); struct zswap_entry *entry; - u8 *dst; VM_WARN_ON_ONCE(!folio_test_locked(folio)); + if (zswap_never_enabled()) + return false; + + /* + * Large folios should not be swapped in while zswap is being used, as + * they are not properly handled. Zswap does not properly load large + * folios, and a large folio may only be partially in zswap. + * + * Return true without marking the folio uptodate so that an IO error is + * emitted (e.g. do_swap_page() will sigbus). + */ + if (WARN_ON_ONCE(folio_test_large(folio))) + return true; + /* * When reading into the swapcache, invalidate our entry. The * swapcache can be the authoritative owner of the page and @@ -1580,12 +1577,9 @@ bool zswap_load(struct folio *folio) return false; if (entry->length) - zswap_decompress(entry, page); - else { - dst = kmap_local_page(page); - zswap_fill_page(dst, entry->value); - kunmap_local(dst); - } + zswap_decompress(entry, folio); + else + zswap_fill_folio(folio, entry->value); count_vm_event(ZSWPIN); if (entry->objcg) @@ -1596,6 +1590,7 @@ bool zswap_load(struct folio *folio) folio_mark_dirty(folio); } + folio_mark_uptodate(folio); return true; } @@ -1737,9 +1732,10 @@ static int zswap_setup(void) pool = __zswap_pool_create_fallback(); if (pool) { pr_info("loaded using pool %s/%s\n", pool->tfm_name, - zpool_get_type(pool->zpools[0])); + zpool_get_type(pool->zpool)); list_add(&pool->list, &zswap_pools); zswap_has_pool = true; + static_branch_enable(&zswap_ever_enabled); } else { pr_err("pool creation failed\n"); zswap_enabled = false; diff --git a/samples/kmemleak/kmemleak-test.c b/samples/kmemleak/kmemleak-test.c index 6ced5ddd99d4..f7470ed85a79 100644 --- a/samples/kmemleak/kmemleak-test.c +++ b/samples/kmemleak/kmemleak-test.c @@ -96,4 +96,5 @@ static void __exit kmemleak_test_exit(void) } module_exit(kmemleak_test_exit); +MODULE_DESCRIPTION("Sample module to leak memory for kmemleak testing"); MODULE_LICENSE("GPL"); diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h new file mode 100644 index 000000000000..8a27bc5c7a7f --- /dev/null +++ b/tools/include/uapi/linux/fs.h @@ -0,0 +1,552 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_FS_H +#define _UAPI_LINUX_FS_H + +/* + * This file has definitions for some important file table structures + * and constants and structures used by various generic file system + * ioctl's. Please do not make any changes in this file before + * sending patches for review to linux-fsdevel@vger.kernel.org and + * linux-api@vger.kernel.org. + */ + +#include +#include +#include +#ifndef __KERNEL__ +#include +#endif + +/* Use of MS_* flags within the kernel is restricted to core mount(2) code. */ +#if !defined(__KERNEL__) +#include +#endif + +/* + * It's silly to have NR_OPEN bigger than NR_FILE, but you can change + * the file limit at runtime and only root can increase the per-process + * nr_file rlimit, so it's safe to set up a ridiculously high absolute + * upper limit on files-per-process. + * + * Some programs (notably those using select()) may have to be + * recompiled to take full advantage of the new limits.. + */ + +/* Fixed constants first: */ +#undef NR_OPEN +#define INR_OPEN_CUR 1024 /* Initial setting for nfile rlimits */ +#define INR_OPEN_MAX 4096 /* Hard limit for nfile rlimits */ + +#define BLOCK_SIZE_BITS 10 +#define BLOCK_SIZE (1</maps ioctl */ +#define PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 17, struct procmap_query) + +enum procmap_query_flags { + /* + * VMA permission flags. + * + * Can be used as part of procmap_query.query_flags field to look up + * only VMAs satisfying specified subset of permissions. E.g., specifying + * PROCMAP_QUERY_VMA_READABLE only will return both readable and read/write VMAs, + * while having PROCMAP_QUERY_VMA_READABLE | PROCMAP_QUERY_VMA_WRITABLE will only + * return read/write VMAs, though both executable/non-executable and + * private/shared will be ignored. + * + * PROCMAP_QUERY_VMA_* flags are also returned in procmap_query.vma_flags + * field to specify actual VMA permissions. + */ + PROCMAP_QUERY_VMA_READABLE = 0x01, + PROCMAP_QUERY_VMA_WRITABLE = 0x02, + PROCMAP_QUERY_VMA_EXECUTABLE = 0x04, + PROCMAP_QUERY_VMA_SHARED = 0x08, + /* + * Query modifier flags. + * + * By default VMA that covers provided address is returned, or -ENOENT + * is returned. With PROCMAP_QUERY_COVERING_OR_NEXT_VMA flag set, closest + * VMA with vma_start > addr will be returned if no covering VMA is + * found. + * + * PROCMAP_QUERY_FILE_BACKED_VMA instructs query to consider only VMAs that + * have file backing. Can be combined with PROCMAP_QUERY_COVERING_OR_NEXT_VMA + * to iterate all VMAs with file backing. + */ + PROCMAP_QUERY_COVERING_OR_NEXT_VMA = 0x10, + PROCMAP_QUERY_FILE_BACKED_VMA = 0x20, +}; + +/* + * Input/output argument structured passed into ioctl() call. It can be used + * to query a set of VMAs (Virtual Memory Areas) of a process. + * + * Each field can be one of three kinds, marked in a short comment to the + * right of the field: + * - "in", input argument, user has to provide this value, kernel doesn't modify it; + * - "out", output argument, kernel sets this field with VMA data; + * - "in/out", input and output argument; user provides initial value (used + * to specify maximum allowable buffer size), and kernel sets it to actual + * amount of data written (or zero, if there is no data). + * + * If matching VMA is found (according to criterias specified by + * query_addr/query_flags, all the out fields are filled out, and ioctl() + * returns 0. If there is no matching VMA, -ENOENT will be returned. + * In case of any other error, negative error code other than -ENOENT is + * returned. + * + * Most of the data is similar to the one returned as text in /proc//maps + * file, but procmap_query provides more querying flexibility. There are no + * consistency guarantees between subsequent ioctl() calls, but data returned + * for matched VMA is self-consistent. + */ +struct procmap_query { + /* Query struct size, for backwards/forward compatibility */ + __u64 size; + /* + * Query flags, a combination of enum procmap_query_flags values. + * Defines query filtering and behavior, see enum procmap_query_flags. + * + * Input argument, provided by user. Kernel doesn't modify it. + */ + __u64 query_flags; /* in */ + /* + * Query address. By default, VMA that covers this address will + * be looked up. PROCMAP_QUERY_* flags above modify this default + * behavior further. + * + * Input argument, provided by user. Kernel doesn't modify it. + */ + __u64 query_addr; /* in */ + /* VMA starting (inclusive) and ending (exclusive) address, if VMA is found. */ + __u64 vma_start; /* out */ + __u64 vma_end; /* out */ + /* VMA permissions flags. A combination of PROCMAP_QUERY_VMA_* flags. */ + __u64 vma_flags; /* out */ + /* VMA backing page size granularity. */ + __u64 vma_page_size; /* out */ + /* + * VMA file offset. If VMA has file backing, this specifies offset + * within the file that VMA's start address corresponds to. + * Is set to zero if VMA has no backing file. + */ + __u64 vma_offset; /* out */ + /* Backing file's inode number, or zero, if VMA has no backing file. */ + __u64 inode; /* out */ + /* Backing file's device major/minor number, or zero, if VMA has no backing file. */ + __u32 dev_major; /* out */ + __u32 dev_minor; /* out */ + /* + * If set to non-zero value, signals the request to return VMA name + * (i.e., VMA's backing file's absolute path, with " (deleted)" suffix + * appended, if file was unlinked from FS) for matched VMA. VMA name + * can also be some special name (e.g., "[heap]", "[stack]") or could + * be even user-supplied with prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME). + * + * Kernel will set this field to zero, if VMA has no associated name. + * Otherwise kernel will return actual amount of bytes filled in + * user-supplied buffer (see vma_name_addr field below), including the + * terminating zero. + * + * If VMA name is longer that user-supplied maximum buffer size, + * -E2BIG error is returned. + * + * If this field is set to non-zero value, vma_name_addr should point + * to valid user space memory buffer of at least vma_name_size bytes. + * If set to zero, vma_name_addr should be set to zero as well + */ + __u32 vma_name_size; /* in/out */ + /* + * If set to non-zero value, signals the request to extract and return + * VMA's backing file's build ID, if the backing file is an ELF file + * and it contains embedded build ID. + * + * Kernel will set this field to zero, if VMA has no backing file, + * backing file is not an ELF file, or ELF file has no build ID + * embedded. + * + * Build ID is a binary value (not a string). Kernel will set + * build_id_size field to exact number of bytes used for build ID. + * If build ID is requested and present, but needs more bytes than + * user-supplied maximum buffer size (see build_id_addr field below), + * -E2BIG error will be returned. + * + * If this field is set to non-zero value, build_id_addr should point + * to valid user space memory buffer of at least build_id_size bytes. + * If set to zero, build_id_addr should be set to zero as well + */ + __u32 build_id_size; /* in/out */ + /* + * User-supplied address of a buffer of at least vma_name_size bytes + * for kernel to fill with matched VMA's name (see vma_name_size field + * description above for details). + * + * Should be set to zero if VMA name should not be returned. + */ + __u64 vma_name_addr; /* in */ + /* + * User-supplied address of a buffer of at least build_id_size bytes + * for kernel to fill with matched VMA's ELF build ID, if available + * (see build_id_size field description above for details). + * + * Should be set to zero if build ID should not be returned. + */ + __u64 build_id_addr; /* in */ +}; + +#endif /* _UAPI_LINUX_FS_H */ diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h new file mode 100644 index 000000000000..35791791a879 --- /dev/null +++ b/tools/include/uapi/linux/prctl.h @@ -0,0 +1,331 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _LINUX_PRCTL_H +#define _LINUX_PRCTL_H + +#include + +/* Values to pass as first argument to prctl() */ + +#define PR_SET_PDEATHSIG 1 /* Second arg is a signal */ +#define PR_GET_PDEATHSIG 2 /* Second arg is a ptr to return the signal */ + +/* Get/set current->mm->dumpable */ +#define PR_GET_DUMPABLE 3 +#define PR_SET_DUMPABLE 4 + +/* Get/set unaligned access control bits (if meaningful) */ +#define PR_GET_UNALIGN 5 +#define PR_SET_UNALIGN 6 +# define PR_UNALIGN_NOPRINT 1 /* silently fix up unaligned user accesses */ +# define PR_UNALIGN_SIGBUS 2 /* generate SIGBUS on unaligned user access */ + +/* Get/set whether or not to drop capabilities on setuid() away from + * uid 0 (as per security/commoncap.c) */ +#define PR_GET_KEEPCAPS 7 +#define PR_SET_KEEPCAPS 8 + +/* Get/set floating-point emulation control bits (if meaningful) */ +#define PR_GET_FPEMU 9 +#define PR_SET_FPEMU 10 +# define PR_FPEMU_NOPRINT 1 /* silently emulate fp operations accesses */ +# define PR_FPEMU_SIGFPE 2 /* don't emulate fp operations, send SIGFPE instead */ + +/* Get/set floating-point exception mode (if meaningful) */ +#define PR_GET_FPEXC 11 +#define PR_SET_FPEXC 12 +# define PR_FP_EXC_SW_ENABLE 0x80 /* Use FPEXC for FP exception enables */ +# define PR_FP_EXC_DIV 0x010000 /* floating point divide by zero */ +# define PR_FP_EXC_OVF 0x020000 /* floating point overflow */ +# define PR_FP_EXC_UND 0x040000 /* floating point underflow */ +# define PR_FP_EXC_RES 0x080000 /* floating point inexact result */ +# define PR_FP_EXC_INV 0x100000 /* floating point invalid operation */ +# define PR_FP_EXC_DISABLED 0 /* FP exceptions disabled */ +# define PR_FP_EXC_NONRECOV 1 /* async non-recoverable exc. mode */ +# define PR_FP_EXC_ASYNC 2 /* async recoverable exception mode */ +# define PR_FP_EXC_PRECISE 3 /* precise exception mode */ + +/* Get/set whether we use statistical process timing or accurate timestamp + * based process timing */ +#define PR_GET_TIMING 13 +#define PR_SET_TIMING 14 +# define PR_TIMING_STATISTICAL 0 /* Normal, traditional, + statistical process timing */ +# define PR_TIMING_TIMESTAMP 1 /* Accurate timestamp based + process timing */ + +#define PR_SET_NAME 15 /* Set process name */ +#define PR_GET_NAME 16 /* Get process name */ + +/* Get/set process endian */ +#define PR_GET_ENDIAN 19 +#define PR_SET_ENDIAN 20 +# define PR_ENDIAN_BIG 0 +# define PR_ENDIAN_LITTLE 1 /* True little endian mode */ +# define PR_ENDIAN_PPC_LITTLE 2 /* "PowerPC" pseudo little endian */ + +/* Get/set process seccomp mode */ +#define PR_GET_SECCOMP 21 +#define PR_SET_SECCOMP 22 + +/* Get/set the capability bounding set (as per security/commoncap.c) */ +#define PR_CAPBSET_READ 23 +#define PR_CAPBSET_DROP 24 + +/* Get/set the process' ability to use the timestamp counter instruction */ +#define PR_GET_TSC 25 +#define PR_SET_TSC 26 +# define PR_TSC_ENABLE 1 /* allow the use of the timestamp counter */ +# define PR_TSC_SIGSEGV 2 /* throw a SIGSEGV instead of reading the TSC */ + +/* Get/set securebits (as per security/commoncap.c) */ +#define PR_GET_SECUREBITS 27 +#define PR_SET_SECUREBITS 28 + +/* + * Get/set the timerslack as used by poll/select/nanosleep + * A value of 0 means "use default" + */ +#define PR_SET_TIMERSLACK 29 +#define PR_GET_TIMERSLACK 30 + +#define PR_TASK_PERF_EVENTS_DISABLE 31 +#define PR_TASK_PERF_EVENTS_ENABLE 32 + +/* + * Set early/late kill mode for hwpoison memory corruption. + * This influences when the process gets killed on a memory corruption. + */ +#define PR_MCE_KILL 33 +# define PR_MCE_KILL_CLEAR 0 +# define PR_MCE_KILL_SET 1 + +# define PR_MCE_KILL_LATE 0 +# define PR_MCE_KILL_EARLY 1 +# define PR_MCE_KILL_DEFAULT 2 + +#define PR_MCE_KILL_GET 34 + +/* + * Tune up process memory map specifics. + */ +#define PR_SET_MM 35 +# define PR_SET_MM_START_CODE 1 +# define PR_SET_MM_END_CODE 2 +# define PR_SET_MM_START_DATA 3 +# define PR_SET_MM_END_DATA 4 +# define PR_SET_MM_START_STACK 5 +# define PR_SET_MM_START_BRK 6 +# define PR_SET_MM_BRK 7 +# define PR_SET_MM_ARG_START 8 +# define PR_SET_MM_ARG_END 9 +# define PR_SET_MM_ENV_START 10 +# define PR_SET_MM_ENV_END 11 +# define PR_SET_MM_AUXV 12 +# define PR_SET_MM_EXE_FILE 13 +# define PR_SET_MM_MAP 14 +# define PR_SET_MM_MAP_SIZE 15 + +/* + * This structure provides new memory descriptor + * map which mostly modifies /proc/pid/stat[m] + * output for a task. This mostly done in a + * sake of checkpoint/restore functionality. + */ +struct prctl_mm_map { + __u64 start_code; /* code section bounds */ + __u64 end_code; + __u64 start_data; /* data section bounds */ + __u64 end_data; + __u64 start_brk; /* heap for brk() syscall */ + __u64 brk; + __u64 start_stack; /* stack starts at */ + __u64 arg_start; /* command line arguments bounds */ + __u64 arg_end; + __u64 env_start; /* environment variables bounds */ + __u64 env_end; + __u64 *auxv; /* auxiliary vector */ + __u32 auxv_size; /* vector size */ + __u32 exe_fd; /* /proc/$pid/exe link file */ +}; + +/* + * Set specific pid that is allowed to ptrace the current task. + * A value of 0 mean "no process". + */ +#define PR_SET_PTRACER 0x59616d61 +# define PR_SET_PTRACER_ANY ((unsigned long)-1) + +#define PR_SET_CHILD_SUBREAPER 36 +#define PR_GET_CHILD_SUBREAPER 37 + +/* + * If no_new_privs is set, then operations that grant new privileges (i.e. + * execve) will either fail or not grant them. This affects suid/sgid, + * file capabilities, and LSMs. + * + * Operations that merely manipulate or drop existing privileges (setresuid, + * capset, etc.) will still work. Drop those privileges if you want them gone. + * + * Changing LSM security domain is considered a new privilege. So, for example, + * asking selinux for a specific new context (e.g. with runcon) will result + * in execve returning -EPERM. + * + * See Documentation/userspace-api/no_new_privs.rst for more details. + */ +#define PR_SET_NO_NEW_PRIVS 38 +#define PR_GET_NO_NEW_PRIVS 39 + +#define PR_GET_TID_ADDRESS 40 + +#define PR_SET_THP_DISABLE 41 +#define PR_GET_THP_DISABLE 42 + +/* + * No longer implemented, but left here to ensure the numbers stay reserved: + */ +#define PR_MPX_ENABLE_MANAGEMENT 43 +#define PR_MPX_DISABLE_MANAGEMENT 44 + +#define PR_SET_FP_MODE 45 +#define PR_GET_FP_MODE 46 +# define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ +# define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ + +/* Control the ambient capability set */ +#define PR_CAP_AMBIENT 47 +# define PR_CAP_AMBIENT_IS_SET 1 +# define PR_CAP_AMBIENT_RAISE 2 +# define PR_CAP_AMBIENT_LOWER 3 +# define PR_CAP_AMBIENT_CLEAR_ALL 4 + +/* arm64 Scalable Vector Extension controls */ +/* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ +#define PR_SVE_SET_VL 50 /* set task vector length */ +# define PR_SVE_SET_VL_ONEXEC (1 << 18) /* defer effect until exec */ +#define PR_SVE_GET_VL 51 /* get task vector length */ +/* Bits common to PR_SVE_SET_VL and PR_SVE_GET_VL */ +# define PR_SVE_VL_LEN_MASK 0xffff +# define PR_SVE_VL_INHERIT (1 << 17) /* inherit across exec */ + +/* Per task speculation control */ +#define PR_GET_SPECULATION_CTRL 52 +#define PR_SET_SPECULATION_CTRL 53 +/* Speculation control variants */ +# define PR_SPEC_STORE_BYPASS 0 +# define PR_SPEC_INDIRECT_BRANCH 1 +# define PR_SPEC_L1D_FLUSH 2 +/* Return and control values for PR_SET/GET_SPECULATION_CTRL */ +# define PR_SPEC_NOT_AFFECTED 0 +# define PR_SPEC_PRCTL (1UL << 0) +# define PR_SPEC_ENABLE (1UL << 1) +# define PR_SPEC_DISABLE (1UL << 2) +# define PR_SPEC_FORCE_DISABLE (1UL << 3) +# define PR_SPEC_DISABLE_NOEXEC (1UL << 4) + +/* Reset arm64 pointer authentication keys */ +#define PR_PAC_RESET_KEYS 54 +# define PR_PAC_APIAKEY (1UL << 0) +# define PR_PAC_APIBKEY (1UL << 1) +# define PR_PAC_APDAKEY (1UL << 2) +# define PR_PAC_APDBKEY (1UL << 3) +# define PR_PAC_APGAKEY (1UL << 4) + +/* Tagged user address controls for arm64 */ +#define PR_SET_TAGGED_ADDR_CTRL 55 +#define PR_GET_TAGGED_ADDR_CTRL 56 +# define PR_TAGGED_ADDR_ENABLE (1UL << 0) +/* MTE tag check fault modes */ +# define PR_MTE_TCF_NONE 0UL +# define PR_MTE_TCF_SYNC (1UL << 1) +# define PR_MTE_TCF_ASYNC (1UL << 2) +# define PR_MTE_TCF_MASK (PR_MTE_TCF_SYNC | PR_MTE_TCF_ASYNC) +/* MTE tag inclusion mask */ +# define PR_MTE_TAG_SHIFT 3 +# define PR_MTE_TAG_MASK (0xffffUL << PR_MTE_TAG_SHIFT) +/* Unused; kept only for source compatibility */ +# define PR_MTE_TCF_SHIFT 1 + +/* Control reclaim behavior when allocating memory */ +#define PR_SET_IO_FLUSHER 57 +#define PR_GET_IO_FLUSHER 58 + +/* Dispatch syscalls to a userspace handler */ +#define PR_SET_SYSCALL_USER_DISPATCH 59 +# define PR_SYS_DISPATCH_OFF 0 +# define PR_SYS_DISPATCH_ON 1 +/* The control values for the user space selector when dispatch is enabled */ +# define SYSCALL_DISPATCH_FILTER_ALLOW 0 +# define SYSCALL_DISPATCH_FILTER_BLOCK 1 + +/* Set/get enabled arm64 pointer authentication keys */ +#define PR_PAC_SET_ENABLED_KEYS 60 +#define PR_PAC_GET_ENABLED_KEYS 61 + +/* Request the scheduler to share a core */ +#define PR_SCHED_CORE 62 +# define PR_SCHED_CORE_GET 0 +# define PR_SCHED_CORE_CREATE 1 /* create unique core_sched cookie */ +# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */ +# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */ +# define PR_SCHED_CORE_MAX 4 +# define PR_SCHED_CORE_SCOPE_THREAD 0 +# define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1 +# define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2 + +/* arm64 Scalable Matrix Extension controls */ +/* Flag values must be in sync with SVE versions */ +#define PR_SME_SET_VL 63 /* set task vector length */ +# define PR_SME_SET_VL_ONEXEC (1 << 18) /* defer effect until exec */ +#define PR_SME_GET_VL 64 /* get task vector length */ +/* Bits common to PR_SME_SET_VL and PR_SME_GET_VL */ +# define PR_SME_VL_LEN_MASK 0xffff +# define PR_SME_VL_INHERIT (1 << 17) /* inherit across exec */ + +/* Memory deny write / execute */ +#define PR_SET_MDWE 65 +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0) +# define PR_MDWE_NO_INHERIT (1UL << 1) + +#define PR_GET_MDWE 66 + +#define PR_SET_VMA 0x53564d41 +# define PR_SET_VMA_ANON_NAME 0 + +#define PR_GET_AUXV 0x41555856 + +#define PR_SET_MEMORY_MERGE 67 +#define PR_GET_MEMORY_MERGE 68 + +#define PR_RISCV_V_SET_CONTROL 69 +#define PR_RISCV_V_GET_CONTROL 70 +# define PR_RISCV_V_VSTATE_CTRL_DEFAULT 0 +# define PR_RISCV_V_VSTATE_CTRL_OFF 1 +# define PR_RISCV_V_VSTATE_CTRL_ON 2 +# define PR_RISCV_V_VSTATE_CTRL_INHERIT (1 << 4) +# define PR_RISCV_V_VSTATE_CTRL_CUR_MASK 0x3 +# define PR_RISCV_V_VSTATE_CTRL_NEXT_MASK 0xc +# define PR_RISCV_V_VSTATE_CTRL_MASK 0x1f + +#define PR_RISCV_SET_ICACHE_FLUSH_CTX 71 +# define PR_RISCV_CTX_SW_FENCEI_ON 0 +# define PR_RISCV_CTX_SW_FENCEI_OFF 1 +# define PR_RISCV_SCOPE_PER_PROCESS 0 +# define PR_RISCV_SCOPE_PER_THREAD 1 + +/* PowerPC Dynamic Execution Control Register (DEXCR) controls */ +#define PR_PPC_GET_DEXCR 72 +#define PR_PPC_SET_DEXCR 73 +/* DEXCR aspect to act on */ +# define PR_PPC_DEXCR_SBHE 0 /* Speculative branch hint enable */ +# define PR_PPC_DEXCR_IBRTPD 1 /* Indirect branch recurrent target prediction disable */ +# define PR_PPC_DEXCR_SRAPD 2 /* Subroutine return address prediction disable */ +# define PR_PPC_DEXCR_NPHIE 3 /* Non-privileged hash instruction enable */ +/* Action to apply / return */ +# define PR_PPC_DEXCR_CTRL_EDITABLE 0x1 /* Aspect can be modified with PR_PPC_SET_DEXCR */ +# define PR_PPC_DEXCR_CTRL_SET 0x2 /* Set the aspect for this process */ +# define PR_PPC_DEXCR_CTRL_CLEAR 0x4 /* Clear the aspect for this process */ +# define PR_PPC_DEXCR_CTRL_SET_ONEXEC 0x8 /* Set the aspect on exec */ +# define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */ +# define PR_PPC_DEXCR_CTRL_MASK 0x1f + +#endif /* _LINUX_PRCTL_H */ diff --git a/tools/mm/Makefile b/tools/mm/Makefile index 7bb03606b9ea..15791c1c5b28 100644 --- a/tools/mm/Makefile +++ b/tools/mm/Makefile @@ -3,7 +3,7 @@ # include ../scripts/Makefile.include -BUILD_TARGETS=page-types slabinfo page_owner_sort +BUILD_TARGETS=page-types slabinfo page_owner_sort thp_swap_allocator_test INSTALL_TARGETS = $(BUILD_TARGETS) thpmaps LIB_DIR = ../lib/api diff --git a/tools/mm/thp_swap_allocator_test.c b/tools/mm/thp_swap_allocator_test.c new file mode 100644 index 000000000000..83afc52275a5 --- /dev/null +++ b/tools/mm/thp_swap_allocator_test.c @@ -0,0 +1,234 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * thp_swap_allocator_test + * + * The purpose of this test program is helping check if THP swpout + * can correctly get swap slots to swap out as a whole instead of + * being split. It randomly releases swap entries through madvise + * DONTNEED and swapin/out on two memory areas: a memory area for + * 64KB THP and the other area for small folios. The second memory + * can be enabled by "-s". + * Before running the program, we need to setup a zRAM or similar + * swap device by: + * echo lzo > /sys/block/zram0/comp_algorithm + * echo 64M > /sys/block/zram0/disksize + * echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled + * echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled + * mkswap /dev/zram0 + * swapon /dev/zram0 + * The expected result should be 0% anon swpout fallback ratio w/ or + * w/o "-s". + * + * Author(s): Barry Song + */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include + +#define MEMSIZE_MTHP (60 * 1024 * 1024) +#define MEMSIZE_SMALLFOLIO (4 * 1024 * 1024) +#define ALIGNMENT_MTHP (64 * 1024) +#define ALIGNMENT_SMALLFOLIO (4 * 1024) +#define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024) +#define TOTAL_DONTNEED_SMALLFOLIO (1 * 1024 * 1024) +#define MTHP_FOLIO_SIZE (64 * 1024) + +#define SWPOUT_PATH \ + "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout" +#define SWPOUT_FALLBACK_PATH \ + "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback" + +static void *aligned_alloc_mem(size_t size, size_t alignment) +{ + void *mem = NULL; + + if (posix_memalign(&mem, alignment, size) != 0) { + perror("posix_memalign"); + return NULL; + } + return mem; +} + +/* + * This emulates the behavior of native libc and Java heap, + * as well as process exit and munmap. It helps generate mTHP + * and ensures that iterations can proceed with mTHP, as we + * currently don't support large folios swap-in. + */ +static void random_madvise_dontneed(void *mem, size_t mem_size, + size_t align_size, size_t total_dontneed_size) +{ + size_t num_pages = total_dontneed_size / align_size; + size_t i; + size_t offset; + void *addr; + + for (i = 0; i < num_pages; ++i) { + offset = (rand() % (mem_size / align_size)) * align_size; + addr = (char *)mem + offset; + if (madvise(addr, align_size, MADV_DONTNEED) != 0) + perror("madvise dontneed"); + + memset(addr, 0x11, align_size); + } +} + +static void random_swapin(void *mem, size_t mem_size, + size_t align_size, size_t total_swapin_size) +{ + size_t num_pages = total_swapin_size / align_size; + size_t i; + size_t offset; + void *addr; + + for (i = 0; i < num_pages; ++i) { + offset = (rand() % (mem_size / align_size)) * align_size; + addr = (char *)mem + offset; + memset(addr, 0x11, align_size); + } +} + +static unsigned long read_stat(const char *path) +{ + FILE *file; + unsigned long value; + + file = fopen(path, "r"); + if (!file) { + perror("fopen"); + return 0; + } + + if (fscanf(file, "%lu", &value) != 1) { + perror("fscanf"); + fclose(file); + return 0; + } + + fclose(file); + return value; +} + +int main(int argc, char *argv[]) +{ + int use_small_folio = 0, aligned_swapin = 0; + void *mem1 = NULL, *mem2 = NULL; + int i; + + for (i = 1; i < argc; ++i) { + if (strcmp(argv[i], "-s") == 0) + use_small_folio = 1; + else if (strcmp(argv[i], "-a") == 0) + aligned_swapin = 1; + } + + mem1 = aligned_alloc_mem(MEMSIZE_MTHP, ALIGNMENT_MTHP); + if (mem1 == NULL) { + fprintf(stderr, "Failed to allocate large folios memory\n"); + return EXIT_FAILURE; + } + + if (madvise(mem1, MEMSIZE_MTHP, MADV_HUGEPAGE) != 0) { + perror("madvise hugepage for mem1"); + free(mem1); + return EXIT_FAILURE; + } + + if (use_small_folio) { + mem2 = aligned_alloc_mem(MEMSIZE_SMALLFOLIO, ALIGNMENT_MTHP); + if (mem2 == NULL) { + fprintf(stderr, "Failed to allocate small folios memory\n"); + free(mem1); + return EXIT_FAILURE; + } + + if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_NOHUGEPAGE) != 0) { + perror("madvise nohugepage for mem2"); + free(mem1); + free(mem2); + return EXIT_FAILURE; + } + } + + /* warm-up phase to occupy the swapfile */ + memset(mem1, 0x11, MEMSIZE_MTHP); + madvise(mem1, MEMSIZE_MTHP, MADV_PAGEOUT); + if (use_small_folio) { + memset(mem2, 0x11, MEMSIZE_SMALLFOLIO); + madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_PAGEOUT); + } + + /* iterations with newly created mTHP, swap-in, and swap-out */ + for (i = 0; i < 100; ++i) { + unsigned long initial_swpout; + unsigned long initial_swpout_fallback; + unsigned long final_swpout; + unsigned long final_swpout_fallback; + unsigned long swpout_inc; + unsigned long swpout_fallback_inc; + double fallback_percentage; + + initial_swpout = read_stat(SWPOUT_PATH); + initial_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH); + + /* + * The following setup creates a 1:1 ratio of mTHP to small folios + * since large folio swap-in isn't supported yet. Once we support + * mTHP swap-in, we'll likely need to reduce MEMSIZE_MTHP and + * increase MEMSIZE_SMALLFOLIO to maintain the ratio. + */ + random_swapin(mem1, MEMSIZE_MTHP, + aligned_swapin ? ALIGNMENT_MTHP : ALIGNMENT_SMALLFOLIO, + TOTAL_DONTNEED_MTHP); + random_madvise_dontneed(mem1, MEMSIZE_MTHP, ALIGNMENT_MTHP, + TOTAL_DONTNEED_MTHP); + + if (use_small_folio) { + random_swapin(mem2, MEMSIZE_SMALLFOLIO, + ALIGNMENT_SMALLFOLIO, + TOTAL_DONTNEED_SMALLFOLIO); + } + + if (madvise(mem1, MEMSIZE_MTHP, MADV_PAGEOUT) != 0) { + perror("madvise pageout for mem1"); + free(mem1); + if (mem2 != NULL) + free(mem2); + return EXIT_FAILURE; + } + + if (use_small_folio) { + if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_PAGEOUT) != 0) { + perror("madvise pageout for mem2"); + free(mem1); + free(mem2); + return EXIT_FAILURE; + } + } + + final_swpout = read_stat(SWPOUT_PATH); + final_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH); + + swpout_inc = final_swpout - initial_swpout; + swpout_fallback_inc = final_swpout_fallback - initial_swpout_fallback; + + fallback_percentage = (double)swpout_fallback_inc / + (swpout_fallback_inc + swpout_inc) * 100; + + printf("Iteration %d: swpout inc: %lu, swpout fallback inc: %lu, Fallback percentage: %.2f%%\n", + i + 1, swpout_inc, swpout_fallback_inc, fallback_percentage); + } + + free(mem1); + if (mem2 != NULL) + free(mem2); + + return EXIT_SUCCESS; +} diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 0a33d9195b7a..01237d167223 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -1202,6 +1202,8 @@ static const char *uaccess_safe_builtin[] = { "__sanitizer_cov_trace_switch", /* KMSAN */ "kmsan_copy_to_user", + "kmsan_disable_current", + "kmsan_enable_current", "kmsan_report", "kmsan_unpoison_entry_regs", "kmsan_unpoison_memory", diff --git a/tools/testing/selftests/cgroup/config b/tools/testing/selftests/cgroup/config index 97d549ee894f..39f979690dd3 100644 --- a/tools/testing/selftests/cgroup/config +++ b/tools/testing/selftests/cgroup/config @@ -3,5 +3,4 @@ CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_FREEZER=y CONFIG_CGROUP_SCHED=y CONFIG_MEMCG=y -CONFIG_MEMCG_KMEM=y CONFIG_PAGE_COUNTER=y diff --git a/tools/testing/selftests/damon/Makefile b/tools/testing/selftests/damon/Makefile index 29a22f50e762..1e2e98cc809d 100644 --- a/tools/testing/selftests/damon/Makefile +++ b/tools/testing/selftests/damon/Makefile @@ -4,7 +4,7 @@ TEST_GEN_FILES += huge_count_read_write TEST_GEN_FILES += debugfs_target_ids_read_before_terminate_race TEST_GEN_FILES += debugfs_target_ids_pid_leak -TEST_GEN_FILES += access_memory +TEST_GEN_FILES += access_memory access_memory_even TEST_FILES = _chk_dependency.sh _debugfs_common.sh @@ -13,6 +13,7 @@ TEST_PROGS = debugfs_attrs.sh debugfs_schemes.sh debugfs_target_ids.sh TEST_PROGS += sysfs.sh TEST_PROGS += sysfs_update_schemes_tried_regions_wss_estimation.py TEST_PROGS += damos_quota.py damos_quota_goal.py damos_apply_interval.py +TEST_PROGS += damos_tried_regions.py damon_nr_regions.py TEST_PROGS += reclaim.sh lru_sort.sh # regression tests (reproducers of previously found bugs) diff --git a/tools/testing/selftests/damon/_damon_sysfs.py b/tools/testing/selftests/damon/_damon_sysfs.py index 2bd44c32be1b..6e136dc3df19 100644 --- a/tools/testing/selftests/damon/_damon_sysfs.py +++ b/tools/testing/selftests/damon/_damon_sysfs.py @@ -175,16 +175,24 @@ class DamosStats: self.sz_applied = sz_applied self.qt_exceeds = qt_exceeds +class DamosTriedRegion: + def __init__(self, start, end, nr_accesses, age): + self.start = start + self.end = end + self.nr_accesses = nr_accesses + self.age = age + class Damos: action = None access_pattern = None quota = None apply_interval_us = None - # todo: Support watermarks, stats, tried_regions + # todo: Support watermarks, stats idx = None context = None tried_bytes = None stats = None + tried_regions = None def __init__(self, action='stat', access_pattern=DamosAccessPattern(), quota=DamosQuota(), apply_interval_us=0): @@ -398,6 +406,35 @@ class Kdamond: err = write_file(os.path.join(self.sysfs_dir(), 'state'), 'on') return err + def stop(self): + err = write_file(os.path.join(self.sysfs_dir(), 'state'), 'off') + return err + + def update_schemes_tried_regions(self): + err = write_file(os.path.join(self.sysfs_dir(), 'state'), + 'update_schemes_tried_regions') + if err is not None: + return err + for context in self.contexts: + for scheme in context.schemes: + tried_regions = [] + tried_regions_dir = os.path.join( + scheme.sysfs_dir(), 'tried_regions') + for filename in os.listdir( + os.path.join(scheme.sysfs_dir(), 'tried_regions')): + tried_region_dir = os.path.join(tried_regions_dir, filename) + if not os.path.isdir(tried_region_dir): + continue + region_values = [] + for f in ['start', 'end', 'nr_accesses', 'age']: + content, err = read_file( + os.path.join(tried_region_dir, f)) + if err is not None: + return err + region_values.append(int(content)) + tried_regions.append(DamosTriedRegion(*region_values)) + scheme.tried_regions = tried_regions + def update_schemes_tried_bytes(self): err = write_file(os.path.join(self.sysfs_dir(), 'state'), 'update_schemes_tried_bytes') @@ -444,6 +481,25 @@ class Kdamond: goal.effective_bytes = int(content) return None + def commit(self): + nr_contexts_file = os.path.join(self.sysfs_dir(), + 'contexts', 'nr_contexts') + content, err = read_file(nr_contexts_file) + if err is not None: + return err + if int(content) != len(self.contexts): + err = write_file(nr_contexts_file, '%d' % len(self.contexts)) + if err is not None: + return err + + for context in self.contexts: + err = context.stage() + if err is not None: + return err + err = write_file(os.path.join(self.sysfs_dir(), 'state'), 'commit') + return err + + def commit_schemes_quota_goals(self): for context in self.contexts: for scheme in context.schemes: @@ -478,3 +534,10 @@ class Kdamonds: if err is not None: return err return None + + def stop(self): + for kdamond in self.kdamonds: + err = kdamond.stop() + if err is not None: + return err + return None diff --git a/tools/testing/selftests/damon/access_memory.c b/tools/testing/selftests/damon/access_memory.c index 585a2fa54329..56b17e8fe1be 100644 --- a/tools/testing/selftests/damon/access_memory.c +++ b/tools/testing/selftests/damon/access_memory.c @@ -35,7 +35,7 @@ int main(int argc, char *argv[]) start_clock = clock(); while ((clock() - start_clock) * 1000 / CLOCKS_PER_SEC < access_time_ms) - memset(regions[i], i, 1024 * 1024 * 10); + memset(regions[i], i, sz_region); } return 0; } diff --git a/tools/testing/selftests/damon/access_memory_even.c b/tools/testing/selftests/damon/access_memory_even.c new file mode 100644 index 000000000000..3be121487432 --- /dev/null +++ b/tools/testing/selftests/damon/access_memory_even.c @@ -0,0 +1,42 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Artificial memory access program for testing DAMON. + * + * Receives number of regions and size of each region from user. Allocate the + * regions and repeatedly access even numbered (starting from zero) regions. + */ + +#include +#include +#include +#include + +int main(int argc, char *argv[]) +{ + char **regions; + clock_t start_clock; + int nr_regions; + int sz_region; + int access_time_ms; + int i; + + if (argc != 3) { + printf("Usage: %s \n", argv[0]); + return -1; + } + + nr_regions = atoi(argv[1]); + sz_region = atoi(argv[2]); + + regions = malloc(sizeof(*regions) * nr_regions); + for (i = 0; i < nr_regions; i++) + regions[i] = malloc(sz_region); + + while (1) { + for (i = 0; i < nr_regions; i++) { + if (i % 2 == 0) + memset(regions[i], i, sz_region); + } + } + return 0; +} diff --git a/tools/testing/selftests/damon/damon_nr_regions.py b/tools/testing/selftests/damon/damon_nr_regions.py new file mode 100644 index 000000000000..2e8a74aff543 --- /dev/null +++ b/tools/testing/selftests/damon/damon_nr_regions.py @@ -0,0 +1,145 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +import subprocess +import time + +import _damon_sysfs + +def test_nr_regions(real_nr_regions, min_nr_regions, max_nr_regions): + ''' + Create process of the given 'real_nr_regions' regions, monitor it using + DAMON with given '{min,max}_nr_regions' monitoring parameter. + + Exit with non-zero return code if the given {min,max}_nr_regions is not + kept. + ''' + sz_region = 10 * 1024 * 1024 + proc = subprocess.Popen(['./access_memory_even', '%d' % real_nr_regions, + '%d' % sz_region]) + + # stat every monitored regions + kdamonds = _damon_sysfs.Kdamonds([_damon_sysfs.Kdamond( + contexts=[_damon_sysfs.DamonCtx( + monitoring_attrs=_damon_sysfs.DamonAttrs( + min_nr_regions=min_nr_regions, + max_nr_regions=max_nr_regions), + ops='vaddr', + targets=[_damon_sysfs.DamonTarget(pid=proc.pid)], + schemes=[_damon_sysfs.Damos(action='stat', + )] # schemes + )] # contexts + )]) # kdamonds + + err = kdamonds.start() + if err is not None: + proc.terminate() + print('kdamond start failed: %s' % err) + exit(1) + + collected_nr_regions = [] + while proc.poll() is None: + time.sleep(0.1) + err = kdamonds.kdamonds[0].update_schemes_tried_regions() + if err is not None: + proc.terminate() + print('tried regions update failed: %s' % err) + exit(1) + + scheme = kdamonds.kdamonds[0].contexts[0].schemes[0] + if scheme.tried_regions is None: + proc.terminate() + print('tried regions is not collected') + exit(1) + + nr_tried_regions = len(scheme.tried_regions) + if nr_tried_regions <= 0: + proc.terminate() + print('tried regions is not created') + exit(1) + collected_nr_regions.append(nr_tried_regions) + if len(collected_nr_regions) > 10: + break + proc.terminate() + kdamonds.stop() + + test_name = 'nr_regions test with %d/%d/%d real/min/max nr_regions' % ( + real_nr_regions, min_nr_regions, max_nr_regions) + if (collected_nr_regions[0] < min_nr_regions or + collected_nr_regions[-1] > max_nr_regions): + print('fail %s' % test_name) + print('number of regions that collected are:') + for nr in collected_nr_regions: + print(nr) + exit(1) + print('pass %s ' % test_name) + +def main(): + # test min_nr_regions larger than real nr regions + test_nr_regions(10, 20, 100) + + # test max_nr_regions smaller than real nr regions + test_nr_regions(15, 3, 10) + + # test online-tuned max_nr_regions that smaller than real nr regions + sz_region = 10 * 1024 * 1024 + proc = subprocess.Popen(['./access_memory_even', '14', '%d' % sz_region]) + + # stat every monitored regions + kdamonds = _damon_sysfs.Kdamonds([_damon_sysfs.Kdamond( + contexts=[_damon_sysfs.DamonCtx( + monitoring_attrs=_damon_sysfs.DamonAttrs( + min_nr_regions=10, max_nr_regions=1000), + ops='vaddr', + targets=[_damon_sysfs.DamonTarget(pid=proc.pid)], + schemes=[_damon_sysfs.Damos(action='stat', + )] # schemes + )] # contexts + )]) # kdamonds + + err = kdamonds.start() + if err is not None: + proc.terminate() + print('kdamond start failed: %s' % err) + exit(1) + + # wait until the real regions are found + time.sleep(3) + + attrs = kdamonds.kdamonds[0].contexts[0].monitoring_attrs + attrs.min_nr_regions = 3 + attrs.max_nr_regions = 7 + err = kdamonds.kdamonds[0].commit() + if err is not None: + proc.terminate() + print('commit failed: %s' % err) + exit(1) + # wait for next merge operation is executed + time.sleep(0.3) + + err = kdamonds.kdamonds[0].update_schemes_tried_regions() + if err is not None: + proc.terminate() + print('tried regions update failed: %s' % err) + exit(1) + + scheme = kdamonds.kdamonds[0].contexts[0].schemes[0] + if scheme.tried_regions is None: + proc.terminate() + print('tried regions is not collected') + exit(1) + + nr_tried_regions = len(scheme.tried_regions) + if nr_tried_regions <= 0: + proc.terminate() + print('tried regions is not created') + exit(1) + proc.terminate() + + if nr_tried_regions > 7: + print('fail online-tuned max_nr_regions: %d > 7' % nr_tried_regions) + exit(1) + print('pass online-tuned max_nr_regions') + +if __name__ == '__main__': + main() diff --git a/tools/testing/selftests/damon/damos_tried_regions.py b/tools/testing/selftests/damon/damos_tried_regions.py new file mode 100644 index 000000000000..3b347eb28bd2 --- /dev/null +++ b/tools/testing/selftests/damon/damos_tried_regions.py @@ -0,0 +1,65 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +import subprocess +import time + +import _damon_sysfs + +def main(): + # repeatedly access even-numbered ones in 14 regions of 10 MiB size + sz_region = 10 * 1024 * 1024 + proc = subprocess.Popen(['./access_memory_even', '14', '%d' % sz_region]) + + # stat every monitored regions + kdamonds = _damon_sysfs.Kdamonds([_damon_sysfs.Kdamond( + contexts=[_damon_sysfs.DamonCtx( + ops='vaddr', + targets=[_damon_sysfs.DamonTarget(pid=proc.pid)], + schemes=[_damon_sysfs.Damos(action='stat', + )] # schemes + )] # contexts + )]) # kdamonds + + err = kdamonds.start() + if err is not None: + proc.terminate() + print('kdamond start failed: %s' % err) + exit(1) + + collected_nr_regions = [] + while proc.poll() is None: + time.sleep(0.1) + err = kdamonds.kdamonds[0].update_schemes_tried_regions() + if err is not None: + proc.terminate() + print('tried regions update failed: %s' % err) + exit(1) + + scheme = kdamonds.kdamonds[0].contexts[0].schemes[0] + if scheme.tried_regions is None: + proc.terminate() + print('tried regions is not collected') + exit(1) + + nr_tried_regions = len(scheme.tried_regions) + if nr_tried_regions <= 0: + proc.terminate() + print('tried regions is not created') + exit(1) + collected_nr_regions.append(nr_tried_regions) + if len(collected_nr_regions) > 10: + break + proc.terminate() + + collected_nr_regions.sort() + sample = collected_nr_regions[4] + print('50-th percentile nr_regions: %d' % sample) + print('expectation (>= 14) is %s' % 'met' if sample >= 14 else 'not met') + if collected_nr_regions[4] < 14: + print('full nr_regions:') + print('\n'.join(collected_nr_regions)) + exit(1) + +if __name__ == '__main__': + main() diff --git a/tools/testing/selftests/drivers/dma-buf/udmabuf.c b/tools/testing/selftests/drivers/dma-buf/udmabuf.c index c812080e304e..6062723a172e 100644 --- a/tools/testing/selftests/drivers/dma-buf/udmabuf.c +++ b/tools/testing/selftests/drivers/dma-buf/udmabuf.c @@ -9,52 +9,162 @@ #include #include #include +#include #include #include +#include #include #include +#include "../../kselftest.h" #define TEST_PREFIX "drivers/dma-buf/udmabuf" #define NUM_PAGES 4 +#define NUM_ENTRIES 4 +#define MEMFD_SIZE 1024 /* in pages */ -static int memfd_create(const char *name, unsigned int flags) +static unsigned int page_size; + +static int create_memfd_with_seals(off64_t size, bool hpage) { - return syscall(__NR_memfd_create, name, flags); + int memfd, ret; + unsigned int flags = MFD_ALLOW_SEALING; + + if (hpage) + flags |= MFD_HUGETLB; + + memfd = memfd_create("udmabuf-test", flags); + if (memfd < 0) { + ksft_print_msg("%s: [skip,no-memfd]\n", TEST_PREFIX); + exit(KSFT_SKIP); + } + + ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK); + if (ret < 0) { + ksft_print_msg("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX); + exit(KSFT_SKIP); + } + + ret = ftruncate(memfd, size); + if (ret == -1) { + ksft_print_msg("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX); + exit(KSFT_FAIL); + } + + return memfd; +} + +static int create_udmabuf_list(int devfd, int memfd, off64_t memfd_size) +{ + struct udmabuf_create_list *list; + int ubuf_fd, i; + + list = malloc(sizeof(struct udmabuf_create_list) + + sizeof(struct udmabuf_create_item) * NUM_ENTRIES); + if (!list) { + ksft_print_msg("%s: [FAIL, udmabuf-malloc]\n", TEST_PREFIX); + exit(KSFT_FAIL); + } + + for (i = 0; i < NUM_ENTRIES; i++) { + list->list[i].memfd = memfd; + list->list[i].offset = i * (memfd_size / NUM_ENTRIES); + list->list[i].size = getpagesize() * NUM_PAGES; + } + + list->count = NUM_ENTRIES; + list->flags = UDMABUF_FLAGS_CLOEXEC; + ubuf_fd = ioctl(devfd, UDMABUF_CREATE_LIST, list); + free(list); + if (ubuf_fd < 0) { + ksft_print_msg("%s: [FAIL, udmabuf-create]\n", TEST_PREFIX); + exit(KSFT_FAIL); + } + + return ubuf_fd; +} + +static void write_to_memfd(void *addr, off64_t size, char chr) +{ + int i; + + for (i = 0; i < size / page_size; i++) { + *((char *)addr + (i * page_size)) = chr; + } +} + +static void *mmap_fd(int fd, off64_t size) +{ + void *addr; + + addr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); + if (addr == MAP_FAILED) { + ksft_print_msg("%s: ubuf_fd mmap fail\n", TEST_PREFIX); + exit(KSFT_FAIL); + } + + return addr; +} + +static int compare_chunks(void *addr1, void *addr2, off64_t memfd_size) +{ + off64_t off; + int i = 0, j, k = 0, ret = 0; + char char1, char2; + + while (i < NUM_ENTRIES) { + off = i * (memfd_size / NUM_ENTRIES); + for (j = 0; j < NUM_PAGES; j++, k++) { + char1 = *((char *)addr1 + off + (j * getpagesize())); + char2 = *((char *)addr2 + (k * getpagesize())); + if (char1 != char2) { + ret = -1; + goto err; + } + } + i++; + } +err: + munmap(addr1, memfd_size); + munmap(addr2, NUM_ENTRIES * NUM_PAGES * getpagesize()); + return ret; } int main(int argc, char *argv[]) { struct udmabuf_create create; int devfd, memfd, buf, ret; - off_t size; - void *mem; + off64_t size; + void *addr1, *addr2; + + ksft_print_header(); + ksft_set_plan(6); devfd = open("/dev/udmabuf", O_RDWR); if (devfd < 0) { - printf("%s: [skip,no-udmabuf: Unable to access DMA buffer device file]\n", - TEST_PREFIX); - exit(77); + ksft_print_msg( + "%s: [skip,no-udmabuf: Unable to access DMA buffer device file]\n", + TEST_PREFIX); + exit(KSFT_SKIP); } memfd = memfd_create("udmabuf-test", MFD_ALLOW_SEALING); if (memfd < 0) { - printf("%s: [skip,no-memfd]\n", TEST_PREFIX); - exit(77); + ksft_print_msg("%s: [skip,no-memfd]\n", TEST_PREFIX); + exit(KSFT_SKIP); } ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK); if (ret < 0) { - printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX); - exit(77); + ksft_print_msg("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX); + exit(KSFT_SKIP); } - size = getpagesize() * NUM_PAGES; ret = ftruncate(memfd, size); if (ret == -1) { - printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX); - exit(1); + ksft_print_msg("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX); + exit(KSFT_FAIL); } memset(&create, 0, sizeof(create)); @@ -64,44 +174,86 @@ int main(int argc, char *argv[]) create.offset = getpagesize()/2; create.size = getpagesize(); buf = ioctl(devfd, UDMABUF_CREATE, &create); - if (buf >= 0) { - printf("%s: [FAIL,test-1]\n", TEST_PREFIX); - exit(1); - } + if (buf >= 0) + ksft_test_result_fail("%s: [FAIL,test-1]\n", TEST_PREFIX); + else + ksft_test_result_pass("%s: [PASS,test-1]\n", TEST_PREFIX); /* should fail (size not multiple of page) */ create.memfd = memfd; create.offset = 0; create.size = getpagesize()/2; buf = ioctl(devfd, UDMABUF_CREATE, &create); - if (buf >= 0) { - printf("%s: [FAIL,test-2]\n", TEST_PREFIX); - exit(1); - } + if (buf >= 0) + ksft_test_result_fail("%s: [FAIL,test-2]\n", TEST_PREFIX); + else + ksft_test_result_pass("%s: [PASS,test-2]\n", TEST_PREFIX); /* should fail (not memfd) */ create.memfd = 0; /* stdin */ create.offset = 0; create.size = size; buf = ioctl(devfd, UDMABUF_CREATE, &create); - if (buf >= 0) { - printf("%s: [FAIL,test-3]\n", TEST_PREFIX); - exit(1); - } + if (buf >= 0) + ksft_test_result_fail("%s: [FAIL,test-3]\n", TEST_PREFIX); + else + ksft_test_result_pass("%s: [PASS,test-3]\n", TEST_PREFIX); /* should work */ + page_size = getpagesize(); + addr1 = mmap_fd(memfd, size); + write_to_memfd(addr1, size, 'a'); create.memfd = memfd; create.offset = 0; create.size = size; buf = ioctl(devfd, UDMABUF_CREATE, &create); - if (buf < 0) { - printf("%s: [FAIL,test-4]\n", TEST_PREFIX); - exit(1); - } + if (buf < 0) + ksft_test_result_fail("%s: [FAIL,test-4]\n", TEST_PREFIX); + else + ksft_test_result_pass("%s: [PASS,test-4]\n", TEST_PREFIX); + + munmap(addr1, size); + close(buf); + close(memfd); + + /* should work (migration of 4k size pages)*/ + size = MEMFD_SIZE * page_size; + memfd = create_memfd_with_seals(size, false); + addr1 = mmap_fd(memfd, size); + write_to_memfd(addr1, size, 'a'); + buf = create_udmabuf_list(devfd, memfd, size); + addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize()); + write_to_memfd(addr1, size, 'b'); + ret = compare_chunks(addr1, addr2, size); + if (ret < 0) + ksft_test_result_fail("%s: [FAIL,test-5]\n", TEST_PREFIX); + else + ksft_test_result_pass("%s: [PASS,test-5]\n", TEST_PREFIX); + + close(buf); + close(memfd); + + /* should work (migration of 2MB size huge pages)*/ + page_size = getpagesize() * 512; /* 2 MB */ + size = MEMFD_SIZE * page_size; + memfd = create_memfd_with_seals(size, true); + addr1 = mmap_fd(memfd, size); + write_to_memfd(addr1, size, 'a'); + buf = create_udmabuf_list(devfd, memfd, size); + addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize()); + write_to_memfd(addr1, size, 'b'); + ret = compare_chunks(addr1, addr2, size); + if (ret < 0) + ksft_test_result_fail("%s: [FAIL,test-6]\n", TEST_PREFIX); + else + ksft_test_result_pass("%s: [PASS,test-6]\n", TEST_PREFIX); - fprintf(stderr, "%s: ok\n", TEST_PREFIX); close(buf); close(memfd); close(devfd); + + ksft_print_msg("%s: ok\n", TEST_PREFIX); + ksft_print_cnts(); + return 0; } diff --git a/tools/testing/selftests/exec/Makefile b/tools/testing/selftests/exec/Makefile index ab67d58cfab7..ba012bc5aab9 100644 --- a/tools/testing/selftests/exec/Makefile +++ b/tools/testing/selftests/exec/Makefile @@ -1,7 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 CFLAGS = -Wall CFLAGS += -Wno-nonnull -CFLAGS += -D_GNU_SOURCE ALIGNS := 0x1000 0x200000 0x1000000 ALIGN_PIES := $(patsubst %,load_address.%,$(ALIGNS)) diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile index 994fa3468f17..f79f9bac7918 100644 --- a/tools/testing/selftests/futex/functional/Makefile +++ b/tools/testing/selftests/futex/functional/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 INCLUDES := -I../include -I../../ $(KHDR_INCLUDES) -CFLAGS := $(CFLAGS) -g -O2 -Wall -D_GNU_SOURCE= -pthread $(INCLUDES) $(KHDR_INCLUDES) +CFLAGS := $(CFLAGS) -g -O2 -Wall -pthread $(INCLUDES) $(KHDR_INCLUDES) LDLIBS := -lpthread -lrt LOCAL_HDRS := \ diff --git a/tools/testing/selftests/intel_pstate/Makefile b/tools/testing/selftests/intel_pstate/Makefile index 05d66ef50c97..f45372cb00fe 100644 --- a/tools/testing/selftests/intel_pstate/Makefile +++ b/tools/testing/selftests/intel_pstate/Makefile @@ -1,5 +1,5 @@ # SPDX-License-Identifier: GPL-2.0 -CFLAGS := $(CFLAGS) -Wall -D_GNU_SOURCE +CFLAGS := $(CFLAGS) -Wall LDLIBS += -lm ARCH ?= $(shell uname -m 2>/dev/null || echo not) diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile index 32c5fdfd0eef..fd6477911f24 100644 --- a/tools/testing/selftests/iommu/Makefile +++ b/tools/testing/selftests/iommu/Makefile @@ -2,8 +2,6 @@ CFLAGS += -Wall -O2 -Wno-unused-function CFLAGS += $(KHDR_INCLUDES) -CFLAGS += -D_GNU_SOURCE - TEST_GEN_PROGS := TEST_GEN_PROGS += iommufd TEST_GEN_PROGS += iommufd_fail_nth diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index b084ba2262a0..48d32c5aa3eb 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -233,7 +233,7 @@ LINUX_TOOL_ARCH_INCLUDE = $(top_srcdir)/tools/arch/$(ARCH)/include endif CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \ -Wno-gnu-variable-sized-type-not-at-end -MD -MP -DCONFIG_64BIT \ - -D_GNU_SOURCE -fno-builtin-memcmp -fno-builtin-memcpy \ + -fno-builtin-memcmp -fno-builtin-memcpy \ -fno-builtin-memset -fno-builtin-strnlen \ -fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \ -I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \ diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk index 7b299ed5ff45..d6edcfcb5be8 100644 --- a/tools/testing/selftests/lib.mk +++ b/tools/testing/selftests/lib.mk @@ -196,6 +196,9 @@ endef clean: $(if $(TEST_GEN_MODS_DIR),clean_mods_dir) $(CLEAN) +# Build with _GNU_SOURCE by default +CFLAGS += -D_GNU_SOURCE= + # Enables to extend CFLAGS and LDFLAGS from command line, e.g. # make USERCFLAGS=-Werror USERLDFLAGS=-static CFLAGS += $(USERCFLAGS) diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 0b9ab987601c..064e7b125643 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -6,6 +6,7 @@ hugepage-shm hugepage-vmemmap hugetlb-madvise hugetlb-read-hwpoison +hugetlb-soft-offline khugepaged map_hugetlb map_populate diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 3b49bc3d0a3b..e1aa09ddaa3d 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -2,6 +2,7 @@ # Makefile for mm selftests LOCAL_HDRS += $(selfdir)/mm/local_config.h $(top_srcdir)/mm/gup_test.h +LOCAL_HDRS += $(selfdir)/mm/mseal_helpers.h include local_config.mk @@ -42,6 +43,7 @@ TEST_GEN_FILES += gup_test TEST_GEN_FILES += hmm-tests TEST_GEN_FILES += hugetlb-madvise TEST_GEN_FILES += hugetlb-read-hwpoison +TEST_GEN_FILES += hugetlb-soft-offline TEST_GEN_FILES += hugepage-mmap TEST_GEN_FILES += hugepage-mremap TEST_GEN_FILES += hugepage-shm @@ -73,6 +75,7 @@ TEST_GEN_FILES += ksm_functional_tests TEST_GEN_FILES += mdwe_test TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map +TEST_GEN_FILES += hugetlb_dio ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/hugepage-mremap.c b/tools/testing/selftests/mm/hugepage-mremap.c index c463d1c09c9b..ada9156cc497 100644 --- a/tools/testing/selftests/mm/hugepage-mremap.c +++ b/tools/testing/selftests/mm/hugepage-mremap.c @@ -15,7 +15,7 @@ #define _GNU_SOURCE #include #include -#include +#include #include #include #include /* Definition of O_* constants */ diff --git a/tools/testing/selftests/mm/hugetlb-soft-offline.c b/tools/testing/selftests/mm/hugetlb-soft-offline.c new file mode 100644 index 000000000000..f086f0e04756 --- /dev/null +++ b/tools/testing/selftests/mm/hugetlb-soft-offline.c @@ -0,0 +1,228 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test soft offline behavior for HugeTLB pages: + * - if enable_soft_offline = 0, hugepages should stay intact and soft + * offlining failed with EOPNOTSUPP. + * - if enable_soft_offline = 1, a hugepage should be dissolved and + * nr_hugepages/free_hugepages should be reduced by 1. + * + * Before running, make sure more than 2 hugepages of default_hugepagesz + * are allocated. For example, if /proc/meminfo/Hugepagesize is 2048kB: + * echo 8 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages + */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include + +#include "../kselftest.h" + +#ifndef MADV_SOFT_OFFLINE +#define MADV_SOFT_OFFLINE 101 +#endif + +#define EPREFIX " !!! " + +static int do_soft_offline(int fd, size_t len, int expect_errno) +{ + char *filemap = NULL; + char *hwp_addr = NULL; + const unsigned long pagesize = getpagesize(); + int ret = 0; + + if (ftruncate(fd, len) < 0) { + ksft_perror(EPREFIX "ftruncate to len failed"); + return -1; + } + + filemap = mmap(NULL, len, PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, fd, 0); + if (filemap == MAP_FAILED) { + ksft_perror(EPREFIX "mmap failed"); + ret = -1; + goto untruncate; + } + + memset(filemap, 0xab, len); + ksft_print_msg("Allocated %#lx bytes of hugetlb pages\n", len); + + hwp_addr = filemap + len / 2; + ret = madvise(hwp_addr, pagesize, MADV_SOFT_OFFLINE); + ksft_print_msg("MADV_SOFT_OFFLINE %p ret=%d, errno=%d\n", + hwp_addr, ret, errno); + if (ret != 0) + ksft_perror(EPREFIX "madvise failed"); + + if (errno == expect_errno) + ret = 0; + else { + ksft_print_msg("MADV_SOFT_OFFLINE should ret %d\n", + expect_errno); + ret = -1; + } + + munmap(filemap, len); +untruncate: + if (ftruncate(fd, 0) < 0) + ksft_perror(EPREFIX "ftruncate back to 0 failed"); + + return ret; +} + +static int set_enable_soft_offline(int value) +{ + char cmd[256] = {0}; + FILE *cmdfile = NULL; + + if (value != 0 && value != 1) + return -EINVAL; + + sprintf(cmd, "echo %d > /proc/sys/vm/enable_soft_offline", value); + cmdfile = popen(cmd, "r"); + + if (cmdfile) + ksft_print_msg("enable_soft_offline => %d\n", value); + else { + ksft_perror(EPREFIX "failed to set enable_soft_offline"); + return errno; + } + + pclose(cmdfile); + return 0; +} + +static int read_nr_hugepages(unsigned long hugepage_size, + unsigned long *nr_hugepages) +{ + char buffer[256] = {0}; + char cmd[256] = {0}; + + sprintf(cmd, "cat /sys/kernel/mm/hugepages/hugepages-%ldkB/nr_hugepages", + hugepage_size); + FILE *cmdfile = popen(cmd, "r"); + + if (cmdfile == NULL) { + ksft_perror(EPREFIX "failed to popen nr_hugepages"); + return -1; + } + + if (!fgets(buffer, sizeof(buffer), cmdfile)) { + ksft_perror(EPREFIX "failed to read nr_hugepages"); + pclose(cmdfile); + return -1; + } + + *nr_hugepages = atoll(buffer); + pclose(cmdfile); + return 0; +} + +static int create_hugetlbfs_file(struct statfs *file_stat) +{ + int fd; + + fd = memfd_create("hugetlb_tmp", MFD_HUGETLB); + if (fd < 0) { + ksft_perror(EPREFIX "could not open hugetlbfs file"); + return -1; + } + + memset(file_stat, 0, sizeof(*file_stat)); + if (fstatfs(fd, file_stat)) { + ksft_perror(EPREFIX "fstatfs failed"); + goto close; + } + if (file_stat->f_type != HUGETLBFS_MAGIC) { + ksft_print_msg(EPREFIX "not hugetlbfs file\n"); + goto close; + } + + return fd; +close: + close(fd); + return -1; +} + +static void test_soft_offline_common(int enable_soft_offline) +{ + int fd; + int expect_errno = enable_soft_offline ? 0 : EOPNOTSUPP; + struct statfs file_stat; + unsigned long hugepagesize_kb = 0; + unsigned long nr_hugepages_before = 0; + unsigned long nr_hugepages_after = 0; + int ret; + + ksft_print_msg("Test soft-offline when enabled_soft_offline=%d\n", + enable_soft_offline); + + fd = create_hugetlbfs_file(&file_stat); + if (fd < 0) + ksft_exit_fail_msg("Failed to create hugetlbfs file\n"); + + hugepagesize_kb = file_stat.f_bsize / 1024; + ksft_print_msg("Hugepagesize is %ldkB\n", hugepagesize_kb); + + if (set_enable_soft_offline(enable_soft_offline) != 0) { + close(fd); + ksft_exit_fail_msg("Failed to set enable_soft_offline\n"); + } + + if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_before) != 0) { + close(fd); + ksft_exit_fail_msg("Failed to read nr_hugepages\n"); + } + + ksft_print_msg("Before MADV_SOFT_OFFLINE nr_hugepages=%ld\n", + nr_hugepages_before); + + ret = do_soft_offline(fd, 2 * file_stat.f_bsize, expect_errno); + + if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) { + close(fd); + ksft_exit_fail_msg("Failed to read nr_hugepages\n"); + } + + ksft_print_msg("After MADV_SOFT_OFFLINE nr_hugepages=%ld\n", + nr_hugepages_after); + + // No need for the hugetlbfs file from now on. + close(fd); + + if (enable_soft_offline) { + if (nr_hugepages_before != nr_hugepages_after + 1) { + ksft_test_result_fail("MADV_SOFT_OFFLINE should reduced 1 hugepage\n"); + return; + } + } else { + if (nr_hugepages_before != nr_hugepages_after) { + ksft_test_result_fail("MADV_SOFT_OFFLINE reduced %lu hugepages\n", + nr_hugepages_before - nr_hugepages_after); + return; + } + } + + ksft_test_result(ret == 0, + "Test soft-offline when enabled_soft_offline=%d\n", + enable_soft_offline); +} + +int main(int argc, char **argv) +{ + ksft_print_header(); + ksft_set_plan(2); + + test_soft_offline_common(1); + test_soft_offline_common(0); + + ksft_finished(); +} diff --git a/tools/testing/selftests/mm/hugetlb_dio.c b/tools/testing/selftests/mm/hugetlb_dio.c new file mode 100644 index 000000000000..f9ac20c657ec --- /dev/null +++ b/tools/testing/selftests/mm/hugetlb_dio.c @@ -0,0 +1,117 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * This program tests for hugepage leaks after DIO writes to a file using a + * hugepage as the user buffer. During DIO, the user buffer is pinned and + * should be properly unpinned upon completion. This patch verifies that the + * kernel correctly unpins the buffer at DIO completion for both aligned and + * unaligned user buffer offsets (w.r.t page boundary), ensuring the hugepage + * is freed upon unmapping. + */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include "vm_util.h" +#include "../kselftest.h" + +void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off) +{ + int fd; + char *buffer = NULL; + char *orig_buffer = NULL; + size_t h_pagesize = 0; + size_t writesize; + int free_hpage_b = 0; + int free_hpage_a = 0; + const int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB; + const int mmap_prot = PROT_READ | PROT_WRITE; + + writesize = end_off - start_off; + + /* Get the default huge page size */ + h_pagesize = default_huge_page_size(); + if (!h_pagesize) + ksft_exit_fail_msg("Unable to determine huge page size\n"); + + /* Open the file to DIO */ + fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664); + if (fd < 0) + ksft_exit_fail_perror("Error opening file\n"); + + /* Get the free huge pages before allocation */ + free_hpage_b = get_free_hugepages(); + if (free_hpage_b == 0) { + close(fd); + ksft_exit_skip("No free hugepage, exiting!\n"); + } + + /* Allocate a hugetlb page */ + orig_buffer = mmap(NULL, h_pagesize, mmap_prot, mmap_flags, -1, 0); + if (orig_buffer == MAP_FAILED) { + close(fd); + ksft_exit_fail_perror("Error mapping memory\n"); + } + buffer = orig_buffer; + buffer += start_off; + + memset(buffer, 'A', writesize); + + /* Write the buffer to the file */ + if (write(fd, buffer, writesize) != (writesize)) { + munmap(orig_buffer, h_pagesize); + close(fd); + ksft_exit_fail_perror("Error writing to file\n"); + } + + /* unmap the huge page */ + munmap(orig_buffer, h_pagesize); + close(fd); + + /* Get the free huge pages after unmap*/ + free_hpage_a = get_free_hugepages(); + + /* + * If the no. of free hugepages before allocation and after unmap does + * not match - that means there could still be a page which is pinned. + */ + if (free_hpage_a != free_hpage_b) { + ksft_print_msg("No. Free pages before allocation : %d\n", free_hpage_b); + ksft_print_msg("No. Free pages after munmap : %d\n", free_hpage_a); + ksft_test_result_fail(": Huge pages not freed!\n"); + } else { + ksft_print_msg("No. Free pages before allocation : %d\n", free_hpage_b); + ksft_print_msg("No. Free pages after munmap : %d\n", free_hpage_a); + ksft_test_result_pass(": Huge pages freed successfully !\n"); + } +} + +int main(void) +{ + size_t pagesize = 0; + + ksft_print_header(); + ksft_set_plan(4); + + /* Get base page size */ + pagesize = psize(); + + /* start and end is aligned to pagesize */ + run_dio_using_hugetlb(0, (pagesize * 3)); + + /* start is aligned but end is not aligned */ + run_dio_using_hugetlb(0, (pagesize * 3) - (pagesize / 2)); + + /* start is unaligned and end is aligned */ + run_dio_using_hugetlb(pagesize / 2, (pagesize * 3)); + + /* both start and end are unaligned */ + run_dio_using_hugetlb(pagesize / 2, (pagesize * 3) + (pagesize / 2)); + + ksft_finished(); +} diff --git a/tools/testing/selftests/mm/ksm_functional_tests.c b/tools/testing/selftests/mm/ksm_functional_tests.c index b61803e36d1c..66b4e111b5a2 100644 --- a/tools/testing/selftests/mm/ksm_functional_tests.c +++ b/tools/testing/selftests/mm/ksm_functional_tests.c @@ -11,7 +11,7 @@ #include #include #include -#include +#include #include #include #include @@ -369,7 +369,6 @@ unmap: munmap(map, size); } -#ifdef __NR_userfaultfd static void test_unmerge_uffd_wp(void) { struct uffdio_writeprotect uffd_writeprotect; @@ -430,7 +429,6 @@ close_uffd: unmap: munmap(map, size); } -#endif /* Verify that KSM can be enabled / queried with prctl. */ static void test_prctl(void) @@ -686,9 +684,7 @@ int main(int argc, char **argv) exit(test_child_ksm()); } -#ifdef __NR_userfaultfd tests++; -#endif ksft_print_header(); ksft_set_plan(tests); @@ -700,9 +696,7 @@ int main(int argc, char **argv) test_unmerge(); test_unmerge_zero_pages(); test_unmerge_discarded(); -#ifdef __NR_userfaultfd test_unmerge_uffd_wp(); -#endif test_prot_none(); diff --git a/tools/testing/selftests/mm/memfd_secret.c b/tools/testing/selftests/mm/memfd_secret.c index 9a0597310a76..74c911aa3aea 100644 --- a/tools/testing/selftests/mm/memfd_secret.c +++ b/tools/testing/selftests/mm/memfd_secret.c @@ -17,7 +17,7 @@ #include #include -#include +#include #include #include #include @@ -28,8 +28,6 @@ #define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__) #define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__) -#ifdef __NR_memfd_secret - #define PATTERN 0x55 static const int prot = PROT_READ | PROT_WRITE; @@ -334,13 +332,3 @@ int main(int argc, char *argv[]) ksft_finished(); } - -#else /* __NR_memfd_secret */ - -int main(int argc, char *argv[]) -{ - printf("skip: skipping memfd_secret test (missing __NR_memfd_secret)\n"); - return KSFT_SKIP; -} - -#endif /* __NR_memfd_secret */ diff --git a/tools/testing/selftests/mm/mkdirty.c b/tools/testing/selftests/mm/mkdirty.c index b8a7efe9204e..1db134063c38 100644 --- a/tools/testing/selftests/mm/mkdirty.c +++ b/tools/testing/selftests/mm/mkdirty.c @@ -9,7 +9,7 @@ */ #include #include -#include +#include #include #include #include @@ -265,7 +265,6 @@ munmap: munmap(mmap_mem, mmap_size); } -#ifdef __NR_userfaultfd static void test_uffdio_copy(void) { struct uffdio_register uffdio_register; @@ -322,7 +321,6 @@ munmap: munmap(dst, pagesize); free(src); } -#endif /* __NR_userfaultfd */ int main(void) { @@ -335,9 +333,7 @@ int main(void) thpsize / 1024); tests += 3; } -#ifdef __NR_userfaultfd tests += 1; -#endif /* __NR_userfaultfd */ ksft_print_header(); ksft_set_plan(tests); @@ -367,9 +363,7 @@ int main(void) if (thpsize) test_pte_mapped_thp(); /* Placing a fresh page via userfaultfd may set the PTE dirty. */ -#ifdef __NR_userfaultfd test_uffdio_copy(); -#endif /* __NR_userfaultfd */ err = ksft_get_fail_cnt(); if (err) diff --git a/tools/testing/selftests/mm/mlock2.h b/tools/testing/selftests/mm/mlock2.h index 4417eaa5cfb7..1e5731bab499 100644 --- a/tools/testing/selftests/mm/mlock2.h +++ b/tools/testing/selftests/mm/mlock2.h @@ -3,6 +3,7 @@ #include #include #include +#include static int mlock2_(void *start, size_t len, int flags) { diff --git a/tools/testing/selftests/mm/mseal_helpers.h b/tools/testing/selftests/mm/mseal_helpers.h new file mode 100644 index 000000000000..0cfce31c76d2 --- /dev/null +++ b/tools/testing/selftests/mm/mseal_helpers.h @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#define FAIL_TEST_IF_FALSE(test_passed) \ + do { \ + if (!(test_passed)) { \ + ksft_test_result_fail("%s: line:%d\n", \ + __func__, __LINE__); \ + return; \ + } \ + } while (0) + +#define SKIP_TEST_IF_FALSE(test_passed) \ + do { \ + if (!(test_passed)) { \ + ksft_test_result_skip("%s: line:%d\n", \ + __func__, __LINE__); \ + return; \ + } \ + } while (0) + +#define REPORT_TEST_PASS() ksft_test_result_pass("%s\n", __func__) + +#ifndef PKEY_DISABLE_ACCESS +#define PKEY_DISABLE_ACCESS 0x1 +#endif + +#ifndef PKEY_DISABLE_WRITE +#define PKEY_DISABLE_WRITE 0x2 +#endif + +#ifndef PKEY_BITS_PER_PKEY +#define PKEY_BITS_PER_PKEY 2 +#endif + +#ifndef PKEY_MASK +#define PKEY_MASK (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE) +#endif + +#ifndef u64 +#define u64 unsigned long long +#endif diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c index 41998cf1dcf5..a818f010de47 100644 --- a/tools/testing/selftests/mm/mseal_test.c +++ b/tools/testing/selftests/mm/mseal_test.c @@ -3,7 +3,7 @@ #include #include #include -#include +#include #include #include #include @@ -17,54 +17,7 @@ #include #include #include - -/* - * need those definition for manually build using gcc. - * gcc -I ../../../../usr/include -DDEBUG -O3 -DDEBUG -O3 mseal_test.c -o mseal_test - */ -#ifndef PKEY_DISABLE_ACCESS -# define PKEY_DISABLE_ACCESS 0x1 -#endif - -#ifndef PKEY_DISABLE_WRITE -# define PKEY_DISABLE_WRITE 0x2 -#endif - -#ifndef PKEY_BITS_PER_PKEY -#define PKEY_BITS_PER_PKEY 2 -#endif - -#ifndef PKEY_MASK -#define PKEY_MASK (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE) -#endif - -#define FAIL_TEST_IF_FALSE(c) do {\ - if (!(c)) {\ - ksft_test_result_fail("%s, line:%d\n", __func__, __LINE__);\ - goto test_end;\ - } \ - } \ - while (0) - -#define SKIP_TEST_IF_FALSE(c) do {\ - if (!(c)) {\ - ksft_test_result_skip("%s, line:%d\n", __func__, __LINE__);\ - goto test_end;\ - } \ - } \ - while (0) - - -#define TEST_END_CHECK() {\ - ksft_test_result_pass("%s\n", __func__);\ - return;\ -test_end:\ - return;\ -} - -#ifndef u64 -#define u64 unsigned long long -#endif +#include "mseal_helpers.h" static unsigned long get_vma_size(void *addr, int *prot) { @@ -287,7 +240,7 @@ static void test_seal_addseal(void) ret = sys_mseal(ptr, size); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_unmapped_start(void) @@ -315,7 +268,7 @@ static void test_seal_unmapped_start(void) ret = sys_mseal(ptr + 2 * page_size, 2 * page_size); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_unmapped_middle(void) @@ -347,7 +300,7 @@ static void test_seal_unmapped_middle(void) ret = sys_mseal(ptr + 3 * page_size, page_size); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_unmapped_end(void) @@ -376,7 +329,7 @@ static void test_seal_unmapped_end(void) ret = sys_mseal(ptr, 2 * page_size); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_multiple_vmas(void) @@ -407,7 +360,7 @@ static void test_seal_multiple_vmas(void) ret = sys_mseal(ptr, size); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_split_start(void) @@ -432,7 +385,7 @@ static void test_seal_split_start(void) ret = sys_mseal(ptr + page_size, 3 * page_size); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_split_end(void) @@ -457,7 +410,7 @@ static void test_seal_split_end(void) ret = sys_mseal(ptr, 3 * page_size); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_invalid_input(void) @@ -492,7 +445,7 @@ static void test_seal_invalid_input(void) ret = sys_mseal(ptr - page_size, 5 * page_size); FAIL_TEST_IF_FALSE(ret < 0); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_zero_length(void) @@ -516,7 +469,7 @@ static void test_seal_zero_length(void) ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_zero_address(void) @@ -542,7 +495,7 @@ static void test_seal_zero_address(void) ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); FAIL_TEST_IF_FALSE(ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_twice(void) @@ -562,7 +515,7 @@ static void test_seal_twice(void) ret = sys_mseal(ptr, size); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mprotect(bool seal) @@ -586,7 +539,7 @@ static void test_seal_mprotect(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_start_mprotect(bool seal) @@ -616,7 +569,7 @@ static void test_seal_start_mprotect(bool seal) PROT_READ | PROT_WRITE); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_end_mprotect(bool seal) @@ -646,7 +599,7 @@ static void test_seal_end_mprotect(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mprotect_unalign_len(bool seal) @@ -675,7 +628,7 @@ static void test_seal_mprotect_unalign_len(bool seal) PROT_READ | PROT_WRITE); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mprotect_unalign_len_variant_2(bool seal) @@ -703,7 +656,7 @@ static void test_seal_mprotect_unalign_len_variant_2(bool seal) PROT_READ | PROT_WRITE); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mprotect_two_vma(bool seal) @@ -738,7 +691,7 @@ static void test_seal_mprotect_two_vma(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mprotect_two_vma_with_split(bool seal) @@ -785,7 +738,7 @@ static void test_seal_mprotect_two_vma_with_split(bool seal) PROT_READ | PROT_WRITE); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mprotect_partial_mprotect(bool seal) @@ -811,7 +764,7 @@ static void test_seal_mprotect_partial_mprotect(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mprotect_two_vma_with_gap(bool seal) @@ -854,7 +807,7 @@ static void test_seal_mprotect_two_vma_with_gap(bool seal) ret = sys_mprotect(ptr + 3 * page_size, page_size, PROT_READ); FAIL_TEST_IF_FALSE(ret == 0); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mprotect_split(bool seal) @@ -891,7 +844,7 @@ static void test_seal_mprotect_split(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mprotect_merge(bool seal) @@ -925,7 +878,7 @@ static void test_seal_mprotect_merge(bool seal) ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ); FAIL_TEST_IF_FALSE(ret == 0); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_munmap(bool seal) @@ -950,7 +903,7 @@ static void test_seal_munmap(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } /* @@ -990,7 +943,7 @@ static void test_seal_munmap_two_vma(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } /* @@ -1028,7 +981,7 @@ static void test_seal_munmap_vma_with_gap(bool seal) ret = sys_munmap(ptr, size); FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_munmap_start_freed(bool seal) @@ -1068,7 +1021,7 @@ static void test_munmap_start_freed(bool seal) FAIL_TEST_IF_FALSE(size == 0); } - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_munmap_end_freed(bool seal) @@ -1098,7 +1051,7 @@ static void test_munmap_end_freed(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_munmap_middle_freed(bool seal) @@ -1142,7 +1095,7 @@ static void test_munmap_middle_freed(bool seal) FAIL_TEST_IF_FALSE(size == 0); } - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mremap_shrink(bool seal) @@ -1171,7 +1124,7 @@ static void test_seal_mremap_shrink(bool seal) } - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mremap_expand(bool seal) @@ -1203,7 +1156,7 @@ static void test_seal_mremap_expand(bool seal) } - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mremap_move(bool seal) @@ -1236,7 +1189,7 @@ static void test_seal_mremap_move(bool seal) } - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mmap_overwrite_prot(bool seal) @@ -1264,7 +1217,7 @@ static void test_seal_mmap_overwrite_prot(bool seal) } else FAIL_TEST_IF_FALSE(ret2 == ptr); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mmap_expand(bool seal) @@ -1295,7 +1248,7 @@ static void test_seal_mmap_expand(bool seal) } else FAIL_TEST_IF_FALSE(ret2 == ptr); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mmap_shrink(bool seal) @@ -1323,7 +1276,7 @@ static void test_seal_mmap_shrink(bool seal) } else FAIL_TEST_IF_FALSE(ret2 == ptr); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mremap_shrink_fixed(bool seal) @@ -1354,7 +1307,7 @@ static void test_seal_mremap_shrink_fixed(bool seal) } else FAIL_TEST_IF_FALSE(ret2 == newAddr); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mremap_expand_fixed(bool seal) @@ -1385,7 +1338,7 @@ static void test_seal_mremap_expand_fixed(bool seal) } else FAIL_TEST_IF_FALSE(ret2 == newAddr); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mremap_move_fixed(bool seal) @@ -1415,7 +1368,7 @@ static void test_seal_mremap_move_fixed(bool seal) } else FAIL_TEST_IF_FALSE(ret2 == newAddr); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mremap_move_fixed_zero(bool seal) @@ -1447,7 +1400,7 @@ static void test_seal_mremap_move_fixed_zero(bool seal) } - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mremap_move_dontunmap(bool seal) @@ -1476,7 +1429,7 @@ static void test_seal_mremap_move_dontunmap(bool seal) } - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_mremap_move_dontunmap_anyaddr(bool seal) @@ -1510,7 +1463,7 @@ static void test_seal_mremap_move_dontunmap_anyaddr(bool seal) } - TEST_END_CHECK(); + REPORT_TEST_PASS(); } @@ -1603,7 +1556,7 @@ static void test_seal_merge_and_split(void) FAIL_TEST_IF_FALSE(size == 22 * page_size); FAIL_TEST_IF_FALSE(prot == 0x4); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_discard_ro_anon_on_rw(bool seal) @@ -1632,7 +1585,7 @@ static void test_seal_discard_ro_anon_on_rw(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_discard_ro_anon_on_pkey(bool seal) @@ -1679,7 +1632,7 @@ static void test_seal_discard_ro_anon_on_pkey(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_discard_ro_anon_on_filebacked(bool seal) @@ -1716,7 +1669,7 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal) FAIL_TEST_IF_FALSE(!ret); close(fd); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_discard_ro_anon_on_shared(bool seal) @@ -1745,7 +1698,7 @@ static void test_seal_discard_ro_anon_on_shared(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } static void test_seal_discard_ro_anon(bool seal) @@ -1775,7 +1728,7 @@ static void test_seal_discard_ro_anon(bool seal) else FAIL_TEST_IF_FALSE(!ret); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } int main(int argc, char **argv) diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c index 2d785aca72a5..fc90af2a97b8 100644 --- a/tools/testing/selftests/mm/pagemap_ioctl.c +++ b/tools/testing/selftests/mm/pagemap_ioctl.c @@ -15,7 +15,7 @@ #include #include #include -#include +#include #include #include #include @@ -1567,8 +1567,10 @@ int main(int argc, char *argv[]) /* 7. File Hugetlb testing */ mem_size = 2*1024*1024; fd = memfd_create("uffd-test", MFD_HUGETLB | MFD_NOEXEC_SEAL); + if (fd < 0) + ksft_exit_fail_msg("uffd-test creation failed %d %s\n", errno, strerror(errno)); mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); - if (mem) { + if (mem != MAP_FAILED) { wp_init(mem, mem_size); wp_addr_range(mem, mem_size); diff --git a/tools/testing/selftests/mm/protection_keys.c b/tools/testing/selftests/mm/protection_keys.c index 48dc151f8fca..eaa6d1fc5328 100644 --- a/tools/testing/selftests/mm/protection_keys.c +++ b/tools/testing/selftests/mm/protection_keys.c @@ -42,7 +42,7 @@ #include #include #include -#include +#include #include #include diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 3157204b9047..03ac4f2e1cce 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -265,6 +265,7 @@ CATEGORY="hugetlb" run_test ./map_hugetlb CATEGORY="hugetlb" run_test ./hugepage-mremap CATEGORY="hugetlb" run_test ./hugepage-vmemmap CATEGORY="hugetlb" run_test ./hugetlb-madvise +CATEGORY="hugetlb" run_test ./hugetlb_dio nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages) # For this test, we need one and just one huge page @@ -331,6 +332,12 @@ CATEGORY="hugetlb" run_test ./thuge-gen CATEGORY="hugetlb" run_test ./charge_reserved_hugetlb.sh -cgroup-v2 CATEGORY="hugetlb" run_test ./hugetlb_reparenting_test.sh -cgroup-v2 if $RUN_DESTRUCTIVE; then +nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages) +enable_soft_offline=$(cat /proc/sys/vm/enable_soft_offline) +echo 8 > /proc/sys/vm/nr_hugepages +CATEGORY="hugetlb" run_test ./hugetlb-soft-offline +echo "$nr_hugepages_tmp" > /proc/sys/vm/nr_hugepages +echo "$enable_soft_offline" > /proc/sys/vm/enable_soft_offline CATEGORY="hugetlb" run_test ./hugetlb-read-hwpoison fi diff --git a/tools/testing/selftests/mm/seal_elf.c b/tools/testing/selftests/mm/seal_elf.c index f2babec79bb6..7aa1366063e4 100644 --- a/tools/testing/selftests/mm/seal_elf.c +++ b/tools/testing/selftests/mm/seal_elf.c @@ -2,7 +2,7 @@ #define _GNU_SOURCE #include #include -#include +#include #include #include #include @@ -16,38 +16,7 @@ #include #include #include - -/* - * need those definition for manually build using gcc. - * gcc -I ../../../../usr/include -DDEBUG -O3 -DDEBUG -O3 seal_elf.c -o seal_elf - */ -#define FAIL_TEST_IF_FALSE(c) do {\ - if (!(c)) {\ - ksft_test_result_fail("%s, line:%d\n", __func__, __LINE__);\ - goto test_end;\ - } \ - } \ - while (0) - -#define SKIP_TEST_IF_FALSE(c) do {\ - if (!(c)) {\ - ksft_test_result_skip("%s, line:%d\n", __func__, __LINE__);\ - goto test_end;\ - } \ - } \ - while (0) - - -#define TEST_END_CHECK() {\ - ksft_test_result_pass("%s\n", __func__);\ - return;\ -test_end:\ - return;\ -} - -#ifndef u64 -#define u64 unsigned long long -#endif +#include "mseal_helpers.h" /* * define sys_xyx to call syscall directly. @@ -158,7 +127,7 @@ static void test_seal_elf(void) FAIL_TEST_IF_FALSE(ret < 0); ksft_print_msg("somestr is sealed, mprotect is rejected\n"); - TEST_END_CHECK(); + REPORT_TEST_PASS(); } int main(int argc, char **argv) diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index d3c7f5fb3e7b..e5e8dafc9d94 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -300,7 +300,7 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd, char **addr) { size_t i; - int __attribute__((unused)) dummy = 0; + int dummy = 0; srand(time(NULL)); @@ -341,6 +341,7 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd, for (size_t i = 0; i < fd_size; i++) dummy += *(*addr + i); + asm volatile("" : "+r" (dummy)); if (!check_huge_file(*addr, fd_size / pmd_pagesize, pmd_pagesize)) { ksft_print_msg("No large pagecache folio generated, please provide a filesystem supporting large folio\n"); diff --git a/tools/testing/selftests/mm/thuge-gen.c b/tools/testing/selftests/mm/thuge-gen.c index ea7fd8fe2876..e4370b79b62f 100644 --- a/tools/testing/selftests/mm/thuge-gen.c +++ b/tools/testing/selftests/mm/thuge-gen.c @@ -13,8 +13,9 @@ sudo ipcs | awk '$1 == "0x00000000" {print $2}' | xargs -n1 sudo ipcrm -m (warning this will remove all if someone else uses them) */ -#define _GNU_SOURCE 1 +#define _GNU_SOURCE #include +#include #include #include #include @@ -28,19 +29,23 @@ #include "vm_util.h" #include "../kselftest.h" -#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) -#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT) -#define MAP_HUGE_SHIFT 26 -#define MAP_HUGE_MASK 0x3f #if !defined(MAP_HUGETLB) #define MAP_HUGETLB 0x40000 #endif #define SHM_HUGETLB 04000 /* segment will use huge TLB pages */ +#ifndef SHM_HUGE_SHIFT #define SHM_HUGE_SHIFT 26 +#endif +#ifndef SHM_HUGE_MASK #define SHM_HUGE_MASK 0x3f +#endif +#ifndef SHM_HUGE_2MB #define SHM_HUGE_2MB (21 << SHM_HUGE_SHIFT) +#endif +#ifndef SHM_HUGE_1GB #define SHM_HUGE_1GB (30 << SHM_HUGE_SHIFT) +#endif #define NUM_PAGESIZES 5 #define NUM_PAGES 4 diff --git a/tools/testing/selftests/mm/uffd-common.c b/tools/testing/selftests/mm/uffd-common.c index 7ad6ba660c7d..717539eddf98 100644 --- a/tools/testing/selftests/mm/uffd-common.c +++ b/tools/testing/selftests/mm/uffd-common.c @@ -673,11 +673,7 @@ int uffd_open_dev(unsigned int flags) int uffd_open_sys(unsigned int flags) { -#ifdef __NR_userfaultfd return syscall(__NR_userfaultfd, flags); -#else - return -1; -#endif } int uffd_open(unsigned int flags) diff --git a/tools/testing/selftests/mm/uffd-stress.c b/tools/testing/selftests/mm/uffd-stress.c index f78bab0f3d45..a4b83280998a 100644 --- a/tools/testing/selftests/mm/uffd-stress.c +++ b/tools/testing/selftests/mm/uffd-stress.c @@ -33,10 +33,10 @@ * pthread_mutex_lock will also verify the atomicity of the memory * transfer (UFFDIO_COPY). */ - +#include #include "uffd-common.h" -#ifdef __NR_userfaultfd +uint64_t features; #define BOUNCE_RANDOM (1<<0) #define BOUNCE_RACINGFAULTS (1<<1) @@ -247,10 +247,14 @@ static int userfaultfd_stress(void) unsigned long nr; struct uffd_args args[nr_cpus]; uint64_t mem_size = nr_pages * page_size; + int flags = 0; memset(args, 0, sizeof(struct uffd_args) * nr_cpus); - if (uffd_test_ctx_init(UFFD_FEATURE_WP_UNPOPULATED, NULL)) + if (features & UFFD_FEATURE_WP_UNPOPULATED && test_type == TEST_ANON) + flags = UFFD_FEATURE_WP_UNPOPULATED; + + if (uffd_test_ctx_init(flags, NULL)) err("context init failed"); if (posix_memalign(&area, page_size, page_size)) @@ -385,8 +389,6 @@ static void set_test_type(const char *type) static void parse_test_type_arg(const char *raw_type) { - uint64_t features = UFFD_API_FEATURES; - set_test_type(raw_type); if (!test_type) @@ -409,12 +411,15 @@ static void parse_test_type_arg(const char *raw_type) * feature. */ - if (userfaultfd_open(&features)) - err("Userfaultfd open failed"); + if (uffd_get_features(&features)) + err("failed to get available features"); test_uffdio_wp = test_uffdio_wp && (features & UFFD_FEATURE_PAGEFAULT_FLAG_WP); + if (test_type != TEST_ANON && !(features & UFFD_FEATURE_WP_HUGETLBFS_SHMEM)) + test_uffdio_wp = false; + close(uffd); uffd = -1; } @@ -466,15 +471,3 @@ int main(int argc, char **argv) nr_pages, nr_pages_per_cpu); return userfaultfd_stress(); } - -#else /* __NR_userfaultfd */ - -#warning "missing __NR_userfaultfd definition" - -int main(void) -{ - printf("skip: Skipping userfaultfd test (missing __NR_userfaultfd)\n"); - return KSFT_SKIP; -} - -#endif /* __NR_userfaultfd */ diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c index 21ec23206ab4..b3d21eed203d 100644 --- a/tools/testing/selftests/mm/uffd-unit-tests.c +++ b/tools/testing/selftests/mm/uffd-unit-tests.c @@ -5,12 +5,11 @@ * Copyright (C) 2015-2023 Red Hat, Inc. */ +#include #include "uffd-common.h" #include "../../../../mm/gup_test.h" -#ifdef __NR_userfaultfd - /* The unit test doesn't need a large or random size, make it 32MB for now */ #define UFFD_TEST_MEM_SIZE (32UL << 20) @@ -1554,14 +1553,3 @@ int main(int argc, char *argv[]) return ksft_get_fail_cnt() ? KSFT_FAIL : KSFT_PASS; } -#else /* __NR_userfaultfd */ - -#warning "missing __NR_userfaultfd definition" - -int main(void) -{ - printf("Skipping %s (missing __NR_userfaultfd)\n", __file__); - return KSFT_SKIP; -} - -#endif /* __NR_userfaultfd */ diff --git a/tools/testing/selftests/mm/va_high_addr_switch.c b/tools/testing/selftests/mm/va_high_addr_switch.c index cfbc501290d3..fa7eabfaf841 100644 --- a/tools/testing/selftests/mm/va_high_addr_switch.c +++ b/tools/testing/selftests/mm/va_high_addr_switch.c @@ -9,26 +9,9 @@ #include #include +#include "vm_util.h" #include "../kselftest.h" -#ifdef __powerpc64__ -#define PAGE_SIZE (64 << 10) -/* - * This will work with 16M and 2M hugepage size - */ -#define HUGETLB_SIZE (16 << 20) -#elif __aarch64__ -/* - * The default hugepage size for 64k base pagesize - * is 512MB. - */ -#define PAGE_SIZE (64 << 10) -#define HUGETLB_SIZE (512 << 20) -#else -#define PAGE_SIZE (4 << 10) -#define HUGETLB_SIZE (2 << 20) -#endif - /* * The hint addr value is used to allocate addresses * beyond the high address switch boundary. @@ -37,18 +20,8 @@ #define ADDR_MARK_128TB (1UL << 47) #define ADDR_MARK_256TB (1UL << 48) -#define HIGH_ADDR_128TB ((void *) (1UL << 48)) -#define HIGH_ADDR_256TB ((void *) (1UL << 49)) - -#define LOW_ADDR ((void *) (1UL << 30)) - -#ifdef __aarch64__ -#define ADDR_SWITCH_HINT ADDR_MARK_256TB -#define HIGH_ADDR HIGH_ADDR_256TB -#else -#define ADDR_SWITCH_HINT ADDR_MARK_128TB -#define HIGH_ADDR HIGH_ADDR_128TB -#endif +#define HIGH_ADDR_128TB (1UL << 48) +#define HIGH_ADDR_256TB (1UL << 49) struct testcase { void *addr; @@ -59,195 +32,230 @@ struct testcase { unsigned int keep_mapped:1; }; -static struct testcase testcases[] = { - { - /* - * If stack is moved, we could possibly allocate - * this at the requested address. - */ - .addr = ((void *)(ADDR_SWITCH_HINT - PAGE_SIZE)), - .size = PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(ADDR_SWITCH_HINT - PAGE_SIZE, PAGE_SIZE)", - .low_addr_required = 1, - }, - { - /* - * Unless MAP_FIXED is specified, allocation based on hint - * addr is never at requested address or above it, which is - * beyond high address switch boundary in this case. Instead, - * a suitable allocation is found in lower address space. - */ - .addr = ((void *)(ADDR_SWITCH_HINT - PAGE_SIZE)), - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(ADDR_SWITCH_HINT - PAGE_SIZE, (2 * PAGE_SIZE))", - .low_addr_required = 1, - }, - { - /* - * Exact mapping at high address switch boundary, should - * be obtained even without MAP_FIXED as area is free. - */ - .addr = ((void *)(ADDR_SWITCH_HINT)), - .size = PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(ADDR_SWITCH_HINT, PAGE_SIZE)", - .keep_mapped = 1, - }, - { - .addr = (void *)(ADDR_SWITCH_HINT), - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, - .msg = "mmap(ADDR_SWITCH_HINT, 2 * PAGE_SIZE, MAP_FIXED)", - }, - { - .addr = NULL, - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(NULL)", - .low_addr_required = 1, - }, - { - .addr = LOW_ADDR, - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(LOW_ADDR)", - .low_addr_required = 1, - }, - { - .addr = HIGH_ADDR, - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(HIGH_ADDR)", - .keep_mapped = 1, - }, - { - .addr = HIGH_ADDR, - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(HIGH_ADDR) again", - .keep_mapped = 1, - }, - { - .addr = HIGH_ADDR, - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, - .msg = "mmap(HIGH_ADDR, MAP_FIXED)", - }, - { - .addr = (void *) -1, - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(-1)", - .keep_mapped = 1, - }, - { - .addr = (void *) -1, - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(-1) again", - }, - { - .addr = ((void *)(ADDR_SWITCH_HINT - PAGE_SIZE)), - .size = PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(ADDR_SWITCH_HINT - PAGE_SIZE, PAGE_SIZE)", - .low_addr_required = 1, - }, - { - .addr = (void *)(ADDR_SWITCH_HINT - PAGE_SIZE), - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(ADDR_SWITCH_HINT - PAGE_SIZE, 2 * PAGE_SIZE)", - .low_addr_required = 1, - .keep_mapped = 1, - }, - { - .addr = (void *)(ADDR_SWITCH_HINT - PAGE_SIZE / 2), - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(ADDR_SWITCH_HINT - PAGE_SIZE/2 , 2 * PAGE_SIZE)", - .low_addr_required = 1, - .keep_mapped = 1, - }, - { - .addr = ((void *)(ADDR_SWITCH_HINT)), - .size = PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(ADDR_SWITCH_HINT, PAGE_SIZE)", - }, - { - .addr = (void *)(ADDR_SWITCH_HINT), - .size = 2 * PAGE_SIZE, - .flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, - .msg = "mmap(ADDR_SWITCH_HINT, 2 * PAGE_SIZE, MAP_FIXED)", - }, -}; +static struct testcase *testcases; +static struct testcase *hugetlb_testcases; +static int sz_testcases, sz_hugetlb_testcases; +static unsigned long switch_hint; -static struct testcase hugetlb_testcases[] = { - { - .addr = NULL, - .size = HUGETLB_SIZE, - .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(NULL, MAP_HUGETLB)", - .low_addr_required = 1, - }, - { - .addr = LOW_ADDR, - .size = HUGETLB_SIZE, - .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(LOW_ADDR, MAP_HUGETLB)", - .low_addr_required = 1, - }, - { - .addr = HIGH_ADDR, - .size = HUGETLB_SIZE, - .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(HIGH_ADDR, MAP_HUGETLB)", - .keep_mapped = 1, - }, - { - .addr = HIGH_ADDR, - .size = HUGETLB_SIZE, - .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(HIGH_ADDR, MAP_HUGETLB) again", - .keep_mapped = 1, - }, - { - .addr = HIGH_ADDR, - .size = HUGETLB_SIZE, - .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, - .msg = "mmap(HIGH_ADDR, MAP_FIXED | MAP_HUGETLB)", - }, - { - .addr = (void *) -1, - .size = HUGETLB_SIZE, - .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(-1, MAP_HUGETLB)", - .keep_mapped = 1, - }, - { - .addr = (void *) -1, - .size = HUGETLB_SIZE, - .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(-1, MAP_HUGETLB) again", - }, - { - .addr = (void *)(ADDR_SWITCH_HINT - PAGE_SIZE), - .size = 2 * HUGETLB_SIZE, - .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, - .msg = "mmap(ADDR_SWITCH_HINT - PAGE_SIZE, 2*HUGETLB_SIZE, MAP_HUGETLB)", - .low_addr_required = 1, - .keep_mapped = 1, - }, - { - .addr = (void *)(ADDR_SWITCH_HINT), - .size = 2 * HUGETLB_SIZE, - .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, - .msg = "mmap(ADDR_SWITCH_HINT , 2*HUGETLB_SIZE, MAP_FIXED | MAP_HUGETLB)", - }, -}; +/* Initialize testcases inside a function to compute parameters at runtime */ +void testcases_init(void) +{ + unsigned long pagesize = getpagesize(); + unsigned long hugepagesize = default_huge_page_size(); + unsigned long low_addr = (1UL << 30); + unsigned long addr_switch_hint = ADDR_MARK_128TB; + unsigned long high_addr = HIGH_ADDR_128TB; + +#ifdef __aarch64__ + + /* Post LPA2, the lower userspace VA on a 16K pagesize is 47 bits. */ + if (pagesize != (16UL << 10)) { + addr_switch_hint = ADDR_MARK_256TB; + high_addr = HIGH_ADDR_256TB; + } +#endif + + struct testcase t[] = { + { + /* + * If stack is moved, we could possibly allocate + * this at the requested address. + */ + .addr = ((void *)(addr_switch_hint - pagesize)), + .size = pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(addr_switch_hint - pagesize, pagesize)", + .low_addr_required = 1, + }, + { + /* + * Unless MAP_FIXED is specified, allocation based on hint + * addr is never at requested address or above it, which is + * beyond high address switch boundary in this case. Instead, + * a suitable allocation is found in lower address space. + */ + .addr = ((void *)(addr_switch_hint - pagesize)), + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(addr_switch_hint - pagesize, (2 * pagesize))", + .low_addr_required = 1, + }, + { + /* + * Exact mapping at high address switch boundary, should + * be obtained even without MAP_FIXED as area is free. + */ + .addr = ((void *)(addr_switch_hint)), + .size = pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(addr_switch_hint, pagesize)", + .keep_mapped = 1, + }, + { + .addr = (void *)(addr_switch_hint), + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, + .msg = "mmap(addr_switch_hint, 2 * pagesize, MAP_FIXED)", + }, + { + .addr = NULL, + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(NULL)", + .low_addr_required = 1, + }, + { + .addr = (void *)low_addr, + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(low_addr)", + .low_addr_required = 1, + }, + { + .addr = (void *)high_addr, + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(high_addr)", + .keep_mapped = 1, + }, + { + .addr = (void *)high_addr, + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(high_addr) again", + .keep_mapped = 1, + }, + { + .addr = (void *)high_addr, + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, + .msg = "mmap(high_addr, MAP_FIXED)", + }, + { + .addr = (void *) -1, + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(-1)", + .keep_mapped = 1, + }, + { + .addr = (void *) -1, + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(-1) again", + }, + { + .addr = ((void *)(addr_switch_hint - pagesize)), + .size = pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(addr_switch_hint - pagesize, pagesize)", + .low_addr_required = 1, + }, + { + .addr = (void *)(addr_switch_hint - pagesize), + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(addr_switch_hint - pagesize, 2 * pagesize)", + .low_addr_required = 1, + .keep_mapped = 1, + }, + { + .addr = (void *)(addr_switch_hint - pagesize / 2), + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(addr_switch_hint - pagesize/2 , 2 * pagesize)", + .low_addr_required = 1, + .keep_mapped = 1, + }, + { + .addr = ((void *)(addr_switch_hint)), + .size = pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(addr_switch_hint, pagesize)", + }, + { + .addr = (void *)(addr_switch_hint), + .size = 2 * pagesize, + .flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, + .msg = "mmap(addr_switch_hint, 2 * pagesize, MAP_FIXED)", + }, + }; + + struct testcase ht[] = { + { + .addr = NULL, + .size = hugepagesize, + .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(NULL, MAP_HUGETLB)", + .low_addr_required = 1, + }, + { + .addr = (void *)low_addr, + .size = hugepagesize, + .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(low_addr, MAP_HUGETLB)", + .low_addr_required = 1, + }, + { + .addr = (void *)high_addr, + .size = hugepagesize, + .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(high_addr, MAP_HUGETLB)", + .keep_mapped = 1, + }, + { + .addr = (void *)high_addr, + .size = hugepagesize, + .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(high_addr, MAP_HUGETLB) again", + .keep_mapped = 1, + }, + { + .addr = (void *)high_addr, + .size = hugepagesize, + .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, + .msg = "mmap(high_addr, MAP_FIXED | MAP_HUGETLB)", + }, + { + .addr = (void *) -1, + .size = hugepagesize, + .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(-1, MAP_HUGETLB)", + .keep_mapped = 1, + }, + { + .addr = (void *) -1, + .size = hugepagesize, + .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(-1, MAP_HUGETLB) again", + }, + { + .addr = (void *)(addr_switch_hint - pagesize), + .size = 2 * hugepagesize, + .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, + .msg = "mmap(addr_switch_hint - pagesize, 2*hugepagesize, MAP_HUGETLB)", + .low_addr_required = 1, + .keep_mapped = 1, + }, + { + .addr = (void *)(addr_switch_hint), + .size = 2 * hugepagesize, + .flags = MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, + .msg = "mmap(addr_switch_hint , 2*hugepagesize, MAP_FIXED | MAP_HUGETLB)", + }, + }; + + testcases = malloc(sizeof(t)); + hugetlb_testcases = malloc(sizeof(ht)); + + /* Copy into global arrays */ + memcpy(testcases, t, sizeof(t)); + memcpy(hugetlb_testcases, ht, sizeof(ht)); + + sz_testcases = ARRAY_SIZE(t); + sz_hugetlb_testcases = ARRAY_SIZE(ht); + switch_hint = addr_switch_hint; +} static int run_test(struct testcase *test, int count) { @@ -267,7 +275,7 @@ static int run_test(struct testcase *test, int count) continue; } - if (t->low_addr_required && p >= (void *)(ADDR_SWITCH_HINT)) { + if (t->low_addr_required && p >= (void *)(switch_hint)) { printf("FAILED\n"); ret = KSFT_FAIL; } else { @@ -292,7 +300,7 @@ static int supported_arch(void) #elif defined(__x86_64__) return 1; #elif defined(__aarch64__) - return getpagesize() == PAGE_SIZE; + return 1; #else return 0; #endif @@ -305,8 +313,10 @@ int main(int argc, char **argv) if (!supported_arch()) return KSFT_SKIP; - ret = run_test(testcases, ARRAY_SIZE(testcases)); + testcases_init(); + + ret = run_test(testcases, sz_testcases); if (argc == 2 && !strcmp(argv[1], "--run-hugetlb")) - ret = run_test(hugetlb_testcases, ARRAY_SIZE(hugetlb_testcases)); + ret = run_test(hugetlb_testcases, sz_hugetlb_testcases); return ret; } diff --git a/tools/testing/selftests/mm/va_high_addr_switch.sh b/tools/testing/selftests/mm/va_high_addr_switch.sh index a0a75f302904..2c725773cd79 100755 --- a/tools/testing/selftests/mm/va_high_addr_switch.sh +++ b/tools/testing/selftests/mm/va_high_addr_switch.sh @@ -57,8 +57,4 @@ check_test_requirements() } check_test_requirements -./va_high_addr_switch - -# In order to run hugetlb testcases, "--run-hugetlb" must be appended -# to the binary. ./va_high_addr_switch --run-hugetlb diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index bc3925200637..8eaffd7a641c 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 # Makefile for net selftests -CFLAGS = -Wall -Wl,--no-as-needed -O2 -g +CFLAGS += -Wall -Wl,--no-as-needed -O2 -g CFLAGS += -I../../../../usr/include/ $(KHDR_INCLUDES) # Additional include paths needed by kselftest.h CFLAGS += -I../ diff --git a/tools/testing/selftests/net/tcp_ao/Makefile b/tools/testing/selftests/net/tcp_ao/Makefile index 522d991e310e..bd88b90b902b 100644 --- a/tools/testing/selftests/net/tcp_ao/Makefile +++ b/tools/testing/selftests/net/tcp_ao/Makefile @@ -26,7 +26,7 @@ LIB := $(LIBDIR)/libaotst.a LDLIBS += $(LIB) -pthread LIBDEPS := lib/aolib.h Makefile -CFLAGS := -Wall -O2 -g -D_GNU_SOURCE -fno-strict-aliasing +CFLAGS += -Wall -O2 -g -fno-strict-aliasing CFLAGS += $(KHDR_INCLUDES) CFLAGS += -iquote ./lib/ -I ../../../../include/ diff --git a/tools/testing/selftests/proc/Makefile b/tools/testing/selftests/proc/Makefile index cd95369254c0..4d57f2a87510 100644 --- a/tools/testing/selftests/proc/Makefile +++ b/tools/testing/selftests/proc/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only CFLAGS += -Wall -O2 -Wno-unused-function -CFLAGS += -D_GNU_SOURCE +CFLAGS += $(TOOLS_INCLUDES) LDFLAGS += -pthread TEST_GEN_PROGS := diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c index cacbd2a4aec9..d04685771952 100644 --- a/tools/testing/selftests/proc/proc-pid-vm.c +++ b/tools/testing/selftests/proc/proc-pid-vm.c @@ -45,6 +45,7 @@ #include #include #include +#include #include "../kselftest.h" @@ -492,6 +493,91 @@ int main(void) assert(buf[13] == '\n'); } + /* Test PROCMAP_QUERY ioctl() for /proc/$PID/maps */ + { + char path_buf[256], exp_path_buf[256]; + struct procmap_query q; + int fd, err; + + snprintf(path_buf, sizeof(path_buf), "/proc/%u/maps", pid); + fd = open(path_buf, O_RDONLY); + if (fd == -1) + return 1; + + /* CASE 1: exact MATCH at VADDR */ + memset(&q, 0, sizeof(q)); + q.size = sizeof(q); + q.query_addr = VADDR; + q.query_flags = 0; + q.vma_name_addr = (__u64)(unsigned long)path_buf; + q.vma_name_size = sizeof(path_buf); + + err = ioctl(fd, PROCMAP_QUERY, &q); + assert(err == 0); + + assert(q.query_addr == VADDR); + assert(q.query_flags == 0); + + assert(q.vma_flags == (PROCMAP_QUERY_VMA_READABLE | PROCMAP_QUERY_VMA_EXECUTABLE)); + assert(q.vma_start == VADDR); + assert(q.vma_end == VADDR + PAGE_SIZE); + assert(q.vma_page_size == PAGE_SIZE); + + assert(q.vma_offset == 0); + assert(q.inode == st.st_ino); + assert(q.dev_major == MAJOR(st.st_dev)); + assert(q.dev_minor == MINOR(st.st_dev)); + + snprintf(exp_path_buf, sizeof(exp_path_buf), + "/tmp/#%llu (deleted)", (unsigned long long)st.st_ino); + assert(q.vma_name_size == strlen(exp_path_buf) + 1); + assert(strcmp(path_buf, exp_path_buf) == 0); + + /* CASE 2: NO MATCH at VADDR-1 */ + memset(&q, 0, sizeof(q)); + q.size = sizeof(q); + q.query_addr = VADDR - 1; + q.query_flags = 0; /* exact match */ + + err = ioctl(fd, PROCMAP_QUERY, &q); + err = err < 0 ? -errno : 0; + assert(err == -ENOENT); + + /* CASE 3: MATCH COVERING_OR_NEXT_VMA at VADDR - 1 */ + memset(&q, 0, sizeof(q)); + q.size = sizeof(q); + q.query_addr = VADDR - 1; + q.query_flags = PROCMAP_QUERY_COVERING_OR_NEXT_VMA; + + err = ioctl(fd, PROCMAP_QUERY, &q); + assert(err == 0); + + assert(q.query_addr == VADDR - 1); + assert(q.query_flags == PROCMAP_QUERY_COVERING_OR_NEXT_VMA); + assert(q.vma_start == VADDR); + assert(q.vma_end == VADDR + PAGE_SIZE); + + /* CASE 4: NO MATCH at VADDR + PAGE_SIZE */ + memset(&q, 0, sizeof(q)); + q.size = sizeof(q); + q.query_addr = VADDR + PAGE_SIZE; /* point right after the VMA */ + q.query_flags = PROCMAP_QUERY_COVERING_OR_NEXT_VMA; + + err = ioctl(fd, PROCMAP_QUERY, &q); + err = err < 0 ? -errno : 0; + assert(err == -ENOENT); + + /* CASE 5: NO MATCH WRITABLE at VADDR */ + memset(&q, 0, sizeof(q)); + q.size = sizeof(q); + q.query_addr = VADDR; + q.query_flags = PROCMAP_QUERY_VMA_WRITABLE; + + err = ioctl(fd, PROCMAP_QUERY, &q); + err = err < 0 ? -errno : 0; + assert(err == -ENOENT); + } + return 0; } #else diff --git a/tools/testing/selftests/resctrl/Makefile b/tools/testing/selftests/resctrl/Makefile index 021863f86053..f408bd6bfc3d 100644 --- a/tools/testing/selftests/resctrl/Makefile +++ b/tools/testing/selftests/resctrl/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 -CFLAGS = -g -Wall -O2 -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE +CFLAGS = -g -Wall -O2 -D_FORTIFY_SOURCE=2 CFLAGS += $(KHDR_INCLUDES) TEST_GEN_PROGS := resctrl_tests diff --git a/tools/testing/selftests/ring-buffer/Makefile b/tools/testing/selftests/ring-buffer/Makefile index 627c5fa6d1ab..23605782639e 100644 --- a/tools/testing/selftests/ring-buffer/Makefile +++ b/tools/testing/selftests/ring-buffer/Makefile @@ -1,7 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 CFLAGS += -Wl,-no-as-needed -Wall CFLAGS += $(KHDR_INCLUDES) -CFLAGS += -D_GNU_SOURCE TEST_GEN_PROGS = map_test diff --git a/tools/testing/selftests/riscv/mm/Makefile b/tools/testing/selftests/riscv/mm/Makefile index c333263f2b27..4664ed79e20b 100644 --- a/tools/testing/selftests/riscv/mm/Makefile +++ b/tools/testing/selftests/riscv/mm/Makefile @@ -3,7 +3,7 @@ # Originally tools/testing/arm64/abi/Makefile # Additional include paths needed by kselftest.h and local headers -CFLAGS += -D_GNU_SOURCE -std=gnu99 -I. +CFLAGS += -std=gnu99 -I. TEST_GEN_FILES := mmap_default mmap_bottomup diff --git a/tools/testing/selftests/sgx/Makefile b/tools/testing/selftests/sgx/Makefile index 867f88ce2570..03b5e13b872b 100644 --- a/tools/testing/selftests/sgx/Makefile +++ b/tools/testing/selftests/sgx/Makefile @@ -12,7 +12,7 @@ OBJCOPY := $(CROSS_COMPILE)objcopy endif INCLUDES := -I$(top_srcdir)/tools/include -HOST_CFLAGS := -Wall -Werror -g $(INCLUDES) -fPIC +HOST_CFLAGS := -Wall -Werror -g $(INCLUDES) -fPIC $(CFLAGS) HOST_LDFLAGS := -z noexecstack -lcrypto ENCL_CFLAGS += -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE \ -fno-stack-protector -mrdrnd $(INCLUDES) diff --git a/tools/testing/selftests/tmpfs/Makefile b/tools/testing/selftests/tmpfs/Makefile index aa11ccc92e5b..3be931e1193f 100644 --- a/tools/testing/selftests/tmpfs/Makefile +++ b/tools/testing/selftests/tmpfs/Makefile @@ -1,6 +1,5 @@ # SPDX-License-Identifier: GPL-2.0-only CFLAGS += -Wall -O2 -CFLAGS += -D_GNU_SOURCE TEST_GEN_PROGS := TEST_GEN_PROGS += bug-link-o-tmpfile