mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-11-15 00:04:15 +08:00
1ec6d09789
1304717 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Barry Song
|
9d57090e73 |
mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
Patch series "mm: enable large folios swap-in support", v9.
Currently, we support mTHP swapout but not swapin. This means that once
mTHP is swapped out, it will come back as small folios when swapped in.
This is particularly detrimental for devices like Android, where more than
half of the memory is in swap.
The lack of mTHP swapin functionality makes mTHP a showstopper in
scenarios that heavily rely on swap. This patchset introduces mTHP
swap-in support. It starts with synchronous devices similar to zRAM,
aiming to benefit as many users as possible with minimal changes.
This patch (of 3):
There could be a corner case where the first entry is non-zeromap, but a
subsequent entry is zeromap. In this case, we should not let
swap_read_folio_zeromap() return false since we will still read corrupted
data.
Additionally, the iteration of test_bit() is unnecessary and can be
replaced with bitmap operations, which are more efficient.
We can adopt the style of swap_pte_batch() and folio_pte_batch() to
introduce swap_zeromap_batch() which seems to provide the greatest
flexibility for the caller. This approach allows the caller to either
check if the zeromap status of all entries is consistent or determine the
number of contiguous entries with the same status.
Since swap_read_folio() can't handle reading a large folio that's
partially zeromap and partially non-zeromap, we've moved the code to
mm/swap.h so that others, like those working on swap-in, can access it.
Link: https://lkml.kernel.org/r/20240908232119.2157-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240908232119.2157-2-21cnbao@gmail.com
Fixes:
|
||
Anshuman Khandual
|
a0c9fd22e3 |
mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
This replaces all the existing READ_ONCE() based page table accesses with respective pxdp_get() helpers. Although these helpers might also fallback to READ_ONCE() as default, but they do provide an opportunity for various platforms to override when required. This change is a step in direction to replace all page table entry accesses with respective pxdp_get() helpers. Link: https://lkml.kernel.org/r/20240910115746.514454-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Christophe Leroy
|
82ce8e2f31 |
set_memory: add __must_check to generic stubs
Following query shows that architectures that don't provide asm/set_memory.h don't use set_memory_...() functions. $ git grep set_memory_ alpha arc csky hexagon loongarch m68k microblaze mips nios2 openrisc parisc sh sparc um xtensa Following query shows that all core users of set_memory_...() functions always take returned value into account: $ git grep -w -e set_memory_ro -e set_memory_rw -e set_memory_x -e set_memory_nx -e set_memory_rox `find . -maxdepth 1 -type d | grep -v arch | grep /` set_memory_...() functions can fail, leaving the memory attributes unchanged. Make sure all callers check the returned code. Link: https://github.com/KSPP/linux/issues/7 Link: https://lkml.kernel.org/r/6a89ffc69666de84721216947c6b6c7dcca39d7d.1725723347.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Xiao Yang
|
659c55ef98 |
mm/vma: return the exact errno in vms_gather_munmap_vmas()
__split_vma() and mas_store_gfp() returns several types of errno on
failure so don't ignore them in vms_gather_munmap_vmas(). For example,
__split_vma() returns -EINVAL when an unaligned huge page is unmapped.
This issue is reproduced by ltp memfd_create03 test.
Don't initialise the error variable and assign it when a failure actually
occurs.
[akpm@linux-foundation.org: fix whitespace, per Liam]
Link: https://lkml.kernel.org/r/20240909125621.1994-1-ice_yangxiao@163.com
Fixes:
|
||
Michal Koutný
|
f2c5101be4 |
memcg: cleanup with !CONFIG_MEMCG_V1
Extern declarations have no definitions with !CONFIG_MEMCG_V1 and no users, drop them altogether. Link: https://lkml.kernel.org/r/20240909163223.3693529-1-mkoutny@suse.com Link: https://lkml.kernel.org/r/20240909163223.3693529-2-mkoutny@suse.com Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Tejun Heo <tj@kernel.org> Cc: Chen Ridong <chenridong@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kent Overstreet
|
fd00be9afa |
mm/show_mem.c: report alloc tags in human readable units
We already do this when reporting slab info - more consistent and more readable. Link: https://lkml.kernel.org/r/20240906005337.1220091-1-kent.overstreet@linux.dev Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
658be46520 |
mm: support poison recovery from copy_present_page()
Similar to other poison recovery, use copy_mc_user_highpage() to avoid potentially kernel panic during copy page in copy_present_page() from fork, once copy failed due to hwpoison in source page, we need to break out of copy in copy_pte_range() and release prealloc folio, so copy_mc_user_highpage() is moved ahead before set *prealloc to NULL. Link: https://lkml.kernel.org/r/20240906024201.1214712-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
aa549f923f |
mm: support poison recovery from do_cow_fault()
Patch series "mm: hwpoison: two more poison recovery".
One more CoW path to support poison recorvery in do_cow_fault(), and the
last copy_user_highpage() user is replaced to copy_mc_user_highpage() from
copy_present_page() during fork to support poison recorvery too.
This patch (of 2):
Like commit
|
||
Huang Ying
|
99185c10d5 |
resource, kunit: add test case for region_intersects()
Patch series "resource: Fix region_intersects() vs add_memory_driver_managed()", v3. The patchset fixes a bug of region_intersects() for systems with CXL memory. The details of the bug can be found in [1/3]. To avoid similar bugs in the future. A kunit test case for region_intersects() is added in [3/3]. [2/3] is a preparation patch for [3/3]. This patch (of 3): region_intersects() is important because it's used for /dev/mem permission checking. To avoid possible bug of region_intersects() in the future, a kunit test case for region_intersects() is added. Link: https://lkml.kernel.org/r/20240906030713.204292-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20240906030713.204292-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
bacf9c3cbb |
resource: make alloc_free_mem_region() works for iomem_resource
During developing a kunit test case for region_intersects(), some fake resources need to be inserted into iomem_resource. To do that, a resource hole needs to be found first in iomem_resource. However, alloc_free_mem_region() cannot work for iomem_resource now. Because the start address to check cannot be 0 to detect address wrapping 0 in gfr_continue(), while iomem_resource.start == 0. To make alloc_free_mem_region() works for iomem_resource, gfr_start() is changed to avoid to return 0 even if base->start == 0. We don't need to check 0 as start address. Link: https://lkml.kernel.org/r/20240906030713.204292-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yosry Ahmed
|
7a2369b74a |
mm: z3fold: deprecate CONFIG_Z3FOLD
The z3fold compressed pages allocator is rarely used, most users use zsmalloc. The only disadvantage of zsmalloc in comparison is the dependency on MMU, and zbud is a more common option for !MMU as it was the default zswap allocator for a long time. Historically, zsmalloc had worse latency than zbud and z3fold but offered better memory savings. This is no longer the case as shown by a simple recent analysis [1]. That analysis showed that z3fold does not have any advantage over zsmalloc or zbud considering both performance and memory usage. In a kernel build test on tmpfs in a limited cgroup, z3fold took 3% more time and used 1.8% more memory. The latency of zswap_load() was 7% higher, and that of zswap_store() was 10% higher. Zsmalloc is better in all metrics. Moreover, z3fold apparently has latent bugs, which was made noticeable by a recent soft lockup bug report with z3fold [2]. Switching to zsmalloc not only fixed the problem, but also reduced the swap usage from 6~8G to 1~2G. Other users have also reported being bitten by mistakenly enabling z3fold. Other than hurting users, z3fold is repeatedly causing wasted engineering effort. Apart from investigating the above bug, it came up in multiple development discussions (e.g. [3]) as something we need to handle, when there aren't any legit users (at least not intentionally). The natural course of action is to deprecate z3fold, and remove in a few cycles if no objections are raised from active users. Next on the list should be zbud, as it offers marginal latency gains at the cost of huge memory waste when compared to zsmalloc. That one will need to wait until zsmalloc does not depend on MMU. Rename the user-visible config option from CONFIG_Z3FOLD to CONFIG_Z3FOLD_DEPRECATED so that users with CONFIG_Z3FOLD=y get a new prompt with explanation during make oldconfig. Also, remove CONFIG_Z3FOLD=y from defconfigs. [1]https://lore.kernel.org/lkml/CAJD7tkbRF6od-2x_L8-A1QL3=2Ww13sCj4S3i4bNndqF+3+_Vg@mail.gmail.com/ [2]https://lore.kernel.org/lkml/EF0ABD3E-A239-4111-A8AB-5C442E759CF3@gmail.com/ [3]https://lore.kernel.org/lkml/CAJD7tkbnmeVugfunffSovJf9FAgy9rhBVt_tx=nxUveLUfqVsA@mail.gmail.com/ [arnd@arndb.de: deprecate ZSWAP_ZPOOL_DEFAULT_Z3FOLD as well] Link: https://lkml.kernel.org/r/20240909202625.1054880-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20240904233343.933462-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vitaly Wool <vitaly.wool@konsulko.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Alex Williamson
|
f9e54c3a2f |
vfio/pci: implement huge_fault support
With the addition of pfnmap support in vmf_insert_pfn_{pmd,pud}() we can
take advantage of PMD and PUD faults to PCI BAR mmaps and create more
efficient mappings. PCI BARs are always a power of two and will typically
get at least PMD alignment without userspace even trying. Userspace
alignment for PUD mappings is also not too difficult.
Consolidate faults through a single handler with a new wrapper for
standard single page faults. The pre-faulting behavior of commit
|
||
Peter Xu
|
3e509c9b03 |
mm/arm64: support large pfn mappings
Support huge pfnmaps by using bit 56 (PTE_SPECIAL) for "special" on pmds/puds. Provide the pmd/pud helpers to set/get special bit. There's one more thing missing for arm64 which is the pxx_pgprot() for pmd/pud. Add them too, which is mostly the same as the pte version by dropping the pfn field. These helpers are essential to be used in the new follow_pfnmap*() API to report valid pgprot_t results. Note that arm64 doesn't yet support huge PUD yet, but it's still straightforward to provide the pud helpers that we need altogether. Only PMD helpers will make an immediate benefit until arm64 will support huge PUDs first in general (e.g. in THPs). Link: https://lkml.kernel.org/r/20240826204353.2228736-19-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
75182022a0 |
mm/x86: support large pfn mappings
Helpers to install and detect special pmd/pud entries. In short, bit 9 on x86 is not used for pmd/pud, so we can directly define them the same as the pte level. One note is that it's also used in _PAGE_BIT_CPA_TEST but that is only used in the debug test, and shouldn't conflict in this case. One note is that pxx_set|clear_flags() for pmd/pud will need to be moved upper so that they can be referenced by the new special bit helpers. There's no change in the code that was moved. Link: https://lkml.kernel.org/r/20240826204353.2228736-18-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
b0a1c0d0ed |
mm: remove follow_pte()
follow_pte() users have been converted to follow_pfnmap*(). Remove the API. Link: https://lkml.kernel.org/r/20240826204353.2228736-17-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
b17269a51c |
mm/access_process_vm: use the new follow_pfnmap API
Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-16-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
e6bc784c24 |
acrn: use the new follow_pfnmap API
Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-15-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
a77f9489f1 |
vfio: use the new follow_pfnmap API
Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-14-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
cbea8536d9 |
mm/x86/pat: use the new follow_pfnmap API
Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-13-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
bd8c2d18bf |
s390/pci_mmio: use follow_pfnmap API
Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
5731aacd54 |
KVM: use follow_pfnmap API
Use the new pfnmap API to allow huge MMIO mappings for VMs. The rest work is done perfectly on the other side (host_pfn_mapping_level()). Link: https://lkml.kernel.org/r/20240826204353.2228736-11-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
6da8e9634b |
mm: new follow_pfnmap API
Introduce a pair of APIs to follow pfn mappings to get entry information. It's very similar to what follow_pte() does before, but different in that it recognizes huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-10-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
0515e022e1 |
mm: always define pxx_pgprot()
There're: - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that support pte_pgprot(). - 2 archs (x86, sparc) that support pmd_pgprot(). - 1 arch (x86) that support pud_pgprot(). Always define them to be used in generic code, and then we don't need to fiddle with "#ifdef"s when doing so. Link: https://lkml.kernel.org/r/20240826204353.2228736-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
bc02afbd4d |
mm/fork: accept huge pfnmap entries
Teach the fork code to properly copy pfnmaps for pmd/pud levels. Pud is much easier, the write bit needs to be persisted though for writable and shared pud mappings like PFNMAP ones, otherwise a follow up write in either parent or child process will trigger a write fault. Do the same for pmd level. Link: https://lkml.kernel.org/r/20240826204353.2228736-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
10d83d7781 |
mm/pagewalk: check pfnmap for folio_walk_start()
Teach folio_walk_start() to recognize special pmd/pud mappings, and fail them properly as it means there's no folio backing them. [peterx@redhat.com: remove some stale comments, per David] Link: https://lkml.kernel.org/r/20240829202237.2640288-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240826204353.2228736-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
ae3c99e650 |
mm/gup: detect huge pfnmap entries in gup-fast
Since gup-fast doesn't have the vma reference, teach it to detect such huge pfnmaps by checking the special bit for pmd/pud too, just like ptes. Link: https://lkml.kernel.org/r/20240826204353.2228736-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
5dd40721f1 |
mm: allow THP orders for PFNMAPs
This enables PFNMAPs to be mapped at either pmd/pud layers. Generalize the dax case into vma_is_special_huge() so as to cover both. Meanwhile, rename the macro to THP_ORDERS_ALL_SPECIAL. Link: https://lkml.kernel.org/r/20240826204353.2228736-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Gavin Shan <gshan@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
3c8e44c9b3 |
mm: mark special bits for huge pfn mappings when inject
We need these special bits to be around on pfnmaps. Mark properly for !devmap case, reflecting that there's no page struct backing the entry. Link: https://lkml.kernel.org/r/20240826204353.2228736-4-peterx@redhat.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
ef713ec3a5 |
mm: drop is_huge_zero_pud()
It constantly returns false since 2017. One assertion is added in 2019 but it should never have triggered, IOW it means what is checked should be asserted instead. If it didn't exist for 7 years maybe it's good idea to remove it and only add it when it comes. Link: https://lkml.kernel.org/r/20240826204353.2228736-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
6857be5fec |
mm: introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
Patch series "mm: Support huge pfnmaps", v2. Overview ======== This series implements huge pfnmaps support for mm in general. Huge pfnmap allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow as large as 8GB or even bigger. Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last patch (from Alex Williamson) will be the first user of huge pfnmap, so as to enable vfio-pci driver to fault in huge pfn mappings. Implementation ============== In reality, it's relatively simple to add such support comparing to many other types of mappings, because of PFNMAP's specialties when there's no vmemmap backing it, so that most of the kernel routines on huge mappings should simply already fail for them, like GUPs or old-school follow_page() (which is recently rewritten to be folio_walk* APIs by David). One trick here is that we're still unmature on PUDs in generic paths here and there, as DAX is so far the only user. This patchset will add the 2nd user of it. Hugetlb can be a 3rd user if the hugetlb unification work can go on smoothly, but to be discussed later. The other trick is how to allow gup-fast working for such huge mappings even if there's no direct sign of knowing whether it's a normal page or MMIO mapping. This series chose to keep the pte_special solution, so that it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that gup-fast will be able to identify them and fail properly. Along the way, we'll also notice that the major pgtable pfn walker, aka, follow_pte(), will need to retire soon due to the fact that it only works with ptes. A new set of simple API is introduced (follow_pfnmap* API) to be able to do whatever follow_pte() can already do, plus that it can also process huge pfnmaps now. Half of this series is about that and converting all existing pfnmap walkers to use the new API properly. Hopefully the new API also looks better to avoid exposing e.g. pgtable lock details into the callers, so that it can be used in an even more straightforward way. Here, three more options will be introduced and involved in huge pfnmap: - ARCH_SUPPORTS_HUGE_PFNMAP Arch developers will need to select this option when huge pfnmap is supported in arch's Kconfig. After this patchset applied, both x86_64 and arm64 will start to enable it by default. - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP These options are for driver developers to identify whether current arch / config supports huge pfnmaps, making decision on whether it can use the huge pfnmap APIs to inject them. One can refer to the last vfio-pci patch from Alex on the use of them properly in a device driver. So after the whole set applied, and if one would enable some dynamic debug lines in vfio-pci core files, we should observe things like: vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100 vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100 vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100 In this specific case, it says that vfio-pci faults in PMDs properly for a few BAR0 offsets. Patch Layout ============ Patch 1: Introduce the new options mentioned above for huge PFNMAPs Patch 2: A tiny cleanup Patch 3-8: Preparation patches for huge pfnmap (include introduce special bit for pmd/pud) Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and then drop follow_pte() API Patch 17: Add huge pfnmap support for x86_64 Patch 18: Add huge pfnmap support for arm64 Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex) TODO ==== More architectures / More page sizes ------------------------------------ Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems to have plan to support arm64 1G later on top of this series [2]. Any arch will need to first support THP / THP_1G, then provide a special bit in pmds/puds to support huge pfnmaps. remap_pfn_range() support ------------------------- Currently, remap_pfn_range() still only maps PTEs. With the new option, remap_pfn_range() can logically start to inject either PMDs or PUDs when the alignment requirements match on the VAs. When the support is there, it should be able to silently benefit all drivers that is using remap_pfn_range() in its mmap() handler on better TLB hit rate and overall faster MMIO accesses similar to processor on hugepages. More driver support ------------------- VFIO is so far the only consumer for the huge pfnmaps after this series applied. Besides above remap_pfn_range() generic optimization, device driver can also try to optimize its mmap() on a better VA alignment for either PMD/PUD sizes. This may, iiuc, normally require userspace changes, as the driver doesn't normally decide the VA to map a bar. But I don't think I know all the drivers to know the full picture. Credits all go to Alex on help testing the GPU/NIC use cases above. [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com This patch (of 19): This patch introduces the option to introduce special pte bit into pmd/puds. Archs can start to define pmd_special / pud_special when supported by selecting the new option. Per-arch support will be added later. Before that, create fallbacks for these helpers so that they are always available. Link: https://lkml.kernel.org/r/20240826204353.2228736-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240826204353.2228736-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
22af8caff7 |
mm/madvise: process_madvise() drop capability check if same mm
In commit |
||
Miaohe Lin
|
2a1b8648d9 |
mm/huge_memory: ensure huge_zero_folio won't have large_rmappable flag set
Ensure huge_zero_folio won't have large_rmappable flag set. So it can be
reported as thp,zero correctly through stable_page_flags().
Link: https://lkml.kernel.org/r/20240914015306.3656791-1-linmiaohe@huawei.com
Fixes:
|
||
Vishal Moola (Oracle)
|
98b74bb4d7 |
mm/hugetlb.c: fix UAF of vma in hugetlb fault pathway
Syzbot reports a UAF in hugetlb_fault(). This happens because
vmf_anon_prepare() could drop the per-VMA lock and allow the current VMA
to be freed before hugetlb_vma_unlock_read() is called.
We can fix this by using a modified version of vmf_anon_prepare() that
doesn't release the VMA lock on failure, and then release it ourselves
after hugetlb_vma_unlock_read().
Link: https://lkml.kernel.org/r/20240914194243.245-2-vishal.moola@gmail.com
Fixes:
|
||
Vishal Moola (Oracle)
|
2a058ab328 |
mm: change vmf_anon_prepare() to __vmf_anon_prepare()
Some callers of vmf_anon_prepare() may not want us to release the per-VMA
lock ourselves. Rename vmf_anon_prepare() to __vmf_anon_prepare() and let
the callers drop the lock when desired.
Also, make vmf_anon_prepare() a wrapper that releases the per-VMA lock
itself for any callers that don't care.
This is in preparation to fix this bug reported by syzbot:
https://lore.kernel.org/linux-mm/00000000000067c20b06219fbc26@google.com/
Link: https://lkml.kernel.org/r/20240914194243.245-1-vishal.moola@gmail.com
Fixes:
|
||
Huang Ying
|
b4afe4183e |
resource: fix region_intersects() vs add_memory_driver_managed()
On a system with CXL memory, the resource tree (/proc/iomem) related to
CXL memory may look like something as follows.
490000000-50fffffff : CXL Window 0
490000000-50fffffff : region0
490000000-50fffffff : dax0.0
490000000-50fffffff : System RAM (kmem)
Because drivers/dax/kmem.c calls add_memory_driver_managed() during
onlining CXL memory, which makes "System RAM (kmem)" a descendant of "CXL
Window X". This confuses region_intersects(), which expects all "System
RAM" resources to be at the top level of iomem_resource. This can lead to
bugs.
For example, when the following command line is executed to write some
memory in CXL memory range via /dev/mem,
$ dd if=data of=/dev/mem bs=$((1 << 10)) seek=$((0x490000000 >> 10)) count=1
dd: error writing '/dev/mem': Bad address
1+0 records in
0+0 records out
0 bytes copied, 0.0283507 s, 0.0 kB/s
the command fails as expected. However, the error code is wrong. It
should be "Operation not permitted" instead of "Bad address". More
seriously, the /dev/mem permission checking in devmem_is_allowed() passes
incorrectly. Although the accessing is prevented later because ioremap()
isn't allowed to map system RAM, it is a potential security issue. During
command executing, the following warning is reported in the kernel log for
calling ioremap() on system RAM.
ioremap on RAM at 0x0000000490000000 - 0x0000000490000fff
WARNING: CPU: 2 PID: 416 at arch/x86/mm/ioremap.c:216 __ioremap_caller.constprop.0+0x131/0x35d
Call Trace:
memremap+0xcb/0x184
xlate_dev_mem_ptr+0x25/0x2f
write_mem+0x94/0xfb
vfs_write+0x128/0x26d
ksys_write+0xac/0xfe
do_syscall_64+0x9a/0xfd
entry_SYSCALL_64_after_hwframe+0x4b/0x53
The details of command execution process are as follows. In the above
resource tree, "System RAM" is a descendant of "CXL Window 0" instead of a
top level resource. So, region_intersects() will report no System RAM
resources in the CXL memory region incorrectly, because it only checks the
top level resources. Consequently, devmem_is_allowed() will return 1
(allow access via /dev/mem) for CXL memory region incorrectly.
Fortunately, ioremap() doesn't allow to map System RAM and reject the
access.
So, region_intersects() needs to be fixed to work correctly with the
resource tree with "System RAM" not at top level as above. To fix it, if
we found a unmatched resource in the top level, we will continue to search
matched resources in its descendant resources. So, we will not miss any
matched resources in resource tree anymore.
In the new implementation, an example resource tree
|------------- "CXL Window 0" ------------|
|-- "System RAM" --|
will behave similar as the following fake resource tree for
region_intersects(, IORESOURCE_SYSTEM_RAM, ),
|-- "System RAM" --||-- "CXL Window 0a" --|
Where "CXL Window 0a" is part of the original "CXL Window 0" that
isn't covered by "System RAM".
Link: https://lkml.kernel.org/r/20240906030713.204292-2-ying.huang@intel.com
Fixes:
|
||
Sergey Senozhatsky
|
6040f650c5 |
zsmalloc: use unique zsmalloc caches names
Each zsmalloc pool maintains several named kmem-caches for zs_handle-s and
zspage-s. On a system with multiple zsmalloc pools and CONFIG_DEBUG_VM
this triggers kmem_cache_sanity_check():
kmem_cache of name 'zspage' already exists
WARNING: at mm/slab_common.c:108 do_kmem_cache_create_usercopy+0xb5/0x310
...
kmem_cache of name 'zs_handle' already exists
WARNING: at mm/slab_common.c:108 do_kmem_cache_create_usercopy+0xb5/0x310
...
We provide zram device name when init its zsmalloc pool, so we can use
that same name for zsmalloc caches and, hence, create unique names that
can easily be linked to zram device that has created them.
So instead of having this
cat /proc/slabinfo
slabinfo - version: 2.1
zspage 46 46 ...
zs_handle 128 128 ...
zspage 34270 34270 ...
zs_handle 34816 34816 ...
zspage 0 0 ...
zs_handle 0 0 ...
We now have this
cat /proc/slabinfo
slabinfo - version: 2.1
zspage-zram2 46 46 ...
zs_handle-zram2 128 128 ...
zspage-zram0 34270 34270 ...
zs_handle-zram0 34816 34816 ...
zspage-zram1 0 0 ...
zs_handle-zram1 0 0 ...
Link: https://lkml.kernel.org/r/20240906035103.2435557-1-senozhatsky@chromium.org
Fixes:
|
||
Juergen Gross
|
c3dea3d54f |
xen/swiotlb: fix allocated size
The allocated size in xen_swiotlb_alloc_coherent() and
xen_swiotlb_free_coherent() is calculated wrong for the case of
XEN_PAGE_SIZE not matching PAGE_SIZE. Fix that.
Fixes:
|
||
Juergen Gross
|
9f40ec84a7 |
xen/swiotlb: add alignment check for dma buffers
When checking a memory buffer to be consecutive in machine memory,
the alignment needs to be checked, too. Failing to do so might result
in DMA memory not being aligned according to its requested size,
leading to error messages like:
4xxx 0000:2b:00.0: enabling device (0140 -> 0142)
4xxx 0000:2b:00.0: Ring address not aligned
4xxx 0000:2b:00.0: Failed to initialise service qat_crypto
4xxx 0000:2b:00.0: Resetting device qat_dev0
4xxx: probe of 0000:2b:00.0 failed with error -14
Fixes:
|
||
Linus Torvalds
|
c903327d32 |
printk changes for 6.12
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmbi598ACgkQUqAMR0iA lPL3lw//WaRDKJ1Cb/bKAn3nRpjdqiNBI//K1gRJp0LgLE7qEudE25t4j3F9tvvP pc9AB81g1Au8Br6iOd+NiGXXW5KWJHaZ3rUAdeo6co4NQCbrY6qTA78ItZSQImBH A9fhWWr1TGRX8L/N/gR2eYBnpbDGIbRahUOQraUpBn4kEPyR47KEx7Njjo48GcmR Ye8dIYwUOWEgQeIuIxIAwNf6KyNjo5tQpgve+M8HGwy8mZqP9XV6UjXUACVwQNx6 +CK+IGM+94tCq5KalOaJ5BtsXGKlabHIs7y9QpLS45M2QoHIlDIvpaxzLf0FTsPI CppqedAGN2jU0NyjfbFk1c+SNQfDZEAZVyF6vKFelP7t2jzAx301RyB2S+Cm7Hh+ PajFty41UT0/y17V4sZawfMqpFyp7Wr6RKQYYKMBRdSQQkToh/dmebBvqPAHW9cJ LInQQf+XdzbonKa+CTmT/Tg+eM2R124FWeMVnEMdtyXpKUV9qdKWfngtzyRMQiYI q54ZwKd3VJ9kRIfb7Fp0TBr2NErdnEQE5hh9QhI8SAWENskw5+GmYaQit734U9wA SU7t9rir7NS4Rc1jHP9SQ9oWWI9HT4hthRGkLh2Knx0O2c6AwOuEI4wkjzSWI3GX /eeofnbZiUpi7fESf9qmTGtQZ4/9ogQ7fNaroWCSfQzq3+wl+2o= =28sV -----END PGP SIGNATURE----- Merge tag 'printk-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: "This is the "last" part of the support for the new nbcon consoles. Where "nbcon" stays for "No Big console lock CONsoles" aka not under the console_lock. New callbacks are added to struct console: - write_thread() for flushing nbcon consoles in task context. - write_atomic() for flushing nbcon consoles in atomic context, including NMI. - con->device_lock() and device_unlock() for taking the driver specific lock, for example, port->lock. New printk-specific kthreads are created: - per-console kthreads which get responsible for flushing normal priority messages on nbcon consoles. - thread which gets responsible for flushing normal priority messages on all consoles when CONFIG_RT enabled. The new callbacks are called under a special per-console lock which has already been added back in v6.7. It allows to distinguish three severities: normal, emergency, and panic. A context with a higher priority could take over the ownership when it is safe even in the middle of handling a record. The panic context could do it even when it is not safe. But it is allowed only for the final desperate flush before entering the infinite loop. The new lock helps to flush the messages directly in emergency and panic contexts. But it is not enough in all situations: - console_lock() is still need for synchronization against boot consoles. - con->device_lock() is need for synchronization against other operations on the same HW, e.g. serial port speed setting, non-printk related read/write. The dependency on con->device_lock() is mutual. Any code taking the driver specific lock has to acquire the related nbcon console context as well. For example, see the new uart_port_lock() API. It provides the necessary synchronization against emergency and panic contexts where the messages are flushed only under the new per-console lock. Maybe surprisingly, a quite tricky part is the decision how to flush the consoles in various situations. It has to take into account: - message priority: normal, emergency, panic - scheduling context: task, atomic, deferred_legacy - registered consoles: boot, legacy, nbcon - threads are running: early boot, suspend, shutdown, panic - caller: printk(), pr_flush(), printk_flush_in_panic(), console_unlock(), console_start(), ... The primary decision is made in printk_get_console_flush_type(). It creates a hint what the caller should do: - flush nbcon consoles directly or via the kthread - call the legacy loop (console_unlock()) directly or via irq_work The existing behavior is preserved for the legacy consoles. The only exception is that they are not longer flushed directly from printk() in panic() before CPUs are stopped. But this blocking happens only when at least one nbcon console is registered. The motivation is to increase a chance to produce the crash dump. They legacy consoles might create a deadlock in compare with nbcon consoles. The nbcon console should allow to see the messages even when the crash dump fails. There are three possible ways how nbcon consoles are flushed: - The per-nbcon-console kthread is responsible for flushing messages added with the normal priority. This is the default mode. - The legacy loop, aka console_unlock(), is used when there is still a boot console registered. There is no easy way how to match an early console driver with a nbcon console driver. And the console_lock() provides the only reliable serialization at the moment. The legacy loop uses either con->write_atomic() or con->write_thread() callbacks depending on whether it is allowed to schedule. The atomic variant has to be used from printk(). - In other situations, the messages are flushed directly using write_atomic() which can be called in any context, including NMI. It is primary needed during early boot or shutdown, in emergency situations, and panic. The emergency priority is used by a code called within nbcon_cpu_emergency_enter()/exit(). At the moment, it is used in four situations: WARN(), Oops, lockdep, and RCU stall reports. Finally, there is no nbcon console at the moment. It means that the changes should _not_ modify the existing behavior. The only exception is CONFIG_RT which would force offloading the legacy loop, for normal priority context, into the dedicated kthread" * tag 'printk-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (54 commits) printk: Avoid false positive lockdep report for legacy printing printk: nbcon: Assign nice -20 for printing threads printk: Implement legacy printer kthread for PREEMPT_RT tty: sysfs: Add nbcon support for 'active' proc: Add nbcon support for /proc/consoles proc: consoles: Add notation to c_start/c_stop printk: nbcon: Show replay message on takeover printk: Provide helper for message prepending printk: nbcon: Rely on kthreads for normal operation printk: nbcon: Use thread callback if in task context for legacy printk: nbcon: Relocate nbcon_atomic_emit_one() printk: nbcon: Introduce printer kthreads printk: nbcon: Init @nbcon_seq to highest possible printk: nbcon: Add context to usable() and emit() printk: Flush console on unregister_console() printk: Fail pr_flush() if before SYSTEM_SCHEDULING printk: nbcon: Add function for printers to reacquire ownership printk: nbcon: Use raw_cpu_ptr() instead of open coding printk: Use the BITS_PER_LONG macro lockdep: Mark emergency sections in lockdep splats ... |
||
Linus Torvalds
|
daa394f0f9 |
A set of updates for debugobjects:
- Use the threshold to check for the pool refill condition and not the run time recorded all time low fill value, which is lower than the threshold and therefore causes refills to be delayed. - KCSAN annotation updates and simplification of the fill_pool() code. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn480THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoVB1D/0UE1n86SLFrR7plXudttXJnbyJ/OjK uOjLSHx66TyMkN1z6xF6K4bZTyQRpIUifPLz4evyd9CdDvITvnrvkboby/15rsGW 8sEBqAFVMkENkPzDA1Qmn3fxJs9XvHoER7WcMjaEl9yQbSi4gjO5Y+B0BNp4XKHZ P1YSmRJqUBX5F0BvmeeDlHCCpyUxeRGiyzxZ/WSl70e6RSGis10R+B/aqsMxf3Zz 6WboQJqMxnDT3ICtDxTicH9VJ6Lh9iJxppeLVxAtZ+acfhcRmpwKFmsfJJOVy1eg zkJuDh3ieb8hH7vr6bqzMEoP8qclUY7JgcJCK0dIwcASIvr7ZFVLCDLDx6Ta9UrG D+L7sjGs+h/wz7NOoKTaGJS0XHwijVtLhc5/O64p1POUiQVTfjCVW6E3RAs3IGBI uXTxuVzpK7XXvbg7+iEwYVcE5fp5vctnlLyepkbXvei9r/ccgIndj3rVGZz1qyOc 41LVhTx1Uu9MSqnsWTGbr+kzIze/g1rj8OlSH+692nbLL0mxWsOuojljvDgILC1Q rcvZLJrf8e4FDFyGZiX8kG3eHbyYQPdf3fqUCI7B05n0o7utXLf4Mgw+/LdIvpKY JTx4/lhwZ4TXFMvf+LiW/rhRlP72QsVkljjsyJTHI6a5LukdNL9dNXKTqSCypcjm hAsMzee52FiZoQ== =B0II -----END PGP SIGNATURE----- Merge tag 'core-debugobjects-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull debugobjects updates from Thomas Gleixner: - Use the threshold to check for the pool refill condition and not the run time recorded all time low fill value, which is lower than the threshold and therefore causes refills to be delayed. - KCSAN annotation updates and simplification of the fill_pool() code. * tag 'core-debugobjects-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: debugobjects: Remove redundant checks in fill_pool() debugobjects: Fix conditions in fill_pool() debugobjects: Fix the compilation attributes of some global variables |
||
Linus Torvalds
|
9ea925c806 |
Updates for timers and timekeeping:
- Core: - Overhaul of posix-timers in preparation of removing the workaround for periodic timers which have signal delivery ignored. - Remove the historical extra jiffie in msleep() msleep() adds an extra jiffie to the timeout value to ensure minimal sleep time. The timer wheel ensures minimal sleep time since the large rewrite to a non-cascading wheel, but the extra jiffie in msleep() remained unnoticed. Remove it. - Make the timer slack handling correct for realtime tasks. The procfs interface is inconsistent and does neither reflect reality nor conforms to the man page. Show the correct 0 slack for real time tasks and enforce it at the core level instead of having inconsistent individual checks in various timer setup functions. - The usual set of updates and enhancements all over the place. - Drivers: - Allow the ACPI PM timer to be turned off during suspend - No new drivers - The usual updates and enhancements in various drivers -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn7jQTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYobqnD/9COlU0nwsulABI/aNIrsh6iYvnCC9v 14CcNta7Qn+157Wfw9BWOyHdNhR1/fPCXE8jJ71zTyIOeW27HV2JyTtxTwe9ZcdK ViHAaj7YcIjcVUEC3StCoRCPnvLslEw4qJA5AOQuDyMivdQn+YVa2c0baJxKaXZt xk4HZdMj4NAS0jRKnoZSwtKW/+Oz6rR4GAWrZo+Zs1/8ur3HfqnQfi8lJ1hJtLLW V7XDCVRvamVi6Ah3ocYPPp/1P6yeQDA1ge9aMddqaza5STWISXRtSnFMUmYP3rbS FaL8TyL+ilfny8pkGB2WlG6nLuSbtvogtdEh1gG1k1RmZt44kAtk8ba/KiWFPBSb zK9cjojRMBS71f9G4kmb5F4rnXoLsg1YbD1Nzhz3wq2Cs1Z90dc2QwMren0zoQ1x Fn56ueRyAiagBlnrSaKyso/2RvqJTNoSdi3RkpjYeAph0UoDCqvTvKjGAf1mWiw1 T/1lUWSVqWHnzZbM7XXzzajIN9bl6A7bbqlcAJ2O9vZIDt7273DG+bQym9Vh6Why 0LTGGERHxzKBsG7WRg+2Gmvv6S18UPKRo8tLtlA758rHlFuPTZCShWrIriwSNl1K Hxon+d4BparSnm1h9W/NHPKJA574UbWRCBjdk58IkAj8DxZZY4ORD9SMP+ggkV7G F6p9cgoDNP9KFg== =jE0N -----END PGP SIGNATURE----- Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Core: - Overhaul of posix-timers in preparation of removing the workaround for periodic timers which have signal delivery ignored. - Remove the historical extra jiffie in msleep() msleep() adds an extra jiffie to the timeout value to ensure minimal sleep time. The timer wheel ensures minimal sleep time since the large rewrite to a non-cascading wheel, but the extra jiffie in msleep() remained unnoticed. Remove it. - Make the timer slack handling correct for realtime tasks. The procfs interface is inconsistent and does neither reflect reality nor conforms to the man page. Show the correct 0 slack for real time tasks and enforce it at the core level instead of having inconsistent individual checks in various timer setup functions. - The usual set of updates and enhancements all over the place. Drivers: - Allow the ACPI PM timer to be turned off during suspend - No new drivers - The usual updates and enhancements in various drivers" * tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits) ntp: Make sure RTC is synchronized when time goes backwards treewide: Fix wrong singular form of jiffies in comments cpu: Use already existing usleep_range() timers: Rename next_expiry_recalc() to be unique platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function clocksource/drivers/jcore: Use request_percpu_irq() clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init() clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended clocksource: acpi_pm: Add external callback for suspend/resume clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped() dt-bindings: timer: rockchip: Add rk3576 compatible timers: Annotate possible non critical data race of next_expiry timers: Remove historical extra jiffie for timeout in msleep() hrtimer: Use and report correct timerslack values for realtime tasks hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse. timers: Add sparse annotation for timer_sync_wait_running(). signal: Replace BUG_ON()s ... |
||
Linus Torvalds
|
cb69d86550 |
Updates for the interrupt subsystem:
- Core: - Remove a global lock in the affinity setting code The lock protects a cpumask for intermediate results and the lock causes a bottleneck on simultaneous start of multiple virtual machines. Replace the lock and the static cpumask with a per CPU cpumask which is nicely serialized by raw spinlock held when executing this code. - Provide support for giving a suffix to interrupt domain names. That's required to support devices with subfunctions so that the domain names are distinct even if they originate from the same device node. - The usual set of cleanups and enhancements all over the place - Drivers: - Support for longarch AVEC interrupt chip - Refurbishment of the Armada driver so it can be extended for new variants. - The usual set of cleanups and enhancements all over the place -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn5p8THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoRFtD/43eB3h5usY2OPW0JmDqrE6qnzsvjPZ 1H52BcmMcOuI6yCfTnbi/fBB52mwSEGq9Dmt1GXradyq9/CJDIqZ1ajI1rA2jzW2 YdbeTDpKm1rS2ddzfp2LT2BryrNt+7etrRO7qHn4EKSuOcNuV2f58WPbIIqasvaK uPbUDVDPrvXxLNcjoab6SqaKrEoAaHSyKpd0MvDd80wHrtcSC/QouW7JDSUXv699 RwvLebN1OF6mQ2J8Z3DLeCQpcbAs+UT8UvID7kYUJi1g71J/ZY+xpMLoX/gHiDNr isBtsuEAiZeNaFpksc7A6Jgu5ljZf2/aLCqbPLlHaduHFNmo94x9KUbIF2cpEMN+ rsf5Ff7AVh1otz3cUwLLsm+cFLWRRoZdLuncn7rrgB4Yg0gll7qzyLO6YGvQHr8U Ocj1RXtvvWsMk4XzhgCt1AH/42cO6go+bhA4HspeYykNpsIldIUl1MeFbO8sWiDJ kybuwiwHp3oaMLjEK4Lpq65u7Ll8Lju2zRde65YUJN2nbNmJFORrOLmeC1qsr6ri dpend6n2qD9UD1oAt32ej/uXnG160nm7UKescyxiZNeTm1+ez8GW31hY128ifTY3 4R3urGS38p3gazXBsfw6eqkeKx0kEoDNoQqrO5gBvb8kowYTvoZtkwMGAN9OADwj w6vvU0i+NIyVMA== =JlJ2 -----END PGP SIGNATURE----- Merge tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Core: - Remove a global lock in the affinity setting code The lock protects a cpumask for intermediate results and the lock causes a bottleneck on simultaneous start of multiple virtual machines. Replace the lock and the static cpumask with a per CPU cpumask which is nicely serialized by raw spinlock held when executing this code. - Provide support for giving a suffix to interrupt domain names. That's required to support devices with subfunctions so that the domain names are distinct even if they originate from the same device node. - The usual set of cleanups and enhancements all over the place Drivers: - Support for longarch AVEC interrupt chip - Refurbishment of the Armada driver so it can be extended for new variants. - The usual set of cleanups and enhancements all over the place" * tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits) genirq: Use cpumask_intersects() genirq/cpuhotplug: Use cpumask_intersects() irqchip/apple-aic: Only access system registers on SoCs which provide them irqchip/apple-aic: Add a new "Global fast IPIs only" feature level irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi dt-bindings: apple,aic: Document A7-A11 compatibles irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy() genirq/msi: Use kmemdup_array() instead of kmemdup() genirq/proc: Change the return value for set affinity permission error genirq/proc: Use irq_move_pending() in show_irq_affinity() genirq/proc: Correctly set file permissions for affinity control files genirq: Get rid of global lock in irq_do_set_affinity() genirq: Fix typo in struct comment irqchip/loongarch-avec: Add AVEC irqchip support irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING LoongArch: Architectural preparation for AVEC irqchip LoongArch: Move irqchip function prototypes to irq-loongson.h irqchip/loongson-pch-msi: Switch to MSI parent domains softirq: Remove unused 'action' parameter from action callback ... |
||
Linus Torvalds
|
a64405b78b |
Updates for the clocksource watchdog:
- Make the uncertainty margin handling more robust to prevent false positives - Clarify comments -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn6xETHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoafvEAC30/0ePYUhwnJuhm1DboYLWWVGJlSI WpdwmL1YgBUTAjJC0wuYA+Fc8XV9grvxY86bPAsU2Lep85znCSIK/YEO4tsN5Tnq +aWGR8Q+2j80MxGCP2mFtG1P0K35P4dopRJgvGmE/xtqKxJ3IPryhfjegHVREDoe PNQt59lW1WZhydLXB60S9FCAJeLFRWAoc2tOIr9T2CnsRkc8aXdWnpEU40bN+nFH rX09ynPPOlxo5f6z5YWBOuv7LgjvNjmJP3qf8d4Sp4stj7+kI3ipZh6mlolek4l+ sMlbBAelQk+eiPYPIbthZvMvhM3J+YFXy4nh1LPYSRqWIMsgtobxwNUvWV0CqZaQ ZEImJqh+QJA27RAD13Uiv3N68prgW7yj/65b3EbjiK9tka2+67JVWTnlxuuyYu+S JxetHfxLKqohGhKAgAeO11efcf4HBh82afqDvMlWvTwDxyqMVLXM2NeT2iOgeYZk eH85R7ophcVdQAHZlYagsYA1oQMyT+UD40kiZzrX6WVHqX5ohVpnV/X8cJUVcUGa phOdX03ARzTSYdbE+J61bfAMCS2Tme34LkBMgZmooVfKwmnJOCGcskljtLOssm7N pTU7dI5dJLt6P+HZ6dbxf0xd0z6mNLXbVDlibEfR1ISfhLOHyH+2ywM0OJFDI0sP Ydt3U0n8sNFiXw== =Y38F -----END PGP SIGNATURE----- Merge tag 'timers-clocksource-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull clocksource watchdog updates from Thomas Gleixner: - Make the uncertainty margin handling more robust to prevent false positives - Clarify comments * tag 'timers-clocksource-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin clocksource: Fix comments on WATCHDOG_THRESHOLD & WATCHDOG_MAX_SKEW clocksource: Improve comments for watchdog skew bounds |
||
Linus Torvalds
|
97e17c08a4 |
A set of updates for CPU hotplug:
- Prepare the core for supporting parallel hotplug on loongarch - A small set of cleanups and enhancements -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn6qETHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoZutD/94s3G8D3xrgQ6DwMHVtMtIAbzLtlBt SeKpIiSCnSy8bnQ+sAqOw4VjmbpB0dOlcJRii701D6hY+48TEgsL3dLn1/ws4ECc /5PapLihQgIquiAqk9iQH2BOrsFVqOGp4jbU95+ppBQtiDIB+3KeQPxxws57xb7E EUXzgCMTSsqlHCt40UCbsn7atbj0AfkV12uPKsNZT7WxPjxGK3OLuttMA6a+4xHm nBxxy/Vp9ll3J+uRGQobLFgZiIEiUsHI/+pGwltYxXC7jdN3joGqD3LuwypqLuly Ir8yXP+NhpOeNMn3iSVE8sm39bp8Sm1UslrbXvlQHGuP1JsUnIZzyevdneUp5Je7 zDKHzfn04Ls3uK1XiuQGUTvLYuiHPQ/UHP8ZeWFlkapFFDtl3fu2FU9r+LlkwKZK /0zQF6R5eBaGl3F1YKn7nPcfNf1jTLQlYq+eZT2DnSSeOb7ammjxVGgIMzWRWidG ZFNZhkjusRi3MH4aYLF8mQl7nyepy4+XQF4K0PusQ8B/NQxYRoI66mFsKhtufn5e 7T9vpYTazmhazl9SO1wQ9NNXYub+bjVj3fyRl5WSsTdS5d9pz9yqgC+xBIAJXTaq 9kN+NlP/nJ6HAzTgO074znUYR/tjlki22hNHVa6JyEh0/h0AVG53CFH5/hSONJ3P jvnhjxM/X/kLwg== =0FVL -----END PGP SIGNATURE----- Merge tag 'smp-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug updates from Thomas Gleixner: - Prepare the core for supporting parallel hotplug on loongarch - A small set of cleanups and enhancements * tag 'smp-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp: Mark smp_prepare_boot_cpu() __init cpu: Fix W=1 build kernel-doc warning cpu/hotplug: Provide weak fallback for arch_cpuhp_init_parallel_bringup() cpu/hotplug: Make HOTPLUG_PARALLEL independent of HOTPLUG_SMT |
||
Dave Airlie
|
ae2c6d8b3b |
Driver Changes:
- Fix usefafter-free when provisioning VF (Matthew Auld) - Suppress rpm warning on false positive (Rodrigo) - Fix memleak on ioctl error path (Dafna) - Fix use-after-free while inserting ggtt (Michal Wajdeczko) - Add Wa_15016589081 workaround (Tejas) - Fix error path on suspend (Maarten) -----BEGIN PGP SIGNATURE----- iQJNBAABCAA3FiEE6rM8lpABPHM5FqyDm6KlpjDL6lMFAmbjddoZHGx1Y2FzLmRl bWFyY2hpQGludGVsLmNvbQAKCRCboqWmMMvqUy/DD/9ghjhqAFAnLQyXJ5OdYwSo KGQ4EBD52KGMZhBZwSwXXnhgwdz3s9OG6WZXZ9R3ZZOo22OyXkGZlScnTbteNU67 AARQEtveTmxPPe3Ggxrx87HqrbiveCBmdom5eggLcb2Hh6T1Z3bgbCXYnjpHEedy SnKYsxC6ReJJfB0r1Cl2MSiJ8vCQ0vFgKz7HH0RZeXomZ3lpsR9XtDLAktFYpU7H zJZ1om8hMLl7W7XXK6piPyB4VOkzeNjhxd8Hol9BZ9FiOcGlVQoG/icXISL2Kb83 1Bx2HHc7wu0IcFqsCRrkKzZudcgylGyaZkO6aPus3BNRY3bDUh9kcVM0jUkd6mvp rpTMOBS/pjS2SJl+4VMzgEYsyxb0LbJLjxPMbnsm0rL3FAeuCgch1jImKVbnKzge Xp9v5k+UtlLJjtCUwBjfFQe1nioduMp1OixNuI3TaXQz6T6aknfyjalH+vQVa4b8 uBMVdqSg2LJp3RgCFYMxIUFjLK3nlxK62tN/Ivge2aWU74hmGl/IDpozEvjr6bxQ oIwTJUIKU/SHBqd7RSP2npI2yACnXUP888Ydgrnf9BH3Ff3y7ZPqcThl8AVulYLp VCXlCzmbFoXsdde3CQfq6GtlLd5/FCBqZiZLoGigGLpOSrH4ajqpENbkg8cPZexA g7yoONJwYVWoU75rmv3B0w== =fkHk -----END PGP SIGNATURE----- Merge tag 'drm-xe-next-fixes-2024-09-12' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next Driver Changes: - Fix usefafter-free when provisioning VF (Matthew Auld) - Suppress rpm warning on false positive (Rodrigo) - Fix memleak on ioctl error path (Dafna) - Fix use-after-free while inserting ggtt (Michal Wajdeczko) - Add Wa_15016589081 workaround (Tejas) - Fix error path on suspend (Maarten) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/az6xs2z6zj3brq2h5wgaaoxwnqktrwbvxoyckrz7gbywsso734@a6v7gytqbcd6 |
||
Enzo Matsumiya
|
5ac1f99fdd |
smb: client: fix compression heuristic functions
Change is_compressible() return type to bool, use WARN_ON_ONCE(1) for internal errors and return false for those. Renames: check_repeated_data -> has_repeated_data check_ascii_bytes -> is_mostly_ascii (also refactor into a single loop) calc_shannon_entropy -> has_low_entropy Also wraps "wreq->Length" in le32_to_cpu() in should_compress() (caught by sparse). Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de> Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Steve French <stfrench@microsoft.com> |
||
Pali Rohár
|
37408843f2 |
cifs: Update SFU comments about fifos and sockets
In SFU mode, activated by -o sfu mount option is now also support for creating new fifos and sockets. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> |
||
Pali Rohár
|
41d3f256c6 |
cifs: Add support for creating SFU symlinks
Linux cifs client can already detect SFU symlinks and reads it content (target location). But currently is not able to create new symlink. So implement this missing support. When 'sfu' mount option is specified and 'mfsymlinks' is not specified then create new symlinks in SFU-style. This will provide full SFU compatibility of symlinks when mounting cifs share with 'sfu' option. 'mfsymlinks' option override SFU for better Apple compatibility as explained in fs_context.c file in smb3_update_mnt_flags() function. Extend __cifs_sfu_make_node() function, which now can handle also S_IFLNK type and refactor structures passed to sync_write() in this function, by splitting SFU type and SFU data from original combined struct win_dev as combined fixed-length struct cannot be used for variable-length symlinks. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> |
||
Helge Deller
|
5d698966fa |
parisc: Allow mmap(MAP_STACK) memory to automatically expand upwards
When userspace allocates memory with mmap() in order to be used for stack, allow this memory region to automatically expand upwards up until the current maximum process stack size. The fault handler checks if the VM_GROWSUP bit is set in the vm_flags field of a memory area before it allows it to expand. This patch modifies the parisc specific code only. A RFC for a generic patch to modify mmap() for all architectures was sent to the mailing list but did not get enough Acks. Reported-by: Camm Maguire <camm@maguirefamily.org> Signed-off-by: Helge Deller <deller@gmx.de> Cc: stable@vger.kernel.org # v5.10+ |
||
Helge Deller
|
75f653f0c6 |
parisc: Use PRIV_USER instead of hardcoded value
Signed-off-by: Helge Deller <deller@gmx.de> |