linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-09-22 04:31:58 +08:00

Go to file

Mike Kravetz d2cf88c27f hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles Patch series "Batch hugetlb vmemmap modification operations", v8. When hugetlb vmemmap optimization was introduced, the overhead of enabling the option was measured as described in commit `426e5c429d` [1]. The summary states that allocating a hugetlb page should be ~2x slower with optimization and freeing a hugetlb page should be ~2-3x slower. Such overhead was deemed an acceptable trade off for the memory savings obtained by freeing vmemmap pages. It was recently reported that the overhead associated with enabling vmemmap optimization could be as high as 190x for hugetlb page allocations. Yes, 190x! Some actual numbers from other environments are: Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895 ------------------------------------------------ Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0 time echo 500000 > .../hugepages-2048kB/nr_hugepages real 0m4.119s time echo 0 > .../hugepages-2048kB/nr_hugepages real 0m4.477s Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1 time echo 500000 > .../hugepages-2048kB/nr_hugepages real 0m28.973s time echo 0 > .../hugepages-2048kB/nr_hugepages real 0m36.748s VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan ----------------------------------------------------------- Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0 time echo 524288 > .../hugepages-2048kB/nr_hugepages real 0m2.463s time echo 0 > .../hugepages-2048kB/nr_hugepages real 0m2.931s Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1 time echo 524288 > .../hugepages-2048kB/nr_hugepages real 2m27.609s time echo 0 > .../hugepages-2048kB/nr_hugepages real 2m29.924s In the VM environment, the slowdown of enabling hugetlb vmemmap optimization resulted in allocation times being 61x slower. A quick profile showed that the vast majority of this overhead was due to TLB flushing. Each time we modify the kernel pagetable we need to flush the TLB. For each hugetlb that is optimized, there could be potentially two TLB flushes performed. One for the vmemmap pages associated with the hugetlb page, and potentially another one if the vmemmap pages are mapped at the PMD level and must be split. The TLB flushes required for the kernel pagetable, result in a broadcast IPI with each CPU having to flush a range of pages, or do a global flush if a threshold is exceeded. So, the flush time increases with the number of CPUs. In addition, in virtual environments the broadcast IPI can’t be accelerated by hypervisor hardware and leads to traps that need to wakeup/IPI all vCPUs which is very expensive. Because of this the slowdown in virtual environments is even worse than bare metal as the number of vCPUS/CPUs is increased. The following series attempts to reduce amount of time spent in TLB flushing. The idea is to batch the vmemmap modification operations for multiple hugetlb pages. Instead of doing one or two TLB flushes for each page, we do two TLB flushes for each batch of pages. One flush after splitting pages mapped at the PMD level, and another after remapping vmemmap associated with all hugetlb pages. Results of such batching are as follows: Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895 ------------------------------------------------ next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0 time echo 500000 > .../hugepages-2048kB/nr_hugepages real 0m4.719s time echo 0 > .../hugepages-2048kB/nr_hugepages real 0m4.245s next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1 time echo 500000 > .../hugepages-2048kB/nr_hugepages real 0m7.267s time echo 0 > .../hugepages-2048kB/nr_hugepages real 0m13.199s VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan ----------------------------------------------------------- next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0 time echo 524288 > .../hugepages-2048kB/nr_hugepages real 0m2.715s time echo 0 > .../hugepages-2048kB/nr_hugepages real 0m3.186s next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1 time echo 524288 > .../hugepages-2048kB/nr_hugepages real 0m4.799s time echo 0 > .../hugepages-2048kB/nr_hugepages real 0m5.273s With batching, results are back in the 2-3x slowdown range. This patch (of 8): update_and_free_pages_bulk is designed to free a list of hugetlb pages back to their associated lower level allocators. This may require allocating vmemmmap pages associated with each hugetlb page. The hugetlb page destructor must be changed before pages are freed to lower level allocators. However, the destructor must be changed under the hugetlb lock. This means there is potentially one lock cycle per page. Minimize the number of lock cycles in update_and_free_pages_bulk by: 1) allocating necessary vmemmap for all hugetlb pages on the list 2) take hugetlb lock and clear destructor for all pages on the list 3) free all pages on list back to low level allocators Link: https://lkml.kernel.org/r/20231019023113.345257-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20231019023113.345257-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: James Houghton <jthoughton@google.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Konrad Dybcio <konradybcio@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Usama Arif <usama.arif@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2023-10-25 16:47:07 -07:00
arch	mm: delete checks for xor_unlock_is_negative_byte()	2023-10-18 14:34:17 -07:00
block	block: fix kernel-doc for disk_force_media_change()	2023-09-26 00:43:34 -06:00
certs	certs: Reference revocation list for all keyrings	2023-08-17 20:12:41 +00:00
crypto	crypto: sm2 - Fix crash caused by uninitialized context	2023-09-20 13:10:10 +08:00
Documentation	Docs/admin-guide/mm/damon/usage: update for tried regions update time interval	2023-10-18 14:34:19 -07:00
drivers	dax, kmem: calculate abstract distance with general interface	2023-10-16 15:44:39 -07:00
fs	mm: update memfd seal write check to include F_SEAL_WRITE	2023-10-18 14:34:19 -07:00
include	mm: update memfd seal write check to include F_SEAL_WRITE	2023-10-18 14:34:19 -07:00
init	workqueue: Changes for v6.6	2023-09-01 16:06:32 -07:00
io_uring	io_uring/fs: remove sqe->rw_flags checking from LINKAT	2023-09-29 03:07:09 -06:00
ipc	Add x86 shadow stack support	2023-08-31 12:20:12 -07:00
kernel	mm: drop the assumption that VM_SHARED always implies writable	2023-10-18 14:34:19 -07:00
lib	percpu_counter: extend _limited_add() to negative amounts	2023-10-18 14:34:14 -07:00
LICENSES	LICENSES: Add the copyleft-next-0.3.1 license	2022-11-08 15:44:01 +01:00
mm	hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles	2023-10-25 16:47:07 -07:00
net	sunrpc: dynamically allocate the sunrpc_cred shrinker	2023-10-04 10:32:24 -07:00
rust	Documentation work keeps chugging along; stuff for 6.6 includes:	2023-08-30 20:05:42 -07:00
samples	VFIO updates for v6.6-rc1	2023-08-30 20:36:01 -07:00
scripts	kbuild: remove stale code for 'source' symlink in packaging scripts	2023-10-01 23:06:06 +09:00
security	selinux: fix handling of empty opts in selinux_fs_context_submount()	2023-09-12 17:31:08 -04:00
sound	ASoC: Fixes for v6.6	2023-09-20 15:02:16 +02:00
tools	tools/mm: update the usage output to be more organized	2023-10-18 14:34:19 -07:00
usr	initramfs: Encode dependency on KBUILD_BUILD_TIMESTAMP	2023-06-06 17:54:49 +09:00
virt	ARM:	2023-09-07 13:52:20 -07:00
.clang-format	iommu: Add for_each_group_device()	2023-05-23 08:15:51 +02:00
.cocciconfig
.get_maintainer.ignore	get_maintainer: add Alan to .get_maintainer.ignore	2022-08-20 15:17:44 -07:00
.gitattributes	.gitattributes: set diff driver for Rust source code files	2023-05-31 17:48:25 +02:00
.gitignore	kbuild: rpm-pkg: rename binkernel.spec to kernel.spec	2023-07-25 00:59:33 +09:00
.mailmap	mailmap: correct email aliasing for Oleksij Rempel	2023-10-18 12:12:41 -07:00
.rustfmt.toml	rust: add `.rustfmt.toml`	2022-09-28 09:02:20 +02:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	USB: Remove Wireless USB and UWB documentation	2023-08-09 14:17:32 +02:00
Kbuild	Kbuild updates for v6.1	2022-10-10 12:00:45 -07:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	selftests: add a selftest to verify hugetlb usage in memcg	2023-10-18 14:34:18 -07:00
Makefile	Linux 6.6-rc4	2023-10-01 14:15:13 -07:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.