linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-28 13:34:38 +08:00

History

Gregory Price fa3bea4e1f mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving When a system has multiple NUMA nodes and it becomes bandwidth hungry, using the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based interleave policy does not optimally distribute data to make use of their different bandwidth characteristics. Instead, interleave is more effective when the allocation policy follows each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, enabling weighted interleave between NUMA nodes. Weighted interleave allows for proportional distribution of memory across multiple numa nodes, preferably apportioned to match the bandwidth of each node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights for each node can be assigned via the new sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ For now, the default value of all nodes will be `1`, which matches the behavior of standard 1:1 round-robin interleave. An extension will be added in the future to allow default values to be registered at kernel and device bringup time. The policy allocates a number of pages equal to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). Some high level notes about the pieces of weighted interleave: current->il_prev: Tracks the node previously allocated from. current->il_weight: The active weight of the current node (current->il_prev) When this reaches 0, current->il_prev is set to the next node and current->il_weight is set to the next weight. weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Operates only on task->mempolicy. weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Operates on VMA policies. bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight will be allocated first. Operates only on the task->mempolicy. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. If a call to alloc_pages_mpol() were made with a weighted-interleave policy and ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA policy - violating the description above. An inspection of all callers of alloc_pages_mpol() shows that all external callers set ilx to `0`, an index value, or will call get_vma_policy() to acquire the ilx. For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces the `weighted_interleave_nodes()` and `weighted_interleave_nid()` policy requirements (task/vma respectively). Link: https://lkml.kernel.org/r/20240202170238.90004-4-gregory.price@memverge.com Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com> Signed-off-by: Gregory Price <gregory.price@memverge.com> Co-developed-by: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Rakie Kim <rakie.kim@sk.com> Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2024-02-22 10:24:46 -08:00
..
ABI	mm/mempolicy: implement the sysfs-based weighted_interleave interface	2024-02-22 10:24:46 -08:00
accel	docs/accel: correct links to mailing list archives	2024-01-23 14:45:50 -07:00
accounting
admin-guide	mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving	2024-02-22 10:24:46 -08:00
arch	arm64: Subscribe Microsoft Azure Cobalt 100 to ARM Neoverse N2 errata	2024-02-15 11:47:22 +00:00
block	Documentation: block: ioprio: Update schedulers	2024-01-18 08:21:14 -07:00
bpf	Another moderately busy cycle for documentation, including:	2024-01-11 19:46:52 -08:00
cdrom
core-api	A handful of late-arriving documentation fixes.	2024-01-17 11:49:11 -08:00
cpu-freq
crypto	Another moderately busy cycle for documentation, including:	2024-01-11 19:46:52 -08:00
dev-tools	Documentation: KUnit: Update the instructions on how to test static functions	2024-01-22 07:59:03 -07:00
devicetree	sound fixes for 6.8-rc5	2024-02-16 09:02:19 -08:00
doc-guide	docs: Raise the minimum Sphinx requirement to 2.4.4	2023-12-15 08:36:33 -07:00
driver-api	pci-v6.8-changes	2024-01-17 16:23:17 -08:00
fault-injection
fb	fbdev/intelfb: Remove driver	2024-01-12 12:38:37 +01:00
features	riscv: Add support for BATCHED_UNMAP_TLB_FLUSH	2024-01-11 08:01:53 -08:00
filesystems	ovl: mark xwhiteouts directory with overlay.opaque='x'	2024-01-23 12:39:48 +02:00
firmware_class
firmware-guide
fpga
gpu	amd-drm-next-6.8-2024-01-05:	2024-01-09 09:07:50 +10:00
hid
hwmon	hwmon: (lm75) Add AMS AS6200 temperature sensor	2024-01-02 08:44:57 -08:00
i2c	Documentation/i2c: fix spelling error in i2c-address-translators	2023-12-27 20:05:44 +01:00
iio
images
infiniband
input
isdn
kbuild	docs: kconfig: Fix grammar and formatting	2024-02-15 06:55:47 +09:00
kernel-hacking
leds
litmus-tests
livepatch
locking	locking/mutex: Clarify that mutex_unlock(), and most other sleeping locks, can still use the lock object after it's unlocked	2024-01-08 09:55:31 +01:00
maintainer
mhi
misc-devices
mm	mm/rmap: rename COMPOUND_MAPPED to ENTIRELY_MAPPED	2023-12-29 11:58:56 -08:00
netlabel
netlink	dpll: fix possible deadlock during netlink dump operation	2024-02-08 18:29:21 -08:00
networking	net-device: move lstats in net_device_read_txrx	2024-02-12 09:51:26 +00:00
nvdimm
nvme
PCI	docs: PCI: Fix typos	2023-12-28 17:37:36 -06:00
pcmcia
peci
power	Documentation: PM: Adjust freezing-of-tasks.rst to the freezer changes	2023-12-19 21:14:32 +01:00
process	Documentation: Document the Linux Kernel CVE process	2024-02-17 14:46:39 +01:00
RAS
RCU
rust	LoongArch changes for v6.8	2024-01-19 13:30:49 -08:00
scheduler	sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)	2023-12-23 15:59:56 +01:00
scsi
security
sound
sphinx	docs: kernel_feat.py: fix build error for missing files	2024-02-08 11:05:35 -07:00
sphinx-static	docs: translations: add translations links when they exist	2023-12-19 14:34:59 -07:00
spi
staging	rpmsg updates for v6.8	2024-01-17 15:05:27 -08:00
target
tee
timers
tools
trace	tracing updates for 6.8:	2024-01-18 14:35:29 -08:00
translations	Docs/translations/damon/usage: update for monitor_on renaming	2024-02-22 10:24:46 -08:00
usb	usb: gadget: ncm: Fix indentations in documentation of NCM section	2024-01-27 16:27:58 -08:00
userspace-api	fbdev fixes and cleanups for 6.8-rc1:	2024-01-12 14:38:08 -08:00
virt	KVM x86 MMU changes for 6.8:	2024-01-08 08:10:32 -05:00
w1
watchdog
wmi
.gitignore
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py	docs: translations: add translations links when they exist	2023-12-19 14:34:59 -07:00
docutils.conf
dontdiff
index.rst
Kconfig
Makefile	doc/netlink: Regenerate netlink .rst files if ynl-gen-rst changes	2023-12-18 14:39:44 -08:00
memory-barriers.txt
SubmittingPatches
subsystem-apis.rst