linux/Documentation
Gregory Price fa3bea4e1f mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
When a system has multiple NUMA nodes and it becomes bandwidth hungry,
using the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such as
socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based
interleave policy does not optimally distribute data to make use of their
different bandwidth characteristics.

Instead, interleave is more effective when the allocation policy follows
each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
enabling weighted interleave between NUMA nodes.  Weighted interleave
allows for proportional distribution of memory across multiple numa nodes,
preferably apportioned to match the bandwidth of each node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight
distribution is (2:1).

Weights for each node can be assigned via the new sysfs extension:
/sys/kernel/mm/mempolicy/weighted_interleave/

For now, the default value of all nodes will be `1`, which matches the
behavior of standard 1:1 round-robin interleave.  An extension will be
added in the future to allow default values to be registered at kernel and
device bringup time.

The policy allocates a number of pages equal to the set weights.  For
example, if the weights are (2,1), then 2 pages will be allocated on node0
for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

Some high level notes about the pieces of weighted interleave:

current->il_prev:
    Tracks the node previously allocated from.

current->il_weight:
    The active weight of the current node (current->il_prev)
    When this reaches 0, current->il_prev is set to the next node
    and current->il_weight is set to the next weight.

weighted_interleave_nodes:
    Counts the number of allocations as they occur, and applies the
    weight for the current node.  When the weight reaches 0, switch
    to the next node.  Operates only on task->mempolicy.

weighted_interleave_nid:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the node based on the given index.
    Operates on VMA policies.

bulk_array_weighted_interleave:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the number of "interleave rounds" as
    well as any delta ("partial round").  Calculates the number of
    pages for each node and allocates them.

    If a node was scheduled for interleave via interleave_nodes, the
    current weight will be allocated first.

    Operates only on the task->mempolicy.

One piece of complexity is the interaction between a recent refactor which
split the logic to acquire the "ilx" (interleave index) of an allocation
and the actually application of the interleave.  If a call to
alloc_pages_mpol() were made with a weighted-interleave policy and ilx set
to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA
policy - violating the description above.

An inspection of all callers of alloc_pages_mpol() shows that all external
callers set ilx to `0`, an index value, or will call get_vma_policy() to
acquire the ilx.

For example, mm/shmem.c may call into alloc_pages_mpol.  The call stacks
all set (pgoff_t ilx) or end up in `get_vma_policy()`.  This enforces the
`weighted_interleave_nodes()` and `weighted_interleave_nid()` policy
requirements (task/vma respectively).

Link: https://lkml.kernel.org/r/20240202170238.90004-4-gregory.price@memverge.com
Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 10:24:46 -08:00
..
ABI mm/mempolicy: implement the sysfs-based weighted_interleave interface 2024-02-22 10:24:46 -08:00
accel docs/accel: correct links to mailing list archives 2024-01-23 14:45:50 -07:00
accounting
admin-guide mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 2024-02-22 10:24:46 -08:00
arch arm64: Subscribe Microsoft Azure Cobalt 100 to ARM Neoverse N2 errata 2024-02-15 11:47:22 +00:00
block Documentation: block: ioprio: Update schedulers 2024-01-18 08:21:14 -07:00
bpf Another moderately busy cycle for documentation, including: 2024-01-11 19:46:52 -08:00
cdrom
core-api A handful of late-arriving documentation fixes. 2024-01-17 11:49:11 -08:00
cpu-freq
crypto Another moderately busy cycle for documentation, including: 2024-01-11 19:46:52 -08:00
dev-tools Documentation: KUnit: Update the instructions on how to test static functions 2024-01-22 07:59:03 -07:00
devicetree sound fixes for 6.8-rc5 2024-02-16 09:02:19 -08:00
doc-guide docs: Raise the minimum Sphinx requirement to 2.4.4 2023-12-15 08:36:33 -07:00
driver-api pci-v6.8-changes 2024-01-17 16:23:17 -08:00
fault-injection
fb fbdev/intelfb: Remove driver 2024-01-12 12:38:37 +01:00
features riscv: Add support for BATCHED_UNMAP_TLB_FLUSH 2024-01-11 08:01:53 -08:00
filesystems ovl: mark xwhiteouts directory with overlay.opaque='x' 2024-01-23 12:39:48 +02:00
firmware_class
firmware-guide
fpga
gpu amd-drm-next-6.8-2024-01-05: 2024-01-09 09:07:50 +10:00
hid
hwmon hwmon: (lm75) Add AMS AS6200 temperature sensor 2024-01-02 08:44:57 -08:00
i2c Documentation/i2c: fix spelling error in i2c-address-translators 2023-12-27 20:05:44 +01:00
iio
images
infiniband
input
isdn
kbuild docs: kconfig: Fix grammar and formatting 2024-02-15 06:55:47 +09:00
kernel-hacking
leds
litmus-tests
livepatch
locking locking/mutex: Clarify that mutex_unlock(), and most other sleeping locks, can still use the lock object after it's unlocked 2024-01-08 09:55:31 +01:00
maintainer Documentation: xfs: consolidate XFS docs into its own subdirectory 2023-12-07 14:49:13 +05:30
mhi
misc-devices
mm mm/rmap: rename COMPOUND_MAPPED to ENTIRELY_MAPPED 2023-12-29 11:58:56 -08:00
netlabel
netlink dpll: fix possible deadlock during netlink dump operation 2024-02-08 18:29:21 -08:00
networking net-device: move lstats in net_device_read_txrx 2024-02-12 09:51:26 +00:00
nvdimm
nvme
PCI docs: PCI: Fix typos 2023-12-28 17:37:36 -06:00
pcmcia
peci
power Documentation: PM: Adjust freezing-of-tasks.rst to the freezer changes 2023-12-19 21:14:32 +01:00
process Documentation: Document the Linux Kernel CVE process 2024-02-17 14:46:39 +01:00
RAS Documentation: Begin a RAS section 2023-12-06 21:07:52 +01:00
RCU doc: Mention address and data dependencies in rcu_dereference.rst 2023-12-14 01:16:28 +05:30
rust LoongArch changes for v6.8 2024-01-19 13:30:49 -08:00
scheduler sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true) 2023-12-23 15:59:56 +01:00
scsi
security Documentation: Destage TEE subsystem documentation 2023-12-08 15:45:10 -07:00
sound
sphinx docs: kernel_feat.py: fix build error for missing files 2024-02-08 11:05:35 -07:00
sphinx-static docs: translations: add translations links when they exist 2023-12-19 14:34:59 -07:00
spi spi: pxa2xx: Update DMA mapping and using logic in the documentation 2023-12-08 17:50:00 +00:00
staging rpmsg updates for v6.8 2024-01-17 15:05:27 -08:00
target
tee Documentation: Destage TEE subsystem documentation 2023-12-08 15:45:10 -07:00
timers
tools
trace tracing updates for 6.8: 2024-01-18 14:35:29 -08:00
translations Docs/translations/damon/usage: update for monitor_on renaming 2024-02-22 10:24:46 -08:00
usb usb: gadget: ncm: Fix indentations in documentation of NCM section 2024-01-27 16:27:58 -08:00
userspace-api fbdev fixes and cleanups for 6.8-rc1: 2024-01-12 14:38:08 -08:00
virt KVM x86 MMU changes for 6.8: 2024-01-08 08:10:32 -05:00
w1
watchdog
wmi
.gitignore
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py docs: translations: add translations links when they exist 2023-12-19 14:34:59 -07:00
docutils.conf
dontdiff
index.rst Documentation: Begin a RAS section 2023-12-06 21:07:52 +01:00
Kconfig
Makefile doc/netlink: Regenerate netlink .rst files if ynl-gen-rst changes 2023-12-18 14:39:44 -08:00
memory-barriers.txt doc: Clarify historical disclaimers in memory-barriers.txt 2023-12-14 01:16:28 +05:30
SubmittingPatches
subsystem-apis.rst Documentation: Destage TEE subsystem documentation 2023-12-08 15:45:10 -07:00