linux/drivers
Nhat Pham 0a97c01cd2 list_lru: allow explicit memcg and NUMA node selection
Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.

There are currently several issues with zswap writeback:

1. There is only a single global LRU for zswap, making it impossible to
   perform worload-specific shrinking - an memcg under memory pressure
   cannot determine which pages in the pool it owns, and often ends up
   writing pages from other memcgs. This issue has been previously
   observed in practice and mitigated by simply disabling
   memcg-initiated shrinking:

   https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u

   But this solution leaves a lot to be desired, as we still do not
   have an avenue for an memcg to free up its own memory locked up in
   the zswap pool.

2. We only shrink the zswap pool when the user-defined limit is hit.
   This means that if we set the limit too high, cold data that are
   unlikely to be used again will reside in the pool, wasting precious
   memory. It is hard to predict how much zswap space will be needed
   ahead of time, as this depends on the workload (specifically, on
   factors such as memory access patterns and compressibility of the
   memory pages).

This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure.  The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.

As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance.  Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.


This patch (of 6):

The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg.  While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker).  It has caused us a lot of issues during our
development.

This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects.  The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.

It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call.  Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count).  list_lru_putback also allows for explicit
memcg and NUMA node selection.

Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12 10:57:01 -08:00
..
accel accel/ivpu/37xx: Fix hangs related to MMIO reset 2023-11-21 09:20:25 +01:00
accessibility
acpi ACPI fixes for 6.7-rc4 2023-12-02 08:52:20 +09:00
amba amba: bus: balance firmware node reference counting 2023-10-17 13:37:35 -05:00
android list_lru: allow explicit memcg and NUMA node selection 2023-12-12 10:57:01 -08:00
ata SCSI fixes on 20231130 2023-12-02 06:27:20 +09:00
atm pci-v6.7-changes 2023-11-02 14:05:18 -10:00
auxdisplay
base drivers/base/cpu: crash data showing should depends on KEXEC_CORE 2023-12-06 16:12:48 -08:00
bcma
block zram: use kmap_local_page() 2023-12-10 16:51:55 -08:00
bluetooth Bluetooth: btmtksdio: enable bluetooth wakeup in system suspend 2023-10-23 11:04:51 -07:00
bus SoC driver updates for 6.7 2023-11-01 14:46:51 -10:00
cache riscv: RISCV_NONSTANDARD_CACHE_OPS shouldn't depend on RISCV_DMA_NONCOHERENT 2023-10-26 09:42:37 +02:00
cdrom
cdx Char/Misc and other driver changes for 6.7-rc1 2023-11-03 14:51:08 -10:00
char Char/Misc and other driver changes for 6.7-rc1 2023-11-03 14:51:08 -10:00
clk SoC driver updates for 6.7 2023-11-01 14:46:51 -10:00
clocksource RISC-V Patches for the 6.7 Merge Window, Part 2 2023-11-10 09:23:17 -08:00
comedi
connector Fix NULL pointer dereference in cn_filter() 2023-10-24 10:53:45 +02:00
counter
cpufreq cpufreq/amd-pstate: Only print supported EPP values for performance governor 2023-11-29 22:04:15 +01:00
cpuidle
crypto crypto: talitos - stop using crypto_ahash::init 2023-10-27 18:04:29 +08:00
cxl cxl/pci: Change CXL AER support check to use native AER 2023-11-02 14:09:01 -07:00
dax dax/kmem: allow kmem to add memory with memmap_on_memory 2023-12-10 16:51:35 -08:00
dca
devfreq PM / devfreq: rockchip-dfi: add support for RK3588 2023-10-19 21:21:16 +09:00
dio
dma dmaengine updates for v6.7 2023-11-03 18:56:51 -10:00
dma-buf dma-buf: fix check in dma_resv_add_fence 2023-11-27 20:00:47 +01:00
dpll dpll: Fix potential msg memleak when genlmsg_put_reply failed 2023-11-21 17:41:20 -08:00
edac hardening updates for v6.7-rc1 2023-10-30 19:09:55 -10:00
eisa
extcon extcon: realtek: add the error handler for nvmem_cell_read 2023-10-17 17:38:57 +09:00
firewire firewire fixes for 6.7-rc4 2023-12-03 09:03:07 +09:00
firmware efi/unaccepted: Fix off-by-one when checking for overlapping ranges 2023-11-28 12:49:21 +01:00
fpga Char/Misc and other driver changes for 6.7-rc1 2023-11-03 14:51:08 -10:00
fsi
gnss
gpio pwm: Changes for v6.7-rc1 2023-11-09 13:47:52 -08:00
gpu amd-drm-fixes-6.7-2023-11-30: 2023-12-01 13:57:11 +10:00
greybus greybus: Add BeaglePlay Linux Driver 2023-10-27 13:19:04 +02:00
hid for-linus-2023112301 2023-11-23 17:31:53 -08:00
hsi
hte hte: Changes for v6.7-rc1 2023-10-31 18:32:51 -10:00
hv TTY/Serial changes for 6.7-rc1 2023-11-03 15:44:25 -10:00
hwmon hwmon updates for v6.7-rc1 2023-10-31 17:44:17 -10:00
hwspinlock
hwtracing
i2c i2c: ocores: Move system PM hooks to the NOIRQ phase 2023-11-13 12:43:42 -05:00
i3c I3C for 6.7 2023-11-04 16:25:36 -10:00
idle
iio Char/Misc and other driver changes for 6.7-rc1 2023-11-03 14:51:08 -10:00
infiniband RDMA for v6.7 2023-11-02 15:20:30 -10:00
input Input updates for 6.7 merge window: 2023-11-09 14:18:42 -08:00
interconnect Merge branch 'icc-platform-remove' into icc-next 2023-10-19 00:50:03 +03:00
iommu iommu: Fix printk arg in of_iommu_get_resv_regions() 2023-12-01 10:13:49 +01:00
ipack
irqchip - Flush the translation service tables to prevent unpredictable behavior 2023-11-19 13:49:32 -08:00
isdn hardening updates for v6.7-rc1 2023-10-30 19:09:55 -10:00
leds leds: class: Don't expose color sysfs entry 2023-11-22 11:46:03 +00:00
macintosh powerpc updates for 6.7 2023-11-03 10:07:39 -10:00
mailbox Moving repo 2023-11-05 18:45:32 -08:00
mcb mcb: fix error handling for different scenarios when parsing 2023-10-21 23:04:02 +02:00
md block-6.7-2023-12-01 2023-12-02 06:39:30 +09:00
media Renesas R-Car VSP1 driver regression fix 2023-11-16 14:28:44 +01:00
memory IOMMU Updates for Linux v6.7 2023-11-09 13:37:28 -08:00
memstick
message scsi: message: fusion: Initialize return value in mptfc_bus_reset() 2023-10-24 22:36:39 -04:00
mfd - Core Frameworks 2023-11-02 14:40:51 -10:00
misc RISC-V Patches for the 6.7 Merge Window, Part 1 2023-11-08 09:21:18 -08:00
mmc mmc: sdhci-sprd: Fix vqmmc not shutting down after the card was pulled 2023-11-23 18:04:17 +01:00
most
mtd - removed AR7 platform support 2023-11-10 09:19:46 -08:00
mux
net net: ravb: Keep reverse order of operations in ravb_remove() 2023-11-30 10:59:07 +01:00
nfc nfc: virtual_ncidev: Add variable to check if ndev is running 2023-11-22 10:55:48 +00:00
ntb
nubus
nvdimm libnvdimm: remove kernel-doc warnings: 2023-10-18 09:48:05 -07:00
nvme nvme-core: check for too small lba shift 2023-12-01 07:49:50 -08:00
nvmem Char/Misc and other driver changes for 6.7-rc1 2023-11-03 14:51:08 -10:00
of RISC-V Patches for the 6.7 Merge Window, Part 2 2023-11-10 09:23:17 -08:00
opp OPP: No need to defer probe from _opp_attach_genpd() 2023-10-17 11:11:28 +05:30
parisc parisc/power: Fix power soft-off when running on qemu 2023-11-18 18:59:30 +01:00
parport parport: gsc: mark init function static 2023-11-10 08:41:23 +01:00
pci cxl for v6.7 2023-11-04 16:20:36 -10:00
pcmcia PCMCIA odd cleanups and fixes for v6.7-rc1 2023-11-07 16:40:42 -08:00
peci
perf arm64 fixes: 2023-11-10 12:22:14 -08:00
phy Revert "phy: realtek: usb: Add driver for the Realtek SoC USB 2.0 PHY" 2023-11-06 14:47:36 +01:00
pinctrl pinctrl: realtek: Fix logical error when finding descriptor 2023-11-24 10:39:20 +01:00
platform platform/x86: intel_telemetry: Fix kernel doc descriptions 2023-11-21 10:09:04 +02:00
pmdomain Power management fixes for 6.7-rc4 2023-12-02 09:01:00 +09:00
pnp PNP: replace deprecated strncpy() with memcpy() 2023-10-20 19:50:40 +02:00
power USB/Thunderbolt changes for 6.7-rc1 2023-11-03 16:00:42 -10:00
powercap powercap: DTPM: Fix unneeded conversions to micro-Watts 2023-11-28 15:15:14 +01:00
pps
ps3
ptp ptp: annotate data-race around q->head and q->tail 2023-11-13 20:51:37 -08:00
pwm pwm: samsung: Fix a bit test in pwm_samsung_resume() 2023-11-10 09:20:48 +01:00
rapidio
ras
regulator regulator: Merge up pending fix 2023-10-30 13:14:27 +00:00
remoteproc remoteproc: st: Fix sometimes uninitialized ret in st_rproc_probe() 2023-10-16 11:24:34 -06:00
reset reset: Annotate struct reset_control_array with __counted_by 2023-10-24 14:10:04 -07:00
rpmsg rpmsg: virtio: Replace deprecated strncpy with strscpy/_pad 2023-10-23 13:11:07 -06:00
rtc RTC for 6.7 2023-11-05 18:49:40 -08:00
s390 block-6.7-2023-11-23 2023-11-23 17:40:15 -08:00
sbus
scsi scsi: sd: Fix system start for ATA devices 2023-11-24 20:44:21 -05:00
sh sh: Remove superhyway bus support 2023-10-25 16:50:11 +02:00
siox
slimbus
soc powerpc updates for 6.7 2023-11-03 10:07:39 -10:00
soundwire soundwire updates for 6.7 2023-11-03 19:10:41 -10:00
spi spi: Fixes for v6.7 2023-11-10 11:44:38 -08:00
spmi spmi: rename spmi device lookup helper 2023-11-01 10:02:18 +00:00
ssb ssb: relax SSB_EMBEDDED dependencies 2023-10-19 10:26:26 +03:00
staging pwm: Changes for v6.7-rc1 2023-11-09 13:47:52 -08:00
target SCSI misc on 20231102 2023-11-02 15:13:50 -10:00
tc
tee tee: make tee_class constant 2023-10-18 10:01:34 +02:00
thermal Thermal control updates for 6.7-rc1 2023-10-31 15:28:37 -10:00
thunderbolt thunderbolt: Only add device router DP IN to the head of the DP resource list 2023-11-17 13:05:57 +02:00
tty - removed AR7 platform support 2023-11-10 09:19:46 -08:00
ufs scsi: ufs: core: Clear cmd if abort succeeds in MCQ mode 2023-11-24 20:35:24 -05:00
uio
usb USB-serial fixes for 6.7-rc3 2023-11-24 16:30:38 +00:00
vdpa vdpa_sim_blk: allocate the buffer zeroed 2023-11-01 09:31:16 -04:00
vfio vfio/pds: Fix possible sleep while in atomic context 2023-11-27 09:29:03 -07:00
vhost vhost,virtio,vdpa,firmware: bugfixes 2023-11-16 07:39:37 -05:00
video fbdev: fsl-diu-fb: mark wr_reg_wa() static 2023-11-10 09:16:02 +01:00
virt configfs-tsm for v6.7 2023-11-04 15:58:13 -10:00
virtio vhost,virtio,vdpa,firmware: bugfixes 2023-11-16 07:39:37 -05:00
w1 nvmem: add explicit config option to read old syntax fixed OF cells 2023-10-21 19:19:06 +02:00
watchdog - removed AR7 platform support 2023-11-10 09:19:46 -08:00
xen xen/events: fix error code in xen_bind_pirq_msi_to_irq() 2023-11-28 12:48:27 +01:00
zorro
Kconfig - removed AR7 platform support 2023-11-10 09:19:46 -08:00
Makefile - removed AR7 platform support 2023-11-10 09:19:46 -08:00