linux/drivers
SeongJae Park cb9369bdbb xen/blkback: Squeeze page pools if a memory pressure is detected
Each `blkif` has a free pages pool for the grant mapping.  The size of
the pool starts from zero and is increased on demand while processing
the I/O requests.  If current I/O requests handling is finished or 100
milliseconds has passed since last I/O requests handling, it checks and
shrinks the pool to not exceed the size limit, `max_buffer_pages`.

Therefore, host administrators can cause memory pressure in blkback by
attaching a large number of block devices and inducing I/O.  Such
problematic situations can be avoided by limiting the maximum number of
devices that can be attached, but finding the optimal limit is not so
easy.  Improper set of the limit can results in memory pressure or a
resource underutilization.  This commit avoids such problematic
situations by squeezing the pools (returns every free page in the pool
to the system) for a while (users can set this duration via a module
parameter) if memory pressure is detected.

Discussions
===========

The `blkback`'s original shrinking mechanism returns only pages in the
pool which are not currently be used by `blkback` to the system.  In
other words, the pages that are not mapped with granted pages.  Because
this commit is changing only the shrink limit but still uses the same
freeing mechanism it does not touch pages which are currently mapping
grants.

Once memory pressure is detected, this commit keeps the squeezing limit
for a user-specified time duration.  The duration should be neither too
long nor too short.  If it is too long, the squeezing incurring overhead
can reduce the I/O performance.  If it is too short, `blkback` will not
free enough pages to reduce the memory pressure.  This commit sets the
value as `10 milliseconds` by default because it is a short time in
terms of I/O while it is a long time in terms of memory operations.
Also, as the original shrinking mechanism works for at least every 100
milliseconds, this could be a somewhat reasonable choice.  I also tested
other durations (refer to the below section for more details) and
confirmed that 10 milliseconds is the one that works best with the test.
That said, the proper duration depends on actual configurations and
workloads.  That's why this commit allows users to set the duration as a
module parameter.

Memory Pressure Test
====================

To show how this commit fixes the memory pressure situation well, I
configured a test environment on a xen-running virtualization system.
On the `blkfront` running guest instances, I attach a large number of
network-backed volume devices and induce I/O to those.  Meanwhile, I
measure the number of pages that swapped in (pswpin) and out (pswpout)
on the `blkback` running guest.  The test ran twice, once for the
`blkback` before this commit and once for that after this commit.  As
shown below, this commit has dramatically reduced the memory pressure:

                pswpin  pswpout
    before      76,672  185,799
    after          867    3,967

Optimal Aggressive Shrinking Duration
-------------------------------------

To find a best squeezing duration, I repeated the test with three
different durations (1ms, 10ms, and 100ms).  The results are as below:

    duration    pswpin  pswpout
    1           707     5,095
    10          867     3,967
    100         362     3,348

As expected, the memory pressure decreases as the duration increases,
but the reduction become slow from the `10ms`.  Based on this results, I
chose the default duration as 10ms.

Performance Overhead Test
=========================

This commit could incur I/O performance degradation under severe memory
pressure because the squeezing will require more page allocations per
I/O.  To show the overhead, I artificially made a worst-case squeezing
situation and measured the I/O performance of a `blkfront` running
guest.

For the artificial squeezing, I set the `blkback.max_buffer_pages` using
the `/sys/module/xen_blkback/parameters/max_buffer_pages` file.  In this
test, I set the value to `1024` and `0`.  The `1024` is the default
value.  Setting the value as `0` is same to a situation doing the
squeezing always (worst-case).

If the underlying block device is slow enough, the squeezing overhead
could be hidden.  For the reason, I use a fast block device, namely the
rbd[1]:

    # xl block-attach guest phy:/dev/ram0 xvdb w

For the I/O performance measurement, I run a simple `dd` command 5 times
directly to the device as below and collect the 'MB/s' results.

    $ for i in {1..5}; do dd if=/dev/zero of=/dev/xvdb \
                             bs=4k count=$((256*512)); sync; done

The results are as below.  'max_pgs' represents the value of the
`blkback.max_buffer_pages` parameter.

    max_pgs   Min       Max       Median     Avg    Stddev
    0         417       423       420        419.4  2.5099801
    1024      414       425       416        417.8  4.4384682
    No difference proven at 95.0% confidence

In short, even worst case squeezing on ramdisk based fast block device
makes no visible performance degradation.  Please note that this is just
a very simple and minimal test.  On systems using super-fast block
devices and a special I/O workload, the results might be different.  If
you have any doubt, test on your machine with your workload to find the
optimal squeezing duration for you.

[1] https://www.kernel.org/doc/html/latest/admin-guide/blockdev/ramdisk.html

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2020-01-29 07:35:49 -06:00
..
accessibility
acpi ACPI: PM: Avoid attaching ACPI PM domain to certain devices 2019-12-10 00:22:18 +01:00
amba
android binder: fix incorrect calculation for num_valid 2019-12-14 09:10:47 +01:00
ata ata: ahci_brcm: Add missing clock management during recovery 2019-12-25 20:47:24 -07:00
atm atm: eni: fix uninitialized variable warning 2020-01-08 13:11:00 -08:00
auxdisplay auxdisplay: charlcd: deduplicate simple_strtoul() 2019-12-04 19:44:12 -08:00
base Merge branch 'remove-ksys-mount-dup' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux 2019-12-15 11:36:12 -08:00
bcma
block xen/blkback: Squeeze page pools if a memory pressure is detected 2020-01-29 07:35:49 -06:00
bluetooth Bluetooth: btbcm: Use the BDADDR_PROPERTY quirk 2019-11-22 13:35:20 +01:00
bus bus: ti-sysc: Fix missing reset delay handling 2019-12-12 08:20:10 -08:00
cdrom cdrom: respect device capabilities during opening action 2019-11-26 13:02:24 -07:00
char tpm: Handle negative priv->response_len in tpm_common_read() 2020-01-08 18:11:09 +02:00
clk clk: qcom: Avoid SMMU/cx gdsc corner cases 2019-12-18 22:02:27 -08:00
clocksource clocksource: riscv: add notrace to riscv_sched_clock 2020-01-04 21:48:48 -08:00
connector
counter
cpufreq Merge branch 'cpufreq/arm/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm 2020-01-07 10:41:35 +01:00
cpuidle cpuidle: Drop unnecessary type cast in cpuidle_poll_time() 2019-12-12 17:56:08 +01:00
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2019-12-02 17:23:21 -08:00
dax libnvdimm for 5.5 2019-12-01 18:43:25 -08:00
dca
devfreq PM / devfreq: tegra: Add COMMON_CLK dependency 2019-12-23 10:42:58 +09:00
dio
dma ioat: ioat_alloc_ring() failure handling. 2019-12-27 12:06:06 +05:30
dma-buf - A fix for a memory leak in the dma-buf support 2019-12-09 17:13:19 +10:00
edac riscv: move sifive_l2_cache.h to include/soc 2020-01-12 10:12:44 -08:00
eisa
extcon Char/Misc driver patches for 5.5-rc1 2019-11-27 10:53:50 -08:00
firewire FireWire (IEEE 1394) subsystem updates: 2019-12-02 14:13:00 -08:00
firmware firmware: tee_bnxt: Fix multiple call to tee_client_close_context 2020-01-06 13:51:37 -08:00
fpga
fsi fsi: aspeed: Fix OPB0 byte order register values 2019-11-08 11:28:21 +01:00
gnss
gpio gpiolib: acpi: Add honor_wakeup module-option + quirk mechanism 2020-01-07 12:58:15 +01:00
gpu Merge tag 'drm-intel-fixes-2020-01-09-1' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes 2020-01-10 11:43:02 +10:00
greybus
hid HID: hidraw, uhid: Always report EPOLLOUT 2020-01-10 15:34:28 +01:00
hsi
hv Merge branch 'akpm' (patches from Andrew) 2019-12-01 20:36:41 -08:00
hwmon compat_ioctl: remove most of fs/compat_ioctl.c 2019-12-01 13:46:15 -08:00
hwspinlock hwspinlock: u8500_hsem: Remove redundant PM runtime implementation 2019-11-08 16:42:26 -08:00
hwtracing intel_th: msu: Fix window switching without windows 2019-12-17 15:45:59 +01:00
i2c i2c: fix bus recovery stop mode timing 2020-01-09 22:21:08 +01:00
i3c
ide compat_ioctl: remove most of fs/compat_ioctl.c 2019-12-01 13:46:15 -08:00
idle cpuidle: Drop disabled field from struct cpuidle_state 2019-11-29 11:48:39 +01:00
iio First set of fixes for IIO in the 5.5 cycle. 2019-12-09 09:27:52 +01:00
infiniband i40iw: Remove setting of VMA private data and use rdma_user_mmap_io 2020-01-07 15:07:37 -04:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2020-01-09 15:37:40 -08:00
interconnect interconnect: qcom: msm8974: Walk the list safely on node removal 2019-12-12 10:28:54 +01:00
iommu iommu/dma: fix variable 'cookie' set but not used 2020-01-07 17:08:58 +01:00
ipack
irqchip riscv: prefix IRQ_ macro names with an RV_ namespace 2020-01-04 21:48:59 -08:00
isdn compat_ioctl: remove most of fs/compat_ioctl.c 2019-12-01 13:46:15 -08:00
leds Merge tag 'leds-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds 2019-12-01 16:09:28 -08:00
lightnvm
macintosh powerpc updates for 5.5 2019-11-30 14:35:43 -08:00
mailbox mailbox changes for v5.5 2019-12-01 18:42:02 -08:00
mcb
md for-linus-20191212 2019-12-13 14:27:19 -08:00
media media updates for v5.5-rc5 2020-01-04 10:41:08 -08:00
memory memory: tegra: Fixes for v5.5-rc1 2019-12-06 08:28:51 -08:00
memstick pci-v5.5-changes 2019-12-03 13:58:22 -08:00
message
mfd chrome platform changes for v5.5 2019-12-03 14:37:12 -08:00
misc powerpc fixes for 5.5 #4 2019-12-21 06:17:05 -08:00
mmc mmc: sdhci-of-esdhc: re-implement erratum A-009204 workaround 2019-12-19 08:13:43 +01:00
mtd mtd: spi-nor: Fix the writing of the Status Register on micron flashes 2020-01-09 20:11:34 +01:00
mux
net macvlan: do not assume mac_header is set in macvlan_broadcast() 2020-01-08 12:52:33 -08:00
nfc nfc: s3fwrn5: replace the assertion with a WARN_ON 2019-12-19 17:33:23 -08:00
ntb Add Hygon Device ID to the AMD NTB device driver 2019-12-07 18:38:17 -08:00
nubus
nvdimm libnvdimm for 5.5 2019-12-01 18:43:25 -08:00
nvme nvmet: fix per feat data len for get_feature 2020-01-10 08:55:50 -07:00
nvmem ARM: SoC-related driver updates 2019-12-05 11:43:31 -08:00
of Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-12-22 09:54:33 -08:00
opp PM / OPP: Support adjusting OPP voltages at runtime 2019-11-11 10:27:15 +05:30
oprofile Printk changes for 5.5 2019-11-25 19:40:40 -08:00
parisc
parport parport: daisy: use new parport device model 2019-11-13 19:09:49 +08:00
pci PCI: rockchip: Fix IO outbound ATU register number 2019-12-12 15:25:37 -06:00
pcmcia pcmcia: remove unused dprintk definition 2019-11-22 07:03:45 +01:00
perf perf/smmuv3: Remove the leftover put_cpu() in error path 2019-12-18 16:15:36 +00:00
phy phy/rockchip: inno-hdmi: round clock rate down to closest 1000 Hz 2019-12-31 15:46:08 +05:30
pinctrl pinctrl: meson: Fix wrong shift value when get drive-strength 2020-01-07 11:21:07 +01:00
platform A collection of MIPS fixes: 2020-01-04 14:16:57 -08:00
pnp
power Additional power management updates for 5.5-rc1 2019-12-04 10:48:09 -08:00
powercap powercap: intel_rapl: add NULL pointer check to rapl_mmio_cpu_online() 2020-01-07 12:24:34 +01:00
pps
ps3
ptp ptp: fix the race between the release of ptp_clock and cdev 2019-12-30 20:19:27 -08:00
pwm pwm: Changes for v5.5-rc1 2019-12-05 11:28:14 -08:00
rapidio drivers/rapidio/rio-access.c: fix missing include of <linux/rio_drv.h> 2019-12-04 19:44:13 -08:00
ras
regulator regulator: Fixes for v5.5 2020-01-06 12:04:31 -08:00
remoteproc remoteproc: stm32: fix probe error case 2019-11-18 20:35:16 -08:00
reset reset: Do not register resource data for missing resets 2019-12-10 11:43:37 +01:00
rpmsg rpmsg updates for v5.5 2019-12-01 18:39:24 -08:00
rtc rtc: cmos: Revert "rtc: Fix the AltCentury value on AMD/Hygon platform" 2020-01-04 05:31:50 +01:00
s390 s390/qeth: fix initialization on old HW 2019-12-24 22:41:06 -08:00
sbus
scsi SCSI fixes on 20191227 2019-12-27 17:28:41 -08:00
sfi
sh
siox
slimbus
soc riscv: move sifive_l2_cache.h to include/soc 2020-01-12 10:12:44 -08:00
soundwire Merge 5.4-rc7 into char-misc-next 2019-11-11 06:24:30 +01:00
spi spi: Fixes for v5.5 2020-01-06 12:34:44 -08:00
spmi
ssb
staging Staging fixes for 5.5-rc6 2020-01-10 13:22:11 -08:00
target SCSI fixes on 20191227 2019-12-27 17:28:41 -08:00
tc
tee Merge mainline/master into arm/fixes 2019-12-05 13:18:54 -08:00
thermal drivers: thermal: tsens: Work with old DTBs 2020-01-07 08:22:35 +01:00
thunderbolt thunderbolt: Power cycle the router if NVM authentication fails 2019-11-19 17:35:57 +01:00
tty serdev: Don't claim unsupported ACPI serial devices 2020-01-06 20:00:44 +01:00
uio uio: fix irq init with dt support & irq not defined 2019-11-14 11:49:48 +08:00
usb usb: missing parentheses in USE_NEW_SCHEME 2020-01-08 17:44:11 +01:00
vfio VFIO updates for v5.5-rc1 2019-12-07 14:51:04 -08:00
vhost Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-12-08 13:28:11 -08:00
video pci-v5.5-changes 2019-12-03 13:58:22 -08:00
virt compat_ioctl: remove most of fs/compat_ioctl.c 2019-12-01 13:46:15 -08:00
virtio virtio_balloon: divide/multiply instead of shifts 2019-12-11 08:14:07 -05:00
visorbus
vlynq
vme
w1 w1: new driver. DS2430 chip 2019-11-14 13:06:33 +08:00
watchdog watchdog: orion: fix platform_get_irq() complaints 2019-12-30 15:58:29 +01:00
xen xenbus/backend: Protect xenbus callback with lock 2020-01-29 07:35:49 -06:00
zorro
Kconfig
Makefile