linux/block
Tejun Heo aa1b46dcdc block: fix rq-qos breakage from skipping rq_qos_done_bio()
a647a524a4 ("block: don't call rq_qos_ops->done_bio if the bio isn't
tracked") made bio_endio() skip rq_qos_done_bio() if BIO_TRACKED is not set.
While this fixed a potential oops, it also broke blk-iocost by skipping the
done_bio callback for merged bios.

Before, whether a bio goes through rq_qos_throttle() or rq_qos_merge(),
rq_qos_done_bio() would be called on the bio on completion with BIO_TRACKED
distinguishing the former from the latter. rq_qos_done_bio() is not called
for bios which wenth through rq_qos_merge(). This royally confuses
blk-iocost as the merged bios never finish and are considered perpetually
in-flight.

One reliably reproducible failure mode is an intermediate cgroup geting
stuck active preventing its children from being activated due to the
leaf-only rule, leading to loss of control. The following is from
resctl-bench protection scenario which emulates isolating a web server like
workload from a memory bomb run on an iocost configuration which should
yield a reasonable level of protection.

  # cat /sys/block/nvme2n1/device/model
  Samsung SSD 970 PRO 512GB
  # cat /sys/fs/cgroup/io.cost.model
  259:0 ctrl=user model=linear rbps=834913556 rseqiops=93622 rrandiops=102913 wbps=618985353 wseqiops=72325 wrandiops=71025
  # cat /sys/fs/cgroup/io.cost.qos
  259:0 enable=1 ctrl=user rpct=95.00 rlat=18776 wpct=95.00 wlat=8897 min=60.00 max=100.00
  # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
  ...
  Memory Hog Summary
  ==================

  IO Latency: R p50=242u:336u/2.5m p90=794u:1.4m/7.5m p99=2.7m:8.0m/62.5m max=8.0m:36.4m/350m
              W p50=221u:323u/1.5m p90=709u:1.2m/5.5m p99=1.5m:2.5m/9.5m max=6.9m:35.9m/350m

  Isolation and Request Latency Impact Distributions:

                min   p01   p05   p10   p25   p50   p75   p90   p95   p99   max  mean stdev
  isol%       15.90 15.90 15.90 40.05 57.24 59.07 60.01 74.63 74.63 90.35 90.35 58.12 15.82
  lat-imp%        0     0     0     0     0  4.55 14.68 15.54 233.5 548.1 548.1 53.88 143.6

  Result: isol=58.12:15.82% lat_imp=53.88%:143.6 work_csv=100.0% missing=3.96%

The isolation result of 58.12% is close to what this device would show
without any IO control.

Fix it by introducing a new flag BIO_QOS_MERGED to mark merged bios and
calling rq_qos_done_bio() on them too. For consistency and clarity, rename
BIO_TRACKED to BIO_QOS_THROTTLED. The flag checks are moved into
rq_qos_done_bio() so that it's next to the code paths that set the flags.

With the patch applied, the above same benchmark shows:

  # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
  ...
  Memory Hog Summary
  ==================

  IO Latency: R p50=123u:84.4u/985u p90=322u:256u/2.5m p99=1.6m:1.4m/9.5m max=11.1m:36.0m/350m
              W p50=429u:274u/995u p90=1.7m:1.3m/4.5m p99=3.4m:2.7m/11.5m max=7.9m:5.9m/26.5m

  Isolation and Request Latency Impact Distributions:

                min   p01   p05   p10   p25   p50   p75   p90   p95   p99   max  mean stdev
  isol%       84.91 84.91 89.51 90.73 92.31 94.49 96.36 98.04 98.71 100.0 100.0 94.42  2.81
  lat-imp%        0     0     0     0     0  2.81  5.73 11.11 13.92 17.53 22.61  4.10  4.68

  Result: isol=94.42:2.81% lat_imp=4.10%:4.68 work_csv=58.34% missing=0%

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: a647a524a4 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked")
Cc: stable@vger.kernel.org # v5.15+
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/Yi7rdrzQEHjJLGKB@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-14 14:23:13 -06:00
..
partitions block: remove genhd.h 2022-02-02 07:49:59 -07:00
badblocks.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
bdev.c block: remove redundant semicolon 2022-02-27 14:52:36 -07:00
bfq-cgroup.c block, bfq: don't move oom_bfqq 2022-02-18 06:13:00 -07:00
bfq-iosched.c Revert "Revert "block, bfq: honor already-setup queue merges"" 2022-03-08 17:56:45 -07:00
bfq-iosched.h block, bfq: cleanup bfq_bfqq_to_bfqg() 2022-02-18 06:13:00 -07:00
bfq-wf2q.c block, bfq: cleanup bfq_bfqq_to_bfqg() 2022-02-18 06:13:00 -07:00
bio-integrity.c block: clone crypto and integrity data in __bio_clone_fast 2022-02-04 07:43:18 -07:00
bio.c block: fix rq-qos breakage from skipping rq_qos_done_bio() 2022-03-14 14:23:13 -06:00
blk-cgroup-rwstat.c blk-cgroup: Fix the recursive blkg rwstat 2021-03-05 11:32:15 -07:00
blk-cgroup-rwstat.h block: partition include/linux/blk-cgroup.h 2022-02-11 10:02:41 -07:00
blk-cgroup.c blk-cgroup: set blkg iostat after percpu stat aggregation 2022-02-15 14:13:12 -07:00
blk-cgroup.h block: partition include/linux/blk-cgroup.h 2022-02-11 10:02:41 -07:00
blk-core.c block: move q_usage_counter release into blk_queue_release 2022-03-08 19:40:01 -07:00
blk-crypto-fallback.c block: partition include/linux/blk-cgroup.h 2022-02-11 10:02:41 -07:00
blk-crypto-internal.h blk-crypto: show crypto capabilities in sysfs 2022-02-28 06:40:23 -07:00
blk-crypto-profile.c blk-crypto: remove blk_crypto_unregister() 2021-11-29 06:38:51 -07:00
blk-crypto-sysfs.c blk-crypto: show crypto capabilities in sysfs 2022-02-28 06:40:23 -07:00
blk-crypto.c blk-crypto: show crypto capabilities in sysfs 2022-02-28 06:40:23 -07:00
blk-flush.c block: pass a block_device and opf to bio_init 2022-02-02 07:49:59 -07:00
blk-ia-ranges.c block: fix memory leak in disk_register_independent_access_ranges 2022-01-23 09:13:09 -07:00
blk-integrity.c blk-crypto: remove blk_crypto_unregister() 2021-11-29 06:38:51 -07:00
blk-ioc.c block: drop needless assignment in set_task_ioprio() 2021-12-23 07:10:07 -07:00
blk-iocost.c block: partition include/linux/blk-cgroup.h 2022-02-11 10:02:41 -07:00
blk-iolatency.c block: fix rq-qos breakage from skipping rq_qos_done_bio() 2022-03-14 14:23:13 -06:00
blk-ioprio.c block: partition include/linux/blk-cgroup.h 2022-02-11 10:02:41 -07:00
blk-ioprio.h block: Introduce the ioprio rq-qos policy 2021-06-21 15:03:40 -06:00
blk-lib.c blk-lib: don't check bdev_get_queue() NULL check 2022-02-15 07:51:46 -07:00
blk-map.c Merge branch 'akpm' (patches from Andrew) 2021-09-03 10:08:28 -07:00
blk-merge.c block: ensure plug merging checks the correct queue at least once 2022-03-11 11:08:21 -07:00
blk-mq-cpumap.c blk-mq: remove the calling of local_memory_node() 2020-10-20 07:08:17 -06:00
blk-mq-debugfs-zoned.c block: Cleanup license notice 2019-01-17 21:21:40 -07:00
blk-mq-debugfs.c blk-mq: prepare for implementing hctx table via xarray 2022-03-08 17:57:19 -07:00
blk-mq-debugfs.h blk-mq: manage hctx map via xarray 2022-03-08 19:39:38 -07:00
blk-mq-pci.c block: Fix blk_mq_*_map_queues() kernel-doc headers 2019-05-31 15:12:34 -06:00
blk-mq-rdma.c block: Fix blk_mq_*_map_queues() kernel-doc headers 2019-05-31 15:12:34 -06:00
blk-mq-sched.c blk-mq: prepare for implementing hctx table via xarray 2022-03-08 17:57:19 -07:00
blk-mq-sched.h block: move blk_mq_sched_assign_ioc to blk-ioc.c 2021-11-29 06:41:29 -07:00
blk-mq-sysfs.c blk-mq: prepare for implementing hctx table via xarray 2022-03-08 17:57:19 -07:00
blk-mq-tag.c blk-mq: manage hctx map via xarray 2022-03-08 19:39:38 -07:00
blk-mq-tag.h blk-mq: Delete busy_iter_fn 2021-12-06 13:18:47 -07:00
blk-mq-virtio.c blk-mq: Fix typo in comment 2020-03-17 20:55:21 +01:00
blk-mq.c block: flush plug based on hardware and software queue order 2022-03-11 11:08:34 -07:00
blk-mq.h blk-mq: manage hctx map via xarray 2022-03-08 19:39:38 -07:00
blk-pm.c scsi: block: pm: Always set request queue runtime active in blk_post_runtime_resume() 2021-12-22 23:38:29 -05:00
blk-pm.h block: Remove unused blk_pm_*() function definitions 2021-02-22 06:33:48 -07:00
blk-rq-qos.c rq-qos: fix missed wake-ups in rq_qos_throttle try two 2021-06-08 15:12:57 -06:00
blk-rq-qos.h block: fix rq-qos breakage from skipping rq_qos_done_bio() 2022-03-14 14:23:13 -06:00
blk-settings.c block: Fix partition check for host-aware zoned block devices 2021-10-27 06:58:01 -06:00
blk-stat.c block: make queue stat accounting a reference 2021-12-14 17:23:05 -07:00
blk-stat.h block: make queue stat accounting a reference 2021-12-14 17:23:05 -07:00
blk-sysfs.c block: move blk_exit_queue into disk_release 2022-03-08 19:40:01 -07:00
blk-throttle.c block: revert 4f1e9630af ("blk-throtl: optimize IOPS throttle for large IO scenarios") 2022-02-16 19:42:28 -07:00
blk-throttle.h block: revert 4f1e9630af ("blk-throtl: optimize IOPS throttle for large IO scenarios") 2022-02-16 19:42:28 -07:00
blk-timeout.c block: blk-timeout: delete duplicated word 2020-07-31 16:29:47 -06:00
blk-wbt.c blk-wbt: prevent NULL pointer dereference in wb_timer_fn 2021-10-19 06:13:41 -06:00
blk-wbt.h blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled() 2021-06-21 15:03:41 -06:00
blk-zoned.c block: pass a block_device and opf to bio_init 2022-02-02 07:49:59 -07:00
blk.h blk-mq: do not include passthrough requests in I/O accounting 2022-03-08 19:39:52 -07:00
bounce.c block: partition include/linux/blk-cgroup.h 2022-02-11 10:02:41 -07:00
bsg-lib.c block: remove the gendisk argument to blk_execute_rq 2021-11-29 06:41:29 -07:00
bsg.c scsi: bsg: Fix device unregistration 2021-09-14 00:22:15 -04:00
disk-events.c block: remove genhd.h 2022-02-02 07:49:59 -07:00
elevator.c block: do more work in elevator_exit 2022-03-08 19:40:01 -07:00
elevator.h block: move elevator.h to block/ 2021-10-18 06:17:01 -06:00
fops.c block: pass a block_device and opf to bio_init 2022-02-02 07:49:59 -07:00
genhd.c block: move rq_qos_exit() into disk_release() 2022-03-08 19:40:01 -07:00
holder.c block: remove genhd.h 2022-02-02 07:49:59 -07:00
ioctl.c block: merge disk_scan_partitions and blkdev_reread_part 2021-11-29 06:35:21 -07:00
ioprio.c for-5.17/block-2022-01-11 2022-01-12 10:26:52 -08:00
Kconfig block: default BLOCK_LEGACY_AUTOLOAD to y 2022-02-27 14:49:23 -07:00
Kconfig.iosched block: only build the icq tracking code when needed 2021-12-16 10:59:02 -07:00
kyber-iosched.c block: make queue stat accounting a reference 2021-12-14 17:23:05 -07:00
Makefile blk-crypto: show crypto capabilities in sysfs 2022-02-28 06:40:23 -07:00
mq-deadline.c block: fix async_depth sysfs interface for mq-deadline 2022-01-20 10:54:02 -07:00
opal_proto.h block: sed-opal: Change the check condition for regular session validity 2020-03-12 08:00:10 -06:00
sed-opal.c block: remove genhd.h 2022-02-02 07:49:59 -07:00
t10-pi.c block: move integrity handling out of <linux/blkdev.h> 2021-10-18 06:17:02 -06:00