Commit Graph

6733 Commits

Author SHA1 Message Date
Chen Zhongjin
fa81cbafbf block: Fix possible memory leak for rq_wb on add_disk failure
kmemleak reported memory leaks in device_add_disk():

kmemleak: 3 new suspected memory leaks

unreferenced object 0xffff88800f420800 (size 512):
  comm "modprobe", pid 4275, jiffies 4295639067 (age 223.512s)
  hex dump (first 32 bytes):
    04 00 00 00 08 00 00 00 01 00 00 00 00 00 00 00  ................
    00 e1 f5 05 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000d3662699>] kmalloc_trace+0x26/0x60
    [<00000000edc7aadc>] wbt_init+0x50/0x6f0
    [<0000000069601d16>] wbt_enable_default+0x157/0x1c0
    [<0000000028fc393f>] blk_register_queue+0x2a4/0x420
    [<000000007345a042>] device_add_disk+0x6fd/0xe40
    [<0000000060e6aab0>] nbd_dev_add+0x828/0xbf0 [nbd]
    ...

It is because the memory allocated in wbt_enable_default() is not
released in device_add_disk() error path.
Normally, these memory are freed in:

del_gendisk()
  rq_qos_exit()
    rqos->ops->exit(rqos);
      wbt_exit()

So rq_qos_exit() is called to free the rq_wb memory for wbt_init().
However in the error path of device_add_disk(), only
blk_unregister_queue() is called and make rq_wb memory leaked.

Add rq_qos_exit() to the error path to fix it.

Fixes: 83cbce9574 ("block: add error handling for device_add_disk / add_disk")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221029071355.35462-1-chenzhongjin@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-31 07:29:53 -06:00
Jinlong Chen
219cf43c55 blk-mq: move queue_is_mq out of blk_mq_cancel_work_sync
The only caller that needs queue_is_mq check is del_gendisk, so move the
check into it.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030094730.1275463-1-nickyc975@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-31 07:27:54 -06:00
David Jeffery
82c229476b blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.

So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.

But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-31 07:24:22 -06:00
Linus Torvalds
c6e0e874a8 block-6.1-2022-10-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNcRTYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkc2D/4yxjQ+lAmXLrqZOGAZc+r2GpiCFsgFcpKP
 i3ezeeo3zmdaoUH778DcbPo0oeWY/iIvV2RDo3/0PBHIlGL43W9e7zsnauRxUwtw
 /Aj140Tsm5/lKnBy8n0nT+DO4LE22JnBHi5XjlFELBwM+deBxS+izinFtcOQC8sj
 58XWSmKag/Lv5JLvcYMj+PprtGOKzfNAacXvTjouy0IlEyb9E/yPMELS8lWFv8i2
 QELtvuEDxODpQtA+Ph0O/o00A8Fg/lC4EH5uvExFMr8k74CGFm32Bar1UuaJ9QAs
 5b8wateTra51yOGW3NEl1ph+4qVe9e4mutrLOFrChYylk5LePOVAki3wYb3lREiU
 rTOEKzUj3P/LHLpl4els0yIQ0gHXs60/M/Vn3TC50+2DnV00qEfvaocZ8vtXOux4
 YR+2cKUxk2CRNyj2BB3WRlrIkCIVk+ehl17E2cdrg0m8SMqk0GAYbpXD753L9uiy
 I7IQEqYB+op501pmTcVskFUfW9ozT96YD53fwSOTR/pEK+esHN0GfqxI6lcA6Q0O
 M2AWEiu8t1PbSONVH/p895gfgGHdRHl6zgvR+ADJMDEmc7dpEoAxsoTj4HIirXbe
 sGHi7ycrQR6aLdHahjCukjUVkZkuhXJkAQmq2XURJgmEcz7iJme23WqtWWUUoQvi
 pk6e1RSqSA==
 =Zfnu
 -----END PGP SIGNATURE-----

Merge tag 'block-6.1-2022-10-28' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
      - make the multipath dma alignment match the non-multipath one
        (Keith Busch)
      - fix a bogus use of sg_init_marker() (Nam Cao)
      - fix circulr locking in nvme-tcp (Sagi Grimberg)

 - Initialization fix for requests allocated via the special hw queue
   allocator (John)

 - Fix for a regression added in this release with the batched
   completions of end_io backed requests (Ming)

 - Error handling leak fix for rbd (Yang)

 - Error handling leak fix for add_disk() failure (Yu)

* tag 'block-6.1-2022-10-28' of git://git.kernel.dk/linux:
  blk-mq: Properly init requests from blk_mq_alloc_request_hctx()
  blk-mq: don't add non-pt request with ->end_io to batch
  rbd: fix possible memory leak in rbd_sysfs_init()
  nvme-multipath: set queue dma alignment to 3
  nvme-tcp: fix possible circular locking when deleting a controller under memory pressure
  nvme-tcp: replace sg_init_marker() with sg_init_table()
  block: fix memory leak for elevator on add_disk failure
2022-10-29 18:06:52 -07:00
John Garry
e3c5a78cdb blk-mq: Properly init requests from blk_mq_alloc_request_hctx()
Function blk_mq_alloc_request_hctx() is missing zeroing/init of rq->bio,
biotail, __sector, and __data_len members, which blk_mq_alloc_request()
has, so duplicate what we do in blk_mq_alloc_request().

Fixes: 1f5bd336b9 ("blk-mq: add blk_mq_alloc_request_hctx")
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1666780513-121650-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-28 07:54:47 -06:00
Bart Van Assche
9546531884 block: Micro-optimize get_max_segment_size()
This patch removes a conditional jump from get_max_segment_size(). The
x86-64 assembler code for this function without this patch is as follows:

206             return min_not_zero(mask - offset + 1,
   0x0000000000000118 <+72>:    not    %rax
   0x000000000000011b <+75>:    and    0x8(%r10),%rax
   0x000000000000011f <+79>:    add    $0x1,%rax
   0x0000000000000123 <+83>:    je     0x138 <bvec_split_segs+104>
   0x0000000000000125 <+85>:    cmp    %rdx,%rax
   0x0000000000000128 <+88>:    mov    %rdx,%r12
   0x000000000000012b <+91>:    cmovbe %rax,%r12
   0x000000000000012f <+95>:    test   %rdx,%rdx
   0x0000000000000132 <+98>:    mov    %eax,%edx
   0x0000000000000134 <+100>:   cmovne %r12d,%edx

With this patch applied:

206             return min(mask - offset, (unsigned long)lim->max_segment_size - 1) + 1;
   0x000000000000003f <+63>:    mov    0x28(%rdi),%ebp
   0x0000000000000042 <+66>:    not    %rax
   0x0000000000000045 <+69>:    and    0x8(%rdi),%rax
   0x0000000000000049 <+73>:    sub    $0x1,%rbp
   0x000000000000004d <+77>:    cmp    %rbp,%rax
   0x0000000000000050 <+80>:    cmova  %rbp,%rax
   0x0000000000000054 <+84>:    add    $0x1,%eax

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-4-bvanassche@acm.org
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25 13:41:20 -06:00
Bart Van Assche
aa261f2058 block: Constify most queue limits pointers
Document which functions do not modify the queue limits.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-3-bvanassche@acm.org
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25 13:41:17 -06:00
Christoph Hellwig
a55b70f127 block: remove bio_start_io_acct_time
bio_start_io_acct_time is not actually used anywhere, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221025155916.270303-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25 12:08:31 -06:00
Christoph Hellwig
2b3f056f72 blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue
The fact that blk_mq_destroy_queue also drops a queue reference leads
to various places having to grab an extra reference.  Move the call to
blk_put_queue into the callers to allow removing the extra references.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-2-hch@lst.de
[axboe: fix fabrics_q vs admin_q conflict in nvme core.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25 08:25:10 -06:00
Jinlong Chen
8ed40ee35d block: fix up elevator_type refcounting
The current reference management logic of io scheduler modules contains
refcnt problems. For example, blk_mq_init_sched may fail before or after
the calling of e->ops.init_sched. If it fails before the calling, it does
nothing to the reference to the io scheduler module. But if it fails after
the calling, it releases the reference by calling kobject_put(&eq->kobj).

As the callers of blk_mq_init_sched can't know exactly where the failure
happens, they can't handle the reference to the io scheduler module
properly: releasing the reference on failure results in double-release if
blk_mq_init_sched has released it, and not releasing the reference results
in ghost reference if blk_mq_init_sched did not release it either.

The same problem also exists in io schedulers' init_sched implementations.

We can address the problem by adding releasing statements to the error
handling procedures of blk_mq_init_sched and init_sched implementations.
But that is counterintuitive and requires modifications to existing io
schedulers.

Instead, We make elevator_alloc get the io scheduler module references
that will be released by elevator_release. And then, we match each
elevator_get with an elevator_put. Therefore, each reference to an io
scheduler module explicitly has its own getter and releaser, and we no
longer need to worry about the refcnt problems.

The bugs and the patch can be validated with tools here:
https://github.com/nickyc975/linux_elv_refcnt_bug.git

[hch: split out a few bits into separate patches, use a non-try
      module_get in elevator_alloc]

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:17 -06:00
Jinlong Chen
b54c2ad9b7 block: check for an unchanged elevator earlier in __elevator_change
No need to find the actual elevator_type struct for this comparism,
the name is all that is needed.

Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:17 -06:00
Christoph Hellwig
58367c8a5f block: sanitize the elevator name before passing it to __elevator_change
The stripped name should also be used for the none check.  To do so
strip it in the caller and pass in the sanitized name.  Drop the pointless
__ prefix in the function name while we're at it.

Based on a patch from Jinlong Chen <nickyc975@zju.edu.cn>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:17 -06:00
Christoph Hellwig
dd6f7f17bf block: add proper helpers for elevator_type module refcount management
Make sure we have helpers for all relevant module refcount operations on
the elevator_type in elevator.h, and use them.  Move the call to the get
helper in blk_mq_elv_switch_none a bit so that it is obvious with a less
verbose comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:17 -06:00
Yu Kuai
671fae5e51 blk-wbt: don't enable throttling if default elevator is bfq
Commit b5dc5d4d1f ("block,bfq: Disable writeback throttling") tries to
disable wbt for bfq, it's done by calling wbt_disable_default() in
bfq_init_queue(). However, wbt is still enabled if default elevator is
bfq:

device_add_disk
 elevator_init_mq
  bfq_init_queue
   wbt_disable_default -> done nothing

 blk_register_queue
  wbt_enable_default -> wbt is enabled

Fix the problem by adding a new flag ELEVATOR_FLAG_DISBALE_WBT, bfq
will set the flag in bfq_init_queue, and following wbt_enable_default()
won't enable wbt while the flag is set.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-7-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:17 -06:00
Yu Kuai
181d066374 elevator: add new field flags in struct elevator_queue
There are only one flag to indicate that elevator is registered currently,
prepare to add a flag to disable wbt if default elevator is bfq.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:17 -06:00
Yu Kuai
3642ef4d95 blk-wbt: don't show valid wbt_lat_usec in sysfs while wbt is disabled
Currently, if wbt is initialized and then disabled by
wbt_disable_default(), sysfs will still show valid wbt_lat_usec, which
will confuse users that wbt is still enabled.

This patch shows wbt_lat_usec as zero if it's disabled.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reported-and-tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:17 -06:00
Yu Kuai
a9a236d238 blk-wbt: make enable_state more accurate
Currently, if user disable wbt through sysfs, 'enable_state' will be
'WBT_STATE_ON_MANUAL', which will be confusing. Add a new state
'WBT_STATE_OFF_MANUAL' to cover that case.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:17 -06:00
Yu Kuai
b11d31ae01 blk-wbt: remove unnecessary check in wbt_enable_default()
If CONFIG_BLK_WBT_MQ is disabled, wbt_init() won't do anything.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:17 -06:00
Yu Kuai
6d9f4cf125 elevator: remove redundant code in elv_unregister_queue()
"elevator_queue *e" is already declared and initialized in the beginning
of elv_unregister_queue().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221019121518.3865235-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:17 -06:00
Yu Kuai
074501bce3 blk-iocost: read 'ioc->params' inside 'ioc->lock' in ioc_timer_fn()
'ioc->params' is updated in ioc_refresh_params(), which is proteced by
'ioc->lock', however, ioc_timer_fn() read params outside the lock.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221012094035.390056-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:17 -06:00
Yu Kuai
2b2da2f6dc blk-iocost: prevent configuration update concurrent with io throttling
This won't cause any severe problem currently, however, this doesn't
seems appropriate:

1) 'ioc->params' is read from multiple places without holding
'ioc->lock', unexpected value might be read if writing it concurrently.

2) If configuration is changed while io is throttling, the functionality
might be affected. For example, if module params is updated and cost
becomes smaller, waiting for timer that is caculated under old
configuration is not appropriate.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221012094035.390056-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:16 -06:00
Yu Kuai
2c06479884 blk-iocost: don't release 'ioc->lock' while updating params
ioc_qos_write() and ioc_cost_model_write() are the same:

1) hold lock to read 'ioc->params' to local variable;
2) update params to local variable without lock;
3) hold lock to write local variable to 'ioc->params';

In theroy, if user updates params concurrenty, the params might be lost:

t1: update params a		t2: update params b
spin_lock_irq(&ioc->lock);
memcpy(qos, ioc->params.qos, sizeof(qos))
spin_unlock_irq(&ioc->lock);

qos[a] = xxx;

				spin_lock_irq(&ioc->lock);
				memcpy(qos, ioc->params.qos, sizeof(qos))
				spin_unlock_irq(&ioc->lock);

				qos[b] = xxx;

spin_lock_irq(&ioc->lock);
memcpy(ioc->params.qos, qos, sizeof(qos));
ioc_refresh_params(ioc, true);
spin_unlock_irq(&ioc->lock);

				spin_lock_irq(&ioc->lock);
				// updates of a will be lost
				memcpy(ioc->params.qos, qos, sizeof(qos));
				ioc_refresh_params(ioc, true);
				spin_unlock_irq(&ioc->lock);

Althrough this is not common case, the problem can by fixed easily by
holding the lock through the read, update, write process.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221012094035.390056-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:16 -06:00
Yu Kuai
8796acbc9a blk-iocost: disable writeback throttling
Commit b5dc5d4d1f ("block,bfq: Disable writeback throttling") disable
wbt for bfq, because different write-throttling heuristics should not
work together.

For the same reason, wbt and iocost should not work together as well,
unless admin really want to do that, dispite that performance is
affected.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221012094035.390056-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23 18:59:16 -06:00
Yu Kuai
02341a08c9 block: fix memory leak for elevator on add_disk failure
The default elevator is allocated in the beginning of device_add_disk(),
however, it's not freed in the following error path.

Fixes: 50e34d7881 ("block: disable the elevator int del_gendisk")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20221022021615.2756171-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-22 15:14:38 -06:00
Linus Torvalds
d4b7332eef block-6.1-2022-10-20
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNSA3oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoTfEACwFAyQsAGMVVsTmU/XggFsNRczTKH+J7dO
 4tBhOhfrVEWyDbB++3kUepN0GenSiu3dvcAzWQbX0lPiLjkqFWvpj2GIrqHwXJCt
 Li8SPoPsQARBvJYUERmdlQDUeEKCK/IwOREQSH4LZaNVvU2l0MSAqjihUjmcXHBE
 OeCWwYLJ27Bo0j5py9xSo7Z1C84YtXHDKAbx1nVhX6ByyCvpSGLLunrwT7oeEjT7
 FAbVvL4p5xwZe5e3XiiC5J7g85BkIZGwhqViYrVyHPWrUVWycSvXQdqXGGHiSqlj
 g3jcT6nLNBscAPuWnnP4jobaFVjm1Lr3eXtVX1k66kRwrxVcB+tZGszvjh4cuneO
 N23OjrRjtGIXenSyOslCxMAIxgMjsxeuuoQb66Js9g8bA9zJeKPcpKWbXplkaXoH
 VsT23W24Xp72MkBAz9nP9sHXbQEAkfNGK+6qZLBpkb8bJsymc+4Vn6lpV6ybl8WF
 LJWouLss2/I+G/D/WLvVcygvefQxVSLDFjWcU8BcJjDkdXgQWUjZfcZEsRE2v0cr
 tLZOh1WhHgVwTBCZxEus+QoAE1tXn2C+xhMu0uqRQ7zlngy5AycM6eSJtQ5Lwm2H
 91MSYoUvGkOx5Hqp265AnZEBayOg+xnsp7qMZghoQqLUc5yU4zyUbvxPi0vNuduJ
 mbys/sxVsg==
 =YfkM
 -----END PGP SIGNATURE-----

Merge tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
      - fix nvme-hwmon for DMA non-cohehrent architectures (Serge Semin)
      - add a nvme-hwmong maintainer (Christoph Hellwig)
      - fix error pointer dereference in error handling (Dan Carpenter)
      - fix invalid memory reference in nvmet_subsys_attr_qid_max_show
        (Daniel Wagner)
      - don't limit the DMA segment size in nvme-apple (Russell King)
      - fix workqueue MEM_RECLAIM flushing dependency (Sagi Grimberg)
      - disable write zeroes on various Kingston SSDs (Xander Li)

 - fix a memory leak with block device tracing (Ye)

 - flexible-array fix for ublk (Yushan)

 - document the ublk recovery feature from this merge window
   (ZiyangZhang)

 - remove dead bfq variable in struct (Yuwei)

 - error handling rq clearing fix (Yu)

 - add an IRQ safety check for the cached bio freeing (Pavel)

 - drbd bio cloning fix (Christoph)

* tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux:
  blktrace: remove unnessary stop block trace in 'blk_trace_shutdown'
  blktrace: fix possible memleak in '__blk_trace_remove'
  blktrace: introduce 'blk_trace_{start,stop}' helper
  bio: safeguard REQ_ALLOC_CACHE bio put
  block, bfq: remove unused variable for bfq_queue
  drbd: only clone bio if we have a backing device
  ublk_drv: use flexible-array member instead of zero-length array
  nvmet: fix invalid memory reference in nvmet_subsys_attr_qid_max_show
  nvmet: fix workqueue MEM_RECLAIM flushing dependency
  nvme-hwmon: kmalloc the NVME SMART log buffer
  nvme-hwmon: consistently ignore errors from nvme_hwmon_init
  nvme: add Guenther as nvme-hwmon maintainer
  nvme-apple: don't limit DMA segement size
  nvme-pci: disable write zeroes on various Kingston SSD
  nvme: fix error pointer dereference in error handling
  Documentation: document ublk user recovery feature
  blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping()
2022-10-21 15:14:14 -07:00
Pavel Begunkov
d4347d5040 bio: safeguard REQ_ALLOC_CACHE bio put
bio_put() with REQ_ALLOC_CACHE assumes that it's executed not from
an irq context. Let's add a warning if the invariant is not respected,
especially since there is a couple of places removing REQ_POLLED by hand
without also clearing REQ_ALLOC_CACHE.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/558d78313476c4e9c233902efa0092644c3d420a.1666122465.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-20 05:50:29 -07:00
Yuwei Guan
33566f92cd block, bfq: remove unused variable for bfq_queue
it defined in d0edc2473b, but there's nowhere to use it,
so remove it.

Signed-off-by: Yuwei Guan <Yuwei.Guan@zeekrlife.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20221018030139.159-1-Yuwei.Guan@zeekrlife.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-20 05:46:49 -07:00
Yu Kuai
76dd298094 blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping()
Our syzkaller report a null pointer dereference, root cause is
following:

__blk_mq_alloc_map_and_rqs
 set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs
  blk_mq_alloc_map_and_rqs
   blk_mq_alloc_rqs
    // failed due to oom
    alloc_pages_node
    // set->tags[hctx_idx] is still NULL
    blk_mq_free_rqs
     drv_tags = set->tags[hctx_idx];
     // null pointer dereference is triggered
     blk_mq_clear_rq_mapping(drv_tags, ...)

This is because commit 63064be150 ("blk-mq:
Add blk_mq_alloc_map_and_rqs()") merged the two steps:

1) set->tags[hctx_idx] = blk_mq_alloc_rq_map()
2) blk_mq_alloc_rqs(..., set->tags[hctx_idx])

into one step:

set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs()

Since tags is not initialized yet in this case, fix the problem by
checking if tags is NULL pointer in blk_mq_clear_rq_mapping().

Fixes: 63064be150 ("blk-mq: Add blk_mq_alloc_map_and_rqs()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20221011142253.4015966-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-16 17:22:51 -06:00
Linus Torvalds
f1947d7c8a Random number generator fixes for Linux 6.1-rc1.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmNHYD0ACgkQSfxwEqXe
 A655AA//dJK0PdRghqrKQsl18GOCffV5TUw5i1VbJQbI9d8anfxNjVUQiNGZi4et
 qUwZ8OqVXxYx1Z1UDgUE39PjEDSG9/cCvOpMUWqN20/+6955WlNZjwA7Fk6zjvlM
 R30fz5CIJns9RFvGT4SwKqbVLXIMvfg/wDENUN+8sxt36+VD2gGol7J2JJdngEhM
 lW+zqzi0ABqYy5so4TU2kixpKmpC08rqFvQbD1GPid+50+JsOiIqftDErt9Eg1Mg
 MqYivoFCvbAlxxxRh3+UHBd7ZpJLtp1UFEOl2Rf00OXO+ZclLCAQAsTczucIWK9M
 8LCZjb7d4lPJv9RpXFAl3R1xvfc+Uy2ga5KeXvufZtc5G3aMUKPuIU7k28ZyblVS
 XXsXEYhjTSd0tgi3d0JlValrIreSuj0z2QGT5pVcC9utuAqAqRIlosiPmgPlzXjr
 Us4jXaUhOIPKI+Musv/fqrxsTQziT0jgVA3Njlt4cuAGm/EeUbLUkMWwKXjZLTsv
 vDsBhEQFmyZqxWu4pYo534VX2mQWTaKRV1SUVVhQEHm57b00EAiZohoOvweB09SR
 4KiJapikoopmW4oAUFotUXUL1PM6yi+MXguTuc1SEYuLz/tCFtK8DJVwNpfnWZpE
 lZKvXyJnHq2Sgod/hEZq58PMvT6aNzTzSg7YzZy+VabxQGOO5mc=
 =M+mV
 -----END PGP SIGNATURE-----

Merge tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

Pull more random number generator updates from Jason Donenfeld:
 "This time with some large scale treewide cleanups.

  The intent of this pull is to clean up the way callers fetch random
  integers. The current rules for doing this right are:

   - If you want a secure or an insecure random u64, use get_random_u64()

   - If you want a secure or an insecure random u32, use get_random_u32()

     The old function prandom_u32() has been deprecated for a while
     now and is just a wrapper around get_random_u32(). Same for
     get_random_int().

   - If you want a secure or an insecure random u16, use get_random_u16()

   - If you want a secure or an insecure random u8, use get_random_u8()

   - If you want secure or insecure random bytes, use get_random_bytes().

     The old function prandom_bytes() has been deprecated for a while
     now and has long been a wrapper around get_random_bytes()

   - If you want a non-uniform random u32, u16, or u8 bounded by a
     certain open interval maximum, use prandom_u32_max()

     I say "non-uniform", because it doesn't do any rejection sampling
     or divisions. Hence, it stays within the prandom_*() namespace, not
     the get_random_*() namespace.

     I'm currently investigating a "uniform" function for 6.2. We'll see
     what comes of that.

  By applying these rules uniformly, we get several benefits:

   - By using prandom_u32_max() with an upper-bound that the compiler
     can prove at compile-time is ≤65536 or ≤256, internally
     get_random_u16() or get_random_u8() is used, which wastes fewer
     batched random bytes, and hence has higher throughput.

   - By using prandom_u32_max() instead of %, when the upper-bound is
     not a constant, division is still avoided, because
     prandom_u32_max() uses a faster multiplication-based trick instead.

   - By using get_random_u16() or get_random_u8() in cases where the
     return value is intended to indeed be a u16 or a u8, we waste fewer
     batched random bytes, and hence have higher throughput.

  This series was originally done by hand while I was on an airplane
  without Internet. Later, Kees and I worked on retroactively figuring
  out what could be done with Coccinelle and what had to be done
  manually, and then we split things up based on that.

  So while this touches a lot of files, the actual amount of code that's
  hand fiddled is comfortably small"

* tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
  prandom: remove unused functions
  treewide: use get_random_bytes() when possible
  treewide: use get_random_u32() when possible
  treewide: use get_random_{u8,u16}() when possible, part 2
  treewide: use get_random_{u8,u16}() when possible, part 1
  treewide: use prandom_u32_max() when possible, part 2
  treewide: use prandom_u32_max() when possible, part 1
2022-10-16 15:27:07 -07:00
Linus Torvalds
a521fc3cfb block-6.1-2022-10-13
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNIXQAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnplD/wP1Kcbn0wc6UktiOaGHy6/DjRmpi2a857g
 ioxTioycD7/ZchIGsfXLca9xBuOufLK7//irG1CpxJT7QXXQTprHspXOZLpw2nUb
 t77hGTgCHkI5QOPuVnZ4+agjvxefUQ0/TOEPLYlb1fh+Q81ZAR7NecV9XwXnGy4V
 75gS1t3cP4WcmeFi6MIkvg8BZhusS2qnPvmviig6eDMDvlJ5Ct/x3zJGnxPH/jt/
 fJtCvZUIUVSfuUirCh/Xv3gmXptPmwrQPKGW66dCZeZaZpUAcFkmvQu/4739q+x2
 5XGhr02FlHxm6zX165rF4jvXNyn1K9nPj/gUQcPLKK+JesA56OPGsUCjkNE7azbm
 ZWf710Tyzx1B2EAIk4keRNIe5YBcUQenwIbGd2czOf+oAuSmNX+tRLozIsVdwD7Z
 NPvLFYfH+evPqMzIhESjt4q476eCl1Zu8pGfBvYmPJ9NNnjcG3wjFZLmmsfA4KLG
 ffkJqBGLWc9LD4sUjsVhYxV79jcQDr7Wh+fkiQ1u9ML6qjNsp9dQijnvIb7dC6Vj
 UKvMbB45oDPffI71dHx08xX+G7Kf5cLWm/Wf0DLRjJaL5uxUA/OGITMdjmvFzSLZ
 FNdlAkwfIomLO3mFK1LrSUYMpwZ84ZBH7iZfO6omOk0MrAhMA3M1U1OnXUS+GgxK
 R5A4tOD4Ow==
 =kxra
 -----END PGP SIGNATURE-----

Merge tag 'block-6.1-2022-10-13' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:
 "Fixes that ended up landing later than the initial block pull request.
  Nothing really major in here:

   - NVMe pull request via Christoph:
        - add NVME_QUIRK_BOGUS_NID for Lexar NM760 (Abhijit)
        - add NVME_QUIRK_NO_DEEPEST_PS to avoid the deepest sleep state
          on ZHITAI TiPro5000 SSDs (Xi Ruoyao)
        - fix possible hang caused during ctrl deletion (Sagi Grimberg)
        - fix possible hang in live ns resize with ANA access (Sagi
          Grimberg)

   - Proactively avoid a sign extension issue with the queue flags
     (Brian)

   - Regression fix for hidden disks (Christoph)

   - Update OPAL maintainers entry (Jonathan)

   - blk-wbt regression initialization fix (Yu)"

* tag 'block-6.1-2022-10-13' of git://git.kernel.dk/linux:
  nvme-multipath: fix possible hang in live ns resize with ANA access
  nvme-pci: avoid the deepest sleep state on ZHITAI TiPro5000 SSDs
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM760
  nvme-tcp: fix possible hang caused during ctrl deletion
  nvme-rdma: fix possible hang caused during ctrl deletion
  block: fix leaking minors of hidden disks
  block: avoid sign extend problem with default queue flags mask
  blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init()
  block: Remove the repeat word 'can'
  MAINTAINERS: Update SED-Opal Maintainers
2022-10-13 21:25:57 -07:00
Jason A. Donenfeld
197173db99 treewide: use get_random_bytes() when possible
The prandom_bytes() function has been a deprecated inline wrapper around
get_random_bytes() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. This was done as a basic find and replace.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> # powerpc
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11 17:42:58 -06:00
Linus Torvalds
27bc50fc90 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative
   reports (or any positive ones, come to that).
 
 - Also the Maple Tree from Liam R.  Howlett.  An overlapping range-based
   tree for vmas.  It it apparently slight more efficient in its own right,
   but is mainly targeted at enabling work to reduce mmap_lock contention.
 
   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.
 
   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
   This has yet to be addressed due to Liam's unfortunately timed
   vacation.  He is now back and we'll get this fixed up.
 
 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer.  It uses
   clang-generated instrumentation to detect used-unintialized bugs down to
   the single bit level.
 
   KMSAN keeps finding bugs.  New ones, as well as the legacy ones.
 
 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.
 
 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
   file/shmem-backed pages.
 
 - userfaultfd updates from Axel Rasmussen
 
 - zsmalloc cleanups from Alexey Romanov
 
 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
 
 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.
 
 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.
 
 - memcg cleanups from Kairui Song.
 
 - memcg fixes and cleanups from Johannes Weiner.
 
 - Vishal Moola provides more folio conversions
 
 - Zhang Yi removed ll_rw_block() :(
 
 - migration enhancements from Peter Xu
 
 - migration error-path bugfixes from Huang Ying
 
 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths.  For optimizations by PMEM drivers, DRM
   drivers, etc.
 
 - vma merging improvements from Jakub Matěn.
 
 - NUMA hinting cleanups from David Hildenbrand.
 
 - xu xin added aditional userspace visibility into KSM merging activity.
 
 - THP & KSM code consolidation from Qi Zheng.
 
 - more folio work from Matthew Wilcox.
 
 - KASAN updates from Andrey Konovalov.
 
 - DAMON cleanups from Kaixu Xia.
 
 - DAMON work from SeongJae Park: fixes, cleanups.
 
 - hugetlb sysfs cleanups from Muchun Song.
 
 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
 joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
 bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
 =xfWx
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
   linux-next for a couple of months without, to my knowledge, any
   negative reports (or any positive ones, come to that).

 - Also the Maple Tree from Liam Howlett. An overlapping range-based
   tree for vmas. It it apparently slightly more efficient in its own
   right, but is mainly targeted at enabling work to reduce mmap_lock
   contention.

   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.

   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   at [1]. This has yet to be addressed due to Liam's unfortunately
   timed vacation. He is now back and we'll get this fixed up.

 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
   clang-generated instrumentation to detect used-unintialized bugs down
   to the single bit level.

   KMSAN keeps finding bugs. New ones, as well as the legacy ones.

 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.

 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
   support file/shmem-backed pages.

 - userfaultfd updates from Axel Rasmussen

 - zsmalloc cleanups from Alexey Romanov

 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
   memory-failure

 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.

 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.

 - memcg cleanups from Kairui Song.

 - memcg fixes and cleanups from Johannes Weiner.

 - Vishal Moola provides more folio conversions

 - Zhang Yi removed ll_rw_block() :(

 - migration enhancements from Peter Xu

 - migration error-path bugfixes from Huang Ying

 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths. For optimizations by PMEM drivers, DRM
   drivers, etc.

 - vma merging improvements from Jakub Matěn.

 - NUMA hinting cleanups from David Hildenbrand.

 - xu xin added aditional userspace visibility into KSM merging
   activity.

 - THP & KSM code consolidation from Qi Zheng.

 - more folio work from Matthew Wilcox.

 - KASAN updates from Andrey Konovalov.

 - DAMON cleanups from Kaixu Xia.

 - DAMON work from SeongJae Park: fixes, cleanups.

 - hugetlb sysfs cleanups from Muchun Song.

 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.

Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]

* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
  hugetlb: allocate vma lock for all sharable vmas
  hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
  hugetlb: fix vma lock handling during split vma and range unmapping
  mglru: mm/vmscan.c: fix imprecise comments
  mm/mglru: don't sync disk for each aging cycle
  mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
  mm: memcontrol: use do_memsw_account() in a few more places
  mm: memcontrol: deprecate swapaccounting=0 mode
  mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
  mm/secretmem: remove reduntant return value
  mm/hugetlb: add available_huge_pages() func
  mm: remove unused inline functions from include/linux/mm_inline.h
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: dedup THP helpers
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  ...
2022-10-10 17:53:04 -07:00
Linus Torvalds
adf4bfc4a9 cgroup changes for v6.1-rc1.
* cpuset now support isolated cpus.partition type, which will enable dynamic
   CPU isolation.
 * pids.peak added to remember the max number of pids used.
 * Holes in cgroup namespace plugged.
 * Internal cleanups.
 
 Note that for-6.1-fixes was pulled into for-6.1 twice. Both were for
 follow-up cleanups and each merge commit has details.
 
 Also, 8a693f7766 ("cgroup: Remove CFTYPE_PRESSURE") removes the flag used
 by PSI changes in the tip tree and the merged result won't compile due to
 the missing flag. Simply removing the struct init lines specifying the flag
 is the correct resolution. linux-next already contains the correct fix:
 
  https://lkml.kernel.org/r/20220912161812.072aaa3b@canb.auug.org.au
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYzsl7w4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGYsxAP4kad4YPw+CueLyyEMiYgBHouqDt8cG0+FJWK3X
 svTC7wD/eCLfxZM8TjjSrMmvaMrml586mr3NoQaFeW0x3twptQQ=
 =LERu
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cpuset now support isolated cpus.partition type, which will enable
   dynamic CPU isolation

 - pids.peak added to remember the max number of pids used

 - holes in cgroup namespace plugged

 - internal cleanups

* tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (25 commits)
  cgroup: use strscpy() is more robust and safer
  iocost_monitor: reorder BlkgIterator
  cgroup: simplify code in cgroup_apply_control
  cgroup: Make cgroup_get_from_id() prettier
  cgroup/cpuset: remove unreachable code
  cgroup: Remove CFTYPE_PRESSURE
  cgroup: Improve cftype add/rm error handling
  kselftest/cgroup: Add cpuset v2 partition root state test
  cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule
  cgroup/cpuset: Relocate a code block in validate_change()
  cgroup/cpuset: Show invalid partition reason string
  cgroup/cpuset: Add a new isolated cpus.partition type
  cgroup/cpuset: Relax constraints to partition & cpus changes
  cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
  cgroup/cpuset: Miscellaneous cleanups & add helper functions
  cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset
  cgroup: add pids.peak interface for pids controller
  cgroup: Remove data-race around cgrp_dfl_visible
  cgroup: Fix build failure when CONFIG_SHRINKER_DEBUG
  ...
2022-10-10 11:12:25 -07:00
Jens Axboe
24a403340d Merge branch 'for-6.1/block' into block-6.1
Merge in later fixes.

* for-6.1/block:
  block: fix leaking minors of hidden disks
  block: avoid sign extend problem with default queue flags mask
  blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init()
  block: Remove the repeat word 'can'
  MAINTAINERS: Update SED-Opal Maintainers
2022-10-10 11:26:40 -06:00
Christoph Hellwig
a0a6314ae7 block: fix leaking minors of hidden disks
The major/minor of a hidden gendisk is not propagated to the block
device because it is never registered using bdev_add.  But the lack of
bd_dev also causes the dynamic major minor number not to be freed.
Assign bd_dev manually to ensure the dynamic major minor gets freed.

Based on a patch by Keith Busch.

Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
Reported-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221010131857.748129-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-10 08:48:59 -06:00
Yu Kuai
285febabac blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init()
commit 8c5035dfbb ("blk-wbt: call rq_qos_add() after wb_normal is
initialized") moves wbt_set_write_cache() before rq_qos_add(), which
is wrong because wbt_rq_qos() is still NULL.

Fix the problem by removing wbt_set_write_cache() and setting 'rwb->wc'
directly. Noted that this patch also remove the redundant setting of
'rab->wc'.

Fixes: 8c5035dfbb ("blk-wbt: call rq_qos_add() after wb_normal is initialized")
Reported-by: kernel test robot <yujie.liu@intel.com>
Link: https://lore.kernel.org/r/202210081045.77ddf59b-yujie.liu@intel.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20221009101038.1692875-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-09 07:48:16 -06:00
Linus Torvalds
7c989b1da3 for-6.1/passthrough-2022-10-04
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM8rp4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjTHD/9eeWwaG7oSSu5J1YzkKn+hptaDzZwreL98
 Mh8euiQScUVpvHGkNowBhjBZ5cIAAcYaH17rjW7dWu6A7tv/iqygWd/YvIbs1JOe
 STSD9yf0RV4dI0MG6Wu2w6YxObaLvE5BTRxqb/WuFWNTgsYf2HEp4PM9sTio71+H
 WwWdRvsIxsRxVYemds3vBxd+BcM8vm26EoUTSaCwRhfopaJwBNceCYIIrM7VHUNM
 5G6+DJkm3mB1a8nsdguYZQC/y8F/9P5Ch9CdxA12yOZEryr3wzsyRNGdm7oRmFGM
 bAkjFcddhwk5+SuTzGX6t4/Z3ODIjeCXbMBg4p7AShHws4Yx1trJePiqoNQ8xd5A
 PkMfxhQpBPlDFKLmwtObPLInyzMpp5P8KYMIZfyymKD/+XjmqAlR6TXbFUTihzBU
 lHSFhwG8ysT2cAVrFBMDJu4UPIThIHqfkkF/nTkHePTSArJ/k5rGV7v5sQpZ+jtY
 R0gvoNHTq2IvgKGEEbTgDjpwVcCn5ERVorZuGjVN2nMdLj35kXpo7YNgyYMaD5LJ
 9SOR5a8iQjjudAfdGyZCGzNaOecizVFjABozUYc1XJi/boNuFTsq4XCE/tCLTixc
 V4sElRpgrlXxNXkiVdbuWIPuYo4sDw5gqZQynpVNH5PkmX/NqmpWYVEWJ20o+pwg
 3ag39nZQVQ==
 =nwLk
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux

Pull passthrough updates from Jens Axboe:
 "With these changes, passthrough NVMe support over io_uring now
  performs at the same level as block device O_DIRECT, and in many cases
  6-8% better.

  This contains:

   - Add support for fixed buffers for passthrough (Anuj, Kanchan)

   - Enable batched allocations and freeing on passthrough, similarly to
     what we support on the normal storage path (me)

   - Fix from Geert fixing an issue with !CONFIG_IO_URING"

* tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux:
  io_uring: Add missing inline to io_uring_cmd_import_fixed() dummy
  nvme: wire up fixed buffer support for nvme passthrough
  nvme: pass ubuffer as an integer
  block: extend functionality to map bvec iterator
  block: factor out blk_rq_map_bio_alloc helper
  block: rename bio_map_put to blk_mq_map_bio_put
  nvme: refactor nvme_alloc_request
  nvme: refactor nvme_add_user_metadata
  nvme: Use blk_rq_map_user_io helper
  scsi: Use blk_rq_map_user_io helper
  block: add blk_rq_map_user_io
  io_uring: introduce fixed buffer support for io_uring_cmd
  io_uring: add io_uring_cmd_import_fixed
  nvme: enable batched completions of passthrough IO
  nvme: split out metadata vs non metadata end_io uring_cmd completions
  block: allow end_io based requests in the completion batch handling
  block: change request end_io handler to pass back a return value
  block: enable batched allocation for blk_mq_alloc_request()
  block: kill deprecated BUG_ON() in the flush handling
2022-10-07 09:35:50 -07:00
Linus Torvalds
513389809e for-6.1/block-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr
 KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt
 yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul
 0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg
 fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR
 /a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E
 Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem
 V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u
 bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN
 cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH
 0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a
 vQNj/iOBQA==
 =R05e
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - handle number of queue changes in the TCP and RDMA drivers
        (Daniel Wagner)
      - allow changing the number of queues in nvmet (Daniel Wagner)
      - also consider host_iface when checking ip options (Daniel
        Wagner)
      - don't map pages which can't come from HIGHMEM (Fabio M. De
        Francesco)
      - avoid unnecessary flush bios in nvmet (Guixin Liu)
      - shrink and better pack the nvme_iod structure (Keith Busch)
      - add comment for unaligned "fake" nqn (Linjun Bao)
      - print actual source IP address through sysfs "address" attr
        (Martin Belanger)
      - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)
      - handle effects after freeing the request (Keith Busch)
      - copy firmware_rev on each init (Keith Busch)
      - restrict management ioctls to admin (Keith Busch)
      - ensure subsystem reset is single threaded (Keith Busch)
      - report the actual number of tagset maps in nvme-pci (Keith
        Busch)
      - small fabrics authentication fixups (Christoph Hellwig)
      - add common code for tagset allocation and freeing (Christoph
        Hellwig)
      - stop using the request_queue in nvmet (Christoph Hellwig)
      - set min_align_mask before calculating max_hw_sectors (Rishabh
        Bhatnagar)
      - send a rediscover uevent when a persistent discovery controller
        reconnects (Sagi Grimberg)
      - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)

 - MD pull request via Song:
      - Various raid5 fix and clean up, by Logan Gunthorpe and David
        Sloan.
      - Raid10 performance optimization, by Yu Kuai.

 - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu)

 - IO scheduler switching quisce fix (Keith)

 - s390/dasd block driver updates (Stefan)

 - support for recovery for the ublk driver (ZiyangZhang)

 - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph)

 - blk-mq and null_blk map fixes (Bart)

 - various bcache fixes (Coly, Jilin, Jules)

 - nbd signal hang fix (Shigeru)

 - block writeback throttling fix (Yu)

 - optimize the passthrough mapping handling (me)

 - prepare block cgroups to being gendisk based (Christoph)

 - get rid of an old PSI hack in the block layer, moving it to the
   callers instead where it belongs (Christoph)

 - blk-throttle fixes and cleanups (Yu)

 - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj,
   Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming,
   Miaohe, Bart, Coly, Gaosheng

* tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...
2022-10-07 09:19:14 -07:00
Linus Torvalds
0a78a376ef for-6.1/io_uring-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67S0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppnPEACkBzilBLKwT9MWdUAITwyrMXsAa1R9gsR9
 Tb3Xs+mNO2meuycLAUh4LIbb28NNr7/S5rwWet5NRZ71hgv4Q/WA/0EemAGGXYqd
 +3MEBAWU3FBFkC/cJXCnT8F5yCXYRkT5n/hzCSNEpNKjQ5JnAhHDlWAjgzZRuD/A
 A+YJjoBVJJuI1wY4I5XCpeQXEmg/Wc1MgXfyHgWVtGKnYrrxibiCnBZnqbAMZNvD
 hGn1Vl02ooamGTFm/nW/OAt71DtqsjWUCVOHKmlZ+zBUjbUj6FMXmPVV7vCV9o2w
 PT4Dx3CTc2iXwa8KfEFNPvXBzy0Qfu8edweP/MvZHWHVZREpEAh4cG6GhwW8whD+
 5mPisqmRjZKe0BBS4k/wKN1RXEypSQoTU4EdljfbQPU/usn35lmjMmEXXgs3IhqM
 fcTdO5ZUOp+CGyzI0Bc7UtS8vilJbX9ynN8G80MUUAZzuQg39MH7lNQYSJSSvJfU
 OlvzmL3lhRLYM1s/KKiZzdDBoMvC7R4oHmzCveOjQTMIHf6WNyqKFlrWScq2wzpN
 oRxqt0xiVQ3PFMmFj6N08f145qtbASuF3sKv7dbU3QXTsCAos3wdTdX+PejYApEZ
 W3dr0TDjNBicLNVPiSj132p0ZRtdZvLGuGVkBD4GPQeH2NwswxMHQAfz8e2lqmA4
 9bWG6BM7Yw==
 =m9kX
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Add supported for more directly managed task_work running.

   This is beneficial for real world applications that end up issuing
   lots of system calls as part of handling work. Normal task_work will
   always execute as we transition in and out of the kernel, even for
   "unrelated" system calls. It's more efficient to defer the handling
   of io_uring's deferred work until the application wants it to be run,
   generally in batches.

   As part of ongoing work to write an io_uring network backend for
   Thrift, this has been shown to greatly improve performance. (Dylan)

 - Add IOPOLL support for passthrough (Kanchan)

 - Improvements and fixes to the send zero-copy support (Pavel)

 - Partial IO handling fixes (Pavel)

 - CQE ordering fixes around CQ ring overflow (Pavel)

 - Support sendto() for non-zc as well (Pavel)

 - Support sendmsg for zerocopy (Pavel)

 - Networking iov_iter fix (Stefan)

 - Misc fixes and cleanups (Pavel, me)

* tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits)
  io_uring/net: fix notif cqe reordering
  io_uring/net: don't update msg_name if not provided
  io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
  io_uring/rw: defer fsnotify calls to task context
  io_uring/net: fix fast_iov assignment in io_setup_async_msg()
  io_uring/net: fix non-zc send with address
  io_uring/net: don't skip notifs for failed requests
  io_uring/rw: don't lose short results on io_setup_async_rw()
  io_uring/rw: fix unexpected link breakage
  io_uring/net: fix cleanup double free free_iov init
  io_uring: fix CQE reordering
  io_uring/net: fix UAF in io_sendrecv_fail()
  selftest/net: adjust io_uring sendzc notif handling
  io_uring: ensure local task_work marks task as running
  io_uring/net: zerocopy sendmsg
  io_uring/net: combine fail handlers
  io_uring/net: rename io_sendzc()
  io_uring/net: support non-zerocopy sendto
  io_uring/net: refactor io_setup_async_addr
  io_uring/net: don't lose partial send_zc on fail
  ...
2022-10-07 08:52:43 -07:00
Deming Wang
340e134727 block: Remove the repeat word 'can'
Remove the repeat word 'can' from the comments of bio_kmalloc.

Signed-off-by: Deming Wang <wangdeming@inspur.com>
Link: https://lore.kernel.org/r/20221006084450.1513-1-wangdeming@inspur.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-06 07:23:13 -06:00
Linus Torvalds
725737e7c2 STATX_DIOALIGN for 6.1
Make statx() support reporting direct I/O (DIO) alignment information.
 This provides a generic interface for userspace programs to determine
 whether a file supports DIO, and if so with what alignment restrictions.
 Specifically, STATX_DIOALIGN works on block devices, and on regular
 files when their containing filesystem has implemented support.
 
 An interface like this has been requested for years, since the
 conditions for when DIO is supported in Linux have gotten increasingly
 complex over time.  Today, DIO support and alignment requirements can be
 affected by various filesystem features such as multi-device support,
 data journalling, inline data, encryption, verity, compression,
 checkpoint disabling, log-structured mode, etc.  Further complicating
 things, Linux v6.0 relaxed the traditional rule of DIO needing to be
 aligned to the block device's logical block size; now user buffers (but
 not file offsets) only need to be aligned to the DMA alignment.
 
 The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was
 discarded in favor of creating a clean new interface with statx().
 
 For more information, see the individual commits and the man page update
 https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYzpV2xQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKwF1AQDetPX5hyuq0/mwikOywLTTJsoHgGY5
 euO+dISqjH/InwD9HAQqfPRkdM1j4ml82BjjkAfrhzZXOOWPKJm0zOhMIQg=
 =0Oav
 -----END PGP SIGNATURE-----

Merge tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull STATX_DIOALIGN support from Eric Biggers:
 "Make statx() support reporting direct I/O (DIO) alignment information.

  This provides a generic interface for userspace programs to determine
  whether a file supports DIO, and if so with what alignment
  restrictions. Specifically, STATX_DIOALIGN works on block devices, and
  on regular files when their containing filesystem has implemented
  support.

  An interface like this has been requested for years, since the
  conditions for when DIO is supported in Linux have gotten increasingly
  complex over time. Today, DIO support and alignment requirements can
  be affected by various filesystem features such as multi-device
  support, data journalling, inline data, encryption, verity,
  compression, checkpoint disabling, log-structured mode, etc.

  Further complicating things, Linux v6.0 relaxed the traditional rule
  of DIO needing to be aligned to the block device's logical block size;
  now user buffers (but not file offsets) only need to be aligned to the
  DMA alignment.

  The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was
  discarded in favor of creating a clean new interface with statx().

  For more information, see the individual commits and the man page
  update[1]"

Link: https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org [1]

* tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
  xfs: support STATX_DIOALIGN
  f2fs: support STATX_DIOALIGN
  f2fs: simplify f2fs_force_buffered_io()
  f2fs: move f2fs_force_buffered_io() into file.c
  ext4: support STATX_DIOALIGN
  fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN
  vfs: support STATX_DIOALIGN on block devices
  statx: add direct I/O alignment information
2022-10-03 20:33:41 -07:00
Alexander Potapenko
11b331f857 block: kmsan: skip bio block merging logic for KMSAN
KMSAN doesn't allow treating adjacent memory pages as such, if they were
allocated by different alloc_pages() calls.  The block layer however does
so: adjacent pages end up being used together.  To prevent this, make
page_is_mergeable() return false under KMSAN.

Link: https://lkml.kernel.org/r/20220915150417.722975-29-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Suggested-by: Eric Biggers <ebiggers@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:23 -07:00
Alexander Potapenko
f630a5d0ca kmsan: disable physical page merging in biovec
KMSAN metadata for adjacent physical pages may not be adjacent, therefore
accessing such pages together may lead to metadata corruption.  We disable
merging pages in biovec to prevent such corruptions.

Link: https://lkml.kernel.org/r/20220915150417.722975-28-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:23 -07:00
Kanchan Joshi
3798754793 block: extend functionality to map bvec iterator
Extend blk_rq_map_user_iov so that it can handle bvec iterator, using
the new blk_rq_map_user_bvec function. It maps the pages from bvec
iterator into a bio and place the bio into request.

This helper will be used by nvme for uring-passthrough path when IO is
done using pre-mapped buffers.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-11-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Kanchan Joshi
ab89e8e7ca block: factor out blk_rq_map_bio_alloc helper
Move bio allocation logic from bio_map_user_iov to a new helper
blk_rq_map_bio_alloc. It is named so because functionality is opposite
of what is done inside blk_mq_map_bio_put. This is a prep patch.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-10-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Anuj Gupta
32f1c71b15 block: rename bio_map_put to blk_mq_map_bio_put
This patch renames existing bio_map_put function to blk_mq_map_bio_put.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-9-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Anuj Gupta
557654025d block: add blk_rq_map_user_io
Create a helper blk_rq_map_user_io for mapping of vectored as well as
non-vectored requests. This will help in saving dupilcation of code at few
places in scsi and nvme.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-4-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Jens Axboe
ab3e1d3bba block: allow end_io based requests in the completion batch handling
With end_io handlers now being able to potentially pass ownership of
the request upon completion, we can allow requests with end_io handlers
in the batch completion handling.

Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Co-developed-by: Stefan Roesch <shr@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:49:11 -06:00
Jens Axboe
de671d6116 block: change request end_io handler to pass back a return value
Everything is just converted to returning RQ_END_IO_NONE, and there
should be no functional changes with this patch.

In preparation for allowing the end_io handler to pass ownership back
to the block layer, rather than retain ownership of the request.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:49:09 -06:00
Jens Axboe
4b6a5d9cea block: enable batched allocation for blk_mq_alloc_request()
The filesystem IO path can take advantage of allocating batches of
requests, if the underlying submitter tells the block layer about it
through the blk_plug. For passthrough IO, the exported API is the
blk_mq_alloc_request() helper, and that one does not allow for
request caching.

Wire up request caching for blk_mq_alloc_request(), which is generally
done without having a bio available upfront.

Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:48:00 -06:00
Jens Axboe
e73a625bc2 block: kill deprecated BUG_ON() in the flush handling
We've never had any useful reports from this BUG_ON(), and in fact a
number of the BUG_ON()'s in the flush handling need to be turned into
more graceful handling.

In preparation for allowing batched completions of the end_io handling,
where we can enter the flush completion with queuelist having been reused
for the batch, get rid of this BUG_ON().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:48:00 -06:00
Jens Axboe
5853a7b551 Merge branch 'for-6.1/io_uring' into for-6.1/passthrough
* for-6.1/io_uring: (56 commits)
  io_uring/net: fix notif cqe reordering
  io_uring/net: don't update msg_name if not provided
  io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
  io_uring/rw: defer fsnotify calls to task context
  io_uring/net: fix fast_iov assignment in io_setup_async_msg()
  io_uring/net: fix non-zc send with address
  io_uring/net: don't skip notifs for failed requests
  io_uring/rw: don't lose short results on io_setup_async_rw()
  io_uring/rw: fix unexpected link breakage
  io_uring/net: fix cleanup double free free_iov init
  io_uring: fix CQE reordering
  io_uring/net: fix UAF in io_sendrecv_fail()
  selftest/net: adjust io_uring sendzc notif handling
  io_uring: ensure local task_work marks task as running
  io_uring/net: zerocopy sendmsg
  io_uring/net: combine fail handlers
  io_uring/net: rename io_sendzc()
  io_uring/net: support non-zerocopy sendto
  io_uring/net: refactor io_setup_async_addr
  io_uring/net: don't lose partial send_zc on fail
  ...
2022-09-30 07:47:42 -06:00
Jens Axboe
736feaa3a0 Merge branch 'for-6.1/block' into for-6.1/passthrough
* for-6.1/block: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:47:38 -06:00
Pankaj Raghav
110fdb4486 block: add rationale for not using blk_mq_plug() when applicable
There are two places in the block layer at the moment where
blk_mq_plug() helper could be used instead of directly accessing the
plug from struct current. In both these cases, directly accessing the plug
should not have any consequences for zoned devices.

Make the intent explicit by adding comments instead of introducing unwanted
checks with blk_mq_plug() helper.[1]

[1] https://lore.kernel.org/linux-block/f6e54907-1035-2b2c-6387-ed178be05ccb@kernel.dk/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Suggested-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220929144141.140077-1-p.raghav@samsung.com
[axboe: fixup multi-line comment style]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29 09:02:49 -06:00
Pankaj Raghav
8cafdb5ab9 block: adapt blk_mq_plug() to not plug for writes that require a zone lock
The current implementation of blk_mq_plug() disables plugging for all
operations that involves a transfer to the device as we just check if
the last bit in op_is_write() function.

Modify blk_mq_plug() to disable plugging only for REQ_OP_WRITE and
REQ_OP_WRITE_ZEROS as they might require a zone lock.

Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220929074745.103073-2-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29 07:45:47 -06:00
Christoph Hellwig
5765033cf7 blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
blkg_conf_prep just creates a new blkg structure, there is no real
need to update the lookup hint which should only be done on a
successful lookup in the I/O path.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220927065425.257876-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27 11:50:05 -06:00
Keith Busch
8237c01f16 blk-mq: use quiesced elevator switch when reinitializing queues
The hctx's run_work may be racing with the elevator switch when
reinitializing hardware queues. The queue is merely frozen in this
context, but that only prevents requests from allocating and doesn't
stop the hctx work from running. The work may get an elevator pointer
that's being torn down, and can result in use-after-free errors and
kernel panics (example below). Use the quiesced elevator switch instead,
and make the previous one static since it is now only used locally.

  nvme nvme0: resetting controller
  nvme nvme0: 32/0/0 default/read/poll queues
  BUG: kernel NULL pointer dereference, address: 0000000000000008
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0
  Oops: 0000 [#1] SMP PTI
  Workqueue: kblockd blk_mq_run_work_fn
  RIP: 0010:kyber_has_work+0x29/0x70

...

  Call Trace:
   __blk_mq_do_dispatch_sched+0x83/0x2b0
   __blk_mq_sched_dispatch_requests+0x12e/0x170
   blk_mq_sched_dispatch_requests+0x30/0x60
   __blk_mq_run_hw_queue+0x2b/0x50
   process_one_work+0x1ef/0x380
   worker_thread+0x2d/0x3e0

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220927155652.3260724-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27 09:58:56 -06:00
Christoph Hellwig
568ec936bf block: replace blk_queue_nowait with bdev_nowait
Replace blk_queue_nowait with a bdev_nowait helpers that takes the
block_device given that the I/O submission path should not have to
look into the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27 09:57:58 -06:00
Christoph Hellwig
99e6038743 blk-cgroup: pass a gendisk to the blkg allocation helpers
Prepare for storing the blkcg information in the gendisk instead of
the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:28 -06:00
Christoph Hellwig
de185b56e8 blk-cgroup: pass a gendisk to blkcg_schedule_throttle
Pass the gendisk to blkcg_schedule_throttle as part of moving the
blk-cgroup infrastructure to be gendisk based.  Remove the unused
!BLK_CGROUP stub while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:28 -06:00
Christoph Hellwig
00ad6991bb blk-cgroup: pass a gendisk to blkg_destroy_all
Pass the gendisk to blkg_destroy_all as part of moving the blk-cgroup
infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:28 -06:00
Christoph Hellwig
cad9266abc blk-throttle: pass a gendisk to blk_throtl_cancel_bios
Pass the gendisk to blk_throtl_cancel_bios as part of moving the
blk-cgroup infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:28 -06:00
Christoph Hellwig
5f6dc7522a blk-throttle: pass a gendisk to blk_throtl_register_queue
Pass the gendisk to blk_throtl_register_queue as part of moving the
blk-cgroup infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:27 -06:00
Christoph Hellwig
e13793bae6 blk-throttle: pass a gendisk to blk_throtl_init and blk_throtl_exit
Pass the gendisk to blk_throtl_init and blk_throtl_exit as part of moving
the blk-cgroup infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:27 -06:00
Christoph Hellwig
3657647e33 blk-iocost: cleanup ioc_qos_write
Use a local disk variable instead of retrieving the disk and
request_queue over and over by various means.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:27 -06:00
Christoph Hellwig
57b6455497 blk-iocost: pass a gendisk to blk_iocost_init
Pass the gendisk to blk_iocost_init as part of moving the blk-cgroup
infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:27 -06:00
Christoph Hellwig
9df3e65139 blk-iocost: simplify ioc_name
Just directly dereference the disk name instead of going through multiple
hoops to find the same value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:27 -06:00
Christoph Hellwig
16fac1b591 blk-iolatency: pass a gendisk to blk_iolatency_init
Pass the gendisk to blk_iolatency_init as part of moving the blk-cgroup
infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-9-hch@lst.de
[axboe: missed inline for blk_iolatency_init() and !CONFIG_BLK_CGROUP_IOLATENCY]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:17:24 -06:00
Christoph Hellwig
b0dde3f5d6 blk-ioprio: pass a gendisk to blk_ioprio_init and blk_ioprio_exit
Pass the gendisk to blk_ioprio_init and blk_ioprio_exit as part of moving
the blk-cgroup infrastructure to be gendisk based.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
9823538fb7 blk-cgroup: pass a gendisk to blkcg_init_queue and blkcg_exit_queue
Pass the gendisk to blkcg_init_disk and blkcg_exit_disk as part of moving
the blk-cgroup infrastructure to be gendisk based.  Also remove the
rather pointless kerneldoc comments for these internal functions with a
single caller each.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
f753526e32 blk-cgroup: remove blkg_lookup_check
The combinations of an error check with an ERR_PTR return and a lookup
with a NULL return leads to ugly handling of the return values in the
callers.  Just open coding the check and the lookup is much simpler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
4a69f325aa blk-cgroup: cleanup the blkg_lookup family of functions
Add a fully inlined blkg_lookup as the extra two checks aren't going
to generated a lot more code vs the call to the slowpath routine, and
open code the hint update in the two callers that care.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
79fcc5be93 blk-cgroup: remove open coded blkg_lookup instances
Use blkg_lookup instead of open coding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
928f6f00a9 blk-cgroup: remove blk_queue_root_blkg
Just open code it in the only caller and drop the unused !BLK_CGROUP
stub.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Christoph Hellwig
33dc62796c blk-cgroup: fix error unwinding in blkcg_init_queue
When blk_throtl_init fails, we need to call blk_ioprio_exit.  Switch to
proper goto based unwinding to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 19:09:31 -06:00
Linus Torvalds
0be27f7be2 block-6.0-2022-09-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMs/PgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpml9D/wOZVaE/eofmZBWcnl+ZmYtdWQ+N8nv8a2D
 dqLR+OBe/w/6oBMnf2VzgzXpCrI+ZgW7z6uMCKpJa8FvcvB4KnB+UcBS5FQgaAdm
 Zbs7LJGe7p7lZahwCgjFfeN1L2gYya9vM19QgrSxAUsy3mZCjJpOi644YsNC397C
 9+NwzMS6LS2KGHUlZTYL+M8iVBEehuR9oexrHbqhWAm7NaMtMKJGNvPvbt7G3bkl
 1RnHmOtLJh1vyD1TxVj0OxCmJ719igFUJHgtAgFLkMVRVli224tAu5lMRZ7kR7IU
 l2IJzSM/4af1UjVeMAcXoZp1kht+ks/SEvpT8auh3y44teXsTBCxKQpoUE5n+U7v
 737UJN2iZSFi9ahUbg8FgxDu+rmW/Et4G3vptWnVcmAGEd+OOSfdKNhj/oskLdLS
 /YioBL9uDvUQDETxbPcnfCN5ySziKXPwRvtobGyjrzHyuCaP9eutFXnwpdW+FAED
 9cMyPKFbnnrDlvKfHOybxwWsvGNlNMMeueNglVJ/XUkZhGzizlHhD1+CAgxXa0u5
 q8eXpe/y2TtCrl9q3lPKJvoBZ1KVE2cxyK/w17venwcMNb8quy6Yf3FdedlkS04N
 rXzEJCSELnkT4F7siii4R9VrcoVU7qUJHA3kLFsNxvNXhwdM2koLa+4gZXfpIn4A
 dVHjIuo54Q==
 =Z0yq
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-09-22' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Fix a regression that's been plaguing us by reverting the offending
  commit, as attempts to both reproduce the issue and fix it in a saner
  fashion have failed.

  Fix for a potential oops condition in the s390 dasd block driver"

* tag 'block-6.0-2022-09-22' of git://git.kernel.dk/linux:
  Revert "block: freeze the queue earlier in del_gendisk"
  s390/dasd: fix Oops in dasd_alias_get_start_dev due to missing pavgroup
2022-09-24 08:22:53 -07:00
Liu Song
f168420c62 blk-mq: don't redirect completion for hctx withs only one ctx mapping
High-performance NVMe devices usually support a large hw queues, which
ensures a 1:1 mapping of hctx and ctx. In this case there will be no
remote request, so we don't need to care about it.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1663731123-81536-1-git-send-email-liusong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-24 09:02:30 -06:00
Yu Kuai
81c7a63abc blk-throttle: improve bypassing bios checkings
"tg->has_rules" is extended to "tg->has_rules_iops/bps", thus bios that
don't need to be throttled can be checked accurately.

With this patch, bio will be throttled if:

1) Bio is read/write, and corresponding read/write iops limit exist.
2) If corresponding doesn't exist, corresponding bps limit exist and
bio is not throttled before.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921095309.1481289-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-24 08:59:43 -06:00
Yu Kuai
8549674990 blk-throttle: remove THROTL_TG_HAS_IOPS_LIMIT
Currently, "tg->has_rules" and "tg->flags & THROTL_TG_HAS_IOPS_LIMIT"
both try to bypass bios that don't need to be throttled, however, they are
a little redundant and both not perfect:

1) "tg->has_rules" only distinguish read and write, but not iops and bps
   limit.
2) "tg->flags & THROTL_TG_HAS_IOPS_LIMIT" only check if iops limit
   exist, read and write is not distinguished, and bps limit is not
   checked.

tg->has_rules will extended to distinguish bps and iops in the following
patch. There is no need to keep the flag.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921095309.1481289-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-24 08:59:43 -06:00
Tejun Heo
026e14a276 Merge branch 'for-6.0-fixes' into for-6.1
for-6.0 has the following fix for cgroup_get_from_id().

  836ac87d ("cgroup: fix cgroup_get_from_id")

which conflicts with the following two commits in for-6.1.

  4534dee9 ("cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id")
  fa7e439c ("cgroup: Homogenize cgroup_get_from_id() return value")

While the resolution is straightforward, the code ends up pretty ugly
afterwards. Let's pull for-6.0-fixes into for-6.1 so that the code can be
fixed up there.

Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-23 07:19:38 -10:00
Li Jinlin
9713a67067 block/blk-rq-qos: delete useless enmu RQ_QOS_IOPRIO
Since blk-ioprio handing was converted from a rqos policy to a direct call,
RQ_QOS_IOPRIO is not used anymore, just delete it.

Signed-off-by: Li Jinlin <lijinlin3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220916023241.32926-1-lijinlin3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 19:50:53 -06:00
Kanchan Joshi
c6e99ea482 block: export blk_rq_is_poll
This is in preparation to support iopoll for nvme passthrough.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-4-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Yu Kuai
8c5035dfbb blk-wbt: call rq_qos_add() after wb_normal is initialized
Our test found a problem that wbt inflight counter is negative, which
will cause io hang(noted that this problem doesn't exist in mainline):

t1: device create	t2: issue io
add_disk
 blk_register_queue
  wbt_enable_default
   wbt_init
    rq_qos_add
    // wb_normal is still 0
			/*
			 * in mainline, disk can't be opened before
			 * bdev_add(), however, in old kernels, disk
			 * can be opened before blk_register_queue().
			 */
			blkdev_issue_flush
                        // disk size is 0, however, it's not checked
                         submit_bio_wait
                          submit_bio
                           blk_mq_submit_bio
                            rq_qos_throttle
                             wbt_wait
			      bio_to_wbt_flags
                               rwb_enabled
			       // wb_normal is 0, inflight is not increased

    wbt_queue_depth_changed(&rwb->rqos);
     wbt_update_limits
     // wb_normal is initialized
                            rq_qos_track
                             wbt_track
                              rq->wbt_flags |= bio_to_wbt_flags(rwb, bio);
			      // wb_normal is not 0,wbt_flags will be set
t3: io completion
blk_mq_free_request
 rq_qos_done
  wbt_done
   wbt_is_tracked
   // return true
   __wbt_done
    wbt_rqw_done
     atomic_dec_return(&rqw->inflight);
     // inflight is decreased

commit 8235b5c1e8 ("block: call bdev_add later in device_add_disk") can
avoid this problem, however it's better to fix this problem in wbt:

1) Lower kernel can't backport this patch due to lots of refactor.
2) Root cause is that wbt call rq_qos_add() before wb_normal is
initialized.

Fixes: e34cbd3074 ("blk-wbt: add general throttling mechanism")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220913105749.3086243-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:36:13 -06:00
Li zeming
a7609c68f7 blk-iocost: Remove unnecessary (void*) conversions
The key pointer is void and hence does not need an explicit cast.

Signed-off-by: Li zeming <zeming@nfschina.com>
Link: https://lore.kernel.org/r/20220919012825.2936-1-zeming@nfschina.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:28:30 -06:00
Christoph Hellwig
118f3663fb block: remove PSI accounting from the bio layer
PSI accounting is now done by the VM code, where it should have been
since the beginning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220915094200.139713-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:24:38 -06:00
Ping-Xiang Chen
e88480871b block: fix comment typo in submit_bio of block-core.c.
This patch fix a comment typo in block-core.c.

Signed-off-by: Ping-Xiang Chen <p.x.chen@uci.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220914074237.31621-1-p.x.chen@uci.edu
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:22:47 -06:00
Christoph Hellwig
4c66a326b5 Revert "block: freeze the queue earlier in del_gendisk"
This reverts commit a09b314005.

Dusty Mabe reported consistent hang during CoreOS shutdown with a MD
RAID1 setup.  Although apparently similar hangs happened before,
and this patch most likely is not the root cause it made it much
more severe.  Revert it until we can figure out what is going on
with the md driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220919144049.978907-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:15:44 -06:00
Linus Torvalds
68e777e44c block-6.0-2022-09-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMkPEIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiFOEADWRiCqDvhcSp+ChZybJTeUIHr7XIeMCDig
 S6822sgaEvi0jg7xOXe0BcGJ77bDCzjB0Jn0OjD/XTeT6a3692yszbG0KDHX0RCZ
 mnpIzMvCdWwkDgkhbmKPTZvWTA93abyJN/53N5YBlOO1/qZciJYL5rBnK/8tBKLF
 z8VpszzTmJp5TFt9mBzfjwYFQ6NNTRfml+LOKPiGuJwc2+Q3IVy1ZLTVtvL1L9Tl
 tHB2dJakvGdr5UzR/FgVUlOw36IODqIOfOdyfwTpJmk3Lx0PbtYLHNbv/sGUgMe7
 5VXuSgum0ZOpKXXReiwXLj1u68ys/1eQC2K0LtR5lIXbHhC4FX4S0RZ9IYHat5qF
 IV7kcDALDg6TCpHDVXU11G+S6RkvaBhfQHNMJAvkUt90zbD5RrRbe/4BGNgfpVIJ
 t95w80v5o7vpYM6X0pr+wp2l3bx62U77ZH1vPLlANsaaXc9aL+BMiwIx3vISTDtx
 4NGAoDH0O8MGn+WF9dwVt6UixNmIoqdlkxCADrM/3gn9Ebo8kTuqtAo5ZpezqSgi
 eDcKa1BXxOcHHrNt3udWkiyGLAM/XtyztrlAULRGDyqkv+KfF8V4Q4UM4lNywICc
 Q3t8nMWtLkjMqxZXttlpOrsAgHwuuuSeyRjcSkeKg1Ldu6/ecvgtsl7YmfOrFSWd
 UQ9ODxburw==
 =jBr+
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-09-16' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Two fixes for -rc6:

   - Fix a mixup of sectors and bytes in the secure erase ioctl
     (Mikulas)

   - Fix for a bad return value for a non-blocking bio/blk queue enter
     call (me)"

* tag 'block-6.0-2022-09-16' of git://git.kernel.dk/linux-block:
  blk-lib: fix blkdev_issue_secure_erase
  block: blk_queue_enter() / __bio_queue_enter() must return -EAGAIN for nowait
2022-09-16 06:58:04 -07:00
Mikulas Patocka
c4fa368466 blk-lib: fix blkdev_issue_secure_erase
There's a bug in blkdev_issue_secure_erase. The statement
"unsigned int len = min_t(sector_t, nr_sects, max_sectors);"
sets the variable "len" to the length in sectors, but the statement
"bio->bi_iter.bi_size = len" treats it as if it were in bytes.
The statements "sector += len << SECTOR_SHIFT" and "nr_sects -= len <<
SECTOR_SHIFT" are thinko.

This patch fixes it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org	# v5.19
Fixes: 44abff2c0b ("block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD")
Link: https://lore.kernel.org/r/alpine.LRH.2.02.2209141549480.28100@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-15 00:25:17 -06:00
Stefan Roesch
56f99b8d06 block: blk_queue_enter() / __bio_queue_enter() must return -EAGAIN for nowait
Today blk_queue_enter() and __bio_queue_enter() return -EBUSY for the
nowait code path. This is not correct: they should return -EAGAIN
instead.

This problem was detected by fio. The following command exposed the
above problem:

t/io_uring -p0 -d128 -b4096 -s32 -c32 -F1 -B0 -R0 -X1 -n24 -P1 -u1 -O0 /dev/ng0n1

By applying the patch, the retry case is handled correctly in the slow
path.

Signed-off-by: Stefan Roesch <shr@fb.com>
Fixes: bfd343aa17 ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-13 15:06:39 -06:00
Yu Kuai
c013710e1a blk-throttle: cleanup tg_update_disptime()
tg_update_disptime() only need to adjust postion for 'tg' in
'parent_sq', there is no need to call throtl_enqueue/dequeue_tg(),
since they will set/clear flag THROTL_TG_PENDING and increase/decrease
nr_pending, which is useless. By the way, clear the flag/decrease
nr_pending while there are still throttled bios is not good for debugging.

There are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220827101637.1775111-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:20:08 -06:00
Yu Kuai
8c25ed0cb9 blk-throttle: calling throtl_dequeue/enqueue_tg in pairs
It's a little weird to call throtl_dequeue_tg() unconditionally in
throtl_select_dispatch(), since it will be called in tg_update_disptime()
again if some bio is still throttled. Thus call it later if there are no
throttled bio. There are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220827101637.1775111-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:20:07 -06:00
Yu Kuai
7e9c5c54d4 blk-throttle: use 'READ/WRITE' instead of '0/1'
Make the code easier to read, like everywhere else.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220827101637.1775111-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:20:07 -06:00
Yu Kuai
a880ae93e5 blk-throttle: fix io hung due to configuration updates
If new configuration is submitted while a bio is throttled, then new
waiting time is recalculated regardless that the bio might already wait
for some time:

tg_conf_updated
 throtl_start_new_slice
  tg_update_disptime
  throtl_schedule_next_dispatch

Then io hung can be triggered by always submmiting new configuration
before the throttled bio is dispatched.

Fix the problem by respecting the time that throttled bio already waited.
In order to do that, add new fields to record how many bytes/io are
waited, and use it to calculate wait time for throttled bio under new
configuration.

Some simple test:
1)
cd /sys/fs/cgroup/blkio/
echo $$ > cgroup.procs
echo "8:0 2048" > blkio.throttle.write_bps_device
{
        sleep 2
        echo "8:0 1024" > blkio.throttle.write_bps_device
} &
dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct

2)
cd /sys/fs/cgroup/blkio/
echo $$ > cgroup.procs
echo "8:0 1024" > blkio.throttle.write_bps_device
{
        sleep 4
        echo "8:0 2048" > blkio.throttle.write_bps_device
} &
dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct

test results: io finish time
	before this patch	with this patch
1)	10s			6s
2)	8s			6s

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:19:48 -06:00
Yu Kuai
681cd46fff blk-throttle: factor out code to calculate ios/bytes_allowed
No functional changes, new apis will be used in later patches to
calculate wait time for throttled bios when new configuration is
submitted.

Noted this patch also rename tg_with_in_iops/bps_limit() to
tg_within_iops/bps_limit().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:19:48 -06:00
Yu Kuai
8d6bbaada2 blk-throttle: prevent overflow while calculating wait time
There is a problem found by code review in tg_with_in_bps_limit() that
'bps_limit * jiffy_elapsed_rnd' might overflow. Fix the problem by
calling mul_u64_u64_div_u64() instead.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:19:48 -06:00
Yu Kuai
320fb0f91e blk-throttle: fix that io throttle can only work for single bio
Test scripts:
cd /sys/fs/cgroup/blkio/
echo "8:0 1024" > blkio.throttle.write_bps_device
echo $$ > cgroup.procs
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &

Test result:
10240 bytes (10 kB, 10 KiB) copied, 10.0134 s, 1.0 kB/s
10240 bytes (10 kB, 10 KiB) copied, 10.0135 s, 1.0 kB/s

The problem is that the second bio is finished after 10s instead of 20s.

Root cause:
1) second bio will be flagged:

__blk_throtl_bio
 while (true) {
  ...
  if (sq->nr_queued[rw]) -> some bio is throttled already
   break
 };
 bio_set_flag(bio, BIO_THROTTLED); -> flag the bio

2) flagged bio will be dispatched without waiting:

throtl_dispatch_tg
 tg_may_dispatch
  tg_with_in_bps_limit
   if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED))
    *wait = 0; -> wait time is zero
    return true;

commit 9f5ede3c01 ("block: throttle split bio in case of iops limit")
support to count split bios for iops limit, thus it adds flagged bio
checking in tg_with_in_bps_limit() so that split bios will only count
once for bps limit, however, it introduce a new problem that io throttle
won't work if multiple bios are throttled.

In order to fix the problem, handle iops/bps limit in different ways:

1) for iops limit, there is no flag to record if the bio is throttled,
   and iops is always applied.
2) for bps limit, original bio will be flagged with BIO_BPS_THROTTLED,
   and io throttle will ignore bio with the flag.

Noted this patch also remove the code to set flag in __bio_clone(), it's
introduced in commit 111be88398 ("block-throttle: avoid double
charge"), and author thinks split bio can be resubmited and throttled
again, which is wrong because split bio will continue to dispatch from
caller.

Fixes: 9f5ede3c01 ("block: throttle split bio in case of iops limit")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220829022240.3348319-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:19:48 -06:00
Keith Busch
4acb83417c sbitmap: fix batched wait_cnt accounting
Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12 00:10:34 -06:00
Eric Biggers
2d985f8c6b vfs: support STATX_DIOALIGN on block devices
Add support for STATX_DIOALIGN to block devices, so that direct I/O
alignment restrictions are exposed to userspace in a generic way.

Note that this breaks the tradition of stat operating only on the block
device node, not the block device itself.  However, it was felt that
doing this is preferable, in order to make the interface useful and
avoid needing separate interfaces for regular files and block devices.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20220827065851.135710-3-ebiggers@kernel.org
2022-09-11 19:47:12 -05:00
Linus Torvalds
9ebc0ecb21 block-6.0-2022-09-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMbdksQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqm8EACPqE5eFwFtZZnYV8eFjtTQCA5aZ+u3HyF8
 Xby5FEZxwpRaaGPwUZ6T7A45ZVH5ka+JPvSQYeSTKxdjBVeqmAUVnlhqLjxeTvJt
 WhfSCrPq553eYIKoeUVGxLfQ5GhK4f/2noiEMUIEOMSsDyh4txm46MRNdfi5doj1
 2JlbJyoTIK0ZRoraAuZp07bZPFnbxtjSB9YJlUqyKpiSMbM+xocPRaB3Tpr42So8
 vUdgt2/tDZ2xNOf4gd1KqUlosIlaG0RWeZeXTU9knrQjPYtB77iM6MVDUGhgiYT/
 w25XYCbZiGt1YwRTOYFX9bE6jGuPDc54isRqCmPSjYwTAbeEd7sWytn5D1W0IkaF
 dH93mDaW+8/GrjdGVCJ4jLPRCfQvZZ8i5XlwXaXQfnTQ8okUelx5BJ6kEfKRUkhV
 nZD0r5MskH4dQv4ESluaeblrDWphMaznZLNjp+UXfiycsbE9l2TX1FOjyK9E7f7o
 4zGYiertjc7wj0OnnoDTf0wdpXI13xDqQGFhe1Lnz4tOrpTfzZqwyv4IkFo3VghR
 OocJFk4GSuKLwDPASDY/B+nynhQw5fft/fp458Q5pFY/2zItvmg2rxloFetUAz9i
 0kcrO4NuO1l6E20KoBED2HMrQOow68dqcCJTrRcXhDfUIRIk0XSg4Kd/nGJBhZZS
 5mv/hhkjjQ==
 =3RZd
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-09-09' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull via Christoph:
      - fix a use after free in nvmet (Bart Van Assche)
      - fix a use after free when detecting digest errors
        (Sagi Grimberg)
      - fix regression that causes sporadic TCP requests to time out
        (Sagi Grimberg)
      - fix two off by ones errors in the nvmet ZNS support
        (Dennis Maisenbacher)
      - requeue aen after firmware activation (Keith Busch)

 - Fix missing request flags in debugfs code (me)

 - Partition scan fix (Ming)

* tag 'block-6.0-2022-09-09' of git://git.kernel.dk/linux-block:
  block: add missing request flags to debugfs code
  nvme: requeue aen after firmware activation
  nvmet: fix mar and mor off-by-one errors
  nvme-tcp: fix regression that causes sporadic requests to time out
  nvme-tcp: fix UAF when detecting digest errors
  nvmet: fix a use-after-free
  block: don't add partitions if GD_SUPPRESS_PART_SCAN is set
2022-09-09 15:03:08 -04:00
Jens Axboe
745ed37277 block: add missing request flags to debugfs code
We're missing TIMED_OUT and RESV. Particularly the former is handy
for debugging, let's get them added.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-09 05:57:52 -06:00
Miaohe Lin
bdb7d420c6 block: remove unneeded return value of bio_check_ro()
bio_check_ro() always return false now. Remove this unneeded return value
and cleanup the sole caller. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Link: https://lore.kernel.org/r/20220905102754.1942-1-linmiaohe@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-05 11:45:35 -06:00
Miaohe Lin
6d5e8d21e8 blk-mq: remove unneeded needs_restart check
If code reaches here, needs_restart must be true. Remove this unneeded
needs_restart check. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Link: https://lore.kernel.org/r/20220905101950.4606-1-linmiaohe@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-05 11:45:24 -06:00
Jiapeng Chong
91e5adda5c block/blk-map: Remove set but unused variable 'added'
The variable added is not effectively used in the function, so delete
it.

block/blk-map.c:273:16: warning: variable 'added' set but not used.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2049
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220905063253.120082-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-05 11:44:37 -06:00
Yu Kuai
2d8f7a3b9f blk-throttle: clean up codes that can't be reached
While doing code coverage testing while CONFIG_BLK_DEV_THROTTLING_LOW is
disabled, we found that there are many codes can never be reached.

This patch move such codes inside "#ifdef CONFIG_BLK_DEV_THROTTLING_LOW".

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220903062826.1099085-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-04 14:38:18 -06:00
Jens Axboe
bce1b56c73 Revert "sbitmap: fix batched wait_cnt accounting"
This reverts commit 16ede66973.

This is causing issues with CPU stalls on my test box, revert it for
now until we understand what is going on. It looks like infinite
looping off sbitmap_queue_wake_up(), but hard to tell with a lot of
CPUs hitting this issue and the console scrolling infinitely.

Link: https://lore.kernel.org/linux-block/e742813b-ce5c-0d58-205b-1626f639b1bd@kernel.dk/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-04 06:39:25 -06:00
Ming Lei
748008e1da block: don't add partitions if GD_SUPPRESS_PART_SCAN is set
Commit b9684a71fc ("block, loop: support partitions without scanning")
adds GD_SUPPRESS_PART_SCAN for replacing part function of
GENHD_FL_NO_PART. But looks blk_add_partitions() is missed, since
loop doesn't want to add partitions if GENHD_FL_NO_PART was set.
And it causes regression on libblockdev (as called from udisks) which
operates with the LO_FLAGS_PARTSCAN.

Fixes the issue by not adding partitions if GD_SUPPRESS_PART_SCAN is
set.

Fixes: b9684a71fc ("block, loop: support partitions without scanning")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220823103819.395776-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-03 11:29:03 -06:00
Jens Axboe
12c5b70c18 block: enable per-cpu bio caching for the fs bio set
This is useful for polled IO on a file, or for polled IO with the
io_uring passthrough mechanism. If bio allocations are done with
REQ_POLLED for those cases, then initializing the bio set with
BIOSET_PERCPU_CACHE enables the local per-cpu cache which eliminates
allocations (and frees) of bio structs when possible.

Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-02 13:03:33 -06:00
Keith Busch
16ede66973 sbitmap: fix batched wait_cnt accounting
Batched completions can clear multiple bits, but we're only decrementing
the wait_cnt by one each time. This can cause waiters to never be woken,
stalling IO. Use the batched count instead.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220825145312.1217900-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01 10:42:41 -06:00
Michal Koutný
fa7e439cf9 cgroup: Homogenize cgroup_get_from_id() return value
Cgroup id is user provided datum hence extend its return domain to
include possible error reason (similar to cgroup_get_from_fd()).

This change also fixes commit d4ccaf58a8 ("bpf: Introduce cgroup
iter") that would use NULL instead of proper error handling in
d4ccaf58a8 ("bpf: Introduce cgroup iter").

Additionally, neither of: fc_appid_store, bpf_iter_attach_cgroup,
mem_cgroup_get_from_ino (callers of cgroup_get_from_fd) is built without
CONFIG_CGROUPS (depends via CONFIG_BLK_CGROUP, direct, transitive
CONFIG_MEMCG respectively) transitive, so drop the singular definition
not needed with !CONFIG_CGROUPS.

Fixes: d4ccaf58a8 ("bpf: Introduce cgroup iter")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-26 10:57:41 -10:00
Linus Torvalds
3e5c673f0d block-6.0-2022-08-26
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMI9VgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgY+D/0axaG4MQODNSCyzTkKdFJqeL1Ab6tgz+SP
 w4hqGhBA50P18nqQo14v/l2wBnc3DnvX6OcoAj2v+27WRw4VAC6pYu9cHiiz0sK0
 WyfYE/3qdB5PTe3149UcMd0BlPzMiadYKYhDbS5zLq5VGnun4oOT6uJugdz5toSl
 GgPJ4zQTjUE/w7pJ40fSTZ82FPCqVndBaSMycLvzzAkzfKdMxMIPBqFTDdW/gFqG
 1Xc7NI4vASw8jsrG+TkPESNf4j0WcEs8+zCV9d9WFSQjWgktV1NpdCZ9CA+jD0cV
 h0i31SB43pttZZTARPraPgQdC23L22CpxxNTuzwz5ZKAuSulML87S/W2lN7HnagQ
 GAl6irQdB9RXN5o6vmOXkZIfmfuTctrXay7g13tNk/WUrGb/3euk4CJMYJWA4fM7
 VkUS3yCjvP5z8auqOqti7zEy1QSN8mvdtsafU1n+tsf0WdIIqH+sufSnmJExbOs9
 gzL2xcabOFCD5WYDCpl/SkbnY4G7Qrk4oHIB9S/r0aOKVHFdQ8JMp3GBVjSlHulN
 Nz3oScNQBG3CSvJ2QbFi6UQxt6YSs+tx/oa1plmeCe4w9tModooL0CdfmKMBwPXT
 AGlHvUIGyl4ZT/XamQ19AdFOE7WLHPG7wYtZUaFiNtkjAsgMoQr2bwYeTpb29Cfu
 881BssmWCQ==
 =qR5N
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-08-26' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - MD pull request via Song:
      - Fix for clustered raid (Guoqing Jiang)
      - req_op fix (Bart Van Assche)
      - Fix race condition in raid recreate (David Sloan)

 - loop configuration overflow fix (Siddh)

 - Fix missing commit_rqs call for certain conditions (Yu)

* tag 'block-6.0-2022-08-26' of git://git.kernel.dk/linux-block:
  md: call __md_stop_writes in md_stop
  Revert "md-raid: destroy the bitmap after destroying the thread"
  md: Flush workqueue md_rdev_misc_wq in md_alloc()
  md/raid10: Fix the data type of an r10_sync_page_io() argument
  loop: Check for overflow while configuring loop
  blk-mq: fix io hung due to missing commit_rqs
2022-08-26 11:05:54 -07:00
Jens Axboe
e88811bc43 block: use on-stack page vec for <= UIO_FASTIOV
Avoid a kmalloc+kfree for each page array, if we only have a few pages
that are mapped. An alloc+free for each IO is quite expensive, and
it's pretty pointless if we're only dealing with 1 or a few vecs.

Use UIO_FASTIOV like we do in other spots to set a sane limit for how
big of an IO we want to avoid allocations for.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:56 -06:00
Jens Axboe
8af870aa5b block: enable bio caching use for passthru IO
bdev based polled O_DIRECT is currently quite a bit faster than
passthru on the same device, and one of the reaons is that we're not
able to use the bio caching for passthru IO.

If REQ_POLLED is set on the request, use the fs bio set for grabbing a
bio from the caches, if available. This saves 5-6% of CPU over head
for polled passthru IO.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:56 -06:00
Jens Axboe
f5d632d15e block: shrink rq_map_data a bit
We don't need full ints for several of these members. Change the
page_order and nr_entries to unsigned shorts, and the true/false from_user
and null_mapped to booleans.

This shrinks the struct from 32 to 24 bytes on 64-bit archs.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:56 -06:00
Yu Kuai
d322f355e9 block, bfq: remove useless parameter for bfq_add/del_bfqq_busy()
'bfqd' can be accessed through 'bfqq->bfqd', there is no need to pass
it as a parameter separately.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220816015631.1323948-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:56 -06:00
Yu Kuai
1e3cc2125d block, bfq: remove useless checking in bfq_put_queue()
'bfqq->bfqd' is ensured to set in bfq_init_queue(), and it will never
change afterwards.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220816015631.1323948-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:56 -06:00
Yu Kuai
c2090eabac block, bfq: remove unused functions
While doing code coverage testing(CONFIG_BFQ_CGROUP_DEBUG is disabled), we
found that some functions doesn't have caller, thus remove them.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220816015631.1323948-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:56 -06:00
Bart Van Assche
a4e1d0b76e block: Change the return type of blk_mq_map_queues() into void
Since blk_mq_map_queues() and the .map_queues() callbacks always return 0,
change their return type into void. Most callers ignore the returned value
anyway.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.garry@huawei.com>
Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220815170043.19489-3-bvanassche@acm.org
[axboe: fold in fix from Bart]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:53 -06:00
dougmill@linux.vnet.ibm.com
c6ea706042 block: sed-opal: Add ioctl to return device status
Provide a mechanism to retrieve basic status information about
the device, including the "supported" flag indicating whether
SED-OPAL is supported. The information returned is from the various
feature descriptors received during the discovery0 step, and so
this ioctl does nothing more than perform the discovery0 step
and then save the information received. See "struct opal_status"
and OPAL_FL_* bits for the status information currently returned.

This is necessary to be able to check whether a device is OPAL
enabled, set up, locked or unlocked from userspace programs
like systemd-cryptsetup and libcryptsetup. Right now we just
have to assume the user 'knows' or blindly attempt setup/lock/unlock
operations.

Signed-off-by: Douglas Miller <dougmill@linux.vnet.ibm.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20220816140713.84893-1-luca.boccassi@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 07:52:51 -06:00
Linus Torvalds
b9bce6e553 block-6.0-2022-08-19
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmL/xOgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgenD/4kaXa2Q2GdrCUZxSSwKCc1u8FemSunFyao
 Q1jbpRPhS2of8JGOdQzbZ/1ioer73rjKAVCpiZ8pVbFw5j/PpjsCUY2H4pF4Pm5V
 oeaq29yp5TLT9mlETGHO8bFAWs3wmErqa9/Tp+P4ut7Jbxw2fjv9oDqbYg7dc8T9
 F769MuojyVQ2D8CAn0o1Vpw3BSqIPk/MJKMU8MWWtErRHidljT6RqZT3ow8qGroD
 0QMfZl7rzfuJ9hokyO3ixFkLErpZbZdA7MdMciXvuvPafz7onjrBf5dKJxp1qMDK
 CADw4uWQBndc+337YVY5uJSPHFWApsRiCadkLgsAnRIn4QcEyYCEBJcYXXs0p05z
 2wuyMlOynVjzSJiyWgq2lJF9CNIUWxkfnBDNNvj1rw6McKX0eJCCnLIUWE90GVn3
 hDU6TTT6dTdb4QyhpbjdS9RVcGOxB8yaVUy4JvXBqZ0GDfVxqTozR8Qx8Gh3XRfi
 5LeUSsHFyzD81GMYtTtovllJZdBhNue3hpLFMy6rFMTpwFiF3bKAPeihGmkMhnWX
 hG340uO44PM8iXQZAoSlEUplY/fbRX2WAfTNSsbmKxey1BHEqfmLvdv9DxaTGZFy
 3xse9L5s867uhFQh8ezYjK2WdIumN67spT1xszYc0pJqhHN6LmRIncVSyzTyJeii
 fUKpxfj15g==
 =y2HE
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-08-19' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes that should go into this release:

   - Small series of patches for ublk (ZiyangZhang)

   - Remove dead function (Yu)

   - Fix for running a block queue in case of resource starvation
     (Yufen)"

* tag 'block-6.0-2022-08-19' of git://git.kernel.dk/linux-block:
  blk-mq: run queue no matter whether the request is the last request
  blk-mq: remove unused function blk_mq_queue_stopped()
  ublk_drv: do not add a re-issued request aborted previously to ioucmd's task_work
  ublk_drv: update comment for __ublk_fail_req()
  ublk_drv: check ubq_daemon_is_dying() in __ublk_rq_task_work()
  ublk_drv: update iod->addr for UBLK_IO_NEED_GET_DATA
2022-08-20 10:17:05 -07:00
Yu Kuai
65fac0d54f blk-mq: fix io hung due to missing commit_rqs
Currently, in virtio_scsi, if 'bd->last' is not set to true while
dispatching request, such io will stay in driver's queue, and driver
will wait for block layer to dispatch more rqs. However, if block
layer failed to dispatch more rq, it should trigger commit_rqs to
inform driver.

There is a problem in blk_mq_try_issue_list_directly() that commit_rqs
won't be called:

// assume that queue_depth is set to 1, list contains two rq
blk_mq_try_issue_list_directly
 blk_mq_request_issue_directly
 // dispatch first rq
 // last is false
  __blk_mq_try_issue_directly
   blk_mq_get_dispatch_budget
   // succeed to get first budget
   __blk_mq_issue_directly
    scsi_queue_rq
     cmd->flags |= SCMD_LAST
      virtscsi_queuecommand
       kick = (sc->flags & SCMD_LAST) != 0
       // kick is false, first rq won't issue to disk
 queued++

 blk_mq_request_issue_directly
 // dispatch second rq
  __blk_mq_try_issue_directly
   blk_mq_get_dispatch_budget
   // failed to get second budget
 ret == BLK_STS_RESOURCE
  blk_mq_request_bypass_insert
 // errors is still 0

 if (!list_empty(list) || errors && ...)
  // won't pass, commit_rqs won't be called

In this situation, first rq relied on second rq to dispatch, while
second rq relied on first rq to complete, thus they will both hung.

Fix the problem by also treat 'BLK_STS_*RESOURCE' as 'errors' since
it means that request is not queued successfully.

Same problem exists in blk_mq_dispatch_rq_list(), 'BLK_STS_*RESOURCE'
can't be treated as 'errors' here, fix the problem by calling
commit_rqs if queue_rq return 'BLK_STS_*RESOURCE'.

Fixes: d666ba98f8 ("blk-mq: add mq_ops->commit_rqs()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220726122224.1790882-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-19 20:45:58 +00:00
Yufen Yu
d3b3859687 blk-mq: run queue no matter whether the request is the last request
We do test on a virtio scsi device (/dev/sda) and the default mq
scheduler is 'none'. We found a IO hung as following:

blk_finish_plug
  blk_mq_plug_issue_direct
      scsi_mq_get_budget
      //get budget_token fail and sdev->restarts=1

			     	 scsi_end_request
				   scsi_run_queue_async
                                   //sdev->restart=0 and run queue

     blk_mq_request_bypass_insert
        //add request to hctx->dispatch list

  //continue to dispath plug list
  blk_mq_dispatch_plug_list
      blk_mq_try_issue_list_directly
        //success issue all requests from plug list

After .get_budget fail, scsi_mq_get_budget will increase 'restarts'.
Normally, it will run hw queue when io complete and set 'restarts'
as 0. But if we run queue before adding request to the dispatch list
and blk_mq_dispatch_plug_list also success issue all requests, then
on one will run queue, and the request will be stall in the dispatch
list and cannot complete forever.

It is wrong to use last request of plug list to decide if run queue is
needed since all the remained requests in plug list may be from other
hctxs. To fix the bug, pass run_queue as true always to
blk_mq_request_bypass_insert().

Fix-suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Fixes: dc5fc361d8 ("block: attempt direct issue of plug list")
Link: https://lore.kernel.org/r/20220803023355.3687360-1-yuyufen@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-18 07:39:01 -06:00
Yu Kuai
a8239f0342 blk-mq: remove unused function blk_mq_queue_stopped()
blk_mq_queue_stopped() doesn't have any caller, which was found by
code coverage test, thus remove it.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220818063555.3741222-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-18 07:38:10 -06:00
Linus Torvalds
abe7a481aa block-6.0-2022-08-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmL2SxQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpl24D/4nxPDuhvDxLHpMUR/C2lCtAnqVhHtGLhIX
 QSw+GVA5LuJvuB0L5Zr7ODDILPY5ZyWI/F1FPVJUOJE1NJ3tFiH4WzFIkqtFtVCE
 2jFTXH63A/o/fyo9nscsZ1g6eEswSAbvenHEa9HNpjgFxz0lnXjrniP5VFPo5HNl
 F8/MO1CBkKmhsGazZn7o1J3Ws6RvApq59YzxHmVz1hFHPgJFN2KwIAQjY2+GGoOD
 ifpRBbZBCTzj2dEEFZHeK1aCYhTNP4VqbNnBDQPZHwEB3qkml5R9GhTlUe7Ej17t
 7o4A05efcm/24TXcODMHP5YaGA14otPUr8wQiJjuOIFLw8sMC61OyS9qDdu1IvyW
 JCnTtDkzwZZEkhXlraU1HmiLSaBjMEvd/2puxbcS9kISdO7baCLd4Oj7+8ThVhIf
 JHIt2x3vzKaCzWI93IMrw5iJFK0+NS+SLAD6eyuEgC71Rj5ooemxrBYxKBQ7jb3o
 GCC3SaU8lFmB1Z/zKo63gGS1b7eaCNGauNm9/gSe1jM8Sor4hlT0yeNRpHf7egAu
 1dUQdSwgon6EH6JOGX3CFSC9lnIEAew733QZLaYBqar2WHcq5Wpq6LcWUHhhujgB
 dSTeLY1Tnhs3GWvMFe4JH/YZilFpbMKzxuFPCV7sQScxzlaGusliX8kVkha/VPLK
 9rtT8uyJXg==
 =1/w+
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-08-12' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request
     - print nvme connect Linux error codes properly (Amit Engel)
     - fix the fc_appid_store return value (Christoph Hellwig)
     - fix a typo in an error message (Christophe JAILLET)
     - add another non-unique identifier quirk (Dennis P. Kliem)
     - check if the queue is allocated before stopping it in nvme-tcp
       (Maurizio Lombardi)
     - restart admin queue if the caller needs to restart queue in
       nvme-fc (Ming Lei)
     - use kmemdup instead of kmalloc + memcpy in nvme-auth (Zhang
       Xiaoxu)

 - __alloc_disk_node() error handling fix (Rafael)

* tag 'block-6.0-2022-08-12' of git://git.kernel.dk/linux-block:
  block: Do not call blk_put_queue() if gendisk allocation fails
  nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S70
  nvme-tcp: check if the queue is allocated before stopping it
  nvme-fabrics: Fix a typo in an error message
  nvme-fabrics: parse nvme connect Linux error codes
  nvmet-auth: use kmemdup instead of kmalloc + memcpy
  nvme-fc: fix the fc_appid_store return value
  nvme-fc: restart admin queue if the caller needs to restart queue
2022-08-13 13:37:36 -07:00
Rafael Mendonca
aa0c680c3a block: Do not call blk_put_queue() if gendisk allocation fails
Commit 6f8191fdf4 ("block: simplify disk shutdown") removed the call
to blk_get_queue() during gendisk allocation but missed to remove the
corresponding cleanup code blk_put_queue() for it. Thus, if the gendisk
allocation fails, the request_queue refcount gets decremented and
reaches 0, causing blk_mq_release() to be called with a hctx still
alive. That triggers a WARNING report, as found by syzkaller:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 23016 at block/blk-mq.c:3881
blk_mq_release+0xf8/0x3e0 block/blk-mq.c:3881
[...] stripped
RIP: 0010:blk_mq_release+0xf8/0x3e0 block/blk-mq.c:3881
[...] stripped
Call Trace:
 <TASK>
 blk_release_queue+0x153/0x270 block/blk-sysfs.c:780
 kobject_cleanup lib/kobject.c:673 [inline]
 kobject_release lib/kobject.c:704 [inline]
 kref_put include/linux/kref.h:65 [inline]
 kobject_put+0x1c8/0x540 lib/kobject.c:721
 __alloc_disk_node+0x4f7/0x610 block/genhd.c:1388
 __blk_mq_alloc_disk+0x13b/0x1f0 block/blk-mq.c:3961
 loop_add+0x3e2/0xaf0 drivers/block/loop.c:1978
 loop_control_ioctl+0x133/0x620 drivers/block/loop.c:2150
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:870 [inline]
 __se_sys_ioctl fs/ioctl.c:856 [inline]
 __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
[...] stripped

Fixes: 6f8191fdf4 ("block: simplify disk shutdown")
Reported-by: syzbot+31c9594f6e43b9289b25@syzkaller.appspotmail.com
Suggested-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220811232338.254673-1-rafaelmendsr@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-12 06:42:06 -06:00
Al Viro
480cb846c2 block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
... doing revert if we end up not using some pages

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:22 -04:00
Al Viro
fcb14cb1bd new iov_iter flavour - ITER_UBUF
Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.

We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.

New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.

DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:15 -04:00
Linus Torvalds
fa9db655d0 for-5.20/block-2022-08-04
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLsRfkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj43EADBydQhe7nQHH65gecqvttnio2GqEmcbozt
 lKFQlPPd3SHGMAJjSdR1dIwqtPsJ8q6xZXH+TjHhLXb2kgVu+TQ31krNHIqBwE14
 s7SsgGRgvopA46lSf/ls18/8sh6Yz1NgI39YcMVPjvkbLaVFK7zRkL9OSp4RQCwH
 u/IIHJmV415EeF6QNTgABBel/gEIPBLsvwOxTBIkzDOyUohtExZPYj83MDm7jdr3
 jsTUd2MiumNMh7ziMJIp1iN32nQOtIKtwWZaMHDCzfU/IUnBSmh2nj9oXr3+vcwo
 IsBMDUfUj9Eig5QQ/XcVIrFezi0GnunpBhScXPqL+dxPN812lzxNjkx6PsC+rPn8
 mWmXoaeK1ayoyotdHJlmINNmWUSCkOMwVnA2r1c4Hp4cQS5vRUtkKcpNLTpMhk4I
 OwQ3bjt9mA//WlH+apbhJqXqxjcoBwCwMoveJ4mHVtku9lo+JJAKVGdUs17QjZkC
 NxACP1MtBcXy1hurNQf14oH5C0Hyg4TBJShPauKmrqGtOFnbOAdX2qIhldvyNfH1
 l9cOvGNSgbQ6FLD6MVto6dC/KYOEM3LelVxgNB/80GbSmGwj88Kd/nzQLYFP89JJ
 0Wkt14mSkm82gabOvNqXGG8P8hLb/+v6sp4qZv0mf+op0xmb4FB5eaZvoceptVzM
 3Z+hmT7MfA==
 =pgNf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/block-2022-08-04' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - add support for In-Band authentication (Hannes Reinecke)
      - handle the persistent internal error AER (Michael Kelley)
      - use in-capsule data for TCP I/O queue connect (Caleb Sander)
      - remove timeout for getting RDMA-CM established event (Israel
        Rukshin)
      - misc cleanups (Joel Granados, Sagi Grimberg, Chaitanya Kulkarni,
        Guixin Liu, Xiang wangx)
      - use command_id instead of req->tag in trace_nvme_complete_rq()
        (Bean Huo)
      - various fixes for the new authentication code (Lukas Bulwahn,
        Dan Carpenter, Colin Ian King, Chaitanya Kulkarni, Hannes
        Reinecke)
      - small cleanups (Liu Song, Christoph Hellwig)
      - restore compat_ioctl support (Nick Bowler)
      - make a nvmet-tcp workqueue lockdep-safe (Sagi Grimberg)
      - enable generic interface (/dev/ngXnY) for unknown command sets
        (Joel Granados, Christoph Hellwig)
      - don't always build constants.o (Christoph Hellwig)
      - print the command name of aborted commands (Christoph Hellwig)

 - MD pull requests via Song:
      - Improve raid5 lock contention, by Logan Gunthorpe.
      - Misc fixes to raid5, by Logan Gunthorpe.
      - Fix race condition with md_reap_sync_thread(), by Guoqing Jiang.
      - Fix potential deadlock with raid5_quiesce and
        raid5_get_active_stripe, by Logan Gunthorpe.
      - Refactoring md_alloc(), by Christoph"
      - Fix md disk_name lifetime problems, by Christoph Hellwig
      - Convert prepare_to_wait() to wait_woken() api, by Logan
        Gunthorpe;
      - Fix sectors_to_do bitmap issue, by Logan Gunthorpe.

 - Work on unifying the null_blk module parameters and configfs API
   (Vincent)

 - drbd bitmap IO error fix (Lars)

 - Set of rnbd fixes (Guoqing, Md Haris)

 - Remove experimental marker on bcache async device registration (Coly)

 - Series from cleaning up the bio splitting (Christoph)

 - Removal of the sx8 block driver. This hardware never really
   widespread, and it didn't receive a lot of attention after the
   initial merge of it back in 2005 (Christoph)

 - A few fixes for s390 dasd (Eric, Jiang)

 - Followup set of fixes for ublk (Ming)

 - Support for UBLK_IO_NEED_GET_DATA for ublk (ZiyangZhang)

 - Fixes for the dio dma alignment (Keith)

 - Misc fixes and cleanups (Ming, Yu, Dan, Christophe

* tag 'for-5.20/block-2022-08-04' of git://git.kernel.dk/linux-block: (136 commits)
  s390/dasd: Establish DMA alignment
  s390/dasd: drop unexpected word 'for' in comments
  ublk_drv: add support for UBLK_IO_NEED_GET_DATA
  ublk_cmd.h: add one new ublk command: UBLK_IO_NEED_GET_DATA
  ublk_drv: cleanup ublksrv_ctrl_dev_info
  ublk_drv: add SET_PARAMS/GET_PARAMS control command
  ublk_drv: fix ublk device leak in case that add_disk fails
  ublk_drv: cancel device even though disk isn't up
  block: fix leaking page ref on truncated direct io
  block: ensure bio_iov_add_page can't fail
  block: ensure iov_iter advances for added pages
  drivers:md:fix a potential use-after-free bug
  md/raid5: Ensure batch_last is released before sleeping for quiesce
  md/raid5: Move stripe_request_ctx up
  md/raid5: Drop unnecessary call to r5c_check_stripe_cache_usage()
  md/raid5: Make is_inactive_blocked() helper
  md/raid5: Refactor raid5_get_active_stripe()
  block: pass struct queue_limits to the bio splitting helpers
  block: move bio_allowed_max_sectors to blk-merge.c
  block: move the call to get_max_io_size out of blk_bio_segment_split
  ...
2022-08-04 20:00:14 -07:00
Linus Torvalds
746fc76b82 SCSI misc on 20220804
Updates to the usual drivers (ufs, qla2xx, target, lpfc, smartpqi,
 mpi3mr).  The main driver change that might cause issues on down the
 road is the conversion of some of our oldest surviving drivers to the
 DMA API (should only affect m68k).  The only major core change is the
 rework of async resume; the rest are either completely trivial or for
 updating deprecated APIs.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYuvakyYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishfvOAP4m0N6b
 e3JwoBtB1c0JMKv6G4gka8suEG8p5f4khDu8wwD+LfGUCzG49Y5Ts7rByXfEiGgO
 krSdwsAZiV6yKg/HuPw=
 =Ak9L
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "Updates to the usual drivers (ufs, qla2xx, target, lpfc, smartpqi,
  mpi3mr).

  The main driver change that might cause issues on down the road is the
  conversion of some of our oldest surviving drivers to the DMA API
  (should only affect m68k).

  The only major core change is the rework of async resume; the rest are
  either completely trivial or for updating deprecated APIs"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (195 commits)
  scsi: target: Remove XDWRITEREAD emulated support
  scsi: megaraid: Remove the static variable initialisation
  scsi: ch: Do not initialise statics to 0
  scsi: ufs: core: Fix spelling mistake "Cannnot" -> "Cannot"
  scsi: target: iscsi: Do not require target authentication
  scsi: target: iscsi: Allow AuthMethod=None
  scsi: target: iscsi: Support base64 in CHAP
  scsi: target: iscsi: Add support for extended CDB AHS
  scsi: ufs: dt-bindings: Add SC8280XP binding
  scsi: target: iscsi: Fix clang -Wformat warnings
  scsi: ufs: core: Read device property for ref clock
  scsi: libsas: Resume SAS host for phy reset or enable via sysfs
  scsi: hisi_sas: Modify v3 HW SATA completion error processing
  scsi: hisi_sas: Relocate DMA unmap of SMP task
  scsi: hisi_sas: Remove unnecessary variable to hold DMA map elements
  scsi: hisi_sas: Call hisi_sas_slave_configure() from slave_configure_v3_hw()
  scsi: mpi3mr: Delete a stray tab
  scsi: mpi3mr: Unlock on error path
  scsi: mpi3mr: Reduce VD queue depth on detecting throttling
  scsi: mpi3mr: Resource Based Metering
  ...
2022-08-04 19:47:37 -07:00
Linus Torvalds
5264406cdb iov_iter work, part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter()
 and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}()
 has a surprising amount of overhead, in particular inside iocb_flags().
 That's why the beginning of the series is in this pile; it's not directly
 iov_iter-related, but it's a part of the same work...
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ
 6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ
 uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM=
 =9UCV
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs iov_iter updates from Al Viro:
 "Part 1 - isolated cleanups and optimizations.

  One of the goals is to reduce the overhead of using ->read_iter() and
  ->write_iter() instead of ->read()/->write().

  new_sync_{read,write}() has a surprising amount of overhead, in
  particular inside iocb_flags(). That's the explanation for the
  beginning of the series is in this pile; it's not directly
  iov_iter-related, but it's a part of the same work..."

* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  first_iovec_segment(): just return address
  iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
  iov_iter: first_{iovec,bvec}_segment() - simplify a bit
  iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
  iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
  iov_iter_bvec_advance(): don't bother with bvec_iter
  copy_page_{to,from}_iter(): switch iovec variants to generic
  keep iocb_flags() result cached in struct file
  iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
  struct file: use anonymous union member for rcuhead and llist
  btrfs: use IOMAP_DIO_NOSYNC
  teach iomap_dio_rw() to suppress dsync
  No need of likely/unlikely on calls of check_copy_size()
2022-08-03 13:50:22 -07:00
Linus Torvalds
f00654007f Folio changes for 6.0
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
    when running xfstests
 
  - Convert more of mpage to use folios
 
  - Remove add_to_page_cache() and add_to_page_cache_locked()
 
  - Convert find_get_pages_range() to filemap_get_folios()
 
  - Improvements to the read_cache_page() family of functions
 
  - Remove a few unnecessary checks of PageError
 
  - Some straightforward filesystem conversions to use folios
 
  - Split PageMovable users out from address_space_operations into their
    own movable_operations
 
  - Convert aops->migratepage to aops->migrate_folio
 
  - Remove nobh support (Christoph Hellwig)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp
 gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha
 mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS
 7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT
 /58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z
 C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M
 Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q==
 =DgUi
 -----END PGP SIGNATURE-----

Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache

Pull folio updates from Matthew Wilcox:

 - Fix an accounting bug that made NR_FILE_DIRTY grow without limit
   when running xfstests

 - Convert more of mpage to use folios

 - Remove add_to_page_cache() and add_to_page_cache_locked()

 - Convert find_get_pages_range() to filemap_get_folios()

 - Improvements to the read_cache_page() family of functions

 - Remove a few unnecessary checks of PageError

 - Some straightforward filesystem conversions to use folios

 - Split PageMovable users out from address_space_operations into
   their own movable_operations

 - Convert aops->migratepage to aops->migrate_folio

 - Remove nobh support (Christoph Hellwig)

* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
  fs: remove the NULL get_block case in mpage_writepages
  fs: don't call ->writepage from __mpage_writepage
  fs: remove the nobh helpers
  jfs: stop using the nobh helper
  ext2: remove nobh support
  ntfs3: refactor ntfs_writepages
  mm/folio-compat: Remove migration compatibility functions
  fs: Remove aops->migratepage()
  secretmem: Convert to migrate_folio
  hugetlb: Convert to migrate_folio
  aio: Convert to migrate_folio
  f2fs: Convert to filemap_migrate_folio()
  ubifs: Convert to filemap_migrate_folio()
  btrfs: Convert btrfs_migratepage to migrate_folio
  mm/migrate: Add filemap_migrate_folio()
  mm/migrate: Convert migrate_page() to migrate_folio()
  nfs: Convert to migrate_folio
  btrfs: Convert btree_migratepage to migrate_folio
  mm/migrate: Convert expected_page_refs() to folio_expected_refs()
  mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
  ...
2022-08-03 10:35:43 -07:00
Keith Busch
e97424fd44 block: fix leaking page ref on truncated direct io
The size being added to a bio from an iov is aligned to a block size
after the pages were gotten. If the new aligned size truncates the last
page, its reference was being leaked. Ensure all pages that were not
added to the bio have their reference released.

Since this essentially requires doing the same that bio_put_pages(), and
there was only one caller for that function, this patch makes the
put_page() loop common for everyone.

Fixes: b1a000d3b8 ("block: relax direct io memory alignment")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220712153256.2202024-3-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:08:54 -06:00
Keith Busch
34cdb8c825 block: ensure bio_iov_add_page can't fail
Adding the page could fail on the bio_full() condition, which checks for
either exceeding the bio's max segments or total size exceeding
UINT_MAX. We already ensure the max segments can't be exceeded, so just
ensure the total size won't reach the limit. This simplifies error
handling and removes unnecessary repeated bio_full() checks.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220712153256.2202024-2-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:08:54 -06:00
Keith Busch
325347d965 block: ensure iov_iter advances for added pages
There are cases where a bio may not accept additional pages, and the iov
needs to advance to the last data length that was accepted. The zone
append used to handle this correctly, but was inadvertently broken when
the setup was made common with the normal r/w case.

Fixes: 576ed91354 ("block: use bio_add_page in bio_iov_iter_get_pages")
Fixes: c58c0074c5 ("block/bio: remove duplicate append pages code")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220712153256.2202024-1-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:08:54 -06:00
Christoph Hellwig
c55ddd9082 block: pass struct queue_limits to the bio splitting helpers
Allow using the splitting helpers on just a queue_limits instead of
a full request_queue structure.  This will eventually allow file systems
or remapping drivers to split REQ_OP_ZONE_APPEND bios based on limits
calculated as the minimum common capabilities over multiple devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220727162300.3089193-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:08:53 -06:00
Christoph Hellwig
b6dc6198eb block: move bio_allowed_max_sectors to blk-merge.c
Move this helper into the only file where it is used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220727162300.3089193-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:08:53 -06:00
Christoph Hellwig
a85b36375b block: move the call to get_max_io_size out of blk_bio_segment_split
Prepare for reusing blk_bio_segment_split for (file system controlled)
splits of REQ_OP_ZONE_APPEND bios by letting the caller control the
maximum size of the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220727162300.3089193-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:08:53 -06:00
Christoph Hellwig
46754bd056 block: move ->bio_split to the gendisk
Only non-passthrough requests are split by the block layer and use the
->bio_split bio_set.  Move it from the request_queue to the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220727162300.3089193-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:08:49 -06:00
Christoph Hellwig
51d798cdb5 block: change the blk_queue_bounce calling convention
The double indirect bio leads to somewhat suboptimal code generation.
Instead return the (original or split) bio, and make sure the
request_queue arguments to the lower level helpers is passed after the
bio to avoid constant reshuffling of the argument passing registers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220727162300.3089193-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:54 -06:00
Christoph Hellwig
5a97806f7d block: change the blk_queue_split calling convention
The double indirect bio leads to somewhat suboptimal code generation.
Instead return the (original or split) bio, and make sure the
request_queue arguments to the lower level helpers is passed after the
bio to avoid constant reshuffling of the argument passing registers.

Also give it and the helpers used to implement it more descriptive names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220727162300.3089193-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:53 -06:00
Linus Torvalds
c013d0af81 for-5.20/block-2022-07-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2
 ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU
 MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn
 b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc
 YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU
 gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6
 M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH
 5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK
 ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH
 ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H
 WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR
 4N1HEozvIw==
 =celk
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Improve the type checking of request flags (Bart)

 - Ensure queue mapping for a single queues always picks the right queue
   (Bart)

 - Sanitize the io priority handling (Jan)

 - rq-qos race fix (Jinke)

 - Reserved tags handling improvements (John)

 - Separate memory alignment from file/disk offset aligment for O_DIRECT
   (Keith)

 - Add new ublk driver, userspace block driver using io_uring for
   communication with the userspace backend (Ming)

 - Use try_cmpxchg() to cleanup the code in various spots (Uros)

 - Finally remove bdevname() (Christoph)

 - Clean up the zoned device handling (Christoph)

 - Clean up independent access range support (Christoph)

 - Clean up and improve block sysfs handling (Christoph)

 - Clean up and improve teardown of block devices.

   This turns the usual two step process into something that is simpler
   to implement and handle in block drivers (Christoph)

 - Clean up chunk size handling (Christoph)

 - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
   Ming, Sebastian, Yang, Ying)

* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
  ublk_drv: fix double shift bug
  ublk_drv: make sure that correct flags(features) returned to userspace
  ublk_drv: fix error handling of ublk_add_dev
  ublk_drv: fix lockdep warning
  block: remove __blk_get_queue
  block: call blk_mq_exit_queue from disk_release for never added disks
  blk-mq: fix error handling in __blk_mq_alloc_disk
  ublk: defer disk allocation
  ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
  ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
  ublk: cleanup ublk_ctrl_uring_cmd
  ublk: simplify ublk_ch_open and ublk_ch_release
  ublk: remove the empty open and release block device operations
  ublk: remove UBLK_IO_F_PREFLUSH
  ublk: add a MAINTAINERS entry
  block: don't allow the same type rq_qos add more than once
  mmc: fix disk/queue leak in case of adding disk failure
  ublk_drv: fix an IS_ERR() vs NULL check
  ublk: remove UBLK_IO_F_INTEGRITY
  ublk_drv: remove unneeded semicolon
  ...
2022-08-02 13:46:35 -07:00
Matthew Wilcox (Oracle)
67235182a4 mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
Use a folio throughout __buffer_migrate_folio(), add kernel-doc for
buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their
declarations to buffer.h and switch all filesystems that have wired
them up.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
4fdc08d418 block: Convert read_part_sector() to use a folio
This relatively straightforward converion saves a call to compound_head()
hidden inside put_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
069fc464f1 block: Use PAGE_SECTORS_SHIFT
The bare use of '9' confuses some people.  We also don't need this cast,
since the compiler does exactly that cast for us.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
98d8ba69ff block: Handle partition read errors more consistently
Set p->v to NULL if we try to read beyond the end of the disk, just like
we do if we get an error returned from trying to read the disk.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
8b5d143c95 block: Simplify read_part_sector()
That rather complicated expression is just trying to find the offset
of this sector within a page, and there are easier ways to express that.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Christoph Hellwig
828b5f017d block: remove __blk_get_queue
__blk_get_queue is only called by blk_get_queue, so merge the two.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220721063432.1714609-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 11:03:17 -06:00
Christoph Hellwig
c5db2cfc62 block: call blk_mq_exit_queue from disk_release for never added disks
To undo the all initialization from blk_mq_init_allocated_queue in case
of a probe failure where add_disk is never called we have to call
blk_mq_exit_queue from put_disk.

This relies on the fact that drivers always call blk_mq_free_tag_set
after calling put_disk in the probe error path if they have a gendisk
at all.

We should be doing this in general, but can't do it for the normal
teardown case (yet) as the tagset can be gone by the time the disk is
released once it was added.  I hope to sort this out properly eventually
but for now this isolated hack will do it.

Fixes: 6f8191fdf4 ("block: simplify disk shutdown")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220720130541.1323531-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 10:59:02 -06:00
Christoph Hellwig
0a3e5cc7bb blk-mq: fix error handling in __blk_mq_alloc_disk
To fully clean up the queue if the disk allocation fails we need to
call blk_mq_destroy_queue and not just blk_put_queue.

Fixes: 6f8191fdf4 ("block: simplify disk shutdown")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220720130541.1323531-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 10:59:02 -06:00
Jinke Han
14a6e2eb7d block: don't allow the same type rq_qos add more than once
In our test of iocost, we encountered some list add/del corruptions of
inner_walk list in ioc_timer_fn.

The reason can be described as follows:

cpu 0					cpu 1
ioc_qos_write				ioc_qos_write

ioc = q_to_ioc(queue);
if (!ioc) {
        ioc = kzalloc();
					ioc = q_to_ioc(queue);
					if (!ioc) {
						ioc = kzalloc();
						...
						rq_qos_add(q, rqos);
					}
        ...
        rq_qos_add(q, rqos);
        ...
}

When the io.cost.qos file is written by two cpus concurrently, rq_qos may
be added to one disk twice. In that case, there will be two iocs enabled
and running on one disk. They own different iocgs on their active list. In
the ioc_timer_fn function, because of the iocgs from two iocs have the
same root iocg, the root iocg's walk_list may be overwritten by each other
and this leads to list add/del corruptions in building or destroying the
inner_walk list.

And so far, the blk-rq-qos framework works in case that one instance for
one type rq_qos per queue by default. This patch make this explicit and
also fix the crash above.

Signed-off-by: Jinke Han <hanjinke.666@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-20 06:44:14 -06:00
Bart Van Assche
d625fecd8a block/kyber: Use the new blk_opf_t type
Use the new blk_opf_t type for arguments that represent a bitwise
combination of a request operation and request flags. Rename those
arguments from 'op' into 'opf'.

This patch does not change any functionality.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-10-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:30 -06:00
Bart Van Assche
f8359efe47 block/mq-deadline: Use the new blk_opf_t type
Use the new blk_opf_t type for an argument that represents a bitwise
combination of a request operation and request flags. Rename that
argument from 'op' into 'opf'.

This patch does not change any functionality.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-9-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:30 -06:00
Bart Van Assche
dc469ba2e7 block/bfq: Use the new blk_opf_t type
Use the new blk_opf_t type for arguments and variables that represent
request flags or a bitwise combination of a request operation and
request flags. Rename those variables from 'op' into 'opf'.

This patch does not change any functionality.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-8-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:30 -06:00
Bart Van Assche
16458cf3bd block: Use the new blk_opf_t type
Use the new blk_opf_t type for arguments and variables that represent
request flags or a bitwise combination of a request operation and
request flags. Rename the function arguments and also a structure member
that hold a request operation and flags from 'rw' into 'opf'.

This patch does not change any functionality.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-7-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:30 -06:00
Bart Van Assche
2d9b02be73 block: Change the type of req_op() and bio_op() into enum req_op
Improve static type checking by changing the type of the value returned by
req_op() and bio_op() from unsigned int into enum req_op. Insert
'default: break;' in switch statements on the enum req_op type to prevent
that the compiler warns about these switch statements.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-5-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:30 -06:00
Bart Van Assche
77e7ffd7ad block: Use enum req_op where appropriate
Change the type of the arguments that are used to pass a REQ_OP_* value
from int or unsigned int into enum req_op to improve static type
checking.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:30 -06:00
Bart Van Assche
ff07a02e9e treewide: Rename enum req_opf into enum req_op
The type name enum req_opf is misleading since it suggests that values of
this type include both an operation type and flags. Since values of this
type represent an operation only, change the type name into enum req_op.

Convert the enum req_op documentation into kernel-doc format. Move a few
definitions such that the enum req_op documentation occurs just above
the enum req_op definition.

The name "req_opf" was introduced by commit ef295ecf09 ("block: better op
and flags encoding").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:30 -06:00
Muchun Song
957a2b345c block: fix missing blkcg_bio_issue_init
The commit 513616843d ("block: remove superfluous calls to
blkcg_bio_issue_init") has removed blkcg_bio_issue_init from
__bio_clone since submit_bio will override ->bi_issue.
However, __blk_queue_split is called after blkcg_bio_issue_init
(see blk_mq_submit_bio) in submit_bio. In this case, the
->bi_issue is 0. Fix it.

Fixes: 513616843d ("block: remove superfluous calls to blkcg_bio_issue_init")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Link: https://lore.kernel.org/r/20220713140226.68135-1-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 10:54:49 -06:00
Christoph Hellwig
900d156bac block: remove bdevname
Replace the remaining calls of bdevname with snprintf using the %pg
format specifier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220713055317.1888500-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 10:27:56 -06:00
Christoph Hellwig
02ff3dd20f block: stop using bdevname in __blkdev_issue_discard
Just use the %pg format specifier instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220713055317.1888500-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 10:27:56 -06:00
Christoph Hellwig
5bf83e9a14 block: stop using bdevname in bdev_write_inode
Just use the %pg format specifier instead.  Also reformat the
printk statement to be more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220713055317.1888500-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 10:27:56 -06:00
Uros Bizjak
96388f57d2 blk-cgroup: Use atomic{,64}_try_cmpxchg
Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old
in blkcg_unuse_delay, blkcg_set_delay and blkcg_clear_delay and
atomic64_try_cmpxchg in blkcg_scale_delay.  x86 CMPXCHG instruction
returns success in ZF flag, so this change saves a compare after cmpxchg
(and related move instruction in front of cmpxchg).

Also, atomic_try_cmpxchg implicitly assigns old *ptr value to "old" when
cmpxchg fails, enabling further code simplifications.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220712154455.66868-1-ubizjak@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-12 15:46:03 -06:00
Uros Bizjak
aee8960c2e blk-iolatency: Use atomic{,64}_try_cmpxchg
Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old
in check_scale_change and atomic64_try_cmpxchg in blkcg_iolatency_done_bio.
x86 CMPXCHG instruction returns success in ZF flag, so this change saves a
compare after cmpxchg (and related move instruction in front of cmpxchg).

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220712151947.6783-1-ubizjak@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-12 15:45:31 -06:00
Uros Bizjak
939f9dd040 block: Use try_cmpxchg in update_io_ticks
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
update_io_ticks. x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg (and related
move instruction in front of cmpxchg).

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220712152741.7324-1-ubizjak@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-12 14:39:06 -06:00
Uros Bizjak
f4b1e27db4 block/rq_qos: Use atomic_try_cmpxchg in atomic_inc_below
Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old in
atomic_inc_below. x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg (and related move instruction
in front of cmpxchg).

Also, atomic_try_cmpxchg implicitly assigns old *ptr value to "old" when
cmpxchg fails, enabling further code simplifications.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220712150547.5786-1-ubizjak@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-12 14:38:52 -06:00
Ming Lei
f3ec5d1155 blk-mq: don't create hctx debugfs dir until q->debugfs_dir is created
blk_mq_debugfs_register_hctx() can be called by blk_mq_update_nr_hw_queues
when gendisk isn't added yet, such as nvme tcp.

Fixes the warning of 'debugfs: Directory 'hctx0' with parent '/' already present!'
which can be observed reliably when running blktests nvme/005.

Fixes: 6cfc0081b0 ("blk-mq: no need to check return value of debugfs_create functions")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220711090808.259682-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-12 06:52:27 -06:00
Christoph Hellwig
d86e716aa4 block: move zone related fields to struct gendisk
Move the zone related fields that are currently stored in
struct request_queue to struct gendisk as these are part of the highlevel
block layer API and are only used for non-passthrough I/O that requires
the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:26 -06:00
Christoph Hellwig
375c140c19 block: use bdev based helpers in blkdev_zone_mgmt{,all}
Use the bdev based helpers instead of the queue based ones to clean up
the code a bit and prepare for storing all zone related fields in
struct gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:26 -06:00
Christoph Hellwig
b623e34732 block: replace blkdev_nr_zones with bdev_nr_zones
Pass a block_device instead of a request_queue as that is what most
callers have at hand.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:26 -06:00
Christoph Hellwig
1dc0172027 block: remove queue_max_open_zones and queue_max_active_zones
Always use the bdev based helpers instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:26 -06:00
Christoph Hellwig
5d40066567 block: pass a gendisk to blk_queue_free_zone_bitmaps
Switch to a gendisk based API in preparation for moving all zone related
fields from the request_queue to the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:26 -06:00
Christoph Hellwig
b3c72f8138 block: pass a gendisk to blk_queue_clear_zone_settings
Switch to a gendisk based API in preparation for moving all zone related
fields from the request_queue to the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:26 -06:00
Christoph Hellwig
6b2bd27474 block: pass a gendisk to blk_queue_set_zoned
Prepare for storing the zone related field in struct gendisk instead
of struct request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:26 -06:00
Christoph Hellwig
052e545c92 block: simplify blk_check_zone_append
Use the bdev based helpers instead of open coding them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:25 -06:00
Christoph Hellwig
6deacb3bfa block: simplify blk_mq_plug
Drop the unused q argument, and invert the check to move the exception
into a branch and the regular path as the normal return.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:25 -06:00
Christoph Hellwig
edd1dbc83b block: use bdev_is_zoned instead of open coding it
Use bdev_is_zoned in all places where a block_device is available instead
of open coding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:25 -06:00
Christoph Hellwig
6cc37a672a block: call blk_queue_free_zone_bitmaps from disk_release
The zone bitmaps are only used for non-passthrough I/O, so free them as
soon as the disk is released.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:25 -06:00
John Garry
4cf6e6c010 blk-mq: Drop local variable for reserved tag
The local variable is now only referenced once so drop it.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/1657109034-206040-7-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:33:53 -06:00
John Garry
2dd6532e95 blk-mq: Drop 'reserved' arg of busy_tag_iter_fn
We no longer use the 'reserved' arg in busy_tag_iter_fn for any iter
function so it may be dropped.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me> #nvme
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/1657109034-206040-6-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:33:53 -06:00
John Garry
9bdb4833dd blk-mq: Drop blk_mq_ops.timeout 'reserved' arg
With new API blk_mq_is_reserved_rq() we can tell if a request is from
the reserved pool, so stop passing 'reserved' arg. There is actually
only a single user of that arg for all the callback implementations, which
can use blk_mq_is_reserved_rq() instead.

This will also allow us to stop passing the same 'reserved' around the
blk-mq iter functions next.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/1657109034-206040-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:33:53 -06:00
John Garry
99e48cd685 blk-mq: Add a flag for reserved requests
Add a flag for reserved requests so that drivers may know this for any
special handling.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/1657109034-206040-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:33:53 -06:00
Jason Yan
e55cf79814 blk-cgroup: factor out blkcg_free_all_cpd()
To reduce some duplicated code, factor out blkcg_free_all_cpd(). No
functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220629070917.3113016-3-yanaijie@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-29 11:09:53 -06:00
Jason Yan
362b8c16f8 blk-cgroup: factor out blkcg_iostat_update()
To reduce some duplicated code, factor out blkcg_iostat_update(). No
functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220629070917.3113016-2-yanaijie@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-29 11:09:53 -06:00
Christoph Hellwig
22d0c4080f block: simplify disk_set_independent_access_ranges
Lift setting disk->ia_ranges from disk_register_independent_access_ranges
into disk_set_independent_access_ranges, and make the behavior the same
for the registered vs non-registered queue cases.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220629062013.1331068-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-29 08:36:46 -06:00
Christoph Hellwig
6a27d28c81 block: move ->ia_ranges from the request_queue to the gendisk
Independent access ranges only matter for file system I/O and are only
valid with a registered gendisk, so move them there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220629062013.1331068-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-29 08:36:46 -06:00
Ying Sun
b9a1c179bd block: remove "select BLK_RQ_IO_DATA_LEN" from BLK_CGROUP_IOCOST dependency
The configuration item BLK_RQ_IO_DATA_LEN is not declared in the kernel.
Select BLK_RQ_IO_DATA_LEN is meaningless which could be removed.

Signed-off-by: Ying Sun <sunying@nj.iscas.ac.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220629062409.19458-1-sunying@nj.iscas.ac.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-29 08:35:57 -06:00
Matthew Wilcox (Oracle)
0e00fa5f83 block: Remove check of PageError
If read_mapping_page() sees a page with PageError set, it returns a
PTR_ERR().  Checking PageError again is simply dead code.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-29 08:51:06 -04:00
Christoph Hellwig
8682b92e5a blk-mq: cleanup disk sysfs registration
Pass a gendisk to the sysfs register/unregister functions and give
them descriptive names.  Also move the unregistration helper next
to the one doing the registration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 11:32:42 -06:00
Christoph Hellwig
eaa870f975 blk-mq: rename blk_mq_sysfs_{,un}register
Add a _hctx postfix to better describe what the functions do, match
the debugfs equivalents and release the old names for functions that
should be using them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 11:32:42 -06:00
Christoph Hellwig
81f0c2ef41 block: remove the extra gendisk reference in __blk_mq_register_dev
kobject_add already grabs a reference to the parent, no need to have
another one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 11:32:42 -06:00
Christoph Hellwig
4a8d14bba4 block: use default groups to register the queue attributes
Set up the default_groups for blk_queue_ktype instead of manually calling
sysfs_create_group.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 11:32:42 -06:00
Christoph Hellwig
060f131e9c block: remove a superflous queue kobject reference
kobject_add already adds a reference to the parent that is dropped
on deletion, so don't bother grabbing another one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 11:32:42 -06:00
Christoph Hellwig
cc5c516df0 block: simplify blktrace sysfs attribute creation
Add the trace attributes to the default gendisk attributes, just like
we already do for partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 11:32:42 -06:00
Christoph Hellwig
8b9ab62662 block: remove blk_cleanup_disk
blk_cleanup_disk is nothing but a trivial wrapper for put_disk now,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 06:33:15 -06:00
Christoph Hellwig
6f8191fdf4 block: simplify disk shutdown
Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for
all disks that do not have separately allocated queues, and thus remove
the need to call blk_cleanup_queue for them.

Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that
this function is intended only for separately allocated blk-mq queues.

This saves an extra queue freeze for devices without a separately
allocated queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 06:30:26 -06:00
Christoph Hellwig
0e3534022f block: stop setting the nomerges flags in blk_cleanup_queue
These flags only apply to file system I/O, and all file system I/O is
already drained by del_gendisk and thus can't be in progress when
blk_cleanup_queue is called.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 06:30:26 -06:00
Christoph Hellwig
1f90307e5f block: remove QUEUE_FLAG_DEAD
Disallow setting the blk-mq state on any queue that is already dying as
setting the state even then is a bad idea, and remove the now unused
QUEUE_FLAG_DEAD flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 06:30:26 -06:00
Liu Song
ee78ec1077 blk-mq: blk_mq_tag_busy is no need to return a value
Currently "blk_mq_tag_busy" return value has no effect, so adjust it.
Some code implementations have also been adjusted to enhance
readability.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1656170121-1619-1-git-send-email-liusong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:12 -06:00
Jan Kara
a78418e6a0 block: Always initialize bio IO priority on submit
Currently, IO priority set in task's IO context is not reflected in the
bio->bi_ioprio for most IO (only io_uring and direct IO set it). This
results in odd results where process is submitting some bios with one
priority and other bios with a different (unset) priority and due to
differing priorities bios cannot be merged. Make sure bio->bi_ioprio is
always set on bio submission.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-9-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:12 -06:00
Jan Kara
9c6227e043 block: Initialize bio priority earlier
Bio's IO priority needs to be initialized before we try to merge the bio
with other bios. Otherwise we could merge bios which would otherwise
receive different IO priorities leading to possible QoS issues.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:12 -06:00
Jan Kara
82b74cac28 blk-ioprio: Convert from rqos policy to direct call
Convert blk-ioprio handling from a rqos policy to a direct call from
blk_mq_submit_bio(). Firstly, blk-ioprio is not much of a rqos policy
anyway, it just needs a hook in bio submission path to set the bio's IO
priority. Secondly, the rqos .track hook gets actually called too late
for blk-ioprio purposes and introducing a special rqos hook just for
blk-ioprio looks even weirder.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:12 -06:00
Jan Kara
f258654472 blk-ioprio: Remove unneeded field
blkcg->ioprio_set field is not really useful except for avoiding
possibly more expensive checks inside blkcg_ioprio_track(). The check
for blkcg->prio_policy being equal to POLICY_NO_CHANGE does the same
service so just remove the ioprio_set field and replace the check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-6-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:12 -06:00
Jan Kara
4b838d9ee9 block: Fix handling of tasks without ioprio in ioprio_get(2)
ioprio_get(2) can be asked to return the best IO priority from several
tasks (IOPRIO_WHO_PGRP, IOPRIO_WHO_USER). Currently the call treats
tasks without set IO priority as having priority
IOPRIO_CLASS_BE/IOPRIO_BE_NORM however this does not really reflect the
IO priority the task will get (which depends on task's nice value).

Fix the code to use the real IO priority task's IO will use. We have to
modify code for ioprio_get(IOPRIO_WHO_PROCESS) to keep returning
IOPRIO_CLASS_NONE priority for tasks without set IO priority as a
special case to maintain userspace visible API.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220623074840.5960-5-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:12 -06:00
Jan Kara
fc25545e17 block: Make ioprio_best() static
Nobody outside of block/ioprio.c uses it.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:12 -06:00
Jan Kara
893e5d32d5 block: Generalize get_current_ioprio() for any task
get_current_ioprio() operates only on current task. We will need the
same functionality for other tasks as well. Generalize
get_current_ioprio() for that and also move the bulk out of the header
file because it is large enough.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:12 -06:00
Jan Kara
e589f46445 block: fix default IO priority handling again
Commit e70344c059 ("block: fix default IO priority handling")
introduced an inconsistency in get_current_ioprio() that tasks without
IO context return IOPRIO_DEFAULT priority while tasks with freshly
allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
Tasks without IO context used to be rare before 5a9d041ba2 ("block:
move io_context creation into where it's needed") but after this commit
they became common because now only BFQ IO scheduler setups task's IO
context. Similar inconsistency is there for get_task_ioprio() so this
inconsistency is now exposed to userspace and userspace will see
different IO priority for tasks operating on devices with BFQ compared
to devices without BFQ. Furthemore the changes done by commit
e70344c059 change the behavior when no IO priority is set for BFQ IO
scheduler which is also documented in ioprio_set(2) manpage:

"If no I/O scheduler has been set for a thread, then by default the I/O
priority will follow the CPU nice value (setpriority(2)).  In Linux
kernels before version 2.6.24, once an I/O priority had been set using
ioprio_set(), there was no way to reset the I/O scheduling behavior to
the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to
reset to the default I/O scheduling behavior."

So make sure we default to IOPRIO_CLASS_NONE as used to be the case
before commit e70344c059. Also cleanup alloc_io_context() to
explicitely set this IO priority for the allocated IO context to avoid
future surprises. Note that we tweak ioprio_best() to maintain
ioprio_get(2) behavior and make this commit easily backportable.

CC: stable@vger.kernel.org
Fixes: e70344c059 ("block: fix default IO priority handling")
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:12 -06:00
Sebastian Andrzej Siewior
3c8f9da41e blk-mq: Don't disable preemption around __blk_mq_run_hw_queue().
__blk_mq_delay_run_hw_queue() disables preemption to get a stable
current CPU number and then invokes __blk_mq_run_hw_queue() if the CPU
number is part the mask.

__blk_mq_run_hw_queue() acquires a spin_lock_t which is a sleeping lock
on PREEMPT_RT and can't be acquired with disabled preemption.

It is not required for correctness to invoke __blk_mq_run_hw_queue() on
a CPU matching hctx->cpumask. Both (async and direct requests) can run
on a CPU not matching hctx->cpumask.

The CPU mask without disabling preemption and invoking
__blk_mq_run_hw_queue().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/YrLSEiNvagKJaDs5@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:12 -06:00
Bart Van Assche
1d87be8212 block: bfq: Fix kernel-doc headers
Fix the following warnings:

block/bfq-cgroup.c:721: warning: Function parameter or member 'bfqg' not described in '__bfq_bic_change_cgroup'
block/bfq-cgroup.c:721: warning: Excess function parameter 'blkcg' description in '__bfq_bic_change_cgroup'
block/bfq-cgroup.c:870: warning: Function parameter or member 'ioprio_class' not described in 'bfq_reparent_leaf_entity'
block/bfq-cgroup.c:900: warning: Function parameter or member 'ioprio_class' not described in 'bfq_reparent_active_queues'

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220617210859.106623-1-bvanassche@acm.org
2022-06-27 06:29:12 -06:00
Bart Van Assche
c28c49b09e block: bfq: Remove an unused function definition
This patch is the result of the analysis of a sparse report.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220617204433.102022-1-bvanassche@acm.org
2022-06-27 06:29:12 -06:00
GuoYong Zheng
6c77b152f5 bfq: Remove useless code in bfq_lookup_next_entity
It is no need to judge entity is null or not here,
directly return entity is ok, so remove it.

Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn>
Link: https://lore.kernel.org/r/1655461684-19075-1-git-send-email-zhenggy@chinatelecom.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:12 -06:00
Christoph Hellwig
2a9336c42a block: move blk_queue_get_max_sectors to blk.h
blk_queue_get_max_sectors is private to the block layer, so move it out
of blkdev.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220614090934.570632-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Christoph Hellwig
efef739d5f block: fold blk_max_size_offset into get_max_io_size
Now that blk_max_size_offset has a single caller left, fold it into that
and clean up the naming convention for the local variables there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220614090934.570632-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Christoph Hellwig
84613beda4 block: cleanup variable naming in get_max_io_size
get_max_io_size has a very odd choice of variables names and
initialization patterns.  Switch to more descriptive names and more
clear initialization of them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220614090934.570632-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Christoph Hellwig
c887519074 block: open code blk_max_size_offset in blk_rq_get_max_sectors
blk_rq_get_max_sectors always uses q->limits.chunk_sectors as the
chunk_sectors argument, and already checks for max_sectors through the
call to blk_queue_get_max_sectors.  That means much of
blk_max_size_offset is not needed and open coding it simplifies the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220614090934.570632-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Bart Van Assche
51ab80f0aa block: Make blk_mq_get_sq_hctx() select the proper hardware queue type
Since the introduction of blk_mq_get_hctx_type() the operation type in
the second argument of blk_mq_get_hctx_type() matters. The introduction
of blk_mq_get_hctx_type() caused blk_mq_get_sq_hctx() to select a
hardware queue of type HCTX_TYPE_READ instead of HCTX_TYPE_DEFAULT.
Switch to hardware queue type HCTX_TYPE_DEFAULT since HCTX_TYPE_READ
should only be used for read requests.

Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220615225549.1054905-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Bart Van Assche
7e923f40a4 block: Rename a blk_mq_map_queue() argument
Before the introduction of blk_mq_get_hctx_type(), blk_mq_map_queue()
only used the flags from its second argument. Since the introduction of
blk_mq_get_hctx_type(), blk_mq_map_queue() uses both the operation and
the flags encoded in that argument. Rename the second argument of
blk_mq_map_queue() to make this clear.

Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220615225549.1054905-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Bart Van Assche
62c159a03d blk-iocost: Simplify ioc_rqos_done()
Leave out the superfluous "& REQ_OP_MASK" code. The definition of req_op()
shows that that code is superfluous:

 #define req_op(req) ((req)->cmd_flags & REQ_OP_MASK)

Compile-tested only.

Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220615225549.1054905-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Bo Liu
798f2a6f73 block: Directly use ida_alloc()/free()
Use ida_alloc()/ida_free() instead of
ida_simple_get()/ida_simple_remove().
The latter is deprecated and more verbose.

Signed-off-by: Bo Liu <liubo03@inspur.com>
Reviewed-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20220615081816.4342-1-liubo03@inspur.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Keith Busch
b1a000d3b8 block: relax direct io memory alignment
Use the address alignment requirements from the block_device for direct
io instead of requiring addresses be aligned to the block size. User
space can discover the alignment requirements from the dma_alignment
queue attribute.

User space can specify any hardware compatible DMA offset for each
segment, but every segment length is still required to be a multiple of
the block size.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-11-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Keith Busch
9cfe3ddecd block/bounce: count bytes instead of sectors
Individual bv_len's may not be a sector size.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-8-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Keith Busch
67927d2201 block/merge: count bytes instead of sectors
Individual bv_len's may not be a sector size.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220610195830.3574005-7-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Keith Busch
37fee2e42e block: add a helper function for dio alignment
This will make it easier to add more complex acceptable alignment
criteria in the future.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-6-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Keith Busch
3850e13f28 block: export dma_alignment attribute
User space may want to know how to align their buffers to avoid
bouncing. Export the queue attribute.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-4-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Keith Busch
c58c0074c5 block/bio: remove duplicate append pages code
The getting pages setup for zone append and normal IO are identical. Use
common code for each.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-3-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Keith Busch
b82d9fa257 block: fix infinite loop for invalid zone append
Returning 0 early from __bio_iov_append_get_pages() for the
max_append_sectors warning just creates an infinite loop since 0 means
success, and the bio will never fill from the unadvancing iov_iter. We
could turn the return into an error value, but it will already be turned
into an error value later on, so just remove the warning. Clearly no one
ever hit it anyway.

Fixes: 0512a75b98 ("block: Introduce REQ_OP_ZONE_APPEND")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220610195830.3574005-2-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Li Nan
ca2a3343d6 block: remove WARN_ON() from bd_link_disk_holder
Since commit 83cbce957446("block: add error handling for device_add_disk /
add_disk"), bdev->bd_holder_dir can not be empty now, so remove WARN_ON()
from bd_link_disk_holder.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074100.2251301-1-linan122@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-23 07:48:05 -06:00
Bo Liu
873cdda193 scsi: core: bsg: Remove usage of the deprecated ida_simple_xxx() API
Use ida_alloc_xxx()/ida_free() instead of
ida_simple_get()/ida_simple_remove().

The latter is deprecated and more verbose.

Link: https://lore.kernel.org/r/20220617020430.2300-1-liubo03@inspur.com
Signed-off-by: Bo Liu <liubo03@inspur.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-06-21 21:22:51 -04:00
Jens Axboe
2645672ffe block: pop cached rq before potentially blocking rq_qos_throttle()
If rq_qos_throttle() ends up blocking, then we will have invalidated and
flushed our current plug. Since blk_mq_get_cached_request() hasn't
popped the cached request off the plug list just yet, we end holding a
pointer to a request that is no longer valid. This insta-crashes with
rq->mq_hctx being NULL in the validity checks just after.

Pop the request off the cached list before doing rq_qos_throttle() to
avoid using a potentially stale request.

Fixes: 0a5aa8d161 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection")
Reported-by: Dylan Yudaken <dylany@fb.com>
Tested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21 10:59:58 -06:00
Damien Le Moal
9243fc4cd2 block: remove queue from struct blk_independent_access_range
The request queue pointer in struct blk_independent_access_range is
unused. Remove it.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Fixes: 41e46b3c2a ("block: Fix potential deadlock in blk_ia_range_sysfs_show()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220603053529.76405-1-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-19 18:40:11 -06:00
Christoph Hellwig
a09b314005 block: freeze the queue earlier in del_gendisk
Freeze the queue earlier in del_gendisk so that the state does not
change while we remove debugfs and sysfs files.

Ming mentioned that being able to observer request in debugfs might
be useful while the queue is being frozen in del_gendisk, which is
made possible by this change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220614074827.458955-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-17 07:31:05 -06:00
Christoph Hellwig
99d055b4fd block: remove per-disk debugfs files in blk_unregister_queue
The block debugfs files are created in blk_register_queue, which is
called by add_disk and use a naming scheme based on the disk_name.
After del_gendisk returns that name can be reused and thus we must not
leave these debugfs files around, otherwise the kernel is unhappy
and spews messages like:

	Directory XXXXX with parent 'block' already present!

and the newly created devices will not have working debugfs files.

Move the unregistration to blk_unregister_queue instead (which matches
the sysfs unregistration) to make sure the debugfs life time rules match
those of the disk name.

As part of the move also make sure the whole debugfs unregistration is
inside a single debugfs_mutex critical section.

Note that this breaks blktests block/002, which checks that the debugfs
directory has not been removed while blktests is running, but that
particular check should simply be removed from the test case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220614074827.458955-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-17 07:31:05 -06:00
Christoph Hellwig
5cf9c91ba9 block: serialize all debugfs operations using q->debugfs_mutex
Various places like I/O schedulers or the QOS infrastructure try to
register debugfs files on demans, which can race with creating and
removing the main queue debugfs directory.  Use the existing
debugfs_mutex to serialize all debugfs operations that rely on
q->debugfs_dir or the directories hanging off it.

To make the teardown code a little simpler declare all debugfs dentry
pointers and not just the main one uncoditionally in blkdev.h.

Move debugfs_mutex next to the dentries that it protects and document
what it is used for.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220614074827.458955-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-17 07:31:05 -06:00
Christoph Hellwig
50e34d7881 block: disable the elevator int del_gendisk
The elevator is only used for file system requests, which are stopped in
del_gendisk.  Move disabling the elevator and freeing the scheduler tags
to the end of del_gendisk instead of doing that work in disk_release and
blk_cleanup_queue to avoid a use after free on q->tag_set from
disk_release as the tag_set might not be alive at that point.

Move the blk_qos_exit call as well, as it just depends on the elevator
exit and would be the only reason to keep the not exactly cheap queue
freeze in disk_release.

Fixes: e155b0c238 ("blk-mq: Use shared tags for shared sbitmap support")
Reported-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20220614074827.458955-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-17 07:31:05 -06:00
Bart Van Assche
b96f3cab59 block/bfq: Enable I/O statistics
BFQ uses io_start_time_ns. That member variable is only set if I/O
statistics are enabled. Hence this patch that enables I/O statistics
at the time BFQ is associated with a request queue.

Compile-tested only.

Reported-by: Cixi Geng <cixi.geng1@unisoc.com>
Cc: Cixi Geng <cixi.geng1@unisoc.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Paolo Valente <paolo.valente@unimore.it>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 16:59:28 -06:00
Ming Lei
6cfeadbff3 blk-mq: don't clear flush_rq from tags->rqs[]
commit 364b61818f ("blk-mq: clearing flush request reference in
tags->rqs[]") is added to clear the to-be-free flush request from
tags->rqs[] for avoiding use-after-free on the flush rq.

Yu Kuai reported that blk_mq_clear_flush_rq_mapping() slows down boot time
by ~8s because running scsi probe which may create and remove lots of
unpresent LUNs on megaraid-sas which uses BLK_MQ_F_TAG_HCTX_SHARED and
each request queue has lots of hw queues.

Improve the situation by not running blk_mq_clear_flush_rq_mapping if
disk isn't added when there can't be any flush request issued.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220616014401.817001-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 14:45:15 -06:00
Ming Lei
4d337cebcb blk-mq: avoid to touch q->elevator without any protection
q->elevator is referred in blk_mq_has_sqsched() without any protection,
no .q_usage_counter is held, no queue srcu and rcu read lock is held,
so potential use-after-free may be triggered.

Fix the issue by adding one queue flag for checking if the elevator
uses single queue style dispatch. Meantime the elevator feature flag
of ELEVATOR_F_MQ_AWARE isn't needed any more.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 14:45:15 -06:00
Ming Lei
5fd7a84a09 blk-mq: protect q->elevator by ->sysfs_lock in blk_mq_elv_switch_none
elevator can be tore down by sysfs switch interface or disk release, so
hold ->sysfs_lock before referring to q->elevator, then potential
use-after-free can be avoided.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220616014401.817001-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 14:45:15 -06:00
Bart Van Assche
14dc7a18ab block: Fix handling of offline queues in blk_mq_alloc_request_hctx()
This patch prevents that test nvme/004 triggers the following:

UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9
index 512 is out of range for type 'long unsigned int [512]'
Call Trace:
 show_stack+0x52/0x58
 dump_stack_lvl+0x49/0x5e
 dump_stack+0x10/0x12
 ubsan_epilogue+0x9/0x3b
 __ubsan_handle_out_of_bounds.cold+0x44/0x49
 blk_mq_alloc_request_hctx+0x304/0x310
 __nvme_submit_sync_cmd+0x70/0x200 [nvme_core]
 nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics]
 nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop]
 nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop]
 nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics]
 nvmf_dev_write+0xae/0x111 [nvme_fabrics]
 vfs_write+0x144/0x560
 ksys_write+0xb7/0x140
 __x64_sys_write+0x42/0x50
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220615210004.1031820-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 14:43:31 -06:00
Al Viro
91b94c5d6a iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
New helper to be used instead of direct checks for IOCB_DSYNC:
iocb_is_dsync(iocb).  Checks converted, which allows to avoid
the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines)
from iocb_flags() - it's checked in iocb_is_dsync() instead

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-06-10 16:05:15 -04:00
Christoph Hellwig
d5a37b1998 block: remove bioset_init_from_src
Unused now, and the interface never really made a whole lot of sense to
start with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-08 14:04:14 -04:00
Linus Torvalds
78c6499c92 for-5.19/drivers-2022-06-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmkoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqyrD/4iyg2ULBPyljLoM3Ed8AONbrFBApenKDFN
 FjiFRZNll8zvJLTtP0GQqJIrljPuRlqb0IkbgqXl+yvPZ+wpB9HgIe2aohkOaqqJ
 KS49UNR/aIHwC2y7lwlcsdgVqqoPdc4wnZeaQvCsWPCBhCca/k0kR7s3uEhHMK92
 OWpV9osl/thLfOBwwt4IEaO1Koz8PM/fCR4XA2KLbs8E4P8EcSFglqi0ap7foLNr
 pQAJIlPjkmF6nw4xg5fdBjVBo//kVcuf9IBMi5/XinmUL1taFAcn5WyeOvBbi0Fs
 Sqp/pKkveM8xWZKrDyA/wf8nzRNpBBl6TOQEMFtV6FkZrij3pbKHlCiyL2gBR13e
 5gkbVvXgtdqDVlnqlvIV/Swfh5YQFtn7+vlHFOUjP+iObsBRo4fTvDhPTLoO/VCf
 tIAA7xnq/pquRKS/QGGC7ZxRVc3T1r+EpvbBP5Dc4CDGbbbyZLCSrOh5HWIb0I3k
 95GSiipTtf54KqZ9HiG/u+xNAFIdapXgU4Xm+JyDRWLdxSs5Nmy8VgpdflvxBfuo
 hCvJJw3vtDusjyHc7IafxaZlJVQT9tPgPshs8GfrCMCP19RCALD/5irspFh6dD35
 BQTIkhC68XNa0iNn/NTP3uxir/JwRoovxQkA9eD+r1NHsAbL8GTypfr5kKJxDaIK
 UhawfyZE3Q==
 =YqhC
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block

Pull more block driver updates from Jens Axboe:
 "A collection of stragglers that were late on sending in their changes
  and just followup fixes.

   - NVMe fixes pull request via Christoph:
       - set controller enable bit in a separate write (Niklas Cassel)
       - disable namespace identifiers for the MAXIO MAP1001 (Christoph)
       - fix a comment typo (Julia Lawall)"

   - MD fixes pull request via Song:
       - Remove uses of bdevname (Christoph Hellwig)
       - Bug fixes (Guoqing Jiang, and Xiao Ni)

   - bcache fixes series (Coly)

   - null_blk zoned write fix (Damien)

   - nbd fixes (Yu, Zhang)

   - Fix for loop partition scanning (Christoph)"

* tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block: (23 commits)
  block: null_blk: Fix null_zone_write()
  nvmet: fix typo in comment
  nvme: set controller enable bit in a separate write
  nvme-pci: disable namespace identifiers for the MAXIO MAP1001
  bcache: avoid unnecessary soft lockup in kworker update_writeback_rate()
  nbd: use pr_err to output error message
  nbd: fix possible overflow on 'first_minor' in nbd_dev_add()
  nbd: fix io hung while disconnecting device
  nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
  nbd: fix race between nbd_alloc_config() and module removal
  nbd: call genl_unregister_family() first in nbd_cleanup()
  md: bcache: check the return value of kzalloc() in detached_dev_do_request()
  bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init()
  block, loop: support partitions without scanning
  bcache: avoid journal no-space deadlock by reserving 1 journal bucket
  bcache: remove incremental dirty sector counting for bch_sectors_dirty_init()
  bcache: improve multithreaded bch_sectors_dirty_init()
  bcache: improve multithreaded bch_btree_check()
  md: fix double free of io_acct_set bioset
  md: Don't set mddev private to NULL in raid0 pers->free
  ...
2022-06-03 10:25:56 -07:00
Linus Torvalds
72fbbc3d0e for-5.19/block-exec-2022-06-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmh0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqg6EACCbkwoH7rrr38iU++xP3c9oqFJCfYR95ho
 qn9/3FiPulua0Dwg8Fbp12ubqqBy/iNj+4Mk7XTo28P7ahjGtKJec2DZguDuHC5X
 G3kucgQcDOLs1IMWoil+KrnjGC8qeT9ZPFNaUF0IY084NPxnj1wAOjo1J00QVieN
 WFgHX1sxzBje8abebf3UxAyXImzfyY2uXbp1F3thzf0ZwHXkSDsbWI3fvpdYF4QC
 p3z6CX0sR+5v7ZLWF3X6H8MBSO+eRlprYji3O/0jVslLBAS8FlTdizQtzx7C6Hsv
 JZVY4ZsUswYxtsHCBR0McglDeu/iXZRZ9HX4iiOobYJNaXfycMltvS+4Tb/TsFTB
 GaG6tbL4JS+NT063ctl5h355vUVhIbw6qEhsiF47+0hgawvRr/xxP0aSq1MPmfjw
 OgG4Jn0htXF47tpKnszfaj3BmgvgV56mV0IGwF5Sh5NXDnHF+MHFmrhRsP1NenjL
 12FTnvWGYyTRGDVIVFDkuwaI9o9iNKdFw0JkKIZa/G5RVmmjukvMGTvvSnKmSxJg
 dgbYLSBA2ZxCcIPjJDvZroe3QxvGNyqUYtxwyWl4a1HK/qljZfwwyRE1rfjJ47hK
 F8jNEkOThcjr1anoV2nvSLE1mM3SyA/UqDsntIdwnUlG/ObYsByTgtKc+ETs1RzS
 8Ovp6lv0Mw==
 =lBeb
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/block-exec-2022-06-02' of git://git.kernel.dk/linux-block

Pull block request execute cleanups from Jens Axboe:
 "This change was advertised in the initial core block pull request, but
  didn't actually make that branch as we deferred it to a post-merge
  pull request to avoid a bunch of cross branch issues.

  This series cleans up the block execute path quite nicely"

* tag 'for-5.19/block-exec-2022-06-02' of git://git.kernel.dk/linux-block:
  blk-mq: remove the done argument to blk_execute_rq_nowait
  blk-mq: avoid a mess of casts for blk_end_sync_rq
  blk-mq: remove __blk_execute_rq_nowait
2022-06-03 10:21:43 -07:00
Linus Torvalds
34845d92bc for-5.19/block-2022-06-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmfQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsLlEACPbK/ms8dMDwKjfEF/RMoc7uL/j6oC0cpf
 0D2sfMka8D41QdrUfMiUismXZ61dyKdsiX/U/Q0gcjIomnlco8ZeLcLa6DlafjwY
 DtvO2aCb+eBAkII5sX2WM4ANNgFTy08Y4wBmgEy5En5u4nPlIGZ8DsulQQodqygx
 1lJh31OXQKw+2kIyUdAeC0GMiD9nddYDsH0CTFDSZsAijCcOBDOHbDPk27wHapzM
 GR1UAK5/SA7RfZgIMRHHclF6Ea49/uPJ45crD1T+8p6jLW+ldbxpiRD3ux9BnK2v
 U7EWS5MLMFAvb/nTLc8T37srJuEhBAT0r2bn614rjOiJofalPeD0eDeHfz4vRpPe
 +qTQREtpBUtJizYN+8rpcxP8f9S/hmPOBvIKD3XC0TlOo1NCf35fqWLWMli2hkTQ
 AfcY1auKjC/UYcnR0TQ91aHo1puM4fK5Pdc6lDGznrcxy9t1g1NvKAEL9Y3xK3No
 paglrliBCUbAN8vogKr4jc7jRkh/GLEqkxV2LIpOVp3lyT9GepvYM1xLQ8X/rszn
 /Il3fAwf5AyP+1RoVcmmOy1XW0ptUbKXWn03NlxN55Ya8x3tKCwWWDSmL2CP8SwV
 Vo5Qt+rKUkqA/TmHW8HOd7i+44Sa8oD/6WpSSPkwXN2cgRQmvmtaGmpXCKNTn5tk
 PMgFJOq3uw==
 =7JDU
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/block-2022-06-02' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Just a collection of fixes that have been queued up since the initial
  merge window pull request, the majority of which are targeted for
  stable as well.

  One bio_set fix that fixes an issue with the dm adoption of cached bio
  structs that got introduced in this merge window"

* tag 'for-5.19/block-2022-06-02' of git://git.kernel.dk/linux-block:
  block: Fix potential deadlock in blk_ia_range_sysfs_show()
  block: fix bio_clone_blkg_association() to associate with proper blkcg_gq
  block: remove useless BUG_ON() in blk_mq_put_tag()
  blk-mq: do not update io_ticks with passthrough requests
  block: make bioset_exit() fully resilient against being called twice
  block: use bio_queue_enter instead of blk_queue_enter in bio_poll
  block: document BLK_STS_AGAIN usage
  block: take destination bvec offsets into account in bio_copy_data_iter
  blk-iolatency: Fix inflight count imbalances and IO hangs on offline
  blk-mq: don't touch ->tagset in blk_mq_get_sq_hctx
2022-06-03 10:14:48 -07:00
Damien Le Moal
41e46b3c2a block: Fix potential deadlock in blk_ia_range_sysfs_show()
When being read, a sysfs attribute is already protected against removal
with the kobject node active reference counter. As a result, in
blk_ia_range_sysfs_show(), there is no need to take the queue sysfs
lock when reading the value of a range attribute. Using the queue sysfs
lock in this function creates a potential deadlock situation with the
disk removal, something that a lockdep signals with a splat when the
device is removed:

[  760.703551]  Possible unsafe locking scenario:
[  760.703551]
[  760.703554]        CPU0                    CPU1
[  760.703556]        ----                    ----
[  760.703558]   lock(&q->sysfs_lock);
[  760.703565]                                lock(kn->active#385);
[  760.703573]                                lock(&q->sysfs_lock);
[  760.703579]   lock(kn->active#385);
[  760.703587]
[  760.703587]  *** DEADLOCK ***

Solve this by removing the mutex_lock()/mutex_unlock() calls from
blk_ia_range_sysfs_show().

Fixes: a2247f19ee ("block: Add independent access ranges support")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220603021905.1441419-1-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-02 23:02:37 -06:00
Jan Kara
22b106e535 block: fix bio_clone_blkg_association() to associate with proper blkcg_gq
Commit d92c370a16 ("block: really clone the block cgroup in
bio_clone_blkg_association") changed bio_clone_blkg_association() to
just clone bio->bi_blkg reference from source to destination bio. This
is however wrong if the source and destination bios are against
different block devices because struct blkcg_gq is different for each
bdev-blkcg pair. This will result in IOs being accounted (and throttled
as a result) multiple times against the same device (src bdev) while
throttling of the other device (dst bdev) is ignored. In case of BFQ the
inconsistency can even result in crashes in bfq_bic_update_cgroup().
Fix the problem by looking up correct blkcg_gq for the cloned bio.

Reported-by: Logan Gunthorpe <logang@deltatee.com>
Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de>
Fixes: d92c370a16 ("block: really clone the block cgroup in bio_clone_blkg_association")
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220602081242.7731-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-02 02:15:05 -06:00
Damien Le Moal
ff47dbd18b block: remove useless BUG_ON() in blk_mq_put_tag()
Since the if condition in blk_mq_put_tag() checks that the tag to put is
not a reserved one, the BUG_ON() check in the else branch checking if
the tag is indeed a reserved one is useless. Remove it.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220602075159.1273366-1-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-02 02:05:56 -06:00
Haisu Wang
b81c14ca14 blk-mq: do not update io_ticks with passthrough requests
Flush or passthrough requests are not accounted as normal IO in completion.
To reflect iostat for slow IO, io_ticks is updated when stat show called
based on inflight numbers.
It may cause inconsistent io_ticks calculation result.

So do not account non-passthrough request when check inflight.

Fixes: 86d7331299 ("block: update io_ticks when io hang")
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: samuelliao <samuelliao@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220530064059.1120058-1-haisuwang@tencent.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-30 07:13:12 -06:00
Jens Axboe
605f7415ec block: make bioset_exit() fully resilient against being called twice
Most of bioset_exit() is fine being called twice, as it clears the
various allocations etc when they are freed. The exception is
bio_alloc_cache_destroy(), which does not clear ->cache when it has
freed it.

This isn't necessarily a bug, but can be if buggy users does call the
exit path more then once, or with just a memset() bioset which has
never been initialized. dm appears to be one such user.

Fixes: be4d234d7a ("bio: add allocation cache abstraction")
Link: https://lore.kernel.org/linux-block/YpK7m+14A+pZKs5k@casper.infradead.org/
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-29 07:36:31 -06:00
Christoph Hellwig
e2e5308672 blk-mq: remove the done argument to blk_execute_rq_nowait
Let the caller set it together with the end_io_data instead of passing
a pointless argument.  Note the the target code did in fact already
set it and then just overrode it again by calling blk_execute_rq_nowait.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28 06:15:27 -06:00
Christoph Hellwig
32ac5a9b8b blk-mq: avoid a mess of casts for blk_end_sync_rq
Instead of trying to cast a __bitwise 32-bit integer to a larger integer
and then a pointer, just allow a struct with the blk_status_t and the
completion on stack and set the end_io_data to that.  Use the
opportunity to move the code to where it belongs and drop rather
confusing comments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28 06:15:27 -06:00
Christoph Hellwig
ae948fd6d0 blk-mq: remove __blk_execute_rq_nowait
We don't want to plug for synchronous execution that where we immediately
wait for the request.  Once that is done not a whole lot of code is
shared, so just remove __blk_execute_rq_nowait.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28 06:15:27 -06:00
Christoph Hellwig
ebd076bf7d block: use bio_queue_enter instead of blk_queue_enter in bio_poll
We want to have a valid live gendisk to call ->poll and not just a
request_queue, so call the right helper.

Fixes: 3e08773c38 ("block: switch polling to be bio based")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220523124302.526186-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28 06:14:35 -06:00
Christoph Hellwig
403d50341c block: take destination bvec offsets into account in bio_copy_data_iter
Appartly bcache can copy into bios that do not just contain fresh
pages but can have offsets into the bio_vecs.  Restore support for tht
in bio_copy_data_iter.

Fixes: f8b679a070 ("block: rewrite bio_copy_data_iter to use bvec_kmap_local and memcpy_to_bvec")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220524143919.1155501-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 20:35:55 -06:00
Christoph Hellwig
b9684a71fc block, loop: support partitions without scanning
Historically we did distinguish between a flag that surpressed partition
scanning, and a combinations of the minors variable and another flag if
any partitions were supported.  This was generally confusing and doesn't
make much sense, but some corner case uses of the loop driver actually
do want to support manually added partitions on a device that does not
actively scan for partitions.  To make things worsee the loop driver
also wants to dynamically toggle the scanning for partitions on a live
gendisk, which makes the disk->flags updates non-atomic.

Introduce a new GD_SUPPRESS_PART_SCAN bit in disk->state that disables
just scanning for partitions, and toggle that instead of GENHD_FL_NO_PART
in the loop driver.

Fixes: 1ebe2e5f9d ("block: remove GENHD_FL_EXT_DEVT")
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220527055806.1972352-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 06:39:00 -06:00
Tejun Heo
8a177a36da blk-iolatency: Fix inflight count imbalances and IO hangs on offline
iolatency needs to track the number of inflight IOs per cgroup. As this
tracking can be expensive, it is disabled when no cgroup has iolatency
configured for the device. To ensure that the inflight counters stay
balanced, iolatency_set_limit() freezes the request_queue while manipulating
the enabled counter, which ensures that no IO is in flight and thus all
counters are zero.

Unfortunately, iolatency_set_limit() isn't the only place where the enabled
counter is manipulated. iolatency_pd_offline() can also dec the counter and
trigger disabling. As this disabling happens without freezing the q, this
can easily happen while some IOs are in flight and thus leak the counts.

This can be easily demonstrated by turning on iolatency on an one empty
cgroup while IOs are in flight in other cgroups and then removing the
cgroup. Note that iolatency shouldn't have been enabled elsewhere in the
system to ensure that removing the cgroup disables iolatency for the whole
device.

The following keeps flipping on and off iolatency on sda:

  echo +io > /sys/fs/cgroup/cgroup.subtree_control
  while true; do
      mkdir -p /sys/fs/cgroup/test
      echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency
      sleep 1
      rmdir /sys/fs/cgroup/test
      sleep 1
  done

and there's concurrent fio generating direct rand reads:

  fio --name test --filename=/dev/sda --direct=1 --rw=randread \
      --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k

while monitoring with the following drgn script:

  while True:
    for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()):
        for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list):
            blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node')
            pd = blkg.pd[prog['blkcg_policy_iolatency'].plid]
            if pd.value_() == 0:
                continue
            iolat = container_of(pd, 'struct iolatency_grp', 'pd')
            inflight = iolat.rq_wait.inflight.counter.value_()
            if inflight:
                print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} '
                      f'{cgroup_path(css.cgroup).decode("utf-8")}')
    time.sleep(1)

The monitoring output looks like the following:

  inflight=1 sda /user.slice
  inflight=1 sda /user.slice
  ...
  inflight=14 sda /user.slice
  inflight=13 sda /user.slice
  inflight=17 sda /user.slice
  inflight=15 sda /user.slice
  inflight=18 sda /user.slice
  inflight=17 sda /user.slice
  inflight=20 sda /user.slice
  inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19
  inflight=19 sda /user.slice
  inflight=19 sda /user.slice

If a cgroup with stuck inflight ends up getting throttled, the throttled IOs
will never get issued as there's no completion event to wake it up leading
to an indefinite hang.

This patch fixes the bug by unifying enable handling into a work item which
is automatically kicked off from iolatency_set_min_lat_nsec() which is
called from both iolatency_set_limit() and iolatency_pd_offline() paths.
Punting to a work item is necessary as iolatency_pd_offline() is called
under spinlocks while freezing a request_queue requires a sleepable context.

This also simplifies the code reducing LOC sans the comments and avoids the
unnecessary freezes which were happening whenever a cgroup's latency target
is newly set or cleared.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Liu Bo <bo.liu@linux.alibaba.com>
Fixes: 8c772a9bfc ("blk-iolatency: fix IO hang due to negative inflight counter")
Cc: stable@vger.kernel.org # v5.0+
Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-26 11:43:00 -06:00
Linus Torvalds
fdaf9a5840 Page cache changes for 5.19
- Appoint myself page cache maintainer
 
  - Fix how scsicam uses the page cache
 
  - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
 
  - Remove the AOP flags entirely
 
  - Remove pagecache_write_begin() and pagecache_write_end()
 
  - Documentation updates
 
  - Convert several address_space operations to use folios:
    - is_dirty_writeback
    - readpage becomes read_folio
    - releasepage becomes release_folio
    - freepage becomes free_folio
 
  - Change filler_t to require a struct file pointer be the first argument
    like ->read_folio
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp
 gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O
 7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj
 xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d
 nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26
 bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX
 WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A==
 =djLv
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache

Pull page cache updates from Matthew Wilcox:

 - Appoint myself page cache maintainer

 - Fix how scsicam uses the page cache

 - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS

 - Remove the AOP flags entirely

 - Remove pagecache_write_begin() and pagecache_write_end()

 - Documentation updates

 - Convert several address_space operations to use folios:
     - is_dirty_writeback
     - readpage becomes read_folio
     - releasepage becomes release_folio
     - freepage becomes free_folio

 - Change filler_t to require a struct file pointer be the first
   argument like ->read_folio

* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
  nilfs2: Fix some kernel-doc comments
  Appoint myself page cache maintainer
  fs: Remove aops->freepage
  secretmem: Convert to free_folio
  nfs: Convert to free_folio
  orangefs: Convert to free_folio
  fs: Add free_folio address space operation
  fs: Convert drop_buffers() to use a folio
  fs: Change try_to_free_buffers() to take a folio
  jbd2: Convert release_buffer_page() to use a folio
  jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
  reiserfs: Convert release_buffer_page() to use a folio
  fs: Remove last vestiges of releasepage
  ubifs: Convert to release_folio
  reiserfs: Convert to release_folio
  orangefs: Convert to release_folio
  ocfs2: Convert to release_folio
  nilfs2: Remove comment about releasepage
  nfs: Convert to release_folio
  jfs: Convert to release_folio
  ...
2022-05-24 19:55:07 -07:00
Linus Torvalds
850f6033cd Description for this pull request:
- fix referencing wrong parent directory information during rename.
  - introduce a sys_tz mount option to use system timezone.
  - improve performance while zeroing a cluster with dirsync mount option.
  - fix slab-out-bounds in exat_clear_bitmap() reported from syzbot.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmKLhPkWHGxpbmtpbmpl
 b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCOeXD/9nJofEx/n9KK0pA1WP2zKoLBnP
 YgLfnWTBlXfErInklW4kg057S6q8M0pDm0iASLw9P6GZNe0VFV5PigrTyAjy5ghW
 hm3JiFAHIZgaOlOk2NQd/1Qv/IdlnkbRkngXqHcizxEX/LcKZpP+pb1mdW6NzWrt
 /HagLClFTUhb0Su3DT7TCqiam5lI+lkarRI0Jo4Scstgsn4aT+25jk9N0bfUih7f
 hGKJpii+5UWCLlBJnyyghrBRQiiPdsETadJdRnHgeDdzKg/UNWxMP0C+G6PmSko/
 mScrR+FeH2toURSUESi1Q558z1+3Fhb8rMbl3aWV70FJDmzMwn9YyhPgBrX3x6Gb
 AF7UBHFvORStYRUmmSMbX9XkY2gNoI9qZMXghDRlgF8t/WWY8VeVnyslaWwqDQhw
 qXyOIThiCuhLfKTD+r+MM08oUPcyFBtuGvdzDOH7/b56zEDwzab+hSHZz94xPWEz
 ESk0hNhaJCEvcgEr7IlSSF4k5Ff+hWVKN4R/DD78yxmjHveOuNMibTeE2YgDIFEX
 SiKFaAiXUuWGVsAPuAeJ+np/7rW3OEBG8yKhtri5vsoXk+Mqd56rp0EkEHzqbVHQ
 Ki5gua549KNRuxnbXRtWLjCKucwN86mE45WD0P0ORnBOjlfmgg8adp6BBSW5yVD2
 SZkfgI0FL8rWdBwRcQ==
 =fr78
 -----END PGP SIGNATURE-----

Merge tag 'exfat-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat

Pull exfat updates from Namjae Jeon:

 - fix referencing wrong parent directory information during rename

 - introduce a sys_tz mount option to use system timezone

 - improve performance while zeroing a cluster with dirsync mount option

 - fix slab-out-bounds in exat_clear_bitmap() reported from syzbot

* tag 'exfat-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  exfat: check if cluster num is valid
  exfat: reduce block requests when zeroing a cluster
  block: add sync_blockdev_range()
  exfat: introduce mount option 'sys_tz'
  exfat: fix referencing wrong parent directory information after renaming
2022-05-24 18:30:27 -07:00
Linus Torvalds
5dc921868c for-5.19/drivers-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrTcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgph/REAC0/7odRfJeTJ1PkJhSKFc7dhyS7rK4du2s
 3+z+H6Yeua2yVIJb0mYYGEJcOUUQ9nD2T9424n3NzDOw88U4y8Vg2YEH+UiJBuj4
 AJoxPNkQdxL7WzmwHmRNLCcOOFhISLqWiCJSr45d+LP1f6aO24Q9lewYWxtNA4TW
 mqb7Ne7e3Z77m9rmsCsZ26bzQHg1EEQ6qgjZM9tqMhOeTqYhmrqfrD9KtG8TIkpK
 N8277E5QcequHf7v6VpKqEOzf3d2kx55JaZdu+oxLPVMED3wJJFwcYF1/xmM7Fgx
 tp7xCjqqUHXwKvJNCFJpnvw+cXu0Ct7cWOIG4ROCvaTD4vBI1KzZLc0gO7pKFW0Y
 hNIlMXr4n8PmonS81tMV4TqmRWxedX/jxuaeJCVNr89PqYU4luPpigJZqv7rlGry
 KZUlktQot22M/7FC2MS6KhgbQKLPrRGTAEyY/JNwBHckCZiduWQFlmKLQ926xQIJ
 6vdjSzHK5MrT/d+yow3bGFxAJWloGJ+L+RsH0b+WikF81+6ic9P3AoStgbVilfKD
 6sbjcju8SShDlQ+W/Ocm0rHC+i/RDKT3QqItXgfhA/1FfMPODQGc/xcZg+AdTswn
 VSnUIkvk9/mTO0StilVfNJDfG1QkSpJ5Ilvs/DnIahZj6IG4QbJvtnVNbmQX6ptz
 AUB4DdGwXg==
 =geQL
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Here are the driver updates queued up for 5.19. This contains:

   - NVMe pull requests via Christoph:
       - tighten the PCI presence check (Stefan Roese)
       - fix a potential NULL pointer dereference in an error path (Kyle
         Miller Smith)
       - fix interpretation of the DMRSL field (Tom Yan)
       - relax the data transfer alignment (Keith Busch)
       - verbose error logging improvements (Max Gurtovoy, Chaitanya
         Kulkarni)
       - misc cleanups (Chaitanya Kulkarni, Christoph)
       - set non-mdts limits in nvme_scan_work (Chaitanya Kulkarni)
       - add support for TP4084 - Time-to-Ready Enhancements (Christoph)

   - MD pull request via Song:
       - Improve annotation in raid5 code, by Logan Gunthorpe
       - Support MD_BROKEN flag in raid-1/5/10, by Mariusz Tkaczyk
       - Other small fixes/cleanups

   - null_blk series making the configfs side much saner (Damien)

   - Various minor drbd cleanups and fixes (Haowen, Uladzislau, Jiapeng,
     Arnd, Cai)

   - Avoid using the system workqueue (and hence flushing it) in rnbd
     (Jack)

   - Avoid using the system workqueue (and hence flushing it) in aoe
     (Tetsuo)

   - Series fixing discard_alignment issues in drivers (Christoph)

   - Small series fixing drivers poking at disk->part0 for openers
     information (Christoph)

   - Series fixing deadlocks in loop (Christoph, Tetsuo)

   - Remove loop.h and add SPDX headers (Christoph)

   - Various fixes and cleanups (Julia, Xie, Yu)"

* tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-block: (72 commits)
  mtip32xx: fix typo in comment
  nvme: set non-mdts limits in nvme_scan_work
  nvme: add support for TP4084 - Time-to-Ready Enhancements
  nvme: split the enum used for various register constants
  nbd: Fix hung on disconnect request if socket is closed before
  nvme-fabrics: add a request timeout helper
  nvme-pci: harden drive presence detect in nvme_dev_disable()
  nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags
  nvme: mark internal passthru request RQF_QUIET
  nvme: remove unneeded include from constants file
  nvme: add missing status values to verbose logging
  nvme: set dma alignment to dword
  nvme: fix interpretation of DMRSL
  loop: remove most the top-of-file boilerplate comment from the UAPI header
  loop: remove most the top-of-file boilerplate comment
  loop: add a SPDX header
  loop: remove loop.h
  block: null_blk: Improve device creation with configfs
  block: null_blk: Cleanup messages
  block: null_blk: Cleanup device creation and deletion
  ...
2022-05-23 14:04:14 -07:00
Linus Torvalds
115cd47132 for-5.19/block-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrUsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgDjD/44hY9h0JsOLoRH1IvFtuaH6n718JXuqG17
 hHCfmnAUVqj2jT00IUbVlUTd905bCGpfrodBL3PAmPev1zZHOUd/MnJKrSynJ+/s
 NJEMZQaHxLmocNDpJ1sZo7UbAFErsZXB0gVYUO8cH2bFYNu84H1mhRCOReYyqmvQ
 aIAASX5qRB/ciBQCivzAJl2jTdn4WOn5hWi9RLidQB7kSbaXGPmgKAuN88WI4H7A
 zQgAkEl2EEquyMI5tV1uquS7engJaC/4PsenF0S9iTyrhJLjneczJBJZKMLeMR8d
 sOm6sKJdpkrfYDyaA4PIkgmLoEGTtwGpqGHl4iXTyinUAxJoca5tmPvBb3wp66GE
 2Mr7pumxc1yJID2VHbsERXlOAX3aZNCowx2gum2MTRIO8g11Eu3aaVn2kv37MBJ2
 4R2a/cJFl5zj9M8536cG+Yqpy0DDVCCQKUIqEupgEu1dyfpznyWH5BTAHXi1E8td
 nxUin7uXdD0AJkaR0m04McjS/Bcmc1dc6I8xvkdUFYBqYCZWpKOTiEpIBlHg0XJA
 sxdngyz5lSYTGVA4o4QCrdR0Tx1n36A1IYFuQj0wzxBJYZ02jEZuII/A3dd+8hiv
 EY+VeUQeVIXFFuOcY+e0ScPpn7Nr17hAd1en/j2Hcoe4ZE8plqG2QTcnwgflcbis
 iomvJ4yk0Q==
 =0Rw1
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Here are the core block changes for 5.19. This contains:

   - blk-throttle accounting fix (Laibin)

   - Series removing redundant assignments (Michal)

   - Expose bio cache via the bio_set, so that DM can use it (Mike)

   - Finish off the bio allocation interface cleanups by dealing with
     the weirdest member of the family. bio_kmalloc combines a kmalloc
     for the bio and bio_vecs with a hidden bio_init call and magic
     cleanup semantics (Christoph)

   - Clean up the block layer API so that APIs consumed by file systems
     are (almost) only struct block_device based, so that file systems
     don't have to poke into block layer internals like the
     request_queue (Christoph)

   - Clean up the blk_execute_rq* API (Christoph)

   - Clean up various lose end in the blk-cgroup code to make it easier
     to follow in preparation of reworking the blkcg assignment for bios
     (Christoph)

   - Fix use-after-free issues in BFQ when processes with merged queues
     get moved to different cgroups (Jan)

   - BFQ fixes (Jan)

   - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming,
     Wolfgang, me)"

* tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits)
  blk-mq: fix typo in comment
  bfq: Remove bfq_requeue_request_body()
  bfq: Remove superfluous conversion from RQ_BIC()
  bfq: Allow current waker to defend against a tentative one
  bfq: Relax waker detection for shared queues
  blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE()
  blk-throttle: Set BIO_THROTTLED when bio has been throttled
  blk-cgroup: Remove unnecessary rcu_read_lock/unlock()
  blk-cgroup: always terminate io.stat lines
  block, bfq: make bfq_has_work() more accurate
  block, bfq: protect 'bfqd->queued' by 'bfqd->lock'
  block: cleanup the VM accounting in submit_bio
  block: Fix the bio.bi_opf comment
  block: reorder the REQ_ flags
  blk-iocost: combine local_stat and desc_stat to stat
  block: improve the error message from bio_check_eod
  block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone
  block: remove superfluous calls to blkcg_bio_issue_init
  kthread: unexport kthread_blkcg
  blk-cgroup: cleanup blkcg_maybe_throttle_current
  ...
2022-05-23 13:56:39 -07:00
Linus Torvalds
9836e93c0a for-5.19/io_uring-passthrough-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKovAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv9oD/4qCs7k3bPZZWZ6xoWb4EObyyWOUifi26lp
 vpsJHFUbA67S/i4++LV9H18SazWJ7h08ac4bjgZ+NQz40/1WkTN8/Fa76jo+BnNK
 7T10Wp4Ak6uwWVrKaA81pnT+G9+xmHlJ3X27aKxzLuT7BEPpShZ6ouFVjTkx9CzN
 LrLjuCDTOBBN+ZoaroWYfdLwTQX2VCAl9B15lOtQIlFvuuU8VlrvLboY+80K8TvY
 1wvTA2HTjnXoYx+/cTTMIFZIwQH3r1hsbwEDD8/YJj1+ouhSRQ1b0p/nk2pA+3ws
 HF5r/YS/rLBjlPF094IzeOBaUyA433AN1VhZqnII8ek7ViT3W3x+BRrgE9O6ZkWT
 0AjX1BXReI5rdFmxBmwsSdBnrSoGaJOf2GdsCCdubXBIi+F/RvyajrPf7PTB5zbW
 9WEK/uy3xvZsRVkUGAzOb9QGdvjcllgMzwPJsDegDCw5PdcPdT3mzy6KGIWipFLp
 j8R+br7hRMpOJv/YpihJDMzSDkQ/r1/SCwR4fpLid/QdSHG/eRTQK6c4Su5bNYEy
 QDy2F6kQdBVtEJCQHcEOsbhXzSTNBcdB+ujUUM5653FkaHe6y4JbomLrsNx407Id
 i/4ROwA5K1dioJx503Eap+OhbI5rV+PFytJTwxvLrNyVGccwbH2YOVq80fsVBP2e
 cZbn6EX4Vg==
 =/peE
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/io_uring-passthrough-2022-05-22' of git://git.kernel.dk/linux-block

Pull io_uring NVMe command passthrough from Jens Axboe:
 "On top of everything else, this adds support for passthrough for
  io_uring.

  The initial feature for this is NVMe passthrough support, which allows
  non-filesystem based IO commands and admin commands.

  To support this, io_uring grows support for SQE and CQE members that
  are twice as big, allowing to pass in a full NVMe command without
  having to copy data around. And to complete with more than just a
  single 32-bit value as the output"

* tag 'for-5.19/io_uring-passthrough-2022-05-22' of git://git.kernel.dk/linux-block: (22 commits)
  io_uring: cleanup handling of the two task_work lists
  nvme: enable uring-passthrough for admin commands
  nvme: helper for uring-passthrough checks
  blk-mq: fix passthrough plugging
  nvme: add vectored-io support for uring-cmd
  nvme: wire-up uring-cmd support for io-passthru on char-device.
  nvme: refactor nvme_submit_user_cmd()
  block: wire-up support for passthrough plugging
  fs,io_uring: add infrastructure for uring-cmd
  io_uring: support CQE32 for nop operation
  io_uring: enable CQE32
  io_uring: support CQE32 in /proc info
  io_uring: add tracing for additional CQE32 fields
  io_uring: overflow processing for CQE32
  io_uring: flush completions for CQE32
  io_uring: modify io_get_cqe for CQE32
  io_uring: add CQE32 completion processing
  io_uring: add CQE32 setup processing
  io_uring: change ring size calculation for CQE32
  io_uring: store add. return values for CQE32
  ...
2022-05-23 13:06:15 -07:00
Ming Lei
5d05426e2d blk-mq: don't touch ->tagset in blk_mq_get_sq_hctx
blk_mq_run_hw_queues() could be run when there isn't queued request and
after queue is cleaned up, at that time tagset is freed, because tagset
lifetime is covered by driver, and often freed after blk_cleanup_queue()
returns.

So don't touch ->tagset for figuring out current default hctx by the mapping
built in request queue, so use-after-free on tagset can be avoided. Meantime
this way should be fast than retrieving mapping from tagset.

Cc: "yukuai (C)" <yukuai3@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Fixes: b6e68ee825 ("blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220522122350.743103-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-23 06:28:28 -06:00
Yuezhang Mo
97d6fb1b48 block: add sync_blockdev_range()
sync_blockdev_range() is to support syncing multiple sectors
with as few block device requests as possible, it is helpful
to make the block device to give full play to its performance.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-05-23 11:17:30 +09:00
Julia Lawall
2aaf516084 blk-mq: fix typo in comment
Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20220521111145.81697-29-Julia.Lawall@inria.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-21 06:32:16 -06:00
Jan Kara
a249ca7dfb bfq: Remove bfq_requeue_request_body()
The function has only a single caller and two lines. Just remove it
since it is pointless and just harming readability.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-19 06:52:36 -06:00
Jan Kara
e79cf8892e bfq: Remove superfluous conversion from RQ_BIC()
We store struct bfq_io_cq pointer in rq->elv.priv[0] in bfq_init_rq().
Thus a call to icq_to_bic() in RQ_BIC() is wrong. Luckily it does no
harm currently because struct io_iq is the first one in struct
bfq_io_cq.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-19 06:52:36 -06:00
Jan Kara
c5ac56bb61 bfq: Allow current waker to defend against a tentative one
The code in bfq_check_waker() ignores wake up events from the current
waker. This makes it more likely we select a new tentative waker
although the current one is generating more wake up events. Treat
current waker the same way as any other process and allow it to reset
the waker detection logic.

Fixes: 71217df39d ("block, bfq: make waker-queue detection more robust")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-19 06:52:36 -06:00
Jan Kara
f950667356 bfq: Relax waker detection for shared queues
Currently we look for waker only if current queue has no requests. This
makes sense for bfq queues with a single process however for shared
queues when there is a larger number of processes the condition that
queue has no requests is difficult to meet because often at least one
process has some request in flight although all the others are waiting
for the waker to do the work and this harms throughput. Relax the "no
queued request for bfq queue" condition to "the current task has no
queued requests yet". For this, we also need to start tracking number of
requests in flight for each task.

This patch (together with the following one) restores the performance
for dbench with 128 clients that regressed with commit c65e6fd460
("bfq: Do not let waker requests skip proper accounting") because
this commit makes requests of wakers properly enter BFQ queues and thus
these queues become ineligible for the old waker detection logic.
Dbench results:

         Vanilla 5.18-rc3        5.18-rc3 + revert      5.18-rc3 patched
Mean     1237.36 (   0.00%)      950.16 *  23.21%*      988.35 *  20.12%*

Numbers are time to complete workload so lower is better.

Fixes: c65e6fd460 ("bfq: Do not let waker requests skip proper accounting")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-19 06:52:33 -06:00
Jens Axboe
1305e2c9d9 blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE()
A previous commit got rid of unnecessary rcu_read_lock() inside the
IRQ disabling queue_lock, but this debug statement was left. It's now
firing since we are indeed not inside a RCU read lock, but we don't
need to be as we're still preempt safe.

Get rid of the check, as we have a lockdep assert for holding the
queue lock right after it anyway.

Link: https://lore.kernel.org/linux-block/46253c48-81cb-0787-20ad-9133afdd9e21@samsung.com/
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Fixes: 77c570a1ea ("blk-cgroup: Remove unnecessary rcu_read_lock/unlock()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 16:32:00 -06:00
Laibin Qiu
5a011f889b blk-throttle: Set BIO_THROTTLED when bio has been throttled
1.In current process, all bio will set the BIO_THROTTLED flag
after __blk_throtl_bio().

2.If bio needs to be throttled, it will start the timer and
stop submit bio directly. Bio will submit in
blk_throtl_dispatch_work_fn() when the timer expires.But in
the current process, if bio is throttled. The BIO_THROTTLED
will be set to bio after timer start. If the bio has been
completed, it may cause use-after-free blow.

BUG: KASAN: use-after-free in blk_throtl_bio+0x12f0/0x2c70
Read of size 2 at addr ffff88801b8902d4 by task fio/26380

 dump_stack+0x9b/0xce
 print_address_description.constprop.6+0x3e/0x60
 kasan_report.cold.9+0x22/0x3a
 blk_throtl_bio+0x12f0/0x2c70
 submit_bio_checks+0x701/0x1550
 submit_bio_noacct+0x83/0xc80
 submit_bio+0xa7/0x330
 mpage_readahead+0x380/0x500
 read_pages+0x1c1/0xbf0
 page_cache_ra_unbounded+0x471/0x6f0
 do_page_cache_ra+0xda/0x110
 ondemand_readahead+0x442/0xae0
 page_cache_async_ra+0x210/0x300
 generic_file_buffered_read+0x4d9/0x2130
 generic_file_read_iter+0x315/0x490
 blkdev_read_iter+0x113/0x1b0
 aio_read+0x2ad/0x450
 io_submit_one+0xc8e/0x1d60
 __se_sys_io_submit+0x125/0x350
 do_syscall_64+0x2d/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Allocated by task 26380:
 kasan_save_stack+0x19/0x40
 __kasan_kmalloc.constprop.2+0xc1/0xd0
 kmem_cache_alloc+0x146/0x440
 mempool_alloc+0x125/0x2f0
 bio_alloc_bioset+0x353/0x590
 mpage_alloc+0x3b/0x240
 do_mpage_readpage+0xddf/0x1ef0
 mpage_readahead+0x264/0x500
 read_pages+0x1c1/0xbf0
 page_cache_ra_unbounded+0x471/0x6f0
 do_page_cache_ra+0xda/0x110
 ondemand_readahead+0x442/0xae0
 page_cache_async_ra+0x210/0x300
 generic_file_buffered_read+0x4d9/0x2130
 generic_file_read_iter+0x315/0x490
 blkdev_read_iter+0x113/0x1b0
 aio_read+0x2ad/0x450
 io_submit_one+0xc8e/0x1d60
 __se_sys_io_submit+0x125/0x350
 do_syscall_64+0x2d/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 0:
 kasan_save_stack+0x19/0x40
 kasan_set_track+0x1c/0x30
 kasan_set_free_info+0x1b/0x30
 __kasan_slab_free+0x111/0x160
 kmem_cache_free+0x94/0x460
 mempool_free+0xd6/0x320
 bio_free+0xe0/0x130
 bio_put+0xab/0xe0
 bio_endio+0x3a6/0x5d0
 blk_update_request+0x590/0x1370
 scsi_end_request+0x7d/0x400
 scsi_io_completion+0x1aa/0xe50
 scsi_softirq_done+0x11b/0x240
 blk_mq_complete_request+0xd4/0x120
 scsi_mq_done+0xf0/0x200
 virtscsi_vq_done+0xbc/0x150
 vring_interrupt+0x179/0x390
 __handle_irq_event_percpu+0xf7/0x490
 handle_irq_event_percpu+0x7b/0x160
 handle_irq_event+0xcc/0x170
 handle_edge_irq+0x215/0xb20
 common_interrupt+0x60/0x120
 asm_common_interrupt+0x1e/0x40

Fix this by move BIO_THROTTLED set into the queue_lock.

Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220301123919.2381579-1-qiulaibin@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-17 19:32:10 -06:00
Fanjun Kong
77c570a1ea blk-cgroup: Remove unnecessary rcu_read_lock/unlock()
spin_lock_irq/spin_unlock_irq contains preempt_disable/enable().
Which can serve as RCU read-side critical region, so remove
rcu_read_lock/unlock().

Signed-off-by: Fanjun Kong <bh1scw@gmail.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220516173930.159535-1-bh1scw@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-17 06:12:23 -06:00
Wolfgang Bumiller
3607849df4 blk-cgroup: always terminate io.stat lines
With the removal of seq_get_buf in blkcg_print_one_stat, we
cannot make adding the newline conditional on there being
relevant stats because the name was already written out
unconditionally.
Otherwise we may end up with multiple device names in one
line which is confusing and doesn't follow the nested-keyed
file format.

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Fixes: 252c651a4c ("blk-cgroup: stop using seq_get_buf")
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220111083159.42340-1-w.bumiller@proxmox.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-17 06:11:17 -06:00
Yu Kuai
ddc25c86b4 block, bfq: make bfq_has_work() more accurate
bfq_has_work() is using busy_queues currently, which is not accurate
because bfq_queue is busy doesn't represent that it has requests. Since
bfqd aready has a counter 'queued' to record how many requests are in
bfq, use it instead of busy_queues.

Noted that bfq_has_work() can be called with 'bfqd->lock' held, thus the
lock can't be held in bfq_has_work() to protect 'bfqd->queued'.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220513023507.2625717-3-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-16 11:39:20 -06:00
Yu Kuai
181490d532 block, bfq: protect 'bfqd->queued' by 'bfqd->lock'
If bfq_schedule_dispatch() is called from bfq_idle_slice_timer_body(),
then 'bfqd->queued' is read without holding 'bfqd->lock'. This is
wrong since it can be wrote concurrently.

Fix the problem by holding 'bfqd->lock' in such case.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220513023507.2625717-2-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-16 11:39:20 -06:00
Christoph Hellwig
a3e7689bfa block: cleanup the VM accounting in submit_bio
submit_bio uses some extremely convoluted checks and confusing comments
to only account REQ_OP_READ/REQ_OP_WRITE comments.  Just switch to the
plain obvious checks instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220516063654.2782792-1-hch@lst.de
[axboe: fixup WRITE -> REQ_OP_WRITE]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-16 11:37:50 -06:00
Bart Van Assche
725f22a147 block/mq-deadline: Set the fifo_time member also if inserting at head
Before commit 322cff70d4 the fifo_time member of requests on a dispatch
list was not used. Commit 322cff70d4 introduces code that reads the
fifo_time member of requests on dispatch lists. Hence this patch that sets
the fifo_time member when adding a request to a dispatch list.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Fixes: 322cff70d4 ("block/mq-deadline: Prioritize high-priority requests")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220513171307.32564-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13 17:02:46 -06:00
Ming Lei
a327c341dc blk-mq: fix passthrough plugging
First we can't add request into plug list in blk_mq_request_bypass_insert
which may be called when flushing plug list, so nested plug is caused.

Second if polled passthrough request is inserted via blk_execute_rq(),
it can't be added to plug list too since io polling needs the request
to be issued to driver.

Fixes the two by moving plugging into blk_execute_rq_no_wait().

Cc: Christoph Hellwig <hch@lst.de>
Fixes: 1c2d2fff6d ("block: wire-up support for passthrough plugging")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220512140010.1458645-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-12 08:45:39 -06:00
Chengming Zhou
2a371f7d5f blk-iocost: combine local_stat and desc_stat to stat
When we flush usage, wait, indebt stat in iocg_flush_stat(), we use
local_stat and desc_stat, which has no point since the leaf iocg
only has local_stat and the inner iocg only has desc_stat. Also
we don't need to flush percpu abs_vusage for these inner iocgs.

This patch combine local_stat and desc_stat to stat, only flush
percpu abs_vusage for active leaf iocgs, then build inner walk
list to propagate.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220510034757.21761-1-zhouchengming@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-11 17:05:52 -06:00
Jens Axboe
1c2d2fff6d block: wire-up support for passthrough plugging
Add support for plugging in passthrough path. When plugging is enabled, the
requests are added to a plug instead of getting dispatched to the driver.
And when the plug is finished, the whole batch gets dispatched via
->queue_rqs which turns out to be more efficient. Otherwise dispatching
used to happen via ->queue_rq, one request at a time.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220511054750.20432-3-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-11 07:40:58 -06:00
Matthew Wilcox (Oracle)
2c69e20579 fs: Convert block_read_full_page() to block_read_full_folio()
This function is NOT converted to handle large folios, so include
an assert that the filesystem isn't passing one in.  Otherwise, use
the folio functions instead of the page functions, where they exist.
Convert all filesystems which use block_read_full_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-09 16:21:44 -04:00
Matthew Wilcox (Oracle)
9d6b0cd757 fs: Remove flags parameter from aops->write_begin
There are no more aop flags left, so remove the parameter.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:28:19 -04:00
Matthew Wilcox (Oracle)
b3992d1e2e fs: Remove aop flags parameter from block_write_begin()
There are no more aop flags left, so remove the parameter.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:28:19 -04:00
Christoph Hellwig
069adbac2c block: improve the error message from bio_check_eod
Print the start sector and length separately instead of the combined
value to help with debugging.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220504143355.568660-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-04 18:30:10 -06:00
Christoph Hellwig
7ecc56c62b block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone
Device mapper wants to allocate a bio before knowing the device it
gets send to, so add explicit support for that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20220504142950.567582-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-04 18:29:52 -06:00
Christoph Hellwig
513616843d block: remove superfluous calls to blkcg_bio_issue_init
blkcg_bio_issue_init is called in submit_bio.  There is no need to have
extra calls that just get overriden in __bio_clone and the two places
that copy and pasted from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20220504142950.567582-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-04 18:29:52 -06:00
Christoph Hellwig
82778259eb blk-cgroup: cleanup blkcg_maybe_throttle_current
Use blkcg_css instead of opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
d200ca143a blk-cgroup: cleanup blk_cgroup_congested
Use blkcg_css instead of open coding it, and switch to a slightly
more natural for loop.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
bc5fee91f2 blk-cgroup: move blkcg_css to blk-cgroup.c
blkcg_css is only used in blk-cgroup.c, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
c97ab27157 blk-cgroup: remove unneeded includes from <linux/blk-cgroup.h>
Remove all the includes that aren't actually needed from
<linux/blk-cgroup.h> and push them to the actual source files where
needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
7f20ba7c42 blk-cgroup: remove pointless CONFIG_BLOCK ifdefs
No need to make BLK_CGROUP stubs conditional on CONFIG_BLOCK as they
can't be used without that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
bbb1ebe7a9 blk-cgroup: replace bio_blkcg with bio_blkcg_css
All callers of bio_blkcg actually want the CSS, so replace it with an
interface that does return the CSS.  This now allows to move
struct blkcg_gq to block/blk-cgroup.h instead of exposing it in a
public header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
f4a6a61cb6 blktrace: cleanup the __trace_note_message interface
Pass the cgroup_subsys_state instead of a the blkg so that blktrace
doesn't need to poke into blk-cgroup internals, and give the name a
blk prefix as the current name is way too generic for a public
interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
dec223c92a blk-cgroup: move struct blkcg to block/blk-cgroup.h
There is no real need to expose the blkcg structure to the whole kernel.
Move it to the private header an expose a helper to let the writeback
code access the cgwb_list member.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
397c9f46ee blk-cgroup: move blkcg_{pin,unpin}_online out of line
Move these two functions out of line as there is no good reason
to inline them.  Also switch to passing a cgroup_subsys_state
instead of doing the conversion in the caller to prepare for making
the blkcg structure private to blk-cgroup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
216889aad3 blk-cgroup: move blk_cgroup_congested out line
There is no urgent need to inline this function, so move it out of line.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
db05628435 blk-cgroup: move blkcg_{get,set}_fc_appid out of line
No need to have these helpers inline.  Also remove the stubs and just use
an IS_ENABLED for the get side (the set side already is only built
conditionlly).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig
2524a5783e blk-cgroup: remove __bio_blkcg
Remove the unused and deprecated __bio_blkcg helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Ming Lei
9650b453a3 block: ignore RWF_HIPRI hint for sync dio
So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed
from userspace sync io interface, then block layer tries to poll until
the bio is completed. But the current implementation calls
blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or
timeout easily.

But looks no one reports this kind of issue, which should have been
triggered in normal io poll sanity test or blktests block/007 as
observed by Changhui, that means it is very likely that no one uses it
or no one cares it.

Also after io_uring is invented, io poll for sync dio becomes legacy
interface.

So ignore RWF_HIPRI hint for sync dio.

CC: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Reported-by: Changhui Zhong <czhong@redhat.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Changhui Zhong <czhong@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220420143110.2679002-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 10:07:42 -06:00
Ming Lei
285d5731a0 Revert "block: release rq qos structures for queue without disk"
This reverts commit daaca3522a.

Commit daaca3522a ("block: release rq qos structures for queue without
disk") is only needed for v5.15~v5.17, and isn't needed for v5.18, so
revert it.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220426024936.3321341-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 10:06:40 -06:00
Jan Kara
09df6a75ff bfq: Fix warning in bfqq_request_over_limit()
People are occasionally reporting a warning bfqq_request_over_limit()
triggering reporting that BFQ's idea of cgroup hierarchy (and its depth)
does not match what generic blkcg code thinks. This can actually happen
when bfqq gets moved between BFQ groups while bfqq_request_over_limit()
is running. Make sure the code is safe against BFQ queue being moved to
a different BFQ group.

Fixes: 76f1df88bb ("bfq: Limit number of requests consumed by each cgroup")
CC: stable@vger.kernel.org
Link: https://lore.kernel.org/all/CAJCQCtTw_2C7ZSz7as5Gvq=OmnDiio=HRkQekqWpKot84sQhFA@mail.gmail.com/
Reported-by: Chris Murphy <lists@colorremedies.com>
Reported-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220407140738.9723-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-29 06:45:37 -06:00
Tejun Heo
4cddeacad6 Revert "block: inherit request start time from bio for BLK_CGROUP"
This reverts commit 0006707723. It has a
couple problems:

* bio_issue_time() is stored in bio->bi_issue truncated to 51 bits. This
  overflows in slightly over 26 days. Setting rq->io_start_time_ns with it
  means that io duration calculation would yield >26days after 26 days of
  uptime. This, for example, confuses kyber making it cause high IO
  latencies.

* rq->io_start_time_ns should record the time that the IO is issued to the
  device so that on-device latency can be measured. However,
  bio_issue_time() is set before the bio goes through the rq-qos controllers
  (wbt, iolatency, iocost), so when the bio gets throttled in any of the
  mechanisms, the measured latencies make no sense - on-device latencies end
  up higher than request-alloc-to-completion latencies.

We'll need a smarter way to avoid calling ktime_get_ns() repeatedly
back-to-back. For now, let's revert the commit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v5.16+
Link: https://lore.kernel.org/r/YmmeOLfo5lzc+8yI@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-27 13:51:28 -06:00