Function blk_mq_sched_alloc_tags() does same as
__blk_mq_alloc_map_and_request(), so give a similar name to be consistent.
Similarly rename label err_free_tags -> err_free_map_and_rqs.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-6-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's easier to read:
if (x)
X;
else
Y;
over:
if (!x)
Y;
else
X;
No functional change intended.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-5-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For shared sbitmap, if the call to blk_mq_tag_update_depth() was
successful for any hctx when hctx->sched_tags is not set, then it would be
successful for all (due to nature in which blk_mq_tag_update_depth()
fails).
As such, there is no need to call blk_mq_tag_resize_shared_sbitmap() for
each hctx. So relocate the call until after the hctx iteration under the
!q->elevator check, which is equivalent (to !hctx->sched_tags).
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is a bit confusing that there is BLKDEV_MAX_RQ and MAX_SCHED_RQ, as
the name BLKDEV_MAX_RQ would imply the max requests always, which it is
not.
Rename to BLKDEV_MAX_RQ to BLKDEV_DEFAULT_RQ, matching its usage - that being
the default number of requests assigned when allocating a request queue.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The original code in commit 24d2f90309 ("blk-mq: split out tag
initialization, support shared tags") would check tags->rqs is non-NULL and
then dereference tags->rqs[].
Then in commit 2af8cbe305 ("blk-mq: split tag ->rqs[] into two"), we
started to dereference tags->static_rqs[], but continued to check non-NULL
tags->rqs.
Check tags->static_rqs as non-NULL instead, which is more logical.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-2-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make the bad sector information a little more useful by printing
current->comm to identify the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20210928052755.113016-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In addition to reverting commit 7b05bf7710 ("Revert "block/mq-deadline:
Prioritize high-priority requests""), this patch uses 'jiffies' instead
of ktime_get() in the code for aging lower priority requests.
This patch has been tested as follows:
Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: 555 K IOPS.
Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: about 380 K IOPS.
Ran the following script:
set -e
scriptdir=$(dirname "$0")
if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
modprobe scsi_debug ndelay=1000000 max_queue=16
sd=''
while [ -z "$sd" ]; do
sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
done
echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire"
if [ -e /sys/fs/cgroup/io.prio.class ]; then
cd /sys/fs/cgroup
echo restrict-to-be >io.prio.class
echo +io > cgroup.subtree_control
else
cd /sys/fs/cgroup/blkio/
echo restrict-to-be >blkio.prio.class
fi
echo $$ >cgroup.procs
mkdir -p hipri
cd hipri
if [ -e io.prio.class ]; then
echo none-to-rt >io.prio.class
else
echo none-to-rt >blkio.prio.class
fi
{ "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
echo $$ >cgroup.procs
"${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt
Result:
* 11000 IOPS for the high-priority job
* 40 IOPS for the low-priority job
If the prio aging expiry time is changed from 100s into 0, the IOPS results
change into 6712 and 6796 IOPS.
The max-iops script is a script that runs fio with the following arguments:
--bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
--norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
--iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
--iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
--filename=${positional_argument_1}
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20210927220328.1410161-5-bvanassche@acm.org
[axboe: @latest -> @latest_start]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Calculating the sum over all CPUs of per-CPU counters frequently is
inefficient. Hence switch from per-CPU to individual counters. Three
counters are protected by the mq-deadline spinlock since these are
only accessed from contexts that already hold that spinlock. The fourth
counter is atomic because protecting it with the mq-deadline spinlock
would trigger lock contention.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210927220328.1410161-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Check a statistics invariant at module unload time. When running
blktests, the invariant is verified every time a request queue is
removed and hence is verified at least once per test.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210927220328.1410161-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The scheduler .insert_requests() callback is called when a request is
queued for the first time and also when it is requeued. Only count a
request the first time it is queued. Additionally, since the mq-deadline
scheduler only performs zone locking for requests that have been
inserted, skip the zone unlock code for requests that have not been
inserted into the mq-deadline scheduler.
Fixes: 38ba64d12d ("block/mq-deadline: Track I/O statistics")
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210927220328.1410161-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct request is only used by blk-mq drivers, so move it and all
related declarations to blk-mq.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split the integrity/metadata handling definitions out into a new header.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These are block-layer internal helpers, so move them to block/blk.h and
block/blk-merge.c. Also update a comment a bit to use better grammar.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drop various include not actually used in genhd.h itself, and
move the remaning includes closer together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Except for the features passed to blk_queue_required_elevator_features,
elevator.h is only needed internally to the block layer. Move the
ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to
block/, dropping all the spurious includes outside of that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
break the include chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFsIqAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppbBEACDewLUv7bg1VFIGdroRN51OGiOv1oV+8HP
ruY7O9CPtV7wcb3lA1Zy9igICuzuC5culHjbRJrNIUeTdWCQHCFk/sfKSD6VGMoT
cFTqpKxV7M3vYr9G2m5TFWgY2mfS+I5fxyDZxK2z2esHCFw6TZ7A5W13xScVXKP+
QdNFSlTrGkpggsSIEeHApG+NLsIecnkT4qzm8zPfUodUtQ3A8JMjQjnYUFEAWfWv
l9x9zDIzaGjPtXf5soFEvmdh1ALh3WWiYb1kIwK1FeP/PYX0JV/3zCMgqOwpK+4b
69OM3Q0NPHvu2TgSRK+ghekAtz5qgPDMCrzdhSgLYJEL/PGAOboqjrB9E+wWoEjd
IKrYLx4Xao2TUZLJF2y34hHfODGdasx7d+wS191UpVFEZHFhDhIaazZ2rDd5xnQK
LdzQw1JQF/igJovHauhSkGFIdJWBSDneLQoMimBnitZlsWARUmFSZej34FFRLZsW
8ZXfqipn/x+fh4sQ/HdEfWxnGHtveDpU+0Ka5bMUe/tJ9RPtmn/Ye7nFjYecC6NY
4UzFSNn+4e9DpHaDuP3I/eA1YBmVlcB5Hum3ve7X6ovwpjArYg3dgJOEi8uCZjfb
hdMANmkVptcPiEO9njEHhC7S8+Nm3t+8o3qQceN81j6Vcjgzt/Y/n3Z6UkKeSlkn
Ila+cZI1oA==
=J/e4
-----END PGP SIGNATURE-----
Merge tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Bigger than usual for this point in time, the majority is fixing some
issues around BDI lifetimes with the move from the request_queue to
the disk in this release. In detail:
- Series on draining fs IO for del_gendisk() (Christoph)
- NVMe pull request via Christoph:
- fix the abort command id (Keith Busch)
- nvme: fix per-namespace chardev deletion (Adam Manzanares)
- brd locking scope fix (Tetsuo)
- BFQ fix (Paolo)"
* tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block:
block, bfq: reset last_bfqq_created on group change
block: warn when putting the final reference on a registered disk
brd: reduce the brd_devices_mutex scope
kyber: avoid q->disk dereferences in trace points
block: keep q_usage_counter in atomic mode after del_gendisk
block: drain file system I/O on del_gendisk
block: split bio_queue_enter from blk_queue_enter
block: factor out a blk_try_enter_queue helper
block: call submit_bio_checks under q_usage_counter
nvme: fix per-namespace chardev deletion
block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs
nvme-pci: Fix abort command id
Since commit 430a67f9d6 ("block, bfq: merge bursts of newly-created
queues"), BFQ maintains a per-group pointer to the last bfq_queue
created. If such a queue, say bfqq, happens to move to a different
group, then bfqq is no more a valid last bfq_queue created for its
previous group. That pointer must then be cleared. Not resetting such
a pointer may also cause UAF, if bfqq happens to also be freed after
being moved to a different group. This commit performs this missing
reset. As such it fixes commit 430a67f9d6 ("block, bfq: merge bursts
of newly-created queues").
Such a missing reset is most likely the cause of the crash reported in [1].
With some analysis, we found that this crash was due to the
above UAF. And such UAF did go away with this commit applied [1].
Anyway, before this commit, that crash happened to be triggered in
conjunction with commit 2d52c58b9c ("block, bfq: honor already-setup
queue merges"). The latter was then reverted by commit ebc69e897e
("Revert "block, bfq: honor already-setup queue merges""). Yet commit
2d52c58b9c ("block, bfq: honor already-setup queue merges") contains
no error related with the above UAF, and can then be restored.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=214503
Fixes: 430a67f9d6 ("block, bfq: merge bursts of newly-created queues")
Tested-by: Grzegorz Kowal <custos.mentis@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20211015144336.45894-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Warn when the last reference on a live disk is put without calling
del_gendisk first. There are some BDI related bug reports that look
like a case of this, so make sure we have the proper instrumentation
to catch it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014130231.1468538-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
q->disk becomes invalid after the gendisk is removed. Work around this
by caching the dev_t for the tracepoints. The real fix would be to
properly tear down the I/O schedulers with the gendisk, but that is
a much more invasive change.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012093301.GA27795@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't switch back to percpu mode to avoid the double RCU grace period
when tearing down SCSI devices. After removing the disk only passthrough
commands can be send anyway.
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-6-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of delaying draining of file system I/O related items like the
blk-qos queues, the integrity read workqueue and timeouts only when the
request_queue is removed, do that when del_gendisk is called. This is
important for SCSI where the upper level drivers that control the gendisk
are separate entities, and the disk can be freed much earlier than the
request_queue, or can even be unbound without tearing down the queue.
Fixes: edb0872f44 ("block: move the bdi from the request_queue to the gendisk")
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-5-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To prepare for fixing a gendisk shutdown race, open code the
blk_queue_enter logic in bio_queue_enter. This also removes the
pointless flags translation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-4-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out the code to try to get q_usage_counter without blocking into
a separate helper. Both to improve code readability and to prepare for
splitting bio_queue_enter from blk_queue_enter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-3-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure all bios check the current values of the queue under freeze
protection, i.e. to make sure the zero capacity set by del_gendisk
is actually seen before dispatching to the driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210929071241.934472-2-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While debugging an issue we've found that $DEBUGFS/block/$disk/state
doesn't decode QUEUE_FLAG_HCTX_ACTIVE but only displays its numerical
value.
Add QUEUE_FLAG(HCTX_ACTIVE) to the blk_queue_flag_name array so it'll get
decoded properly.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/4351076388918075bd80ef07756f9d2ce63be12c.1633332053.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFXvGIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprAgD/9WKQIVPyzXC3Ui8OkSb9DM+dnU47+4b6sL
Naud4khgrvMjTIdZC9YjwAhKBRNsXbcTXRLbBsxVzk7HhGI31jR711qaOZ/jMRbc
dPwcpF82/Yr8BlIl5i3SBvxFNW8FAJWLTVvPvJXSjUBPECTgY6MnchojOjYz1CrN
a0TmtHs5Feu+nLyzKgQlPMv3P8tvrS1hQNpfvbK1oWeEQuyONGacK82nl853ak30
c0uywVsvO+yiBY2xWhiTqperbPwnLxCYOMOdBqbNvlPlGIkEbkW1YWhm2/5f3vRi
37C3ibmt2SCnJaAyiNUKrYSTCK9BoeWFqtxVwyGB8WxWsUXrdocEOoBdohxDHrcp
gHIqxPVUD2hyEroRQseq4NTvERkRvOiF4jAFA3lHOw5zLtPGNEvC0EUFRMm3Ur3/
DyviDBPy321pvrbqGg5uIL3o6wuXDQNnwxdXa2wElWDLuTYp8tB25rqAe2IRJ+iS
k32AsgfL+y5fhjsA6ros/hpQ6FQTAKSW3i6S/0L+QB6H0jYb1ue9JPrA+3MoizvI
p+PdmpLeFVQxnfthT2J+6pzQADoxcMwg5ALgyLFvhhL1YAqyhHkKgdAIS3gKXxYm
GbQLd2hkv7wPZHEsS/XiGCj+zzyawjsr5hSHRqkbZ1QlyfnuTz5uI65WzST68o4d
KOAvQTx4aQ==
=SMoX
-----END PGP SIGNATURE-----
Merge tag 'block-5.15-2021-10-01' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few block fixes for this release:
- Revert a BFQ commit that causes breakage for people. Unfortunately
it was auto-selected for stable as well, so now 5.14.7 suffers from
it too. Hopefully stable will pick up this revert quickly too, so
we can remove the issue on that end as well.
- Add a quirk for Apple NVMe controllers, which due to their
non-compliance broke due to the introduction of command sequences
(Keith)
- Use shifts in nbd, fixing a __divdi3 issue (Nick)"
* tag 'block-5.15-2021-10-01' of git://git.kernel.dk/linux-block:
nbd: use shifts rather than multiplies
Revert "block, bfq: honor already-setup queue merges"
nvme: add command id quirk for apple controllers
syzbot is reporting use-after-free read at bdev_free_inode() [1], for
kfree() from __alloc_disk_node() is called before bdev_free_inode()
(which is called after RCU grace period) reads bdev->bd_disk and calls
kfree(bdev->bd_disk).
Fix use-after-free read followed by double kfree() problem
by making sure that bdev->bd_disk is NULL when calling iput().
Link: https://syzkaller.appspot.com/bug?extid=8281086e8a6fbfbd952a [1]
Reported-by: syzbot <syzbot+8281086e8a6fbfbd952a@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/e6dd13c5-8db0-4392-6e78-a42ee5d2a1c4@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 2d52c58b9c.
We have had several folks complain that this causes hangs for them, which
is especially problematic as the commit has also hit stable already.
As no resolution seems to be forthcoming right now, revert the patch.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214503
Fixes: 2d52c58b9c ("block, bfq: honor already-setup queue merges")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Thirty Three fixes, I'm afraid. Essentially the build up from the
last couple of weeks while I've been dealling with Linux Plumbers
conference infrastructure issues. It's mostly the usual assortment of
spelling fixes and minor corrections. The only core relevant changes
are to the sd driver to reduce the spin up message spew and fix a
small memory leak on the freeing path.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJsEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYU+VfiYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishVuMAP9Y2+rE
AyQ3Y9dubie/AyKQ/2ZEFO/G2wz+jEJvfppXJwD4ulhhDFh/l4iNPVa7GW9M/ti3
gd+6gMzzRC9/B+u2gQ==
=+K+l
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Thirty-three fixes, I'm afraid.
Essentially the build up from the last couple of weeks while I've been
dealling with Linux Plumbers conference infrastructure issues. It's
mostly the usual assortment of spelling fixes and minor corrections.
The only core relevant changes are to the sd driver to reduce the spin
up message spew and fix a small memory leak on the freeing path"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (33 commits)
scsi: ses: Retry failed Send/Receive Diagnostic commands
scsi: target: Fix spelling mistake "CONFLIFT" -> "CONFLICT"
scsi: lpfc: Fix gcc -Wstringop-overread warning, again
scsi: lpfc: Use correct scnprintf() limit
scsi: lpfc: Fix sprintf() overflow in lpfc_display_fpin_wwpn()
scsi: core: Remove 'current_tag'
scsi: acornscsi: Remove tagged queuing vestiges
scsi: fas216: Kill scmd->tag
scsi: qla2xxx: Restore initiator in dual mode
scsi: ufs: core: Unbreak the reset handler
scsi: sd_zbc: Support disks with more than 2**32 logical blocks
scsi: ufs: core: Revert "scsi: ufs: Synchronize SCSI and UFS error handling"
scsi: bsg: Fix device unregistration
scsi: sd: Make sd_spinup_disk() less noisy
scsi: ufs: ufs-pci: Fix Intel LKF link stability
scsi: mpt3sas: Clean up some inconsistent indenting
scsi: megaraid: Clean up some inconsistent indenting
scsi: sr: Fix spelling mistake "does'nt" -> "doesn't"
scsi: Remove SCSI CDROM MAINTAINERS entry
scsi: megaraid: Fix Coccinelle warning
...
When running ->fallocate(), blkdev_fallocate() should hold
mapping->invalidate_lock to prevent page cache from being accessed,
otherwise stale data may be read in page cache.
Without this patch, blktests block/009 fails sometimes. With this patch,
block/009 can pass always.
Also as Jan pointed out, no pages can be created in the discarded area
while you are holding the invalidate_lock, so remove the 2nd
truncate_bdev_range().
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210923023751.1441091-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
rq_qos framework is only applied on request based driver, so:
1) rq_qos_done_bio() needn't to be called for bio based driver
2) rq_qos_done_bio() needn't to be called for bio which isn't tracked,
such as bios ended from error handling code.
Especially in bio_endio():
1) request queue is referred via bio->bi_bdev->bd_disk->queue, which
may be gone since request queue refcount may not be held in above two
cases
2) q->rq_qos may be freed in blk_cleanup_queue() when calling into
__rq_qos_done_bio()
Fix the potential kernel panic by not calling rq_qos_ops->done_bio if
the bio isn't tracked. This way is safe because both ioc_rqos_done_bio()
and blkcg_iolatency_done_bio() are nop if the bio isn't tracked.
Reported-by: Yu Kuai <yukuai3@huawei.com>
Cc: tj@kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210924110704.1541818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the integrity profile is unregistered there can still be integrity
reads queued up which could see a NULL verify_fn as shown by the race
window below:
CPU0 CPU1
process_one_work nvme_validate_ns
bio_integrity_verify_fn nvme_update_ns_info
nvme_update_disk_info
blk_integrity_unregister
---set queue->integrity as 0
bio_integrity_process
--access bi->profile->verify_fn(bi is a pointer of queue->integity)
Before calling blk_integrity_unregister in nvme_update_disk_info, we must
make sure that there is no work item in the kintegrityd_wq. Just call
blk_flush_integrity to flush the work queue so the bug can be resolved.
Signed-off-by: Lihong Kou <koulihong@huawei.com>
[hch: split up and shortened the changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While clearing the profile itself is harmless, we really should not clear
the stable writes flag if it wasn't set due to a registered integrity
profile.
Reported-by: Lihong Kou <koulihong@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
device_initialize() is used to take a refcount on the device. However,
put_device() is not called during device teardown. This leads to a
leak of private data of the driver core, dev_name(), etc. This is
reported by kmemleak at boot time if we compile kernel with
DEBUG_TEST_DRIVER_REMOVE.
Fix memory leaks during unregistration and implement a release
function.
Link: https://lore.kernel.org/r/20210911105306.1511-1-yuzenghui@huawei.com
Fixes: ead09dd3ae ("scsi: bsg: Simplify device registration")
Reviewed-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
blk-mq can't run allocating driver tag and updating ->rqs[tag]
atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver
tag is released.
So there is chance to iterating over one stale request just after the
tag is allocated and before updating ->rqs[tag].
scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
in-flight requests after scsi host is blocked, so no new scsi command can
be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can
be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,
but this request may have been kept in another slot of ->rqs[], meantime
the slot can be allocated out but ->rqs[] isn't updated yet. Then this
in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes
trouble in handling scsi error.
Fixes the issue by not iterating over stale request.
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Reported-by: luojiaxing <luojiaxing@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210906065003.439019-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmE8ueIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpkSYD/9eaQ1Hxc+X+4eVb3A9Cpy36Qy/uY/hArnT
kSUDtQitrRigqhStaD0MGpknWFnZE4cSojbYN0OoEWL7GC8idSZXx7KrVJpSHGbM
XGVEflohvjDLNPkV99gmlzF2o6zPlWESApU1/HO2x+Ws1oKaYDAfFVf0CPGPe2C6
MRerU5v3HSmTC0eFZxU246bwwX/phNuNDokndR27rrsjK0mLF5UoMKySeqy3INp5
6mj3R+HNIW5j8eQk/HJPW7dgiKpWYneWV2Z90DuOLbcJ+wnx7s07wT1yRnOFUTsb
p2ojVWmXtCJ1kRex6bK/eeIJC5TYvT3bNwsnIRmJHd9btHqhm2uKy77m3S1AuE7w
K8bN581aXlr/3pUbFyYZDZQbYshUn25YP9OlyS9r4pklCh9C5KneL1b4xswWTDTB
whvPZlkot3rGD8LHDpV5xVVzeaAcbSXanIRROjxHqQSRRTA9BjG3E4A2cDh8nmYD
mRGEimfZcoojF2EQJYswPOQ24cZwpnihPpJO9NkOodRqfasn6XakAGg6SONFYyQ0
Ewa6QzIOCebBgOVGbzMtpoDpnySE12ONmrDCbSEiYFJLXBMMiqgNON/Xaq0tmXHT
lsDpyz3ytWAB9OZ3M0/9arZzlFf/E+FRqt4ExelmwxiutKRb1dIKQq8xip/YxdA+
Y86kwUoAXQ==
=1ajD
-----END PGP SIGNATURE-----
Merge tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request from Christoph:
- fix nvmet command set reporting for passthrough controllers (Adam Manzanares)
- update a MAINTAINERS email address (Chaitanya Kulkarni)
- set QUEUE_FLAG_NOWAIT for nvme-multipth (me)
- handle errors from add_disk() (Luis Chamberlain)
- update the keep alive interval when kato is modified (Tatsuya Sasaki)
- fix a buffer overrun in nvmet_subsys_attr_serial (Hannes Reinecke)
- do not reset transport on data digest errors in nvme-tcp (Daniel Wagner)
- only call synchronize_srcu when clearing current path (Daniel Wagner)
- revalidate paths during rescan (Hannes Reinecke)
- Split out the fs/block_dev into block/fops.c and block/bdev.c, which
has been long overdue. Do this now before -rc1, to avoid annoying
conflicts due to this (Christoph)
- blk-throtl use-after-free fix (Li)
- Improve plug depth for multi-device plugs, greatly increasing md
resync performance (Song)
- blkdev_show() locking fix (Tetsuo)
- n64cart error check fix (Yang)
* tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
n64cart: fix return value check in n64cart_probe()
blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
block: move fs/block_dev.c to block/bdev.c
block: split out operations on block special files
blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
block: genhd: don't call blkdev_show() with major_names_lock held
nvme: update MAINTAINERS email address
nvme: add error handling support for add_disk()
nvme: only call synchronize_srcu when clearing current path
nvme: update keep alive interval when kato is modified
nvme-tcp: Do not reset transport on data digest errors
nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()
nvmet: return bool from nvmet_passthru_ctrl and nvmet_is_passthru_req
nvmet: looks at the passthrough controller when initializing CAP
nvme: move nvme_multi_css into nvme.h
nvme-multipath: revalidate paths during rescan
nvme-multipath: set QUEUE_FLAG_NOWAIT
Limiting number of request to BLK_MAX_REQUEST_COUNT at blk_plug hurts
performance for large md arrays. [1] shows resync speed of md array drops
for md array with more than 16 HDDs.
Fix this by allowing more request at plug queue. The multiple_queue flag
is used to only apply higher limit to multiple queue cases.
[1] https://lore.kernel.org/linux-raid/CAFDAVznS71BXW8Jxv6k9dXc2iR3ysX3iZRBww_rzA8WifBFxGg@mail.gmail.com/
Tested-by: Marcin Wanat <marcin.wanat@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new block/fops.c for all the file and address_space operations
that provide the block special file support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-2-hch@lst.de
[axboe: correct trailing whitespace while at it]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The pending timer has been set up in blk_throtl_init(). However, the
timer is not deleted in blk_throtl_exit(). This means that the timer
handler may still be running after freeing the timer, which would
result in a use-after-free.
Fix by calling del_timer_sync() to delete the timer in blk_throtl_exit().
Signed-off-by: Li Jinlin <lijinlin3@huawei.com>
Link: https://lore.kernel.org/r/20210907121242.2885564-1-lijinlin3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmE1hQcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprjfEACdG+medwQOPpKNSoAvQmYyQnZRMbPjiruv
A4nW2L6MKaExO59qQLVbBYHaH2+ng2UR/p5jNi2AKm+hrQEYllxlNvuCkRBIn97J
r45R48mzBbHjR4kE3Fdu1mOFpBWOuU9JrtzHI+JF/Sl/qPIxKYNHf5E66T6l90Fz
0hJkorAoVB7+hQYixdmkM9quZy11D5SY3aM+bG8r2uNjZTBEHMfmOen8o1giR0vC
EOHzObuC6WLjLGQInNW+Cq2//vVVybQa79mhOUMp93z5nhDMtwUu7MH4B4kmGpix
GLjDa1DukUZe7nGcnsRKmjjXQ+BpG6YF52Z2RfVZpWZn83t5c4YQsq++TPZ8KfpK
4NAFFuSbGM/+QWwEiiyWu00syvpzrEJ4ZIJyZX3FYEeKyKWVRGHqlMDcS9LstYOk
4OfgQUcJ7f/fXeedwi0OGJS1BLr6fi8RnazIafCNIIJLe1XIwTsNufPCNxWYqDAi
0XhH+uYGD38VoUiR5JymZku6frwY4kxssA1khPPE5jWbzCZXiHprwwzaP4hBNNeZ
c5cn9/1ZQSoTE3ebrX9pzTn5wRZwAL+iDhZ2SpLlN2Ji1BJ4EM9H8qFGj3U/CSM4
OWKY2c0VwJYQUhjO4QDBx0MblJgNy8HsvmqGETuxUlk56j3Q1Mx3ViPV43amP9eM
OM4mGige3Q==
=4SCA
-----END PGP SIGNATURE-----
Merge tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Was going to send this one in later this week, but given that -Werror
is now enabled (or at least available), the mq-deadline fix really
should go in for the folks hitting that.
- Ensure dd_queued() is only there if needed (Geert)
- Fix a kerneldoc warning for bio_alloc_kiocb()
- BFQ fix for queue merging
- loop locking fix (Tetsuo)"
* tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-block:
loop: reduce the loop_ctl_mutex scope
bio: fix kerneldoc documentation for bio_alloc_kiocb()
block, bfq: honor already-setup queue merges
block/mq-deadline: Move dd_queued() to fix defined but not used warning
Merge misc updates from Andrew Morton:
"173 patches.
Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...
flush_kernel_dcache_page is a rather confusing interface that implements a
subset of flush_dcache_page by not being able to properly handle page
cache mapped pages.
The only callers left are in the exec code as all other previous callers
were incorrect as they could have dealt with page cache pages. Replace
the calls to flush_kernel_dcache_page with calls to flush_dcache_page,
which for all architectures does either exactly the same thing, can
contains one or more of the following:
1) an optimization to defer the cache flush for page cache pages not
mapped into userspace
2) additional flushing for mapped page cache pages if cache aliases
are possible
Link: https://lkml.kernel.org/r/20210712060928.4161649-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Apparently the last fixup got butter fingered a bit, the correct variable
name is 'nr_vecs', not 'nr_iovecs'.
Link: https://lore.kernel.org/lkml/20210903164939.02f6e8c5@canb.auug.org.au/
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>