2019-05-01 02:42:43 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* gendisk handling
|
2020-12-10 15:55:44 +08:00
|
|
|
*
|
|
|
|
* Portions Copyright (C) 2020 Christoph Hellwig
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/module.h>
|
2020-03-24 15:25:13 +08:00
|
|
|
#include <linux/ctype.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/fs.h>
|
2007-02-21 05:57:48 +08:00
|
|
|
#include <linux/kdev_t.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/blkdev.h>
|
2015-05-23 05:13:32 +08:00
|
|
|
#include <linux/backing-dev.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/spinlock.h>
|
2008-10-05 03:53:21 +08:00
|
|
|
#include <linux/proc_fs.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/seq_file.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/kmod.h>
|
2021-09-20 20:33:25 +08:00
|
|
|
#include <linux/major.h>
|
2006-02-07 06:12:43 +08:00
|
|
|
#include <linux/mutex.h>
|
2008-08-25 18:47:22 +08:00
|
|
|
#include <linux/idr.h>
|
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-09 03:57:37 +08:00
|
|
|
#include <linux/log2.h>
|
2013-02-23 08:34:13 +08:00
|
|
|
#include <linux/pm_runtime.h>
|
2016-01-10 00:36:51 +08:00
|
|
|
#include <linux/badblocks.h>
|
2021-11-24 02:53:12 +08:00
|
|
|
#include <linux/part_stat.h>
|
2023-06-10 10:20:03 +08:00
|
|
|
#include <linux/blktrace_api.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2023-06-10 10:20:03 +08:00
|
|
|
#include "blk-throttle.h"
|
2008-03-04 18:23:45 +08:00
|
|
|
#include "blk.h"
|
2021-11-24 02:53:08 +08:00
|
|
|
#include "blk-mq-sched.h"
|
2021-09-29 15:12:40 +08:00
|
|
|
#include "blk-rq-qos.h"
|
2022-03-08 13:51:55 +08:00
|
|
|
#include "blk-cgroup.h"
|
2008-03-04 18:23:45 +08:00
|
|
|
|
2020-03-25 23:48:35 +08:00
|
|
|
static struct kobject *block_depr;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
block: add disk sequence number
Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.
Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.
Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.
Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-2-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-13 07:05:25 +08:00
|
|
|
/*
|
|
|
|
* Unique, monotonically increasing sequential number associated with block
|
|
|
|
* devices instances (i.e. incremented each time a device is attached).
|
|
|
|
* Associating uevents with block devices in userspace is difficult and racy:
|
|
|
|
* the uevent netlink socket is lossy, and on slow and overloaded systems has
|
|
|
|
* a very high latency.
|
|
|
|
* Block devices do not have exclusive owners in userspace, any process can set
|
|
|
|
* one up (e.g. loop devices). Moreover, device names can be reused (e.g. loop0
|
|
|
|
* can be reused again and again).
|
|
|
|
* A userspace process setting up a block device and watching for its events
|
|
|
|
* cannot thus reliably tell whether an event relates to the device it just set
|
|
|
|
* up or another earlier instance with the same name.
|
|
|
|
* This sequential number allows userspace processes to solve this problem, and
|
|
|
|
* uniquely associate an uevent to the lifetime to a device.
|
|
|
|
*/
|
|
|
|
static atomic64_t diskseq;
|
|
|
|
|
2008-08-25 18:47:22 +08:00
|
|
|
/* for extended dynamic devt allocation, currently only one major is used */
|
2013-02-28 09:03:56 +08:00
|
|
|
#define NR_EXT_DEVT (1 << MINORBITS)
|
2020-11-26 16:23:26 +08:00
|
|
|
static DEFINE_IDA(ext_devt_ida);
|
2008-08-25 18:47:22 +08:00
|
|
|
|
2020-11-27 01:43:37 +08:00
|
|
|
void set_capacity(struct gendisk *disk, sector_t sectors)
|
|
|
|
{
|
2023-04-24 21:13:18 +08:00
|
|
|
bdev_set_nr_sectors(disk->part0, sectors);
|
2020-11-27 01:43:37 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(set_capacity);
|
|
|
|
|
2020-03-13 13:30:05 +08:00
|
|
|
/*
|
2020-11-16 22:56:56 +08:00
|
|
|
* Set disk capacity and notify if the size is not currently zero and will not
|
|
|
|
* be set to zero. Returns true if a uevent was sent, otherwise false.
|
2020-03-13 13:30:05 +08:00
|
|
|
*/
|
2020-11-16 22:56:56 +08:00
|
|
|
bool set_capacity_and_notify(struct gendisk *disk, sector_t size)
|
2020-03-13 13:30:05 +08:00
|
|
|
{
|
|
|
|
sector_t capacity = get_capacity(disk);
|
2020-11-27 01:43:37 +08:00
|
|
|
char *envp[] = { "RESIZE=1", NULL };
|
2020-03-13 13:30:05 +08:00
|
|
|
|
|
|
|
set_capacity(disk, size);
|
|
|
|
|
2020-11-27 01:43:37 +08:00
|
|
|
/*
|
|
|
|
* Only print a message and send a uevent if the gendisk is user visible
|
|
|
|
* and alive. This avoids spamming the log and udev when setting the
|
|
|
|
* initial capacity during probing.
|
|
|
|
*/
|
|
|
|
if (size == capacity ||
|
2021-08-09 14:40:28 +08:00
|
|
|
!disk_live(disk) ||
|
|
|
|
(disk->flags & GENHD_FL_HIDDEN))
|
2020-11-27 01:43:37 +08:00
|
|
|
return false;
|
2020-03-13 13:30:05 +08:00
|
|
|
|
2020-11-27 01:43:37 +08:00
|
|
|
pr_info("%s: detected capacity change from %lld to %lld\n",
|
2021-02-23 16:50:15 +08:00
|
|
|
disk->disk_name, capacity, size);
|
2020-11-13 00:50:04 +08:00
|
|
|
|
2020-11-27 01:43:37 +08:00
|
|
|
/*
|
|
|
|
* Historically we did not send a uevent for changes to/from an empty
|
|
|
|
* device.
|
|
|
|
*/
|
|
|
|
if (!capacity || !size)
|
|
|
|
return false;
|
|
|
|
kobject_uevent_env(&disk_to_dev(disk)->kobj, KOBJ_CHANGE, envp);
|
|
|
|
return true;
|
2020-03-13 13:30:05 +08:00
|
|
|
}
|
2020-11-16 22:56:56 +08:00
|
|
|
EXPORT_SYMBOL_GPL(set_capacity_and_notify);
|
2020-03-13 13:30:05 +08:00
|
|
|
|
2020-11-27 23:43:51 +08:00
|
|
|
static void part_stat_read_all(struct block_device *part,
|
|
|
|
struct disk_stats *stat)
|
2020-03-25 21:07:06 +08:00
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
memset(stat, 0, sizeof(struct disk_stats));
|
|
|
|
for_each_possible_cpu(cpu) {
|
2020-11-27 23:43:51 +08:00
|
|
|
struct disk_stats *ptr = per_cpu_ptr(part->bd_stats, cpu);
|
2020-03-25 21:07:06 +08:00
|
|
|
int group;
|
|
|
|
|
|
|
|
for (group = 0; group < NR_STAT_GROUPS; group++) {
|
|
|
|
stat->nsecs[group] += ptr->nsecs[group];
|
|
|
|
stat->sectors[group] += ptr->sectors[group];
|
|
|
|
stat->ios[group] += ptr->ios[group];
|
|
|
|
stat->merges[group] += ptr->merges[group];
|
|
|
|
}
|
|
|
|
|
|
|
|
stat->io_ticks += ptr->io_ticks;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
block: support to account io_ticks precisely
Currently, io_ticks is accounted based on sampling, specifically
update_io_ticks() will always account io_ticks by 1 jiffies from
bdev_start_io_acct()/blk_account_io_start(), and the result can be
inaccurate, for example(HZ is 250):
Test script:
fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms
Test result: util is about 90%, while the disk is really idle.
This behaviour is introduced by commit 5b18b5a73760 ("block: delete
part_round_stats and switch to less precise counting"), however, there
was a key point that is missed that this patch also improve performance
a lot:
Before the commit:
part_round_stats:
if (part->stamp != now)
stats |= 1;
part_in_flight()
-> there can be lots of task here in 1 jiffies.
part_round_stats_single()
__part_stat_add()
part->stamp = now;
After the commit:
update_io_ticks:
stamp = part->bd_stamp;
if (time_after(now, stamp))
if (try_cmpxchg())
__part_stat_add()
-> only one task can reach here in 1 jiffies.
Hence in order to account io_ticks precisely, we only need to know if
there are IO inflight at most once in one jiffies. Noted that for
rq-based device, iterating tags should not be used here because
'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence
part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight.
The additional overhead is quite little:
- per cpu add/dec for each IO for rq-based device;
- per cpu sum for each jiffies;
And it's verified by null-blk that there are no performance degration
under heavy IO pressure.
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240509123717.3223892-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 20:37:16 +08:00
|
|
|
unsigned int part_in_flight(struct block_device *part)
|
2017-08-09 07:51:45 +08:00
|
|
|
{
|
2020-05-13 18:49:33 +08:00
|
|
|
unsigned int inflight = 0;
|
2018-12-07 00:41:20 +08:00
|
|
|
int cpu;
|
2017-08-09 07:51:45 +08:00
|
|
|
|
2018-12-07 00:41:20 +08:00
|
|
|
for_each_possible_cpu(cpu) {
|
2018-12-07 00:41:21 +08:00
|
|
|
inflight += part_stat_local_read_cpu(part, in_flight[0], cpu) +
|
|
|
|
part_stat_local_read_cpu(part, in_flight[1], cpu);
|
2018-12-07 00:41:20 +08:00
|
|
|
}
|
2018-12-07 00:41:21 +08:00
|
|
|
if ((int)inflight < 0)
|
|
|
|
inflight = 0;
|
2018-12-07 00:41:20 +08:00
|
|
|
|
2018-12-07 00:41:21 +08:00
|
|
|
return inflight;
|
2017-08-09 07:51:45 +08:00
|
|
|
}
|
|
|
|
|
2020-11-24 16:36:54 +08:00
|
|
|
static void part_in_flight_rw(struct block_device *part,
|
|
|
|
unsigned int inflight[2])
|
2018-04-26 15:21:59 +08:00
|
|
|
{
|
2018-12-07 00:41:20 +08:00
|
|
|
int cpu;
|
|
|
|
|
|
|
|
inflight[0] = 0;
|
|
|
|
inflight[1] = 0;
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
inflight[0] += part_stat_local_read_cpu(part, in_flight[0], cpu);
|
|
|
|
inflight[1] += part_stat_local_read_cpu(part, in_flight[1], cpu);
|
|
|
|
}
|
|
|
|
if ((int)inflight[0] < 0)
|
|
|
|
inflight[0] = 0;
|
|
|
|
if ((int)inflight[1] < 0)
|
|
|
|
inflight[1] = 0;
|
2018-04-26 15:21:59 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Can be deleted altogether. Later.
|
|
|
|
*
|
|
|
|
*/
|
2017-06-17 07:48:21 +08:00
|
|
|
#define BLKDEV_MAJOR_HASH_SIZE 255
|
2005-04-17 06:20:36 +08:00
|
|
|
static struct blk_major_name {
|
|
|
|
struct blk_major_name *next;
|
|
|
|
int major;
|
|
|
|
char name[16];
|
2022-01-04 15:16:47 +08:00
|
|
|
#ifdef CONFIG_BLOCK_LEGACY_AUTOLOAD
|
2020-10-29 22:58:28 +08:00
|
|
|
void (*probe)(dev_t devt);
|
2022-01-04 15:16:47 +08:00
|
|
|
#endif
|
2006-03-31 18:30:32 +08:00
|
|
|
} *major_names[BLKDEV_MAJOR_HASH_SIZE];
|
2020-10-29 22:58:26 +08:00
|
|
|
static DEFINE_MUTEX(major_names_lock);
|
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 19:52:13 +08:00
|
|
|
static DEFINE_SPINLOCK(major_names_spinlock);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* index in the above - for now: assume no multimajor ranges */
|
2010-12-17 16:00:18 +08:00
|
|
|
static inline int major_to_index(unsigned major)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2006-03-31 18:30:32 +08:00
|
|
|
return major % BLKDEV_MAJOR_HASH_SIZE;
|
2006-01-15 05:20:38 +08:00
|
|
|
}
|
|
|
|
|
2006-03-31 18:30:32 +08:00
|
|
|
#ifdef CONFIG_PROC_FS
|
2008-09-03 15:01:09 +08:00
|
|
|
void blkdev_show(struct seq_file *seqf, off_t offset)
|
2006-01-15 05:20:38 +08:00
|
|
|
{
|
2006-03-31 18:30:32 +08:00
|
|
|
struct blk_major_name *dp;
|
2006-01-15 05:20:38 +08:00
|
|
|
|
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 19:52:13 +08:00
|
|
|
spin_lock(&major_names_spinlock);
|
2017-06-17 07:48:21 +08:00
|
|
|
for (dp = major_names[major_to_index(offset)]; dp; dp = dp->next)
|
|
|
|
if (dp->major == offset)
|
2008-09-03 15:01:09 +08:00
|
|
|
seq_printf(seqf, "%3d %s\n", dp->major, dp->name);
|
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 19:52:13 +08:00
|
|
|
spin_unlock(&major_names_spinlock);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2006-03-31 18:30:32 +08:00
|
|
|
#endif /* CONFIG_PROC_FS */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-02-20 15:12:51 +08:00
|
|
|
/**
|
2020-11-15 01:08:21 +08:00
|
|
|
* __register_blkdev - register a new block device
|
2009-02-20 15:12:51 +08:00
|
|
|
*
|
2018-02-06 10:25:27 +08:00
|
|
|
* @major: the requested major device number [1..BLKDEV_MAJOR_MAX-1]. If
|
|
|
|
* @major = 0, try to allocate any unused major number.
|
2009-02-20 15:12:51 +08:00
|
|
|
* @name: the name of the new block device as a zero terminated string
|
2021-11-04 07:04:34 +08:00
|
|
|
* @probe: pre-devtmpfs / pre-udev callback used to create disks when their
|
|
|
|
* pre-created device node is accessed. When a probe call uses
|
|
|
|
* add_disk() and it fails the driver must cleanup resources. This
|
|
|
|
* interface may soon be removed.
|
2009-02-20 15:12:51 +08:00
|
|
|
*
|
|
|
|
* The @name must be unique within the system.
|
|
|
|
*
|
2017-03-31 04:11:36 +08:00
|
|
|
* The return value depends on the @major input parameter:
|
|
|
|
*
|
2018-02-06 10:25:27 +08:00
|
|
|
* - if a major device number was requested in range [1..BLKDEV_MAJOR_MAX-1]
|
|
|
|
* then the function returns zero on success, or a negative error code
|
2017-03-31 04:11:36 +08:00
|
|
|
* - if any unused major number was requested with @major = 0 parameter
|
2009-02-20 15:12:51 +08:00
|
|
|
* then the return value is the allocated major number in range
|
2018-02-06 10:25:27 +08:00
|
|
|
* [1..BLKDEV_MAJOR_MAX-1] or a negative error code otherwise
|
|
|
|
*
|
|
|
|
* See Documentation/admin-guide/devices.txt for the list of allocated
|
|
|
|
* major numbers.
|
2020-11-15 01:08:21 +08:00
|
|
|
*
|
|
|
|
* Use register_blkdev instead for any new code.
|
2009-02-20 15:12:51 +08:00
|
|
|
*/
|
2020-10-29 22:58:28 +08:00
|
|
|
int __register_blkdev(unsigned int major, const char *name,
|
|
|
|
void (*probe)(dev_t devt))
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct blk_major_name **n, *p;
|
|
|
|
int index, ret = 0;
|
|
|
|
|
2020-10-29 22:58:26 +08:00
|
|
|
mutex_lock(&major_names_lock);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* temporary */
|
|
|
|
if (major == 0) {
|
|
|
|
for (index = ARRAY_SIZE(major_names)-1; index > 0; index--) {
|
|
|
|
if (major_names[index] == NULL)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (index == 0) {
|
2019-02-17 23:21:56 +08:00
|
|
|
printk("%s: failed to get major for %s\n",
|
|
|
|
__func__, name);
|
2005-04-17 06:20:36 +08:00
|
|
|
ret = -EBUSY;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
major = index;
|
|
|
|
ret = major;
|
|
|
|
}
|
|
|
|
|
2017-06-17 07:48:21 +08:00
|
|
|
if (major >= BLKDEV_MAJOR_MAX) {
|
2019-02-17 23:21:56 +08:00
|
|
|
pr_err("%s: major requested (%u) is greater than the maximum (%u) for %s\n",
|
|
|
|
__func__, major, BLKDEV_MAJOR_MAX-1, name);
|
2017-06-17 07:48:21 +08:00
|
|
|
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
p = kmalloc(sizeof(struct blk_major_name), GFP_KERNEL);
|
|
|
|
if (p == NULL) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
p->major = major;
|
2022-01-04 15:16:47 +08:00
|
|
|
#ifdef CONFIG_BLOCK_LEGACY_AUTOLOAD
|
2020-10-29 22:58:28 +08:00
|
|
|
p->probe = probe;
|
2022-01-04 15:16:47 +08:00
|
|
|
#endif
|
2023-05-30 23:56:08 +08:00
|
|
|
strscpy(p->name, name, sizeof(p->name));
|
2005-04-17 06:20:36 +08:00
|
|
|
p->next = NULL;
|
|
|
|
index = major_to_index(major);
|
|
|
|
|
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 19:52:13 +08:00
|
|
|
spin_lock(&major_names_spinlock);
|
2005-04-17 06:20:36 +08:00
|
|
|
for (n = &major_names[index]; *n; n = &(*n)->next) {
|
|
|
|
if ((*n)->major == major)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (!*n)
|
|
|
|
*n = p;
|
|
|
|
else
|
|
|
|
ret = -EBUSY;
|
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 19:52:13 +08:00
|
|
|
spin_unlock(&major_names_spinlock);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
if (ret < 0) {
|
2018-02-06 10:25:27 +08:00
|
|
|
printk("register_blkdev: cannot get major %u for %s\n",
|
2005-04-17 06:20:36 +08:00
|
|
|
major, name);
|
|
|
|
kfree(p);
|
|
|
|
}
|
|
|
|
out:
|
2020-10-29 22:58:26 +08:00
|
|
|
mutex_unlock(&major_names_lock);
|
2005-04-17 06:20:36 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2020-10-29 22:58:28 +08:00
|
|
|
EXPORT_SYMBOL(__register_blkdev);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2007-07-17 19:03:47 +08:00
|
|
|
void unregister_blkdev(unsigned int major, const char *name)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct blk_major_name **n;
|
|
|
|
struct blk_major_name *p = NULL;
|
|
|
|
int index = major_to_index(major);
|
|
|
|
|
2020-10-29 22:58:26 +08:00
|
|
|
mutex_lock(&major_names_lock);
|
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 19:52:13 +08:00
|
|
|
spin_lock(&major_names_spinlock);
|
2005-04-17 06:20:36 +08:00
|
|
|
for (n = &major_names[index]; *n; n = &(*n)->next)
|
|
|
|
if ((*n)->major == major)
|
|
|
|
break;
|
2007-07-17 19:03:45 +08:00
|
|
|
if (!*n || strcmp((*n)->name, name)) {
|
|
|
|
WARN_ON(1);
|
|
|
|
} else {
|
2005-04-17 06:20:36 +08:00
|
|
|
p = *n;
|
|
|
|
*n = p->next;
|
|
|
|
}
|
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 19:52:13 +08:00
|
|
|
spin_unlock(&major_names_spinlock);
|
2020-10-29 22:58:26 +08:00
|
|
|
mutex_unlock(&major_names_lock);
|
2005-04-17 06:20:36 +08:00
|
|
|
kfree(p);
|
|
|
|
}
|
|
|
|
|
|
|
|
EXPORT_SYMBOL(unregister_blkdev);
|
|
|
|
|
2021-05-21 13:50:51 +08:00
|
|
|
int blk_alloc_ext_minor(void)
|
2008-08-25 18:47:22 +08:00
|
|
|
{
|
2013-02-28 09:03:57 +08:00
|
|
|
int idx;
|
2008-08-25 18:47:22 +08:00
|
|
|
|
2022-03-26 22:50:46 +08:00
|
|
|
idx = ida_alloc_range(&ext_devt_ida, 0, NR_EXT_DEVT - 1, GFP_KERNEL);
|
2021-08-24 15:52:16 +08:00
|
|
|
if (idx == -ENOSPC)
|
|
|
|
return -EBUSY;
|
|
|
|
return idx;
|
2008-08-25 18:47:22 +08:00
|
|
|
}
|
|
|
|
|
2021-05-21 13:50:51 +08:00
|
|
|
void blk_free_ext_minor(unsigned int minor)
|
2008-08-25 18:47:22 +08:00
|
|
|
{
|
2021-08-24 15:52:16 +08:00
|
|
|
ida_free(&ext_devt_ida, minor);
|
2019-04-02 20:06:34 +08:00
|
|
|
}
|
|
|
|
|
2021-01-24 18:02:39 +08:00
|
|
|
void disk_uevent(struct gendisk *disk, enum kobject_action action)
|
|
|
|
{
|
|
|
|
struct block_device *part;
|
2021-04-06 14:23:02 +08:00
|
|
|
unsigned long idx;
|
2021-01-24 18:02:39 +08:00
|
|
|
|
2021-04-06 14:23:02 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
xa_for_each(&disk->part_tbl, idx, part) {
|
|
|
|
if (bdev_is_partition(part) && !bdev_nr_sectors(part))
|
|
|
|
continue;
|
2021-07-01 16:16:37 +08:00
|
|
|
if (!kobject_get_unless_zero(&part->bd_device.kobj))
|
2021-04-06 14:23:02 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
rcu_read_unlock();
|
2021-01-24 18:02:39 +08:00
|
|
|
kobject_uevent(bdev_kobj(part), action);
|
2021-07-01 16:16:37 +08:00
|
|
|
put_device(&part->bd_device);
|
2021-04-06 14:23:02 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2021-01-24 18:02:39 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(disk_uevent);
|
|
|
|
|
2023-06-08 19:02:55 +08:00
|
|
|
int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode)
|
2020-09-21 15:19:46 +08:00
|
|
|
{
|
2024-01-23 21:26:20 +08:00
|
|
|
struct file *file;
|
2023-02-17 10:22:00 +08:00
|
|
|
int ret = 0;
|
2020-09-21 15:19:46 +08:00
|
|
|
|
2024-05-02 21:00:32 +08:00
|
|
|
if (!disk_has_partscan(disk))
|
2022-05-27 13:58:06 +08:00
|
|
|
return -EINVAL;
|
2021-11-22 21:06:16 +08:00
|
|
|
if (disk->open_partitions)
|
|
|
|
return -EBUSY;
|
2020-09-21 15:19:46 +08:00
|
|
|
|
2023-02-17 10:22:00 +08:00
|
|
|
/*
|
|
|
|
* If the device is opened exclusively by current thread already, it's
|
|
|
|
* safe to scan partitons, otherwise, use bd_prepare_to_claim() to
|
|
|
|
* synchronize with other exclusive openers and other partition
|
|
|
|
* scanners.
|
|
|
|
*/
|
2023-06-08 19:02:55 +08:00
|
|
|
if (!(mode & BLK_OPEN_EXCL)) {
|
2023-06-01 17:44:52 +08:00
|
|
|
ret = bd_prepare_to_claim(disk->part0, disk_scan_partitions,
|
|
|
|
NULL);
|
2023-02-17 10:22:00 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-03-22 11:59:26 +08:00
|
|
|
set_bit(GD_NEED_PART_SCAN, &disk->state);
|
2024-01-23 21:26:20 +08:00
|
|
|
file = bdev_file_open_by_dev(disk_devt(disk), mode & ~BLK_OPEN_EXCL,
|
|
|
|
NULL, NULL);
|
|
|
|
if (IS_ERR(file))
|
|
|
|
ret = PTR_ERR(file);
|
2023-02-17 10:22:00 +08:00
|
|
|
else
|
2024-01-23 21:26:20 +08:00
|
|
|
fput(file);
|
2023-02-17 10:22:00 +08:00
|
|
|
|
2023-03-22 11:59:26 +08:00
|
|
|
/*
|
|
|
|
* If blkdev_get_by_dev() failed early, GD_NEED_PART_SCAN is still set,
|
|
|
|
* and this will cause that re-assemble partitioned raid device will
|
|
|
|
* creat partition for underlying disk.
|
|
|
|
*/
|
|
|
|
clear_bit(GD_NEED_PART_SCAN, &disk->state);
|
2023-06-08 19:02:55 +08:00
|
|
|
if (!(mode & BLK_OPEN_EXCL))
|
2023-02-17 10:22:00 +08:00
|
|
|
bd_abort_claiming(disk->part0, disk_scan_partitions);
|
|
|
|
return ret;
|
2020-09-21 15:19:46 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/**
|
2021-08-04 17:41:47 +08:00
|
|
|
* device_add_disk - add disk information to kernel list
|
2016-06-16 09:17:27 +08:00
|
|
|
* @parent: parent device for the disk
|
2005-04-17 06:20:36 +08:00
|
|
|
* @disk: per-device partitioning information
|
2018-09-28 14:17:19 +08:00
|
|
|
* @groups: Additional per-device sysfs groups
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
* This function registers the partitioning information in @disk
|
|
|
|
* with the kernel.
|
|
|
|
*/
|
2021-11-10 08:29:49 +08:00
|
|
|
int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
|
|
|
|
const struct attribute_group **groups)
|
2021-08-04 17:41:47 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2021-08-18 22:45:33 +08:00
|
|
|
struct device *ddev = disk_to_dev(disk);
|
2021-05-21 13:50:51 +08:00
|
|
|
int ret;
|
2008-04-30 15:54:32 +08:00
|
|
|
|
2022-03-05 10:08:03 +08:00
|
|
|
/* Only makes sense for bio-based to set ->poll_bio */
|
|
|
|
if (queue_is_mq(disk->queue) && disk->fops->poll_bio)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2019-09-05 17:51:33 +08:00
|
|
|
/*
|
|
|
|
* The disk queue should now be all set with enough information about
|
|
|
|
* the device for the elevator code to pick an adequate default
|
|
|
|
* elevator if one is needed, that is, for devices requesting queue
|
|
|
|
* registration.
|
|
|
|
*/
|
2021-08-04 17:41:47 +08:00
|
|
|
elevator_init_mq(disk->queue);
|
2019-09-05 17:51:33 +08:00
|
|
|
|
2023-04-14 21:32:02 +08:00
|
|
|
/* Mark bdev as having a submit_bio, if needed */
|
2024-04-12 13:21:45 +08:00
|
|
|
if (disk->fops->submit_bio)
|
|
|
|
bdev_set_flag(disk->part0, BD_HAS_SUBMIT_BIO);
|
2023-04-14 21:32:02 +08:00
|
|
|
|
2021-05-21 13:50:51 +08:00
|
|
|
/*
|
|
|
|
* If the driver provides an explicit major number it also must provide
|
|
|
|
* the number of minors numbers supported, and those will be used to
|
|
|
|
* setup the gendisk.
|
|
|
|
* Otherwise just allocate the device numbers for both the whole device
|
|
|
|
* and all partitions from the extended dev_t space.
|
2008-08-25 18:56:17 +08:00
|
|
|
*/
|
2022-10-22 10:16:15 +08:00
|
|
|
ret = -EINVAL;
|
2021-05-21 13:50:51 +08:00
|
|
|
if (disk->major) {
|
2021-08-18 22:45:40 +08:00
|
|
|
if (WARN_ON(!disk->minors))
|
2022-10-22 10:16:15 +08:00
|
|
|
goto out_exit_elevator;
|
2021-05-21 13:50:52 +08:00
|
|
|
|
|
|
|
if (disk->minors > DISK_MAX_PARTS) {
|
|
|
|
pr_err("block: can't allocate more than %d partitions\n",
|
|
|
|
DISK_MAX_PARTS);
|
|
|
|
disk->minors = DISK_MAX_PARTS;
|
|
|
|
}
|
2023-12-19 15:59:42 +08:00
|
|
|
if (disk->first_minor > MINORMASK ||
|
|
|
|
disk->minors > MINORMASK + 1 ||
|
|
|
|
disk->first_minor + disk->minors > MINORMASK + 1)
|
2022-10-22 10:16:15 +08:00
|
|
|
goto out_exit_elevator;
|
2021-05-21 13:50:51 +08:00
|
|
|
} else {
|
2021-08-18 22:45:40 +08:00
|
|
|
if (WARN_ON(disk->minors))
|
2022-10-22 10:16:15 +08:00
|
|
|
goto out_exit_elevator;
|
2008-08-25 18:56:17 +08:00
|
|
|
|
2021-05-21 13:50:51 +08:00
|
|
|
ret = blk_alloc_ext_minor();
|
2021-08-18 22:45:40 +08:00
|
|
|
if (ret < 0)
|
2022-10-22 10:16:15 +08:00
|
|
|
goto out_exit_elevator;
|
2021-05-21 13:50:51 +08:00
|
|
|
disk->major = BLOCK_EXT_MAJOR;
|
2021-08-24 15:52:15 +08:00
|
|
|
disk->first_minor = ret;
|
2008-08-25 18:56:17 +08:00
|
|
|
}
|
2021-05-21 13:50:51 +08:00
|
|
|
|
2021-08-18 22:45:33 +08:00
|
|
|
/* delay uevents, until we scanned partition table */
|
|
|
|
dev_set_uevent_suppress(ddev, 1);
|
|
|
|
|
|
|
|
ddev->parent = parent;
|
|
|
|
ddev->groups = groups;
|
|
|
|
dev_set_name(ddev, "%s", disk->disk_name);
|
2021-08-18 22:45:34 +08:00
|
|
|
if (!(disk->flags & GENHD_FL_HIDDEN))
|
|
|
|
ddev->devt = MKDEV(disk->major, disk->first_minor);
|
2021-08-18 22:45:40 +08:00
|
|
|
ret = device_add(ddev);
|
|
|
|
if (ret)
|
2021-12-22 00:18:51 +08:00
|
|
|
goto out_free_ext_minor;
|
|
|
|
|
|
|
|
ret = disk_alloc_events(disk);
|
|
|
|
if (ret)
|
|
|
|
goto out_device_del;
|
|
|
|
|
driver core: remove CONFIG_SYSFS_DEPRECATED and CONFIG_SYSFS_DEPRECATED_V2
CONFIG_SYSFS_DEPRECATED was added in commit 88a22c985e35
("CONFIG_SYSFS_DEPRECATED") in 2006 to allow systems with older versions
of some tools (i.e. Fedora 3's version of udev) to boot properly. Four
years later, in 2010, the option was attempted to be removed as most of
userspace should have been fixed up properly by then, but some kernel
developers clung to those old systems and refused to update, so we added
CONFIG_SYSFS_DEPRECATED_V2 in commit e52eec13cd6b ("SYSFS: Allow boot
time switching between deprecated and modern sysfs layout") to allow
them to continue to boot properly, and we allowed a boot time parameter
to be used to switch back to the old format if needed.
Over time, the logic that was covered under these config options was
slowly removed from individual driver subsystems successfully, removed,
and the only thing that is now left in the kernel are some changes in
the block layer's representation in sysfs where real directories are
used instead of symlinks like normal.
Because the original changes were done to userspace tools in 2006, and
all distros that use those tools are long end-of-life, and older
non-udev-based systems do not care about the block layer's sysfs
representation, it is time to finally remove this old logic and the
config entries from the kernel.
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: linux-block@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20230223073326.2073220-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-23 15:33:26 +08:00
|
|
|
ret = sysfs_create_link(block_depr, &ddev->kobj,
|
|
|
|
kobject_name(&ddev->kobj));
|
|
|
|
if (ret)
|
|
|
|
goto out_device_del;
|
2021-08-18 22:45:33 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* avoid probable deadlock caused by allocating memory with
|
|
|
|
* GFP_KERNEL in runtime_resume callback of its all ancestor
|
|
|
|
* devices
|
|
|
|
*/
|
|
|
|
pm_runtime_set_memalloc_noio(ddev, true);
|
|
|
|
|
|
|
|
disk->part0->bd_holder_dir =
|
|
|
|
kobject_create_and_add("holders", &ddev->kobj);
|
2021-11-04 00:40:23 +08:00
|
|
|
if (!disk->part0->bd_holder_dir) {
|
|
|
|
ret = -ENOMEM;
|
2023-03-19 01:36:25 +08:00
|
|
|
goto out_del_block_link;
|
2021-11-04 00:40:23 +08:00
|
|
|
}
|
2021-08-18 22:45:33 +08:00
|
|
|
disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
|
2021-11-04 00:40:23 +08:00
|
|
|
if (!disk->slave_dir) {
|
|
|
|
ret = -ENOMEM;
|
2021-08-18 22:45:40 +08:00
|
|
|
goto out_put_holder_dir;
|
2021-11-04 00:40:23 +08:00
|
|
|
}
|
2021-08-18 22:45:33 +08:00
|
|
|
|
2021-08-18 22:45:40 +08:00
|
|
|
ret = blk_register_queue(disk);
|
|
|
|
if (ret)
|
|
|
|
goto out_put_slave_dir;
|
2021-08-18 22:45:37 +08:00
|
|
|
|
2021-11-22 21:06:23 +08:00
|
|
|
if (!(disk->flags & GENHD_FL_HIDDEN)) {
|
2021-08-18 22:45:34 +08:00
|
|
|
ret = bdi_register(disk->bdi, "%u:%u",
|
|
|
|
disk->major, disk->first_minor);
|
2021-08-18 22:45:40 +08:00
|
|
|
if (ret)
|
|
|
|
goto out_unregister_queue;
|
2021-08-18 22:45:34 +08:00
|
|
|
bdi_set_owner(disk->bdi, ddev);
|
2021-08-18 22:45:40 +08:00
|
|
|
ret = sysfs_create_link(&ddev->kobj,
|
|
|
|
&disk->bdi->dev->kobj, "bdi");
|
|
|
|
if (ret)
|
|
|
|
goto out_unregister_bdi;
|
2021-08-18 22:45:34 +08:00
|
|
|
|
2023-02-17 10:22:00 +08:00
|
|
|
/* Make sure the first partition scan will be proceed */
|
2024-05-02 21:00:32 +08:00
|
|
|
if (get_capacity(disk) && disk_has_partscan(disk))
|
2023-02-17 10:22:00 +08:00
|
|
|
set_bit(GD_NEED_PART_SCAN, &disk->state);
|
|
|
|
|
2021-08-18 22:45:35 +08:00
|
|
|
bdev_add(disk->part0, ddev->devt);
|
2021-11-22 21:06:16 +08:00
|
|
|
if (get_capacity(disk))
|
2023-06-08 19:02:55 +08:00
|
|
|
disk_scan_partitions(disk, BLK_OPEN_READ);
|
2021-08-18 22:45:33 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Announce the disk and partitions after all partitions are
|
2021-08-18 22:45:34 +08:00
|
|
|
* created. (for hidden disks uevents remain suppressed forever)
|
2021-08-18 22:45:33 +08:00
|
|
|
*/
|
|
|
|
dev_set_uevent_suppress(ddev, 0);
|
|
|
|
disk_uevent(disk, KOBJ_ADD);
|
2022-10-10 21:18:57 +08:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Even if the block_device for a hidden gendisk is not
|
|
|
|
* registered, it needs to have a valid bd_dev so that the
|
|
|
|
* freeing of the dynamic major works.
|
|
|
|
*/
|
|
|
|
disk->part0->bd_dev = MKDEV(disk->major, disk->first_minor);
|
2021-08-18 22:45:33 +08:00
|
|
|
}
|
|
|
|
|
2021-08-18 22:45:37 +08:00
|
|
|
disk_update_readahead(disk);
|
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-09 03:57:37 +08:00
|
|
|
disk_add_events(disk);
|
2022-02-15 17:45:10 +08:00
|
|
|
set_bit(GD_ADDED, &disk->state);
|
2021-08-18 22:45:40 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_unregister_bdi:
|
|
|
|
if (!(disk->flags & GENHD_FL_HIDDEN))
|
|
|
|
bdi_unregister(disk->bdi);
|
|
|
|
out_unregister_queue:
|
|
|
|
blk_unregister_queue(disk);
|
2022-10-29 15:13:55 +08:00
|
|
|
rq_qos_exit(disk->queue);
|
2021-08-18 22:45:40 +08:00
|
|
|
out_put_slave_dir:
|
|
|
|
kobject_put(disk->slave_dir);
|
2022-11-15 22:10:45 +08:00
|
|
|
disk->slave_dir = NULL;
|
2021-08-18 22:45:40 +08:00
|
|
|
out_put_holder_dir:
|
|
|
|
kobject_put(disk->part0->bd_holder_dir);
|
|
|
|
out_del_block_link:
|
driver core: remove CONFIG_SYSFS_DEPRECATED and CONFIG_SYSFS_DEPRECATED_V2
CONFIG_SYSFS_DEPRECATED was added in commit 88a22c985e35
("CONFIG_SYSFS_DEPRECATED") in 2006 to allow systems with older versions
of some tools (i.e. Fedora 3's version of udev) to boot properly. Four
years later, in 2010, the option was attempted to be removed as most of
userspace should have been fixed up properly by then, but some kernel
developers clung to those old systems and refused to update, so we added
CONFIG_SYSFS_DEPRECATED_V2 in commit e52eec13cd6b ("SYSFS: Allow boot
time switching between deprecated and modern sysfs layout") to allow
them to continue to boot properly, and we allowed a boot time parameter
to be used to switch back to the old format if needed.
Over time, the logic that was covered under these config options was
slowly removed from individual driver subsystems successfully, removed,
and the only thing that is now left in the kernel are some changes in
the block layer's representation in sysfs where real directories are
used instead of symlinks like normal.
Because the original changes were done to userspace tools in 2006, and
all distros that use those tools are long end-of-life, and older
non-udev-based systems do not care about the block layer's sysfs
representation, it is time to finally remove this old logic and the
config entries from the kernel.
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: linux-block@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20230223073326.2073220-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-23 15:33:26 +08:00
|
|
|
sysfs_remove_link(block_depr, dev_name(ddev));
|
2023-12-11 15:53:56 +08:00
|
|
|
pm_runtime_set_memalloc_noio(ddev, false);
|
2021-08-18 22:45:40 +08:00
|
|
|
out_device_del:
|
|
|
|
device_del(ddev);
|
|
|
|
out_free_ext_minor:
|
|
|
|
if (disk->major == BLOCK_EXT_MAJOR)
|
|
|
|
blk_free_ext_minor(disk->first_minor);
|
2022-10-22 10:16:15 +08:00
|
|
|
out_exit_elevator:
|
|
|
|
if (disk->queue->elevator)
|
|
|
|
elevator_exit(disk->queue);
|
2021-11-10 08:29:49 +08:00
|
|
|
return ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2016-06-16 09:17:27 +08:00
|
|
|
EXPORT_SYMBOL(device_add_disk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2023-08-11 18:08:25 +08:00
|
|
|
static void blk_report_disk_dead(struct gendisk *disk, bool surprise)
|
2023-06-01 17:44:53 +08:00
|
|
|
{
|
|
|
|
struct block_device *bdev;
|
|
|
|
unsigned long idx;
|
|
|
|
|
2023-10-18 02:48:22 +08:00
|
|
|
/*
|
|
|
|
* On surprise disk removal, bdev_mark_dead() may call into file
|
|
|
|
* systems below. Make it clear that we're expecting to not hold
|
|
|
|
* disk->open_mutex.
|
|
|
|
*/
|
|
|
|
lockdep_assert_not_held(&disk->open_mutex);
|
|
|
|
|
2023-06-01 17:44:53 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
xa_for_each(&disk->part_tbl, idx, bdev) {
|
|
|
|
if (!kobject_get_unless_zero(&bdev->bd_device.kobj))
|
|
|
|
continue;
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
2023-08-11 18:08:25 +08:00
|
|
|
bdev_mark_dead(bdev, surprise);
|
2023-06-01 17:44:53 +08:00
|
|
|
|
|
|
|
put_device(&bdev->bd_device);
|
|
|
|
rcu_read_lock();
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2023-08-11 18:08:25 +08:00
|
|
|
static void __blk_mark_disk_dead(struct gendisk *disk)
|
2022-02-17 15:52:31 +08:00
|
|
|
{
|
2023-06-01 17:44:47 +08:00
|
|
|
/*
|
|
|
|
* Fail any new I/O.
|
|
|
|
*/
|
2023-06-01 17:44:48 +08:00
|
|
|
if (test_and_set_bit(GD_DEAD, &disk->state))
|
|
|
|
return;
|
|
|
|
|
2023-06-01 17:44:47 +08:00
|
|
|
if (test_bit(GD_OWNS_QUEUE, &disk->state))
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
|
2022-11-01 23:00:37 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Stop buffered writers from dirtying pages that can't be written out.
|
|
|
|
*/
|
2023-06-01 17:44:47 +08:00
|
|
|
set_capacity(disk, 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prevent new I/O from crossing bio_queue_enter().
|
|
|
|
*/
|
|
|
|
blk_queue_start_drain(disk->queue);
|
2023-08-11 18:08:25 +08:00
|
|
|
}
|
2023-06-01 17:44:53 +08:00
|
|
|
|
2023-08-11 18:08:25 +08:00
|
|
|
/**
|
|
|
|
* blk_mark_disk_dead - mark a disk as dead
|
|
|
|
* @disk: disk to mark as dead
|
|
|
|
*
|
|
|
|
* Mark as disk as dead (e.g. surprise removed) and don't accept any new I/O
|
|
|
|
* to this disk.
|
|
|
|
*/
|
|
|
|
void blk_mark_disk_dead(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
__blk_mark_disk_dead(disk);
|
|
|
|
blk_report_disk_dead(disk, true);
|
2022-02-17 15:52:31 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mark_disk_dead);
|
|
|
|
|
2020-06-20 04:47:23 +08:00
|
|
|
/**
|
|
|
|
* del_gendisk - remove the gendisk
|
|
|
|
* @disk: the struct gendisk to remove
|
|
|
|
*
|
|
|
|
* Removes the gendisk and all its associated resources. This deletes the
|
|
|
|
* partitions associated with the gendisk, and unregisters the associated
|
|
|
|
* request_queue.
|
|
|
|
*
|
|
|
|
* This is the counter to the respective __device_add_disk() call.
|
|
|
|
*
|
|
|
|
* The final removal of the struct gendisk happens when its refcount reaches 0
|
|
|
|
* with put_disk(), which should be called after del_gendisk(), if
|
|
|
|
* __device_add_disk() was used.
|
2020-06-20 04:47:25 +08:00
|
|
|
*
|
|
|
|
* Drivers exist which depend on the release of the gendisk to be synchronous,
|
|
|
|
* it should not be deferred.
|
|
|
|
*
|
|
|
|
* Context: can sleep
|
2020-06-20 04:47:23 +08:00
|
|
|
*/
|
2010-12-09 03:57:36 +08:00
|
|
|
void del_gendisk(struct gendisk *disk)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2021-09-29 15:12:40 +08:00
|
|
|
struct request_queue *q = disk->queue;
|
2023-06-01 17:44:50 +08:00
|
|
|
struct block_device *part;
|
|
|
|
unsigned long idx;
|
2021-09-29 15:12:40 +08:00
|
|
|
|
2020-06-20 04:47:25 +08:00
|
|
|
might_sleep();
|
|
|
|
|
2021-08-24 22:43:10 +08:00
|
|
|
if (WARN_ON_ONCE(!disk_live(disk) && !(disk->flags & GENHD_FL_HIDDEN)))
|
2020-10-29 22:58:24 +08:00
|
|
|
return;
|
|
|
|
|
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-09 03:57:37 +08:00
|
|
|
disk_del_events(disk);
|
|
|
|
|
2023-06-01 17:44:50 +08:00
|
|
|
/*
|
2023-08-11 18:08:25 +08:00
|
|
|
* Prevent new openers by unlinked the bdev inode.
|
2023-06-01 17:44:50 +08:00
|
|
|
*/
|
2021-05-25 14:12:56 +08:00
|
|
|
mutex_lock(&disk->open_mutex);
|
2023-08-11 18:08:25 +08:00
|
|
|
xa_for_each(&disk->part_tbl, idx, part)
|
2024-04-29 07:01:39 +08:00
|
|
|
bdev_unhash(part);
|
2021-05-25 14:12:56 +08:00
|
|
|
mutex_unlock(&disk->open_mutex);
|
2021-04-06 14:22:56 +08:00
|
|
|
|
2023-08-11 18:08:25 +08:00
|
|
|
/*
|
|
|
|
* Tell the file system to write back all dirty data and shut down if
|
|
|
|
* it hasn't been notified earlier.
|
|
|
|
*/
|
|
|
|
if (!test_bit(GD_DEAD, &disk->state))
|
|
|
|
blk_report_disk_dead(disk, false);
|
|
|
|
__blk_mark_disk_dead(disk);
|
2021-09-29 15:12:40 +08:00
|
|
|
|
2023-06-01 17:44:50 +08:00
|
|
|
/*
|
|
|
|
* Drop all partitions now that the disk is marked dead.
|
|
|
|
*/
|
|
|
|
mutex_lock(&disk->open_mutex);
|
|
|
|
xa_for_each_start(&disk->part_tbl, idx, part, 1)
|
|
|
|
drop_partition(part);
|
|
|
|
mutex_unlock(&disk->open_mutex);
|
|
|
|
|
2020-10-29 22:58:24 +08:00
|
|
|
if (!(disk->flags & GENHD_FL_HIDDEN)) {
|
2017-11-03 02:29:53 +08:00
|
|
|
sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
|
2020-10-29 22:58:24 +08:00
|
|
|
|
2017-03-09 00:48:33 +08:00
|
|
|
/*
|
|
|
|
* Unregister bdi before releasing device numbers (as they can
|
|
|
|
* get reused and we'd get clashes in sysfs).
|
|
|
|
*/
|
2021-08-09 22:17:43 +08:00
|
|
|
bdi_unregister(disk->bdi);
|
2017-03-09 00:48:33 +08:00
|
|
|
}
|
2010-12-09 03:57:36 +08:00
|
|
|
|
2020-10-29 22:58:24 +08:00
|
|
|
blk_unregister_queue(disk);
|
2010-12-09 03:57:36 +08:00
|
|
|
|
2020-11-27 01:47:17 +08:00
|
|
|
kobject_put(disk->part0->bd_holder_dir);
|
2010-12-09 03:57:36 +08:00
|
|
|
kobject_put(disk->slave_dir);
|
2022-11-15 22:10:45 +08:00
|
|
|
disk->slave_dir = NULL;
|
2010-12-09 03:57:36 +08:00
|
|
|
|
2020-11-24 16:36:54 +08:00
|
|
|
part_stat_set_all(disk->part0, 0);
|
2020-11-27 01:47:17 +08:00
|
|
|
disk->part0->bd_stamp = 0;
|
driver core: remove CONFIG_SYSFS_DEPRECATED and CONFIG_SYSFS_DEPRECATED_V2
CONFIG_SYSFS_DEPRECATED was added in commit 88a22c985e35
("CONFIG_SYSFS_DEPRECATED") in 2006 to allow systems with older versions
of some tools (i.e. Fedora 3's version of udev) to boot properly. Four
years later, in 2010, the option was attempted to be removed as most of
userspace should have been fixed up properly by then, but some kernel
developers clung to those old systems and refused to update, so we added
CONFIG_SYSFS_DEPRECATED_V2 in commit e52eec13cd6b ("SYSFS: Allow boot
time switching between deprecated and modern sysfs layout") to allow
them to continue to boot properly, and we allowed a boot time parameter
to be used to switch back to the old format if needed.
Over time, the logic that was covered under these config options was
slowly removed from individual driver subsystems successfully, removed,
and the only thing that is now left in the kernel are some changes in
the block layer's representation in sysfs where real directories are
used instead of symlinks like normal.
Because the original changes were done to userspace tools in 2006, and
all distros that use those tools are long end-of-life, and older
non-udev-based systems do not care about the block layer's sysfs
representation, it is time to finally remove this old logic and the
config entries from the kernel.
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: linux-block@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20230223073326.2073220-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-23 15:33:26 +08:00
|
|
|
sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
|
2013-02-23 08:34:13 +08:00
|
|
|
pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
|
2010-12-09 03:57:36 +08:00
|
|
|
device_del(disk_to_dev(disk));
|
2021-10-26 18:12:04 +08:00
|
|
|
|
2022-09-19 22:40:49 +08:00
|
|
|
blk_mq_freeze_queue_wait(q);
|
|
|
|
|
2022-09-22 02:04:58 +08:00
|
|
|
blk_throtl_cancel_bios(disk);
|
2022-03-18 21:01:44 +08:00
|
|
|
|
2021-10-26 18:12:04 +08:00
|
|
|
blk_sync_queue(q);
|
|
|
|
blk_flush_integrity();
|
2022-10-30 17:47:30 +08:00
|
|
|
|
|
|
|
if (queue_is_mq(q))
|
|
|
|
blk_mq_cancel_work_sync(q);
|
2022-06-14 15:48:24 +08:00
|
|
|
|
|
|
|
blk_mq_quiesce_queue(q);
|
|
|
|
if (q->elevator) {
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
|
|
|
elevator_exit(q);
|
|
|
|
mutex_unlock(&q->sysfs_lock);
|
|
|
|
}
|
|
|
|
rq_qos_exit(q);
|
|
|
|
blk_mq_unquiesce_queue(q);
|
|
|
|
|
2021-10-26 18:12:04 +08:00
|
|
|
/*
|
2022-06-19 14:05:51 +08:00
|
|
|
* If the disk does not own the queue, allow using passthrough requests
|
|
|
|
* again. Else leave the queue frozen to fail all I/O.
|
2021-10-26 18:12:04 +08:00
|
|
|
*/
|
2022-06-19 14:05:51 +08:00
|
|
|
if (!test_bit(GD_OWNS_QUEUE, &disk->state)) {
|
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q);
|
|
|
|
__blk_mq_unfreeze_queue(q, true);
|
|
|
|
} else {
|
|
|
|
if (queue_is_mq(q))
|
|
|
|
blk_mq_exit_queue(q);
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2010-12-09 03:57:36 +08:00
|
|
|
EXPORT_SYMBOL(del_gendisk);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-09-22 20:37:08 +08:00
|
|
|
/**
|
|
|
|
* invalidate_disk - invalidate the disk
|
|
|
|
* @disk: the struct gendisk to invalidate
|
|
|
|
*
|
|
|
|
* A helper to invalidates the disk. It will clean the disk's associated
|
|
|
|
* buffer/page caches and reset its internal states so that the disk
|
|
|
|
* can be reused by the drivers.
|
|
|
|
*
|
|
|
|
* Context: can sleep
|
|
|
|
*/
|
|
|
|
void invalidate_disk(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
struct block_device *bdev = disk->part0;
|
|
|
|
|
|
|
|
invalidate_bdev(bdev);
|
2024-04-11 22:53:37 +08:00
|
|
|
bdev->bd_mapping->wb_err = 0;
|
2021-09-22 20:37:08 +08:00
|
|
|
set_capacity(disk, 0);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(invalidate_disk);
|
|
|
|
|
2016-01-10 00:36:51 +08:00
|
|
|
/* sysfs access to bad-blocks list. */
|
|
|
|
static ssize_t disk_badblocks_show(struct device *dev,
|
|
|
|
struct device_attribute *attr,
|
|
|
|
char *page)
|
|
|
|
{
|
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
|
|
|
|
|
|
|
if (!disk->bb)
|
|
|
|
return sprintf(page, "\n");
|
|
|
|
|
|
|
|
return badblocks_show(disk->bb, page, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t disk_badblocks_store(struct device *dev,
|
|
|
|
struct device_attribute *attr,
|
|
|
|
const char *page, size_t len)
|
|
|
|
{
|
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
|
|
|
|
|
|
|
if (!disk->bb)
|
|
|
|
return -ENXIO;
|
|
|
|
|
|
|
|
return badblocks_store(disk->bb, page, len, 0);
|
|
|
|
}
|
|
|
|
|
2022-01-04 15:16:47 +08:00
|
|
|
#ifdef CONFIG_BLOCK_LEGACY_AUTOLOAD
|
2020-11-26 16:23:26 +08:00
|
|
|
void blk_request_module(dev_t devt)
|
2020-10-29 22:58:27 +08:00
|
|
|
{
|
2020-10-29 22:58:28 +08:00
|
|
|
unsigned int major = MAJOR(devt);
|
|
|
|
struct blk_major_name **n;
|
|
|
|
|
|
|
|
mutex_lock(&major_names_lock);
|
|
|
|
for (n = &major_names[major_to_index(major)]; *n; n = &(*n)->next) {
|
|
|
|
if ((*n)->major == major && (*n)->probe) {
|
|
|
|
(*n)->probe(devt);
|
|
|
|
mutex_unlock(&major_names_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
mutex_unlock(&major_names_lock);
|
|
|
|
|
2020-10-29 22:58:27 +08:00
|
|
|
if (request_module("block-major-%d-%d", MAJOR(devt), MINOR(devt)) > 0)
|
|
|
|
/* Make old-style 2.4 aliases work */
|
|
|
|
request_module("block-major-%d", MAJOR(devt));
|
|
|
|
}
|
2022-01-04 15:16:47 +08:00
|
|
|
#endif /* CONFIG_BLOCK_LEGACY_AUTOLOAD */
|
2020-10-29 22:58:27 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifdef CONFIG_PROC_FS
|
|
|
|
/* iterator */
|
2008-09-03 14:57:12 +08:00
|
|
|
static void *disk_seqf_start(struct seq_file *seqf, loff_t *pos)
|
2008-05-23 05:21:08 +08:00
|
|
|
{
|
2008-09-03 14:57:12 +08:00
|
|
|
loff_t skip = *pos;
|
|
|
|
struct class_dev_iter *iter;
|
|
|
|
struct device *dev;
|
2008-05-23 05:21:08 +08:00
|
|
|
|
2008-08-28 15:27:42 +08:00
|
|
|
iter = kmalloc(sizeof(*iter), GFP_KERNEL);
|
2008-09-03 14:57:12 +08:00
|
|
|
if (!iter)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
|
|
|
seqf->private = iter;
|
|
|
|
class_dev_iter_init(iter, &block_class, NULL, &disk_type);
|
|
|
|
do {
|
|
|
|
dev = class_dev_iter_next(iter);
|
|
|
|
if (!dev)
|
|
|
|
return NULL;
|
|
|
|
} while (skip--);
|
|
|
|
|
|
|
|
return dev_to_disk(dev);
|
2008-05-23 05:21:08 +08:00
|
|
|
}
|
|
|
|
|
2008-09-03 14:57:12 +08:00
|
|
|
static void *disk_seqf_next(struct seq_file *seqf, void *v, loff_t *pos)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2007-05-22 04:08:01 +08:00
|
|
|
struct device *dev;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-09-03 14:57:12 +08:00
|
|
|
(*pos)++;
|
|
|
|
dev = class_dev_iter_next(seqf->private);
|
2008-09-03 14:53:37 +08:00
|
|
|
if (dev)
|
2008-05-23 05:21:08 +08:00
|
|
|
return dev_to_disk(dev);
|
2008-09-03 14:53:37 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2008-09-03 14:57:12 +08:00
|
|
|
static void disk_seqf_stop(struct seq_file *seqf, void *v)
|
2008-05-23 05:21:08 +08:00
|
|
|
{
|
2008-09-03 14:57:12 +08:00
|
|
|
struct class_dev_iter *iter = seqf->private;
|
2008-05-23 05:21:08 +08:00
|
|
|
|
2008-09-03 14:57:12 +08:00
|
|
|
/* stop is called even after start failed :-( */
|
|
|
|
if (iter) {
|
|
|
|
class_dev_iter_exit(iter);
|
|
|
|
kfree(iter);
|
2016-07-29 16:40:31 +08:00
|
|
|
seqf->private = NULL;
|
2008-08-16 20:30:30 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-09-03 14:57:12 +08:00
|
|
|
static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
block: Don't use static to define "void *p" in show_partition_start()
I met a odd prblem:read /proc/partitions may return zero.
I wrote a file test.c:
int main()
{
char buff[4096];
int ret;
int fd;
printf("pid=%d\n",getpid());
while (1) {
fd = open("/proc/partitions", O_RDONLY);
if (fd < 0) {
printf("open error %s\n", strerror(errno));
return 0;
}
ret = read(fd, buff, 4096);
if (ret <= 0)
printf("ret=%d, %s, %ld\n", ret,
strerror(errno), lseek(fd,0,SEEK_CUR));
close(fd);
}
exit(0);
}
You can reproduce by:
1:while true;do cat /proc/partitions > /dev/null ;done
2:./test
I reviewed the code and found:
>> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
>> {
>> static void *p;
>>
>> p = disk_seqf_start(seqf, pos);
>> if (!IS_ERR_OR_NULL(p) && !*pos)
>> seq_puts(seqf, "major minor #blocks name\n\n");
>> return p;
>> }
test cat /proc/partitions
p = disk_seqf_start()(Not NULL)
p = disk_seqf_start()(NULL because pos)
if (!IS_ERR_OR_NULL(p) && !*pos)
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-03 16:42:00 +08:00
|
|
|
void *p;
|
2008-09-03 14:57:12 +08:00
|
|
|
|
|
|
|
p = disk_seqf_start(seqf, pos);
|
2010-12-17 15:58:36 +08:00
|
|
|
if (!IS_ERR_OR_NULL(p) && !*pos)
|
2008-09-03 14:57:12 +08:00
|
|
|
seq_puts(seqf, "major minor #blocks name\n\n");
|
|
|
|
return p;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-09-03 15:01:09 +08:00
|
|
|
static int show_partition(struct seq_file *seqf, void *v)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct gendisk *sgp = v;
|
2020-11-24 16:52:59 +08:00
|
|
|
struct block_device *part;
|
2021-04-06 14:23:00 +08:00
|
|
|
unsigned long idx;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-11-22 21:06:21 +08:00
|
|
|
if (!get_capacity(sgp) || (sgp->flags & GENHD_FL_HIDDEN))
|
2005-04-17 06:20:36 +08:00
|
|
|
return 0;
|
|
|
|
|
2021-04-06 14:23:00 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
xa_for_each(&sgp->part_tbl, idx, part) {
|
|
|
|
if (!bdev_nr_sectors(part))
|
|
|
|
continue;
|
2021-07-27 14:25:15 +08:00
|
|
|
seq_printf(seqf, "%4d %7d %10llu %pg\n",
|
2020-11-24 16:52:59 +08:00
|
|
|
MAJOR(part->bd_dev), MINOR(part->bd_dev),
|
2021-07-27 14:25:15 +08:00
|
|
|
bdev_nr_sectors(part) >> 1, part);
|
2021-04-06 14:23:00 +08:00
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2005-04-17 06:20:36 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-10-05 03:53:21 +08:00
|
|
|
static const struct seq_operations partitions_op = {
|
2008-09-03 14:57:12 +08:00
|
|
|
.start = show_partition_start,
|
|
|
|
.next = disk_seqf_next,
|
|
|
|
.stop = disk_seqf_stop,
|
2007-05-22 04:08:01 +08:00
|
|
|
.show = show_partition
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static int __init genhd_device_init(void)
|
|
|
|
{
|
2008-04-22 01:51:07 +08:00
|
|
|
int error;
|
|
|
|
|
|
|
|
error = class_register(&block_class);
|
2008-03-12 08:13:15 +08:00
|
|
|
if (unlikely(error))
|
|
|
|
return error;
|
2005-04-17 06:20:36 +08:00
|
|
|
blk_dev_init();
|
2007-05-22 04:08:01 +08:00
|
|
|
|
block: fix boot failure with CONFIG_DEBUG_BLOCK_EXT_DEVT=y and nash
We run into system boot failure with kernel 2.6.28-rc. We found it on a
couple of machines, including T61 notebook, nehalem machine, and another
HPC NX6325 notebook. All the machines use FedoraCore 8 or FedoraCore 9.
With kernel prior to 2.6.28-rc, system boot doesn't fail.
I debug it and locate the root cause. Pls. see
http://bugzilla.kernel.org/show_bug.cgi?id=11899
https://bugzilla.redhat.com/show_bug.cgi?id=471517
As a matter of fact, there are 2 bugs.
1)root=/dev/sda1, system boot randomly fails. Mostly, boot for 5 times
and fails once. nash has a bug. Some of its functions misuse return
value 0. Sometimes, 0 means timeout and no uevent available. Sometimes,
0 means nash gets an uevent, but the uevent isn't block-related (for
exmaple, usb). If by coincidence, kernel tells nash that uevents are
available, but kernel also set timeout, nash might stops collecting
other uevents in queue if current uevent isn't block-related. I work
out a patch for nash to fix it.
http://bugzilla.kernel.org/attachment.cgi?id=18858
2) root=LABEL=/, system always can't boot. initrd init reports
switchroot fails. Here is an executation branch of nash when booting:
(1) nash read /sys/block/sda/dev; Assume major is 8 (on my desktop)
(2) nash query /proc/devices with the major number; It found line
"8 sd";
(3) nash use 'sd' to search its own probe table to find device (DISK)
type for the device and add it to its own list;
(4) Later on, it probes all devices in its list to get filesystem
labels; scsi register "8 sd" always.
When major is 259, nash fails to find the device(DISK) type. I enables
CONFIG_DEBUG_BLOCK_EXT_DEVT=y when compiling kernel, so 259 is picked up
for device /dev/sda1, which causes nash to fail to find device (DISK)
type.
To fixing issue 2), I create a patch for nash and another patch for
kernel.
http://bugzilla.kernel.org/attachment.cgi?id=18859
http://bugzilla.kernel.org/attachment.cgi?id=18837
Below is the patch for kernel 2.6.28-rc4. It registers blkext, a new
block device in proc/devices.
With 2 patches on nash and 1 patch on kernel, I boot my machines for
dozens of times without failure.
Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-14 15:26:30 +08:00
|
|
|
register_blkdev(BLOCK_EXT_MAJOR, "blkext");
|
|
|
|
|
2007-05-22 04:08:01 +08:00
|
|
|
/* create top-level block dir */
|
driver core: remove CONFIG_SYSFS_DEPRECATED and CONFIG_SYSFS_DEPRECATED_V2
CONFIG_SYSFS_DEPRECATED was added in commit 88a22c985e35
("CONFIG_SYSFS_DEPRECATED") in 2006 to allow systems with older versions
of some tools (i.e. Fedora 3's version of udev) to boot properly. Four
years later, in 2010, the option was attempted to be removed as most of
userspace should have been fixed up properly by then, but some kernel
developers clung to those old systems and refused to update, so we added
CONFIG_SYSFS_DEPRECATED_V2 in commit e52eec13cd6b ("SYSFS: Allow boot
time switching between deprecated and modern sysfs layout") to allow
them to continue to boot properly, and we allowed a boot time parameter
to be used to switch back to the old format if needed.
Over time, the logic that was covered under these config options was
slowly removed from individual driver subsystems successfully, removed,
and the only thing that is now left in the kernel are some changes in
the block layer's representation in sysfs where real directories are
used instead of symlinks like normal.
Because the original changes were done to userspace tools in 2006, and
all distros that use those tools are long end-of-life, and older
non-udev-based systems do not care about the block layer's sysfs
representation, it is time to finally remove this old logic and the
config entries from the kernel.
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: linux-block@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20230223073326.2073220-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-23 15:33:26 +08:00
|
|
|
block_depr = kobject_create_and_add("block", NULL);
|
2007-11-07 02:36:58 +08:00
|
|
|
return 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
subsys_initcall(genhd_device_init);
|
|
|
|
|
2007-05-22 04:08:01 +08:00
|
|
|
static ssize_t disk_range_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2007-05-22 04:08:01 +08:00
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2007-05-22 04:08:01 +08:00
|
|
|
return sprintf(buf, "%d\n", disk->minors);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-08-25 18:47:23 +08:00
|
|
|
static ssize_t disk_ext_range_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
|
|
|
{
|
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
|
|
|
|
2021-11-22 21:06:22 +08:00
|
|
|
return sprintf(buf, "%d\n",
|
|
|
|
(disk->flags & GENHD_FL_NO_PART) ? 1 : DISK_MAX_PARTS);
|
2008-08-25 18:47:23 +08:00
|
|
|
}
|
|
|
|
|
2007-05-22 04:08:01 +08:00
|
|
|
static ssize_t disk_removable_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
2005-10-01 20:49:43 +08:00
|
|
|
{
|
2007-05-22 04:08:01 +08:00
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
2005-10-01 20:49:43 +08:00
|
|
|
|
2007-05-22 04:08:01 +08:00
|
|
|
return sprintf(buf, "%d\n",
|
|
|
|
(disk->flags & GENHD_FL_REMOVABLE ? 1 : 0));
|
2005-10-01 20:49:43 +08:00
|
|
|
}
|
|
|
|
|
2017-11-03 02:29:53 +08:00
|
|
|
static ssize_t disk_hidden_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
|
|
|
{
|
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
|
|
|
|
|
|
|
return sprintf(buf, "%d\n",
|
|
|
|
(disk->flags & GENHD_FL_HIDDEN ? 1 : 0));
|
|
|
|
}
|
|
|
|
|
2008-06-13 15:41:00 +08:00
|
|
|
static ssize_t disk_ro_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
|
|
|
{
|
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
|
|
|
|
2008-08-25 18:56:10 +08:00
|
|
|
return sprintf(buf, "%d\n", get_disk_ro(disk) ? 1 : 0);
|
2008-06-13 15:41:00 +08:00
|
|
|
}
|
|
|
|
|
2020-03-24 15:25:13 +08:00
|
|
|
ssize_t part_size_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
|
|
|
{
|
2020-11-27 23:43:51 +08:00
|
|
|
return sprintf(buf, "%llu\n", bdev_nr_sectors(dev_to_bdev(dev)));
|
2020-03-24 15:25:13 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ssize_t part_stat_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
|
|
|
{
|
2020-11-27 23:43:51 +08:00
|
|
|
struct block_device *bdev = dev_to_bdev(dev);
|
2020-03-25 21:07:06 +08:00
|
|
|
struct disk_stats stat;
|
2020-03-24 15:25:13 +08:00
|
|
|
unsigned int inflight;
|
|
|
|
|
block: fix that util can be greater than 100%
util means the percentage that disk has IO, and theoretically it should
not be greater than 100%. However, there is a gap for rq-based disk:
io_ticks will be updated when rq is allocated, however, before such rq
dispatch to driver, it will not be account as inflight from
blk_mq_start_request() hence diskstats_show()/part_stat_show() will not
update io_ticks. For example:
1) at t0, issue a new IO, rq is allocated, and blk_account_io_start()
update io_ticks;
2) something is wrong with drivers, and the rq can't be dispatched;
3) at t0 + 10s, drivers recovers and rq is dispatched and done, io_ticks
is updated;
Then if user is using "iostat 1" to monitor "util", between t0 - t0+9s,
util will be zero, and between t0+9s - t0+10s, util will be 1000%.
Fix this problem by updating io_ticks from diskstats_show() and
part_stat_show() if there are rq allocated.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240509123717.3223892-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 20:37:17 +08:00
|
|
|
inflight = part_in_flight(bdev);
|
2022-02-17 14:42:47 +08:00
|
|
|
if (inflight) {
|
|
|
|
part_stat_lock();
|
|
|
|
update_io_ticks(bdev, jiffies, true);
|
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
part_stat_read_all(bdev, &stat);
|
2020-03-24 15:25:13 +08:00
|
|
|
return sprintf(buf,
|
|
|
|
"%8lu %8lu %8llu %8u "
|
|
|
|
"%8lu %8lu %8llu %8u "
|
|
|
|
"%8u %8u %8u "
|
|
|
|
"%8lu %8lu %8llu %8u "
|
|
|
|
"%8lu %8u"
|
|
|
|
"\n",
|
2020-03-25 21:07:06 +08:00
|
|
|
stat.ios[STAT_READ],
|
|
|
|
stat.merges[STAT_READ],
|
|
|
|
(unsigned long long)stat.sectors[STAT_READ],
|
|
|
|
(unsigned int)div_u64(stat.nsecs[STAT_READ], NSEC_PER_MSEC),
|
|
|
|
stat.ios[STAT_WRITE],
|
|
|
|
stat.merges[STAT_WRITE],
|
|
|
|
(unsigned long long)stat.sectors[STAT_WRITE],
|
|
|
|
(unsigned int)div_u64(stat.nsecs[STAT_WRITE], NSEC_PER_MSEC),
|
2020-03-24 15:25:13 +08:00
|
|
|
inflight,
|
2020-03-25 21:07:06 +08:00
|
|
|
jiffies_to_msecs(stat.io_ticks),
|
2020-03-25 21:07:08 +08:00
|
|
|
(unsigned int)div_u64(stat.nsecs[STAT_READ] +
|
|
|
|
stat.nsecs[STAT_WRITE] +
|
|
|
|
stat.nsecs[STAT_DISCARD] +
|
|
|
|
stat.nsecs[STAT_FLUSH],
|
|
|
|
NSEC_PER_MSEC),
|
2020-03-25 21:07:06 +08:00
|
|
|
stat.ios[STAT_DISCARD],
|
|
|
|
stat.merges[STAT_DISCARD],
|
|
|
|
(unsigned long long)stat.sectors[STAT_DISCARD],
|
|
|
|
(unsigned int)div_u64(stat.nsecs[STAT_DISCARD], NSEC_PER_MSEC),
|
|
|
|
stat.ios[STAT_FLUSH],
|
|
|
|
(unsigned int)div_u64(stat.nsecs[STAT_FLUSH], NSEC_PER_MSEC));
|
2020-03-24 15:25:13 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ssize_t part_inflight_show(struct device *dev, struct device_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
2020-11-27 23:43:51 +08:00
|
|
|
struct block_device *bdev = dev_to_bdev(dev);
|
2021-10-14 22:03:30 +08:00
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
2020-03-24 15:25:13 +08:00
|
|
|
unsigned int inflight[2];
|
|
|
|
|
2020-05-13 18:49:33 +08:00
|
|
|
if (queue_is_mq(q))
|
2020-11-27 23:43:51 +08:00
|
|
|
blk_mq_in_flight_rw(q, bdev, inflight);
|
2020-05-13 18:49:33 +08:00
|
|
|
else
|
2020-11-27 23:43:51 +08:00
|
|
|
part_in_flight_rw(bdev, inflight);
|
2020-05-13 18:49:33 +08:00
|
|
|
|
2020-03-24 15:25:13 +08:00
|
|
|
return sprintf(buf, "%8u %8u\n", inflight[0], inflight[1]);
|
|
|
|
}
|
|
|
|
|
2007-05-22 04:08:01 +08:00
|
|
|
static ssize_t disk_capability_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
2007-05-24 04:57:38 +08:00
|
|
|
{
|
2023-02-03 23:02:09 +08:00
|
|
|
dev_warn_once(dev, "the capability attribute has been deprecated.\n");
|
|
|
|
return sprintf(buf, "0\n");
|
2007-05-24 04:57:38 +08:00
|
|
|
}
|
2007-05-22 04:08:01 +08:00
|
|
|
|
2009-05-23 05:17:53 +08:00
|
|
|
static ssize_t disk_alignment_offset_show(struct device *dev,
|
|
|
|
struct device_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
|
|
|
|
2022-04-15 12:52:48 +08:00
|
|
|
return sprintf(buf, "%d\n", bdev_alignment_offset(disk->part0));
|
2009-05-23 05:17:53 +08:00
|
|
|
}
|
|
|
|
|
2009-11-10 18:50:21 +08:00
|
|
|
static ssize_t disk_discard_alignment_show(struct device *dev,
|
|
|
|
struct device_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
|
|
|
|
2022-04-15 12:52:50 +08:00
|
|
|
return sprintf(buf, "%d\n", bdev_alignment_offset(disk->part0));
|
2009-11-10 18:50:21 +08:00
|
|
|
}
|
|
|
|
|
2021-07-13 07:05:28 +08:00
|
|
|
static ssize_t diskseq_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
|
|
|
{
|
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
|
|
|
|
|
|
|
return sprintf(buf, "%llu\n", disk->diskseq);
|
|
|
|
}
|
|
|
|
|
2024-05-02 21:00:33 +08:00
|
|
|
static ssize_t partscan_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
|
|
|
{
|
|
|
|
return sprintf(buf, "%u\n", disk_has_partscan(dev_to_disk(dev)));
|
|
|
|
}
|
|
|
|
|
2018-05-25 03:38:59 +08:00
|
|
|
static DEVICE_ATTR(range, 0444, disk_range_show, NULL);
|
|
|
|
static DEVICE_ATTR(ext_range, 0444, disk_ext_range_show, NULL);
|
|
|
|
static DEVICE_ATTR(removable, 0444, disk_removable_show, NULL);
|
|
|
|
static DEVICE_ATTR(hidden, 0444, disk_hidden_show, NULL);
|
|
|
|
static DEVICE_ATTR(ro, 0444, disk_ro_show, NULL);
|
|
|
|
static DEVICE_ATTR(size, 0444, part_size_show, NULL);
|
|
|
|
static DEVICE_ATTR(alignment_offset, 0444, disk_alignment_offset_show, NULL);
|
|
|
|
static DEVICE_ATTR(discard_alignment, 0444, disk_discard_alignment_show, NULL);
|
|
|
|
static DEVICE_ATTR(capability, 0444, disk_capability_show, NULL);
|
|
|
|
static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
|
|
|
|
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
|
|
|
|
static DEVICE_ATTR(badblocks, 0644, disk_badblocks_show, disk_badblocks_store);
|
2021-07-13 07:05:28 +08:00
|
|
|
static DEVICE_ATTR(diskseq, 0444, diskseq_show, NULL);
|
2024-05-02 21:00:33 +08:00
|
|
|
static DEVICE_ATTR(partscan, 0444, partscan_show, NULL);
|
2020-03-24 15:25:13 +08:00
|
|
|
|
2006-12-08 18:39:46 +08:00
|
|
|
#ifdef CONFIG_FAIL_MAKE_REQUEST
|
2020-03-24 15:25:13 +08:00
|
|
|
ssize_t part_fail_show(struct device *dev,
|
|
|
|
struct device_attribute *attr, char *buf)
|
|
|
|
{
|
2024-04-28 12:15:07 +08:00
|
|
|
return sprintf(buf, "%d\n",
|
|
|
|
bdev_test_flag(dev_to_bdev(dev), BD_MAKE_IT_FAIL));
|
2020-03-24 15:25:13 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ssize_t part_fail_store(struct device *dev,
|
|
|
|
struct device_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2024-04-28 12:15:07 +08:00
|
|
|
if (count > 0 && sscanf(buf, "%d", &i) > 0) {
|
|
|
|
if (i)
|
|
|
|
bdev_set_flag(dev_to_bdev(dev), BD_MAKE_IT_FAIL);
|
|
|
|
else
|
|
|
|
bdev_clear_flag(dev_to_bdev(dev), BD_MAKE_IT_FAIL);
|
|
|
|
}
|
2020-03-24 15:25:13 +08:00
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2007-05-22 04:08:01 +08:00
|
|
|
static struct device_attribute dev_attr_fail =
|
2018-05-25 03:38:59 +08:00
|
|
|
__ATTR(make-it-fail, 0644, part_fail_show, part_fail_store);
|
2020-03-24 15:25:13 +08:00
|
|
|
#endif /* CONFIG_FAIL_MAKE_REQUEST */
|
|
|
|
|
2008-09-14 20:56:33 +08:00
|
|
|
#ifdef CONFIG_FAIL_IO_TIMEOUT
|
|
|
|
static struct device_attribute dev_attr_fail_timeout =
|
2018-05-25 03:38:59 +08:00
|
|
|
__ATTR(io-timeout-fail, 0644, part_timeout_show, part_timeout_store);
|
2008-09-14 20:56:33 +08:00
|
|
|
#endif
|
2007-05-22 04:08:01 +08:00
|
|
|
|
|
|
|
static struct attribute *disk_attrs[] = {
|
|
|
|
&dev_attr_range.attr,
|
2008-08-25 18:47:23 +08:00
|
|
|
&dev_attr_ext_range.attr,
|
2007-05-22 04:08:01 +08:00
|
|
|
&dev_attr_removable.attr,
|
2017-11-03 02:29:53 +08:00
|
|
|
&dev_attr_hidden.attr,
|
2008-06-13 15:41:00 +08:00
|
|
|
&dev_attr_ro.attr,
|
2007-05-22 04:08:01 +08:00
|
|
|
&dev_attr_size.attr,
|
2009-05-23 05:17:53 +08:00
|
|
|
&dev_attr_alignment_offset.attr,
|
2009-11-10 18:50:21 +08:00
|
|
|
&dev_attr_discard_alignment.attr,
|
2007-05-22 04:08:01 +08:00
|
|
|
&dev_attr_capability.attr,
|
|
|
|
&dev_attr_stat.attr,
|
block: Seperate read and write statistics of in_flight requests v2
Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.
But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.
So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;
- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-07 02:16:55 +08:00
|
|
|
&dev_attr_inflight.attr,
|
2016-01-10 00:36:51 +08:00
|
|
|
&dev_attr_badblocks.attr,
|
2021-06-24 15:38:43 +08:00
|
|
|
&dev_attr_events.attr,
|
|
|
|
&dev_attr_events_async.attr,
|
|
|
|
&dev_attr_events_poll_msecs.attr,
|
2021-07-13 07:05:28 +08:00
|
|
|
&dev_attr_diskseq.attr,
|
2024-05-02 21:00:33 +08:00
|
|
|
&dev_attr_partscan.attr,
|
2007-05-22 04:08:01 +08:00
|
|
|
#ifdef CONFIG_FAIL_MAKE_REQUEST
|
|
|
|
&dev_attr_fail.attr,
|
2008-09-14 20:56:33 +08:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_FAIL_IO_TIMEOUT
|
|
|
|
&dev_attr_fail_timeout.attr,
|
2007-05-22 04:08:01 +08:00
|
|
|
#endif
|
|
|
|
NULL
|
|
|
|
};
|
|
|
|
|
2017-04-28 05:46:26 +08:00
|
|
|
static umode_t disk_visible(struct kobject *kobj, struct attribute *a, int n)
|
|
|
|
{
|
|
|
|
struct device *dev = container_of(kobj, typeof(*dev), kobj);
|
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
|
|
|
|
|
|
|
if (a == &dev_attr_badblocks.attr && !disk->bb)
|
|
|
|
return 0;
|
|
|
|
return a->mode;
|
|
|
|
}
|
|
|
|
|
2007-05-22 04:08:01 +08:00
|
|
|
static struct attribute_group disk_attr_group = {
|
|
|
|
.attrs = disk_attrs,
|
2017-04-28 05:46:26 +08:00
|
|
|
.is_visible = disk_visible,
|
2007-05-22 04:08:01 +08:00
|
|
|
};
|
|
|
|
|
2009-06-25 01:06:31 +08:00
|
|
|
static const struct attribute_group *disk_attr_groups[] = {
|
2007-05-22 04:08:01 +08:00
|
|
|
&disk_attr_group,
|
2022-06-29 01:18:45 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_IO_TRACE
|
|
|
|
&blk_trace_attr_group,
|
2023-03-19 01:36:25 +08:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
&blk_integrity_attr_group,
|
2022-06-29 01:18:45 +08:00
|
|
|
#endif
|
2007-05-22 04:08:01 +08:00
|
|
|
NULL
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2020-06-20 04:47:23 +08:00
|
|
|
/**
|
|
|
|
* disk_release - releases all allocated resources of the gendisk
|
|
|
|
* @dev: the device representing this disk
|
|
|
|
*
|
|
|
|
* This function releases all allocated resources of the gendisk.
|
|
|
|
*
|
|
|
|
* Drivers which used __device_add_disk() have a gendisk with a request_queue
|
|
|
|
* assigned. Since the request_queue sits on top of the gendisk for these
|
|
|
|
* drivers we also call blk_put_queue() for them, and we expect the
|
|
|
|
* request_queue refcount to reach 0 at this point, and so the request_queue
|
|
|
|
* will also be freed prior to the disk.
|
2020-06-20 04:47:25 +08:00
|
|
|
*
|
|
|
|
* Context: can sleep
|
2020-06-20 04:47:23 +08:00
|
|
|
*/
|
2007-05-22 04:08:01 +08:00
|
|
|
static void disk_release(struct device *dev)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2007-05-22 04:08:01 +08:00
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
|
|
|
|
2020-06-20 04:47:25 +08:00
|
|
|
might_sleep();
|
2021-10-14 21:02:31 +08:00
|
|
|
WARN_ON_ONCE(disk_live(disk));
|
2020-06-20 04:47:25 +08:00
|
|
|
|
2023-06-10 10:20:03 +08:00
|
|
|
blk_trace_remove(disk->queue);
|
|
|
|
|
2022-07-20 21:05:41 +08:00
|
|
|
/*
|
|
|
|
* To undo the all initialization from blk_mq_init_allocated_queue in
|
|
|
|
* case of a probe failure where add_disk is never called we have to
|
|
|
|
* call blk_mq_exit_queue here. We can't do this for the more common
|
|
|
|
* teardown case (yet) as the tagset can be gone by the time the disk
|
|
|
|
* is released once it was added.
|
|
|
|
*/
|
|
|
|
if (queue_is_mq(disk->queue) &&
|
|
|
|
test_bit(GD_OWNS_QUEUE, &disk->state) &&
|
|
|
|
!test_bit(GD_ADDED, &disk->state))
|
|
|
|
blk_mq_exit_queue(disk->queue);
|
|
|
|
|
2023-02-15 02:33:06 +08:00
|
|
|
blkcg_exit_disk(disk);
|
|
|
|
|
2022-07-28 00:22:57 +08:00
|
|
|
bioset_exit(&disk->bio_split);
|
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 09:43:43 +08:00
|
|
|
|
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-09 03:57:37 +08:00
|
|
|
disk_release_events(disk);
|
2005-04-17 06:20:36 +08:00
|
|
|
kfree(disk->random);
|
block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.
Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.
This mechanism allows to:
- Untangle zone write ordering from block IO schedulers. This allows
removing the restriction on using mq-deadline for writing to zoned
block devices. Any block IO scheduler, including "none" can be used.
- Zone write plugging operates on BIOs instead of requests. Plugged
BIOs waiting for execution thus do not hold scheduling tags and thus
are not preventing other BIOs from executing (reads or writes to
other zones). Depending on the workload, this can significantly
improve the device use (higher queue depth operation) and
performance.
- Both blk-mq (request based) zoned devices and BIO-based zoned devices
(e.g. device mapper) can use zone write plugging. It is mandatory
for the former but optional for the latter. BIO-based drivers can
use zone write plugging to implement write ordering guarantees, or
the drivers can implement their own if needed.
- The code is less invasive in the block layer and is mostly limited to
blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
bio.c.
Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.
Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.
Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.
Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.
Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.
When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.
Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.
If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.
To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.
In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.
If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.
This commit contains contributions from Christoph Hellwig <hch@lst.de>.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-08 09:41:07 +08:00
|
|
|
disk_free_zone_resources(disk);
|
2021-01-24 18:02:41 +08:00
|
|
|
xa_destroy(&disk->part_tbl);
|
2022-03-08 13:51:55 +08:00
|
|
|
|
2021-08-16 21:46:24 +08:00
|
|
|
disk->queue->disk = NULL;
|
2021-08-16 21:19:09 +08:00
|
|
|
blk_put_queue(disk->queue);
|
2022-02-15 17:45:10 +08:00
|
|
|
|
|
|
|
if (test_bit(GD_ADDED, &disk->state) && disk->fops->free_disk)
|
|
|
|
disk->fops->free_disk(disk);
|
|
|
|
|
2024-04-29 07:01:39 +08:00
|
|
|
bdev_drop(disk->part0); /* frees the disk */
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2021-07-13 07:05:26 +08:00
|
|
|
|
2022-11-23 20:25:19 +08:00
|
|
|
static int block_uevent(const struct device *dev, struct kobj_uevent_env *env)
|
2021-07-13 07:05:26 +08:00
|
|
|
{
|
2022-11-23 20:25:19 +08:00
|
|
|
const struct gendisk *disk = dev_to_disk(dev);
|
2021-07-13 07:05:26 +08:00
|
|
|
|
|
|
|
return add_uevent_var(env, "DISKSEQ=%llu", disk->diskseq);
|
|
|
|
}
|
|
|
|
|
2024-03-06 03:32:16 +08:00
|
|
|
const struct class block_class = {
|
2007-05-22 04:08:01 +08:00
|
|
|
.name = "block",
|
2021-07-13 07:05:26 +08:00
|
|
|
.dev_uevent = block_uevent,
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2023-01-11 19:30:08 +08:00
|
|
|
static char *block_devnode(const struct device *dev, umode_t *mode,
|
2023-01-05 05:44:02 +08:00
|
|
|
kuid_t *uid, kgid_t *gid)
|
|
|
|
{
|
|
|
|
struct gendisk *disk = dev_to_disk(dev);
|
|
|
|
|
|
|
|
if (disk->fops->devnode)
|
|
|
|
return disk->fops->devnode(disk, mode);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2020-06-02 04:12:05 +08:00
|
|
|
const struct device_type disk_type = {
|
2007-05-22 04:08:01 +08:00
|
|
|
.name = "disk",
|
|
|
|
.groups = disk_attr_groups,
|
|
|
|
.release = disk_release,
|
2023-01-05 05:44:02 +08:00
|
|
|
.devnode = block_devnode,
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2008-05-24 00:44:11 +08:00
|
|
|
#ifdef CONFIG_PROC_FS
|
2008-09-03 15:01:09 +08:00
|
|
|
/*
|
|
|
|
* aggregate disk stat collector. Uses the same stats that the sysfs
|
|
|
|
* entries do, above, but makes them available through one seq_file.
|
|
|
|
*
|
|
|
|
* The output looks suspiciously like /proc/partitions with a bunch of
|
|
|
|
* extra fields.
|
|
|
|
*/
|
|
|
|
static int diskstats_show(struct seq_file *seqf, void *v)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct gendisk *gp = v;
|
2020-11-24 16:52:59 +08:00
|
|
|
struct block_device *hd;
|
2018-12-07 00:41:21 +08:00
|
|
|
unsigned int inflight;
|
2020-03-25 21:07:06 +08:00
|
|
|
struct disk_stats stat;
|
2021-04-06 14:23:01 +08:00
|
|
|
unsigned long idx;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
2008-08-25 18:56:05 +08:00
|
|
|
if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next)
|
2008-09-03 15:01:09 +08:00
|
|
|
seq_puts(seqf, "major minor name"
|
2005-04-17 06:20:36 +08:00
|
|
|
" rio rmerge rsect ruse wio wmerge "
|
|
|
|
"wsect wuse running use aveq"
|
|
|
|
"\n\n");
|
|
|
|
*/
|
2011-06-13 16:45:43 +08:00
|
|
|
|
2021-04-06 14:23:01 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
xa_for_each(&gp->part_tbl, idx, hd) {
|
|
|
|
if (bdev_is_partition(hd) && !bdev_nr_sectors(hd))
|
|
|
|
continue;
|
2020-03-25 21:07:06 +08:00
|
|
|
|
block: fix that util can be greater than 100%
util means the percentage that disk has IO, and theoretically it should
not be greater than 100%. However, there is a gap for rq-based disk:
io_ticks will be updated when rq is allocated, however, before such rq
dispatch to driver, it will not be account as inflight from
blk_mq_start_request() hence diskstats_show()/part_stat_show() will not
update io_ticks. For example:
1) at t0, issue a new IO, rq is allocated, and blk_account_io_start()
update io_ticks;
2) something is wrong with drivers, and the rq can't be dispatched;
3) at t0 + 10s, drivers recovers and rq is dispatched and done, io_ticks
is updated;
Then if user is using "iostat 1" to monitor "util", between t0 - t0+9s,
util will be zero, and between t0+9s - t0+10s, util will be 1000%.
Fix this problem by updating io_ticks from diskstats_show() and
part_stat_show() if there are rq allocated.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240509123717.3223892-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 20:37:17 +08:00
|
|
|
inflight = part_in_flight(hd);
|
2022-02-17 14:42:47 +08:00
|
|
|
if (inflight) {
|
|
|
|
part_stat_lock();
|
|
|
|
update_io_ticks(hd, jiffies, true);
|
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
part_stat_read_all(hd, &stat);
|
2021-07-27 14:25:13 +08:00
|
|
|
seq_printf(seqf, "%4d %7d %pg "
|
2018-07-18 19:47:40 +08:00
|
|
|
"%lu %lu %lu %u "
|
|
|
|
"%lu %lu %lu %u "
|
|
|
|
"%u %u %u "
|
2019-11-21 18:40:26 +08:00
|
|
|
"%lu %lu %lu %u "
|
|
|
|
"%lu %u"
|
|
|
|
"\n",
|
2021-07-27 14:25:13 +08:00
|
|
|
MAJOR(hd->bd_dev), MINOR(hd->bd_dev), hd,
|
2020-03-25 21:07:06 +08:00
|
|
|
stat.ios[STAT_READ],
|
|
|
|
stat.merges[STAT_READ],
|
|
|
|
stat.sectors[STAT_READ],
|
|
|
|
(unsigned int)div_u64(stat.nsecs[STAT_READ],
|
|
|
|
NSEC_PER_MSEC),
|
|
|
|
stat.ios[STAT_WRITE],
|
|
|
|
stat.merges[STAT_WRITE],
|
|
|
|
stat.sectors[STAT_WRITE],
|
|
|
|
(unsigned int)div_u64(stat.nsecs[STAT_WRITE],
|
|
|
|
NSEC_PER_MSEC),
|
2018-12-07 00:41:21 +08:00
|
|
|
inflight,
|
2020-03-25 21:07:06 +08:00
|
|
|
jiffies_to_msecs(stat.io_ticks),
|
2020-03-25 21:07:08 +08:00
|
|
|
(unsigned int)div_u64(stat.nsecs[STAT_READ] +
|
|
|
|
stat.nsecs[STAT_WRITE] +
|
|
|
|
stat.nsecs[STAT_DISCARD] +
|
|
|
|
stat.nsecs[STAT_FLUSH],
|
|
|
|
NSEC_PER_MSEC),
|
2020-03-25 21:07:06 +08:00
|
|
|
stat.ios[STAT_DISCARD],
|
|
|
|
stat.merges[STAT_DISCARD],
|
|
|
|
stat.sectors[STAT_DISCARD],
|
|
|
|
(unsigned int)div_u64(stat.nsecs[STAT_DISCARD],
|
|
|
|
NSEC_PER_MSEC),
|
|
|
|
stat.ios[STAT_FLUSH],
|
|
|
|
(unsigned int)div_u64(stat.nsecs[STAT_FLUSH],
|
|
|
|
NSEC_PER_MSEC)
|
2008-02-08 18:04:56 +08:00
|
|
|
);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2021-04-06 14:23:01 +08:00
|
|
|
rcu_read_unlock();
|
2011-06-13 16:45:43 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-10-06 16:55:38 +08:00
|
|
|
static const struct seq_operations diskstats_op = {
|
2008-09-03 14:57:12 +08:00
|
|
|
.start = disk_seqf_start,
|
|
|
|
.next = disk_seqf_next,
|
|
|
|
.stop = disk_seqf_stop,
|
2005-04-17 06:20:36 +08:00
|
|
|
.show = diskstats_show
|
|
|
|
};
|
2008-10-05 03:53:21 +08:00
|
|
|
|
|
|
|
static int __init proc_genhd_init(void)
|
|
|
|
{
|
2018-04-14 01:44:18 +08:00
|
|
|
proc_create_seq("diskstats", 0, NULL, &diskstats_op);
|
|
|
|
proc_create_seq("partitions", 0, NULL, &partitions_op);
|
2008-10-05 03:53:21 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
module_init(proc_genhd_init);
|
2008-05-24 00:44:11 +08:00
|
|
|
#endif /* CONFIG_PROC_FS */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-05-25 14:13:00 +08:00
|
|
|
dev_t part_devt(struct gendisk *disk, u8 partno)
|
|
|
|
{
|
2021-05-25 14:13:01 +08:00
|
|
|
struct block_device *part;
|
2021-05-25 14:13:00 +08:00
|
|
|
dev_t devt = 0;
|
|
|
|
|
2021-05-25 14:13:01 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
part = xa_load(&disk->part_tbl, partno);
|
|
|
|
if (part)
|
2021-05-25 14:13:00 +08:00
|
|
|
devt = part->bd_dev;
|
2021-05-25 14:13:01 +08:00
|
|
|
rcu_read_unlock();
|
2021-05-25 14:13:00 +08:00
|
|
|
|
|
|
|
return devt;
|
|
|
|
}
|
|
|
|
|
2021-08-16 21:19:08 +08:00
|
|
|
struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
|
|
|
|
struct lock_class_key *lkclass)
|
2005-06-23 15:08:19 +08:00
|
|
|
{
|
|
|
|
struct gendisk *disk;
|
|
|
|
|
2013-08-30 06:21:42 +08:00
|
|
|
disk = kzalloc_node(sizeof(struct gendisk), GFP_KERNEL, node_id);
|
2020-09-01 02:02:37 +08:00
|
|
|
if (!disk)
|
2022-08-12 07:23:37 +08:00
|
|
|
return NULL;
|
2011-01-07 15:43:37 +08:00
|
|
|
|
2022-07-28 00:22:57 +08:00
|
|
|
if (bioset_init(&disk->bio_split, BIO_POOL_SIZE, 0, 0))
|
|
|
|
goto out_free_disk;
|
|
|
|
|
2021-08-09 22:17:43 +08:00
|
|
|
disk->bdi = bdi_alloc(node_id);
|
|
|
|
if (!disk->bdi)
|
2022-07-28 00:22:57 +08:00
|
|
|
goto out_free_bioset;
|
2021-08-09 22:17:43 +08:00
|
|
|
|
2021-10-14 22:03:26 +08:00
|
|
|
/* bdev_alloc() might need the queue, set before the first call */
|
|
|
|
disk->queue = q;
|
|
|
|
|
2020-11-27 01:47:17 +08:00
|
|
|
disk->part0 = bdev_alloc(disk, 0);
|
|
|
|
if (!disk->part0)
|
2021-08-09 22:17:43 +08:00
|
|
|
goto out_free_bdi;
|
2020-11-26 16:23:26 +08:00
|
|
|
|
2020-09-01 02:02:37 +08:00
|
|
|
disk->node_id = node_id;
|
2021-05-25 14:12:56 +08:00
|
|
|
mutex_init(&disk->open_mutex);
|
2021-01-24 18:02:41 +08:00
|
|
|
xa_init(&disk->part_tbl);
|
|
|
|
if (xa_insert(&disk->part_tbl, 0, disk->part0, GFP_KERNEL))
|
|
|
|
goto out_destroy_part_tbl;
|
2020-09-01 02:02:37 +08:00
|
|
|
|
2023-02-15 02:33:06 +08:00
|
|
|
if (blkcg_init_disk(disk))
|
|
|
|
goto out_erase_part0;
|
|
|
|
|
block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.
Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.
This mechanism allows to:
- Untangle zone write ordering from block IO schedulers. This allows
removing the restriction on using mq-deadline for writing to zoned
block devices. Any block IO scheduler, including "none" can be used.
- Zone write plugging operates on BIOs instead of requests. Plugged
BIOs waiting for execution thus do not hold scheduling tags and thus
are not preventing other BIOs from executing (reads or writes to
other zones). Depending on the workload, this can significantly
improve the device use (higher queue depth operation) and
performance.
- Both blk-mq (request based) zoned devices and BIO-based zoned devices
(e.g. device mapper) can use zone write plugging. It is mandatory
for the former but optional for the latter. BIO-based drivers can
use zone write plugging to implement write ordering guarantees, or
the drivers can implement their own if needed.
- The code is less invasive in the block layer and is mostly limited to
blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
bio.c.
Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.
Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.
Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.
Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.
Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.
When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.
Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.
If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.
To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.
In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.
If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.
This commit contains contributions from Christoph Hellwig <hch@lst.de>.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-08 09:41:07 +08:00
|
|
|
disk_init_zone_resources(disk);
|
2020-09-01 02:02:37 +08:00
|
|
|
rand_initialize_disk(disk);
|
|
|
|
disk_to_dev(disk)->class = &block_class;
|
|
|
|
disk_to_dev(disk)->type = &disk_type;
|
|
|
|
device_initialize(disk_to_dev(disk));
|
block: add disk sequence number
Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.
Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.
Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.
Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-2-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-13 07:05:25 +08:00
|
|
|
inc_diskseq(disk);
|
2021-08-16 21:46:24 +08:00
|
|
|
q->disk = disk;
|
2021-08-16 21:19:05 +08:00
|
|
|
lockdep_init_map(&disk->lockdep_map, "(bio completion)", lkclass, 0);
|
2021-08-04 17:41:42 +08:00
|
|
|
#ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
|
|
|
|
INIT_LIST_HEAD(&disk->slave_bdevs);
|
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
return disk;
|
2020-09-01 02:02:37 +08:00
|
|
|
|
2023-02-15 02:33:06 +08:00
|
|
|
out_erase_part0:
|
|
|
|
xa_erase(&disk->part_tbl, 0);
|
2021-01-24 18:02:41 +08:00
|
|
|
out_destroy_part_tbl:
|
|
|
|
xa_destroy(&disk->part_tbl);
|
2021-10-02 17:23:02 +08:00
|
|
|
disk->part0->bd_disk = NULL;
|
2024-04-29 07:01:39 +08:00
|
|
|
bdev_drop(disk->part0);
|
2021-08-09 22:17:43 +08:00
|
|
|
out_free_bdi:
|
|
|
|
bdi_put(disk->bdi);
|
2022-07-28 00:22:57 +08:00
|
|
|
out_free_bioset:
|
|
|
|
bioset_exit(&disk->bio_split);
|
2020-09-01 02:02:37 +08:00
|
|
|
out_free_disk:
|
|
|
|
kfree(disk);
|
|
|
|
return NULL;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2024-02-15 15:10:47 +08:00
|
|
|
struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node,
|
|
|
|
struct lock_class_key *lkclass)
|
2021-05-21 13:50:55 +08:00
|
|
|
{
|
2024-02-15 15:10:47 +08:00
|
|
|
struct queue_limits default_lim = { };
|
2021-05-21 13:50:55 +08:00
|
|
|
struct request_queue *q;
|
|
|
|
struct gendisk *disk;
|
|
|
|
|
2024-02-15 15:10:47 +08:00
|
|
|
q = blk_alloc_queue(lim ? lim : &default_lim, node);
|
2024-02-13 15:34:18 +08:00
|
|
|
if (IS_ERR(q))
|
2024-02-15 15:10:47 +08:00
|
|
|
return ERR_CAST(q);
|
2021-05-21 13:50:55 +08:00
|
|
|
|
2021-08-16 21:19:08 +08:00
|
|
|
disk = __alloc_disk_node(q, node, lkclass);
|
2021-05-21 13:50:55 +08:00
|
|
|
if (!disk) {
|
2022-06-19 14:05:51 +08:00
|
|
|
blk_put_queue(q);
|
2024-02-15 15:10:47 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2021-05-21 13:50:55 +08:00
|
|
|
}
|
2022-06-19 14:05:51 +08:00
|
|
|
set_bit(GD_OWNS_QUEUE, &disk->state);
|
2021-05-21 13:50:55 +08:00
|
|
|
return disk;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__blk_alloc_disk);
|
|
|
|
|
2020-06-20 04:47:23 +08:00
|
|
|
/**
|
|
|
|
* put_disk - decrements the gendisk refcount
|
2020-07-31 09:42:30 +08:00
|
|
|
* @disk: the struct gendisk to decrement the refcount for
|
2020-06-20 04:47:23 +08:00
|
|
|
*
|
|
|
|
* This decrements the refcount for the struct gendisk. When this reaches 0
|
|
|
|
* we'll have disk_release() called.
|
2020-06-20 04:47:25 +08:00
|
|
|
*
|
2022-07-20 21:05:41 +08:00
|
|
|
* Note: for blk-mq disk put_disk must be called before freeing the tag_set
|
|
|
|
* when handling probe errors (that is before add_disk() is called).
|
|
|
|
*
|
2020-06-20 04:47:25 +08:00
|
|
|
* Context: Any context, but the last reference must not be dropped from
|
|
|
|
* atomic context.
|
2020-06-20 04:47:23 +08:00
|
|
|
*/
|
2005-04-17 06:20:36 +08:00
|
|
|
void put_disk(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
if (disk)
|
2020-11-10 14:25:37 +08:00
|
|
|
put_device(disk_to_dev(disk));
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(put_disk);
|
|
|
|
|
2009-07-28 15:13:13 +08:00
|
|
|
static void set_disk_ro_uevent(struct gendisk *gd, int ro)
|
|
|
|
{
|
|
|
|
char event[] = "DISK_RO=1";
|
|
|
|
char *envp[] = { event, NULL };
|
|
|
|
|
|
|
|
if (!ro)
|
|
|
|
event[8] = '0';
|
|
|
|
kobject_uevent_env(&disk_to_dev(gd)->kobj, KOBJ_CHANGE, envp);
|
|
|
|
}
|
|
|
|
|
2021-01-09 18:42:51 +08:00
|
|
|
/**
|
|
|
|
* set_disk_ro - set a gendisk read-only
|
|
|
|
* @disk: gendisk to operate on
|
2021-01-29 12:55:05 +08:00
|
|
|
* @read_only: %true to set the disk read-only, %false set the disk read/write
|
2021-01-09 18:42:51 +08:00
|
|
|
*
|
|
|
|
* This function is used to indicate whether a given disk device should have its
|
|
|
|
* read-only flag set. set_disk_ro() is typically used by device drivers to
|
|
|
|
* indicate whether the underlying physical device is write-protected.
|
|
|
|
*/
|
|
|
|
void set_disk_ro(struct gendisk *disk, bool read_only)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2021-01-09 18:42:51 +08:00
|
|
|
if (read_only) {
|
|
|
|
if (test_and_set_bit(GD_READ_ONLY, &disk->state))
|
|
|
|
return;
|
|
|
|
} else {
|
|
|
|
if (!test_and_clear_bit(GD_READ_ONLY, &disk->state))
|
|
|
|
return;
|
2009-07-28 15:13:13 +08:00
|
|
|
}
|
2021-01-09 18:42:51 +08:00
|
|
|
set_disk_ro_uevent(disk, read_only);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(set_disk_ro);
|
|
|
|
|
block: add disk sequence number
Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.
Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.
Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.
Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-2-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-13 07:05:25 +08:00
|
|
|
void inc_diskseq(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
disk->diskseq = atomic64_inc_return(&diskseq);
|
|
|
|
}
|