linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-25 13:14:07 +08:00

Author	SHA1	Message	Date
Ming Lei	88022d7201	blk-mq: don't handle failure in .get_budget It is enough to just check if we can get the budget via .get_budget(). And we don't need to deal with device state change in .get_budget(). For SCSI, one issue to be fixed is that we have to call scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails to handle the request. And it isn't enough to simply call blk_mq_end_request() to do that if this request is marked as RQF_DONTPREP. Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq) Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-04 12:31:08 -06:00
Ming Lei	826a70a08b	SCSI: don't get target/host busy_count in scsi_mq_get_budget() It is very expensive to atomic_inc/atomic_dec the host wide counter of host->busy_count, and it should have been avoided via blk-mq's mechanism of getting driver tag, which uses the more efficient way of sbitmap queue. Also we don't check atomic_read(&sdev->device_busy) in scsi_mq_get_budget() and don't run queue if the counter becomes zero, so IO hang may be caused if all requests are completed just before the current SCSI device is added to shost->starved_list. Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq) Reported-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-04 08:19:25 -06:00
Christoph Hellwig	e4f36b249b	block: fix peeking requests during PM We need to look for an active PM request until the next softbarrier instead of looking for the first non-PM request. Otherwise any cause of request reordering might starve the PM request(s). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-04 08:17:06 -06:00
Bart Van Assche	21e768b442	blk-mq: Make blk_mq_get_request() error path less confusing blk_mq_get_tag() can modify data->ctx. This means that in the error path of blk_mq_get_request() data->ctx should be passed to blk_mq_put_ctx() instead of local_ctx. Note: since blk_mq_put_ctx() ignores its argument, this patch does not change any functionality. References: commit `1ad43c0078` ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed") Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-03 12:33:56 -06:00
weiping zhang	c2e82a2348	blk-mq: fix nr_requests wrong value when modify it from sysfs if blk-mq use "none" io scheduler, nr_request get a wrong value when input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get the smaller one min(nr, set->queue_depth), and then q->nr_request get a wrong value. Reproduce: echo none > /sys/block/nvme0n1/queue/scheduler echo 1000000 > /sys/block/nvme0n1/queue/nr_requests cat /sys/block/nvme0n1/queue/nr_requests 1000000 Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-03 12:21:06 -06:00
Jens Axboe	a116895fc7	cdrom: hide CONFIG_CDROM menu selection We don't need to expose this. The point is that drivers select the uniform CDROM layer, if they need it, the user should not have to make a conscious decision on whether to include this separately or not. Fixes: `2a750166a5` ("block: Rework drivers/cdrom/Makefile") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-03 11:02:15 -06:00
Christoph Hellwig	ea435e1b93	block: add a poll_fn callback to struct request_queue That we we can also poll non blk-mq queues. Mostly needed for the NVMe multipath code, but could also be useful elsewhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-03 10:31:48 -06:00
Christoph Hellwig	8ddcd65325	block: introduce GENHD_FL_HIDDEN With this flag a driver can create a gendisk that can be used for I/O submission inside the kernel, but which is not registered as user facing block device. This will be useful for the NVMe multipath implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-03 10:31:48 -06:00
Christoph Hellwig	517bf3c306	block: don't look at the struct device dev_t in disk_devt The hidden gendisks introduced in the next patch need to keep the dev field in their struct device empty so that udev won't try to create block device nodes for them. To support that rewrite disk_devt to look at the major and first_minor fields in the gendisk itself instead of looking into the struct device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-03 10:31:48 -06:00
Christoph Hellwig	ef71de8b15	block: add a blk_steal_bios helper This helpers allows to bounce steal the uncompleted bios from a request so that they can be reissued on another path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-03 10:31:48 -06:00
Christoph Hellwig	f421e1d9ad	block: provide a direct_make_request helper This helper allows reinserting a bio into a new queue without much overhead, but requires all queue limits to be the same for the upper and lower queues, and it does not provide any recursion preventions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Javier González <javier@cnexlabs.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-03 10:31:48 -06:00
Christoph Hellwig	96222bcc73	block: add REQ_DRV bit Set aside a bit in the request/bio flags for driver use. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-03 10:31:48 -06:00
Christoph Hellwig	8977f56384	block: move REQ_NOWAIT This flag should be before the operation-specific REQ_NOUNMAP bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-03 10:31:48 -06:00
Jens Axboe	3e2cb3ad47	Merge branch 'nvme-4.15' of git://git.infradead.org/nvme into for-4.15/block Pull NVMe changes from Christoph: "Below are the currently queue nvme updates for Linux 4.15. There are a few more things that could make it for this merge window, but I'd like to get things into linux-next, especially for the unlikely case that Linus decided to cut -rc8. Highlights: - support for SGLs in the PCIe driver (Chaitanya Kulkarni) - disable I/O schedulers for the admin queue (Israel Rukshin) - various Fibre Channel fixes and enhancements (James Smart) - various refactoring for better code sharing between transports (Sagi Grimberg and me) as well as lots of little bits from various contributors."	2017-11-03 10:28:51 -06:00
Minwoo Im	a806c6c81e	nvme: comment typo fixed in clearing AER a tiny typo fixed in a comment. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-03 10:02:20 +03:00
Arnd Bergmann	474f5da235	skd: use ktime_get_real_seconds() Like many storage drivers, skd uses an unsigned 32-bit number for interchanging the current time with the firmware. This will overflow in y2106 and is otherwise safe. However, the get_seconds() function is generally considered deprecated since the behavior is different between 32-bit and 64-bit architectures, and using it may indicate a bigger problem. To annotate that we've thought about this, let's add a comment here and migrate to the ktime_get_real_seconds() function that consistently returns a 64-bit number. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-02 08:27:21 -06:00
Arnd Bergmann	c091fbe9a2	block: fix CDROM dependency on BLK_DEV After the cdrom cleanup, I get randconfig warnings for some configurations: warning: (BLK_DEV_IDECD && BLK_DEV_SR) selects CDROM which has unmet direct dependencies (BLK_DEV) This adds an explicit BLK_DEV dependency for both drivers. The other drivers that select 'CDROM' already have this and don't need a change. Fixes: `2a750166a5` ("block: Rework drivers/cdrom/Makefile") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-02 08:26:55 -06:00
Keith Busch	3639efef8f	nvme: Remove unused headers Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:34:43 +01:00
James Smart	a96d4bd867	nvmet: fix fatal_err_work deadlock Below is a stack trace for an issue that was reported. What's happening is that the nvmet layer had it's controller kato timeout fire, which causes it to schedule its fatal error handler via the fatal_err_work element. The error handler is invoked, which calls the transport delete_ctrl() entry point, and as the transport tears down the controller, nvmet_sq_destroy ends up doing the final put on the ctlr causing it to enter its free routine. The ctlr free routine does a cancel_work_sync() on fatal_err_work element, which then does a flush_work and wait_for_completion. But, as the wait is in the context of the work element being flushed, its in a catch-22 and the thread hangs. [ 326.903131] nvmet: ctrl 1 keep-alive timer (15 seconds) expired! [ 326.909832] nvmet: ctrl 1 fatal error occurred! [ 327.643100] lpfc 0000:04:00.0: 0:6313 NVMET Defer ctx release xri x114 flg x2 [ 494.582064] INFO: task kworker/0:2:243 blocked for more than 120 seconds. [ 494.589638] Not tainted 4.14.0-rc1.James+ #1 [ 494.594986] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 494.603718] kworker/0:2 D 0 243 2 0x80000000 [ 494.609839] Workqueue: events nvmet_fatal_error_handler [nvmet] [ 494.616447] Call Trace: [ 494.619177] __schedule+0x28d/0x890 [ 494.623070] schedule+0x36/0x80 [ 494.626571] schedule_timeout+0x1dd/0x300 [ 494.631044] ? dequeue_task_fair+0x592/0x840 [ 494.635810] ? pick_next_task_fair+0x23b/0x5c0 [ 494.640756] wait_for_completion+0x121/0x180 [ 494.645521] ? wake_up_q+0x80/0x80 [ 494.649315] flush_work+0x11d/0x1a0 [ 494.653206] ? wake_up_worker+0x30/0x30 [ 494.657484] __cancel_work_timer+0x10b/0x190 [ 494.662249] cancel_work_sync+0x10/0x20 [ 494.666525] nvmet_ctrl_put+0xa3/0x100 [nvmet] [ 494.671482] nvmet_sq_:q+0x64/0xd0 [nvmet] [ 494.676540] nvmet_fc_delete_target_queue+0x202/0x220 [nvmet_fc] [ 494.683245] nvmet_fc_delete_target_assoc+0x6d/0xc0 [nvmet_fc] [ 494.689743] nvmet_fc_delete_ctrl+0x137/0x1a0 [nvmet_fc] [ 494.695673] nvmet_fatal_error_handler+0x30/0x40 [nvmet] [ 494.701589] process_one_work+0x149/0x360 [ 494.706064] worker_thread+0x4d/0x3c0 [ 494.710148] kthread+0x109/0x140 [ 494.713751] ? rescuer_thread+0x380/0x380 [ 494.718214] ? kthread_park+0x60/0x60 Correct by having the fc transport convert to a different workq context for the actual controller teardown which may call the cancel_work_sync. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:34:42 +01:00
James Smart	2b632970da	nvme-fc: add dev_loss_tmo timeout and remoteport resume support When a remoteport is unregistered (connectivity lost), the following actions are taken: - the remoteport is marked DELETED - the time when dev_loss_tmo would expire is set in the remoteport - all controllers on the remoteport are reset. After a controller resets, it will stall in a RECONNECTING state waiting for one of the following: - the controller will continue to attempt reconnect per max_retries and reconnect_delay. As no remoteport connectivity, the reconnect attempt will immediately fail. If max reconnects has not been reached, a new reconnect_delay timer will be schedule. If the current time plus another reconnect_delay exceeds when dev_loss_tmo expires on the remote port, then the reconnect_delay will be shortend to schedule no later than when dev_loss_tmo expires. If max reconnect attempts are reached (e.g. ctrl_loss_tmo reached) or dev_loss_tmo ix exceeded without connectivity, the controller is deleted. - the remoteport is re-registered prior to dev_loss_tmo expiring. The resume of the remoteport will immediately attempt to reconnect each of its suspended controllers. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> [hch: updated to use nvme_delete_ctrl] Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:34:41 +01:00
James Smart	3cec7f9de4	nvme: allow controller RESETTING to RECONNECTING transition Transport will typically transition from LIVE to RESETTING when initially performing a reset or recovering from an error. Adding this transition allows a transport to transition to RECONNECTING when it checks/waits for connectivity then creates new transport connections and reinits the controller. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:34:40 +01:00
James Smart	96e2480105	nvme-fc: check connectivity before initiating reconnects Check remoteport connectivity before initiating reconnects Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:34:39 +01:00
James Smart	ac7fe82b6f	nvme-fc: add a dev_loss_tmo field to the remoteport Add a dev_loss_tmo value, paralleling the SCSI FC transport, for device connectivity loss. The transport initializes the value in the nvme_fc_register_remoteport() call. If the value is not set, a default of 60s is set. Add a new routine to the api, nvme_fc_set_remoteport_devloss() routine, which allows the lldd to dynamically update the value on an existing remoteport. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:34:36 +01:00
James Smart	44c6ec77e1	nvme-fc: change ctlr state assignments during reset/reconnect Clean up some of the controller state checks and add the RESETTING->RECONNECTING state transition. Specifically: - the movement of the RESETTING state change and schedule of reset_work to core doesn't work wiht nvme_fc_error_recovery setting state to RECONNECTING before attempting to reset. Remove the state change as the reset request does it. - In the rare cases where an error occurs right as we're transitioning to LIVE, defer the controller start actions. - In error handling on teardown of associations while performing initial controller creation - avoid quiesce calls on the admin_q. They are unneeded. - Add the RESETTING->RECONNECTING transition in the reset handler. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:23 +01:00
Sagi Grimberg	4054637c9b	nvme: flush reset_work before safely continuing with delete operation Prevent racing controller reset and delete flows. reset_work must not ever self-requeue so flushing it suffices. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:17 +01:00
Sagi Grimberg	12fa1304ba	nvme-rdma: reuse nvme_delete_ctrl when reconnect attempts expire instead of just queueing delete work Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:07 +01:00
Christoph Hellwig	6cd53d14aa	nvme: consolidate common code from ->reset_work No change in behavior except that the FC code cancels two work items a little later now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:06 +01:00
Christoph Hellwig	e9bc25874c	nvme-rdma: remove nvme_rdma_remove_ctrl It is only used in two places, and some of the work done by it will be taken into common code soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:05 +01:00
Christoph Hellwig	c5017e8570	nvme: move controller deletion to common code Move the ->delete_work and the associated helpers to common code instead of duplicating them in every driver. This also adds the missing reference get/put for the loop driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:04 +01:00
Christoph Hellwig	29c0964873	nvme-fc: merge __nvme_fc_schedule_delete_work into __nvme_fc_del_ctrl No need to have two functions doing the same thing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:04 +01:00
James Smart	71c691fd06	nvme-fc: avoid workqueue flush stalls There's no need to wait for the full nvme_wq, which is now shared, to flush. flush only the delete_work item. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sgi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-11-01 16:28:03 +01:00
Bart Van Assche	2a750166a5	block: Rework drivers/cdrom/Makefile Instead of referring from inside drivers/cdrom/Makefile to all the drivers that use this driver, let these drivers select the cdrom driver. This change makes the cdrom build code follow the approach that is used for most other drivers, namely refer from the higher layers to the lower layer instead of from the lower layer to the higher layers. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-01 08:33:57 -06:00
Ming Lei	1f460b63d4	blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE SCSI restarts its queue in scsi_end_request() automatically, so we don't need to handle this case in blk-mq. Especailly any request won't be dequeued in this case, we needn't to worry about IO hang caused by restart vs. dispatch. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-01 08:20:34 -06:00
Ming Lei	358a3a6bcc	blk-mq: don't handle TAG_SHARED in restart Now restart is used in the following cases, and TAG_SHARED is for SCSI only. 1) .get_budget() returns BLK_STS_RESOURCE - if resource in target/host level isn't satisfied, this SCSI device will be added in shost->starved_list, and the whole queue will be rerun (via SCSI's built-in RESTART) in scsi_end_request() after any request initiated from this host/targe is completed. Forget to mention, host level resource can't be an issue for blk-mq at all. - the same is true if resource in the queue level isn't satisfied. - if there isn't outstanding request on this queue, then SCSI's RESTART can't work(blk-mq's can't work too), and the queue will be run after SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's RESTART when this request is finished 2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE - if there isn't onprogressing request on this queue, the queue will be run after SCSI_QUEUE_DELAY - otherwise, SCSI's RESTART covers the rerun. 3) blk_mq_get_driver_tag() failed - BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver allocation. In one word, SCSI's built-in RESTART is enough to cover the queue rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart. In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when running I/O on these 8 luns concurrently. Aslo Roman Pen reported the current RESTART is very expensive especialy when there are lots of LUNs attached in one host, such as in his test, RESTART causes half of IOPS be cut. Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2 Fixes: `6d8c6c0f97` ("blk-mq: Restart a single queue if tag sets are shared") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-01 08:20:33 -06:00
Ming Lei	0df21c86bd	scsi: implement .get_budget and .put_budget for blk-mq We need to tell blk-mq to reserve resources before queuing one request, so implement these two callbacks. Then blk-mq can avoid to dequeue request too early, and IO merging can be improved a lot. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-01 08:20:02 -06:00
Ming Lei	aeec77629a	scsi: allow passing in null rq to scsi_prep_state_check() In the following patch, we will implement scsi_get_budget() which need to call scsi_prep_state_check() when rq isn't dequeued yet. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-01 08:20:02 -06:00
Ming Lei	b347689ffb	blk-mq-sched: improve dispatching from sw queue SCSI devices use host-wide tagset, and the shared driver tag space is often quite big. However, there is also a queue depth for each lun( .cmd_per_lun), which is often small, for example, on both lpfc and qla2xxx, .cmd_per_lun is just 3. So lots of requests may stay in sw queue, and we always flush all belonging to same hw queue and dispatch them all to driver. Unfortunately it is easy to cause queue busy because of the small .cmd_per_lun. Once these requests are flushed out, they have to stay in hctx->dispatch, and no bio merge can happen on these requests, and sequential IO performance is harmed. This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request from a sw queue, so that we can dispatch them in scheduler's way. We can then avoid dequeueing too many requests from sw queue, since we don't flush ->dispatch completely. This patch improves dispatching from sw queue by using the .get_budget and .put_budget callbacks. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-01 08:20:02 -06:00
Ming Lei	de14829740	blk-mq: introduce .get_budget and .put_budget in blk_mq_ops For SCSI devices, there is often a per-request-queue depth, which needs to be respected before queuing one request. Currently blk-mq always dequeues the request first, then calls .queue_rq() to dispatch the request to lld. One obvious issue with this approach is that I/O merging may not be successful, because when the per-request-queue depth can't be respected, .queue_rq() has to return BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch list. This means it never gets a chance to be merged with other IO. This patch introduces .get_budget and .put_budget callback in blk_mq_ops, then we can try to get reserved budget first before dequeuing request. If the budget for queueing I/O can't be satisfied, we don't need to dequeue request at all. Hence the request can be left in the IO scheduler queue, for more merging opportunities. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-01 08:20:02 -06:00
Ming Lei	63ba8e31c3	block: kyber: check if there are requests in ctx in kyber_has_work() There may be request in sw queue, and not fetched to domain queue yet, so check it in kyber_has_work(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-01 08:20:02 -06:00
Ming Lei	7930d0a00f	sbitmap: introduce __sbitmap_for_each_set() For blk-mq, we need to be able to iterate software queues starting from any queue in a round robin fashion, so introduce this helper. Reviewed-by: Omar Sandoval <osandov@fb.com> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-01 08:20:02 -06:00
Ming Lei	caf8eb0d60	blk-mq-sched: move actual dispatching into one helper So that it becomes easy to support to dispatch from sw queue in the following patch. No functional change. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-01 08:20:02 -06:00
Ming Lei	5e3d02bbaf	blk-mq-sched: dispatch from scheduler IFF progress is made in ->dispatch When the hw queue is busy, we shouldn't take requests from the scheduler queue any more, otherwise it is difficult to do IO merge. This patch fixes the awful IO performance on some SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if hw queue is busy. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-01 08:20:02 -06:00
Scott Bauer	0e7d3a8d4e	MAINTAINERS: Remove Rafael from Opal maintainers. He is no longer working on storage. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-31 14:39:20 -06:00
Liang Chen	330a4db89d	bcache: explicitly destroy mutex while exiting mutex_destroy does nothing most of time, but it's better to call it to make the code future proof and it also has some meaning for like mutex debug. As Coly pointed out in a previous review, bcache_exit() may not be able to handle all the references properly if userspace registers cache and backing devices right before bch_debug_init runs and bch_debug_init failes later. So not exposing userspace interface until everything is ready to avoid that issue. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-30 15:57:54 -06:00
tang.junhui	c157313791	bcache: fix wrong cache_misses statistics Currently, Cache missed IOs are identified by s->cache_miss, but actually, there are many situations that missed IOs are not assigned a value for s->cache_miss in cached_dev_cache_miss(), for example, a bypassed IO (s->iop.bypass = 1), or the cache_bio allocate failed. In these situations, it will go to out_put or out_submit, and s->cache_miss is null, which leads bch_mark_cache_accounting() to treat this IO as a hit IO. [ML: applied by 3-way merge] Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-30 15:57:54 -06:00
Tang Junhui	d44c2f9e7c	bcache: update bucket_in_use in real time bucket_in_use is updated in gc thread which triggered by invalidating or writing sectors_to_gc dirty data, It's a long interval. Therefore, when we use it to compare with the threshold, it is often not timely, which leads to inaccurate judgment and often results in bucket depletion. We have send a patch before, by the means of updating bucket_in_use periodically In gc thread, which Coly thought that would lead high latency, In this patch, we add avail_nbuckets to record the count of available buckets, and we calculate bucket_in_use when alloc or free bucket in real time. [edited by ML: eliminated some whitespace errors] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-30 15:57:54 -06:00
Elena Reshetova	3b304d24a7	bcache: convert cached_dev.count from atomic_t to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable cached_dev.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-30 15:57:54 -06:00
Coly Li	d59b237959	bcache: only permit to recovery read error when cache device is clean When bcache does read I/Os, for example in writeback or writethrough mode, if a read request on cache device is failed, bcache will try to recovery the request by reading from cached device. If the data on cached device is not synced with cache device, then requester will get a stale data. For critical storage system like database, providing stale data from recovery may result an application level data corruption, which is unacceptible. With this patch, for a failed read request in writeback or writethrough mode, recovery a recoverable read request only happens when cache device is clean. That is to say, all data on cached device is up to update. For other cache modes in bcache, read request will never hit cached_dev_read_error(), they don't need this patch. Please note, because cache mode can be switched arbitrarily in run time, a writethrough mode might be switched from a writeback mode. Therefore checking dc->has_data in writethrough mode still makes sense. Changelog: V4: Fix parens error pointed by Michael Lyle. v3: By response from Kent Oversteet, he thinks recovering stale data is a bug to fix, and option to permit it is unnecessary. So this version the sysfs file is removed. v2: rename sysfs entry from allow_stale_data_on_failure to allow_stale_data_on_failure, and fix the confusing commit log. v1: initial patch posted. [small change to patch comment spelling by mlyle] Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reported-by: Arne Wolf <awolf@lenovo.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Nix <nix@esperi.org.uk> Cc: Kai Krakow <hurikhan77@gmail.com> Cc: Eric Wheeler <bcache@lists.ewheeler.net> Cc: Junhui Tang <tang.junhui@zte.com.cn> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-30 15:57:54 -06:00
Bart Van Assche	4e9b6f2082	block: Fix a race between blk_cleanup_queue() and timeout handling Make sure that if the timeout timer fires after a queue has been marked "dying" that the affected requests are finished. Reported-by: chenxiang (M) <chenxiang66@hisilicon.com> Fixes: commit `287922eb0b` ("block: defer timeouts to a workqueue") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: chenxiang (M) <chenxiang66@hisilicon.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <keith.busch@intel.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-30 13:28:10 -06:00
James Smart	ecad0d2cb8	nvme-fc: remove NVME_FC_MAX_SEGMENTS The define is an arbitrary limit to the io size on the initiator, capping the io to 1MB-4KB. Remove the define from the transport. I/O size will solely be limited by the LLDD sg limits. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2017-10-27 09:25:35 +03:00

1 2 3 4 5 ...

706550 Commits