Commit Graph

3565 Commits

Author SHA1 Message Date
Ming Lei
88022d7201 blk-mq: don't handle failure in .get_budget
It is enough to just check if we can get the budget via .get_budget().
And we don't need to deal with device state change in .get_budget().

For SCSI, one issue to be fixed is that we have to call
scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails
to handle the request. And it isn't enough to simply call
blk_mq_end_request() to do that if this request is marked as
RQF_DONTPREP.

Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:31:08 -06:00
Christoph Hellwig
e4f36b249b block: fix peeking requests during PM
We need to look for an active PM request until the next softbarrier
instead of looking for the first non-PM request.  Otherwise any cause
of request reordering might starve the PM request(s).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 08:17:06 -06:00
Bart Van Assche
21e768b442 blk-mq: Make blk_mq_get_request() error path less confusing
blk_mq_get_tag() can modify data->ctx. This means that in the
error path of blk_mq_get_request() data->ctx should be passed to
blk_mq_put_ctx() instead of local_ctx. Note: since blk_mq_put_ctx()
ignores its argument, this patch does not change any functionality.

References: commit 1ad43c0078 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 12:33:56 -06:00
weiping zhang
c2e82a2348 blk-mq: fix nr_requests wrong value when modify it from sysfs
if blk-mq use "none" io scheduler, nr_request get a wrong value when
input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
the smaller one min(nr, set->queue_depth), and then q->nr_request get a
wrong value.

Reproduce:

echo none > /sys/block/nvme0n1/queue/scheduler
echo 1000000 > /sys/block/nvme0n1/queue/nr_requests
cat /sys/block/nvme0n1/queue/nr_requests
1000000

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 12:21:06 -06:00
Christoph Hellwig
ea435e1b93 block: add a poll_fn callback to struct request_queue
That we we can also poll non blk-mq queues.  Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig
8ddcd65325 block: introduce GENHD_FL_HIDDEN
With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device.  This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig
517bf3c306 block: don't look at the struct device dev_t in disk_devt
The hidden gendisks introduced in the next patch need to keep the dev
field in their struct device empty so that udev won't try to create
block device nodes for them.  To support that rewrite disk_devt to
look at the major and first_minor fields in the gendisk itself instead
of looking into the struct device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig
ef71de8b15 block: add a blk_steal_bios helper
This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig
f421e1d9ad block: provide a direct_make_request helper
This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Jens Axboe
3e2cb3ad47 Merge branch 'nvme-4.15' of git://git.infradead.org/nvme into for-4.15/block
Pull NVMe changes from Christoph:

"Below are the currently queue nvme updates for Linux 4.15.  There are
a few more things that could make it for this merge window, but I'd
like to get things into linux-next, especially for the unlikely case
that Linus decided to cut -rc8.

Highlights:
 - support for SGLs in the PCIe driver (Chaitanya Kulkarni)
 - disable I/O schedulers for the admin queue (Israel Rukshin)
 - various Fibre Channel fixes and enhancements (James Smart)
 - various refactoring for better code sharing between transports
   (Sagi Grimberg and me)

as well as lots of little bits from various contributors."
2017-11-03 10:28:51 -06:00
Ming Lei
1f460b63d4 blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE
SCSI restarts its queue in scsi_end_request() automatically, so we don't
need to handle this case in blk-mq.

Especailly any request won't be dequeued in this case, we needn't to
worry about IO hang caused by restart vs. dispatch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:34 -06:00
Ming Lei
358a3a6bcc blk-mq: don't handle TAG_SHARED in restart
Now restart is used in the following cases, and TAG_SHARED is for
SCSI only.

1) .get_budget() returns BLK_STS_RESOURCE
- if resource in target/host level isn't satisfied, this SCSI device
will be added in shost->starved_list, and the whole queue will be rerun
(via SCSI's built-in RESTART) in scsi_end_request() after any request
initiated from this host/targe is completed. Forget to mention, host level
resource can't be an issue for blk-mq at all.

- the same is true if resource in the queue level isn't satisfied.

- if there isn't outstanding request on this queue, then SCSI's RESTART
can't work(blk-mq's can't work too), and the queue will be run after
SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's
RESTART when this request is finished

2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE
- if there isn't onprogressing request on this queue, the queue
will be run after SCSI_QUEUE_DELAY

- otherwise, SCSI's RESTART covers the rerun.

3) blk_mq_get_driver_tag() failed
- BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver
allocation.

In one word, SCSI's built-in RESTART is enough to cover the queue
rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart.

In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when
running I/O on these 8 luns concurrently.

Aslo Roman Pen reported the current RESTART is very expensive especialy
when there are lots of LUNs attached in one host, such as in his
test, RESTART causes half of IOPS be cut.

Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2
Fixes: 6d8c6c0f97 ("blk-mq: Restart a single queue if tag sets are shared")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:33 -06:00
Ming Lei
b347689ffb blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun.  Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.

This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei
de14829740 blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
For SCSI devices, there is often a per-request-queue depth, which needs
to be respected before queuing one request.

Currently blk-mq always dequeues the request first, then calls
.queue_rq() to dispatch the request to lld. One obvious issue with this
approach is that I/O merging may not be successful, because when the
per-request-queue depth can't be respected, .queue_rq() has to return
BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
list. This means it never gets a chance to be merged with other IO.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
If the budget for queueing I/O can't be satisfied, we don't need to
dequeue request at all. Hence the request can be left in the IO
scheduler queue, for more merging opportunities.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei
63ba8e31c3 block: kyber: check if there are requests in ctx in kyber_has_work()
There may be request in sw queue, and not fetched to domain queue
yet, so check it in kyber_has_work().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei
caf8eb0d60 blk-mq-sched: move actual dispatching into one helper
So that it becomes easy to support to dispatch from sw queue in the
following patch.

No functional change.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei
5e3d02bbaf blk-mq-sched: dispatch from scheduler IFF progress is made in ->dispatch
When the hw queue is busy, we shouldn't take requests from the scheduler
queue any more, otherwise it is difficult to do IO merge.

This patch fixes the awful IO performance on some SCSI devices(lpfc,
qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if
hw queue is busy.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Bart Van Assche
4e9b6f2082 block: Fix a race between blk_cleanup_queue() and timeout handling
Make sure that if the timeout timer fires after a queue has been
marked "dying" that the affected requests are finished.

Reported-by: chenxiang (M) <chenxiang66@hisilicon.com>
Fixes: commit 287922eb0b ("block: defer timeouts to a workqueue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: chenxiang (M) <chenxiang66@hisilicon.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-30 13:28:10 -06:00
Jens Axboe
4d740bc9f0 mq-deadline: add 'deadline' as a name alias
The scheduler framework now supports looking up the appropriate
scheduler with the {name,mq} tupple. We can register mq-deadline
with the alias of 'deadline', so that switching to 'deadline'
will do the right thing based on the type of driver attached to
it.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:36:55 -06:00
Jens Axboe
8ac0d9a81e elevator: allow name aliases
Since we now lookup elevator types with the appropriate multiqueue
capability, allow schedulers to register with an alias alongside
the real name. This is in preparation for allowing 'mq-deadline'
to register an alias of 'deadline' as well.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:36:50 -06:00
Jens Axboe
2527d99789 elevator: lookup mq vs non-mq elevators
If an IO scheduler is selected via elevator= and it doesn't match
the driver in question wrt blk-mq support, then we fail to boot.

The elevator= parameter is deprecated and only supported for
non-mq devices. Augment the elevator lookup API so that we
pass in if we're looking for an mq capable scheduler or not,
so that we only ever return a valid type for the queue in
question.

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196695
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:36:45 -06:00
Ilya Dryomov
d5ce4c31d6 block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to
permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
is set.  Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE
SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO:

  $ fallocate -zn -l 1k /dev/sdg
  fallocate: fallocate failed: Remote I/O error
  $ fallocate -zn -l 1k /dev/sdg  # OK
  $ fallocate -zn -l 1k /dev/sdg  # OK

The following calls succeed because sd_done() sets ->no_write_same in
response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing
__blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios.

This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing
and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is
specified.  For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if
sd_done() has just set ->no_write_same thus indicating lack of offload
support.

Fixes: c20cfc27a4 ("block: stop using blkdev_issue_write_same for zeroing")
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:28:23 -06:00
Ilya Dryomov
425a4dba79 block: factor out __blkdev_issue_zero_pages()
blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:28:22 -06:00
Ilya Dryomov
bb749b31c2 block: move CAP_SYS_ADMIN check in blkdev_roset()
Check for CAP_SYS_ADMIN before calling into the driver, similar to
blkdev_flushbuf().  This is safer and can spare a check in the driver.

(Currently BLKROSET is overridden by md and rbd, rbd is missing the
check.  md has the check, but it covers a lot more than BLKROSET.)

Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-25 12:25:00 -06:00
Dmitry Monakhov
351499a172 block: Invalidate cache on discard v2
It is reasonable drop page cache on discard, otherwise that pages may
be written by writeback second later, so thin provision devices will
not be happy. This seems to be a  security leak in case of secure discard case.

Also add check for queue_discard flag on early stage.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-24 18:44:57 -06:00
Sagi Grimberg
dab7487bdf block: remove blk_mq_reinit_tagset
No callers left.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-10-18 19:27:49 +02:00
Sagi Grimberg
149e10f8ff block: introduce blk_mq_tagset_iter
Iterator helper to apply a function on all the
tags in a given tagset. export it as it will be used
outside the block layer later on.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-10-18 19:27:48 +02:00
Omar Sandoval
8cf4666020 kyber: fix hang on domain token wait queue
When we're getting a domain token, if we fail to get a token on our
first attempt, we put the current hardware queue on a wait queue and
then try again just in case a token was freed after our initial attempt
but before we got on the wait queue. If this second attempt succeeds, we
currently leave the hardware queue on the wait queue. Usually this is
okay; we'll just run the hardware queue one extra time when another
token is freed. However, if the hardware queue doesn't have any other
requests waiting, then when it it gets the extra wakeup, it won't have
anything to free and therefore won't wake up any other hardware queues.
If tokens are limited, then we won't make forward progress and the
device will hang.

Reported-by: Bin Zha <zhabin.zb@alibaba-inc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-17 16:18:11 -06:00
Randy Dunlap
519c8e9ffd block: fix Sphinx kernel-doc warning
Sphinx treats symbols that end with '_' as a kind of special
documentation indicator, so fix that by adding an ending '*'
to it.

../block/bio.c:404: ERROR: Unknown target name: "gfp".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16 13:00:12 -06:00
Shaohua Li
85acb3ba2f block: set request_list for request
Legacy queue sets request's request_list, mq doesn't. This makes mq does
the same thing, so we can find cgroup of a request. Note, we really
only use blkg field of request_list, it's pointless to allocate mempool
for request_list in mq case.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10 13:48:16 -06:00
Shaohua Li
eca8b53a67 blk-stat: delete useless code
Fix two issues:
- the per-cpu stat flush is unnecessary, nobody uses per-cpu stat except
  sum it to global stat. We can do the calculation there. The flush just
  wastes cpu time.
- some fields are signed int/s64. I don't see the point.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10 13:48:14 -06:00
Jiufei Xue
53cfdc10a9 blk-throttle: fix null pointer dereference while throttling writeback IOs
A null pointer dereference can occur when blkcg is removed manually
with writeback IOs inflight. This is caused by the following case:

Writeback kworker submit the bio and set bio->bi_cg_private to tg
in blk_throtl_assoc_bio.
Then we remove the block cgroup manually, the blkg and tg would be
freed if there is no request inflight.
When the submitted bio come back, blk_throtl_bio_endio() fetch the tg
which was already freed.

Fix this by increasing the refcount of blkg in funcion
blk_throtl_assoc_bio() so that the blkg will not be freed until the
bio_endio called.

Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jiufei Xue <jiufei.xjf@alibaba-inc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10 13:09:34 -06:00
weiping zhang
58a9edce0a blkcg: check pol->cpd_free_fn before free cpd
check pol->cpd_free_fn() instead of pol->cpd_alloc_fn() when free cpd.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10 09:04:47 -06:00
Paolo Valente
99fead8d38 block, bfq: fix unbalanced decrements of burst size
The commit "block, bfq: decrease burst size when queues in burst
exit" introduced the decrement of burst_size on the removal of a
bfq_queue from the burst list. Unfortunately, this decrement can
happen to be performed even when burst size is already equal to 0,
because of unbalanced decrements. A description follows of the cause
of these unbalanced decrements, namely a wrong assumption, and of the
way how this wrong assumption leads to unbalanced decrements.

The wrong assumption is that a bfq_queue can exit only if the process
associated with the bfq_queue has exited. This is false, because a
bfq_queue, say Q, may exit also as a consequence of a merge with
another bfq_queue. In this case, Q exits because the I/O of its
associated process has been redirected to another bfq_queue.

The decrement unbalance occurs because Q may then be re-created after
a split, and added back to the current burst list, *without*
incrementing burst_size. burst_size is not incremented because Q is
not a new bfq_queue added to the burst list, but a bfq_queue only
temporarily removed from the list, and, before the commit "bfq-sq,
bfq-mq: decrease burst size when queues in burst exit", burst_size was
not decremented when Q was removed.

This commit addresses this issue by just checking whether the exiting
bfq_queue is a merged bfq_queue, and, in that case, not decrementing
burst_size. Unfortunately, this still leaves room for unbalanced
decrements, in the following rarer case: on a split, the bfq_queue
happens to be inserted into a different burst list than that it was
removed from when merged. If this happens, the number of elements in
the new burst list becomes higher than burst_size (by one). When the
bfq_queue then exits, it is of course not in a merged state any
longer, thus burst_size is decremented, which results in an unbalanced
decrement.  To handle this sporadic, unlucky case in a simple way,
this commit also checks that burst_size is larger than 0 before
decrementing it.

Finally, this commit removes an useless, extra check: the check that
the bfq_queue is sync, performed before checking whether the bfq_queue
is in the burst list. This extra check is redundant, because only sync
bfq_queues can be inserted into the burst list.

Fixes: 7cb04004fa ("block, bfq: decrease burst size when queues in burst exit")
Reported-by: Philip Müller <philm@manjaro.org>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Tested-by: Philip Müller <philm@manjaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-09 09:54:58 -06:00
Luca Miccio
b5dc5d4d1f block,bfq: Disable writeback throttling
Similarly to CFQ, BFQ has its write-throttling heuristics, and it
is better not to combine them with further write-throttling
heuristics of a different nature.
So this commit disables write-back throttling for a device if BFQ
is used as I/O scheduler for that device.

Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-09 08:29:21 -06:00
Tim Hansen
4078def82f block/bio: Remove null checks before mempool_destroy in bioset_free
This patch removes redundant checks for null values on bio_pool and
bvec_pool.

Found using make coccicheck M=block/ on linux-net tree on the
next-20170929 tag.

Signed-off-by: Tim Hansen <devtimhansen@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-06 13:03:14 -06:00
Tim Hansen
4b14a5c5d5 block: remove unnecessary NULL checks in bioset_integrity_free()
mempool_destroy() already checks for a NULL value being passed in, this
eliminates duplicate checks.

This was caught by running make coccicheck M=block/ on linus' tree on
commit 77ede3a014 (current head as of this
patch).

Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Tim Hansen <devtimhansen@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-06 13:03:12 -06:00
Christoph Hellwig
5fdee2127f block: remove QUEUE_FLAG_STACKABLE
We already have a queue_is_rq_based helper to check if a request_queue
is request based, so we can remove the flag for it.

Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-05 15:22:59 -06:00
Jens Axboe
fc13457f74 blk-mq: document the need to have STARTED and COMPLETED share a byte
For memory ordering guarantees on stores, we need to ensure that
these two bits share the same byte of storage in the unsigned
long. Add a comment as to why, and a BUILD_BUG_ON() to ensure that
we don't violate this requirement.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04 11:22:24 -06:00
Peter Zijlstra
a7af0af321 blk-mq: attempt to fix atomic flag memory ordering
Attempt to untangle the ordering in blk-mq. The patch introducing the
single smp_mb__before_atomic() is obviously broken in that it doesn't
clearly specify a pairing barrier and an obtained guarantee.

The comment is further misleading in that it hints that the
deadline store and the COMPLETE store also need to be ordered, but
AFAICT there is no such dependency. However what does appear to be
important is the clear happening _after_ the store, and that worked by
pure accident.

This clarifies blk_mq_start_request() -- we should not get there with
STARTING set -- this simplifies the code and makes the barrier usage
sane (the old code could be read to allow not having _any_ atomic after
the barrier, in which case the barrier hasn't got anything to order). We
then also introduce the missing pairing barrier for it.

Also down-grade the barrier to smp_wmb(), this is cheaper for
PowerPC/ARM and doesn't cost anything extra on x86.

And it documents the STARTING vs COMPLETE ordering. Although I've not
been entirely successful in reverse engineering the blk-mq state
machine so there might still be more funnies around timeout vs
requeue.

If I got anything wrong, feel free to educate me by adding comments to
clarify things ;-)

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Fixes: 538b753418 ("blk-mq: request deadline must be visible before marking rq as started")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04 11:20:11 -06:00
Christoph Hellwig
9c9883744d block: move __elv_next_request to blk-core.c
No need to have this helper inline in a header.  Also drop the __ prefix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:43:04 -06:00
Paolo Valente
7cb04004fa block, bfq: decrease burst size when queues in burst exit
If many queues belonging to the same group happen to be created
shortly after each other, then the concurrent processes associated
with these queues have typically a common goal, and they get it done
as soon as possible if not hampered by device idling.  Examples are
processes spawned by git grep, or by systemd during boot. As for
device idling, this mechanism is currently necessary for weight
raising to succeed in its goal: privileging I/O.  In view of these
facts, BFQ does not provide the above queues with either weight
raising or device idling.

On the other hand, a burst of queue creations may be caused also by
the start-up of a complex application. In this case, these queues need
usually to be served one after the other, and as quickly as possible,
to maximise responsiveness. Therefore, in this case the best strategy
is to weight-raise all the queues created during the burst, i.e., the
exact opposite of the strategy for the above case.

To distinguish between the two cases, BFQ uses an empirical burst-size
threshold, found through extensive tests and monitoring of daily
usage. Only large bursts, i.e., burst with a size above this
threshold, are considered as generated by a high number of parallel
processes. In this respect, upstart-based boot proved to be rather
hard to detect as generating a large burst of queue creations, because
with upstart most of the queues created in a burst exit *before* the
next queues in the same burst are created. To address this issue, I
changed the burst-detection mechanism so as to not decrease the size
of the current burst even if one of the queues in the burst is
eliminated.

Unfortunately, this missing decrease causes false positives on very
fast systems: on the start-up of a complex application, such as
libreoffice writer, so many queues are created, served and exited
shortly after each other, that a large burst of queue creations is
wrongly detected as occurring. These false positives just disappear if
the size of a burst is decreased when one of the queues in the burst
exits. This commit restores the missing burst-size decrease, relying
of the fact that upstart is apparently unlikely to be used on systems
running this and future versions of the kernel.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Tested-by: Mirko Montanari <mirkomontanari91@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:40:58 -06:00
Paolo Valente
894df937e0 block, bfq: let early-merged queues be weight-raised on split too
A just-created bfq_queue, say Q, may happen to be merged with another
bfq_queue on the very first invocation of the function
__bfq_insert_request. In such a case, even if Q would clearly deserve
interactive weight raising (as it has just been created), the function
bfq_add_request does not make it to be invoked for Q, and thus to
activate weight raising for Q. As a consequence, when the state of Q
is saved for a possible future restore, after a split of Q from the
other bfq_queue(s), such a state happens to be (unjustly)
non-weight-raised. Then the bfq_queue will not enjoy any weight
raising on the split, even if should still be in an interactive
weight-raising period when the split occurs.

This commit solves this problem as follows, for a just-created
bfq_queue that is being early-merged: it stores directly, in the saved
state of the bfq_queue, the weight-raising state that would have been
assigned to the bfq_queue if not early-merged.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Angelo Ruocco <angeloruocco90@gmail.com>
Tested-by: Mirko Montanari <mirkomontanari91@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:40:57 -06:00
Paolo Valente
3e2bdd6dff block, bfq: check and switch back to interactive wr also on queue split
As already explained in the message of commit "block, bfq: fix
wrong init of saved start time for weight raising", if a soft
real-time weight-raising period happens to be nested in a larger
interactive weight-raising period, then BFQ restores the interactive
weight raising at the end of the soft real-time weight raising. In
particular, BFQ checks whether the latter has ended only on request
dispatches.

Unfortunately, the above scheme fails to restore interactive weight
raising in the following corner case: if a bfq_queue, say Q,
1) Is merged with another bfq_queue while it is in a nested soft
real-time weight-raising period. The weight-raising state of Q is
then saved, and not considered any longer until a split occurs.
2) Is split from the other bfq_queue(s) at a time instant when its
soft real-time weight raising is already finished.
On the split, while resuming the previous, soft real-time
weight-raised state of the bfq_queue Q, BFQ checks whether the
current soft real-time weight-raising period is actually over. If so,
BFQ switches weight raising off for Q, *without* checking whether the
soft real-time period was actually nested in a non-yet-finished
interactive weight-raising period.

This commit addresses this issue by adding the above missing check in
bfq_queue splits, and restoring interactive weight raising if needed.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Angelo Ruocco <angeloruocco90@gmail.com>
Tested-by: Mirko Montanari <mirkomontanari91@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:40:56 -06:00
Paolo Valente
4baa8bb13f block, bfq: fix wrong init of saved start time for weight raising
This commit fixes a bug that causes bfq to fail to guarantee a high
responsiveness on some drives, if there is heavy random read+write I/O
in the background. More precisely, such a failure allowed this bug to
be found [1], but the bug may well cause other yet unreported
anomalies.

BFQ raises the weight of the bfq_queues associated with soft real-time
applications, to privilege the I/O, and thus reduce latency, for these
applications. This mechanism is named soft-real-time weight raising in
BFQ. A soft real-time period may happen to be nested into an
interactive weight raising period, i.e., it may happen that, when a
bfq_queue switches to a soft real-time weight-raised state, the
bfq_queue is already being weight-raised because deemed interactive
too. In this case, BFQ saves in a special variable
wr_start_at_switch_to_srt, the time instant when the interactive
weight-raising period started for the bfq_queue, i.e., the time
instant when BFQ started to deem the bfq_queue interactive. This value
is then used to check whether the interactive weight-raising period
would still be in progress when the soft real-time weight-raising
period ends.  If so, interactive weight raising is restored for the
bfq_queue. This restore is useful, in particular, because it prevents
bfq_queues from losing their interactive weight raising prematurely,
as a consequence of spurious, short-lived soft real-time
weight-raising periods caused by wrong detections as soft real-time.

If, instead, a bfq_queue switches to soft-real-time weight raising
while it *is not* already in an interactive weight-raising period,
then the variable wr_start_at_switch_to_srt has no meaning during the
following soft real-time weight-raising period. Unfortunately the
handling of this case is wrong in BFQ: not only the variable is not
flagged somehow as meaningless, but it is also set to the time when
the switch to soft real-time weight-raising occurs. This may cause an
interactive weight-raising period to be considered mistakenly as still
in progress, and thus a spurious interactive weight-raising period to
start for the bfq_queue, at the end of the soft-real-time
weight-raising period. In particular the spurious interactive
weight-raising period will be considered as still in progress, if the
soft-real-time weight-raising period does not last very long. The
bfq_queue will then be wrongly privileged and, if I/O bound, will
unjustly steal bandwidth to truly interactive or soft real-time
bfq_queues, harming responsiveness and low latency.

This commit fixes this issue by just setting wr_start_at_switch_to_srt
to minus infinity (farthest past time instant according to jiffies
macros): when the soft-real-time weight-raising period ends, certainly
no interactive weight-raising period will be considered as still in
progress.

[1] Background I/O Type: Random - Background I/O mix: Reads and writes
- Application to start: LibreOffice Writer in
http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-Laptop

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Mirko Montanari <mirkomontanari91@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:40:54 -06:00
Jens Axboe
7beb2f845b blk-mq: wire up completion notifier for laptop mode
For some reason, the laptop mode IO completion notified was never wired
up for blk-mq. Ensure that we trigger the callback appropriately, to arm
the laptop mode flush timer.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:38:17 -06:00
Jens Axboe
5385fa47d8 blk-mq-tag: kill unused tag enums
We don't have any notion of a tagging cache anymore, and haven't
for a long time. Kill off the unused enums.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-01 01:26:21 -06:00
weiping zhang
547248736a blk-mq: remove unused function hctx_allow_merges
since 9bddeb2a5b "blk-mq: make per-sw-queue bio merge as default .bio_merge"
there is no caller for this function.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-30 02:17:37 -06:00
Shaohua Li
af551fb3be blkcg: delete unused APIs
Nobody uses the APIs right now.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-26 07:41:22 -06:00
Shaohua Li
f5c156c4c2 block: fix a crash caused by wrong API
part_stat_show takes a part device not a disk, so we should use
part_to_disk.

Fixes: d62e26b3ffd2("block: pass in queue to inflight accounting")
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00