Remove a race condition which causes a warning in disk_clear_events. This
is a race between disk_clear_events() and disk_flush_events().
ev->clearing will be altered by disk_flush_events() even though we are
blocking event checking through disk_flush_events(). If this happens
after ev->clearing was cleared for disk_clear_events(), this can cause the
WARN_ON_ONCE() in that function to be triggered.
This change also has disk_clear_events() not go through a workqueue.
Since we have to wait for the work to complete, we should just call the
function directly. Also, since this work cannot be put on a freezable
workqueue, it will have to contend with increased demand, so calling the
function directly avoids this.
[akpm@linux-foundation.org: fix spello in comment]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In disk_clear_events, do not put work on system_nrt_freezable_wq.
Instead, put it on system_nrt_wq.
There is a race between probing a usb and suspending the device. Since
probing a usb calls disk_clear_events, which puts work on a frozen
workqueue, probing cannot finish after the workqueue is frozen. However,
suspending cannot finish until the usb probe is finished, so we get a
deadlock, causing the system to reboot.
The way to reproduce this bug is to wake up from suspend with a usb
storage device plugged in, or plugging in a usb storage device right
before suspend. The window of time is on the order of time it takes to
probe the usb device. As long as the workqueues are frozen before the
call to add_disk within sd_probe_async finishes, there will be a deadlock
(which calls blkdev_get, sd_open, check_disk_change, then
disk_clear_events). This is not difficult to reproduce after figuring out
the timings.
[akpm@linux-foundation.org: fix up comment]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Reviewed-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge misc patches from Andrew Morton:
"Incoming:
- lots of misc stuff
- backlight tree updates
- lib/ updates
- Oleg's percpu-rwsem changes
- checkpatch
- rtc
- aoe
- more checkpoint/restart support
I still have a pile of MM stuff pending - Pekka should be merging
later today after which that is good to go. A number of other things
are twiddling thumbs awaiting maintainer merges."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits)
scatterlist: don't BUG when we can trivially return a proper error.
docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output
fs, fanotify: add @mflags field to fanotify output
docs: add documentation about /proc/<pid>/fdinfo/<fd> output
fs, notify: add procfs fdinfo helper
fs, exportfs: add exportfs_encode_inode_fh() helper
fs, exportfs: escape nil dereference if no s_export_op present
fs, epoll: add procfs fdinfo helper
fs, eventfd: add procfs fdinfo helper
procfs: add ability to plug in auxiliary fdinfo providers
tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
breakpoint selftests: print failure status instead of cause make error
kcmp selftests: print fail status instead of cause make error
kcmp selftests: make run_tests fix
mem-hotplug selftests: print failure status instead of cause make error
cpu-hotplug selftests: print failure status instead of cause make error
mqueue selftests: print failure status instead of cause make error
vm selftests: print failure status instead of cause make error
ubifs: use prandom_bytes
mtd: nandsim: use prandom_bytes
...
Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:
- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.
- A few cleanups from Akinobu Mita for drbd and cciss.
- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.
- A few fixes for xen back/front block driver from Roger Pau Monne.
- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."
* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...
Pull block layer core updates from Jens Axboe:
"Here are the core block IO bits for 3.8. The branch contains:
- The final version of the surprise device removal fixups from Bart.
- Don't hide EFI partitions under advanced partition types. It's
fairly wide spread these days. This is especially dangerous for
systems that have both msdos and efi partition tables, where you
want to keep them in sync.
- Cleanup of using -1 instead of the proper NUMA_NO_NODE
- Export control of bdi flusher thread CPU mask and default to using
the home node (if known) from Jeff.
- Export unplug tracepoint for MD.
- Core improvements from Shaohua. Reinstate the recursive merge, as
the original bug has been fixed. Add plugging for discard and also
fix a problem handling non pow-of-2 discard limits.
There's a trivial merge in block/blk-exec.c due to a fix that went
into 3.7-rc at a later point than -rc4 where this is based."
* 'for-3.8/core' of git://git.kernel.dk/linux-block:
block: export block_unplug tracepoint
block: add plug for blkdev_issue_discard
block: discard granularity might not be power of 2
deadline: Allow 0ms deadline latency, increase the read speed
partitions: enable EFI/GPT support by default
bsg: Remove unused function bsg_goose_queue()
block: Make blk_cleanup_queue() wait until request_fn finished
block: Avoid scheduling delayed work on a dead queue
block: Avoid that request_fn is invoked on a dead queue
block: Let blk_drain_queue() caller obtain the queue lock
block: Rename queue dead flag
bdi: add a user-tunable cpu_list for the bdi flusher threads
block: use NUMA_NO_NODE instead of -1
block: recursive merge requests
block CFQ: avoid moving request to different queue
Last post of this patch appears lost, so I resend this.
Now discard merge works, add plug for blkdev_issue_discard. This will help
discard request merge especially for raid0 case. In raid0, a big discard
request is split to small requests, and if correct plug is added, such small
requests can be merged in low layer.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In MD raid case, discard granularity might not be power of 2, for example, a
4-disk raid5 has 3*chunk_size discard granularity. Correct the calculation for
such cases.
Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull cgroup changes from Tejun Heo:
"A lot of activities on cgroup side. The big changes are focused on
making cgroup hierarchy handling saner.
- cgroup_rmdir() had peculiar semantics - it allowed cgroup
destruction to be vetoed by individual controllers and tried to
drain refcnt synchronously. The vetoing never worked properly and
caused good deal of contortions in cgroup. memcg was the last
reamining user. Michal Hocko removed the usage and cgroup_rmdir()
path has been simplified significantly. This was done in a
separate branch so that the memcg people can base further memcg
changes on top.
- The above allowed cleaning up cgroup lifecycle management and
implementation of generic cgroup iterators which are used to
improve hierarchy support.
- cgroup_freezer updated to allow migration in and out of a frozen
cgroup and handle hierarchy. If a cgroup is frozen, all descendant
cgroups are frozen.
- netcls_cgroup and netprio_cgroup updated to handle hierarchy
properly.
- Various fixes and cleanups.
- Two merge commits. One to pull in memcg and rmdir cleanups (needed
to build iterators). The other pulled in cgroup/for-3.7-fixes for
device_cgroup fixes so that further device_cgroup patches can be
stacked on top."
Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
commit bea8c150a7 ("memcg: fix hotplugged memory zone oops") in master
touching code close to commit 2ef37d3fe4 ("memcg: Simplify
mem_cgroup_force_empty_list error handling") in for-3.8)
* 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
cgroup: update Documentation/cgroups/00-INDEX
cgroup_rm_file: don't delete the uncreated files
cgroup: remove subsystem files when remounting cgroup
cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
cgroup: warn about broken hierarchies only after css_online
cgroup: list_del_init() on removed events
cgroup: fix lockdep warning for event_control
cgroup: move list add after list head initilization
netprio_cgroup: allow nesting and inherit config on cgroup creation
netprio_cgroup: implement netprio[_set]_prio() helpers
netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
netprio_cgroup: reimplement priomap expansion
netprio_cgroup: shorten variable names in extend_netdev_table()
netprio_cgroup: simplify write_priomap()
netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
cgroup: remove obsolete guarantee from cgroup_task_migrate.
cgroup: add cgroup->id
cgroup, cpuset: remove cgroup_subsys->post_clone()
cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
...
Change a timer compare from after to after-equals, thus allowing
0 timeout and making deadline schedule FIFO.
Signed-off-by: xiaobing tu <xiaobing.tu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The Kconfig currently enables MSDOS partitions by default because they
are assumed to be essential, but it's necessary to enable "advanced
partition selection" in order to get GPT support. IMO GPT partitions
are becoming common enought to deserve the same treatment MSDOS
partitions get.
(Side note: I got bit by a disk that had MSDOS and GPT partition
tables, but for some reason the MSDOS table was different from the
GPT one. I was stupid enought to disable "advanced partition
selection" in my .config, which disabled GPT partitioning and made
my btrfs pool unbootable because it couldn't find the partitions)
Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function bsg_goose_queue() does not have any in-tree callers,
so let's remove it.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some request_fn implementations, e.g. scsi_request_fn(), unlock
the queue lock internally. This may result in multiple threads
executing request_fn for the same queue simultaneously. Keep
track of the number of active request_fn calls and make sure that
blk_cleanup_queue() waits until all active request_fn invocations
have finished. A block driver may start cleaning up resources
needed by its request_fn as soon as blk_cleanup_queue() finished,
so blk_cleanup_queue() must wait for all outstanding request_fn
invocations to finish.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Chanho Min <chanho.min@lge.com>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Running a queue must continue after it has been marked dying until
it has been marked dead. So the function blk_run_queue_async() must
not schedule delayed work after blk_cleanup_queue() has marked a queue
dead. Hence add a test for that queue state in blk_run_queue_async()
and make sure that queue_unplugged() invokes that function with the
queue lock held. This avoids that the queue state can change after
it has been tested and before mod_delayed_work() is invoked. Drop
the queue dying test in queue_unplugged() since it is now
superfluous: __blk_run_queue() already tests whether or not the
queue is dead.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A block driver may start cleaning up resources needed by its
request_fn as soon as blk_cleanup_queue() finished, so request_fn
must not be invoked after draining finished. This is important
when blk_run_queue() is invoked without any requests in progress.
As an example, if blk_drain_queue() and scsi_run_queue() run in
parallel, blk_drain_queue() may have finished all requests after
scsi_run_queue() has taken a SCSI device off the starved list but
before that last function has had a chance to run the queue.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Chanho Min <chanho.min@lge.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Let the caller of blk_drain_queue() obtain the queue lock to improve
readability of the patch called "Avoid that request_fn is invoked on
a dead queue".
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Chanho Min <chanho.min@lge.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
stop. After this flag has been set queue draining starts. However,
during the queue draining phase it is still safe to invoke the
queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
flag.
This patch has been generated by running the following command
over the kernel source tree:
git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \
-e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \
sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
include/linux/blkdev.h; \
sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
-e 's/Dead queue/A dying queue/' block/blk-core.c
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Chanho Min <chanho.min@lge.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After we've done __elv_add_request() and __blk_run_queue() in
blk_execute_rq_nowait(), the request might finish and be freed
immediately. Therefore checking if the type is REQ_TYPE_PM_RESUME
isn't safe afterwards, because if it isn't, rq might be gone.
Instead, check beforehand and stash the result in a temporary.
This fixes crashes in blk_execute_rq_nowait() I get occasionally when
running with lots of memory debugging options enabled -- I think this
race is usually harmless because the window for rq to be reallocated
is so small.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The MSDOS/MBR partition table includes a 32-bit unique ID, often referred
to as the NT disk signature. When combined with a partition number within
the table, this can form a unique ID similar in concept to EFI/GPT's
partition UUID. Constructing and recording this value in struct
partition_meta_info allows MSDOS partitions to be referred to on the
kernel command-line using the following syntax:
root=PARTUUID=0002dd75-01
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will allow other types of UUID to be stored here, aside from true
UUIDs. This also simplifies code that uses this field, since it's usually
constructed from a, used as a, or compared to other, strings.
Note: A simplistic approach here would be to set uuid_str[36]=0 whenever a
/PARTNROFF option was found to be present. However, this modifies the
input string, and causes subsequent calls to devt_from_partuuid() not to
see the /PARTNROFF option, which causes different results. In order to
avoid misleading future maintainers, this parameter is marked const.
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are. Also, update documentation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
In a workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1, a+3,....
When the requests are flushed to queue, a and a+1 are merged to (a, a+1), a+2
and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3) aren't merged.
If we do recursive merge for such interleave access, some workloads throughput
get improvement. A recent worload I'm checking on is swap, below change
boostes the throughput around 5% ~ 10%.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
request is queued in cfqq->fifo list. Looks it's possible we are moving a
request from one cfqq to another in request merge case. In such case, adjusting
the fifo list order doesn't make sense and is impossible if we don't iterate
the whole fifo list.
My test does hit one case the two cfqq are different, but didn't cause kernel
crash, maybe it's because fifo list isn't used frequently. Anyway, from the
code logic, this is buggy.
I thought we can re-enable the recusive merge logic after this is fixed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull rmdir updates into for-3.8 so that further callback updates can
be put on top. This pull created a trivial conflict between the
following two commits.
8c7f6edbda ("cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them")
ed95779340 ("cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs")
The former added a field to cgroup_subsys and the latter removed one
from it. They happen to be colocated causing the conflict. Keeping
what's added and removing what's removed resolves the conflict.
Signed-off-by: Tejun Heo <tj@kernel.org>
All ->pre_destory() implementations return 0 now, which is the only
allowed return value. Make it return void.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
My workload is a raid5 which had 16 disks. And used our filesystem to
write using direct-io mode.
I used the blktrace to find those message:
8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5]
8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5]
8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5]
8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5]
8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5]
8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5]
8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5]
8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5]
8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5]
8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5]
8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5]
8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5]
8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5]
8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5]
8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5]
8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5]
8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5]
8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5]
8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5]
8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5]
8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5]
8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5]
8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5]
8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5]
8,16 0 0 2.453853661 0 m N cfq2579 insert_request
8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5]
8,16 0 0 2.453854439 0 m N cfq2579 insert_request
8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2
8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1
8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert
8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request
8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1
8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5]
8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1
8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert
8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request
8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2
8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5]
8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0]
8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0
8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0]
8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0
8,16 0 0 2.454795160 0 m N cfq schedule dispatch
From above messages,we can find rq[W 7493144 + 104] and rq[W
7493120 + 24] do not merge.
Because the bio order is:
8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5]
8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5]
8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5]
8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5]
The bio(7493144) first and bio(7493120) later.So the subsequent
bios will be divided into two parts.
When flushing plug-list,because elv_attempt_insert_merge only support
backmerge,not supporting frontmerge.
So rq[7493120 + 24] can't merge with rq[7493144 + 104].
From my test,i found those situation can count 25% in our system.
Using this patch, there is no this situation.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
CC:Shaohua Li <shli@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This config item has not carried much meaning for a while now and is
almost always enabled by default. As agreed during the Linux kernel
summit, remove it.
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__blk_queue_next_rl() finds next request list based on blkg_list
while skipping root_blkg in the list.
OTOH, root_rl is special as it may exist even without root_blkg.
Though the later part of the function handles such a case correctly,
exiting early is good for readability of the code.
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block IO update from Jens Axboe:
"Core block IO bits for 3.7. Not a huge round this time, it contains:
- First series from Kent cleaning up and generalizing bio allocation
and freeing.
- WRITE_SAME support from Martin.
- Mikulas patches to prevent O_DIRECT crashes when someone changes
the block size of a device.
- Make bio_split() work on data-less bio's (like trim/discards).
- A few other minor fixups."
Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
Morton. It is due to the VM no longer using a prio-tree (see commit
6b2dbba8b6: "mm: replace vma prio_tree with an interval tree").
So make set_blocksize() use mapping_mapped() instead of open-coding the
internal VM knowledge that has changed.
* 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
block: makes bio_split support bio without data
scatterlist: refactor the sg_nents
scatterlist: add sg_nents
fs: fix include/percpu-rwsem.h export error
percpu-rw-semaphore: fix documentation typos
fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
blockdev: turn a rw semaphore into a percpu rw semaphore
Fix a crash when block device is read and block size is changed at the same time
block: fix request_queue->flags initialization
block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
block: ioctl to zero block ranges
block: Make blkdev_issue_zeroout use WRITE SAME
block: Implement support for WRITE SAME
block: Consolidate command flag and queue limit checks for merges
block: Clean up special command handling logic
block/blk-tag.c: Remove useless kfree
block: remove the duplicated setting for congestion_threshold
block: reject invalid queue attribute values
block: Add bio_clone_bioset(), bio_clone_kmalloc()
block: Consolidate bio_alloc_bioset(), bio_kmalloc()
...
Pull cgroup hierarchy update from Tejun Heo:
"Currently, different cgroup subsystems handle nested cgroups
completely differently. There's no consistency among subsystems and
the behaviors often are outright broken.
People at least seem to agree that the broken hierarhcy behaviors need
to be weeded out if any progress is gonna be made on this front and
that the fallouts from deprecating the broken behaviors should be
acceptable especially given that the current behaviors don't make much
sense when nested.
This patch makes cgroup emit warning messages if cgroups for
subsystems with broken hierarchy behavior are nested to prepare for
fixing them in the future. This was put in a separate branch because
more related changes were expected (didn't make it this round) and the
memory cgroup wanted to pull in this and make changes on top."
* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.
* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.
* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.
These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.
* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.
All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.
* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.
* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.
There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."
Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.
Tejun pointed out a few of them, I fixed a couple more.
* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...
In some usage scenarios it is desireable to work with disk images or
virtualized DASD devices. One problem that prevents such applications
is the partition detection in ibm.c. Currently it works only for
devices that support the BIODASDINFO2 ioctl, in other words, it only
works for devices that belong to the DASD device driver.
The information gained from the BIODASDINFO2 ioctl is only for a small
set of legacy cases abolutely necessary. All current VOL1, LNX1 and
CMS1 type of disk labels can be interpreted correctly without this
information, as long as the generic HDIO_GETGEO ioctl works and
provides a correct disk geometry.
This patch makes the ibm.c partition detection as independent as
possible from the BIODASDINFO2 ioctl. Only the following two cases are
still restricted to real DASDs:
- An FBA DASD, or LDL formatted ECKD DASD without any disk label.
- An old style LNX1 label (without large volume support) on a disk
with inconsistent device geometry.
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
A queue newly allocated with blk_alloc_queue_node() has only
QUEUE_FLAG_BYPASS set. For request-based drivers,
blk_init_allocated_queue() is called and q->queue_flags is overwritten
with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
initial bypass is still in effect.
In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
instead of overwriting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
b82d4b197c ("blkcg: make request_queue bypassing on allocation") made
request_queues bypassed on allocation to avoid switching on and off
bypass mode on a queue being initialized. Some drivers allocate and
then destroy a lot of queues without fully initializing them and
incurring bypass latency overhead on each of them could add upto
significant overhead.
Unfortunately, blk_init_allocated_queue() is never used by queues of
bio-based drivers, which means that all bio-based driver queues are in
bypass mode even after initialization and registration complete
successfully.
Due to the limited way request_queues are used by bio drivers, this
problem is hidden pretty well but it shows up when blk-throttle is
used in combination with a bio-based driver. Trying to configure
(echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
indefinitely in blkg_conf_prep() waiting for bypass mode to end.
This patch moves the initial blk_queue_bypass_end() call from
blk_init_allocated_queue() to blk_register_queue() which is called for
any userland-visible queues regardless of its type.
I believe this is correct because I don't think there is any block
driver which needs or wants working elevator and blk-cgroup on a queue
which isn't visible to userland. If there are such users, we need a
different solution.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au>
Cc: stable@vger.kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce a BLKZEROOUT ioctl which can be used to clear block ranges by
way of blkdev_issue_zeroout().
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the device supports WRITE SAME, use that to optimize zeroing of
blocks. If the device does not support WRITE SAME or if the operation
fails, fall back to writing zeroes the old-fashioned way.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The WRITE SAME command supported on some SCSI devices allows the same
block to be efficiently replicated throughout a block range. Only a
single logical block is transferred from the host and the storage device
writes the same data to all blocks described by the I/O.
This patch implements support for WRITE SAME in the block layer. The
blkdev_issue_write_same() function can be used by filesystems and block
drivers to replicate a buffer across a block range. This can be used to
efficiently initialize software RAID devices, etc.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- blk_check_merge_flags() verifies that cmd_flags / bi_rw are
compatible. This function is called for both req-req and req-bio
merging.
- blk_rq_get_max_sectors() and blk_queue_get_max_sectors() can be used
to query the maximum sector count for a given request or queue. The
calls will return the right value from the queue limits given the
type of command (RW, discard, write same, etc.)
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove special-casing of non-rw fs style requests (discard). The nomerge
flags are consolidated in blk_types.h, and rq_mergeable() and
bio_mergeable() have been modified to use them.
bio_is_rw() is used in place of bio_has_data() a few places. This is
done to to distinguish true reads and writes from other fs type requests
that carry a payload (e.g. write same).
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
65536 should be ludicrous anyway but without it we overflow the
memory computation doing the allocation and badness occurs.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, cgroup hierarchy support is a mess. cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children. blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup. Others show yet different behaviors.
These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.
Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy. Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.
This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support. The goal of
this patch is two-fold.
* Move users away from using hierarchy on currently non-hierarchical
subsystems, so that implementing proper hierarchy support on those
doesn't surprise them.
* Keep track of which controllers are broken how and nudge the
subsystems to implement proper hierarchy support.
For now, start with a single warning message. We can whine louder
later on.
v2: Fixed a typo spotted by Michal. Warning message updated.
v3: Updated memcg part so that it doesn't generate warning in the
cases where .use_hierarchy=false doesn't make the behavior
different from root.use_hierarchy=true. Fixed a typo spotted by
Glauber.
v4: Check ->broken_hierarchy after cgroup creation is complete so that
->create() can affect the result per Michal. Dropped unnecessary
memcg root handling per Michal.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Remove useless kfree() and clean up code related to the removal.
The semantic patch that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@r exists@
position p1,p2;
expression x;
@@
if (x@p1 == NULL) { ... kfree@p2(x); ... return ...; }
@unchanged exists@
position r.p1,r.p2;
expression e <= r.x,x,e1;
iterator I;
statement S;
@@
if (x@p1 == NULL) { ... when != I(x,...) S
when != e = e1
when != e += e1
when != e -= e1
when != ++e
when != --e
when != e++
when != e--
when != &e
kfree@p2(x); ... return ...; }
@ok depends on unchanged exists@
position any r.p1;
position r.p2;
expression x;
@@
... when != true x@p1 == NULL
kfree@p2(x);
@depends on !ok && unchanged@
position r.p2;
expression x;
@@
*kfree@p2(x);
// </smpl>
Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Before call the blk_queue_congestion_threshold(),
the blk_queue_congestion_threshold() is already called at blk_queue_make_rquest().
Because this code is the duplicated, it has removed.
Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of using simple_strtoul which "converts" invalid numbers to 0,
use strict_strtoul and perform error checking to ensure that userspace
passes us a valid unsigned long. This addresses problems with functions
such as writev, which might want to write a trailing newline -- the
newline should rightfully be rejected, but the value preceeding it
should be preserved.
Fixes BZ#46981.
Signed-off-by: Dave Reisner <dreisner@archlinux.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Previously, there was bio_clone() but it only allocated from the fs bio
set; as a result various users were open coding it and using
__bio_clone().
This changes bio_clone() to become bio_clone_bioset(), and then we add
bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
the functionality the last patch adedd.
This will also help in a later patch changing how bio cloning works.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Boaz Harrosh <bharrosh@panasas.com>
CC: Jeff Garzik <jeff@garzik.org>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that we've got generic code for freeing bios allocated from bio
pools, this isn't needed anymore.
This patch also makes bio_free() static, since without bi_destructor
there should be no need for it to be called anywhere else.
bio_free() is now only called from bio_put, so we can refactor those a
bit - move some code from bio_put() to bio_free() and kill the redundant
bio->bi_next = NULL.
v5: Switch to BIO_KMALLOC_POOL ((void *)~0), per Boaz
v6: BIO_KMALLOC_POOL now NULL, drop bio_free's EXPORT_SYMBOL
v7: No #define BIO_KMALLOC_POOL anymore
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that bios keep track of where they were allocated from,
bio_integrity_alloc_bioset() becomes redundant.
Remove bio_integrity_alloc_bioset() and drop bio_set argument from the
related functions and make them use bio->bi_pool.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When performing a cable pull test w/ active stress I/O using fio over
a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs
on the other, it is observed that the system becomes not usable due to
scsi-ml being busy printing the error messages for all the failing commands.
I don't believe this problem is specific to FCoE and these commands are
anyway failing due to link being down (DID_NO_CONNECT), just rate-limit
the messages here to solve this issue.
v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for
rate-limit per function. However, in this case, the failed i/o gets to
blk_end_request_err() and then blk_update_request(), which also has to
be rate-limited, as added in the v2 of this patch.
v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip.
Signed-off-by: Yi Zou <yi.zou@intel.com>
Cc: www.Open-FCoE.org <devel@open-fcoe.org>
Cc: Tomas Henzl <thenzl@redhat.com>
Cc: <linux-scsi@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that cancel_delayed_work() can be safely called from IRQ handlers,
there's no reason to use __cancel_delayed_work(). Use
cancel_delayed_work() instead of __cancel_delayed_work() and mark the
latter deprecated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Roland Dreier <roland@kernel.org>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Now that mod_delayed_work() is safe to call from IRQ handlers,
__cancel_delayed_work() followed by queue_delayed_work() can be
replaced with mod_delayed_work().
Most conversions are straight-forward except for the following.
* net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
elaborate dancing around its delayed_work. Collapse it such that
linkwatch_work is queued for immediate execution if LW_URGENT and
existing timer is kept otherwise.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
system_nrt[_freezable]_wq are now spurious. Mark them deprecated and
convert all users to system[_freezable]_wq.
If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
Please use system[_freezable]_wq instead.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Airlie <airlied@linux.ie>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Convert delayed_work users doing cancel_delayed_work() followed by
queue_delayed_work() to mod_delayed_work().
Most conversions are straight-forward. Ones worth mentioning are,
* drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
use mod_delayed_work() and cancel loop in
edac_mc_reset_delay_period() is dropped.
* drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
watchdog is active or not. @fan_watchdog_active and related code
dropped.
* drivers/power/charger-manager.c: Seemingly a lot of
delayed_work_pending() abuse going on here.
[delayed_]work_pending() are unsynchronized and racy when used like
this. I converted one instance in fullbatt_handler(). Please
conver the rest so that it invokes workqueue APIs for the intended
target state rather than trying to game work item pending state
transitions. e.g. if timer should be modified - call
mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().
* drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
simplified. Note that round_jiffies() calls in this function are
meaningless. round_jiffies() work on absolute jiffies not delta
delay used by delayed_work.
v2: Tomi pointed out that __cancel_delayed_work() users can't be
safely converted to mod_delayed_work(). They could be calling it
from irq context and if that happens while delayed_work_timer_fn()
is running, it could deadlock. __cancel_delayed_work() users are
dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Anton Vorontsov <cbouatmailru@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Roland Dreier <roland@kernel.org>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
I met a odd prblem:read /proc/partitions may return zero.
I wrote a file test.c:
int main()
{
char buff[4096];
int ret;
int fd;
printf("pid=%d\n",getpid());
while (1) {
fd = open("/proc/partitions", O_RDONLY);
if (fd < 0) {
printf("open error %s\n", strerror(errno));
return 0;
}
ret = read(fd, buff, 4096);
if (ret <= 0)
printf("ret=%d, %s, %ld\n", ret,
strerror(errno), lseek(fd,0,SEEK_CUR));
close(fd);
}
exit(0);
}
You can reproduce by:
1:while true;do cat /proc/partitions > /dev/null ;done
2:./test
I reviewed the code and found:
>> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
>> {
>> static void *p;
>>
>> p = disk_seqf_start(seqf, pos);
>> if (!IS_ERR_OR_NULL(p) && !*pos)
>> seq_puts(seqf, "major minor #blocks name\n\n");
>> return p;
>> }
test cat /proc/partitions
p = disk_seqf_start()(Not NULL)
p = disk_seqf_start()(NULL because pos)
if (!IS_ERR_OR_NULL(p) && !*pos)
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to map a bio to a scatterlist, modelled after
blk_rq_map_sg.
This helper is useful for any driver that wants to create
a scatterlist from its ->make_request_fn method.
Changes in v2:
- Use __blk_segment_map_sg to avoid duplicated code
- Add cocbook style function comment
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split the mapping code in blk_rq_map_sg() to a helper
__blk_segment_map_sg(), so that other mapping function, e.g.
blk_bio_map_sg(), can share the code.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Suggested-by: Jens Axboe <axboe@kernel.dk>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a disk has large discard_granularity and small max_discard_sectors,
discards are not split with optimal alignment. In the limit case of
discard_granularity == max_discard_sectors, no request could be aligned
correctly, so in fact you might end up with no discarded logical blocks
at all.
Another example that helps showing the condition in the patch is with
discard_granularity == 64, max_discard_sectors == 128. A request that is
submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
However, only 2 aligned blocks out of 3 are included in the request;
128..191 may be left intact and not discarded. With this patch, the
first request will be truncated to ensure good alignment of what's left,
and the split will be 2..127, 128..255, 256..257. The patch will also
take into account the discard_alignment.
At most one extra request will be introduced, because the first request
will be reduced by at most granularity-1 sectors, and granularity
must be less than max_discard_sectors. Subsequent requests will run
on round_down(max_discard_sectors, granularity) sectors, as in the
current code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mostly a preparation for the next patch.
In principle this fixes an infinite loop if max_discard_sectors < granularity,
but that really shouldn't happen.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block driver changes from Jens Axboe:
- Making the plugging support for drivers a bit more sane from Neil.
This supersedes the plugging change from Shaohua as well.
- The usual round of drbd updates.
- Using a tail add instead of a head add in the request completion for
ndb, making us find the most completed request more quickly.
- A few floppy changes, getting rid of a duplicated flag and also
running the floppy init async (since it takes forever in boot terms)
from Andi.
* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
floppy: remove duplicated flag FD_RAW_NEED_DISK
blk: pass from_schedule to non-request unplug functions.
block: stack unplug
blk: centralize non-request unplug handling.
md: remove plug_cnt feature of plugging.
block/nbd: micro-optimization in nbd request completion
drbd: announce FLUSH/FUA capability to upper layers
drbd: fix max_bio_size to be unsigned
drbd: flush drbd work queue before invalidate/invalidate remote
drbd: fix potential access after free
drbd: call local-io-error handler early
drbd: do not reset rs_pending_cnt too early
drbd: reset congestion information before reporting it in /proc/drbd
drbd: report congestion if we are waiting for some userland callback
drbd: differentiate between normal and forced detach
drbd: cleanup, remove two unused global flags
floppy: Run floppy initialization asynchronous
Pull core block IO bits from Jens Axboe:
"The most complicated part if this is the request allocation rework by
Tejun, which has been queued up for a long time and has been in
for-next ditto as well.
There are a few commits from yesterday and today, mostly trivial and
obvious fixes. So I'm pretty confident that it is sound. It's also
smaller than usual."
* 'for-3.6/core' of git://git.kernel.dk/linux-block:
block: remove dead func declaration
block: add partition resize function to blkpg ioctl
block: uninitialized ioc->nr_tasks triggers WARN_ON
block: do not artificially constrain max_sectors for stacking drivers
blkcg: implement per-blkg request allocation
block: prepare for multiple request_lists
block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
blkcg: inline bio_blkcg() and friends
block: allocate io_context upfront
block: refactor get_request[_wait]()
block: drop custom queue draining used by scsi_transport_{iscsi|fc}
mempool: add @gfp_mask to mempool_create_node()
blkcg: make root blkcg allocation use %GFP_KERNEL
blkcg: __blkg_lookup_create() doesn't need radix preload
__generic_unplug_device() function is removed with commit
7eaceaccab, which forgot to
remove the declaration at meantime. Here remove it.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
allows altering the size of an existing partition, even if it is currently
in use.
This patch converts hd_struct->nr_sects into sequence counter because
One might extend a partition while IO is happening to it and update of
nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
can lead to issues like reading inconsistent size of a partition. Sequence
counter have been used so that readers don't have to take bdev mutex lock
as we call sector_in_part() very frequently.
Now all the access to hd_struct->nr_sects should happen using sequence
counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
There is one exception though, set_capacity()/get_capacity(). I think
theoritically race should exist there too but this patch does not
modify set_capacity()/get_capacity() due to sheer number of call sites
and I am afraid that change might break something. I have left that as a
TODO item. We can handle it later if need be. This patch does not introduce
any new races as such w.r.t set_capacity()/get_capacity().
v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Phillip Susi <psusi@ubuntu.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hi,
I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
below warning as of 3.5-rc1 and later (3.4 is fine):
[ 10.886893] ------------[ cut here ]------------
[ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
[ 10.886905] Hardware name: Bochs
[ 10.886906] Modules linked in:
[ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
[ 10.886908] Call Trace:
[ 10.886911] [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0
[ 10.886912] [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20
[ 10.886913] [<ffffffff8107c088>] copy_process+0x1488/0x1560
[ 10.886914] [<ffffffff8107c244>] do_fork+0xb4/0x340
[ 10.886918] [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50
[ 10.886919] [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80
[ 10.886920] [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60
[ 10.886923] [<ffffffff81051db3>] sys_clone+0x23/0x30
[ 10.886925] [<ffffffff8179bd73>] stub_clone+0x13/0x20
[ 10.886927] [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b
[ 10.886928] ---[ end trace 32a14af7ee6a590b ]---
Reproducing is easy, I can hit it on a KVM system with a very basic
config (x86_64 make defconfig + enable the drivers needed). To hit it,
just install dump (on debian/ubuntu, not sure what the package might be
called on Fedora), and:
dump -o -f /tmp/foo /
You'll see the warning in dmesg once it forks off the I/O process and
starts dumping filesystem contents.
I bisected it down to the following commit:
commit f6e8d01bee
Author: Tejun Heo <tj@kernel.org>
Date: Mon Mar 5 13:15:26 2012 -0800
block: add io_context->active_ref
Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks. This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.
This will be used to associate bio's with a given task. This patch
doesn't introduce any visible behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It seems like the init of ioc->nr_tasks was removed in that patch,
so it starts out at 0 instead of 1.
Tejun, is the right thing here to add back the init, or should something else
be done?
The below patch removes the warning, but I haven't done any more extensive
testing on it.
Signed-off-by: Olof Johansson <olof@lixom.net>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_set_stacking_limits is intended to allow stacking drivers to build
up the limits of the stacked device based on the underlying devices'
limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
doesn't allow the stacking driver to inherit a max_sectors larger than
1024 -- due to blk_stack_limits' use of min_not_zero.
It is now clear that this artificial limit is getting in the way so
change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
stacking drivers like dm-multipath to inherit 'max_sectors' from the
underlying paths).
Reported-by: Vijay Chauhan <vijay.chauhan@netapp.com>
Tested-by: Vijay Chauhan <vijay.chauhan@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
MD raid1 prepares to dispatch request in unplug callback. If make_request in
low level queue also uses unplug callback to dispatch request, the low level
queue's unplug callback will not be called. Recheck the callback list helps
this case.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the queue is dead blk_execute_rq_nowait() doesn't invoke the done()
callback function. That will result in blk_execute_rq() being stuck
in wait_for_completion(). Avoid this by initializing rq->end_io to the
done() callback before we check the queue state. Also, make sure the
queue lock is held around the invocation of the done() callback. Found
this through source code review.
Signed-off-by: Muthukumar Ratty <muthur@gmail.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.
This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.
This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.
* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.
* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.
* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).
v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.
v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer <rni@google.com>.
v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Request allocation is about to be made per-blkg meaning that there'll
be multiple request lists.
* Make queue full state per request_list. blk_*queue_full() functions
are renamed to blk_*rl_full() and takes @rl instead of @q.
* Rename blk_init_free_list() to blk_init_rl() and make it take @rl
instead of @q. Also add @gfp_mask parameter.
* Add blk_exit_rl() instead of destroying rl directly from
blk_release_queue().
* Add request_list->q and make request alloc/free functions -
blk_free_request(), [__]freed_request(), __get_request() - take @rl
instead of @q.
This patch doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
move q->rq.elvpriv to q->nr_rqs_elvpriv. blk_drain_queue() is updated
to use q->nr_rqs[] instead of q->rq.count[].
These counters separates queue-wide request statistics from the
request list and allow implementation of per-queue request allocation.
While at it, properly indent fields of struct request_list.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make bio_blkcg() and friends inline. They all are very simple and
used only in few places.
This patch is to prepare for further updates to request allocation
path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Block layer very lazy allocation of ioc. It waits until the moment
ioc is absolutely necessary; unfortunately, that time could be inside
queue lock and __get_request() performs unlock - try alloc - retry
dancing.
Just allocate it up-front on entry to block layer. We're not saving
the rain forest by deferring it to the last possible moment and
complicating things unnecessarily.
This patch is to prepare for further updates to request allocation
path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, there are two request allocation functions - get_request()
and get_request_wait(). The former tries to allocate a request once
and the latter keeps retrying until it succeeds. The latter wraps the
former and keeps retrying until allocation succeeds.
The combination of two functions deliver fallible non-wait allocation,
fallible wait allocation and unfailing wait allocation. However,
given that forward progress is guaranteed, fallible wait allocation
isn't all that useful and in fact nobody uses it.
This patch simplifies the interface as follows.
* get_request() is renamed to __get_request() and is only used by the
wrapper function.
* get_request_wait() is renamed to get_request(). It now takes
@gfp_mask and retries iff it contains %__GFP_WAIT.
This patch doesn't introduce any functional change and is to prepare
for further updates to request allocation path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
iscsi_remove_host() uses bsg_remove_queue() which implements custom
queue draining. fc_bsg_remove() open-codes mostly identical logic.
The draining logic isn't correct in that blk_stop_queue() doesn't
prevent new requests from being queued - it just stops processing, so
nothing prevents new requests to be queued after the logic determines
that the queue is drained.
blk_cleanup_queue() now implements proper queue draining and these
custom draining logics aren't necessary. Drop them and use
bsg_unregister_queue() + blk_cleanup_queue() instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@emulex.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
mempool_create_node() currently assumes %GFP_KERNEL. Its only user,
blk_init_free_list(), is about to be updated to use other allocation
flags - add @gfp_mask argument to the function.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
from __blkg_lookup_create() for root blkcg creation. This could make
policy fail unnecessarily.
Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
preload radix tree and preallocate blkg with %GFP_KERNEL before trying
to create the root blkg.
v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
instead of ERR_PTR() value. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's no point in calling radix_tree_preload() if preloading doesn't
use more permissible GFP mask. Drop preloading from
__blkg_lookup_create().
While at it, drop sparse locking annotation which no longer applies.
v2: Vivek pointed out the odd preload usage. Instead of updating,
just drop it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sometimes, warnings about ioctls to partition happen often enough that they
form majority of the warnings in the kernel log and users complain. In some
cases warnings are about ioctls such as SG_IO so it's not good to get rid of
the warnings completely as they can ease debugging of userspace problems
when ioctl is refused.
Since I have seen warnings from lots of commands, including some proprietary
userspace applications, I don't think disallowing the ioctls for processes
with CAP_SYS_RAWIO will happen in the near future if ever. So lets just
stop warning for processes with CAP_SYS_RAWIO for which ioctl is allowed.
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: James Bottomley <JBottomley@parallels.com>
CC: linux-scsi@vger.kernel.org
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This function was only used by btrfs code in btrfs_abort_devices()
(seems in a wrong way).
It was removed in commit d07eb91170,
So, Let's remove the dead code to avoid any confusion.
Changes in v2: update commit log, btrfs_abort_devices() was removed
already.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-kernel@vger.kernel.org
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Cc: David Sterba <dave@jikos.cz>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 777eb1bf15 disconnects externally
supplied queue_lock before blk_drain_queue(). Switching the lock would
introduce lock unbalance because theads which have taken the external
lock might unlock the internal lock in the during the queue drain. This
patch mitigate this by disconnecting the lock after the queue draining
since queue draining makes a lot of request_queue users go away.
However, please note, this patch only makes the problem less likely to
happen. Anyone who still holds a ref might try to issue a new request on
a dead queue after the blk_cleanup_queue() finishes draining, the lock
unbalance might still happen in this case.
=====================================
[ BUG: bad unlock balance detected! ]
3.4.0+ #288 Not tainted
-------------------------------------
fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
[<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
but there are no more locks to release!
other info that might help us debug this:
1 lock held by fio/17706:
#0: (&(&vblk->lock)->rlock){......}, at: [<ffffffff81327f1a>]
get_request_wait+0x19a/0x250
stack backtrace:
Pid: 17706, comm: fio Not tainted 3.4.0+ #288
Call Trace:
[<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
[<ffffffff810dea49>] print_unlock_inbalance_bug+0xf9/0x100
[<ffffffff810dfe4f>] lock_release_non_nested+0x1df/0x330
[<ffffffff811dae24>] ? dio_bio_end_aio+0x34/0xc0
[<ffffffff811d6935>] ? bio_check_pages_dirty+0x85/0xe0
[<ffffffff811daea1>] ? dio_bio_end_aio+0xb1/0xc0
[<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
[<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
[<ffffffff810e0079>] lock_release+0xd9/0x250
[<ffffffff81a74553>] _raw_spin_unlock_irq+0x23/0x40
[<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
[<ffffffff81328faa>] generic_make_request+0xca/0x100
[<ffffffff81329056>] submit_bio+0x76/0xf0
[<ffffffff8115470c>] ? set_page_dirty_lock+0x3c/0x60
[<ffffffff811d69e1>] ? bio_set_pages_dirty+0x51/0x70
[<ffffffff811dd1a8>] do_blockdev_direct_IO+0xbf8/0xee0
[<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
[<ffffffff811dd4e5>] __blockdev_direct_IO+0x55/0x60
[<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
[<ffffffff811d92e7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
[<ffffffff8114c6ae>] generic_file_aio_read+0x70e/0x760
[<ffffffff810df7c5>] ? __lock_acquire+0x215/0x5a0
[<ffffffff811e9924>] ? aio_run_iocb+0x54/0x1a0
[<ffffffff8114bfa0>] ? grab_cache_page_nowait+0xc0/0xc0
[<ffffffff811e82cc>] aio_rw_vect_retry+0x7c/0x1e0
[<ffffffff811e8250>] ? aio_fsync+0x30/0x30
[<ffffffff811e9936>] aio_run_iocb+0x66/0x1a0
[<ffffffff811ea9b0>] do_io_submit+0x6f0/0xb80
[<ffffffff8134de2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff811eae50>] sys_io_submit+0x10/0x20
[<ffffffff81a7c9e9>] system_call_fastpath+0x16/0x1b
Changes since v2: Update commit log to explain how the code is still
broken even if we delay the lock switching after the drain.
Changes since v1: Update commit log as Tejun suggested.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After hot-unplug a stressed disk, I found that rl->wait[] is not empty
while rl->count[] is empty and there are theads still sleeping on
get_request after the queue cleanup. With simple debug code, I found
there are exactly nr_sleep - nr_wakeup of theads in D state. So there
are missed wakeup.
$ dmesg | grep nr_sleep
[ 52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
$ vmstat 1
1 173 0 712640 24292 96172 0 0 0 0 419 757 0 0 0 100 0
To quote Tejun:
Ah, okay, freed_request() wakes up single waiter with the assumption
that after the wakeup there will at least be one successful allocation
which in turn will continue the wakeup chain until the wait list is
empty - ie. waiter wakeup is dependent on successful request
allocation happening after each wakeup. With queue marked dead, any
woken up waiter fails the allocation path, so the wakeup chaining is
lost and we're left with hung waiters. What we need is wake_up_all()
after drain completion.
This patch fixes the missed wakeup by waking up all the theads which
are sleeping on wait queue after queue drain.
Changes in v2: Drop waitqueue_active() optimization
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Asias He <asias@redhat.com>
Fixed a bug by me, where stacked devices would oops on calling
blk_drain_queue() since ->rq.wait[] do not get initialized unless
it's a full queue setup.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkg_destroy() caches @blkg->q in local variable @q. While there are
two places which needs @blkg->q, only lockdep_assert_held() used the
local variable leading to unused local variable warning if lockdep is
configured out. Drop the local variable and just use @blkg->q
directly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rakesh Iyer <rni@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When policy data allocation fails in the middle, blkg_alloc() invokes
blkg_free() to destroy the half constructed blkg. This ends up
calling pd_exit_fn() on policy datas which didn't go through
pd_init_fn(). Fix it by making blkg_alloc() call pd_init_fn()
immediately after each policy data allocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq may be built w/ or w/o blkcg support depending on
CONFIG_CFQ_CGROUP_IOSCHED. If blkcg support is disabled, most of
related code is ifdef'd out but some part is left dangling -
blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register()
calls are made on it.
Feeding zero filled policy to blkcg_policy_register() is incorrect and
triggers the following WARN_ON() if CONFIG_BLK_CGROUP &&
!CONFIG_CFQ_GROUP_IOSCHED.
------------[ cut here ]------------
WARNING: at block/blk-cgroup.c:867
Modules linked in:
Modules linked in:
CPU: 3 Not tainted 3.4.0-09547-gfb21aff #1
Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8)
Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000
000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048
000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8
00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70
Krnl Code: 00000000003d76be: a7280001 lhi %r2,1
00000000003d76c2: a7f4ffdf brc 15,3d7680
#00000000003d76c6: a7f40001 brc 15,3d76c8
>00000000003d76ca: a7c8ffea lhi %r12,-22
00000000003d76ce: a7f4ffce brc 15,3d766a
00000000003d76d2: a7f40001 brc 15,3d76d4
00000000003d76d6: a7c80000 lhi %r12,0
00000000003d76da: a7f4ffc2 brc 15,3d765e
Call Trace:
([<0000000000b6f000>] initcall_debug+0x0/0x4)
[<0000000000989e8a>] cfq_init+0x62/0xd4
[<00000000001000ba>] do_one_initcall+0x3a/0x170
[<000000000096fb60>] kernel_init+0x214/0x2bc
[<0000000000623202>] kernel_thread_starter+0x6/0xc
[<00000000006231fc>] kernel_thread_starter+0x0/0xc
no locks held by swapper/0/1.
Last Breaking-Event-Address:
[<00000000003d76c6>] blkcg_policy_register+0xc6/0xe0
---[ end trace b8ef4903fcbf9dd3 ]---
This patch fixes the problem by ensuring all blkcg support code is
inside CONFIG_CFQ_GROUP_IOSCHED.
* blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved
inside the first CONFIG_CFQ_GROUP_IOSCHED block. __maybe_unused is
dropped from blkcg_policy_cfq decl.
* blkcg_deactivate_poilcy() invocation is moved inside ifdef. This
also makes the activation logic match cfq_init_queue().
* All blkcg_policy_[un]register() invocations are moved inside ifdef.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <20120601112954.GC3535@osiris.boeblingen.de.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq_init() would return zero after kmem cache creation failure. Fix
so that it returns -ENOMEM.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Calling get_task_io_context() on a exiting task which isn't %current can
loop forever. This triggers at boot time on my dev machine.
BUG: soft lockup - CPU#3 stuck for 22s ! [mountall.1603]
Fix this by making create_task_io_context() returns -EBUSY in this case
to break the loop.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge block/IO core bits from Jens Axboe:
"This is a bit bigger on the core side than usual, but that is purely
because we decided to hold off on parts of Tejun's submission on 3.4
to give it a bit more time to simmer. As a consequence, it's seen a
long cycle in for-next.
It contains:
- Bug fix from Dan, wrong locking type.
- Relax splice gifting restriction from Eric.
- A ton of updates from Tejun, primarily for blkcg. This improves
the code a lot, making the API nicer and cleaner, and also includes
fixes for how we handle and tie policies and re-activate on
switches. The changes also include generic bug fixes.
- A simple fix from Vivek, along with a fix for doing proper delayed
allocation of the blkcg stats."
Fix up annoying conflict just due to different merge resolution in
Documentation/feature-removal-schedule.txt
* 'for-3.5/core' of git://git.kernel.dk/linux-block: (92 commits)
blkcg: tg_stats_alloc_lock is an irq lock
vmsplice: relax alignement requirements for SPLICE_F_GIFT
blkcg: use radix tree to index blkgs from blkcg
blkcg: fix blkcg->css ref leak in __blkg_lookup_create()
block: fix elvpriv allocation failure handling
block: collapse blk_alloc_request() into get_request()
blkcg: collapse blkcg_policy_ops into blkcg_policy
blkcg: embed struct blkg_policy_data in policy specific data
blkcg: mass rename of blkcg API
blkcg: style cleanups for blk-cgroup.h
blkcg: remove blkio_group->path[]
blkcg: blkg_rwstat_read() was missing inline
blkcg: shoot down blkgs if all policies are deactivated
blkcg: drop stuff unused after per-queue policy activation update
blkcg: implement per-queue policy activation
blkcg: add request_queue->root_blkg
blkcg: make request_queue bypassing on allocation
blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing
blkcg: make blkg_conf_prep() take @pol and return with queue lock held
blkcg: remove static policy ID enums
...
tg_stats_alloc_lock nests inside queue lock and should always be held
with irq disabled. throtl_pd_{init|exit}() were using non-irqsafe
spinlock ops which triggered inverse lock ordering via irq warning via
RCU freeing of blkg invoking throtl_pd_exit() w/o disabling IRQ.
Update both functions to use irq safe operations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
LKML-Reference: <1335339396.16988.80.camel@lappy>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull cgroup updates from Tejun Heo:
"cgroup file type addition / removal is updated so that file types are
added and removed instead of individual files so that dynamic file
type addition / removal can be implemented by cgroup and used by
controllers. blkio controller changes which will come through block
tree are dependent on this. Other changes include res_counter cleanup
and disallowing kthread / PF_THREAD_BOUND threads to be attached to
non-root cgroups.
There's a reported bug with the file type addition / removal handling
which can lead to oops on cgroup umount. The issue is being looked
into. It shouldn't cause problems for most setups and isn't a
security concern."
Fix up trivial conflict in Documentation/feature-removal-schedule.txt
* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
res_counter: Account max_usage when calling res_counter_charge_nofail()
res_counter: Merge res_counter_charge and res_counter_charge_nofail
cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
cgroup: remove cgroup_subsys->populate()
cgroup: get rid of populate for memcg
cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
cgroup: make css->refcnt clearing on cgroup removal optional
cgroup: use negative bias on css->refcnt to block css_tryget()
cgroup: implement cgroup_rm_cftypes()
cgroup: introduce struct cfent
cgroup: relocate __d_cgrp() and __d_cft()
cgroup: remove cgroup_add_file[s]()
cgroup: convert memcg controller to the new cftype interface
memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
cgroup: convert all non-memcg controllers to the new cftype interface
cgroup: relocate cftype and cgroup_subsys definitions in controllers
cgroup: merge cft_release_agent cftype array into the base files array
cgroup: implement cgroup_add_cftypes() and friends
cgroup: build list of all cgroups under a given cgroupfs_root
cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
...
To avoid confusion while formatting a DASD device change the level of
the "Expected VOL1 label not found" message from warning to info.
Signed-off-by: Stefan Haberland <stefan.haberland@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
6d1d8050b4 "block, partition: add partition_meta_info to hd_struct"
added part_unpack_uuid() which assumes that the passed in buffer has
enough space for sprintfing "%pU" - 37 characters including '\0'.
Unfortunately, b5af921ec0 "init: add support for root devices
specified by partition UUID" supplied 33 bytes buffer to the function
leading to the following panic with stackprotector enabled.
Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e
[<ffffffff815e226b>] panic+0xba/0x1c6
[<ffffffff81b14c7e>] ? printk_all_partitions+0x259/0x26xb
[<ffffffff810566bb>] __stack_chk_fail+0x1b/0x20
[<ffffffff81b15c7e>] printk_all_paritions+0x259/0x26xb
[<ffffffff81aedfe0>] mount_block_root+0x1bc/0x27f
[<ffffffff81aee0fa>] mount_root+0x57/0x5b
[<ffffffff81aee23b>] prepare_namespace+0x13d/0x176
[<ffffffff8107eec0>] ? release_tgcred.isra.4+0x330/0x30
[<ffffffff81aedd60>] kernel_init+0x155/0x15a
[<ffffffff81087b97>] ? schedule_tail+0x27/0xb0
[<ffffffff815f4d24>] kernel_thread_helper+0x5/0x10
[<ffffffff81aedc0b>] ? start_kernel+0x3c5/0x3c5
[<ffffffff815f4d20>] ? gs_change+0x13/0x13
Increase the buffer size, remove the dangerous part_unpack_uuid() and
use snprintf() directly from printk_all_partitions().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Szymon Gruszczynski <sz.gruszczynski@googlemail.com>
Cc: Will Drewry <wad@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQEcBAABAgAGBQJPnb50AAoJEHm+PkMAQRiGAE0H/A4zFZIUGmF3miKPDYmejmrZ
oVDYxVAu6JHjHWhu8E3VsinvyVscowjV8dr15eSaQzmDmRkSHAnUQ+dB7Di7jLC2
MNopxsWjwyZ8zvvr3rFR76kjbWKk/1GYytnf7GPZLbJQzd51om2V/TY/6qkwiDSX
U8Tt7ihSgHAezefqEmWp2X/1pxDCEt+VFyn9vWpkhgdfM1iuzF39MbxSZAgqDQ/9
JJrBHFXhArqJguhENwL7OdDzkYqkdzlGtS0xgeY7qio2CzSXxZXK4svT6FFGA8Za
xlAaIvzslDniv3vR2ZKd6wzUwFHuynX222hNim3QMaYdXm012M+Nn1ufKYGFxI0=
=4d4w
-----END PGP SIGNATURE-----
Merge tag 'v3.4-rc5' into for-3.5/core
The core branch is behind driver commits that we want to build
on for 3.5, hence I'm pulling in a later -rc.
Linux 3.4-rc5
Conflicts:
Documentation/feature-removal-schedule.txt
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkg lookup is currently performed by traversing linked list anchored
at blkcg->blkg_list. This is very unscalable and with blk-throttle
enabled and enough request queues on the system, this can get very
ugly quickly (blk-throttle performs look up on every bio submission).
This patch makes blkcg use radix tree to index blkgs combined with
simple last-looked-up hint. This is mostly identical to how icqs are
indexed from ioc.
Note that because __blkg_lookup() may be invoked without holding queue
lock, hint is only updated from __blkg_lookup_create(). Due to cfq's
cfqq caching, this makes hint updates overly lazy. This will be
improved with scheduled blkcg aware request allocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Request allocation is mempool backed to guarantee forward progress
under memory pressure; unfortunately, this property got broken while
adding elvpriv data. Failures during elvpriv allocation, including
ioc and icq creation failures, currently make get_request() fail as
whole. There's no forward progress guarantee for these allocations -
they may fail indefinitely under memory pressure stalling IO and
deadlocking the system.
This patch updates get_request() such that elvpriv allocation failure
doesn't make the whole function fail. If elvpriv allocation fails,
the allocation is degraded into !ELVPRIV. This will force the request
to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
failures should be rare (nothing is per-request) and anything is
better than deadlocking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>