linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-02 16:44:10 +08:00

Author	SHA1	Message	Date
Christoph Hellwig	d852564f8c	blk-mq: remove blk_mq_alloc_request_pinned We now only have one caller left and can open code it there in a cleaner way. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:27 -06:00
Christoph Hellwig	793597a6a9	blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request We already do a non-blocking allocation in blk_mq_map_request, no need to repeat it. Just call __blk_mq_alloc_request to wait directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:25 -06:00
Christoph Hellwig	a3bd77567c	blk-mq: remove blk_mq_wait_for_tags The current logic for blocking tag allocation is rather confusing, as we first allocated and then free again a tag in blk_mq_wait_for_tags, just to attempt a non-blocking allocation and then repeat if someone else managed to grab the tag before us. Instead change blk_mq_alloc_request_pinned to simply do a blocking tag allocation itself and use the request we get back from it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:23 -06:00
Christoph Hellwig	5dee857720	blk-mq: initialize request in __blk_mq_alloc_request Both callers if __blk_mq_alloc_request want to initialize the request, so lift it into the common path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:21 -06:00
Christoph Hellwig	4ce01dd1a0	blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request Instead of having two almost identical copies of the same code just let the callers pass in the reserved flag directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:19 -06:00
Christoph Hellwig	6fca6a611c	blk-mq: add helper to insert requests from irq context Both the cache flush state machine and the SCSI midlayer want to submit requests from irq context, and the current per-request requeue_work unfortunately causes corruption due to sharing with the csd field for flushes. Replace them with a per-request_queue list of requests to be requeued. Based on an earlier test by Ming Lei. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 08:08:02 -06:00
Jens Axboe	95f0968499	blk-mq: allow non-softirq completions Right now we export two ways of completing a request: 1) blk_mq_complete_request(). This uses an IPI (if needed) and completes through q->softirq_done_fn(). It also works with timeouts. 2) blk_mq_end_io(). This completes inline, and ignores any timeout state of the request. Let blk_mq_complete_request() handle non-softirq_done_fn completions as well, by just completing inline. If a driver has enough completion ports to place completions correctly, it need not define a mq_ops->complete() and we can avoid an indirect function call by doing the completion inline. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 17:46:48 -06:00
Jens Axboe	f14bbe77a9	blk-mq: pass in suggested NUMA node to ->alloc_hctx() Drivers currently have to figure this out on their own, and they are missing information to do it properly. The ones that did attempt to do it, do it wrong. So just pass in the suggested node directly to the alloc function. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 12:06:53 -06:00
Ming Lei	3d2936f457	block: only allocate/free mq_usage_counter in blk-mq The percpu counter is only used for blk-mq, so move its allocation and free inside blk-mq, and don't allocate it for legacy queue device. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 09:37:08 -06:00
Ming Lei	624dbe4754	blk-mq: avoid code duplication blk_mq_exit_hw_queues() and blk_mq_free_hw_queues() are introduced to avoid code duplication. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 09:37:06 -06:00
Ming Lei	1f9f07e917	blk-mq: fix leak of hctx->ctx_map hctx->ctx_map should have been freed inside blk_mq_free_queue(). Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 08:34:45 -06:00
Fabian Frederick	35086784ca	block/blk-lib.c: make __blkdev_issue_zeroout static __blkdev_issue_zeroout is only used in blk-lib.c Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-26 17:39:09 -06:00
Christoph Hellwig	19c5d84f14	blk-mq: idle all hardware contexts before freeing a queue Without this we can leak the active_queues reference if a command is freed while it is considered active. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-26 17:21:51 -06:00
Jens Axboe	c22d9d8a60	blk-mq: allow setting of per-request timeouts Currently blk-mq uses the queue timeout for all requests. But for some commands, drivers may want to set a specific timeout for special requests. Allow this to be passed in through request->timeout, and use it if set. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-23 14:14:57 -06:00
Sam Bradshaw	edf866b380	blk-mq: export blk_mq_tag_busy_iter Export the blk-mq in-flight tag iterator for driver consumption. This is particularly useful in exception paths or SRSI where in-flight IOs need to be cancelled and/or reissued. The NVMe driver conversion will use this. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-23 13:30:16 -06:00
Jens Axboe	07068d5b8e	blk-mq: split make request handler for multi and single queue We want slightly different behavior from them: - On single queue devices, we currently use the per-process plug for deferred IO and for merging. - On multi queue devices, we don't use the per-process plug, but we want to go straight to hardware for SYNC IO. Split blk_mq_make_request() into a blk_sq_make_request() for single queue devices, and retain blk_mq_make_request() for multi queue devices. Then we don't need multiple checks for q->nr_hw_queues in the request mapping. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-22 10:43:07 -06:00
Jens Axboe	484b4061e6	blk-mq: save memory by freeing requests on unused hardware queues Depending on the topology of the machine and the number of queues exposed by a device, we can end up in a situation where some of the hardware queues are unused (as in, they don't map to any software queues). For this case, free up the memory used by the request map, as we will not use it. This can be a substantial amount of memory, depending on the number of queues vs CPUs and the queue depth of the device. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-21 14:01:15 -06:00
Jens Axboe	e814e71ba4	blk-mq: allow the hctx cpu hotplug notifier to return errors Prepare this for the next patch which adds more smarts in the plugging logic, so that we can save some memory. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-21 13:59:08 -06:00
Robert Elliott	da41a589f5	blk-mq: Micro-optimize blk_queue_nomerges() check In blk_mq_make_request(), do the blk_queue_nomerges() check outside the call to blk_attempt_plug_merge() to eliminate function call overhead when nomerges=2 (disabled) Signed-off-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-20 15:49:03 -06:00
Jens Axboe	eba7176826	blk-mq: initialize q->nr_requests after calling blk_queue_make_request() blk_queue_make_requests() overwrites our set value for q->nr_requests, turning it into the default of 128. Set this appropriately after initializing queue values in blk_queue_make_request(). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-20 15:17:27 -06:00
Jens Axboe	e3a2b3f931	blk-mq: allow changing of queue depth through sysfs For request_fn based devices, the block layer exports a 'nr_requests' file through sysfs to allow adjusting of queue depth on the fly. Currently this returns -EINVAL for blk-mq, since it's not wired up. Wire this up for blk-mq, so that it now also always dynamic adjustments of the allowed queue depth for any given block device managed by blk-mq. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-20 11:49:02 -06:00
Jens Axboe	719c555f44	block: move mm/bounce.c to block/ Continue moving some of the block files that are scattered around. bounce.c contains only code for bouncing the contents of a bio. It's block proper code, not mm code. Suggested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 20:01:52 -06:00
Jens Axboe	39a9f97e5e	Merge branch 'for-3.16/blk-mq-tagging' into for-3.16/core Signed-off-by: Jens Axboe <axboe@fb.com> Conflicts: block/blk-mq-tag.c	2014-05-19 11:52:35 -06:00
Jens Axboe	1429d7c946	blk-mq: switch ctx pending map to the sparser blk_align_bitmap Each hardware queue has a bitmap of software queues with pending requests. When new IO is queued on a software queue, the bit is set, and when IO is pruned on a hardware queue run, the bit is cleared. This causes a lot of traffic. Switch this from the regular BITS_PER_LONG bitmap to a sparser layout, similarly to what was done for blk-mq tagging. 20% performance increase was observed for single threaded IO, and about 15% performanc increase on multiple threads driving the same device. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 11:02:47 -06:00
Jens Axboe	e93ecf602b	blk-mq: move the cache friendly bitmap type of out blk-mq-tag We will use it for the pending list in blk-mq core as well. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 11:02:47 -06:00
Jens Axboe	2667bcbbd5	block: move ioprio.c from fs/ to block/ Like commit `f9c78b2b`, move this block related file outside of fs/ and into the core block directory, block/. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 11:02:18 -06:00
Jens Axboe	f9c78b2be2	block: move bio.c and bio-integrity.c from fs/ to block/ They really belong in block/, especially now since it's not in drivers/block/ anymore. Additionally, the get_maintainer script gets it wrong when in fs/. Suggested-by: Christoph Hellwig <hch@infradead.org> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 08:34:46 -06:00
Tejun Heo	5c9d535b89	cgroup: remove css_parent() cgroup in general is moving towards using cgroup_subsys_state as the fundamental structural component and css_parent() was introduced to convert from using cgroup->parent to css->parent. It was quite some time ago and we're moving forward with making css more prominent. This patch drops the trivial wrapper css_parent() and let the users dereference css->parent. While at it, explicitly mark fields of css which are public and immutable. v2: New usage from device_cgroup.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-05-16 13:22:48 -04:00
Jens Axboe	0d2602ca30	blk-mq: improve support for shared tags maps This adds support for active queue tracking, meaning that the blk-mq tagging maintains a count of active users of a tag set. This allows us to maintain a notion of fairness between users, so that we can distribute the tag depth evenly without starving some users while allowing others to try unfair deep queues. If sharing of a tag set is detected, each hardware queue will track the depth of its own queue. And if this exceeds the total depth divided by the number of active queues, the user is actively throttled down. The active queue count is done lazily to avoid bouncing that data between submitter and completer. Each hardware queue gets marked active when it allocates its first tag, and gets marked inactive when 1) the last tag is cleared, and 2) the queue timeout grace period has passed. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-13 15:10:52 -06:00
Tejun Heo	451af504df	cgroup: replace cftype->write_string() with cftype->write() Convert all cftype->write_string() users to the new cftype->write() which maps directly to kernfs write operation and has full access to kernfs and cgroup contexts. The conversions are mostly mechanical. * @css and @cft are accessed using of_css() and of_cft() accessors respectively instead of being specified as arguments. * Should return @nbytes on success instead of 0. * @buf is not trimmed automatically. Trim if necessary. Note that blkcg and netprio don't need this as the parsers already handle whitespaces. cftype->write_string() has no user left after the conversions and removed. While at it, remove unnecessary local variable @p in cgroup_subtree_control_write() and stale comment about CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c. This patch doesn't introduce any visible behavior changes. v2: netprio was missing from conversion. Converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net>	2014-05-13 12:16:21 -04:00
Tejun Heo	ec903c0c85	cgroup: rename css_tryget() to css_tryget_online() Unlike the more usual refcnting, what css_tryget() provides is the distinction between online and offline csses instead of protection against upping a refcnt which already reached zero. cgroup is planning to provide actual tryget which fails if the refcnt already reached zero. Let's rename the existing trygets so that they clearly indicate that they're onliness. I thought about keeping the existing names as-are and introducing new names for the planned actual tryget; however, given that each controller participates in the synchronization of the online state, it seems worthwhile to make it explicit that these functions are about on/offline state. Rename css_tryget() to css_tryget_online() and css_tryget_from_dir() to css_tryget_online_from_dir(). This is pure rename. v2: cgroup_freezer grew new usages of css_tryget(). Update accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>	2014-05-13 12:11:01 -04:00
Jens Axboe	acb12e0a9c	Merge branch 'for-3.16/blk-mq-tagging' into for-3.16/core	2014-05-10 15:44:42 -06:00
Ming Lei	1f236ab22c	blk-mq: bitmap tag: cleanup blk_mq_init_tags Both nr_cache and nr_tags arn't needed for bitmap tag anymore. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-10 15:44:00 -06:00
Ming Lei	9d3d21aeb4	blk-mq: bitmap tag: select random tag betweet 0 and (depth - 1) The selected tag should be selected at random between 0 and (depth - 1) with probability 1/depth, instead between 0 and (depth - 2) with probability 1/(depth - 1). Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-10 15:43:14 -06:00
Ming Lei	60f2df8a29	blk-mq: bitmap tag: remove barrier in bt_clear_tag() The barrier isn't necessary because both atomic_dec_and_test() and wake_up() implicate one barrier. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-10 15:42:13 -06:00
Ming Lei	0289b2e110	blk-mq: bitmap tag: use clear_bit_unlock in bt_clear_tag() The unlock memory barrier need to order access to req in free path and clearing tag bit, otherwise either request free path may see a allocated request, or initialized request in allocate path might be modified by the ongoing free path. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-10 15:41:42 -06:00
Jens Axboe	7276d02e24	block: only calculate part_in_flight() once We first check if we have inflight IO, then retrieve that same number again. Usually this isn't that costly since the chance of having the data dirtied in between is small, but there's no reason for calling part_in_flight() twice. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 15:48:23 -06:00
Jens Axboe	cf4b50afc2	blk-mq: fix race in IO start accounting Commit `c6d600c6` opened up a small race where we could attempt to account IO completion on a request, racing with IO start accounting. Fix this up by ensuring that we've accounted for IO start before inserting the request. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 14:54:08 -06:00
Jens Axboe	59d13bf5f5	blk-mq: use sparser tag layout for lower queue depth For best performance, spreading tags over multiple cachelines makes the tagging more efficient on multicore systems. But since we have 8 * sizeof(unsigned long) tags per cacheline, we don't always get a nice spread. Attempt to spread the tags over at least 4 cachelines, using fewer number of bits per unsigned long if we have to. This improves tagging performance in setups with 32-128 tags. For higher depths, the spread is the same as before (BITS_PER_LONG tags per cacheline). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 13:41:15 -06:00
Jens Axboe	4bb659b156	blk-mq: implement new and more efficient tagging scheme blk-mq currently uses percpu_ida for tag allocation. But that only works well if the ratio between tag space and number of CPUs is sufficiently high. For most devices and systems, that is not the case. The end result if that we either only utilize the tag space partially, or we end up attempting to fully exhaust it and run into lots of lock contention with stealing between CPUs. This is not optimal. This new tagging scheme is a hybrid bitmap allocator. It uses two tricks to both be SMP friendly and allow full exhaustion of the space: 1) We cache the last allocated (or freed) tag on a per blk-mq software context basis. This allows us to limit the space we have to search. The key element here is not caching it in the shared tag structure, otherwise we end up dirtying more shared cache lines on each allocate/free operation. 2) The tag space is split into cache line sized groups, and each context will start off randomly in that space. Even up to full utilization of the space, this divides the tag users efficiently into cache line groups, avoiding dirtying the same one both between allocators and between allocator and freeer. This scheme shows drastically better behaviour, both on small tag spaces but on large ones as well. It has been tested extensively to show better performance for all the cases blk-mq cares about. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 09:36:49 -06:00
Christoph Hellwig	af76e555e5	blk-mq: initialize struct request fields individually This allows us to avoid a non-atomic memset over ->atomic_flags as well as killing lots of duplicate initializations. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 08:43:49 -06:00
Jens Axboe	9fccfed8f0	blk-mq: update a hotplug comment for grammar Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-09 08:43:49 -06:00
Jens Axboe	506e931f92	blk-mq: add basic round-robin of what CPU to queue workqueue work on Right now we just pick the first CPU in the mask, but that can easily overload that one. Add some basic batching and round-robin all the entries in the mask instead. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-07 10:26:44 -06:00
Tejun Heo	36c38fb714	blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() During the recent conversion of cgroup to kernfs, cgroup_tree_mutex which nests above both the kernfs s_active protection and cgroup_mutex is added to synchronize cgroup file type operations as cgroup_mutex needed to be grabbed from some file operations and thus can't be put above s_active protection. While this arrangement mostly worked for cgroup, this triggered the following lockdep warning. ====================================================== [ INFO: possible circular locking dependency detected ] 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G W ------------------------------------------------------- trinity-c173/9024 is trying to acquire lock: (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) but task is already holding lock: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283) which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (s_active#89){++++.+}: lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024) kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219) cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899) cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2)) rebind_subsystems (kernel/cgroup.c:1144) cgroup_setup_root (kernel/cgroup.c:1568) cgroup_mount (kernel/cgroup.c:1716) mount_fs (fs/super.c:1094) vfs_kern_mount (fs/namespace.c:899) do_mount (fs/namespace.c:2238 fs/namespace.c:2561) SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729) tracesys (arch/x86/kernel/entry_64.S:746) -> #1 (cgroup_tree_mutex){+.+.+.}: lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040) blkcg_policy_register (block/blk-cgroup.c:1106) throtl_init (block/blk-throttle.c:1694) do_one_initcall (init/main.c:789) kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003) kernel_init (init/main.c:935) ret_from_fork (arch/x86/kernel/entry_64.S:552) -> #0 (blkcg_pol_mutex){+.+.+.}: __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182) lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) cgroup_file_write (kernel/cgroup.c:2714) kernfs_fop_write (fs/kernfs/file.c:295) vfs_write (fs/read_write.c:532) SyS_write (fs/read_write.c:584 fs/read_write.c:576) tracesys (arch/x86/kernel/entry_64.S:746) other info that might help us debug this: Chain exists of: blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(s_active#89); lock(cgroup_tree_mutex); lock(s_active#89); lock(blkcg_pol_mutex); * DEADLOCK * 4 locks held by trinity-c173/9024: #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714) #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530) #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283) #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283) stack backtrace: CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G W 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002 ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004 ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0 Call Trace: dump_stack (lib/dump_stack.c:52) print_circular_bug (kernel/locking/lockdep.c:1216) __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182) lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) cgroup_file_write (kernel/cgroup.c:2714) kernfs_fop_write (fs/kernfs/file.c:295) vfs_write (fs/read_write.c:532) SyS_write (fs/read_write.c:584 fs/read_write.c:576) This is a highly unlikely but valid circular dependency between "echo 1 > blkcg.reset_stats" and cfq module [un]loading. cgroup is going through further locking update which will remove this complication but for now let's use trylock on blkcg_pol_mutex and retry the file operation if the trylock fails. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com	2014-05-05 13:48:18 -04:00
Keith Busch	3291fa57cb	NVMe: Add tracepoints Adding tracepoints for bio_complete and block_split into nvme to help with gathering IO info using blktrace and blkparse. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2014-05-05 10:41:26 -04:00
Fabian Frederick	5cf8c22775	block/blk-throttle.c: fix return of 0/1 with return type bool Fix 4 coccinelle warnings. Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-02 11:38:03 -06:00
Fabian Frederick	5214e33c8e	block/blk-iopoll.c: use iop instead of iopoll All blk_iopoll functions use iop for parent iopoll structure except blk_iopoll_complete.This also fixes one kernel-doc warning. Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-02 11:37:41 -06:00
Jens Axboe	74814b1c55	blk-mq: remove extra requeue trace We already issue a blktrace requeue event in __blk_mq_requeue_request(), don't do it from the original caller as well. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-02 11:24:48 -06:00
Masanari Iida	176167ad9e	block: Fix format string mismatch in cfq-iosched.c Fix format string mismatch in cfq_var_show() Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-30 15:56:10 -06:00
Jens Axboe	c6d600c65e	blk-mq: refactor request insertion/merging Refactor the logic around adding a new bio to a software queue, so we nest the ctx->lock where we really need it (merge and insertion) and don't hold it when we don't (init and IO start accounting). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-30 13:43:56 -06:00
Jens Axboe	98bc1f272a	blk-mq remove debug BUG_ON() when draining software queues It's never been of any use, lets get rid of it. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-30 13:43:08 -06:00
Jens Axboe	5810d903fa	blk-mq: fix waiting for reserved tags blk_mq_wait_for_tags() is only able to wait for "normal" tags, not reserved tags. Pass in which one we should attempt to get a tag for, so that waiting for reserved tags will work. Reserved tags are used for internal commands, which are usually serialized. Hence no waiting generally takes place, but we should ensure that it actually works if users need that functionality. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-29 20:49:48 -06:00
Christoph Hellwig	c4a634f432	block: fold __blk_add_timer into blk_add_timer Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-25 08:24:40 -06:00
Christoph Hellwig	3853520163	blk-mq: respect rq_affinity The blk-mq code is using it's own version of the I/O completion affinity tunables, which causes a few issues: - the rq_affinity sysfs file doesn't work for blk-mq devices, even if it still is present, thus breaking existing tuning setups. - the rq_affinity = 1 mode, which is the defauly for legacy request based drivers isn't implemented at all. - blk-mq drivers don't implement any completion affinity with the default flag settings. This patches removes the blk-mq ipi_redirect flag and sysfs file, as well as the internal BLK_MQ_F_SHOULD_IPI flag and replaces it with code that respects the queue-wide rq_affinity flags and also implements the rq_affinity = 1 mode. This means I/O completion affinity can now only be tuned block-queue wide instead of per context, which seems more sensible to me anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-25 08:24:07 -06:00
Jens Axboe	87ee7b1121	blk-mq: fix race with timeouts and requeue events If a requeue event races with a timeout, we can get into the situation where we attempt to complete a request from the timeout handler when it's not start anymore. This causes a crash. So have the timeout handler check that REQ_ATOM_STARTED is still set on the request - if not, we ignore the event. If this happens, the request has now been marked as complete. As a consequence, we need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(), as to maintain proper request state. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-24 08:51:47 -06:00
Jens Axboe	70ab0b2d51	Revert "blk-mq: initialize req->q in allocation" This reverts commit `6a3c8a3ac0`. We need selective clearing of the request to make the init-at-free time completely safe. Otherwise we end up stomping on rq->atomic_flags, which we don't want to do.	2014-04-24 08:50:38 -06:00
Ming Lei	981bd189f8	blk-mq: fix leak of set->tags set->tags should be freed in blk_mq_free_tag_set(). Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-23 10:08:22 -06:00
Fabian Frederick	8876e140ec	block/blk-throttle.c: add static to blk_throtl_dispatch_work_fn blk_throtl_dispatch_work_fn is only used in blk-throttle.c Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-21 19:30:28 -06:00
Ming Lei	6a3c8a3ac0	blk-mq: initialize req->q in allocation The patch basically reverts the patch of(blk-mq: initialize request on allocation) in Jens's tree(already in -next), and only initialize req->q in allocation for two reasons: - presumed cache hotness on completion - blk_rq_tagged(rq) depends on reset of req->mq_ctx Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-21 10:38:39 -06:00
Ming Lei	4ca085009f	blk-mq: user (1 << order) to implement order_to_size() Cc: Jörg-Volker Peetz <jvpeetz@web.de> Cc: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-21 10:38:38 -06:00
Ming Lei	4847900532	blk-mq: fix allocation of set->tags type of set->tags is struct blk_mq_tags **. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-21 10:38:36 -06:00
Ming Lei	11471e0d04	blk-mq: free hctx->ctx_map when init failed Avoid memory leak in the failure path. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-21 10:38:34 -06:00
Peter Zijlstra	4e857c58ef	arch: Mass conversion of smp_mb__() Mostly scripted conversion of the smp_mb__ barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-04-18 14:20:48 +02:00
Jens Axboe	49fd524f95	bsg: update check for rq based driver for blk-mq bsg currently checks ->request_fn to check whether a queue can handle struct request. But with blk-mq, we don't have a request_fn yet are request based. Add a queue_is_rq_based() helper and use that in bsg, I'm guessing this is not the last place we need to update for this. Besides, it better explains what is being checked. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:46 -06:00
Jens Axboe	f793aa5378	block: relax when to modify the timeout timer Since we are now, by default, applying timer slack to expiry times, the logic for when to modify a timer in the block code is suboptimal. The block layer keeps a forward rolling timer per queue for all requests, and modifies this timer if a request has a shorter timeout than what the current expiry time is. However, this breaks down when our rounded timer values get applied slack. Then each new request ends up modifying the timer, since we're still a little in front of the timer + slack. Fix this by allowing a tolerance of HZ / 2, the timeout handling doesn't need to be very precise. This drastically cuts down the number of timer modifications we have to make. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	12120077b2	block: export blk_finish_request This allows to mirror the blk-mq code flow for more a more readable I/O completion handler in SCSI. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	f88a164b72	blk-mq: rename mq_flush_work struct request member We will use this work_struct to requeue scsi commands from the completion handler as well, so give it a more generic name. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	ed0791b2f8	blk-mq: add blk_mq_requeue_request This allows to requeue a request that has been accepted by ->queue_rq earlier. This is needed by the SCSI layer in various error conditions. The existing internal blk_mq_requeue_request is renamed to __blk_mq_requeue_request as it is a lower level building block for this funtionality. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	2f26855656	blk-mq: add blk_mq_start_hw_queues Add a helper to unconditionally kick contexts of a queue. This will be needed by the SCSI layer to provide fair queueing between multiple devices on a single host. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	70f4db639c	blk-mq: add blk_mq_delay_queue Add a blk-mq equivalent to blk_delay_queue so that the scsi layer can ask to be kicked again after a delay. Signed-off-by: Christoph Hellwig <hch@lst.de> Modified by me to kill the unnecessary preempt disable/enable in the delayed workqueue handler. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	1b4a325858	blk-mq: add async parameter to blk_mq_start_stopped_hw_queues Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	91b63639c7	blk-mq: bidi support Add two unlinkely branches to make sure the resid is initialized correctly for bidi request pairs, and the second request gets properly freed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Christoph Hellwig	63151a449e	blk-mq: allow drivers to hook into I/O completion Split out the bottom half of blk_mq_end_io so that drivers can perform work when they know a request has been completed, but before it has been freed. This also obsoletes blk_mq_end_io_partial as drivers can now pass any value to blk_update_request directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:25 -06:00
Jens Axboe	6700a678c0	blk-mq: kill preempt disable/enable in blk_mq_work_fn() blk_mq_work_fn() is always invoked off the bounded workqueues, so it can happily preempt among the queues in that set without causing any issues for blk-mq. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:24 -06:00
Jens Axboe	fd1270d5df	blk-mq: don't use preempt_count() to check for right CPU UP or CONFIG_PREEMPT_NONE will return 0, and what we really want to check is whether or not we are on the right CPU. So don't make PREEMPT part of this, just test the CPU in the mask directly. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-16 14:15:24 -06:00
Christoph Hellwig	24d2f90309	blk-mq: split out tag initialization, support shared tags Add a new blk_mq_tag_set structure that gets set up before we initialize the queue. A single blk_mq_tag_set structure can be shared by multiple queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Modular export of blk_mq_{alloc,free}_tagset added by me. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:18:02 -06:00
Christoph Hellwig	ed44832dea	blk-mq: initialize request on allocation If we want to share tag and request allocation between queues we cannot initialize the request at init/free time, but need to initialize it at allocation time as it might get used for different queues over its lifetime. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:03 -06:00
Christoph Hellwig	e9b267d91f	blk-mq: add ->init_request and ->exit_request methods The current blk_mq_init_commands/blk_mq_free_commands interface has a two problems: 1) Because only the constructor is passed to blk_mq_init_commands there is no easy way to clean up when a comman initialization failed. The current code simply leaks the allocations done in the constructor. 2) There is no good place to call blk_mq_free_commands: before blk_cleanup_queue there is no guarantee that all outstanding commands have completed, so we can't free them yet. After blk_cleanup_queue the queue has usually been freed. This can be worked around by grabbing an unconditional reference before calling blk_cleanup_queue and dropping it after blk_mq_free_commands is done, although that's not exatly pretty and driver writers are guaranteed to get it wrong sooner or later. Both issues are easily fixed by making the request constructor and destructor normal blk_mq_ops methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:03 -06:00
Christoph Hellwig	8727af4b9d	blk-mq: make ->flush_rq fully transparent to drivers Drivers shouldn't have to care about the block layer setting aside a request to implement the flush state machine. We already override the mq context and tag to make it more transparent, but so far haven't deal with the driver private data in the request. Make sure to override this as well, and while we're at it add a proper helper sitting in blk-mq.c that implements the full impersonation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:02 -06:00
Christoph Hellwig	9d74e25737	blk-mq: do not initialize req->special Drivers can reach their private data easily using the blk_mq_rq_to_pdu helper and don't need req->special. By not initializing it code can be simplified nicely, and we also shave off a few more instructions from the I/O path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:02 -06:00
Christoph Hellwig	742ee69b92	blk-mq: initialize resid_len Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:02 -06:00
Jens Axboe	b4f42e2831	block: remove struct request buffer member This was used in the olden days, back when onions were proper yellow. Basically it mapped to the current buffer to be transferred. With highmem being added more than a decade ago, most drivers map pages out of a bio, and rq->buffer isn't pointing at anything valid. Convert old style drivers to just use bio_data(). For the discard payload use case, just reference the page in the bio. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:02 -06:00
Jens Axboe	f89e0dd9d1	Linux 3.15-rc1 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJTSv89AAoJEHm+PkMAQRiGd7AIAKL45dFRhX96W53uzGlpXtv7 Ecs9CMY5mtsB/rqtV/NSQaUELlNb4Ilb4lITh7NaLWAZxDJ12GwVsIbFoaBj7Ypx tfBNNxffGGnTTn2C1PpPQmLytLBvqXIVHMaPpDvnHYJl6g9BihshLzyMrsrizqnA DPJ0xMdgGp6BLC4qm0ZH1mS2q+TyXLN2ZCjJ1lp6NNYwBnwOe/ABjnUa0Ztu7aii 837N8h6nEuKHTr6DgrYHEgYVc3DPHPHaly/UJ8v4U30myzRv83YkD5DsBSUjSLj8 CzxJEnECa1XljLJK7SjRHy5pSP9tcUlmjx3Fk7IpQixmT+A15cov6fQ967jNlDw= =Hxnc -----END PGP SIGNATURE----- Merge tag 'v3.15-rc1' into for-3.16/core We don't like this, but things have diverged with the blk-mq fixes in 3.15-rc1. So merge it in.	2014-04-15 14:02:24 -06:00
Linus Torvalds	5166701b36	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it does shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...	2014-04-12 14:49:50 -07:00
Duan Jiong	21f9fcd815	block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-11 08:09:47 -06:00
Jens Axboe	360f92c244	block: fix regression with block enabled tagging Martin reported that his test system would not boot with current git, it oopsed with this: BUG: unable to handle kernel paging request at ffff88046c6c9e80 IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150 PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246 Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013 Workqueue: events_unbound async_run_entry_fn task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000 RIP: 0010:[<ffffffff812971e0>] [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150 RSP: 0018:ffff880273d03a58 EFLAGS: 00010092 RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6 RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410 R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0 R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e FS: 0000000000000000(0000) GS:ffff880277b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0 Stack: ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000 ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62 ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8 Call Trace: [<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510 [<ffffffff81293167>] __blk_run_queue+0x37/0x50 [<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130 [<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0 [<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0 [<ffffffff813bba35>] scsi_execute+0xe5/0x180 [<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110 [<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod] [<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0 [<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod] [<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod] [<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140 [<ffffffff8106a1c5>] process_one_work+0x175/0x410 [<ffffffff8106b373>] worker_thread+0x123/0x400 [<ffffffff8106b250>] ? manage_workers+0x160/0x160 [<ffffffff8107104e>] kthread+0xce/0xf0 [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0 [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70 Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89 df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89 58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d RIP [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150 RSP <ffff880273d03a58> CR2: ffff88046c6c9e80 Martin bisected and found this to be the problem patch; commit `6d113398dc` Author: Jan Kara <jack@suse.cz> Date: Mon Feb 24 16:39:54 2014 +0100 block: Stop abusing rq->csd.list in blk-softirq and the problem was immediately apparent. The patch states that it is safe to reuse queuelist at completion time, since it is no longer used. However, that is not true if a device is using block enabled tagging. If that is the case, then the queuelist is reused to keep track of busy tags. If a device also ended up using softirq completions, we'd reuse ->queuelist for the IPI handling while block tagging was still using it. Boom. Fix this by adding a new ipi_list list head, and share the memory used with the request hash table. The hash table is never used after the request is moved to the dispatch list, which happens long before any potential completion of the request. Add a new request bit for this, so we don't have cases that check rq->hash while it could potentially have been reused for the IPI completion. Reported-by: Martin K. Petersen <martin.petersen@oracle.com> Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-09 21:54:06 -06:00
Jens Axboe	cb2da43e3d	blk-mq: simplify blk_mq_hw_sysfs_cpus_show() Now that we have a cpu mask of CPUs that are mapped to a specific hardware queue, we can just iterate that to display the sysfs num-hw-queue/cpu_list file. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-09 10:53:21 -06:00
Jens Axboe	e4043dcf30	blk-mq: ensure that hardware queues are always run on the mapped CPUs Instead of providing soft mappings with no guarantees on hardware queues always being run on the right CPU, switch to a hard mapping guarantee that ensure that we always run the hardware queue on (one of, if more) the mapped CPU. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-09 10:18:23 -06:00
Jens Axboe	8ab14595b6	block: add kblockd_schedule_delayed_work_on() Same function as kblockd_schedule_delayed_work(), but allow the caller to pass in a CPU that the work should be executed on. This just directly extends and maps into the workqueue API, and will be used to make the blk-mq mappings more strict. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-09 10:17:03 -06:00
Jens Axboe	59c3d45e48	block: remove 'q' parameter from kblockd_schedule_*_work() The queue parameter is never used, just get rid of it. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-09 10:17:00 -06:00
Jens Axboe	bccb5f7c8b	blk-mq: fix potential stall during CPU unplug with IO pending When a CPU is unplugged, we move the blk_mq_ctx request entries to the current queue. The current code forgets to remap the blk_mq_hw_ctx before marking the software context pending, which breaks if old-cpu and new-cpu don't map to the same hardware queue. Additionally, if we mark entries as pending in the new hardware queue, then make sure we schedule it for running. Otherwise request could be sitting there until someone else queues IO for that hardware queue. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-07 08:17:18 -06:00
Linus Torvalds	32d01dc7be	Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot updates for cgroup: - The biggest one is cgroup's conversion to kernfs. cgroup took after the long abandoned vfs-entangled sysfs implementation and made it even more convoluted over time. cgroup's internal objects were fused with vfs objects which also brought in vfs locking and object lifetime rules. Naturally, there are places where vfs rules don't fit and nasty hacks, such as credential switching or lock dance interleaving inode mutex and cgroup_mutex with object serial number comparison thrown in to decide whether the operation is actually necessary, needed to be employed. After conversion to kernfs, internal object lifetime and locking rules are mostly isolated from vfs interactions allowing shedding of several nasty hacks and overall simplification. This will also allow implmentation of operations which may affect multiple cgroups which weren't possible before as it would have required nesting i_mutexes. - Various simplifications including dropping of module support, easier cgroup name/path handling, simplified cgroup file type handling and task_cg_lists optimization. - Prepatory changes for the planned unified hierarchy, which is still a patchset away from being actually operational. The dummy hierarchy is updated to serve as the default unified hierarchy. Controllers which aren't claimed by other hierarchies are associated with it, which BTW was what the dummy hierarchy was for anyway. - Various fixes from Li and others. This pull request includes some patches to add missing slab.h to various subsystems. This was triggered xattr.h include removal from cgroup.h. cgroup.h indirectly got included a lot of files which brought in xattr.h which brought in slab.h. There are several merge commits - one to pull in kernfs updates necessary for converting cgroup (already in upstream through driver-core), others for interfering changes in the fixes branch" * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits) cgroup: remove useless argument from cgroup_exit() cgroup: fix spurious lockdep warning in cgroup_exit() cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c cgroup: break kernfs active_ref protection in cgroup directory operations cgroup: fix cgroup_taskset walking order cgroup: implement CFTYPE_ONLY_ON_DFL cgroup: make cgrp_dfl_root mountable cgroup: drop const from @buffer of cftype->write_string() cgroup: rename cgroup_dummy_root and related names cgroup: move ->subsys_mask from cgroupfs_root to cgroup cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding cgroup: remove NULL checks from [pr_cont_]cgroup_{name\|path}() cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root cgroup: reorganize cgroup bootstrapping cgroup: relocate setting of CGRP_DEAD cpuset: use rcu_read_lock() to protect task_cs() cgroup_freezer: document freezer_fork() subtleties cgroup: update cgroup_transfer_tasks() to either succeed or fail cgroup: drop task_lock() protection around task->cgroups cgroup: update how a newly forked task gets associated with css_set ...	2014-04-03 13:05:42 -07:00
Linus Torvalds	cd6362befe	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: "Here is my initial pull request for the networking subsystem during this merge window: 1) Support for ESN in AH (RFC 4302) from Fan Du. 2) Add full kernel doc for ethtool command structures, from Ben Hutchings. 3) Add BCM7xxx PHY driver, from Florian Fainelli. 4) Export computed TCP rate information in netlink socket dumps, from Eric Dumazet. 5) Allow IPSEC SA to be dumped partially using a filter, from Nicolas Dichtel. 6) Convert many drivers to pci_enable_msix_range(), from Alexander Gordeev. 7) Record SKB timestamps more efficiently, from Eric Dumazet. 8) Switch to microsecond resolution for TCP round trip times, also from Eric Dumazet. 9) Clean up and fix 6lowpan fragmentation handling by making use of the existing inet_frag api for it's implementation. 10) Add TX grant mapping to xen-netback driver, from Zoltan Kiss. 11) Auto size SKB lengths when composing netlink messages based upon past message sizes used, from Eric Dumazet. 12) qdisc dumps can take a long time, add a cond_resched(), From Eric Dumazet. 13) Sanitize netpoll core and drivers wrt. SKB handling semantics. Get rid of never-used-in-tree netpoll RX handling. From Eric W Biederman. 14) Support inter-address-family and namespace changing in VTI tunnel driver(s). From Steffen Klassert. 15) Add Altera TSE driver, from Vince Bridgers. 16) Optimizing csum_replace2() so that it doesn't adjust the checksum by checksumming the entire header, from Eric Dumazet. 17) Expand BPF internal implementation for faster interpreting, more direct translations into JIT'd code, and much cleaner uses of BPF filtering in non-socket ocntexts. From Daniel Borkmann and Alexei Starovoitov" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1976 commits) netpoll: Use skb_irq_freeable to make zap_completion_queue safe. net: Add a test to see if a skb is freeable in irq context qlcnic: Fix build failure due to undefined reference to `vxlan_get_rx_port' net: ptp: move PTP classifier in its own file net: sxgbe: make "core_ops" static net: sxgbe: fix logical vs bitwise operation net: sxgbe: sxgbe_mdio_register() frees the bus Call efx_set_channels() before efx->type->dimension_resources() xen-netback: disable rogue vif in kthread context net/mlx4: Set proper build dependancy with vxlan be2net: fix build dependency on VxLAN mac802154: make csma/cca parameters per-wpan mac802154: allow only one WPAN to be up at any given time net: filter: minor: fix kdoc in __sk_run_filter netlink: don't compare the nul-termination in nla_strcmp can: c_can: Avoid led toggling for every packet. can: c_can: Simplify TX interrupt cleanup can: c_can: Store dlc private can: c_can: Reduce register access can: c_can: Make the code readable ...	2014-04-02 20:53:45 -07:00
Linus Torvalds	159d8133d0	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual rocket science -- mostly documentation and comment updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: sparse: fix comment doc: fix double words isdn: capi: fix "CAPI_VERSION" comment doc: DocBook: Fix typos in xml and template file Bluetooth: add module name for btwilink driver core: unexport static function create_syslog_header mmc: core: typo fix in printk specifier ARM: spear: clean up editing mistake net-sysfs: fix comment typo 'CONFIG_SYFS' doc: Insert MODULE_ in module-signing macros Documentation: update URL to hfsplus Technote 1150 gpio: update path to documentation ixgbe: Fix format string in ixgbe_fcoe. Kconfig: Remove useless "default N" lines user_namespace.c: Remove duplicated word in comment CREDITS: fix formatting treewide: Fix typo in Documentation/DocBook mm: Fix warning on make htmldocs caused by slab.c ata: ata-samsung_cf: cleanup in header file idr: remove unused prototype of idr_free()	2014-04-02 16:23:38 -07:00
Al Viro	86d564c84c	constify blk_rq_map_user_iov() and friends sg_iovec array passed to it can be const Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-04-01 23:19:31 -04:00
Linus Torvalds	7a48837732	Merge branch 'for-3.15/core' of git://git.kernel.dk/linux-block Pull core block layer updates from Jens Axboe: "This is the pull request for the core block IO bits for the 3.15 kernel. It's a smaller round this time, it contains: - Various little blk-mq fixes and additions from Christoph and myself. - Cleanup of the IPI usage from the block layer, and associated helper code. From Frederic Weisbecker and Jan Kara. - Duplicate code cleanup in bio-integrity from Gu Zheng. This will give you a merge conflict, but that should be easy to resolve. - blk-mq notify spinlock fix for RT from Mike Galbraith. - A blktrace partial accounting bug fix from Roman Pen. - Missing REQ_SYNC detection fix for blk-mq from Shaohua Li" * 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits) blk-mq: add REQ_SYNC early rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock blk-mq: support partial I/O completions blk-mq: merge blk_mq_insert_request and blk_mq_run_request blk-mq: remove blk_mq_alloc_rq blk-mq: don't dump CPU -> hw queue map on driver load blk-mq: fix wrong usage of hctx->state vs hctx->flags blk-mq: allow blk_mq_init_commands() to return failure block: remove old blk_iopoll_enabled variable blktrace: fix accounting of partially completed requests smp: Rename __smp_call_function_single() to smp_call_function_single_async() smp: Remove wait argument from __smp_call_function_single() watchdog: Simplify a little the IPI call smp: Move __smp_call_function_single() below its safe version smp: Consolidate the various smp_call_function_single() declensions smp: Teach __smp_call_function_single() to check for offline cpus smp: Remove unused list_head from csd smp: Iterate functions through llist_for_each_entry_safe() block: Stop abusing rq->csd.list in blk-softirq block: Remove useless IPI struct initialization ...	2014-04-01 19:19:15 -07:00
David S. Miller	04f58c8854	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: Documentation/devicetree/bindings/net/micrel-ks8851.txt net/core/netpoll.c The net/core/netpoll.c conflict is a bug fix in 'net' happening to code which is completely removed in 'net-next'. In micrel-ks8851.txt we simply have overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-03-25 20:29:20 -04:00
Shaohua Li	27fbf4e87c	blk-mq: add REQ_SYNC early Add REQ_SYNC early, so rq_dispatched[] in blk_mq_rq_ctx_init is set correctly. Signed-off-by: Shaohua Li<shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-21 08:57:58 -06:00
Mike Galbraith	55c816e3df	rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock [ 365.164040] BUG: sleeping function called from invalid context at kernel/rtmutex.c:674 [ 365.164041] in_atomic(): 1, irqs_disabled(): 1, pid: 26, name: migration/1 [ 365.164043] no locks held by migration/1/26. [ 365.164044] irq event stamp: 6648 [ 365.164056] hardirqs last enabled at (6647): [<ffffffff8153d377>] restore_args+0x0/0x30 [ 365.164062] hardirqs last disabled at (6648): [<ffffffff810ed98d>] multi_cpu_stop+0x9d/0x120 [ 365.164070] softirqs last enabled at (0): [<ffffffff810543bc>] copy_process.part.28+0x6fc/0x1920 [ 365.164072] softirqs last disabled at (0): [< (null)>] (null) [ 365.164076] CPU: 1 PID: 26 Comm: migration/1 Tainted: GF N 3.12.12-rt19-0.gcb6c4a2-rt #3 [ 365.164078] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS QSSC-S4R.QCI.01.00.S013.032920111005 03/29/2011 [ 365.164091] 0000000000000001 ffff880a42ea7c30 ffffffff815367e6 ffffffff81a086c0 [ 365.164099] ffff880a42ea7c40 ffffffff8108919c ffff880a42ea7c60 ffffffff8153c24f [ 365.164107] ffff880a42ea91f0 00000000ffffffe1 ffff880a42ea7c88 ffffffff81297ec0 [ 365.164108] Call Trace: [ 365.164119] [<ffffffff810060b1>] try_stack_unwind+0x191/0x1a0 [ 365.164127] [<ffffffff81004872>] dump_trace+0x92/0x360 [ 365.164133] [<ffffffff81006108>] show_trace_log_lvl+0x48/0x60 [ 365.164138] [<ffffffff81004c18>] show_stack_log_lvl+0xd8/0x1d0 [ 365.164143] [<ffffffff81006160>] show_stack+0x20/0x50 [ 365.164153] [<ffffffff815367e6>] dump_stack+0x54/0x9a [ 365.164163] [<ffffffff8108919c>] __might_sleep+0xfc/0x140 [ 365.164173] [<ffffffff8153c24f>] rt_spin_lock+0x1f/0x70 [ 365.164182] [<ffffffff81297ec0>] blk_mq_main_cpu_notify+0x20/0x70 [ 365.164191] [<ffffffff81540a1c>] notifier_call_chain+0x4c/0x70 [ 365.164201] [<ffffffff81083499>] __raw_notifier_call_chain+0x9/0x10 [ 365.164207] [<ffffffff810567be>] cpu_notify+0x1e/0x40 [ 365.164217] [<ffffffff81525da2>] take_cpu_down+0x22/0x40 [ 365.164223] [<ffffffff810ed9c6>] multi_cpu_stop+0xd6/0x120 [ 365.164229] [<ffffffff810edd97>] cpu_stopper_thread+0xd7/0x1e0 [ 365.164235] [<ffffffff810863a3>] smpboot_thread_fn+0x203/0x380 [ 365.164241] [<ffffffff8107cbf8>] kthread+0xc8/0xd0 [ 365.164250] [<ffffffff8154440c>] ret_from_fork+0x7c/0xb0 [ 365.164429] smpboot: CPU 1 is now offline Signed-off-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-21 08:57:56 -06:00
Christoph Hellwig	7237c740b0	blk-mq: support partial I/O completions Add a new blk_mq_end_io_partial function to partially complete requests as needed by the SCSI layer. We do this by reusing blk_update_request to advance the bio instead of having a simplified version of it in the blk-mq code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-21 08:57:55 -06:00
Christoph Hellwig	eeabc850b7	blk-mq: merge blk_mq_insert_request and blk_mq_run_request It's almost identical to blk_mq_insert_request, so fold the two into one slightly more generic function by making the flush special case a bit smarted. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-21 08:57:37 -06:00
Christoph Hellwig	081241e592	blk-mq: remove blk_mq_alloc_rq There's only one caller, which is a straight wrapper and fits the naming scheme of the related functions a lot better. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-21 08:41:45 -06:00
Dave Jones	708f04d2ab	block: free q->flush_rq in blk_init_allocated_queue error paths Commit `7982e90c3a` ("block: fix q->flush_rq NULL pointer crash on dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but neglected to free that allocation on the error paths that follow. Signed-off-by: Dave Jones <davej@fedoraproject.org> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-03-20 22:32:06 -07:00
Jens Axboe	676141e48a	blk-mq: don't dump CPU -> hw queue map on driver load Now that we are out of initial debug/bringup mode, remove the verbose dump of the mapping table. Provide the mapping table in sysfs, under the hardware queue directory, in the cpu_list file. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-20 13:31:44 -06:00
Jens Axboe	5d12f905cc	blk-mq: fix wrong usage of hctx->state vs hctx->flags BLK_MQ_F_* flags are for hctx->flags, and are non-atomic and set at registration time. BLK_MQ_S_* flags are dynamic and atomic, and are accessed through hctx->state. Some of the BLK_MQ_S_STOPPED uses were wrong. Additionally, the header file should not use a bit shift for the _S_ flags, as they are done through the set/test_bit functions. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-19 15:25:02 -06:00
Tejun Heo	4d3bb511b5	cgroup: drop const from @buffer of cftype->write_string() cftype->write_string() just passes on the writeable buffer from kernfs and there's no reason to add const restriction on the buffer. The only thing const achieves is unnecessarily complicating parsing of the buffer. Drop const from @buffer. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>	2014-03-19 10:23:54 -04:00
Eric W. Biederman	57a7744e09	net: Replace u64_stats_fetch_begin_bh to u64_stats_fetch_begin_irq Replace the bh safe variant with the hard irq safe variant. We need a hard irq safe variant to deal with netpoll transmitting packets from hard irq context, and we need it in most if not all of the places using the bh safe variant. Except on 32bit uni-processor the code is exactly the same so don't bother with a bh variant, just have a hard irq safe variant that everyone can use. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-03-14 22:41:36 -04:00
Jens Axboe	95363efde1	blk-mq: allow blk_mq_init_commands() to return failure If drivers do dynamic allocation in the hardware command init path, then we need to be able to handle and return failures. And if they do allocations or mappings in the init command path, then we need a cleanup function to free up that space at exit time. So add blk_mq_free_commands() as the cleanup function. This is required for the mtip32xx driver conversion to blk-mq. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-14 10:43:15 -06:00
Jens Axboe	89f8b33ca1	block: remove old blk_iopoll_enabled variable This was a debugging measure to toggle enabled/disabled when testing. But for real production setups, it's not safe to toggle this setting without either reloading drivers of quiescing IO first. Neither of which the toggle enforces. Additionally, it makes drivers deal with the conditional state. Remove it completely. It's up to the driver whether iopoll is enabled or not. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-13 09:38:42 -06:00
Mike Snitzer	10beafc190	block: change flush sequence list addition back to front add Commit `18741986` inadvertently changed the rq flush insertion from a head to a tail insertion. Fix that back up. Signed-off-by: Mike Snitzer <msnitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-08 20:31:31 -07:00
Mike Snitzer	7982e90c3a	block: fix q->flush_rq NULL pointer crash on dm-mpath flush Commit `1874198` ("blk-mq: rework flush sequencing logic") switched ->flush_rq from being an embedded member of the request_queue structure to being dynamically allocated in blk_init_queue_node(). Request-based DM multipath doesn't use blk_init_queue_node(), instead it uses blk_alloc_queue_node() + blk_init_allocated_queue(). Because commit `1874198` placed the dynamic allocation of ->flush_rq in blk_init_queue_node() any flush issued to a dm-mpath device would crash with a NULL pointer, e.g.: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0 PGD bb3c7067 PUD bb01d067 PMD 0 Oops: 0002 [#1] SMP ... CPU: 5 PID: 5028 Comm: dt Tainted: G W O 3.14.0-rc3.snitm+ #10 ... task: ffff88032fb270e0 ti: ffff880079564000 task.ti: ffff880079564000 RIP: 0010:[<ffffffff8125037e>] [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0 RSP: 0018:ffff880079565c98 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030 RDX: ffff880260c74048 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff880079565ca8 R08: ffff880260aa1e98 R09: 0000000000000001 R10: ffff88032fa78500 R11: 0000000000000246 R12: 0000000000000000 R13: ffff880260aa1de8 R14: 0000000000000650 R15: 0000000000000000 FS: 00007f8d36a2a700(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000079b36000 CR4: 00000000000007e0 Stack: 0000000000000000 ffff880260c74048 ffff880079565cd8 ffffffff81257a47 ffff880260aa1de8 ffff880260c74048 0000000000000001 0000000000000000 ffff880079565d08 ffffffff81257c2d 0000000000000000 ffff880260aa1de8 Call Trace: [<ffffffff81257a47>] blk_flush_complete_seq+0x2d7/0x2e0 [<ffffffff81257c2d>] blk_insert_flush+0x1dd/0x210 [<ffffffff8124ec59>] __elv_add_request+0x1f9/0x320 [<ffffffff81250681>] ? blk_account_io_start+0x111/0x190 [<ffffffff81253a4b>] blk_queue_bio+0x25b/0x330 [<ffffffffa0020bf5>] dm_request+0x35/0x40 [dm_mod] [<ffffffff812530c0>] generic_make_request+0xc0/0x100 [<ffffffff81253173>] submit_bio+0x73/0x140 [<ffffffff811becdd>] submit_bio_wait+0x5d/0x80 [<ffffffff81257528>] blkdev_issue_flush+0x78/0xa0 [<ffffffff811c1f6f>] blkdev_fsync+0x3f/0x60 [<ffffffff811b7fde>] vfs_fsync_range+0x1e/0x20 [<ffffffff811b7ffc>] vfs_fsync+0x1c/0x20 [<ffffffff811b81f1>] do_fsync+0x41/0x80 [<ffffffff8118874e>] ? SyS_lseek+0x7e/0x80 [<ffffffff811b8260>] SyS_fsync+0x10/0x20 [<ffffffff8154c2d2>] system_call_fastpath+0x16/0x1b Fix this by moving the ->flush_rq allocation from blk_init_queue_node() to blk_init_allocated_queue(). blk_init_queue_node() also calls blk_init_allocated_queue() so this change is functionality equivalent for all blk_init_queue_node() callers. Reported-by: Hannes Reinecke <hare@suse.de> Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-08 17:20:01 -07:00
Shaohua Li	739c3eea71	blk-mq: add REQ_SYNC early Add REQ_SYNC early, so rq_dispatched[] in blk_mq_rq_ctx_init is set correctly. Signed-off-by: Shaohua Li<shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-07 08:15:28 -07:00
Roman Pen	af5040da01	blktrace: fix accounting of partially completed requests trace_block_rq_complete does not take into account that request can be partially completed, so we can get the following incorrect output of blkparser: C R 232 + 240 [0] C R 240 + 232 [0] C R 248 + 224 [0] C R 256 + 216 [0] but should be: C R 232 + 8 [0] C R 240 + 8 [0] C R 248 + 8 [0] C R 256 + 8 [0] Also, the whole output summary statistics of completed requests and final throughput will be incorrect. This patch takes into account real completion size of the request and fixes wrong completion accounting. Signed-off-by: Roman Pen <r.peniaev@gmail.com> CC: Steven Rostedt <rostedt@goodmis.org> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@redhat.com> CC: linux-kernel@vger.kernel.org Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-05 16:11:21 -07:00
Mike Galbraith	2a26ebef84	rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock [ 365.164040] BUG: sleeping function called from invalid context at kernel/rtmutex.c:674 [ 365.164041] in_atomic(): 1, irqs_disabled(): 1, pid: 26, name: migration/1 [ 365.164043] no locks held by migration/1/26. [ 365.164044] irq event stamp: 6648 [ 365.164056] hardirqs last enabled at (6647): [<ffffffff8153d377>] restore_args+0x0/0x30 [ 365.164062] hardirqs last disabled at (6648): [<ffffffff810ed98d>] multi_cpu_stop+0x9d/0x120 [ 365.164070] softirqs last enabled at (0): [<ffffffff810543bc>] copy_process.part.28+0x6fc/0x1920 [ 365.164072] softirqs last disabled at (0): [< (null)>] (null) [ 365.164076] CPU: 1 PID: 26 Comm: migration/1 Tainted: GF N 3.12.12-rt19-0.gcb6c4a2-rt #3 [ 365.164078] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS QSSC-S4R.QCI.01.00.S013.032920111005 03/29/2011 [ 365.164091] 0000000000000001 ffff880a42ea7c30 ffffffff815367e6 ffffffff81a086c0 [ 365.164099] ffff880a42ea7c40 ffffffff8108919c ffff880a42ea7c60 ffffffff8153c24f [ 365.164107] ffff880a42ea91f0 00000000ffffffe1 ffff880a42ea7c88 ffffffff81297ec0 [ 365.164108] Call Trace: [ 365.164119] [<ffffffff810060b1>] try_stack_unwind+0x191/0x1a0 [ 365.164127] [<ffffffff81004872>] dump_trace+0x92/0x360 [ 365.164133] [<ffffffff81006108>] show_trace_log_lvl+0x48/0x60 [ 365.164138] [<ffffffff81004c18>] show_stack_log_lvl+0xd8/0x1d0 [ 365.164143] [<ffffffff81006160>] show_stack+0x20/0x50 [ 365.164153] [<ffffffff815367e6>] dump_stack+0x54/0x9a [ 365.164163] [<ffffffff8108919c>] __might_sleep+0xfc/0x140 [ 365.164173] [<ffffffff8153c24f>] rt_spin_lock+0x1f/0x70 [ 365.164182] [<ffffffff81297ec0>] blk_mq_main_cpu_notify+0x20/0x70 [ 365.164191] [<ffffffff81540a1c>] notifier_call_chain+0x4c/0x70 [ 365.164201] [<ffffffff81083499>] __raw_notifier_call_chain+0x9/0x10 [ 365.164207] [<ffffffff810567be>] cpu_notify+0x1e/0x40 [ 365.164217] [<ffffffff81525da2>] take_cpu_down+0x22/0x40 [ 365.164223] [<ffffffff810ed9c6>] multi_cpu_stop+0xd6/0x120 [ 365.164229] [<ffffffff810edd97>] cpu_stopper_thread+0xd7/0x1e0 [ 365.164235] [<ffffffff810863a3>] smpboot_thread_fn+0x203/0x380 [ 365.164241] [<ffffffff8107cbf8>] kthread+0xc8/0xd0 [ 365.164250] [<ffffffff8154440c>] ret_from_fork+0x7c/0xb0 [ 365.164429] smpboot: CPU 1 is now offline Signed-off-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-03-03 09:34:10 -07:00
Frederic Weisbecker	c46fff2a3b	smp: Rename __smp_call_function_single() to smp_call_function_single_async() The name __smp_call_function_single() doesn't tell much about the properties of this function, especially when compared to smp_call_function_single(). The comments above the implementation are also misleading. The main point of this function is actually not to be able to embed the csd in an object. This is actually a requirement that result from the purpose of this function which is to raise an IPI asynchronously. As such it can be called with interrupts disabled. And this feature comes at the cost of the caller who then needs to serialize the IPIs on this csd. Lets rename the function and enhance the comments so that they reflect these properties. Suggested-by: Christoph Hellwig <hch@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-24 14:47:15 -08:00
Frederic Weisbecker	fce8ad1568	smp: Remove wait argument from __smp_call_function_single() The main point of calling __smp_call_function_single() is to send an IPI in a pure asynchronous way. By embedding a csd in an object, a caller can send the IPI without waiting for a previous one to complete as is required by smp_call_function_single() for example. As such, sending this kind of IPI can be safe even when irqs are disabled. This flexibility comes at the expense of the caller who then needs to synchronize the csd lifecycle by himself and make sure that IPIs on a single csd are serialized. This is how __smp_call_function_single() works when wait = 0 and this usecase is relevant. Now there don't seem to be any usecase with wait = 1 that can't be covered by smp_call_function_single() instead, which is safer. Lets look at the two possible scenario: 1) The user calls __smp_call_function_single(wait = 1) on a csd embedded in an object. It looks like a nice and convenient pattern at the first sight because we can then retrieve the object from the IPI handler easily. But actually it is a waste of memory space in the object since the csd can be allocated from the stack by smp_call_function_single(wait = 1) and the object can be passed an the IPI argument. Besides that, embedding the csd in an object is more error prone because the caller must take care of the serialization of the IPIs for this csd. 2) The user calls __smp_call_function_single(wait = 1) on a csd that is allocated on the stack. It's ok but smp_call_function_single() can do it as well and it already takes care of the allocation on the stack. Again it's more simple and less error prone. Therefore, using the underscore prepend API version with wait = 1 is a bad pattern and a sign that the caller can do safer and more simple. There was a single user of that which has just been converted. So lets remove this option to discourage further users. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-24 14:47:09 -08:00
Jan Kara	6d113398dc	block: Stop abusing rq->csd.list in blk-softirq Abusing rq->csd.list for a list of requests to complete is rather ugly. We use rq->queuelist instead which is much cleaner. It is safe because queuelist is used by the block layer only for requests waiting to be submitted to a device. Thus it is unused when irq reports the request IO is finished. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-24 14:46:42 -08:00
Jan Kara	8b4922d317	block: Stop abusing csd.list for fifo_time Block layer currently abuses rq->csd.list.next for storing fifo_time. That is a terrible hack and completely unnecessary as well. Union achieves the same space saving in a cleaner way. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-24 14:46:32 -08:00
Christoph Hellwig	d6a25b3131	blk-mq: support partial I/O completions Add a new blk_mq_end_io_partial function to partially complete requests as needed by the SCSI layer. We do this by reusing blk_update_request to advance the bio instead of having a simplified version of it in the blk-mq code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-21 08:58:49 -08:00
Christoph Hellwig	feb71dae1f	blk-mq: merge blk_mq_insert_request and blk_mq_run_request It's almost identical to blk_mq_insert_request, so fold the two into one slightly more generic function by making the flush special case a bit smarted. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-21 08:58:48 -08:00
Christoph Hellwig	fd694131bb	blk-mq: remove blk_mq_alloc_rq There's only one caller, which is a straight wrapper and fits the naming scheme of the related functions a lot better. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-21 08:58:47 -08:00
Jiri Kosina	d4263348f7	Merge branch 'master' into for-next	2014-02-20 14:54:28 +01:00
Masanari Iida	e227867f12	treewide: Fix typo in Documentation/DocBook This patch fix spelling typo in Documentation/DocBook. It is because .html and .xml files are generated by make htmldocs, I have to fix a typo within the source files. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2014-02-19 14:58:17 +01:00
Paul E. McKenney	ec6c676a08	block: Substitute rcu_access_pointer() for rcu_dereference_raw() (Trivial patch.) If the code is looking at the RCU-protected pointer itself, but not dereferencing it, the rcu_dereference() functions can be downgraded to rcu_access_pointer(). This commit makes this downgrade in blkg_destroy() and ioc_destroy_icq(), both of which simply compare the RCU-protected pointer against another pointer with no dereferencing. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-18 12:21:26 -08:00
Gideon Israel Dsouza	e3ebf0d457	block: Use macros from compiler.h instead of __attribute__((...)) To increase compiler portability there are several macros defined in <linux/compiler.h> for various gcc __attribute((..)) constructs. I've made sure gcc these specific were replaced with the right macro and an #include <linux/compiler.h> was placed where needed. Signed-off-by: Gideon Israel Dsouza <gidisrael@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-18 12:20:01 -08:00
Linus Torvalds	5e57dc8110	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block IO fixes from Jens Axboe: "Second round of updates and fixes for 3.14-rc2. Most of this stuff has been queued up for a while. The notable exception is the blk-mq changes, which are naturally a bit more in flux still. The pull request contains: - Two bug fixes for the new immutable vecs, causing crashes with raid or swap. From Kent. - Various blk-mq tweaks and fixes from Christoph. A fix for integrity bio's from Nic. - A few bcache fixes from Kent and Darrick Wong. - xen-blk{front,back} fixes from David Vrabel, Matt Rushton, Nicolas Swenson, and Roger Pau Monne. - Fix for a vec miscount with integrity vectors from Martin. - Minor annotations or fixes from Masanari Iida and Rashika Kheria. - Tweak to null_blk to do more normal FIFO processing of requests from Shlomo Pongratz. - Elevator switching bypass fix from Tejun. - Softlockup in blkdev_issue_discard() fix when !CONFIG_PREEMPT from me" * 'for-linus' of git://git.kernel.dk/linux-block: (31 commits) block: add cond_resched() to potentially long running ioctl discard loop xen-blkback: init persistent_purge_work work_struct blk-mq: pair blk_mq_start_request / blk_mq_requeue_request blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq block: Fix cloning of discard/write same bios block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show blk-mq: rework flush sequencing logic null_blk: use blk_complete_request and blk_mq_complete_request virtio_blk: use blk_mq_complete_request blk-mq: rework I/O completions fs: Add prototype declaration to appropriate header file include/linux/bio.h fs: Mark function as static in fs/bio-integrity.c block/null_blk: Fix completion processing from LIFO to FIFO block: Explicitly handle discard/write same segments block: Fix nr_vecs for inline integrity vectors blk-mq: Add bio_integrity setup to blk_mq_make_request blk-mq: initialize sg_reserved_size blk-mq: handle dma_drain_size blk-mq: divert __blk_put_request for MQ ops blk-mq: support at_head inserations for blk_execute_rq ...	2014-02-14 10:45:18 -08:00
Tejun Heo	924f0d9a20	cgroup: drop @skip_css from cgroup_taskset_for_each() If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching css. The intention of the interface is to make it easy to skip css's (cgroup_subsys_states) which already match the migration target; however, this is entirely unnecessary as migration taskset doesn't include tasks which are already in the target cgroup. Drop @skip_css from cgroup_taskset_for_each(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Daniel Borkmann <dborkman@redhat.com>	2014-02-13 06:58:41 -05:00
Jens Axboe	c8123f8c9c	block: add cond_resched() to potentially long running ioctl discard loop When mkfs issues a full device discard and the device only supports discards of a smallish size, we can loop in blkdev_issue_discard() for a long time. If preempt isn't enabled, this can turn into a softlock situation and the kernel will start complaining. Add an explicit cond_resched() at the end of the loop to avoid that. Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-12 09:36:37 -07:00
Tejun Heo	e61734c55c	cgroup: remove cgroup->name cgroup->name handling became quite complicated over time involving dedicated struct cgroup_name for RCU protection. Now that cgroup is on kernfs, we can drop all of it and simply use kernfs_name/path() and friends. Replace cgroup->name and all related code with kernfs name/path constructs. * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top of kernfs counterparts, which involves semantic changes. pr_cont_cgroup_name() and pr_cont_cgroup_path() added. * cgroup->name handling dropped from cgroup_rename(). * All users of cgroup_name/path() updated to the new semantics. Users which were formatting the string just to printk them are converted to use pr_cont_cgroup_name/path() instead, which simplifies things quite a bit. As cgroup_name() no longer requires RCU read lock around it, RCU lockings which were protecting only cgroup_name() are removed. v2: Comment above oom_info_lock updated as suggested by Michal. v3: dummy_top doesn't have a kn associated and pr_cont_cgroup_name/path() ended up calling the matching kernfs functions with NULL kn leading to oops. Test for NULL kn and print "/" if so. This issue was reported by Fengguang Wu. v4: Rebased on top of `0ab02ca8f8` ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>	2014-02-12 09:29:50 -05:00
Tejun Heo	5f46990787	cgroup: update the meaning of cftype->max_write_len cftype->max_write_len is used to extend the maximum size of writes. It's interpreted in such a way that the actual maximum size is one less than the specified value. The default size is defined by CGROUP_LOCAL_BUFFER_SIZE. Its interpretation is quite confusing - its value is decremented by 1 and then compared for equality with max size, which means that the actual default size is CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars. There's no point in having a limit that low. Update its definition so that it means the actual string length sans termination and anything below PAGE_SIZE-1 is treated as PAGE_SIZE-1. .max_write_len for "release_agent" is updated to PATH_MAX-1 and cgroup_release_agent_write() is updated so that the redundant strlen() check is removed and it uses strlcpy() instead of strcpy(). .max_write_len initializations in blk-throttle.c and cfq-iosched.c are no longer necessary and removed. The one in cpuset is kept unchanged as it's an approximated value to begin with. This will also make transition to kernfs smoother. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Christoph Hellwig	49f5baa510	blk-mq: pair blk_mq_start_request / blk_mq_requeue_request Make sure we have a proper pairing between starting and requeueing requests. Move the dma drain and REQ_END setup into blk_mq_start_request, and make sure blk_mq_requeue_request properly undoes them, giving us a pair of function to prepare and unprepare a request without leaving side effects. Together this ensures we always clean up properly after BLK_MQ_RQ_QUEUE_BUSY returns from ->queue_rq. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-11 09:34:08 -07:00
Christoph Hellwig	1e93b8c274	blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq rq->errors never has been part of the communication protocol between drivers and the block stack and most drivers will not have initialized it. Return -EIO to upper layers when the driver returns BLK_MQ_RQ_QUEUE_ERROR unconditionally. If a driver want to return a different error it can easily do so by returning success after calling blk_mq_end_io itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-11 09:34:07 -07:00
Masanari Iida	11c9444407	block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show cppcheck detected following format string mismatch. [blk-mq-tag.c:201]: (warning) %u in format string (no. 1) requires 'unsigned int' but the argument type is 'int'. Change "cpu" from int to unsigned int, because the cpu never become minus value. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-10 10:39:53 -07:00
Christoph Hellwig	18741986a4	blk-mq: rework flush sequencing logic Witch to using a preallocated flush_rq for blk-mq similar to what's done with the old request path. This allows us to set up the request properly with a tag from the actually allowed range and ->rq_disk as needed by some drivers. To make life easier we also switch to dynamic allocation of ->flush_rq for the old path. This effectively reverts most of "blk-mq: fix for flush deadlock" and "blk-mq: Don't reserve a tag for flush request" Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-10 09:29:00 -07:00
Christoph Hellwig	30a91cb4ef	blk-mq: rework I/O completions Rework I/O completions to work more like the old code path. blk_mq_end_io now stays out of the business of deferring completions to others CPUs and calling blk_mark_rq_complete. The latter is very important to allow completing requests that have timed out and thus are already marked completed, the former allows using the IPI callout even for driver specific completions instead of having to reimplement them. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-10 09:27:31 -07:00
Tejun Heo	073219e995	cgroup: clean up cgroup_subsys names and initialization cgroup_subsys is a bit messier than it needs to be. * The name of a subsys can be different from its internal identifier defined in cgroup_subsys.h. Most subsystems use the matching name but three - cpu, memory and perf_event - use different ones. * cgroup_subsys_id enums are postfixed with _subsys_id and each cgroup_subsys is postfixed with _subsys. cgroup.h is widely included throughout various subsystems, it doesn't and shouldn't have claim on such generic names which don't have any qualifier indicating that they belong to cgroup. * cgroup_subsys->subsys_id should always equal the matching cgroup_subsys_id enum; however, we require each controller to initialize it and then BUG if they don't match, which is a bit silly. This patch cleans up cgroup_subsys names and initialization by doing the followings. * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each cgroup_subsys with _cgrp_subsys. * With the above, renaming subsys identifiers to match the userland visible names doesn't cause any naming conflicts. All non-matching identifiers are renamed to match the official names. cpu_cgroup -> cpu mem_cgroup -> memory perf -> perf_event * controllers no longer need to initialize ->subsys_id and ->name. They're generated in cgroup core and set automatically during boot. * Redundant cgroup_subsys declarations removed. * While updating BUG_ON()s in cgroup_init_early(), convert them to WARN()s. BUGging that early during boot is stupid - the kernel can't print anything, even through serial console and the trap handler doesn't even link stack frame properly for back-tracing. This patch doesn't introduce any behavior changes. v2: Rebased on top of `fe1217c4f3` ("net: net_cls: move cgroupfs classid handling into core"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Ingo Molnar <mingo@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Thomas Graf <tgraf@suug.ch>	2014-02-08 10:36:58 -05:00
Tejun Heo	3ed80a62bf	cgroup: drop module support With module supported dropped from net_prio, no controller is using cgroup module support. None of actual resource controllers can be built as a module and we aren't gonna add new controllers which don't control resources. This patch drops module support from cgroup. * cgroup_[un]load_subsys() and cgroup_subsys->module removed. * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(), cgroup_subsys.h now uses IS_ENABLED() directly. * enum cgroup_subsys_id now exactly matches the list of enabled controllers as ordered in cgroup_subsys.h. * cgroup_subsys[] is now a contiguously occupied array. Size specification is no longer necessary and dropped. * for_each_builtin_subsys() is removed and for_each_subsys() is updated to not require any locking. * module ref handling is removed from rebind_subsystems(). * Module related comments dropped. v2: Rebased on top of `fe1217c4f3` ("net: net_cls: move cgroupfs classid handling into core"). v3: Added {} around the if (need_forkexit_callback) block in cgroup_post_fork() for readability as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-08 10:36:58 -05:00
Kent Overstreet	5cb8850c9c	block: Explicitly handle discard/write same segments Immutable biovecs changed the way biovecs are interpreted - drivers no longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for using part of an existing segment without modifying it). This breaks with discards and write_same bios, since for those bi_size has nothing to do with segments in the biovec. So for now, we need a fairly gross hack - we fortunately know that there will never be more than one segment for the entire request, so we can special case discard/write_same. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 13:54:08 -07:00
Nicholas Bellinger	14ec77f352	blk-mq: Add bio_integrity setup to blk_mq_make_request This patch adds the missing bio_integrity_enabled() + bio_integrity_prep() setup into blk_mq_make_request() in order to use DIF protection with scsi-mq. Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 13:45:39 -07:00
Christoph Hellwig	1be036e946	blk-mq: initialize sg_reserved_size To behave the same way as the old request path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 11:58:54 -07:00
Christoph Hellwig	4f7f418c48	blk-mq: handle dma_drain_size Make blk-mq handle the dma_drain_size field the same way as the old request path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 11:58:54 -07:00
Christoph Hellwig	6f5ba581c0	blk-mq: divert __blk_put_request for MQ ops __blk_put_request needs to call into the blk-mq code just like blk_put_request. As we don't have the queue lock in this case both end up calling the same function. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 11:58:54 -07:00
Christoph Hellwig	72a0a36e28	blk-mq: support at_head inserations for blk_execute_rq This is neede for proper SG_IO operation as well as various uses of blk_execute_rq from the SCSI midlayer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 11:58:54 -07:00
Linus Torvalds	4e13c5d021	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending Pull SCSI target updates from Nicholas Bellinger: "The highlights this round include: - add support for SCSI Referrals (Hannes) - add support for T10 DIF into target core (nab + mkp) - add support for T10 DIF emulation in FILEIO + RAMDISK backends (Sagi + nab) - add support for T10 DIF -> bio_integrity passthrough in IBLOCK backend (nab) - prep changes to iser-target for >= v3.15 T10 DIF support (Sagi) - add support for qla2xxx N_Port ID Virtualization - NPIV (Saurav + Quinn) - allow percpu_ida_alloc() to receive task state bitmask (Kent) - fix >= v3.12 iscsi-target session reset hung task regression (nab) - fix >= v3.13 percpu_ref se_lun->lun_ref_active race (nab) - fix a long-standing network portal creation race (Andy)" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (51 commits) target: Fix percpu_ref_put race in transport_lun_remove_cmd target/iscsi: Fix network portal creation race target: Report bad sector in sense data for DIF errors iscsi-target: Convert gfp_t parameter to task state bitmask iscsi-target: Fix connection reset hang with percpu_ida_alloc percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask iscsi-target: Pre-allocate more tags to avoid ack starvation qla2xxx: Configure NPIV fc_vport via tcm_qla2xxx_npiv_make_lport qla2xxx: Enhancements to enable NPIV support for QLOGIC ISPs with TCM/LIO. qla2xxx: Fix scsi_host leak on qlt_lport_register callback failure IB/isert: pass scatterlist instead of cmd to fast_reg_mr routine IB/isert: Move fastreg descriptor creation to a function IB/isert: Avoid frwr notation, user fastreg IB/isert: seperate connection protection domains and dma MRs tcm_loop: Enable DIF/DIX modes in SCSI host LLD target/rd: Add DIF protection into rd_execute_rw target/rd: Add support for protection SGL setup + release target/rd: Refactor rd_build_device_space + rd_release_device_space target/file: Add DIF protection support to fd_execute_rw target/file: Add DIF protection init/format support ...	2014-01-31 15:31:23 -08:00
Tejun Heo	556ee818c0	block: __elv_next_request() shouldn't call into the elevator if bypassing request_queue bypassing is used to suppress higher-level function of a request_queue so that they can be switched, reconfigured and shut down. A request_queue does the followings while bypassing. * bypasses elevator and io_cq association and queues requests directly to the FIFO dispatch queue. * bypasses block cgroup request_list lookup and always uses the root request_list. Once confirmed to be bypassing, specific elevator and block cgroup policy implementations can assume that nothing is in flight for them and perform various operations which would be dangerous otherwise. Such confirmation is acheived by short-circuiting all new requests directly to the dispatch queue and waiting for all the requests which were issued before to finish. Unfortunately, while the request allocating and draining sides were properly handled, we forgot to actually plug the request dispatch path. Even after bypassing mode is confirmed, if the attached driver tries to fetch a request and the dispatch queue is empty, __elv_next_request() would invoke the current elevator's elevator_dispatch_fn() callback. As all in-flight requests were drained, the elevator wouldn't contain any request but once bypass is confirmed we don't even know whether the elevator is even there. It might be in the process of being switched and half torn down. Frank Mayhar reports that this actually happened while switching elevators, leading to an oops. Let's fix it by making __elv_next_request() avoid invoking the elevator_dispatch_fn() callback if the queue is bypassing. It already avoids invoking the callback if the queue is dying. As a dying queue is guaranteed to be bypassing, we can simply replace blk_queue_dying() check with blk_queue_bypass(). Reported-by: Frank Mayhar <fmayhar@google.com> References: http://lkml.kernel.org/g/1390319905.20232.38.camel@bobble.lax.corp.google.com Cc: stable@vger.kernel.org Tested-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-30 12:57:25 -07:00
Shaohua Li	f0276924fa	blk-mq: Don't reserve a tag for flush request Reserving a tag (request) for flush to avoid dead lock is a overkill. A tag is valuable resource. We can track the number of flush requests and disallow having too many pending flush requests allocated. With this patch, blk_mq_alloc_request_pinned() could do a busy nop (but not a dead loop) if too many pending requests are allocated and new flush request is allocated. But this should not be a problem, too many pending flush requests are very rare case. I verified this can fix the deadlock caused by too many pending flush requests. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-30 12:57:25 -07:00
Linus Torvalds	53d8ab29f8	Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block Pull block IO driver changes from Jens Axboe: - bcache update from Kent Overstreet. - two bcache fixes from Nicholas Swenson. - cciss pci init error fix from Andrew. - underflow fix in the parallel IDE pg_write code from Dan Carpenter. I'm sure the 1 (or 0) users of that are now happy. - two PCI related fixes for sx8 from Jingoo Han. - floppy init fix for first block read from Jiri Kosina. - pktcdvd error return miss fix from Julia Lawall. - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from Michael Opdenacker. - comment typo fix for the loop driver from Olaf Hering. - potential oops fix for null_blk from Raghavendra K T. - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing an OOM problem and a problem with handling security locked conditions * 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits) mg_disk: Spelling s/finised/finished/ null_blk: Null pointer deference problem in alloc_page_buffers mtip32xx: Correctly handle security locked condition mtip32xx: Make SGL container per-command to eliminate high order dma allocation drivers/block/loop.c: fix comment typo in loop_config_discard drivers/block/cciss.c:cciss_init_one(): use proper errnos drivers/block/paride/pg.c: underflow bug in pg_write() drivers/block/sx8.c: remove unnecessary pci_set_drvdata() drivers/block/sx8.c: use module_pci_driver() floppy: bail out in open() if drive is not responding to block0 read bcache: Fix auxiliary search trees for key size > cacheline size bcache: Don't return -EINTR when insert finished bcache: Improve bucket_prio() calculation bcache: Add bch_bkey_equal_header() bcache: update bch_bkey_try_merge bcache: Move insert_fixup() to btree_keys_ops bcache: Convert sorting to btree_keys bcache: Convert debug code to btree_keys bcache: Convert btree_iter to struct btree_keys bcache: Refactor bset_tree sysfs stats ...	2014-01-30 11:40:10 -08:00
Linus Torvalds	f568849eda	Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...	2014-01-30 11:19:05 -08:00
Andrew Morton	381d3ee33b	block/blk-mq-cpu.c: use hotcpu_notifier() Cleaner, reduces text size when cpu hotplug is disabled. Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-28 09:52:01 -07:00
Kent Overstreet	6f6b5d1ec5	percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask This patch changes percpu_ida_alloc() + callers to accept task state bitmask for prepare_to_wait() for code like target/iscsi that needs it for interruptible sleep, that is provided in a subsequent patch. It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep waiting for a new tag, or TASK_RUNNING when the caller cannot sleep, and is forced to return a negative value when no tags are available. v2 changes: - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes - Drop signal_pending_state() call v3 changes: - Only call prepare_to_wait() + finish_wait() when != TASK_RUNNING (PeterZ) Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: <stable@vger.kernel.org> #3.12+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>	2014-01-23 20:17:18 +00:00

1 2 3 4 5 ...

2291 Commits