Currently both blk-throttle and cfq-iosched implement their own
blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This
patch factors out the common code into blkg_lookup_create(), which
returns ERR_PTR value so that transitional failures due to queue
bypass can be distinguished from other failures.
* New plkio_policy_ops methods blkio_alloc_group_fn() and
blkio_link_group_fn added. Both are transitional and will be
removed once the blkg management code is fully moved into
blk-cgroup.c.
* blkio_alloc_group_fn() allocates policy-specific blkg which is
usually a larger data structure with blkg as the first entry and
intiailizes it. Note that initialization of blkg proper, including
percpu stats, is responsibility of blk-cgroup proper.
Note that default config (weight, bps...) initialization is done
from this method; otherwise, we end up violating locking order
between blkcg and q locks via blkcg_get_CONF() functions.
* blkio_link_group_fn() is called under queue_lock and responsible for
linking the blkg to the queue. blkcg side is handled by blk-cgroup
proper.
* The common blkg creation function is named blkg_lookup_create() and
blkiocg_lookup_group() is renamed to blkg_lookup() for consistency.
Also, throtl / cfq related functions are similarly [re]named for
consistency.
This simplifies blkcg policy implementations and enables further
cleanup.
-v2: Vivek noticed that blkg_lookup_create() incorrectly tested
blk_queue_dead() instead of blk_queue_bypass() leading a user of
the function ending up creating a new blkg on bypassing queue.
This is a bug introduced while relocating bypass patches before
this one. Fixed.
-v3: ERR_PTR patch folded into this one. @for_root added to
blkg_lookup_create() to allow creating root group on a bypassed
queue during elevator switch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Block cgroup policies are maintained in a linked list and,
theoretically, multiple policies sharing the same policy ID are
allowed.
This patch temporarily restricts one policy per plid and adds
blkio_policy[] array which indexes registered policy types by plid.
Both the restriction and blkio_policy[] array are transitional and
will be removed once API cleanup is complete.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkgio_group is association between a block cgroup and a queue for a
given policy. Using opaque void * for association makes things
confusing and hinders factoring of common code. Use request_queue *
and, if necessary, policy id instead.
This will help block cgroup API cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Elevator switch may involve changes to blkcg policies. Implement
shoot down of blkio_groups.
Combined with the previous bypass updates, the end goal is updating
blkcg core such that it can ensure that blkcg's being affected become
quiescent and don't have any per-blkg data hanging around before
commencing any policy updates. Until queues are made aware of the
policies that applies to them, as an interim step, all per-policy blkg
data will be shot down.
* blk-throtl doesn't need this change as it can't be disabled for a
live queue; however, update it anyway as the scheduled blkg
unification requires this behavior change. This means that
blk-throtl configuration will be unnecessarily lost over elevator
switch. This oddity will be removed after blkcg learns to associate
individual policies with request_queues.
* blk-throtl dosen't shoot down root_tg. This is to ease transition.
Unified blkg will always have persistent root group and not shooting
down root_tg for now eases transition to that point by avoiding
having to update td->root_tg and is safe as blk-throtl can never be
disabled
-v2: Vivek pointed out that group list is not guaranteed to be empty
on return from clear function if it raced cgroup removal and
lost. Fix it by waiting a bit and retrying. This kludge will
soon be removed once locking is updated such that blkg is never
in limbo state between blkcg and request_queue locks.
blk-throtl no longer shoots down root_tg to avoid breaking
td->root_tg.
Also, Nest queue_lock inside blkio_list_lock not the other way
around to avoid introduce possible deadlock via blkcg lock.
-v3: blkcg_clear_queue() repositioned and renamed to
blkg_destroy_all() to increase consistency with later changes.
cfq_clear_queue() updated to check q->elevator before
dereferencing it to avoid NULL dereference on not fully
initialized queues (used by later change).
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Block cgroup core can be built as module; however, it isn't too useful
as blk-throttle can only be built-in and cfq-iosched is usually the
default built-in scheduler. Scheduled blkcg cleanup requires calling
into blkcg from block core. To simplify that, disallow building blkcg
as module by making CONFIG_BLK_CGROUP bool.
If building blkcg core as module really matters, which I doubt, we can
revisit it after blkcg API cleanup.
-v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to
depend on BLK_CGROUP. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make BLKIO_STAT_MERGED per cpu hence gettring rid of need of taking
blkg->stats_lock.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Some of the stats are 64bit and updation will be non atomic on 32bit
architecture. Use sequence counters on 32bit arch to make reading
of stats safe.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently we take blkg_stat lock for even updating the stats. So even if
a group has no throttling rules (common case for root group), we end
up taking blkg_lock, for updating the stats.
Make dispatch stats per cpu so that these can be updated without taking
blkg lock.
If cpu goes offline, these stats simply disappear. No protection has
been provided for that yet. Do we really need anything for that?
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y.
there are some fields which are out side this config option. Fix that.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currentlly we first map the task to cgroup and then cgroup to
blkio_cgroup. There is a more direct way to get to blkio_cgroup
from task using task_subsys_state(). Use that.
The real reason for the fix is that it also avoids a race in generic
cgroup code. During remount/umount rebind_subsystems() is called and
it can do following with and rcu protection.
cgrp->subsys[i] = NULL;
That means if somebody got hold of cgroup under rcu and then it tried
to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
is wrong. I was running into this race condition with ltp running on a
upstream derived kernel and that lead to crash.
So ideally we should also fix cgroup generic code to wait for rcu
grace period before setting pointer to NULL. Li Zefan is not very keen
on introducing synchronize_wait() as he thinks it will slow
down moun/remount/umount operations.
So for the time being atleast fix the kernel crash by taking a more
direct route to blkio_cgroup.
One tester had reported a crash while running LTP on a derived kernel
and with this fix crash is no more seen while the test has been
running for over 6 days.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There are two kind of times that tasks are not charged for: the first
seek and the extra time slice used over the allocated timeslice. Both
of these exported as a new unaccounted_time stat.
I think it would be good to have this reported in 'time' as well, but
that is probably a separate discussion.
Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We've found that we still get good, useful isolation at weights this
low. I'd like to adjust the minimum so that any other changes can take
these values into account.
Signed-off-by: Justin TerAvest <teravest@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
- Limit max iops value to UINT_MAX and return error to user if value is more
than that instead of accepting bigger values and truncating implicitly.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o Currently any cgroup throttle limit changes are processed asynchronousy and
the change does not take affect till a new bio is dispatched from same group.
o It might happen that a user sets a redicuously low limit on throttling.
Say 1 bytes per second on reads. In such cases simple operations like mount
a disk can wait for a very long time.
o Once bio is throttled, there is no easy way to come out of that wait even if
user increases the read limit later.
o This patch fixes it. Now if a user changes the cgroup limits, we recalculate
the bio dispatch time according to new limits.
o Can't take queueu lock under blkcg_lock, hence after the change I wake
up the dispatch thread again which recalculates the time. So there are some
variables being synchronized across two threads without lock and I had to
make use of barriers. Hoping I have used barriers correctly. Any review of
memory barrier code especially will help.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o cgroup chagnes for throttle policy.
o Introduces READ and WRITE bytes per second throttling rules.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
o This patch prepares the base for introducing new IO control policies.
Currently all the code is written knowing there is only one policy
and that is proportional bandwidth. Creating infrastructure for newer
policies to come in.
o Also there were many functions which were generated using macro. It was
very confusing. Got rid of those.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch fixes few usability and configurability issues.
o All the cgroup based controller options are configurable from
"Genral Setup/Control Group Support/" menu. blkio is the only exception.
Hence make this option visible in above menu and make it configurable from
there to bring it inline with rest of the cgroup based controllers.
o Get rid of CONFIG_DEBUG_CFQ_IOSCHED.
This option currently does two things.
- Enable printing of cgroup paths in blktrace
- Enables CONFIG_DEBUG_BLK_CGROUP, which in turn displays additional stat
files in cgroup.
If we are using group scheduling, blktrace data is of not really much use
if cgroup information is not present. To get this data, currently one has to
also enable CONFIG_DEBUG_CFQ_IOSCHED, which in turn brings the overhead of
all the additional debug stat files which is not desired.
Hence, this patch moves printing of cgroup paths under
CONFIG_CFQ_GROUP_IOSCHED.
This allows us to get rid of CONFIG_DEBUG_CFQ_IOSCHED completely. Now all
the debug stat files are controlled only by CONFIG_DEBUG_BLK_CGROUP which
can be enabled through config menu.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Divyesh Shah <dpshah@google.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Changelog from v1:
o Call blkiocg_update_idle_time_stats() at cfq_rq_enqueued() instead of at
dispatch time.
Changelog from original patchset: (in response to Vivek Goyal's comments)
o group blkiocg_update_blkio_group_dequeue_stats() with other DEBUG functions
o rename blkiocg_update_set_active_queue_stats() to
blkiocg_update_avg_queue_size_stats()
o s/request/io/ in blkiocg_update_request_add_stats() and
blkiocg_update_request_remove_stats()
o Call cfq_del_timer() at request dispatch() instead of
blkiocg_update_idle_time_stats()
Signed-off-by: Divyesh Shah<dpshah@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently, IO Controller makes use of blkio.weight to assign weight for
all devices. Here a new user interface "blkio.weight_device" is introduced to
assign different weights for different devices. blkio.weight becomes the
default value for devices which are not configured by "blkio.weight_device"
You can use the following format to assigned specific weight for a given
device:
#echo "major:minor weight" > blkio.weight_device
major:minor represents device number.
And you can remove weight for a given device as following:
#echo "major:minor 0" > blkio.weight_device
V1->V2 changes:
- use user interface "weight_device" instead of "policy" suggested by Vivek
- rename some struct suggested by Vivek
- rebase to 2.6-block "for-linus" branch
- remove an useless list_empty check pointed out by Li Zefan
- some trivial typo fix
V2->V3 changes:
- Move policy_*_node() functions up to get rid of forward declarations
- rename related functions by adding prefix "blkio_"
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
1) group_wait_time - This is the amount of time the cgroup had to wait to get a
timeslice for one of its queues from when it became busy, i.e., went from 0
to 1 request queued. This is different from the io_wait_time which is the
cumulative total of the amount of time spent by each IO in that cgroup waiting
in the scheduler queue. This stat is a great way to find out any jobs in the
fleet that are being starved or waiting for longer than what is expected (due
to an IO controller bug or any other issue).
2) empty_time - This is the amount of time a cgroup spends w/o any pending
requests. This stat is useful when a job does not seem to be able to use its
assigned disk share by helping check if that is happening due to an IO
controller bug or because the job is not submitting enough IOs.
3) idle_time - This is the amount of time spent by the IO scheduler idling
for a given cgroup in anticipation of a better request than the exising ones
from other queues/cgroups.
All these stats are recorded using start and stop events. When reading these
stats, we do not add the delta between the current time and the last start time
if we're between the start and stop events. We avoid doing this to make sure
that these numbers are always monotonically increasing when read. Since we're
using sched_clock() which may use the tsc as its source, it may induce some
inconsistency (due to tsc resync across cpus) if we included the current delta.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
These stats are useful for getting a feel for the queue depth of the cgroup,
i.e., how filled up its queues are at a given instant and over the existence of
the cgroup. This ability is useful when debugging problems in the wild as it
helps understand the application's IO pattern w/o having to read through the
userspace code (coz its tedious or just not available) or w/o the ability
to run blktrace (since you may not have root access and/or not want to disturb
performance).
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This includes both the number of bios merged into requests belonging to this
cgroup as well as the number of requests merged together.
In the past, we've observed different merging behavior across upstream kernels,
some by design some actual bugs. This stat helps a lot in debugging such
problems when applications report decreased throughput with a new kernel
version.
This needed adding an extra elevator function to capture bios being merged as I
did not want to pollute elevator code with blkiocg knowledge and hence needed
the accounting invocation to come from CFQ.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
that include some minor fixes and addresses all comments.
Changelog: (most based on Vivek Goyal's comments)
o renamed blkiocg_reset_write to blkiocg_reset_stats
o more clarification in the documentation on io_service_time and io_wait_time
o Initialize blkg->stats_lock
o rename io_add_stat to blkio_add_stat and declare it static
o use bool for direction and sync
o derive direction and sync info from existing rq methods
o use 12 for major:minor string length
o define io_service_time better to cover the NCQ case
o add a separate reset_stats interface
o make the indexed stats a 2d array to simplify macro and function pointer code
o blkio.time now exports in jiffies as before
o Added stats description in patch description and
Documentation/cgroup/blkio-controller.txt
o Prefix all stats functions with blkio and make them static as applicable
o replace IO_TYPE_MAX with IO_TYPE_TOTAL
o Moved #define constant to top of blk-cgroup.c
o Pass dev_t around instead of char *
o Add note to documentation file about resetting stats
o use BLK_CGROUP_MODULE in addition to BLK_CGROUP config option in #ifdef
statements
o Avoid struct request specific knowledge in blk-cgroup. blk-cgroup.h now has
rq_direction() and rq_sync() functions which are used by CFQ and when using
io-controller at a higher level, bio_* functions can be added.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We also add start_time_ns and io_start_time_ns fields to struct request
here to record the time when a request is created and when it is
dispatched to device. We use ns uints here as ms and jiffies are
not very useful for non-rotational media.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- io_service_time
- io_wait_time
- io_serviced
- io_service_bytes
These stats are accumulated per operation type helping us to distinguish between
read and write, and sync and async IO. This patch does not increment any of
these stats.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
that info at request dispatch with other stats now. This patch removes the
existing support for accounting sectors for a blkio_group. This will be added
back differently in the next two patches.
Signed-off-by: Divyesh Shah<dpshah@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Modify the Block I/O cgroup subsystem to be able to be built as a module.
As the CFQ disk scheduler optionally depends on blk-cgroup, config options
in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are
enhanced to support the new module dependency.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's no need to take css reference here, for the caller
has already called rcu_read_lock() to prevent cgroup from
being removed.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o One of the goals of block IO controller is that it should be able to
support mulitple io control policies, some of which be operational at
higher level in storage hierarchy.
o To begin with, we had one io controlling policy implemented by CFQ, and
I hard coded the CFQ functions called by blkio. This created issues when
CFQ is compiled as module.
o This patch implements a basic dynamic io controlling policy registration
functionality in blkio. This is similar to elevator functionality where
ioschedulers register the functions dynamically.
o Now in future, when more IO controlling policies are implemented, these
can dynakically register with block IO controller.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o blkio controller is inside the kernel and cfq makes use of interfaces
exported by blkio. CFQ can be a module too, hence export symbols used
by CFQ.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Export disk time and sector used by a group to user space through cgroup
interface.
o Also export a "dequeue" interface to cgroup which keeps track of how many
a times a group was deleted from service tree. Helps in debugging.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o One can choose to change elevator or delete a cgroup. Implement group
reference counting so that both elevator exit and cgroup deletion can
take place gracefully.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o This is basic implementation of blkio controller cgroup interface. This is
the common interface visible to user space and should be used by different
IO control policies as we implement those.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>