This patch implements 'nice' support across physical cpus on SMP.
It introduces an extra runqueue variable prio_bias which is the sum of the
(inverted) static priorities of all the tasks on the runqueue.
This is then used to bias busy rebalancing between runqueues to obtain good
distribution of tasks of different nice values. By biasing the balancing only
during busy rebalancing we can avoid having any significant loss of throughput
by not affecting the carefully tuned idle balancing already in place. If all
tasks are running at the same nice level this code should also have minimal
effect. The code is optimised out in the !CONFIG_SMP case.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I didn't find any possible modular usage in the kernel.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace smp_processor_id() with any_online_cpu(cpu_online_map) in order to
avoid lots of "BUG: using smp_processor_id() in preemptible [00000001]
code:..." messages in case taking a cpu online fails.
All the traces start at the last notifier_call_chain(...) in kernel/cpu.c.
Since we hold the cpu_control semaphore it shouldn't be any problem to access
cpu_online_map.
The reason why cpu_up failed is simply that the cpu that was supposed to be
taken online wasn't even there. That is because on s390 we never know when a
new cpu comes and therefore cpu_possible_map consists of only ones and doesn't
reflect reality.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Do not transfer remaining time slice to another cpu on process exit.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Simplify the UP (1 CPU) implementatin of set_cpus_allowed.
The one CPU is hardcoded to be cpu 0 - so just test for that bit, and avoid
having to pick up the cpu_online_map.
Also, unexport cpu_online_map: it was only needed for set_cpus_allowed().
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With CONFIG_SMP=n:
*** Warning: "cpu_online_map" [drivers/firmware/dcdbas.ko] undefined!
due to set_cpus_allowed().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fix up the runqueue lock owner only if we truly did a context-switch
with the runqueue lock held. Impacts ia64, mips, sparc64 and arm.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
..and only enable them for ia64. The functions are only valid
when the whole system has been totally stopped and no scheduler
activity is ongoing on any CPU, and interrupts are globally
disabled.
In other words, they aren't useful for anything else. So make
sure that nobody can use them by mistake.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Scheduler hooks to see/change which process is deemed to be on a cpu.
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Don't pull tasks from a group if that would cause the group's total load to
drop below its total cpu_power (ie. cause the group to start going idle).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jack Steiner brought this issue at my OLS talk.
Take a scenario where two tasks are pinned to two HT threads in a physical
package. Idle packages in the system will keep kicking migration_thread on
the busy package with out any success.
We will run into similar scenarios in the presence of CMP/NUMA.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In sys_sched_yield(), we cache current->array in the "array" variable, thus
there's no need to dereference "current" again later.
Signed-Off-By: Renaud Lienhart <renaud.lienhart@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If an idle sibling of an HT queue encounters a busy sibling, then make
higher level load balancing of the non-idle variety.
Performance of multiprocessor HT systems with low numbers of tasks
(generally < number of virtual CPUs) can be significantly worse than the
exact same workloads when running in non-HT mode. The reason is largely
due to poor scheduling behaviour.
This patch improves the situation, making the performance gap far less
significant on one problematic test case (tbench).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
During periodic load balancing, don't hold this runqueue's lock while
scanning remote runqueues, which can take a non trivial amount of time
especially on very large systems.
Holding the runqueue lock will only help to stabilise ->nr_running, however
this doesn't do much to help because tasks being woken will simply get held
up on the runqueue lock, so ->nr_running would not provide a really
accurate picture of runqueue load in that case anyway.
What's more, ->nr_running (and possibly the cpu_load averages) of remote
runqueues won't be stable anyway, so load balancing is always an inexact
operation.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Similarly to the earlier change in load_balance, only lock the runqueue in
load_balance_newidle if the busiest queue found has a nr_running > 1. This
will reduce frequency of expensive remote runqueue lock aquisitions in the
schedule() path on some workloads.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
William Weston reported unusually high scheduling latencies on his x86 HT
box, on the -RT kernel. I managed to reproduce it on my HT box and the
latency tracer shows the incident in action:
_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
du-2803 3Dnh2 0us : __trace_start_sched_wakeup (try_to_wake_up)
..............................................................
... we are running on CPU#3, PID 2778 gets woken to CPU#1: ...
..............................................................
du-2803 3Dnh2 0us : __trace_start_sched_wakeup <<...>-2778> (73 1)
du-2803 3Dnh2 0us : _raw_spin_unlock (try_to_wake_up)
................................................
... still on CPU#3, we send an IPI to CPU#1: ...
................................................
du-2803 3Dnh1 0us : resched_task (try_to_wake_up)
du-2803 3Dnh1 1us : smp_send_reschedule (try_to_wake_up)
du-2803 3Dnh1 1us : send_IPI_mask_bitmask (smp_send_reschedule)
du-2803 3Dnh1 2us : _raw_spin_unlock_irqrestore (try_to_wake_up)
...............................................
... 1 usec later, the IPI arrives on CPU#1: ...
...............................................
<idle>-0 1Dnh. 2us : smp_reschedule_interrupt (c0100c5a 0 0)
So far so good, this is the normal wakeup/preemption mechanism. But here
comes the scheduler anomaly on CPU#1:
<idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched)
<idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched)
<idle>-0 1Dnh. 3us : __schedule (preempt_schedule_irq)
<idle>-0 1Dnh. 3us : profile_hit (__schedule)
<idle>-0 1Dnh1 3us : sched_clock (__schedule)
<idle>-0 1Dnh1 4us : _raw_spin_lock_irq (__schedule)
<idle>-0 1Dnh1 4us : _raw_spin_lock_irqsave (__schedule)
<idle>-0 1Dnh2 5us : _raw_spin_unlock (__schedule)
<idle>-0 1Dnh1 5us : preempt_schedule (__schedule)
<idle>-0 1Dnh1 6us : _raw_spin_lock (__schedule)
<idle>-0 1Dnh2 6us : find_next_bit (__schedule)
<idle>-0 1Dnh2 6us : _raw_spin_lock (__schedule)
<idle>-0 1Dnh3 7us : find_next_bit (__schedule)
<idle>-0 1Dnh3 7us : find_next_bit (__schedule)
<idle>-0 1Dnh3 8us : _raw_spin_unlock (__schedule)
<idle>-0 1Dnh2 8us : preempt_schedule (__schedule)
<idle>-0 1Dnh2 8us : find_next_bit (__schedule)
<idle>-0 1Dnh2 9us : trace_stop_sched_switched (__schedule)
<idle>-0 1Dnh2 9us : _raw_spin_lock (trace_stop_sched_switched)
<idle>-0 1Dnh3 10us : trace_stop_sched_switched <<...>-2778> (73 8c)
<idle>-0 1Dnh3 10us : _raw_spin_unlock (trace_stop_sched_switched)
<idle>-0 1Dnh1 10us : _raw_spin_unlock (__schedule)
<idle>-0 1Dnh. 11us : local_irq_enable_noresched (preempt_schedule_irq)
<idle>-0 1Dnh. 11us < (0)
we didnt pick up pid 2778! It only gets scheduled much later:
<...>-2778 1Dnh2 412us : __switch_to (__schedule)
<...>-2778 1Dnh2 413us : __schedule <<idle>-0> (8c 73)
<...>-2778 1Dnh2 413us : _raw_spin_unlock (__schedule)
<...>-2778 1Dnh1 413us : trace_stop_sched_switched (__schedule)
<...>-2778 1Dnh1 414us : _raw_spin_lock (trace_stop_sched_switched)
<...>-2778 1Dnh2 414us : trace_stop_sched_switched <<...>-2778> (73 1)
<...>-2778 1Dnh2 414us : _raw_spin_unlock (trace_stop_sched_switched)
<...>-2778 1Dnh1 415us : trace_stop_sched_switched (__schedule)
the reason for this anomaly is the following code in dependent_sleeper():
/*
* If a user task with lower static priority than the
* running task on the SMT sibling is trying to schedule,
* delay it till there is proportionately less timeslice
* left of the sibling task to prevent a lower priority
* task from using an unfair proportion of the
* physical cpu's resources. -ck
*/
[...]
if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
100) > task_timeslice(p)))
ret = 1;
Note that in contrast to the comment above, we dont actually do the check
based on static priority, we do the check based on timeslices. But
timeslices go up and down, and even highprio tasks can randomly have very
low timeslices (just before their next refill) and can thus be judged as
'lowprio' by the above piece of code. This condition is clearly buggy.
The correct test is to check for static_prio _and_ to check for the
preemption priority. Even on different static priority levels, a
higher-prio interactive task should not be delayed due to a
higher-static-prio CPU hog.
There is a symmetric bug in the 'kick SMT sibling' code of this function as
well, which can be solved in a similar way.
The patch below (against the current scheduler queue in -mm) fixes both
bugs. I have build and boot-tested this on x86 SMT, and nice +20 tasks
still get properly throttled - so the dependent-sleeper logic is still in
action.
btw., these bugs pessimised the SMT scheduler because the 'delay wakeup'
property was applied too liberally, so this fix is likely a throughput
improvement as well.
I separated out a smt_slice() function to make the code easier to read.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch implements a task state bit (TASK_NONINTERACTIVE), which can be
used by blocking points to mark the task's wait as "non-interactive". This
does not mean the task will be considered a CPU-hog - the wait will simply
not have an effect on the waiting task's priority - positive or negative
alike. Right now only pipe_wait() will make use of it, because it's a
common source of not-so-interactive waits (kernel compilation jobs, etc.).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add relevant checks into find_idlest_group() and find_idlest_cpu() to make
them return only the groups that have allowed CPUs and allowed CPUs
respectively.
Signed-off-by: M.Baris Demiray <baris@labristeknoloji.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The hyperthread aware nice handling currently puts to sleep any non real
time task when a real time task is running on its sibling cpu. This can
lead to prolonged starvation by having the non real time task pegged to the
cpu with load balancing not pulling that task away.
Currently we force lower priority hyperthread tasks to run a percentage of
time difference based on timeslice differences which is meaningless when
comparing real time tasks to SCHED_NORMAL tasks. We can allow non real
time tasks to run with real time tasks on the sibling up to per_cpu_gain%
if we use jiffies as a counter.
Cleanups and micro-optimisations to the relevant code section should make
it more understandable as well.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch (written by me and also containing many suggestions of Arjan van
de Ven) does a major cleanup of the spinlock code. It does the following
things:
- consolidates and enhances the spinlock/rwlock debugging code
- simplifies the asm/spinlock.h files
- encapsulates the raw spinlock type and moves generic spinlock
features (such as ->break_lock) into the generic code.
- cleans up the spinlock code hierarchy to get rid of the spaghetti.
Most notably there's now only a single variant of the debugging code,
located in lib/spinlock_debug.c. (previously we had one SMP debugging
variant per architecture, plus a separate generic one for UP builds)
Also, i've enhanced the rwlock debugging facility, it will now track
write-owners. There is new spinlock-owner/CPU-tracking on SMP builds too.
All locks have lockup detection now, which will work for both soft and hard
spin/rwlock lockups.
The arch-level include files now only contain the minimally necessary
subset of the spinlock code - all the rest that can be generalized now
lives in the generic headers:
include/asm-i386/spinlock_types.h | 16
include/asm-x86_64/spinlock_types.h | 16
I have also split up the various spinlock variants into separate files,
making it easier to see which does what. The new layout is:
SMP | UP
----------------------------|-----------------------------------
asm/spinlock_types_smp.h | linux/spinlock_types_up.h
linux/spinlock_types.h | linux/spinlock_types.h
asm/spinlock_smp.h | linux/spinlock_up.h
linux/spinlock_api_smp.h | linux/spinlock_api_up.h
linux/spinlock.h | linux/spinlock.h
/*
* here's the role of the various spinlock/rwlock related include files:
*
* on SMP builds:
*
* asm/spinlock_types.h: contains the raw_spinlock_t/raw_rwlock_t and the
* initializers
*
* linux/spinlock_types.h:
* defines the generic type and initializers
*
* asm/spinlock.h: contains the __raw_spin_*()/etc. lowlevel
* implementations, mostly inline assembly code
*
* (also included on UP-debug builds:)
*
* linux/spinlock_api_smp.h:
* contains the prototypes for the _spin_*() APIs.
*
* linux/spinlock.h: builds the final spin_*() APIs.
*
* on UP builds:
*
* linux/spinlock_type_up.h:
* contains the generic, simplified UP spinlock type.
* (which is an empty structure on non-debug builds)
*
* linux/spinlock_types.h:
* defines the generic type and initializers
*
* linux/spinlock_up.h:
* contains the __raw_spin_*()/etc. version of UP
* builds. (which are NOPs on non-debug, non-preempt
* builds)
*
* (included on UP-non-debug builds:)
*
* linux/spinlock_api_up.h:
* builds the _spin_*() APIs.
*
* linux/spinlock.h: builds the final spin_*() APIs.
*/
All SMP and UP architectures are converted by this patch.
arm, i386, ia64, ppc, ppc64, s390/s390x, x64 was build-tested via
crosscompilers. m32r, mips, sh, sparc, have not been tested yet, but should
be mostly fine.
From: Grant Grundler <grundler@parisc-linux.org>
Booted and lightly tested on a500-44 (64-bit, SMP kernel, dual CPU).
Builds 32-bit SMP kernel (not booted or tested). I did not try to build
non-SMP kernels. That should be trivial to fix up later if necessary.
I converted bit ops atomic_hash lock to raw_spinlock_t. Doing so avoids
some ugly nesting of linux/*.h and asm/*.h files. Those particular locks
are well tested and contained entirely inside arch specific code. I do NOT
expect any new issues to arise with them.
If someone does ever need to use debug/metrics with them, then they will
need to unravel this hairball between spinlocks, atomic ops, and bit ops
that exist only because parisc has exactly one atomic instruction: LDCW
(load and clear word).
From: "Luck, Tony" <tony.luck@intel.com>
ia64 fix
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjanv@infradead.org>
Signed-off-by: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se>
Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For architecture like ia64, the switch stack structure is fairly large
(currently 528 bytes). For context switch intensive application, we found
that significant amount of cache misses occurs in switch_to() function.
The following patch adds a hook in the schedule() function to prefetch
switch stack structure as soon as 'next' task is determined. This allows
maximum overlap in prefetch cache lines for that structure.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For a NUMA system with multiple CPUs per node, declaring a cpu-exclusive
cpuset that includes only some, but not all, of the CPUs in a node will mangle
the sched domain structures.
Signed-off-by: John Hawkes <hawkes@sgi.com>
Cc; Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As suggested by Michael Kerrisk <mtk-manpages@gmx.net>, make RLIMIT_NICE
consistent with getpriority before it becomes available in released glibc.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here's the patch again to fix the code to handle if the values between
MAX_USER_RT_PRIO and MAX_RT_PRIO are different.
Without this patch, an SMP system will crash if the values are
different.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
RLIMIT_RTPRIO is supposed to grant non privileged users the right to use
SCHED_FIFO/SCHED_RR scheduling policies with priorites bounded by the
RLIMIT_RTPRIO value via sched_setscheduler(). This is usually used by
audio users.
Unfortunately this is broken in 2.6.13rc3 as you can see in the excerpt
from sched_setscheduler below:
/*
* Allow unprivileged RT tasks to decrease priority:
*/
if (!capable(CAP_SYS_NICE)) {
/* can't change policy */
if (policy != p->policy)
return -EPERM;
After the above unconditional test which causes sched_setscheduler to
fail with no regard to the RLIMIT_RTPRIO value the following check is made:
/* can't increase priority */
if (policy != SCHED_NORMAL &&
param->sched_priority > p->rt_priority &&
param->sched_priority >
p->signal->rlim[RLIMIT_RTPRIO].rlim_cur)
return -EPERM;
Thus I do believe that the RLIMIT_RTPRIO value must be taken into
account for the policy check, especially as the RLIMIT_RTPRIO limit is
of no use without this change.
The attached patch fixes this problem.
Signed-off-by: Andreas Steinmetz <ast@domdv.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The BKS might be reacquired before we have dropped PREEMPT_ACTIVE, which
could trigger a second could trigger a second cond_resched() call. Bug
found by Hirofumi Ogawa.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch tweaks idle thread setup semantics a bit: instead of setting
NEED_RESCHED in init_idle(), we do an explicit schedule() before calling
into cpu_idle().
This patch, while having no negative side-effects, enables wider use of
cond_resched()s. (which might happen in the stock kernel too, but it's
particulary important for voluntary-preempt)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This updates the CFQ io scheduler to the new time sliced design (cfq
v3). It provides full process fairness, while giving excellent
aggregate system throughput even for many competing processes. It
supports io priorities, either inherited from the cpu nice value or set
directly with the ioprio_get/set syscalls. The latter closely mimic
set/getpriority.
This import is based on my latest from -mm.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1. Establish a simple API for process freezing defined in linux/include/sched.h:
frozen(process) Check for frozen process
freezing(process) Check if a process is being frozen
freeze(process) Tell a process to freeze (go to refrigerator)
thaw_process(process) Restart process
frozen_process(process) Process is frozen now
2. Remove all references to PF_FREEZE and PF_FROZEN from all
kernel sources except sched.h
3. Fix numerous locations where try_to_freeze is manually done by a driver
4. Remove the argument that is no longer necessary from two function calls.
5. Some whitespace cleanup
6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
cleared before setting PF_FROZEN, recalc_sigpending does not check
PF_FROZEN).
This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patches add dynamic sched domains functionality that was
extensively discussed on lkml and lse-tech. I would like to see this added to
-mm
o The main advantage with this feature is that it ensures that the scheduler
load balacing code only balances against the cpus that are in the sched
domain as defined by an exclusive cpuset and not all of the cpus in the
system. This removes any overhead due to load balancing code trying to
pull tasks outside of the cpu exclusive cpuset only to be prevented by
the tasks' cpus_allowed mask.
o cpu exclusive cpusets are useful for servers running orthogonal
workloads such as RT applications requiring low latency and HPC
applications that are throughput sensitive
o It provides a new API partition_sched_domains in sched.c
that makes dynamic sched domains possible.
o cpu_exclusive cpusets sets are now associated with a sched domain.
Which means that the users can dynamically modify the sched domains
through the cpuset file system interface
o ia64 sched domain code has been updated to support this feature as well
o Currently, this does not support hotplug. (However some of my tests
indicate hotplug+preempt is currently broken)
o I have tested it extensively on x86.
o This should have very minimal impact on performance as none of
the fast paths are affected
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Presently, a process without the capability CAP_SYS_NICE can not change
its own policy, which is OK.
But it can also not decrease its RT priority (if scheduled with policy
SCHED_RR or SCHED_FIFO), which is what this patch changes.
The rationale is the same as for the nice value: a process should be
able to require less priority for itself. Increasing the priority is
still not allowed.
This is for example useful if you give a multithreaded user process a RT
priority, and the process would like to organize its internal threads
using priorities also. Then you can give the process the highest
priority needed N, and the process starts its threads with lower
priorities: N-1, N-2...
The POSIX norm says that the permissions are implementation specific, so
I think we can do that.
In a sense, it makes the permissions consistent whatever the policy is:
with this patch, process scheduled by SCHED_FIFO, SCHED_RR and
SCHED_OTHER can all decrease their priority.
From: Ingo Molnar <mingo@elte.hu>
cleaned up and merged to -mm.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The maximum rebalance interval allowed by the multiprocessor balancing
backoff is often not large enough to handle corner cases where there are
lots of tasks pinned on a CPU. Suresh reported:
I see system livelock's if for example I have 7000 processes
pinned onto one cpu (this is on the fastest 8-way system I
have access to).
After this patch, the machine is reported to go well above this number.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Consolidate balance-on-exec with balance-on-fork. This is made easy by the
sched-domains RCU patches.
As well as the general goodness of code reduction, this allows the runqueues
to be unlocked during balance-on-fork.
schedstats is a problem. Maybe just have balance-on-event instead of
distinguishing fork and exec?
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
One of the problems with the multilevel balance-on-fork/exec is that it needs
to jump through hoops to satisfy sched-domain's locking semantics (that is,
you may traverse your own domain when not preemptable, and you may traverse
others' domains when holding their runqueue lock).
balance-on-exec had to potentially migrate between more than one CPU before
finding a final CPU to migrate to, and balance-on-fork needed to potentially
take multiple runqueue locks.
So bite the bullet and make sched-domains go completely RCU. This actually
simplifies the code quite a bit.
From: Ingo Molnar <mingo@elte.hu>
schedstats RCU fix, and a nice comment on for_each_domain, from Ingo.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The fundamental problem that Suresh has with balance on exec and fork is that
it only tries to balance the top level domain with the flag set.
This was worked around by removing degenerate domains, but is still a problem
if people want to start using more complex sched-domains, especially
multilevel NUMA that ia64 is already using.
This patch makes balance on fork and exec try balancing over not just the top
most domain with the flag set, but all the way down the domain tree.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove degenerate scheduler domains during the sched-domain init.
For example on x86_64, we always have NUMA configured in. On Intel EM64T
systems, top most sched domain will be of NUMA and with only one sched_group
in it.
With fork/exec balances(recent Nick's fixes in -mm tree), we always endup
taking wrong decisions because of this topmost domain (as it contains only one
group and find_idlest_group always returns NULL). We will endup loading HT
package completely first, letting active load balance kickin and correct it.
In general, this patch also makes sense with out recent Nick's fixes in -mm.
From: Nick Piggin <nickpiggin@yahoo.com.au>
Modified to account for more than just sched_groups when scanning for
degenerate domains by Nick Piggin. And allow a runqueue's sd to go NULL
rather than keep a single degenerate domain around (this happens when you run
with maxcpus=1).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the last 2 places that directly access a runqueue's sched-domain and
assume it cannot be NULL.
That allows the use of NULL for domain, instead of a dummy domain, to signify
no balancing is to happen. No functional changes.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of requiring architecture code to interact with the scheduler's
locking implementation, provide a couple of defines that can be used by the
architecture to request runqueue unlocked context switches, and ask for
interrupts to be enabled over the context switch.
Also replaces the "switch_lock" used by these architectures with an oncpu
flag (note, not a potentially slow bitflag). This eliminates one bus
locked memory operation when context switching, and simplifies the
task_running function.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reimplement the balance on exec balancing to be sched-domains aware. Use this
to also do balance on fork balancing. Make x86_64 do balance on fork over the
NUMA domain.
The problem that the non sched domains aware blancing became apparent on dual
core, multi socket opterons. What we want is for the new tasks to be sent to
a different socket, but more often than not, we would first load up our
sibling core, or fill two cores of a single remote socket before selecting a
new one.
This gives large improvements to STREAM on such systems.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the very aggressive idle stuff that has recently gone into 2.6 - it is
going against the direction we are trying to go. Hopefully we can regain
performance through other methods.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Do less affine wakeups. We're trying to reduce dbt2-pgsql idle time
regressions here... make sure we don't don't move tasks the wrong way in an
imbalance condition. Also, remove the cache coldness requirement from the
calculation - this seems to induce sharp cutoff points where behaviour will
suddenly change on some workloads if the load creeps slightly over or under
some point. It is good for periodic balancing because in that case have
otherwise have no other context to determine what task to move.
But also make a minor tweak to "wake balancing" - the imbalance tolerance is
now set at half the domain's imbalance, so we get the opportunity to do wake
balancing before the more random periodic rebalancing gets preformed.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>