linux/include
Srivatsa Vaddagiri 6b2d770026 sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups
The current load balancing scheme isn't good enough for precise
group fairness.

For example: on a 8-cpu system, I created 3 groups as under:

	a = 8 tasks (cpu.shares = 1024)
	b = 4 tasks (cpu.shares = 1024)
	c = 3 tasks (cpu.shares = 1024)

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:

Signed-off-by: Ingo Molnar <mingo@elte.hu>
--------------------------------------------------------------------------------
Col1  | Col2    | Col3  |  Col4
------|---------|-------|-------------------------------------------------------
a     | 277.676 | 57.8% | 54.1%  54.1%  54.1%  54.2%  56.7%  62.2%  62.8% 64.5%
b     | 116.108 | 24.2% | 47.4%  48.1%  48.7%  49.3%
c     |  86.326 | 18.0% | 47.5%  47.9%  48.5%
--------------------------------------------------------------------------------

Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
	group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
        percentage. Col3 data is derived as:
		Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
		Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

	a = +24.47%
	b = -9.13%
	c = -15.33%

which is quite high.

After the patch below is applied, here are the results:

--------------------------------------------------------------------------------
Col1  | Col2    | Col3  |  Col4
------|---------|-------|-------------------------------------------------------
a     | 163.112 | 34.0% | 33.2%  33.4%  33.5%  33.5%  33.7%  34.4%  34.8% 35.3%
b     | 156.220 | 32.5% | 63.3%  64.5%  66.1%  66.5%
c     | 160.653 | 33.5% | 85.8%  90.6%  91.4%
--------------------------------------------------------------------------------

Deviation from desired group fairness is as below:

	a = +0.67%
	b = -0.83%
	c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=========================================================

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

	CPU0	CPU1
---------------------------
	A (10)  B(5)
		C(5)
---------------------------

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

	A = 50%
	B = 25%
	C = 25%

which is not the desired result.

What's changing in the patch?
=============================

	- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
	  defined (see below)
	- API Change
		- Two tunables introduced in sysfs (under SCHED_DEBUG) to
		  control the frequency at which the load balance monitor
		  thread runs.

The basic change made in this patch is how cpu weight (rq->load.weight) is
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
	W(K,i)  = Weight of group K on cpu i
	T(K,i)  = Task load present in group K's cfs_rq on cpu i
	T(K)    = Total task load of group K across various cpus
	S(K) 	= Shares allocated to group K
	NRCPUS	= Number of online cpus in the scheduler domain to
	 	  which group K is assigned.

Then,
	W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate
at which it runs.

Fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl>

- don't start the load_balance_monitor when there is only a single cpu.
- rename the kthread because its currently longer than TASK_COMM_LEN

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:00 +01:00
..
acpi kobject: convert /sys/firmware/acpi/ to use kobject_create 2008-01-24 20:40:23 -08:00
asm-alpha alpha: build fixes 2007-12-17 19:28:16 -08:00
asm-arm [ARM] pxa: silence warnings from cpu_is_xxx() macros 2008-01-10 12:33:54 +00:00
asm-avr32 [AVR32] NMI debugging 2008-01-25 08:31:43 +01:00
asm-blackfin Blackfin SPI driver: move hard coded pin_req to board file 2007-12-05 09:21:20 -08:00
asm-cris CRIS v10: vmlinux.lds.S: ix kernel oops on boot and use common defines 2008-01-17 15:38:58 -08:00
asm-frv frv: Remove bogus NO_IRQ = -1 define 2007-11-09 15:11:44 -08:00
asm-generic Revert quicklist need->flush fix 2007-12-26 22:04:09 -08:00
asm-h8300 asm-h8300: parentheses around definition CLOCK_TICK_RATE 2007-12-10 19:43:54 -08:00
asm-ia64 [IA64] Update Altix BTE error return status patch 2008-01-03 13:18:58 -08:00
asm-m32r m32r: Update sys_rt_sigsuspend 2007-11-28 01:24:04 +09:00
asm-m68k Add CONFIG_DEBUG_SG sg validation 2007-10-22 21:20:03 +02:00
asm-m68knommu m68knommu: fix pread/pwrite defines 2007-11-05 15:12:33 -08:00
asm-mips [MIPS] SMTC: Fix build error. 2008-01-22 00:35:23 +00:00
asm-parisc [PARISC] print more than one character at a time for pdc console 2007-12-06 09:32:15 -08:00
asm-powerpc [POWERPC] Fix boot failure on POWER6 2008-01-15 17:30:58 +11:00
asm-ppc ppc: fix AT_VECTOR_SIZE on arch/ppc 2007-10-22 19:18:54 -07:00
asm-s390 [S390] pud_present/pmd_present bug. 2007-12-17 16:25:56 +01:00
asm-sh sh: Force __access_ok() to obey address space limit. 2008-01-11 13:18:16 +09:00
asm-sh64 sh: remove PTRACE_O_TRACESYSGOOD from asm/ptrace.h 2007-11-07 11:13:54 +09:00
asm-sparc [SPARC32]: Silence sparc32 warnings on missing syscalls. 2007-12-14 10:59:50 -08:00
asm-sparc64 [SPARC64]: Implement pci_resource_to_user() 2007-12-26 19:33:46 -08:00
asm-um uml: update address space affected by pud_clear 2007-11-14 18:45:37 -08:00
asm-v850 Add CONFIG_DEBUG_SG sg validation 2007-10-22 21:20:03 +02:00
asm-x86 x86: asm-x86/msr.h: pull in linux/types.h 2008-01-15 16:44:38 +01:00
asm-xtensa xtensa: dma-mapping.h is using linux/scatterlist.h functions, so include it 2007-10-24 13:28:40 +02:00
crypto [CRYPTO] api: Include sched.h for cond_resched in scatterwalk.h 2008-01-11 08:16:59 +11:00
keys KEYS: Make request_key() and co fundamentally asynchronous 2007-10-17 08:42:57 -07:00
linux sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups 2008-01-25 21:08:00 +01:00
math-emu
media V4L/DVB (6601): V4L: videobuf-core locking fixes and comments 2007-12-11 18:08:08 -02:00
mtd
net [SOCK]: Adds a rcu_dereference() in sk_filter 2008-01-08 23:41:28 -08:00
pcmcia [AVR32] pcmcia ioaddr_t should be 32 bits on AVR32 2007-11-15 13:47:19 +01:00
rdma Kobject: change drivers/infiniband to use kobject_init_and_add 2008-01-24 20:40:26 -08:00
rxrpc
scsi Revert "scsi: revert "[SCSI] Get rid of scsi_cmnd->done"" 2008-01-06 10:17:12 -08:00
sound [ALSA] version 1.0.15 2007-11-20 20:16:43 +01:00
video Make asm-x86/bootparam.h includable from userspace. 2007-10-23 15:49:47 +10:00
xen xen: fix incorrect vcpu_register_vcpu_info hypercall argument 2007-10-16 11:51:31 -07:00
Kbuild do not export /usr/include/scsi in make headers_install 2007-10-17 08:42:52 -07:00