linux/kernel
Paul Jackson 4225399a66 [PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.

NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset.  The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.

When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.

An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.

This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired.  The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.

Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan.  In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock).  So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.

Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound.  A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.

When a task is moved to a different cpuset, it is easier, as there is only one
task involved.  It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.

It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice.  This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
..
irq [PATCH] Alpha: convert to generic irq framework (generic part) 2006-01-06 08:33:40 -08:00
power [PATCH] swsusp: save image header first 2006-01-06 08:33:43 -08:00
.gitignore gitignore: ignore more generated files 2006-01-03 11:35:26 +01:00
acct.c [PATCH] s390: cputime_t fixes 2006-01-06 08:33:49 -08:00
audit.c [PATCH] Add try_to_freeze to kauditd 2005-12-12 08:57:43 -08:00
auditsc.c [PATCH] gfp_t: kernel/* 2005-10-28 08:16:49 -07:00
capability.c [PATCH] kernel/capability.c: add kerneldoc 2005-07-27 16:26:06 -07:00
compat.c [PATCH] kernel: fix-up schedule_timeout() usage 2005-09-10 10:06:37 -07:00
configs.c update the email address of Randy Dunlap 2006-01-03 13:37:51 +01:00
cpu.c [PATCH] clean up lock_cpu_hotplug() in cpufreq 2005-11-28 14:42:23 -08:00
cpuset.c [PATCH] cpuset: rebind vma mempolicies fix 2006-01-08 20:13:44 -08:00
crash_dump.c [PATCH] kernel/crash_dump.c: add kerneldoc 2005-07-27 16:26:06 -07:00
dma.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
exec_domain.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
exit.c [PATCH] RCU signal handling 2006-01-08 20:13:40 -08:00
extable.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
fork.c [PATCH] cpuset: fork hook fix 2006-01-08 20:13:43 -08:00
futex.c [PATCH] FRV: Make futex code compilable on nommu [try #2] 2006-01-06 08:33:33 -08:00
intermodule.c [PATCH] introduce and use kzalloc 2005-09-07 16:57:45 -07:00
itimer.c [PATCH] itimer fixes 2005-07-27 16:25:51 -07:00
kallsyms.c [PATCH] fix missing includes 2005-10-30 17:37:32 -08:00
Kconfig.hz [PATCH] i386: Selectable Frequency of the Timer Interrupt 2005-06-23 09:45:10 -07:00
Kconfig.preempt [PATCH] sched: voluntary kernel preemption 2005-06-25 16:24:45 -07:00
kexec.c [PATCH] mm: split page table lock 2005-10-29 21:40:42 -07:00
kfifo.c [PATCH] gfp flags annotations - part 1 2005-10-08 15:00:57 -07:00
kmod.c [PATCH] Keys: Get rid of warning in kmod.c if keys disabled 2005-10-30 17:37:23 -08:00
kprobes.c [PATCH] kprobes: increment kprobe missed count for multiprobes 2005-12-12 08:57:45 -08:00
ksysfs.c [PATCH] kobject_uevent CONFIG_NET=n fix 2006-01-04 16:18:08 -08:00
kthread.c [PATCH] Add kthread_stop_sem() 2005-10-30 17:37:17 -08:00
Makefile [PATCH] RCU torture-testing kernel module 2005-10-30 17:37:27 -08:00
module.c [PATCH] kernel/module.c: removed dead code 2006-01-06 08:33:59 -08:00
panic.c [PATCH] s390: cleanup Kconfig 2006-01-06 08:33:53 -08:00
params.c [PATCH] kernel/params.c: fix sysfs access with CONFIG_MODULES=n 2005-12-20 10:31:33 -08:00
pid.c [PATCH] RCU signal handling 2006-01-08 20:13:40 -08:00
posix-cpu-timers.c [PATCH] Fix posix-cpu-timers sched_time accumulation 2006-01-06 20:23:04 -08:00
posix-timers.c [PATCH] timespec: normalize off by one errors 2005-11-13 18:14:17 -08:00
printk.c [PATCH] Fix crash in unregister_console() 2005-11-23 16:08:39 -08:00
profile.c [PATCH] mostly_read data section 2005-07-07 18:23:46 -07:00
ptrace.c [PATCH] Fix crash when ptrace poking hugepage areas 2005-11-29 19:47:03 -08:00
rcupdate.c [PATCH] RCU signal handling 2006-01-08 20:13:40 -08:00
rcutorture.c [PATCH] Fix bug in RCU torture test 2005-12-12 08:57:42 -08:00
resource.c [PATCH] introduce and use kzalloc 2005-09-07 16:57:45 -07:00
sched.c [PATCH] RCU signal handling 2006-01-08 20:13:40 -08:00
seccomp.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
signal.c [PATCH] Simpler signal-exit concurrency handling 2006-01-08 20:13:40 -08:00
softirq.c [PATCH] cpu hoptlug: avoid usage of smp_processor_id() in preemptible code 2005-11-07 07:53:29 -08:00
softlockup.c [PATCH] quieten softlockup at boot 2005-11-09 07:55:50 -08:00
spinlock.c [PATCH] spinlock consolidation 2005-09-10 10:06:21 -07:00
stop_machine.c [PATCH] stop_machine() vs. synchronous IPI send deadlock 2005-11-13 18:14:16 -08:00
sys_ni.c [PATCH] Swap Migration V5: sys_migrate_pages interface 2006-01-08 20:12:42 -08:00
sys.c [PATCH] kprobes: no probes on critical path 2005-12-12 08:57:45 -08:00
sysctl.c [PATCH] Make high and batch sizes of per_cpu_pagelists configurable 2006-01-08 20:12:40 -08:00
time.c [PATCH] Add getnstimestamp function 2005-12-12 08:57:42 -08:00
timer.c [PATCH] jiffies_64 cleanup 2005-10-30 17:37:25 -08:00
uid16.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
user.c [PATCH] inotify 2005-07-12 20:38:38 -07:00
wait.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
workqueue.c [PATCH] add schedule_on_each_cpu() 2006-01-08 20:12:40 -08:00