* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1443 commits)
phy/marvell: add 88ec048 support
igb: Program MDICNFG register prior to PHY init
e1000e: correct MAC-PHY interconnect register offset for 82579
hso: Add new product ID
can: Add driver for esd CAN-USB/2 device
l2tp: fix export of header file for userspace
can-raw: Fix skb_orphan_try handling
Revert "net: remove zap_completion_queue"
net: cleanup inclusion
phy/marvell: add 88e1121 interface mode support
u32: negative offset fix
net: Fix a typo from "dev" to "ndev"
igb: Use irq_synchronize per vector when using MSI-X
ixgbevf: fix null pointer dereference due to filter being set for VLAN 0
e1000e: Fix irq_synchronize in MSI-X case
e1000e: register pm_qos request on hardware activation
ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice
net: Add getsockopt support for TCP thin-streams
cxgb4: update driver version
cxgb4: add new PCI IDs
...
Manually fix up conflicts in:
- drivers/net/e1000e/netdev.c: due to pm_qos registration
infrastructure changes
- drivers/net/phy/marvell.c: conflict between adding 88ec048 support
and cleaning up the IDs
- drivers/net/wireless/ipw2x00/ipw2100.c: trivial ipw2100_pm_qos_req
conflict (registration change vs marking it static)
Commit 8f92054e7c ("CRED: Fix __task_cred()'s lockdep check and banner
comment") fixed the lockdep checks on __task_cred(). This has shown up
a place in the signalling code where a lock should be held - namely that
check_kill_permission() requires its callers to hold the RCU lock.
Fix group_send_sig_info() to get the RCU read lock around its call to
check_kill_permission().
Without this patch, the following warning can occur:
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
kernel/signal.c:660 invoked rcu_dereference_check() without protection!
...
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM / Runtime: Add runtime PM statistics (v3)
PM / Runtime: Make runtime_status attribute not debug-only (v. 2)
PM: Do not use dynamically allocated objects in pm_wakeup_event()
PM / Suspend: Fix ordering of calls in suspend error paths
PM / Hibernate: Fix snapshot error code path
PM / Hibernate: Fix hibernation_platform_enter()
pm_qos: Get rid of the allocation in pm_qos_add_request()
pm_qos: Reimplement using plists
plist: Add plist_last
PM: Make it possible to avoid races between wakeup and system sleep
PNPACPI: Add support for remote wakeup
PM: describe kernel policy regarding wakeup defaults (v. 2)
PM / Hibernate: Fix typos in comments in kernel/power/swap.c
Comment in unregister_trace_probe() says probe_lock will be held when it
gets called. However there is a case where it might called without the
probe_lock being held. Also since we are traversing the probe_list and
deleting an element from the probe_list, probe_lock should be held.
This was first pointed in uprobes traceevent review by Frederic
Weisbecker here. (http://lkml.org/lkml/2010/5/12/106)
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100630084548.GA10325@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
usleep_range is a finer precision implementations of msleep
and is designed to be a drop-in replacement for udelay where
a precise sleep / busy-wait is unnecessary.
Since an easy interface to hrtimers could lead to an undesired
proliferation of interrupts, we provide only a "range" API,
forcing the caller to think about an acceptable tolerance on
both ends and hopefully avoiding introducing another interrupt.
INTRO
As discussed here ( http://lkml.org/lkml/2007/8/3/250 ), msleep(1) is not
precise enough for many drivers (yes, sleep precision is an unfair notion,
but consistently sleeping for ~an order of magnitude greater than requested
is worth fixing). This patch adds a usleep API so that udelay does not have
to be used. Obviously not every udelay can be replaced (those in atomic
contexts or being used for simple bitbanging come to mind), but there are
many, many examples of
mydriver_write(...)
/* Wait for hardware to latch */
udelay(100)
in various drivers where a busy-wait loop is neither beneficial nor
necessary, but msleep simply does not provide enough precision and people
are using a busy-wait loop instead.
CONCERNS FROM THE RFC
Why is udelay a problem / necessary? Most callers of udelay are in device/
driver initialization code, which is serial...
As I see it, there is only benefit to sleeping over a delay; the
notion of "refactoring" areas that use udelay was presented, but
I see usleep as the refactoring. Consider i2c, if the bus is busy,
you need to wait a bit (say 100us) before trying again, your
current options are:
* udelay(100)
* msleep(1) <-- As noted above, actually as high as ~20ms
on some platforms, so not really an option
* Manually set up an hrtimer to try again in 100us (which
is what usleep does anyway...)
People choose the udelay route because it is EASY; we need to
provide a better easy route.
Device / driver / boot code is *currently* serial, but every few
months someone makes noise about parallelizing boot, and IMHO, a
little forward-thinking now is one less thing to worry about
if/when that ever happens
udelay's could be preempted
Sure, but if udelay plans on looping 1000 times, and it gets
preempted on loop 200, whenever it's scheduled again, it is
going to do the next 800 loops.
Is the interruptible case needed?
Probably not, but I see usleep as a very logical parallel to msleep,
so it made sense to include the "full" API. Processors are getting
faster (albeit not as quickly as they are becoming more parallel),
so if someone wanted to be interruptible for a few usecs, why not
let them? If this is a contentious point, I'm happy to remove it.
OTHER THOUGHTS
I believe there is also value in exposing the usleep_range option; it gives
the scheduler a lot more flexibility and allows the programmer to express
his intent much more clearly; it's something I would hope future driver
writers will take advantage of.
To get the results in the NUMBERS section below, I literally s/udelay/usleep
the kernel tree; I had to go in and undo the changes to the USB drivers, but
everything else booted successfully; I find that extremely telling in and
of itself -- many people are using a delay API where a sleep will suit them
just fine.
SOME ATTEMPTS AT NUMBERS
It turns out that calculating quantifiable benefit on this is challenging,
so instead I will simply present the current state of things, and I hope
this to be sufficient:
How many udelay calls are there in 2.6.35-rc5?
udealy(ARG) >= | COUNT
1000 | 319
500 | 414
100 | 1146
20 | 1832
I am working on Android, so that is my focus for this. The following table
is a modified usleep that simply printk's the amount of time requested to
sleep; these tests were run on a kernel with udelay >= 20 --> usleep
"boot" is power-on to lock screen
"power collapse" is when the power button is pushed and the device suspends
"resume" is when the power button is pushed and the lock screen is displayed
(no touchscreen events or anything, just turning on the display)
"use device" is from the unlock swipe to clicking around a bit; there is no
sd card in this phone, so fail loading music, video, camera
ACTION | TOTAL NUMBER OF USLEEP CALLS | NET TIME (us)
boot | 22 | 1250
power-collapse | 9 | 1200
resume | 5 | 500
use device | 59 | 7700
The most interesting category to me is the "use device" field; 7700us of
busy-wait time that could be put towards better responsiveness, or at the
least less power usage.
Signed-off-by: Patrick Pannuto <ppannuto@codeaurora.org>
Cc: apw@canonical.com
Cc: corbet@lwn.net
Cc: arjan@linux.intel.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In some cases (for instance with kernel threads) it may be desireable to
use on-stack deferrable timers to get their power saving benefits. Add
interfaces to support this for the IPS driver.
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Historically, Linux has tried to make the regular timer tick on the
various CPUs not happen at the same time, to avoid contention on
xtime_lock.
Nowadays, with the tickless kernel, this contention no longer happens
since time keeping and updating are done differently. In addition,
this skew is actually hurting power consumption in a measurable way on
many-core systems.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
LKML-Reference: <20100727210210.58d3118c@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We use synchronize_sched() to ensure a tracepoint won't be called
while/after we release the perf buffers it references.
But the tracepoint API has its own API for that:
tracepoint_synchronize_unregister(). Use it instead as it's
self-explanatory and eases maintainance.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Mark init_workqueues() as early_initcall() and thus it will be initialized
before smp bringup. init_workqueues() registers for the hotcpu notifier
and thus it should cope with the processors that are brought online after
the workqueues are initialized.
x86 smp bringup code uses workqueues and uses a workaround for the
cold boot process (as the workqueues are initialized post smp_init()).
Marking init_workqueues() as early_initcall() will pave the way for
cleaning up this code.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
for_each_*cwq_cpu() are similar to regular CPU iterators except that
it also considers the pseudo CPU number used for unbound workqueues.
Explain them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
A function that copies the padata cpumasks to a user buffer
is a bit error prone. The cpumask can change any time so we
can't be sure to have the right cpumask when using this function.
A user who is interested in the padata cpumasks should register
to the padata cpumask notifier chain instead. Users of
padata_get_cpumask are already updated, so we can remove it.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
We pass a pointer to the new padata cpumasks to the cpumask_change_notifier
chain. So users can access the cpumasks without the need of an extra
padata_get_cpumask function.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
padata_set_cpumask needs to be protected by a lock. We make
__padata_set_cpumasks unlocked and static. So this function
can be used by the exported and locked padata_set_cpumask and
padata_set_cpumasks functions.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
We rename padata_alloc to padata_alloc_possible because this
function allocates a padata_instance and uses the cpu_possible
mask for parallel and serial workers. Also we rename __padata_alloc
to padata_alloc to avoid to export underlined functions. Underlined
functions are considered to be private to padata. Users are updated
accordingly.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
credentials by incrementing their usage count after their replacement by the
task being accessed.
What happens is that get_task_cred() can race with commit_creds():
TASK_1 TASK_2 RCU_CLEANER
-->get_task_cred(TASK_2)
rcu_read_lock()
__cred = __task_cred(TASK_2)
-->commit_creds()
old_cred = TASK_2->real_cred
TASK_2->real_cred = ...
put_cred(old_cred)
call_rcu(old_cred)
[__cred->usage == 0]
get_cred(__cred)
[__cred->usage == 1]
rcu_read_unlock()
-->put_cred_rcu()
[__cred->usage == 1]
panic()
However, since a tasks credentials are generally not changed very often, we can
reasonably make use of a loop involving reading the creds pointer and using
atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.
If successful, we can safely return the credentials in the knowledge that, even
if the task we're accessing has released them, they haven't gone to the RCU
cleanup code.
We then change task_state() in procfs to use get_task_cred() rather than
calling get_cred() on the result of __task_cred(), as that suffers from the
same problem.
Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
tripped when it is noticed that the usage count is not zero as it ought to be,
for example:
kernel BUG at kernel/cred.c:168!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0
Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
745
RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45
RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
Stack:
ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
<0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
<0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
Call Trace:
[<ffffffff810698cd>] put_cred+0x13/0x15
[<ffffffff81069b45>] commit_creds+0x16b/0x175
[<ffffffff8106aace>] set_current_groups+0x47/0x4e
[<ffffffff8106ac89>] sys_setgroups+0xf6/0x105
[<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b
04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
RIP [<ffffffff81069881>] __put_cred+0xc/0x45
RSP <ffff88019e7e9eb8>
---[ end trace df391256a100ebdd ]---
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A small number of users of IRQF_TIMER are using it for the implied no
suspend behaviour on interrupts which are not timer interrupts.
Therefore add a new IRQF_NO_SUSPEND flag, rename IRQF_TIMER to
__IRQF_TIMER and redefine IRQF_TIMER in terms of these new flags.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: xen-devel@lists.xensource.com
Cc: linux-input@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1280398595-29708-1-git-send-email-ian.campbell@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a new kernel API to attach a task to current task's cgroup
in all the active hierarchies.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
The command
echo "file ec.c +p" >/sys/kernel/debug/dynamic_debug/control
causes an oops.
Move the call to ddebug_remove_module() down into free_module(). In this
way it should be called from all error paths. Currently, we are missing
the remove if the module init routine fails.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Reported-by: Thomas Renninger <trenn@suse.de>
Tested-by: Thomas Renninger <trenn@suse.de>
Cc: <stable@kernel.org> [2.6.32+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To properly handle clocksources that change frequencies
at the clocksource->enable() point, this patch adds
a method that will update the clocksource's mult/shift and
max_idle_ns values.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1279068988-21864-12-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch makes xtime and wall_to_monotonic static, as planned in
Documentation/feature-removal-schedule.txt. This will allow for
further cleanups to the timekeeping core.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1279068988-21864-10-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Provides an accessor function to replace hrtimer.c's
direct access of wall_to_monotonic.
This will allow wall_to_monotonic to be made static as
planned in Documentation/feature-removal-schedule.txt
Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1279068988-21864-9-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
update_vsyscall() did not provide the wall_to_monotoinc offset,
so arch specific implementations tend to reference wall_to_monotonic
directly. This limits future cleanups in the timekeeping core, so
this patch fixes the update_vsyscall interface to provide
wall_to_monotonic, allowing wall_to_monotonic to be made static
as planned in Documentation/feature-removal-schedule.txt
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Tony Luck <tony.luck@intel.com>
LKML-Reference: <1279068988-21864-7-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now that all arches have been converted over to use generic time via
clocksources or arch_gettimeoffset(), we can remove the GENERIC_TIME
config option and simplify the generic code.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1279068988-21864-4-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
After accidentally misusing timespec_add_safe, I wanted to make sure
we don't accidently trip over that issue again, so I created a simple
timespec_add() function which we can use to replace the instances
of timespec_add_safe() that don't want the overflow detection.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1279068988-21864-3-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now that we allow to change the cpumasks from userspace, we have
to check for valid cpumasks in padata_do_parallel. This patch adds
the necessary check. This fixes a division by zero crash if the
parallel cpumask contains no active cpu.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The cpumask separation work assumes the cpumask dependend recources
present regardless of valid or invalid cpumasks. With this patch
we allocate the cpumask dependend recources in any case. This fixes
two NULL pointer dereference crashes in padata_replace and in
padata_get_cpumask.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The counting of the cpu index got lost with a recent commit.
This patch restores it. This fixes a hang of the parallel worker
threads on cpu hotplug.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
According to Oleg Nesterov:
We can move copy_to_user(created_timer_id) down after
"if (timer_event_spec)" block too. (but before CLOCK_DISPATCH(),
of course).
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
usleep[_range] are finer precision implementations of msleep
and are designed to be drop-in replacements for udelay where
a precise sleep / busy-wait is unnecessary. They also allow
an easy interface to specify slack when a precise (ish)
wakeup is unnecessary to help minimize wakeups
Signed-off-by: Patrick Pannuto <ppannuto@codeaurora.org>
Cc: akinobu.mita@gmail.com
Cc: sboyd@codeaurora.org
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
LKML-Reference: <4C44CDD2.1070708@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Steal some text from 6e453a6751 "Add support for deferrable timers". A
reader shouldn't have to dig through the git logs for the basic
description of a deferrable timer.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: johnstul@us.ibm.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Once a work starts execution, its data contains the cpu number it was
on instead of pointing to cwq. This is added by commit 7a22ad75
(workqueue: carry cpu number in work data once execution starts) to
reliably determine the work was last on even if the workqueue itself
was destroyed inbetween.
Whether data points to a cwq or contains a cpu number was
distinguished by comparing the value against PAGE_OFFSET. The
assumption was that a cpu number should be below PAGE_OFFSET while a
pointer to cwq should be above it. However, on architectures which
use separate address spaces for user and kernel spaces, this doesn't
hold as PAGE_OFFSET is zero.
Fix it by using an explicit flag, WORK_STRUCT_CWQ, to mark what the
data field contains. If the flag is set, it's pointing to a cwq;
otherwise, it contains a cpu number.
Reported on s390 and microblaze during linux-next testing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sachin Sant <sachinp@in.ibm.com>
Reported-by: Michal Simek <michal.simek@petalogix.com>
Reported-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Tested-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Tested-by: Michal Simek <monstr@monstr.eu>
We need to add one to the strlen() return because of the NULL
character. The type->name here generally comes from the kernel and I
don't think any of them come close to being MAX_TRACER_SIZE (100)
characters long so this is basically a cleanup.
Signed-off-by: Dan Carpenter <error27@gmail.com>
LKML-Reference: <20100710100644.GV19184@bicker>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The kdb code should not toggle the sysrq state in case an end user
wants to try and resume the normal kernel execution.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Immediately following an exit from the kdb shell the kgdb_connected
variable should be set to zero, unless there are breakpoints planted.
If the kgdb_connected variable is not zeroed out with kdb, it is
impossible to turn off kdb.
This patch is merely a work around for now, the real fix will check
for the breakpoints.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
In the process of merging kdb to the mainline, the kdb lsmod command
stopped printing the base load address of kernel modules. This is
needed for using kdb in conjunction with external tools such as gdb.
Simply restore the functionality by adding a kdb_printf for the base
load address of the kernel modules.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The gdbserial protocol handler should return an empty packet instead
of an error string when ever it responds to a command it does not
implement.
The problem cases come from a debugger client sending
qTBuffer, qTStatus, qSearch, qSupported.
The incorrect response from the gdbstub leads the debugger clients to
not function correctly. Recent versions of gdb will not detach correctly as a result of this behavior.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
Without this patch the "ll" linked-list traversal command won't
terminate when you hit q/Q.
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The sched_child_runs_first value in /proc/sched_debug is
currently displayed using a macro meant to split ns time values.
This patch uses the correct macro to display it as a plain
decimal value.
Signed-off-by: Josh Hunt <johunt@akamai.com>
Cc: peterz@infradead.org
Cc: juhlenko@akamai.com
Cc: efault@gmx.de
LKML-Reference: <1279567876-25418-1-git-send-email-johunt@akamai.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Documentation/trace/ftrace.txt says
buffer_size_kb:
This sets or displays the number of kilobytes each CPU
buffer can hold. The tracer buffers are the same size
for each CPU. The displayed number is the size of the
CPU buffer and not total size of all buffers. The
trace buffers are allocated in pages (blocks of memory
that the kernel uses for allocation, usually 4 KB in size).
If the last page allocated has room for more bytes
than requested, the rest of the page will be used,
making the actual allocation bigger than requested.
( Note, the size may not be a multiple of the page size
due to buffer management overhead. )
This can only be updated when the current_tracer
is set to "nop".
But it's incorrect. currently total memory consumption is
'buffer_size_kb x CPUs x 2'.
Why two times difference is there? because ftrace implicitly allocate
the buffer for max latency too.
That makes sad result when admin want to use large buffer. (If admin
want full logging and makes detail analysis). example, If admin
have 24 CPUs machine and write 200MB to buffer_size_kb, the system
consume ~10GB memory (200MB x 24 x 2). umm.. 5GB memory waste is
usually unacceptable.
Fortunatelly, almost all users don't use max latency feature.
The max latency buffer can be disabled easily.
This patch shrink buffer size of the max latency buffer if
unnecessary.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
LKML-Reference: <20100701104554.DA2D.A69D9226@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
__print_flags() and __print_symbolic() use percpu trace_seq:
1) Its memory is allocated at compile time, it wastes memory if we don't use tracing.
2) It is percpu data and it wastes more memory for multi-cpus system.
3) It disables preemption when it executes its core routine
"trace_seq_printf(s, "%s: ", #call);" and introduces latency.
So we move this trace_seq to struct trace_iterator.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4C078350.7090106@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reorder structure to remove 8 bytes of padding on 64 bit builds.
This shrinks the size to 128 bytes so allowing allocation from a smaller
slab & needed one fewer cache lines.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
LKML-Reference: <1269516456.2054.8.camel@localhost>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We found that even enabling a single trace event that will rarely be
triggered can add big overhead to context switch.
(lmbench context switch test)
-------------------------------------------------
2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
------ ------ ------ ------ ------ ------- -------
2.19 2.3 2.21 2.56 2.13 2.54 2.07
2.39 2.51 2.35 2.75 2.27 2.81 2.24
The overhead is 6% ~ 11%.
It's because when a trace event is enabled 3 tracepoints (sched_switch,
sched_wakeup, sched_wakeup_new) will be activated to map pid to cmdname.
We'd like to avoid this overhead, so add a trace option '(no)record-cmd'
to allow to disable cmdline recording.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4C2D57F4.2050204@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Convert a few calls from kfree_skb to consume_skb
Noticed while I was working on dropwatch that I was detecting lots of internal
skb drops in several places. While some are legitimate, several were not,
freeing skbs that were at the end of their life, rather than being discarded due
to an error. This patch converts those calls sites from using kfree_skb to
consume_skb, which quiets the in-kernel drop_monitor code from detecting them as
drops. Tested successfully by myself
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All cpumasks are assumed to have cpu 0 permanently set on UP, so it
can't be used to signify whether there's something to be done for the
CPU. workqueue was using cpumask to track which CPU requested rescuer
assistance and this led rescuer thread to think there always are
pending mayday requests on UP, which resulted in infinite busy loops.
This patch fixes the problem by introducing mayday_mask_t and
associated helpers which wrap cpumask on SMP and emulates its behavior
using bitops and unsigned long on UP.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
The default for llseek will change to no_llseek,
so the tracing debugfs files need to add explicit
.llseek assignments. Since we're dealing with regular
files from a VFS perspective, use generic_file_llseek.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: John Kacur <jkacur@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1278538820-1392-10-git-send-email-arnd@arndb.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Special traces type was only used by sysprof. Lets remove it now
that sysprof ftrace plugin has been dropped.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Soeren Sandmann <sandmann@daimi.au.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
The sysprof ftrace plugin doesn't seem to be seriously used
somewhere. There is a branch in the sysprof tree that makes
an interface to it, but the real sysprof tool uses either its
own module or perf events.
Drop the sysprof ftrace plugin then, as it's mostly useless.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Soeren Sandmann <sandmann@daimi.au.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Commit f3421797 (workqueue: implement unbound workqueue) incorrectly
tested CONFIG_SMP as part of a C expression in alloc/free_cwqs(). As
CONFIG_SMP is not defined in UP, this breaks build. Fix it by using
Found during linux-next build test.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
friends use the early_res functions for memory management when
NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
corresponding code paths for bootmem allocations.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
pavel@suse.cz no longer works, replace it with working address.
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Added sysfs primitives to padata subsystem. Now API user may
embedded kobject each padata instance contains into any sysfs
hierarchy. For now padata sysfs interface provides only
two objects:
serial_cpumask [RW] - cpumask for serial workers
parallel_cpumask [RW] - cpumask for parallel workers
Signed-off-by: Dan Kruchinin <dkruchinin@acm.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The aim of this patch is to make two separate cpumasks
for padata parallel and serial workers respectively.
It allows user to make more thin and sophisticated configurations
of padata framework. For example user may bind parallel and serial workers to non-intersecting
CPU groups to gain better performance. Also each padata instance has notifiers chain for its
cpumasks now. If either parallel or serial or both masks were changed all
interested subsystems will get notification about that. It's especially useful
if padata user uses algorithm for callback CPU selection according to serial cpumask.
Signed-off-by: Dan Kruchinin <dkruchinin@acm.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The ACPI suspend code calls suspend_nvs_free() at a wrong place,
which may lead to a memory leak if there's an error executing
acpi_pm_prepare(), because acpi_pm_finish() will not be called in
that case. However, the root cause of this problem is the
apparently confusing ordering of calls in suspend error paths that
needs to be fixed.
In addition to that, fix a typo in a label name in suspend.c.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Len Brown <len.brown@intel.com>
There is an inconsistency between hibernation_platform_enter()
and hibernation_snapshot(), because the latter calls
hibernation_ops->end() after failing hibernation_ops->begin(), while
the former doesn't do that. Make hibernation_snapshot() behave in
the same way as hibernation_platform_enter() in that respect.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Len Brown <len.brown@intel.com>
The hibernation_platform_enter() function calls dpm_suspend_noirq()
instead of dpm_resume_noirq() by mistake. Fix this.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Len Brown <len.brown@intel.com>
All current users of pm_qos_add_request() have the ability to supply
the memory required by the pm_qos routines, so make them do this and
eliminate the kmalloc() with pm_qos_add_request(). This has the
double benefit of making the call never fail and allowing it to be
called from atomic context.
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Signed-off-by: mark gross <markgross@thegnar.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
A lot of the pm_qos extremal value handling is really duplicating what a
priority ordered list does, just in a less efficient fashion. Simply
redoing the implementation in terms of a plist gets rid of a lot of this
junk (although there are several other strange things that could do with
tidying up, like pm_qos_request_list has to carry the pm_qos_class with
every node, simply because it doesn't get passed in to
pm_qos_update_request even though every caller knows full well what
parameter it's updating).
I think this redo is a win independent of android, so we should do
something like this now.
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Signed-off-by: mark gross <markgross@thegnar.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
One of the arguments during the suspend blockers discussion was that
the mainline kernel didn't contain any mechanisms making it possible
to avoid races between wakeup and system suspend.
Generally, there are two problems in that area. First, if a wakeup
event occurs exactly when /sys/power/state is being written to, it
may be delivered to user space right before the freezer kicks in, so
the user space consumer of the event may not be able to process it
before the system is suspended. Second, if a wakeup event occurs
after user space has been frozen, it is not generally guaranteed that
the ongoing transition of the system into a sleep state will be
aborted.
To address these issues introduce a new global sysfs attribute,
/sys/power/wakeup_count, associated with a running counter of wakeup
events and three helper functions, pm_stay_awake(), pm_relax(), and
pm_wakeup_event(), that may be used by kernel subsystems to control
the behavior of this attribute and to request the PM core to abort
system transitions into a sleep state already in progress.
The /sys/power/wakeup_count file may be read from or written to by
user space. Reads will always succeed (unless interrupted by a
signal) and return the current value of the wakeup events counter.
Writes, however, will only succeed if the written number is equal to
the current value of the wakeup events counter. If a write is
successful, it will cause the kernel to save the current value of the
wakeup events counter and to abort the subsequent system transition
into a sleep state if any wakeup events are reported after the write
has returned.
[The assumption is that before writing to /sys/power/state user space
will first read from /sys/power/wakeup_count. Next, user space
consumers of wakeup events will have a chance to acknowledge or
veto the upcoming system transition to a sleep state. Finally, if
the transition is allowed to proceed, /sys/power/wakeup_count will
be written to and if that succeeds, /sys/power/state will be written
to as well. Still, if any wakeup events are reported to the PM core
by kernel subsystems after that point, the transition will be
aborted.]
Additionally, put a wakeup events counter into struct dev_pm_info and
make these per-device wakeup event counters available via sysfs,
so that it's possible to check the activity of various wakeup event
sources within the kernel.
To illustrate how subsystems can use pm_wakeup_event(), make the
low-level PCI runtime PM wakeup-handling code use it.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: markgross <markgross@thegnar.org>
Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
There are a few typos in kernel/power/swap.c. Fix them.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
As of commit dcce284 ("mm: Extend gfp masking to the page
allocator") and commit 7e85ee0 ("slab,slub: don't enable
interrupts during early boot"), the slab allocator makes
sure we don't attempt to sleep during boot.
Therefore, remove bootmem special cases from the scheduler
and use plain GFP_KERNEL instead.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1279225102-2572-1-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Norbert reported that nohz_ratelimit() causes his laptop to burn about
4W (40%) extra. For now back out the change and see if we can adjust
the power management code to make better decisions.
Reported-by: Norbert Preining <preining@logic.at>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@infradead.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently we update cpu_power() too often, update_group_power() only
updates the local group's cpu_power but it gets called for all groups.
Furthermore, CPU_NEWLY_IDLE invocations will result in all cpus
calling it, even though a slow update of cpu_power is sufficient.
Therefore move the update under 'idle != CPU_NEWLY_IDLE &&
local_group' to reduce superfluous invocations.
Reported-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <1278612989.1900.176.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Suresh spotted that we don't update the rq->clock in the nohz
load-balancer path.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1278626014.2834.74.camel@sbs-t61.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The ksym (breakpoint) ftrace plugin has been superseded by perf
tools that are much more poweful to use the cpu breakpoints.
This tracer doesn't bring more feature. It has been deprecated
for a while now, lets remove it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
We count the number of processed objects on a percpu basis,
so we need to go through all the percpu reorder queues to calculate
the sequence number of the next object that needs serialization.
This patch changes this to count the number of processed objects
global. So we can calculate the sequence number and the percpu
reorder queue of the next object that needs serialization without
searching through the percpu reorder queues. This avoids some
accesses to memory of foreign cpus.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
To return -EINPROGRESS on success in padata_do_parallel was
considered to be odd. This patch changes this to return zero
on success. Also the only user of padata, pcrypt is adapted to
convert a return of zero to -EINPROGRESS within the crypto layer.
This also removes the pcrypt fallback if padata_do_parallel
was called on a not running padata instance as we can't handle it
anymore. This fallback was unused, so it's save to remove it.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch fixes a bug when the padata cpumask does not
intersect with the active cpumask. In this case we get a
division by zero in padata_alloc_pd and we end up with a
useless padata instance. Padata can end up with an empty
cpumask for two reasons:
1. A user removed the last cpu that belongs to the padata
cpumask and the active cpumask.
2. The last cpu that belongs to the padata cpumask and the
active cpumask goes offline.
We introduce a function padata_validate_cpumask to check if the padata
cpumask does intersect with the active cpumask. If the cpumasks do not
intersect we mark the instance as invalid, so it can't be used. We do not
allocate the cpumask dependend recources in this case. This fixes the
division by zero and keeps the padate instance in a consistent state.
It's not possible to trigger this bug by now because the only padata user,
pcrypt uses always the possible cpumask.
Reported-by: Dan Kruchinin <dkruchinin@acm.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch makes padata_stop to block until the padata
instance is unused. Also we split padata_stop to a locked
and a unlocked version. This is in preparation to be able
to change the cpumask after a call to patata stop.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch introduces the PADATA_INVALID flag which is
checked on padata start. This will be used to mark a padata
instance as invalid, if the padata cpumask does not intersect
with the active cpumask. we change padata_start to return an
error if the PADATA_INVALID is set. Also we adapt the only
padata user, pcrypt to this change.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
maybe_create_worker() mismanaged locking when worker creation fails
and it has to retry. Fix locking and simplify lock manipulation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Yong Zhang <yong.zhang@windriver.com>
I have shown by code review that no driver takes
the BKL at init time any more, so whatever the
init code was locking against is no longer there
and it is now safe to remove the BKL there.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Steven Rostedt <rostedt@goodmis>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Variable on the stack is not initialized to zero, do it
explicitly.
This bug was found by a compiler warning:
kernel/watchdog.c:463: warning: 'result' may be used uninitialized in this function
Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1278316854-28442-1-git-send-email-segooon@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Support string type tracing and printing in kprobe-tracer.
This allows user to trace string data in kernel including __user data. Note
that sometimes __user data may not be accessed if it is paged-out (sorry, but
kprobes operation should be done in atomic, we can not wait for page-in).
Commiter note: Fixed up conflicts with b7e2ece.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100519195724.2885.18788.stgit@localhost6.localdomain6>
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We should initialize the module dynamic debug datastructures
only after determining that the module is not loaded yet. This
fixes a bug that introduced in 2.6.35-rc2, where when a trying
to load a module twice, we also load it's dynamic printing data
twice which causes all sorts of nasty issues. Also handle
the dynamic debug cleanup later on failure.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed a #ifdef)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WQ_SINGLE_CPU combined with @max_active of 1 is used to achieve full
ordering among works queued to a workqueue. The same can be achieved
using WQ_UNBOUND as unbound workqueues always use the gcwq for
WORK_CPU_UNBOUND. As @max_active is always one and benefits from cpu
locality isn't accessible anyway, serving them with unbound workqueues
should be fine.
Drop WQ_SINGLE_CPU support and use WQ_UNBOUND instead. Note that most
single thread workqueue users will be converted to use multithread or
non-reentrant instead and only the ones which require strict ordering
will keep using WQ_UNBOUND + @max_active of 1.
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch implements unbound workqueue which can be specified with
WQ_UNBOUND flag on creation. An unbound workqueue has the following
properties.
* It uses a dedicated gcwq with a pseudo CPU number WORK_CPU_UNBOUND.
This gcwq is always online and disassociated.
* Workers are not bound to any CPU and not concurrency managed. Works
are dispatched to workers as soon as possible and the only applied
limitation is @max_active. IOW, all unbound workqeueues are
implicitly high priority.
Unbound workqueues can be used as simple execution context provider.
Contexts unbound to any cpu are served as soon as possible.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
In preparation of WQ_UNBOUND addition, make the following changes.
* Add WORK_CPU_* constants for pseudo cpu id numbers used (currently
only WORK_CPU_NONE) and use them instead of NR_CPUS. This is to
allow another pseudo cpu id for unbound cpu.
* Reorder WQ_* flags.
* Make workqueue_struct->cpu_wq a union which contains a percpu
pointer, regular pointer and an unsigned long value and use
kzalloc/kfree() in UP allocation path. This will be used to
implement unbound workqueues which will use only one cwq on SMPs.
* Move alloc_cwqs() allocation after initialization of wq fields, so
that alloc_cwqs() has access to wq->flags.
* Trivial relocation of wq local variables in freeze functions.
These changes don't cause any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
When there's no pending work to do, worker_thread() goes back to sleep
after waking up without checking whether worker management is
necessary. This means that idle worker exit requests can be ignored
if the gcwq stays empty.
Fix it by making worker_thread() always check whether worker
management is necessary before going to sleep.
Signed-off-by: Tejun Heo <tj@kernel.org>
get_work_gcwq() was incorrectly triggering BUG_ON() if cpu number is
equal to or higher than num_possible_cpus() instead of nr_cpu_ids.
Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
When one flusher is cascading to the next flusher, it first sets
wq->first_flusher to the next one and sets up the next flush cycle.
If there's nothing to do for the next cycle, it clears
wq->flush_flusher and proceeds to the one after that.
If the woken up flusher checks wq->first_flusher before it gets
cleared, it will incorrectly assume the role of the first flusher,
which triggers BUG_ON() sanity check.
Fix it by checking wq->first_flusher again after grabbing the mutex.
Signed-off-by: Tejun Heo <tj@kernel.org>
worker_set/clr_flags() assume that if none of NOT_RUNNING flags is set
the worker must be contributing to nr_running which is only true if
the worker is actually running.
As when called from self, it is guaranteed that the worker is running,
those functions can be safely used from the worker itself and they
aren't necessary from other places anyway. Make the following changes
to fix the bug.
* Make worker_set/clr_flags() whine if not called from self.
* Convert all places which called those functions from other tasks to
manipulate flags directly.
* Make trustee_thread() directly clear nr_running after setting
WORKER_ROGUE on all workers. This is the only place where
nr_running manipulation is necessary outside of workers themselves.
* While at it, add sanity check for nr_running in worker_enter_idle().
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 0224cf4c5e (sched: Intoduce get_cpu_iowait_time_us())
broke things by not making sure preemption was indeed disabled
by the callers of nr_iowait_cpu() which took the iowait value of
the current cpu.
This resulted in a heap of preempt warnings. Cure this by making
nr_iowait_cpu() take a cpu number and fix up the callers to pass
in the right number.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: linux-pm@lists.linux-foundation.org
LKML-Reference: <1277968037.1868.120.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
fs/fs-writeback.c
Merge reason: Resolve the conflict
Note, i picked the version from Linus's tree, which effectively reverts
the fs-writeback.c bits of:
b97181f: fs: remove all rcu head initializations, except on_stack initializations
As the upstream changes to this file changed this code heavily and the
first attempt to resolve the conflict resulted in a non-booting kernel.
It's safer to re-try this portion of the commit cleanly.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
futex_find_get_task is currently used (through lookup_pi_state) from two
contexts, futex_requeue and futex_lock_pi_atomic. None of the paths
looks it needs the credentials check, though. Different (e)uids
shouldn't matter at all because the only thing that is important for
shared futex is the accessibility of the shared memory.
The credentail check results in glibc assert failure or process hang (if
glibc is compiled without assert support) for shared robust pthread
mutex with priority inheritance if a process tries to lock already held
lock owned by a process with a different euid:
pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.
The problem is that futex_lock_pi_atomic which is called when we try to
lock already held lock checks the current holder (tid is stored in the
futex value) to get the PI state. It uses lookup_pi_state which in turn
gets task struct from futex_find_get_task. ESRCH is returned either
when the task is not found or if credentials check fails.
futex_lock_pi_atomic simply returns if it gets ESRCH. glibc code,
however, doesn't expect that robust lock returns with ESRCH because it
should get either success or owner died.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>