- When tracing capture the kuid.
- When displaying the data to user space convert the kuid into the
user namespace of the process that opened the report file.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
BSD process accounting conveniently passes the file the accounting
records will be written into to do_acct_process. The file credentials
captured the user namespace of the opener of the file. Use the file
credentials to format the uid and the gid of the current process into
the user namespace of the user that started the bsd process
accounting.
Cc: Pavel Emelyanov <xemul@openvz.org>
Reviewed-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Explicitly limit exit task stat broadcast to the initial user and
pid namespaces, as it is already limited to the initial network
namespace.
- For broadcast task stats explicitly generate all of the idenitiers
in terms of the initial user namespace and the initial pid
namespace.
- For request stats report them in terms of the current user namespace
and the current pid namespace. Netlink messages are delivered
syncrhonously to the kernel allowing us to get the user namespace
and the pid namespace from the current task.
- Pass the namespaces for representing pids and uids and gids
into bacct_add_task.
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Explicitly format uids gids in audit messges in the initial user
namespace. This is safe because auditd is restrected to be in
the initial user namespace.
- Convert audit_sig_uid into a kuid_t.
- Enable building the audit code and user namespaces at the same time.
The net result is that the audit subsystem now uses kuid_t and kgid_t whenever
possible making it almost impossible to confuse a raw uid_t with a kuid_t
preventing bugs.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Always store audit loginuids in type kuid_t.
Print loginuids by converting them into uids in the appropriate user
namespace, and then printing the resulting uid.
Modify audit_get_loginuid to return a kuid_t.
Modify audit_set_loginuid to take a kuid_t.
Modify /proc/<pid>/loginuid on read to convert the loginuid into the
user namespace of the opener of the file.
Modify /proc/<pid>/loginud on write to convert the loginuid
rom the user namespace of the opener of the file.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Paul Moore <paul@paul-moore.com> ?
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The audit filter code guarantees that uid are always compared with
uids and gids are always compared with gids, as the comparason
operations are type specific. Take advantage of this proper to define
audit_uid_comparator and audit_gid_comparator which use the type safe
comparasons from uidgid.h.
Build on audit_uid_comparator and audit_gid_comparator and replace
audit_compare_id with audit_compare_uid and audit_compare_gid. This
is one of those odd cases where being type safe and duplicating code
leads to simpler shorter and more concise code.
Don't allow bitmask operations in uid and gid comparisons in
audit_data_to_entry. Bitmask operations are already denined in
audit_rule_to_entry.
Convert constants in audit_rule_to_entry and audit_data_to_entry into
kuids and kgids when appropriate.
Convert the uid and gid field in struct audit_names to be of type
kuid_t and kgid_t respectively, so that the new uid and gid comparators
can be applied in a type safe manner.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The only place we use the uid and the pid that we calculate in
audit_receive_msg is in audit_log_common_recv_msg so move the
calculation of these values into the audit_log_common_recv_msg.
Simplify the calcuation of the current pid and uid by
reading them from current instead of reading them from
NETLINK_CREDS.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
For user generated audit messages set the portid field in the netlink
header to the netlink port where the user generated audit message came
from. Reporting the process id in a port id field was just nonsense.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Use current instead of looking up the current up the current task by
process identifier. Netlink requests are processed in trhe context of
the sending task so this is safe.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that netlink messages are processed in the context of the sender
tty_audit_push_task can be called directly and audit_prepare_user_tty
which only added looking up the task of the tty by process id is
not needed.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Get caller process uid and gid and pid values from the current task
instead of the NETLINK_CB. This is simpler than passing NETLINK_CREDS
from from audit_receive_msg to audit_filter_user_rules and avoid the
chance of being hit by the occassional bugs in netlink uid/gid
credential passing. This is a safe changes because all netlink
requests are processed in the task of the sending process.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This allows the code to safely make the assumption that all of the
uids gids and pids that need to be send in audit messages are in the
initial namespaces.
If someone cares we may lift this restriction someday but start with
limiting access so at least the code is always correct.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This merge is necessary as Lai's CPU hotplug restructuring series
depends on the CPU hotplug bug fixes in for-3.6-fixes.
The merge creates one trivial conflict between the following two
commits.
96e65306b8 "workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic"
e2b6a6d570 "workqueue: use system_highpri_wq for highpri workers in rebind_workers()"
Both add local variable definitions to the same block and can be
merged in any order.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull another workqueue fix from Tejun Heo:
"Unfortunately, yet another late fix. This too is discovered and fixed
by Lai. This bug was introduced during this merge window by commit
25511a4776 ("workqueue: reimplement CPU online rebinding to handle
idle workers") which started using WORKER_REBIND flag for idle rebind
too.
The bug is relatively easy to trigger if the CPU rapidly goes through
off, on and then off (and stay off). The fix is on the safer side.
This hasn't been on linux-next yet but I'm pushing early so that it
can get more exposure before v3.6 release."
* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()
The kernel doesn't check the pid for negative values, so if you try to
write -2 to /proc/sys/kernel/ns_last_pid, you will get a kernel panic.
The crash happens because the next pid is -1, and alloc_pidmap() will
try to access to a nonexistent pidmap.
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pm-sleep:
properly __init-annotate pm_sysrq_init()
PM / wakeup: Use irqsave/irqrestore for events_lock
PM / Freezer: Fix small typo "regrigerator"
PM / Sleep: Print name of wakeup source that aborts suspend
* pm-timers:
PM: Do not use the syscore flag for runtime PM
sh: MTU2: Basic runtime PM support
sh: CMT: Basic runtime PM support
sh: TMU: Basic runtime PM support
PM / Domains: Do not measure start time for "irq safe" devices
PM / Domains: Move syscore flag from subsys data to struct device
PM / Domains: Rename the always_on device flag to syscore
PM / Runtime: Allow helpers to be called by early platform drivers
PM: Reorganize device PM initialization
sh: MTU2: Introduce clock events suspend/resume routines
sh: CMT: Introduce clocksource/clock events suspend/resume routines
sh: TMU: Introduce clocksource/clock events suspend/resume routines
timekeeping: Add suspend and resume of clock event devices
PM / Domains: Add power off/on function for system core suspend stage
PM / Domains: Introduce simplified power on routine for system resume
This patch allows setting of the show_unhandled_signals variable via
/proc/sys/debug/exception-trace. The default value is currently 1
showing unhandled user faults (undefined instructions, data aborts) and
invalid signal stack frames.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Tony Lindgren <tony@atomide.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Olof Johansson <olof@lixom.net>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
This reverts commit 970e178985.
Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20%
performance drop on PostgreSQL 9.2 on his machine (running "pgbench").
Borislav Petkov was able to reproduce this, and bisected it to this
commit 970e178985 ("sched: Improve scalability via 'CPU buddies' ...")
apparently because the new single-idle-buddy model simply doesn't find
idle CPU's to reschedule on aggressively enough.
Mike Galbraith suspects that it is likely due to the user-mode spinlocks
in PostgreSQL not reacting well to preemption, but we don't really know
the details - I'll just revert the commit for now.
There are hopefully other approaches to improve scheduler scalability
without it causing these kinds of downsides.
Reported-by: Nikolay Ulyanitsky <lystor@gmail.com>
Bisected-by: Borislav Petkov <bp@alien8.de>
Acked-by: Mike Galbraith <efault@gmx.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
net/netfilter/nfnetlink_log.c
net/netfilter/xt_LOG.c
Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.
Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.
Signed-off-by: David S. Miller <davem@davemloft.net>
As Oleg pointed out in [0] uprobe should not use the ptrace interface
for enabling/disabling single stepping.
[0] http://lkml.kernel.org/r/20120730141638.GA5306@redhat.com
Add the new "__weak arch" helpers which simply call user_*_single_step()
as a preparation. This is only needed to not break the powerpc port, we
will fold this logic into arch_uprobe_pre/post_xol() hooks later.
We should also change handle_singlestep(), _disable_step(&uprobe->arch)
should be called before put_uprobe().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
The wrong MMF_HAS_UPROBES doesn't really hurt, just it triggers
the "slow" and unnecessary handle_swbp() path if the task hits
the non-uprobe breakpoint.
So this patch changes find_active_uprobe() to check every valid
vma and clear MMF_HAS_UPROBES if no uprobes were found. This is
adds the slow O(n) path, but it is only called in unlikely case
when the task hits the normal breakpoint first time after
uprobe_unregister().
Note the "not strictly accurate" comment in mmf_recalc_uprobes().
We can fix this, we only need to teach vma_has_uprobes() to return
a bit more more info, but I am not sure this worth the trouble.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Add the new MMF_RECALC_UPROBES flag, it means that MMF_HAS_UPROBES
can be false positive after remove_breakpoint() or uprobe_munmap().
It is also set by uprobe_dup_mmap(), this is not optimal but simple.
We could add the new hook, uprobe_dup_vma(), to set MMF_HAS_UPROBES
only if the new mm actually has uprobes, but I don't think this
makes sense.
The next patch will use this flag to clear MMF_HAS_UPROBES.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Nobody plays with uprobes_tree/uprobes_treelock in interrupt context,
no need to disable irqs.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Pull perf fixes from Ingo Molnar:
"This tree includes various fixes"
Ingo really needs to improve on the whole "explain git pull" part.
"Various fixes" indeed.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/hwpb: Invoke __perf_event_disable() if interrupts are already disabled
perf/x86: Enable Intel Cedarview Atom suppport
perf_event: Switch to internal refcount, fix race with close()
oprofile, s390: Fix uninitialized memory access when writing to oprofilefs
perf/x86: Fix microcode revision check for SNB-PEBS
Currently, cgroup hierarchy support is a mess. cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children. blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup. Others show yet different behaviors.
These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.
Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy. Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.
This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support. The goal of
this patch is two-fold.
* Move users away from using hierarchy on currently non-hierarchical
subsystems, so that implementing proper hierarchy support on those
doesn't surprise them.
* Keep track of which controllers are broken how and nudge the
subsystems to implement proper hierarchy support.
For now, start with a single warning message. We can whine louder
later on.
v2: Fixed a typo spotted by Michal. Warning message updated.
v3: Updated memcg part so that it doesn't generate warning in the
cases where .use_hierarchy=false doesn't make the behavior
different from root.use_hierarchy=true. Fixed a typo spotted by
Glauber.
v4: Check ->broken_hierarchy after cgroup creation is complete so that
->create() can affect the result per Michal. Dropped unnecessary
memcg root handling per Michal.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
WARNING: With this change it is impossible to load external built
controllers anymore.
In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
set, corresponding subsys_id should also be a constant. Up to now,
net_prio_subsys_id and net_cls_subsys_id would be of the type int and
the value would be assigned during runtime.
By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
to IS_ENABLED, all *_subsys_id will have constant value. That means we
need to remove all the code which assumes a value can be assigned to
net_prio_subsys_id and net_cls_subsys_id.
A close look is necessary on the RCU part which was introduces by
following patch:
commit f845172531
Author: Herbert Xu <herbert@gondor.apana.org.au> Mon May 24 09:12:34 2010
Committer: David S. Miller <davem@davemloft.net> Mon May 24 09:12:34 2010
cls_cgroup: Store classid in struct sock
Tis code was added to init_cgroup_cls()
/* We can't use rcu_assign_pointer because this is an int. */
smp_wmb();
net_cls_subsys_id = net_cls_subsys.subsys_id;
respectively to exit_cgroup_cls()
net_cls_subsys_id = -1;
synchronize_rcu();
and in module version of task_cls_classid()
rcu_read_lock();
id = rcu_dereference(net_cls_subsys_id);
if (id >= 0)
classid = container_of(task_subsys_state(p, id),
struct cgroup_cls_state, css)->classid;
rcu_read_unlock();
Without an explicit explaination why the RCU part is needed. (The
rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
in a later commit, but that is a minor detail.)
So here is my pondering why it was introduced and why it safe to
remove it now. Note that this code was copied over to net_prio the
reasoning holds for that subsystem too.
The idea behind the RCU use for net_cls_subsys_id is to make sure we
get a valid pointer back from task_subsys_state(). task_subsys_state()
is just blindly accessing the subsys array and returning the
pointer. Obviously, passing in -1 as id into task_subsys_state()
returns an invalid value (out of lower bound).
So this code makes sure that only after module is loaded and the
subsystem registered, the id is assigned.
Before unregistering the module all old readers must have left the
critical section. This is done by assigning -1 to the id and issuing a
synchronized_rcu(). Any new readers wont call task_subsys_state()
anymore and therefore it is safe to unregister the subsystem.
The new code relies on the same trick, but it looks at the subsys
pointer return by task_subsys_state() (remember the id is constant
and therefore we allways have a valid index into the subsys
array).
No precautions need to be taken during module loading
module. Eventually, all CPUs will get a valid pointer back from
task_subsys_state() because rebind_subsystem() which is called after
the module init() function will assigned subsys[net_cls_subsys_id] the
newly loaded module subsystem pointer.
When the subsystem is about to be removed, rebind_subsystem() will
called before the module exit() function. In this case,
rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
and then it calls synchronize_rcu(). All old readers have left by then
the critical section. Any new reader wont access the subsystem
anymore. At this point we are safe to unregister the subsystem. No
synchronize_rcu() call is needed.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
The *_subsys_id will be used as index to access the subsys. Therefore
we need to care we populate the subsystem at the correct position by
using designated initialization.
With this change we are able to interleave builtin and modules in the subsys
array.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Before we are able to define all subsystem ids at compile time we need
a more fine grained control what gets defined when we include
cgroup_subsys.h. For example we define the enums for the subsystems or
to declare for struct cgroup_subsys (builtin subsystem) by including
cgroup_subsys.h and defining SUBSYS accordingly.
Currently, the decision if a subsys is used is defined inside the
header by testing if CONFIG_*=y is true. By moving this test outside
of cgroup_subsys.h we are able to control it on the include level.
This is done by introducing IS_SUBSYS_ENABLED which then is defined
according the task, e.g. is CONFIG_*=y or CONFIG_*=m.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when
looping over the subsys array looking either at the builtin or the
module subsystems. Since all the builtin subsystems have an id which
is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will
have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids
are sorted.
We are about to change id assignment to happen only at compile time
later in this series. That means we can't rely on the above trick
since all ids will always be defined at compile time. Furthermore,
ordering the builtin subsystems and the module subsystems is not
really necessary.
So we need a different way to know which subsystem is a builtin or a
module one. We can use the subsys[]->module pointer for this. Any
place where we need to know if a subsys is module we just check for
the pointer. If it is NULL then the subsystem is a builtin one.
With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT
enum. Though we need to introduce a temporary placeholder so that we
don't get a compilation error when only CONFIG_CGROUP is selected and
no single controller. An empty enum definition is not valid. Later in
this series we are able to remove the placeholder again.
And with this change we get a fix for this:
kernel/cgroup.c: In function ‘cgroup_load_subsys’:
kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds]
when CONFIG_CGROUP=y and no built in controller was enabled.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Fix kprobes/x86 to support jprobes on ftrace-based kprobes.
Because of -mfentry support of ftrace, ftrace is now put
on the beginning of function where jprobes are put.
Originally ftrace-based kprobes doesn't support jprobe
because it will change regs->ip and ftrace doesn't support
changing IP and ftrace itself doesn't conflict jprobe.
However, ftrace -mfentry support moves mcount call on the
top of functions where jprobes are put. This means that
jprobe always conflicts with ftrace-based kprobe and fails.
This patch allows ftrace-based kprobes to support jprobes
by allowing to modify regs->ip and kprobes breakpoint
handler also allows to skip singlestepping because there
is a ftrace call (not an original instruction).
Link: http://lkml.kernel.org/r/20120905143125.10329.90836.stgit@localhost.localdomain
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 56449f437 "tracing: make the trace clocks available generally",
in April 2009, made trace_clock available unconditionally, since
CONFIG_X86_DS used it too.
Commit faa4602e47 "x86, perf, bts, mm: Delete the never used BTS-ptrace code",
in March 2010, removed CONFIG_X86_DS, and now only CONFIG_RING_BUFFER (split
out from CONFIG_TRACING for general use) has a dependency on trace_clock. So,
only compile in trace_clock with CONFIG_RING_BUFFER or CONFIG_TRACING
enabled.
Link: http://lkml.kernel.org/r/20120903024513.GA19583@leaf
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Break out the DEBUG_SPINLOCK dependency (requires moving up
UNINLINE_SPIN_UNLOCK, as this was the only one in that block not
depending on that option).
Avoid putting values not selected into the resulting .config -
they are not useful for anything, make the output less legible,
and just consume space: Use "depends on" rather than directly
setting the default from the combined dependency values.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/504DF2AC020000780009A2DF@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Lezcano reported seeing multi-second stalls from
keyboard input on his T61 laptop when NOHZ and CPU_IDLE
were enabled on a 32bit kernel.
He bisected the problem down to commit
1e75fa8be9 ("time: Condense timekeeper.xtime into xtime_sec").
After reproducing this issue, I narrowed the problem down
to the fact that timekeeping_get_ns() returns a 64bit
nsec value that hasn't been accumulated. In some cases
this value was being then stored in timespec.tv_nsec
(which is a long).
On 32bit systems, with idle times larger then 4 seconds
(or less, depending on the value of xtime_nsec), the
returned nsec value would overflow 32bits. This limited
kept time from increasing, causing timers to not expire.
The fix is to make sure we don't directly store the
result of timekeeping_get_ns() into a tv_nsec field,
instead using a 64bit nsec value which can then be
added into the timespec via timespec_add_ns().
Reported-and-bisected-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Link: http://lkml.kernel.org/r/1347405963-35715-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Steve Rostedt asked for the merge of a single commit, into both
the RCU and the perf/tracing tree:
| Josh made a change to the tracing code that affects both the
| work Paul McKenney and I are currently doing. At the last
| Kernel Summit back in August, Linus said when such a case
| exists, it is best to make a separate branch based off of his
| tree and place the change there. This way, the repositories
| that need to share the change can both pull them in and the
| SHA1 will match for both. Whichever branch is pulled in first
| by Linus will also pull in the necessary change for the other
| branch as well.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is considered good form to lock the lock you claim to be nested in.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
[ removed nest_lock arg to print_lock_nested_lock_not_held in favour
of hlock->nest_lock, also renamed the lock arg to hlock since its
a held_lock type ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/5051A9E7.5040501@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Heteregeneous ARM platform uses arch_scale_freq_power function
to reflect the relative capacity of each core
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341826026-6504-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no load_balancer to be selected now. It just sets the
state of the nohz tick to stop.
So rename the function, pass the 'cpu' as a parameter and then
remove the useless call from tick_nohz_restart_sched_tick().
[ s/set_nohz_tick_stopped/nohz_balance_enter_idle/g
s/clear_nohz_tick_stopped/nohz_balance_exit_idle/g ]
Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1347261059-24747-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:
In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.
Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.
Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the last architecture to use this has stopped doing so (ARM,
thanks Catalin!) we can remove this complexity from the scheduler
core.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On tickless systems, one CPU runs load balance for all idle CPUs.
The cpu_load of this CPU is updated before starting the load balance
of each other idle CPUs. We should instead update the cpu_load of
the balance_cpu.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1347509486-8688-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ptrace_notify() and get_signal_to_deliver() do unnecessary things
before task_work_run():
1. smp_mb__after_clear_bit() is not needed, test_and_clear_bit()
implies mb().
2. And we do not need the barrier at all, in this case we only
care about the "synchronous" works added by the task itself.
3. No need to clear TIF_NOTIFY_RESUME, and we should not assume
task_works is the only user of this flag.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120826191217.GA4238@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ed3e694d "move exit_task_work() past exit_files() et.al" destroyed
the add/exit synchronization we had, the caller itself should ensure
task_work_add() can't race with the exiting task.
However, this is not convenient/simple, and the only user which tries
to do this is buggy (see the next patch). Unless the task is current,
there is simply no way to do this in general.
Change exit_task_work()->task_work_run() to use the dummy "work_exited"
entry to let task_work_add() know it should fail.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120826191211.GA4228@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change task_work's to use llist-like code to avoid pi_lock
in task_work_add(), this makes it useable under rq->lock.
task_work_cancel() and task_work_run() still use pi_lock
to synchronize with each other.
(This is in preparation for a deadlock fix.)
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120826191209.GA4221@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull workqueue fixes from Tejun Heo:
"It's later than I'd like but well the timing just didn't work out this
time.
There are three bug fixes. One from before 3.6-rc1 and two from the
new CPU hotplug code. Kudos to Lai for discovering all of them and
providing fixes.
* Atomicity bug when clearing a flag and setting another. The two
operation should have been atomic but wasn't. This bug has existed
for a long time but is unlikely to have actually happened. Fix is
safe. Marked for -stable.
* If CPU hotplug cycles happen back-to-back before workers finish the
previous cycle, the states could get out of sync and it could get
stuck. Fixed by waiting for workers to complete before finishing
hotplug cycle.
* While CPU hotplug is in progress, idle workers could be depleted
which can then lead to deadlock. I think both happening together
is highly unlikely but still better to fix it and the fix isn't too
scary.
There's another workqueue related regression which reported a few days
ago:
https://bugzilla.kernel.org/show_bug.cgi?id=47301
It's a bit of head scratcher but there is a semi-reliable reproduce
case, so I'm hoping to resolve it soonish."
* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix possible idle worker depletion across CPU hotplug
workqueue: restore POOL_MANAGING_WORKERS
workqueue: fix possible deadlock in idle worker rebinding
workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function
workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic
It is a frequent mistake to confuse the netlink port identifier with a
process identifier. Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.
I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.
I have successfully built an allyesconfig kernel with this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To simplify both normal and CPU hotplug paths, worker management is
prevented while CPU hoplug is in progress. This is achieved by CPU
hotplug holding the same exclusion mechanism used by workers to ensure
there's only one manager per pool.
If someone else seems to be performing the manager role, workers
proceed to execute work items. CPU hotplug using the same mechanism
can lead to idle worker depletion because all workers could proceed to
execute work items while CPU hotplug is in progress and CPU hotplug
itself wouldn't actually perform the worker management duty - it
doesn't guarantee that there's an idle worker left when it releases
management.
This idle worker depletion, under extreme circumstances, can break
forward-progress guarantee and thus lead to deadlock.
This patch fixes the bug by using separate mechanisms for manager
exclusion among workers and hotplug exclusion. For manager exclusion,
POOL_MANAGING_WORKERS which was restored by the previous patch is
used. pool->manager_mutex is now only used for exclusion between the
elected manager and CPU hotplug. The elected manager won't proceed
without holding pool->manager_mutex.
This ensures that the worker which won the manager position can't skip
managing while CPU hotplug is in progress. It will block on
manager_mutex and perform management after CPU hotplug is complete.
Note that hotplug may happen while waiting for manager_mutex. A
manager isn't either on idle or busy list and thus the hoplug code
can't unbind/rebind it. Make the manager handle its own un/rebinding.
tj: Updated comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch restores POOL_MANAGING_WORKERS which was replaced by
pool->manager_mutex by 6037315269 "workqueue: use mutex for global_cwq
manager exclusion".
There's a subtle idle worker depletion bug across CPU hotplug events
and we need to distinguish an actual manager and CPU hotplug
preventing management. POOL_MANAGING_WORKERS will be used for the
former and manager_mutex the later.
This patch just lays POOL_MANAGING_WORKERS on top of the existing
manager_mutex and doesn't introduce any synchronization changes. The
next patch will update it.
Note that this patch fixes a non-critical anomaly where
too_many_workers() may return %true spuriously while CPU hotplug is in
progress. While the issue could schedule idle timer spuriously, it
didn't trigger any actual misbehavior.
tj: Rewrote patch description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch defines netlink_kernel_create as a wrapper function of
__netlink_kernel_create to hide the struct module *me parameter
(which seems to be THIS_MODULE in all existing netlink subsystems).
Suggested by David S. Miller.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
pm_qos_get_value don't return a return code in all cases. It's sure that
anything interesting happend after BUG() but this prevent any compilation
warning.
[rjw: Chaneged the new return value to PM_QOS_DEFAULT_VALUE.]
Signed-off-by: Luis Gonzalez Fernandez <luisgf@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Currently, rebind_workers() and idle_worker_rebind() are two-way
interlocked. rebind_workers() waits for idle workers to finish
rebinding and rebound idle workers wait for rebind_workers() to finish
rebinding busy workers before proceeding.
Unfortunately, this isn't enough. The second wait from idle workers
is implemented as follows.
wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));
rebind_workers() clears WORKER_REBIND, wakes up the idle workers and
then returns. If CPU hotplug cycle happens again before one of the
idle workers finishes the above wait_event(), rebind_workers() will
repeat the first part of the handshake - set WORKER_REBIND again and
wait for the idle worker to finish rebinding - and this leads to
deadlock because the idle worker would be waiting for WORKER_REBIND to
clear.
This is fixed by adding another interlocking step at the end -
rebind_workers() now waits for all the idle workers to finish the
above WORKER_REBIND wait before returning. This ensures that all
rebinding steps are complete on all idle workers before the next
hotplug cycle can happen.
This problem was diagnosed by Lai Jiangshan who also posted a patch to
fix the issue, upon which this patch is based.
This is the minimal fix and further patches are scheduled for the next
merge window to simplify the CPU hotplug path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>
This doesn't make any functional difference and is purely to help the
next patch to be simpler.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
The compiler may compile the following code into TWO write/modify
instructions.
worker->flags &= ~WORKER_UNBOUND;
worker->flags |= WORKER_REBIND;
so the other CPU may temporarily see worker->flags which doesn't have
either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup
prematurely.
Fix it by using single explicit assignment via ACCESS_ONCE().
Because idle workers have another WORKER_NOT_RUNNING flag, this bug
doesn't exist for them; however, update it to use the same pattern for
consistency.
tj: Applied the change to idle workers too and updated comments and
patch description a bit.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
While debugging a warning message on PowerPC while using hardware
breakpoints, it was discovered that when perf_event_disable is invoked
through hw_breakpoint_handler function with interrupts disabled, a
subsequent IPI in the code path would trigger a WARN_ON_ONCE message in
smp_call_function_single function.
This patch calls __perf_event_disable() when interrupts are already
disabled, instead of perf_event_disable().
Reported-by: Edjunior Barbosa Machado <emachado@linux.vnet.ibm.com>
Signed-off-by: K.Prasad <Prasad.Krishnan@gmail.com>
[naveen.n.rao@linux.vnet.ibm.com: v3: Check to make sure we target current task]
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120802081635.5811.17737.stgit@localhost.localdomain
[ Fixed build error on MIPS. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Don't mess with file refcounts (or keep a reference to file, for
that matter) in perf_event. Use explicit refcount of its own
instead. Deal with the race between the final reference to event
going away and new children getting created for it by use of
atomic_long_inc_not_zero() in inherit_event(); just have the
latter free what it had allocated and return NULL, that works
out just fine (children of siblings of something doomed are
created as singletons, same as if the child of leader had been
created and immediately killed).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120820135925.GG23464@ZenIV.linux.org.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's impossible to enter the else branch if we have set
skip_clock_update in task_yield_fair(), as yield_to_task_fair()
will directly return true after invoke task_yield_fair().
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FF2925A.9060005@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Unlike others, sched_migration_cost, sched_time_avg and
sched_shares_window doesn't have time unit as suffix. Add them.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345083330-19486-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Various sd->*_idx's are used for refering the rq's load average table
when selecting a cpu to run. However they can be set to any number
with sysctl knobs so that it can crash the kernel if something bad is
given. Fix it by limiting them into the actual range.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345104204-8317-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit beac4c7e4a ("sched: Remove AFFINE_WAKEUPS feature") removed
use of the flag but left the definition. Get rid of it.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1345090865-20851-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix two kernel-doc warnings in kernel/sched/fair.c:
Warning(kernel/sched/fair.c:3660): Excess function parameter 'cpus' description in 'update_sg_lb_stats'
Warning(kernel/sched/fair.c:3806): Excess function parameter 'cpus' description in 'update_sd_lb_stats'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/50303714.3090204@xenotime.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.
Instead unthrottle rt runqueues before migrating tasks.
Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()
Signed-off-by: Peter Boonstoppel <pboonstoppel@nvidia.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Azat Khuzhin reported high loadavg in Linux v3.6
After checking the upstream scheduler code, I found Peter's commit:
5167e8d541 sched/nohz: Rewrite and fix load-avg computation -- again
not fully applied, missing the call to calc_load_exit_idle().
After that idle exit in sampling window will always be calculated
to non-idle, and the load will be higher than normal.
This patch adds the missing call to calc_load_exit_idle().
Signed-off-by: Charles Wang <muming.wq@taobao.com>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345449754-27130-1-git-send-email-muming.wq@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rabik and Paul reported two different issues related to the same few
lines of code.
Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).
Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.
The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.
The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.
So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.
Reported-by: Rakib Mullick <rakib.mullick@gmail.com>
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some clock event devices, for example such that belong to PM domains,
need to be handled in a spcial way during the timekeeping suspend
and resume (which takes place in the system core, or "syscore",
stages of system power transitions) in analogy with clock sources.
Introduce .suspend() and .resume() callbacks for clock event devices
that will be executed by timekeeping_suspend/_resume(), respectively,
next the the clock sources' .suspend() and .resume() callbacks.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Introduce function pm_genpd_syscore_switch() and two wrappers around
it, pm_genpd_syscore_poweroff() and pm_genpd_syscore_poweron(),
allowing the callers to let the generic PM domains framework know
that the given device is not necessary any more and its PM domain
can be turned off (the former) or that the given device will be
required immediately, so its PM domain has to be turned on (the
latter) during the system core (syscore) stage of system suspend
(or hibernation) and resume.
These functions will be used for handling devices registered as
clock sources and clock event devices that belong to PM domains.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Andreas Bombe reported that the added ktime_t overflow checking added to
timespec_valid in commit 4e8b14526c ("time: Improve sanity checking of
timekeeping inputs") was causing problems with X.org because it caused
timeouts larger then KTIME_T to be invalid.
Previously, these large timeouts would be clamped to KTIME_MAX and would
never expire, which is valid.
This patch splits the ktime_t overflow checking into a new
timespec_valid_strict function, and converts the timekeeping codes
internal checking to use this more strict function.
Reported-and-tested-by: Andreas Bombe <aeb@debian.org>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody does set_orig_insn(verify => false), and I think nobody will.
Remove this argument. IIUC set_orig_insn(verify => false) was needed
to single-step without xol area.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Now that we have uprobe_dup_mmap() we can fold uprobe_reset_state()
into the new hook and remove it. mmput()->uprobe_clear_state() can't
be called before dup_mmap().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Add the new MMF_HAS_UPROBES flag. It is set by install_breakpoint()
and it is copied by dup_mmap(), uprobe_pre_sstep_notifier() checks
it to avoid the slow path if the task was never probed. Perhaps it
makes sense to check it in valid_vma(is_register => false) as well.
This needs the new dup_mmap()->uprobe_dup_mmap() hook. We can't use
uprobe_reset_state() or put MMF_HAS_UPROBES into MMF_INIT_MASK, we
need oldmm->mmap_sem to avoid the race with uprobe_register() or
mmap() from another thread.
Currently we never clear this bit, it can be false-positive after
uprobe_unregister() or uprobe_munmap() or if dup_mmap() hits the
probed VM_DONTCOPY vma. But this is fine correctness-wise and has
no effect unless the task hits the non-uprobe breakpoint.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
-EEXIST from install_breakpoint() no longer makes sense, all
callers should simply treat it as "success". Change the code
to return zero and simplify register_for_each_vma().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Once install_breakpoint() fails uprobe_mmap() "ignores" all other
uprobes and returns the error.
It was never really needed to to stop after the first error, and
in fact it was always wrong at least in -ENOTSUPP case.
Change uprobe_mmap() to ignore the errors and always return 0.
This is not what we want in the long term, but until we teach
the callers to handle the failure it would be better to remove
the pointless complications. And this doesn't look too bad, the
only "reasonable" error is ENOMEM but in this case the caller
should be oom-killed in the likely case or the system has more
serious problems.
However it makes sense to stop if fatal_signal_pending() == T.
In particular this helps if the task was oom-killed.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
1. Kill dup_mmap()->uprobe_mmap(), it was only needed to calculate
new_mm->uprobes_state.count removed by the previous patch.
If the forking process has a pending uprobe (int3) in vma, it will
be copied by copy_page_range(), note that it checks vma->anon_vma
so "Don't copy ptes" is not possible after install_breakpoint()
which does anon_vma_prepare().
2. Remove is_swbp_at_addr() and "int count" in uprobe_mmap(). Again,
this was needed for uprobes_state.count.
As a side effect this fixes the bug pointed out by Srikar,
this code lacked the necessary put_uprobe().
3. uprobe_munmap() becomes a nop after the previous patch. Remove the
meaningless code but do not remove the helper, we will need it.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
uprobes_state->count is only needed to avoid the slow path in
uprobe_pre_sstep_notifier(). It is also checked in uprobe_munmap()
but ironically its only goal to decrement this counter. However,
it is very broken. Just some examples:
- uprobe_mmap() can race with uprobe_unregister() and wrongly
increment the counter if it hits the non-uprobe "int3". Note
that install_breakpoint() checks ->consumers first and returns
-EEXIST if it is NULL.
"atomic_sub() if error" in uprobe_mmap() looks obviously wrong
too.
- uprobe_munmap() can race with uprobe_register() and wrongly
decrement the counter by the same reason.
- Suppose an appication tries to increase the mmapped area via
sys_mremap(). vma_adjust() does uprobe_munmap(whole_vma) first,
this can nullify the counter temporarily and race with another
thread which can hit the bp, the application will be killed by
SIGTRAP.
- Suppose an application mmaps 2 consecutive areas in the same file
and one (or both) of these areas has uprobes. In the likely case
mmap_region()->vma_merge() suceeds. Like above, this leads to
uprobe_munmap/uprobe_mmap from vma_merge()->vma_adjust() but then
mmap_region() does another uprobe_mmap(resulting_vma) and doubles
the counter.
This patch only removes this counter and fixes the compile errors,
then we will try to cleanup the changed code and add something else
instead.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
by the time we get here (after we pass cleanup_ret) uprobe is always is
set. If it is NULL we leave very early in the code.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Since read_opcode() reads from the referenced page and doesnt modify
the page contents nor the page attributes, there is no need to lock
the page.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
This is one of the items in the plumber's wish list.
For use cases:
>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart
v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When remounting cgroupfs with some subsystems added to it and some
removed, cgroup will remove all the files in root directory and then
re-popluate it.
What I'm doing here is, only remove files which belong to subsystems that
are to be unbinded, and only create files for newly-added subsystems.
The purpose is to have all other files untouched.
This is a preparation for cgroup xattr support.
v7:
- checkpatch warnings fixed
v6:
- no changes
v5:
- no changes
v4:
- refactored cgroup_clear_directory() to not use cgroup_rm_file()
- instead of going thru the list of files, get the file list using the
subsystems
- use 'subsys_mask' instead of {added,removed}_bits and made
cgroup_populate_dir() to match the parameters with cgroup_clear_directory()
v3:
- refresh patches after recent refactoring
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull timer fixes from Thomas Gleixner:
"Mostly small fixes for the fallout of the timekeeping overhaul in 3.6
along with stable fixes to address an accumulation problem and missing
sanity checks for RTC readouts and user space provided values."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Avoid making adjustments if we haven't accumulated anything
time: Avoid potential shift overflow with large shift values
time: Fix casting issue in timekeeping_forward_now
time: Ensure we normalize the timekeeper in tk_xtime_add
time: Improve sanity checking of timekeeping inputs
The function graph has a test to check if the frame pointer is
corrupted, which can happen with various options of gcc with mcount.
But this is not an issue with -mfentry as -mfentry does not need nor use
frame pointers for function graph tracing.
Link: http://lkml.kernel.org/r/20120807194059.773895870@goodmis.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Thanks to Andi Kleen, gcc 4.6.0 now supports -mfentry for x86
(and hopefully soon for other archs). What this does is to have
the function profiler start at the beginning of the function
instead of after the stack is set up. As plain -pg (mcount) is
called after the stack is set up, and in some cases can have issues
with the function graph tracer. It also requires frame pointers to
be enabled.
The -mfentry now calls __fentry__ at the beginning of the function.
This allows for compiling without frame pointers and even has the
ability to access parameters if needed.
If the architecture and the compiler both support -mfentry then
use that instead.
Link: http://lkml.kernel.org/r/20120807194059.392617243@goodmis.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We still patch SMP instructions to UP variants if we boot with a
single CPU, but not at any other time. In particular, not if we
unplug CPUs to return to a single cpu.
Paul McKenney points out:
mean offline overhead is 6251/48=130.2 milliseconds.
If I remove the alternatives_smp_switch() from the offline
path [...] the mean offline overhead is 550/42=13.1 milliseconds
Basically, we're never going to get those 120ms back, and the
code is pretty messy.
We get rid of:
1) The "smp-alt-once" boot option. It's actually "smp-alt-boot", the
documentation is wrong. It's now the default.
2) The skip_smp_alternatives flag used by suspend.
3) arch_disable_nonboot_cpus_begin() and arch_disable_nonboot_cpus_end()
which were only used to set this one flag.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paul.mckenney@us.ibm.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/87vcgwwive.fsf@rustcorp.com.au
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Noticed when digging into a suspend issue in linux-next (next-20120821).
For more details see <http://marc.info/?t=134554708000002&r=1&w=2>.
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
If update_wall_time() is called and the current offset isn't large
enough to accumulate, avoid re-calling timekeeping_adjust which may
change the clock freq and can cause 1ns inconsistencies with
CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1345595449-34965-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Andreas Schwab noticed that the 1 << tk->shift could overflow if the
shift value was greater than 30, since 1 would be a 32bit long on
32bit architectures. This issue was introduced by 1e75fa8be (time:
Condense timekeeper.xtime into xtime_sec)
Use 1ULL instead to ensure we don't overflow on the shift.
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
arch_gettimeoffset returns a u32 value which when shifted by tk->shift
can overflow. This issue was introduced with 1e75fa8be (time: Condense
timekeeper.xtime into xtime_sec)
Cast it to u64 first.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Andreas noticed problems with resume on specific hardware after commit
1e75fa8b (time: Condense timekeeper.xtime into xtime_sec) combined
with commit b44d50dca (time: Fix casting issue in tk_set_xtime and
tk_xtime_add)
After some digging I realized we aren't normalizing the timekeeper
after the add. Add the missing normalize call.
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Tested-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
cancel_delayed_work() can't be called from IRQ handlers due to its use
of del_timer_sync() and can't cancel work items which are already
transferred from timer to worklist.
Also, unlike other flush and cancel functions, a canceled delayed_work
would still point to the last associated cpu_workqueue. If the
workqueue is destroyed afterwards and the work item is re-used on a
different workqueue, the queueing code can oops trying to dereference
already freed cpu_workqueue.
This patch reimplements cancel_delayed_work() using
try_to_grab_pending() and set_work_cpu_and_clear_pending(). This
allows the function to be called from IRQ handlers and makes its
behavior consistent with other flush / cancel functions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Up to now, for delayed_works, try_to_grab_pending() couldn't be used
from IRQ handlers because IRQs may happen while
delayed_work_timer_fn() is in progress leading to indefinite -EAGAIN.
This patch makes delayed_work use the new TIMER_IRQSAFE flag for
delayed_work->timer. This makes try_to_grab_pending() and thus
mod_delayed_work_on() safe to call from IRQ handlers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull audit-tree fixes from Miklos Szeredi:
"The audit subsystem maintainers (Al and Eric) are not responding to
repeated resends. Eric did ack them a while ago, but no response
since then. So I'm sending these directly to you."
* 'audit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
audit: clean up refcounting in audit-tree
audit: fix refcounting in audit-tree
audit: don't free_chunk() after fsnotify_add_mark()
It seems commit 4a9d4b024a ("switch fput to task_work_add") re-
introduced the problem addressed in 944be0b224 ("close_files(): add
scheduling point")
If a server process with a lot of files (say 2 million tcp sockets) is
killed, we can spend a lot of time in task_work_run() and trigger a soft
lockup.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Timer internals are protected with irq-safe locks but timer execution
isn't, so a timer being dequeued for execution and its execution
aren't atomic against IRQs. This makes it impossible to wait for its
completion from IRQ handlers and difficult to shoot down a timer from
IRQ handlers.
This issue caused some issues for delayed_work interface. Because
there's no way to reliably shoot down delayed_work->timer from IRQ
handlers, __cancel_delayed_work() can't share the logic to steal the
target delayed_work with cancel_delayed_work_sync(), and can only
steal delayed_works which are on queued on timer. Similarly, the
pending mod_delayed_work() can't be used from IRQ handlers.
This patch adds a new timer flag TIMER_IRQSAFE, which makes the timer
to be executed without enabling IRQ after dequeueing such that its
dequeueing and execution are atomic against IRQ handlers.
This makes it safe to wait for the timer's completion from IRQ
handlers, for example, using del_timer_sync(). It can never be
executing on the local CPU and if executing on other CPUs it won't be
interrupted until done.
This will enable simplifying delayed_work cancel/mod interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-5-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Over time, timer initializers became messy with unnecessarily
duplicated code which are inconsistently spread across timer.h and
timer.c.
This patch cleans up timer initializers.
* timer.c::__init_timer() is renamed to do_init_timer().
* __TIMER_INITIALIZER() added. It takes @flags and all initializers
are wrappers around it.
* init_timer[_on_stack]_key() now take @flags.
* __init_timer[_on_stack]() added. They take @flags and all init
macros are wrappers around them.
* __setup_timer[_on_stack]() added. It uses __init_timer() and takes
@flags. All setup macros are wrappers around the two.
Note that this patch doesn't add missing init/setup combinations -
e.g. init_timer_deferrable_on_stack(). Adding missing ones is
trivial.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-4-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
To prepare for addition of another flag, generalize timer->base flags
handling.
* Rename from TBASE_*_FLAG to TIMER_* and make them LU constants.
* Define and use TIMER_FLAG_MASK for flags masking so that multiple
flags can be handled correctly.
* Don't dereference timer->base directly even if
!tbase_get_deferrable(). All two such places are already passed in
@base, so use it instead.
* Make sure tvec_base's alignment is large enough for timer->base
flags using BUILD_BUG_ON().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-2-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Export dummy_irq_chip to modules to allow them to do things such as
irq_set_chip_and_handler(virq,
&dummy_irq_chip,
handle_level_irq);
This fixes
ERROR: "dummy_irq_chip" [drivers/gpio/gpio-pcf857x.ko] undefined!
when gpio-pcf857x.c is being built as a module.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Greg KH <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/871ujstrp6.wl%25kuninori.morimoto.gx@renesas.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Export irq_set_chip_and_handler_name() to modules to allow them to
do things such as
irq_set_chip_and_handler(....);
This fixes
ERROR: "irq_set_chip_and_handler_name" \
[drivers/gpio/gpio-pcf857x.ko] undefined!
when gpio-pcf857x.c is being built as a module.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Greg KH <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/873948trpk.wl%25kuninori.morimoto.gx@renesas.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=843640
If mmap_region()->uprobe_mmap() fails, unmap_and_free_vma path
does unmap_region() but does not remove the soon-to-be-freed vma
from rb tree. Actually there are more problems but this is how
William noticed this bug.
Perhaps we could do do_munmap() + return in this case, but in
fact it is simply wrong to abort if uprobe_mmap() fails. Until
at least we move the !UPROBE_COPY_INSN code from
install_breakpoint() to uprobe_register().
For example, uprobe_mmap()->install_breakpoint() can fail if the
probed insn is not supported (remember, uprobe_register()
succeeds if nobody mmaps inode/offset), mmap() should not fail
in this case.
dup_mmap()->uprobe_mmap() is wrong too by the same reason,
fork() can race with uprobe_register() and fail for no reason if
it wins the race and does install_breakpoint() first.
And, if nothing else, both mmap_region() and dup_mmap() return
success if uprobe_mmap() fails. Change them to ignore the error
code from uprobe_mmap().
Reported-and-tested-by: William Cohen <wcohen@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # v3.5
Cc: Anton Arapov <anton@redhat.com>
Cc: William Cohen <wcohen@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
. Fix include order for bison/flex-generated C files, from Ben Hutchings
. Build fixes and documentation corrections from David Ahern
. Group parsing support, from Jiri Olsa
. UI/gtk refactorings and improvements from Namhyung Kim
. NULL deref fix for perf script, from Namhyung Kim
. Assorted cleanups from Robert Richter
. Let O= makes handle relative paths, from Steven Rostedt
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJQMkGhAAoJENZQFvNTUqpAqjsQAJE5iD1LFogC8o/WjvRHz0TY
Y0x+sR/XfW61KYpeq5g+UaKuFU3P44ijCoyks3y5sza97DkYgUwMpEHlLXFSM8Pp
sNOapqY57s24nq3MLrhH1V9w+cSE+m2u/Gi5fGLCQekio9gkOBwYxNGk7vpKri/n
LBRsMozBu/mZjMy20uWOb7Uk8xsAToh+TFaAtjyQ9Snn9nNJj49NUAp37uN888H/
ducMLq32HN5v/6Zd3q6IWdDWgZsHLkIa3R5FIs/GNe3Dih07gtYLmDol4ktPbTFm
yoaWpP5wbtu/62EZlJwE393vMuoeqN/96394ZZQGFafhHVxN4+rcBhXbejBs0T2b
wk/0CzntW8bbUAI/cl3SB9aui//FWOxcjG9aDQ7PsmHzPw1Q4VD0F9Mcod4p+dRX
PsA9q/tST1eAiwzWYthDtj81U7iChINcXKhoZn2xn6+0+aMH+6FFNBmCH8MR5aCU
BvrXhTJjvau/Ym/sILl4Tf4wfssTq49yMsn/YKCwLJ0hg0XlTObWfQRy2MOayXH9
NJvUE+9GSXoTEKhmr1AfTYEG9vObaXZyFwAI74xvPPwUYojCb4ZjEKmG0egW+VGk
IJKFCaJZwwVsGau4aIbFAMP12/L8Qs/Ox91ddCJ0j5TIlSGMaqW5lbV1N1crzlTT
a0GsN49NvhbFttBXrcNX
=0a2X
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
* Fix include order for bison/flex-generated C files, from Ben Hutchings
* Build fixes and documentation corrections from David Ahern
* Group parsing support, from Jiri Olsa
* UI/gtk refactorings and improvements from Namhyung Kim
* NULL deref fix for perf script, from Namhyung Kim
* Assorted cleanups from Robert Richter
* Let O= makes handle relative paths, from Steven Rostedt
* perf script python fixes, from Feng Tang.
* Improve 'perf lock' error message when the needed tracepoints
are not present, from David Ahern.
* Initial bash completion support, from Frederic Weisbecker
* Allow building without libelf, from Namhyung Kim.
* Support DWARF CFI based unwind to have callchains when %bp
based unwinding is not possible, from Jiri Olsa.
* Symbol resolution fixes, while fixing support PPC64 files with an .opt ELF
section was the end goal, several fixes for code that handles all
architectures and cleanups are included, from Cody Schafer.
* Add a description for the JIT interface, from Andi Kleen.
* Assorted fixes for Documentation and build in 32 bit, from Robert Richter
* Add support for non-tracepoint events in perf script python, from Feng Tang
* Cache the libtraceevent event_format associated to each evsel early, so that we
avoid relookups, i.e. calling pevent_find_event repeatedly when processing
tracepoint events.
[ This is to reduce the surface contact with libtraceevents and make clear what
is that the perf tools needs from that lib: so far parsing the common and per
event fields. ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull ftrace updates from Steve Rostedt:
" This patch series extends ftrace function tracing utility to be
more dynamic for its users. It allows for data passing to the callback
functions, as well as reading regs as if a breakpoint were to trigger
at function entry.
The main goal of this patch series was to allow kprobes to use ftrace
as an optimized probe point when a probe is placed on an ftrace nop.
With lots of help from Masami Hiramatsu, and going through lots of
iterations, we finally came up with a good solution. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
system_nrt[_freezable]_wq are now spurious. Mark them deprecated and
convert all users to system[_freezable]_wq.
If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
Please use system[_freezable]_wq instead.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Airlie <airlied@linux.ie>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Now that all workqueues are non-reentrant, system[_freezable]_wq() are
equivalent to system_nrt[_freezable]_wq(). Replace the latter with
wrappers around system[_freezable]_wq(). The wrapping goes through
inline functions so that __deprecated can be added easily.
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that all workqueues are non-reentrant, flush[_delayed]_work_sync()
are equivalent to flush[_delayed]_work(). Drop the separate
implementation and make them thin wrappers around
flush[_delayed]_work().
* start_flush_work() no longer takes @wait_executing as the only left
user - flush_work() - always sets it to %true.
* __cancel_work_timer() uses flush_work() instead of wait_on_work().
Signed-off-by: Tejun Heo <tj@kernel.org>
By default, each per-cpu part of a bound workqueue operates separately
and a work item may be executing concurrently on different CPUs. The
behavior avoids some cross-cpu traffic but leads to subtle weirdities
and not-so-subtle contortions in the API.
* There's no sane usefulness in allowing a single work item to be
executed concurrently on multiple CPUs. People just get the
behavior unintentionally and get surprised after learning about it.
Most either explicitly synchronize or use non-reentrant/ordered
workqueue but this is error-prone.
* flush_work() can't wait for multiple instances of the same work item
on different CPUs. If a work item is executing on cpu0 and then
queued on cpu1, flush_work() can only wait for the one on cpu1.
Unfortunately, work items can easily cross CPU boundaries
unintentionally when the queueing thread gets migrated. This means
that if multiple queuers compete, flush_work() can't even guarantee
that the instance queued right before it is finished before
returning.
* flush_work_sync() was added to work around some of the deficiencies
of flush_work(). In addition to the usual flushing, it ensures that
all currently executing instances are finished before returning.
This operation is expensive as it has to walk all CPUs and at the
same time fails to address competing queuer case.
Incorrectly using flush_work() when flush_work_sync() is necessary
is an easy error to make and can lead to bugs which are difficult to
reproduce.
* Similar problems exist for flush_delayed_work[_sync]().
Other than the cross-cpu access concern, there's no benefit in
allowing parallel execution and it's plain silly to have this level of
contortion for workqueue which is widely used from core code to
extremely obscure drivers.
This patch makes all workqueues non-reentrant. If a work item is
executing on a different CPU when queueing is requested, it is always
queued to that CPU. This guarantees that any given work item can be
executing on one CPU at maximum and if a work item is queued and
executing, both are on the same CPU.
The only behavior change which may affect workqueue users negatively
is that non-reentrancy overrides the affinity specified by
queue_work_on(). On a reentrant workqueue, the affinity specified by
queue_work_on() is always followed. Now, if the work item is
executing on one of the CPUs, the work item will be queued there
regardless of the requested affinity. I've reviewed all workqueue
users which request explicit affinity, and, fortunately, none seems to
be crazy enough to exploit parallel execution of the same work item.
This adds an additional busy_hash lookup if the work item was
previously queued on a different CPU. This shouldn't be noticeable
under any sane workload. Work item queueing isn't a very
high-frequency operation and they don't jump across CPUs all the time.
In a micro benchmark to exaggerate this difference - measuring the
time it takes for two work items to repeatedly jump between two CPUs a
number (10M) of times with busy_hash table densely populated, the
difference was around 3%.
While the overhead is measureable, it is only visible in pathological
cases and the difference isn't huge. This change brings much needed
sanity to workqueue and makes its behavior consistent with timer. I
think this is the right tradeoff to make.
This enables significant simplification of workqueue API.
Simplification patches will follow.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixed some checkpatch warnings.
tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit.
Signed-off-by: Valentin Ilie <valentin.ilie@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com>
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix migration thread runtime bogosity
sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled
sched,cgroup: Fix up task_groups list
sched: fix divide by zero at {thread_group,task}_times
sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
The archs that implement virtual cputime accounting all
flush the cputime of a task when it gets descheduled
and sometimes set up some ground initialization for the
next task to account its cputime.
These archs all put their own hooks in their context
switch callbacks and handle the off-case themselves.
Consolidate this by creating a new account_switch_vtime()
callback called in generic code right after a context switch
and that these archs must implement to flush the prev task
cputime and initialize the next task cputime related state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Extract cputime code from the giant sched/core.c and
put it in its own file. This make it easier to deal with
this particular area and de-bloat a bit more core.c
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Merge alpha architecture update from Michael Cree:
"The Alpha Maintainer, Matt Turner, is currently unavailable, so I have
collected up patches that have been posted to the linux-alpha mailing
list over the last couple of months, and are forwarding them to you in
the hope that you are prepared to accept them via me.
The patches by Al Viro and myself I have been running against kernels
for two months now so have had quite a bit of testing. All except one
patch were intended for the 3.5 kernel but because of Matt's
unavailability never got forwarded to you."
* emailed patches from Michael Cree <mcree@orcon.net.nz>: (9 commits)
alpha: Fix fall-out from disintegrating asm/system.h
Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts
alpha: fix fpu.h usage in userspace
alpha/mm/fault.c: Port OOM changes to do_page_fault
alpha: take kernel_execve() out of entry.S
alpha: take a bunch of syscalls into osf_sys.c
alpha: Use new generic strncpy_from_user() and strnlen_user()
alpha: Wire up cross memory attach syscalls
alpha: Don't export SOCK_NONBLOCK to user space.
New helper: current_thread_info(). Allows to do a bunch of odd syscalls
in C. While we are at it, there had never been a reason to do
osf_getpriority() in assembler. We also get "namespace"-aware (read:
consistent with getuid(2), etc.) behaviour from getx?id() syscalls now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Michael Cree <mcree@orcon.net.nz>
Acked-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syscall_get_nr can return -1 in the case that the task is not executing
a system call.
This patch fixes perf_syscall_{enter,exit} to check that the syscall
number is valid before using it as an index into a bitmap.
Link: http://lkml.kernel.org/r/1345137254-7377-1-git-send-email-will.deacon@arm.com
Cc: Jason Baron <jbaron@redhat.com>
Cc: Wade Farnsworth <wade_farnsworth@mentor.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To speed cpu down processing up, use system_highpri_wq.
As scheduling priority of workers on it is higher than system_wq and
it is not contended by other normal works on this cpu, work on it
is processed faster than system_wq.
tj: CPU up/downs care quite a bit about latency these days. This
shouldn't hurt anything and makes sense.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
Currently, in this case, we use only system_wq. This makes a possible
error situation as there is mismatch between cwq->pool and worker->pool.
To prevent this, we should use system_highpri_wq for highpri worker
to match theses. This implements it.
tj: Rephrased comment a bit.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 3270476a6c ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
or highpri worker. But, we don't consider this difference in rebind_workers(),
we use just system_wq for highpri worker. It makes mismatch between
cwq->pool and worker->pool.
It doesn't make error in current implementation, but possible in the future.
Now, we introduce system_highpri_wq to use proper cwq for highpri workers
in rebind_workers(). Following patch fix this issue properly.
tj: Even apart from rebinding, having system_highpri_wq generally
makes sense.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We assign cpu id into work struct's data field in __queue_delayed_work_on().
In current implementation, when work is come in first time,
current running cpu id is assigned.
If we do __queue_delayed_work_on() with CPU A on CPU B,
__queue_work() invoked in delayed_work_timer_fn() go into
the following sub-optimal path in case of WQ_NON_REENTRANT.
gcwq = get_gcwq(cpu);
if (wq->flags & WQ_NON_REENTRANT &&
(last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {
Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
It is sufficient to prevent to go into sub-optimal path.
tj: Slightly rephrased the comment.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When we do tracing workqueue_queue_work(), it records requested cpu.
But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
requested cpu is changed as local cpu.
In case of @wq->flag & WQ_UNBOUND, above change is not occured,
therefore it is reasonable to correct it.
Use temporary local variable for storing requested cpu.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 3270476a6c ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
size of pools, definition of worker_pool in gcwq doesn't use it.
Using it makes code robust and prevent future mistakes.
So change code to use this enum value.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Unexpected behavior could occur if the time is set to a value large
enough to overflow a 64bit ktime_t (which is something larger then the
year 2262).
Also unexpected behavior could occur if large negative offsets are
injected via adjtimex.
So this patch improves the sanity check timekeeping inputs by
improving the timespec_valid() check, and then makes better use of
timespec_valid() to make sure we don't set the time to an invalid
negative value or one that overflows ktime_t.
Note: This does not protect from setting the time close to overflowing
ktime_t and then letting natural accumulation cause the overflow.
Reported-by: CAI Qian <caiqian@redhat.com>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1344454580-17031-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Drop the initial reference by fsnotify_init_mark early instead of
audit_tree_freeing_mark() at destroy time.
In the cases we destroy the mark before we drop the initial reference we need to
get rid of the get_mark that balances the put_mark in audit_tree_freeing_mark().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Refcounting of fsnotify_mark in audit tree is broken. E.g:
refcount
create_chunk
alloc_chunk 1
fsnotify_add_mark 2
untag_chunk
fsnotify_get_mark 3
fsnotify_destroy_mark
audit_tree_freeing_mark 2
fsnotify_put_mark 1
fsnotify_put_mark 0
via destroy_list
fsnotify_mark_destroy -1
This was reported by various people as triggering Oops when stopping auditd.
We could just remove the put_mark from audit_tree_freeing_mark() but that would
break freeing via inode destruction. So this patch simply omits a put_mark
after calling destroy_mark or adds a get_mark before.
The additional get_mark is necessary where there's no other put_mark after
fsnotify_destroy_mark() since it assumes that the caller is holding a reference
(or the inode is keeping the mark pinned, not the case here AFAICS).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Valentin Avram <aval13@gmail.com>
Reported-by: Peter Moody <pmoody@google.com>
Acked-by: Eric Paris <eparis@redhat.com>
CC: stable@vger.kernel.org
Don't do free_chunk() after fsnotify_add_mark(). That one does a delayed unref
via the destroy list and this results in use-after-free.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Eric Paris <eparis@redhat.com>
CC: stable@vger.kernel.org
There is a least one modular user so export free_pid_ns so modules can
capture and use the pid namespace on the very rare occasion when it
makes sense.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Correct a long standing omission and use struct pid in the owner
field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS.
This guarantees we don't have issues when pid wraparound occurs.
Use a kuid_t in the owner field of struct ip6_flowlabel when the
share type is IPV6_FL_S_USER to add user namespace support.
In /proc/net/ip6_flowlabel capture the current pid namespace when
opening the file and release the pid namespace when the file is
closed ensuring we print the pid owner value that is meaning to
the reader of the file. Similarly use from_kuid_munged to print
uid values that are meaningful to the reader of the file.
This requires exporting pid_nr_ns so that ipv6 can continue to built
as a module. Yoiks what silliness
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Any operation which clears PENDING should be preceded by a wmb to
guarantee that the next PENDING owner sees all the changes made before
PENDING release.
There are only two places where PENDING is cleared -
set_work_cpu_and_clear_pending() and clear_work_data(). The caller of
the former already does smp_wmb() but the latter doesn't have any.
Move the wmb above set_work_cpu_and_clear_pending() into it and add
one to clear_work_data().
There hasn't been any report related to this issue, and, given how
clear_work_data() is used, it is extremely unlikely to have caused any
actual problems on any architecture.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
delayed_work encodes the workqueue to use and the last CPU in
delayed_work->work.data while it's on timer. The target CPU is
implicitly recorded as the CPU the timer is queued on and
delayed_work_timer_fn() queues delayed_work->work to the CPU it is
running on.
Unfortunately, this leaves flush_delayed_work[_sync]() no way to find
out which CPU the delayed_work was queued for when they try to
re-queue after killing the timer. Currently, it chooses the local CPU
flush is running on. This can unexpectedly move a delayed_work queued
on a specific CPU to another CPU and lead to subtle errors.
There isn't much point in trying to save several bytes in struct
delayed_work, which is already close to a hundred bytes on 64bit with
all debug options turned off. This patch adds delayed_work->cpu to
remember the CPU it's queued for.
Note that if the timer is migrated during CPU down, the work item
could be queued to the downed global_cwq after this change. As a
detached global_cwq behaves like an unbound one, this doesn't change
much for the delayed_work.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Since power saving code was removed from sched now, the implement
code is out of service in this function, and even pollute other logical.
like, 'want_sd' never has chance to be set '0', that remove the effect
of SD_WAKE_AFFINE here.
So, clean up the obsolete code, includes SD_PREFER_LOCAL.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/5028F431.6000306@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As we already have dst_rq in lb_env, using or changing "this_rq" do not
make sense.
This patch will replace "this_rq" with dst_rq in load_balance, and we
don't need to change "this_rq" while process LBF_SOME_PINNED any more.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/501F8357.3070102@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds a comment on top of the schedule() function to explain
to scheduler newbies how the main scheduler function is entered.
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Explained-by: Ingo Molnar <mingo@kernel.org>
Explained-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344070187-2420-1-git-send-email-penberg@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It should be sched_nr_latency so fix it before it annoys me more.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344435364-18632-1-git-send-email-bp@amd64.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make stop scheduler class do the same accounting as other classes,
Migration threads can be caught in the act while doing exec balancing,
leading to the below due to use of unmaintained ->se.exec_start. The
load that triggered this particular instance was an apparently out of
control heavily threaded application that does system monitoring in
what equated to an exec bomb, with one of the VERY frequently migrated
tasks being ps.
%CPU PID USER CMD
99.3 45 root [migration/10]
97.7 53 root [migration/12]
97.0 57 root [migration/13]
90.1 49 root [migration/11]
89.6 65 root [migration/15]
88.7 17 root [migration/3]
80.4 37 root [migration/8]
78.1 41 root [migration/9]
44.2 13 root [migration/2]
Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With multiple instances of task_groups, for_each_rt_rq() is a noop,
no task groups having been added to the rt.c list instance. This
renders __enable/disable_runtime() and print_rt_stats() noop, the
user (non) visible effect being that rt task groups are missing in
/proc/sched_debug.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: stable@kernel.org # v3.3+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.
This problem can be triggered in practice on very long lived processes:
PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin"
#0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
#1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
#2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
#3 [ffff880472a51cd0] die at ffffffff8100f26b
#4 [ffff880472a51d00] do_trap at ffffffff814f03f4
#5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
#6 [ffff880472a51e00] divide_error at ffffffff8100be7b
[exception RIP: thread_group_times+0x56]
RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046
RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad
RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc
RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800
R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08
R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
#8 [ffff880472a51f40] sys_times at ffffffff81088524
#9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202
RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000
RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0
RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b
R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940
R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940
ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b
Cc: stable@vger.kernel.org
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.
His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.
It was found that:
schedule()
idle_balance()
load_balance()
local_irq_save()
double_rq_lock()
update_h_load()
walk_tg_tree(tg_load_down)
tg_load_down()
Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.
This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.
I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).
Cc: pjt@google.com
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: Peter Portante <pportant@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>