The sleep-time and graph-time options are only for the function graph tracer
and are not used by anything else. As tracer options are now visible when
the tracer is not activated, its better to move the function graph specific
tracer options into the function graph tracer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the effort to move the global trace_flags to the tracing instances, the
direct access to trace_flags must be removed from trace_printk.c
Instead, add a new trace_printk_enabled boolean that is set by a new access
function trace_printk_control(), that will enable or disable trace_printk.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a enum that denotes the last bit of the trace_flags and have a
BUILD_BUG_ON(last_bit > 32).
If we add more bits than we have in trace_flags, the kernel wont build.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are options that are unique to a specific tracer (like function and
function graph). Currently, these options are only visible in the options
directory when the tracer is enabled.
This has been a pain, especially for something like the func_stack_trace
option that if used inappropriately, could bring the system to a crawl. But
the only way to see it, is to enable the function tracer.
For example, if one had done:
# cd /sys/kernel/tracing
# echo __schedule > set_ftrace_filter
# echo 1 > options/func_stack_trace
# echo function > current_tracer
The __schedule call will be traced and a stack trace will also be recorded
there. Now when you were done, you may do...
# echo nop > current_tracer
# echo > set_ftrace_filter
But you forgot to disable the func_stack_trace. The only way to disable it
is to re-enable function tracing first. If you do not add a filter to
set_ftrace_filter and just do:
# echo function > current_tracer
Now you would be performing a stack trace on *every* function! On some
systems, that causes a live lock. Others may take a few minutes to fix your
mistake.
Having the func_stack_trace option visible allows you to check it and
disable it before enabling the funtion tracer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Only create the stacktrace trace option when CONFIG_STACKTRACE is
configured.
Cleaned up the ftrace_trace_stack() function call a little to allow better
encapsulation of the stacktrace trace flag.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the function tracer is not compiled in, do not create the option files
for it.
Fix up both the sched_wakeup and irqsoff tracers to handle the change.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use a cute little macro trick to keep the names of the trace flags file
guaranteed to match the corresponding masks.
The macro TRACE_FLAGS is defined as a serious of enum names followed by
the string name of the file that matches it. For example:
#define TRACE_FLAGS \
C(PRINT_PARENT, "print-parent"), \
C(SYM_OFFSET, "sym-offset"), \
C(SYM_ADDR, "sym-addr"), \
C(VERBOSE, "verbose"),
Now we can define the following:
#undef C
#define C(a, b) TRACE_ITER_##a##_BIT
enum trace_iterator_bits { TRACE_FLAGS };
The above creates:
enum trace_iterator_bits {
TRACE_ITER_PRINT_PARENT_BIT,
TRACE_ITER_SYM_OFFSET_BIT,
TRACE_ITER_SYM_ADDR_BIT,
TRACE_ITER_VERBOSE_BIT,
};
Then we can redefine C as:
#undef C
#define C(a, b) TRACE_ITER_##a = (1 << TRACE_ITER_##a##_BIT)
enum trace_iterator_flags { TRACE_FLAGS };
Which creates:
enum trace_iterator_flags {
TRACE_ITER_PRINT_PARENT = (1 << TRACE_ITER_PRINT_PARENT_BIT),
TRACE_ITER_SYM_OFFSET = (1 << TRACE_ITER_SYM_OFFSET_BIT),
TRACE_ITER_SYM_ADDR = (1 << TRACE_ITER_SYM_ADDR_BIT),
TRACE_ITER_VERBOSE = (1 << TRACE_ITER_VERBOSE_BIT),
};
Then finally we can create the list of file names:
#undef C
#define C(a, b) b
static const char *trace_options[] = {
TRACE_FLAGS
NULL
};
Which creates:
static const char *trace_options[] = {
"print-parent",
"sym-offset",
"sym-addr",
"verbose",
NULL
};
The importance of this is that the strings match the bit index.
trace_options[TRACE_ITER_SYM_ADDR_BIT] == "sym-addr"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Using enums with FLAG_BIT and then defining a FLAG = (1 << FLAG_BIT), is a
bit more robust as we require that there are no bits out of order or skipped
to match the file names that represent the bits.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There was a time where the function tracing would disable interrupts unless
specifically told not to, where it would only disable preemption. With the
new lockless code, the function tracing never disalbes interrupts and just
uses disabling of preemption. Remove the option "ftrace_preempt" as it does
nothing anyway.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to facilitate making all tracer options visible even when the
tracer is not active, we need to get rid of duplicate options. Any option
that is shared between multiple tracers really should be a main option.
As the wakeup and irqsoff tracers both use the "display-graph" option, and
use it exactly the same way, move that option from the tracer options to the
main options and consolidate them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
seq_print_user_ip() is used in only one location in one file. Turn it into a
static function. We could inject its code into the caller, but that would
make the code a bit too complex. Keep the code separate.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
seq_print_userip_objs() is used only in one location, in one file. Instead
of having it as an external function, go one further than making it static,
but inject is code into its only user. It doesn't make the calling function
much more complex.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation for having trace options be per instance, the trace_array
needs to be passed to the trace_buffer_unlock_commit(). The
trace_event_buffer_lock_reserve() already passes in the trace_event_file
where the trace_array can be derived from.
Also added a "__init" to the boot up test event plus function tracing
function function_test_events_call().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_trace_stack_regs() is used in only one place, and because that is
such a simple function, just move its code into the location that it was
used in (trace_buffer_unlock_commit_regs()).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch makes is_good_name return bool to improve readability
due to this particular function only using either one or zero as its
return value.
No functional change.
Link: http://lkml.kernel.org/r/1442929393-4753-2-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJV+/ucAAoJEL/70l94x66DV8YH/1KDym/1GJ+/Br/YkHZnM53l
3Q0PwSLu9cNcIL9lUuDLwGTaVj+y8ud1Hjr/uzvKwivktmUYVZhkdtnZmnanvGOM
qKB9K3nFXCPx8uqy8Dn7fOwEKcg9FmDOTTkWy13HDnXO+V4crSVVt+rPw+6FUMld
NV5tYdw9Lu7y3XrveDebPWaPtyDL7OAagzmeK47eMffxG7X9Hf1H2aT7HueRi7x/
SkLIe3gmiOWmHVJDPE9TOmFYIj19gywDFysKes1gdVJLVUIXiELMT7SrvAYnToVB
zISIEj7Zx4SINPxpf2dUn8REm7NsmJY+PffLIl/Nv+ozGggFQGFH0SMZ08p0bxw=
=tfmn
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Mostly stable material, a lot of ARM fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
sched: access local runqueue directly in single_task_running
arm/arm64: KVM: Remove 'config KVM_ARM_MAX_VCPUS'
arm64: KVM: Remove all traces of the ThumbEE registers
arm: KVM: Disable virtual timer even if the guest is not using it
arm64: KVM: Disable virtual timer even if the guest is not using it
arm/arm64: KVM: vgic: Check for !irqchip_in_kernel() when mapping resources
KVM: s390: Replace incorrect atomic_or with atomic_andnot
arm: KVM: Fix incorrect device to IPA mapping
arm64: KVM: Fix user access for debug registers
KVM: vmx: fix VPID is 0000H in non-root operation
KVM: add halt_attempted_poll to VCPU stats
kvm: fix zero length mmio searching
kvm: fix double free for fast mmio eventfd
kvm: factor out core eventfd assign/deassign logic
kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
KVM: make the declaration of functions within 80 characters
KVM: arm64: add workaround for Cortex-A57 erratum #852523
KVM: fix polling for guest halt continued even if disable it
arm/arm64: KVM: Fix PSCI affinity info return value for non valid cores
arm64: KVM: set {v,}TCR_EL2 RES1 bits
...
Pull irq updates from Thomas Gleixner:
"This is a rather large update post rc1 due to the final steps of
cleanups and API changes which had to wait for the preparatory patches
to hit your tree.
- Regression fixes for ARM GIC irqchips
- Regression fixes and lockdep anotations for renesas irq chips
- The leftovers of the cleanup and preparatory patches which have
been ignored by maintainers
- Final conversions of the newly merged users of obsolete APIs
- Final removal of obsolete APIs
- Final removal of ARM artifacts which had been introduced during the
conversion of ARM to the generic interrupt code.
- Final split of the irq_data into chip specific and common data to
reflect the needs of hierarchical irq domains.
- Treewide removal of the first argument of interrupt flow handlers,
i.e. the irq number, which is not used by the majority of handlers
and simple to retrieve from the other argument the irq descriptor.
- A few comment updates and build warning fixes"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
arm64: Remove ununsed set_irq_flags
ARM: Remove ununsed set_irq_flags
sh: Kill off set_irq_flags usage
irqchip: Kill off set_irq_flags usage
gpu/drm: Kill off set_irq_flags usage
genirq: Remove irq argument from irq flow handlers
genirq: Move field 'msi_desc' from irq_data into irq_common_data
genirq: Move field 'affinity' from irq_data into irq_common_data
genirq: Move field 'handler_data' from irq_data into irq_common_data
genirq: Move field 'node' from irq_data into irq_common_data
irqchip/gic-v3: Use IRQD_FORWARDED_TO_VCPU flag
irqchip/gic: Use IRQD_FORWARDED_TO_VCPU flag
genirq: Provide IRQD_FORWARDED_TO_VCPU status flag
genirq: Simplify irq_data_to_desc()
genirq: Remove __irq_set_handler_locked()
pinctrl/pistachio: Use irq_set_handler_locked
gpio: vf610: Use irq_set_handler_locked
powerpc/mpc8xx: Use irq_set_handler_locked()
powerpc/ipic: Use irq_set_handler_locked()
powerpc/cpm2: Use irq_set_handler_locked()
...
Commit 2ee507c472 ("sched: Add function single_task_running to let a task
check if it is the only task running on a cpu") referenced the current
runqueue with the smp_processor_id. When CONFIG_DEBUG_PREEMPT is enabled,
that is only allowed if preemption is disabled or the currrent task is
bound to the local cpu (e.g. kernel worker).
With commit f781951299 ("kvm: add halt_poll_ns module parameter") KVM
calls single_task_running. If CONFIG_DEBUG_PREEMPT is enabled that
generates a lot of kernel messages.
To avoid adding preemption in that cases, as it would limit the usefulness,
we change single_task_running to access directly the cpu local runqueue.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Fixes: 2ee507c472
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull timer fixes from Ingo Molnar:
"A fix for an abs()/abs64() bug that caused too slow NTP convergence on
32-bit kernels, plus a removal of an obsolete clockevents driver
facility after all users got converted during the merge window"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clockevents: Remove unused set_mode() callback
time: Fix timekeeping_freqadjust()'s incorrect use of abs() instead of abs64()
Pull scheduler fixes from Ingo Molnar:
"A migrate_tasks() locking fix, and a late-coming nohz change plus a
nohz debug check"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: 'Annotate' migrate_tasks()
nohz: Assert existing housekeepers when nohz full enabled
nohz: Affine unpinned timers to housekeepers
Pull locking fixes from Ingo Molnar:
"Spinlock performance regression fix, plus documentation fixes"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/static_keys: Fix up the static keys documentation
locking/qspinlock/x86: Only emit the test-and-set fallback when building guest support
locking/qspinlock/x86: Fix performance regression under unaccelerated VMs
locking/static_keys: Fix a silly typo
Most interrupt flow handlers do not use the irq argument. Those few
which use it can retrieve the irq number from the irq descriptor.
Remove the argument.
Search and replace was done with coccinelle and some extra helper
scripts around it. Thanks to Julia for her help!
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
MSI descriptors are per-irq instead of per irqchip, so move it into
struct irq_common_data.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1433145945-789-35-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Irq affinity mask is per-irq instead of per irqchip, so move it into
struct irq_common_data.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/1433303281-27688-1-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Handler data (handler_data) is per-irq instead of per irqchip, so move
it into struct irq_common_data.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1433145945-789-13-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
NUMA node information is per-irq instead of per-irqchip, so move it into
struct irq_common_data. Also use CONFIG_NUMA to guard irq_common_data.node.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/1433145945-789-8-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The internal clocksteering done for fine-grained error
correction uses a logarithmic approximation, so any time
adjtimex() adjusts the clock steering, timekeeping_freqadjust()
quickly approximates the correct clock frequency over a series
of ticks.
Unfortunately, the logic in timekeeping_freqadjust(), introduced
in commit:
dc491596f6 ("timekeeping: Rework frequency adjustments to work better w/ nohz")
used the abs() function with a s64 error value to calculate the
size of the approximated adjustment to be made.
Per include/linux/kernel.h:
"abs() should not be used for 64-bit types (s64, u64, long long) - use abs64()".
Thus on 32-bit platforms, this resulted in the clocksteering to
take a quite dampended random walk trying to converge on the
proper frequency, which caused the adjustments to be made much
slower then intended (most easily observed when large
adjustments are made).
This patch fixes the issue by using abs64() instead.
Reported-by: Nuno Gonçalves <nunojpg@gmail.com>
Tested-by: Nuno Goncalves <nunojpg@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: <stable@vger.kernel.org> # v3.17+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441840051-20244-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge fourth patch-bomb from Andrew Morton:
- sys_membarier syscall
- seq_file interface changes
- a few misc fixups
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
revert "ocfs2/dlm: use list_for_each_entry instead of list_for_each"
mm/early_ioremap: add explicit #include of asm/early_ioremap.h
fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void
selftests: enhance membarrier syscall test
selftests: add membarrier syscall test
sys_membarrier(): system-wide memory barrier (generic, x86)
MODSIGN: fix a compilation warning in extract-cert
- Build fix for the new Mediatek MT8173 cpufreq driver (Guenter Roeck).
- Generic power domains framework fixes (power on error code
path, subdomain removal) and cleanup of a deprecated API user
(Geert Uytterhoeven, Jon Hunter, Ulf Hansson).
- cpufreq-dt driver fixes including two fixes for bugs related to
the new Operating Performance Points Device Tree bindings
introduced recently (Viresh Kumar).
- Suspend frequency support for the cpufreq-dt driver
(Bartlomiej Zolnierkiewicz, Viresh Kumar).
- cpufreq core cleanups (Viresh Kumar).
- intel_pstate driver fixes (Chen Yu, Kristen Carlson Accardi).
- Additional sanity check in the cpuidle core (Xunlei Pang).
- Fix for a comment related to CPU power management (Lina Iyer).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJV80XwAAoJEILEb/54YlRxIAkP/RrT4mDO9QZkocO+XYDBB739
oi1aaSDmmVVKlFtFiyX4njS7S+U5UC8PYn5o/fFtVnP1kD9yuZkoak65U3t7bsKK
+rjun1SzJeooaLXW4ZAJKyPldi/RrsfRJ2JOaEt2vK03ppjx2mouK1VpjNj+qmYd
wwFb4q5S5Rcpph0Sx+xxhpZKoOgBmXk7nWIe5bWDxuORdx4yoxp4lus/P223wFJy
r0kJnNyY55mnY4Yx5e3L8W1rMC/Sf6mt8pyMvqRpsKdBHaEbhmGsiERSi0eCZJwr
AmL666TVgPk6zNM/yc3mFOTRZeAF3soSWF5R6+n2TswWvWFH+ksM14g9ndb7fTv3
FUuOU3jyfSUj2T4xr0qUZLaQUFB3xNrnWBNnm/CzPwjJtH1u1SK8uBSRqdZnEdGq
EeqwnwDLFeSAbG6lQVsKVVAPZHolDCnJPsvZebKdnJH09kjgUdm60BLOnif4QDgw
ULPZ6NToFZkrQSMAX1Uqm42iv/qWcHIgS5x43jAmblY3TsuByqRrI7gBCy+hbGSN
cQfeCxTdVPi/tUPYlJFmWn43dgEsnceSFKvlQRXGrZJM29BFvhVzJQjTFC8Oj79q
OP/Rv28+VeV5J531BQu+w8cvg73GJ3DDab+i91ZkIsd7ZNjRYQKXBfR9TImTSD4l
BPjvaPXwDg6EPhwjcpK/
=C7+4
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management and ACPI updates from Rafael Wysocki:
"These are mostly fixes and cleanups on top of the previous PM+ACPI
pull request (cpufreq core and drivers, cpuidle, generic power domains
framework). Some of them didn't make to that pull request and some
fix issues introduced by it.
The only really new thing is the support for suspend frequency in the
cpufreq-dt driver, but it is needed to fix an issue with Exynos
platforms.
Specifics:
- build fix for the new Mediatek MT8173 cpufreq driver (Guenter
Roeck).
- generic power domains framework fixes (power on error code path,
subdomain removal) and cleanup of a deprecated API user (Geert
Uytterhoeven, Jon Hunter, Ulf Hansson).
- cpufreq-dt driver fixes including two fixes for bugs related to the
new Operating Performance Points Device Tree bindings introduced
recently (Viresh Kumar).
- suspend frequency support for the cpufreq-dt driver (Bartlomiej
Zolnierkiewicz, Viresh Kumar).
- cpufreq core cleanups (Viresh Kumar).
- intel_pstate driver fixes (Chen Yu, Kristen Carlson Accardi).
- additional sanity check in the cpuidle core (Xunlei Pang).
- fix for a comment related to CPU power management (Lina Iyer)"
* tag 'pm+acpi-4.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
intel_pstate: fix PCT_TO_HWP macro
intel_pstate: Fix user input of min/max to legal policy region
PM / OPP: Return suspend_opp only if it is enabled
cpufreq-dt: add suspend frequency support
cpufreq: allow cpufreq_generic_suspend() to work without suspend frequency
PM / OPP: add dev_pm_opp_get_suspend_opp() helper
staging: board: Migrate away from __pm_genpd_name_add_device()
cpufreq: Use __func__ to print function's name
cpufreq: staticize cpufreq_cpu_get_raw()
PM / Domains: Ensure subdomain is not in use before removing
cpufreq: Add ARM_MT8173_CPUFREQ dependency on THERMAL
cpuidle/coupled: Add sanity check for safe_state_index
PM / Domains: Try power off masters in error path of __pm_genpd_poweron()
cpufreq: dt: Tolerance applies on both sides of target voltage
cpufreq: dt: Print error on failing to mark OPPs as shared
cpufreq: dt: Check OPP count before marking them shared
kernel/cpu_pm: fix cpu_cluster_pm_exit comment
Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system. It is
implemented by calling synchronize_sched(). It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier. For synchronization primitives
that distinguish between read-side and write-side (e.g. userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.
The existing applications of which I am aware that would be improved by
this system call are as follows:
* Through Userspace RCU library (http://urcu.so)
- DNS server (Knot DNS) https://www.knot-dns.cz/
- Network sniffer (http://netsniff-ng.org/)
- Distributed object storage (https://sheepdog.github.io/sheepdog/)
- User-space tracing (http://lttng.org)
- Network storage system (https://www.gluster.org/)
- Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
- Financial software (https://lkml.org/lkml/2015/3/23/189)
Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking. Especially in the case of RCU used by
libraries, sys_membarrier can speed up the read-side by moving the bulk of
the memory barrier cost to synchronize_rcu().
* Direct users of sys_membarrier
- core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux. They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.
To explain the benefit of this scheme, let's introduce two example threads:
Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())
In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".
Before the change, we had, for each smp_mb() pairs:
Thread A Thread B
previous mem accesses previous mem accesses
smp_mb() smp_mb()
following mem accesses following mem accesses
After the change, these pairs become:
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).
1) Non-concurrent Thread A vs Thread B accesses:
Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses
In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.
2) Concurrent Thread A vs Thread B accesses
Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses
In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().
* Benchmarks
On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)
1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
* User-space user of this system call: Userspace RCU library
Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.
Results in liburcu:
Operations in 10s, 6 readers, 2 writers:
memory barriers in reader: 1701557485 reads, 2202847 writes
signal-based scheme: 9830061167 reads, 6700 writes
sys_membarrier: 9952759104 reads, 425 writes
sys_membarrier (dyn. check): 7970328887 reads, 425 writes
The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.
Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.
An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.
This patch adds the system call to x86 and to asm-generic.
[1] http://urcu.so
membarrier(2) man page:
MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)
NAME
membarrier - issue memory barriers on a set of threads
SYNOPSIS
#include <linux/membarrier.h>
int membarrier(int cmd, int flags);
DESCRIPTION
The cmd argument is one of the following:
MEMBARRIER_CMD_QUERY
Query the set of supported commands. It returns a bitmask of
supported commands.
MEMBARRIER_CMD_SHARED
Execute a memory barrier on all threads running on the system.
Upon return from system call, the caller thread is ensured that
all running threads have passed through a state where all memory
accesses to user-space addresses match program order between
entry to and return from the system call (non-running threads
are de facto in such a state). This covers threads from all pro=E2=80=90
cesses running on the system. This command returns 0.
The flags argument needs to be 0. For future extensions.
All memory accesses performed in program order from each targeted
thread is guaranteed to be ordered with respect to sys_membarrier(). If
we use the semantic "barrier()" to represent a compiler barrier forcing
memory accesses to be performed in program order across the barrier,
and smp_mb() to represent explicit memory barriers forcing full memory
ordering across the barrier, we have the following ordering table for
each pair of barrier(), sys_membarrier() and smp_mb():
The pair ordering is detailed as (O: ordered, X: not ordered):
barrier() smp_mb() sys_membarrier()
barrier() X X O
smp_mb() X O O
sys_membarrier() O O O
RETURN VALUE
On success, these system calls return zero. On error, -1 is returned,
and errno is set appropriately. For a given command, with flags
argument set to 0, this system call is guaranteed to always return the
same value until reboot.
ERRORS
ENOSYS System call is not implemented.
EINVAL Invalid arguments.
Linux 2015-04-15 MEMBARRIER(2)
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nicholas Miell <nmiell@comcast.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pm-cpu:
kernel/cpu_pm: fix cpu_cluster_pm_exit comment
* pm-cpuidle:
cpuidle/coupled: Add sanity check for safe_state_index
* pm-domains:
staging: board: Migrate away from __pm_genpd_name_add_device()
PM / Domains: Ensure subdomain is not in use before removing
PM / Domains: Try power off masters in error path of __pm_genpd_poweron()
Kernel testing triggered this warning:
| WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80()
| Modules linked in:
| CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c7 #2
| Call Trace:
| dump_stack+0x4b/0x75
| warn_slowpath_common+0x8b/0xc0
| warn_slowpath_null+0x22/0x30
| do_set_cpus_allowed+0x7e/0x80
| cpuset_cpus_allowed_fallback+0x7c/0x170
| select_fallback_rq+0x221/0x280
| migration_call+0xe3/0x250
| notifier_call_chain+0x53/0x70
| __raw_notifier_call_chain+0x1e/0x30
| cpu_notify+0x28/0x50
| take_cpu_down+0x22/0x40
| multi_cpu_stop+0xd5/0x140
| cpu_stopper_thread+0xbc/0x170
| smpboot_thread_fn+0x174/0x2f0
| kthread+0xc4/0xe0
| ret_from_kernel_thread+0x21/0x30
As Peterz pointed out:
| So the normal rules for changing task_struct::cpus_allowed are holding
| both pi_lock and rq->lock, such that holding either stabilizes the mask.
|
| This is so that wakeup can happen without rq->lock and load-balance
| without pi_lock.
|
| From this we already get the relaxation that we can omit acquiring
| rq->lock if the task is not on the rq, because in that case
| load-balancing will not apply to it.
|
| ** these are the rules currently tested in do_set_cpus_allowed() **
|
| Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which
| unconditionally acquires both locks, we could get away with holding just
| rq->lock when on_rq for modification because that'd still exclude
| __set_cpus_allowed_ptr(), it would also work against
| __kthread_bind_mask() because that assumes !on_rq.
|
| That said, this is all somewhat fragile.
|
| Now, I don't think dropping rq->lock is quite as disastrous as it
| usually is because !cpu_active at this point, which means load-balance
| will not interfere, but that too is somewhat fragile.
|
| So we end up with a choice of two fragile..
This patch fixes it by following the rules for changing
task_struct::cpus_allowed with both pi_lock and rq->lock held.
Reported-by: kernel test robot <ying.huang@intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[ Modified changelog and patch. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/BLU436-SMTP1660820490DE202E3934ED3806E0@phx.gbl
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dave ran into horrible performance on a VM without PARAVIRT_SPINLOCKS
set and Linus noted that the test-and-set implementation was retarded.
One should spin on the variable with a load, not a RMW.
While there, remove 'queued' from the name, as the lock isn't queued
at all, but a simple test-and-set.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: stable@vger.kernel.org # v4.2+
Link: http://lkml.kernel.org/r/20150904152523.GR18673@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge third patch-bomb from Andrew Morton:
- even more of the rest of MM
- lib/ updates
- checkpatch updates
- small changes to a few scruffy filesystems
- kmod fixes/cleanups
- kexec updates
- a dma-mapping cleanup series from hch
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits)
dma-mapping: consolidate dma_set_mask
dma-mapping: consolidate dma_supported
dma-mapping: cosolidate dma_mapping_error
dma-mapping: consolidate dma_{alloc,free}_noncoherent
dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}
mm: use vma_is_anonymous() in create_huge_pmd() and wp_huge_pmd()
mm: make sure all file VMAs have ->vm_ops set
mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff()
mm: mark most vm_operations_struct const
namei: fix warning while make xmldocs caused by namei.c
ipc: convert invalid scenarios to use WARN_ON
zlib_deflate/deftree: remove bi_reverse()
lib/decompress_unlzma: Do a NULL check for pointer
lib/decompressors: use real out buf size for gunzip with kernel
fs/affs: make root lookup from blkdev logical size
sysctl: fix int -> unsigned long assignments in INT_MIN case
kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo
kexec: align crash_notes allocation to make it be inside one physical page
kexec: remove unnecessary test in kimage_alloc_crash_control_pages()
kexec: split kexec_load syscall from kexec core code
...
Pull networking fixes from David Miller:
1) Fix out-of-bounds array access in netfilter ipset, from Jozsef
Kadlecsik.
2) Use correct free operation on netfilter conntrack templates, from
Daniel Borkmann.
3) Fix route leak in SCTP, from Marcelo Ricardo Leitner.
4) Fix sizeof(pointer) in mac80211, from Thierry Reding.
5) Fix cache pointer comparison in ip6mr leading to missed unlock of
mrt_lock. From Richard Laing.
6) rds_conn_lookup() needs to consider network namespace in key
comparison, from Sowmini Varadhan.
7) Fix deadlock in TIPC code wrt broadcast link wakeups, from Kolmakov
Dmitriy.
8) Fix fd leaks in bpf syscall, from Daniel Borkmann.
9) Fix error recovery when installing ipv6 multipath routes, we would
delete the old route before we would know if we could fully commit
to the new set of nexthops. Fix from Roopa Prabhu.
10) Fix run-time suspend problems in r8152, from Hayes Wang.
11) In fec, don't program the MAC address into the chip when the clocks
are gated off. From Fugang Duan.
12) Fix poll behavior for netlink sockets when using rx ring mmap, from
Daniel Borkmann.
13) Don't allocate memory with GFP_KERNEL from get_stats64 in r8169
driver, from Corinna Vinschen.
14) In TCP Cubic congestion control, handle idle periods better where we
are application limited, in order to keep cwnd from growing out of
control. From Eric Dumzet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
tcp_cubic: better follow cubic curve after idle period
tcp: generate CA_EVENT_TX_START on data frames
xen-netfront: respect user provided max_queues
xen-netback: respect user provided max_queues
r8169: Fix sleeping function called during get_stats64, v2
ether: add IEEE 1722 ethertype - TSN
netlink, mmap: fix edge-case leakages in nf queue zero-copy
netlink, mmap: don't walk rx ring on poll if receive queue non-empty
cxgb4: changes for new firmware 1.14.4.0
net: fec: add netif status check before set mac address
r8152: fix the runtime suspend issues
r8152: split DRIVER_VERSION
ipv6: fix ifnullfree.cocci warnings
add microchip LAN88xx phy driver
stmmac: fix check for phydev being open
net: qlcnic: delete redundant memsets
net: mv643xx_eth: use kzalloc
net: jme: use kzalloc() instead of kmalloc+memset
net: cavium: liquidio: use kzalloc in setup_glist()
net: ipv6: use common fib_default_rule_pref
...
The following
if (val < 0)
*lvalp = (unsigned long)-val;
is incorrect because the compiler is free to assume -val to be positive
and use a sign-extend instruction for extending the bit pattern. This is
a problem if val == INT_MIN:
# echo -2147483648 >/proc/sys/dev/scsi/logging_level
# cat /proc/sys/dev/scsi/logging_level
-18446744071562067968
Cast to unsigned long before negation - that way we first sign-extend and
then negate an unsigned, which is well defined. With this:
# cat /proc/sys/dev/scsi/logging_level
-2147483648
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cc: Mikulas Patocka <mikulas@twibright.com>
Cc: Robert Xiao <nneonneo@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and
accordingly the MODULES_VADDR is changed to 0xffffffffa0000000. However,
in v3.12 Kees Cook introduced kaslr to randomise the location of kernel.
And the kernel text mapping addr space is enlarged from 512M to 1G. That
means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr
support is not compiled in and 1G when kaslr support is compiled in.
Accordingly the MODULES_VADDR is changed too to be:
#define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
So when kaslr is compiled in and enabled, the kernel text mapping addr
space and modules vaddr space need be adjusted. Otherwise makedumpfile
will collapse since the addr for some symbols is not correct.
Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in
makedumpfile to help calculate MODULES_VADDR.
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
People reported that crash_notes in /proc/vmcore were corrupted and this
cause crash kdump failure. With code debugging and log we got the root
cause. This is because percpu variable crash_notes are allocated in 2
vmalloc pages. Currently percpu is based on vmalloc by default. Vmalloc
can't guarantee 2 continuous vmalloc pages are also on 2 continuous
physical pages. So when 1st kernel exports the starting address and size
of crash_notes through sysfs like below:
/sys/devices/system/cpu/cpux/crash_notes
/sys/devices/system/cpu/cpux/crash_notes_size
kdump kernel use them to get the content of crash_notes. However the 2nd
part may not be in the next neighbouring physical page as we expected if
crash_notes are allocated accross 2 vmalloc pages. That's why
nhdr_ptr->n_namesz or nhdr_ptr->n_descsz could be very huge in
update_note_header_size_elf64() and cause note header merging failure or
some warnings.
In this patch change to call __alloc_percpu() to passed in the align value
by rounding crash_notes_size up to the nearest power of two. This makes
sure the crash_notes is allocated inside one physical page since
sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE. Meanwhile add
a BUILD_BUG_ON to break compile if size is bigger than PAGE_SIZE since
crash_notes definitely will be in 2 pages. That need be avoided, and need
be reported if it's unavoidable.
[akpm@linux-foundation.org: use correct comment layout]
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Transforming PFN(Page Frame Number) to struct page is never failure, so we
can simplify the code logic to do the image->control_page assignment
directly in the loop, and remove the unnecessary conditional judgement.
Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Acked-by: Dave Young <dyoung@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two kexec load syscalls, kexec_load another and kexec_file_load.
kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
split kexec_load syscall code to kernel/kexec.c.
And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.
The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
kexec_load syscall can bypass the checking.
Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.
Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
kexec_load syscall.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Split kexec_file syscall related code to another file kernel/kexec_file.c
so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.
Sharing variables and functions are moved to kernel/kexec_internal.h per
suggestion from Vivek and Petr.
[akpm@linux-foundation.org: fix bisectability]
[akpm@linux-foundation.org: declare the various arch_kexec functions]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The UMH_WAIT_PROC handler runs in its own thread in order to make sure
that waiting for the exec kernel thread completion won't block other
usermodehelper queued jobs.
On older workqueue implementations, worklets couldn't sleep without
blocking the rest of the queue. But now the workqueue subsystem handles
that. Khelper still had the older limitation due to its singlethread
properties but we replaced it to system unbound workqueues.
Those are affine to the current node and can block up to some number of
instances.
They are a good candidate to handle UMH_WAIT_PROC assuming that we have
enough system unbound workers to handle lots of parallel usermodehelper
jobs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to launch the usermodehelper kernel threads with the widest
affinity and this is partly why we use khelper. This workqueue has
unbound properties and thus a wide affinity inherited by all its children.
Now khelper also has special properties that we aren't much interested in:
ordered and singlethread. There is really no need about ordering as all
we do is creating kernel threads. This can be done concurrently. And
singlethread is a useless limitation as well.
The workqueue engine already proposes generic unbound workqueues that
don't share these useless properties and handle well parallel jobs.
The only worrysome specific is their affinity to the node of the current
CPU. It's fine for creating the usermodehelper kernel threads but those
inherit this affinity for longer jobs such as requesting modules.
This patch proposes to use these node affine unbound workqueues assuming
that a node is sufficient to handle several parallel usermodehelper
requests.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There seem to be quite some confusions on the comments, likely due to
changes that came after them.
Now since it's very non obvious why we have 3 levels of asynchronous code
to implement usermodehelpers, it's important to comment in detail the
reason of this layout.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Khelper is affine to all CPUs. Now since it creates the
call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide
affinity.
As such explicitly forcing a wide affinity from those kernel threads
is like a no-op.
Just remove it. It's needless and it breaks CPU isolation users who
rely on workqueue affinity tuning.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>