Commit Graph

21071 Commits

Author SHA1 Message Date
Steven Rostedt (Red Hat)
5557720415 tracing: Move sleep-time and graph-time options out of the core trace_flags
The sleep-time and graph-time options are only for the function graph tracer
and are not used by anything else. As tracer options are now visible when
the tracer is not activated, its better to move the function graph specific
tracer options into the function graph tracer.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-30 15:22:42 -04:00
Steven Rostedt (Red Hat)
b9f9108cad tracing: Remove access to trace_flags in trace_printk.c
In the effort to move the global trace_flags to the tracing instances, the
direct access to trace_flags must be removed from trace_printk.c

Instead, add a new trace_printk_enabled boolean that is set by a new access
function trace_printk_control(), that will enable or disable trace_printk.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-30 04:35:18 -04:00
Steven Rostedt (Red Hat)
b5e87c0581 tracing: Add build bug if we have more trace_flags than bits
Add a enum that denotes the last bit of the trace_flags and have a
BUILD_BUG_ON(last_bit > 32).

If we add more bits than we have in trace_flags, the kernel wont build.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-30 04:35:18 -04:00
Steven Rostedt (Red Hat)
41d9c0becc tracing: Always show all tracer options in the options directory
There are options that are unique to a specific tracer (like function and
function graph). Currently, these options are only visible in the options
directory when the tracer is enabled.

This has been a pain, especially for something like the func_stack_trace
option that if used inappropriately, could bring the system to a crawl. But
the only way to see it, is to enable the function tracer.

For example, if one had done:

 # cd /sys/kernel/tracing
 # echo __schedule > set_ftrace_filter
 # echo 1 > options/func_stack_trace
 # echo function > current_tracer

The __schedule call will be traced and a stack trace will also be recorded
there. Now when you were done, you may do...

 # echo nop > current_tracer
 # echo > set_ftrace_filter

But you forgot to disable the func_stack_trace. The only way to disable it
is to re-enable function tracing first. If you do not add a filter to
set_ftrace_filter and just do:

 # echo function > current_tracer

Now you would be performing a stack trace on *every* function! On some
systems, that causes a live lock. Others may take a few minutes to fix your
mistake.

Having the func_stack_trace option visible allows you to check it and
disable it before enabling the funtion tracer.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-30 04:34:54 -04:00
Steven Rostedt (Red Hat)
73dddbb57b tracing: Only create stacktrace option when STACKTRACE is configured
Only create the stacktrace trace option when CONFIG_STACKTRACE is
configured.

Cleaned up the ftrace_trace_stack() function call a little to allow better
encapsulation of the stacktrace trace flag.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-29 15:38:55 -04:00
Steven Rostedt (Red Hat)
8179e8a15b tracing: Do not create function tracer options when not compiled in
When the function tracer is not compiled in, do not create the option files
for it.

Fix up both the sched_wakeup and irqsoff tracers to handle the change.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-29 15:01:34 -04:00
Steven Rostedt (Red Hat)
4ee4301c4b tracing: Only create branch tracer options when compiled in
When the branch tracer is not compiled in, do not create the option files
associated to it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-29 13:23:59 -04:00
Steven Rostedt (Red Hat)
729358da95 tracing: Only create function graph options when it is compiled in
Do not create fuction graph tracer options when function graph tracer is not
even compiled in.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-29 13:23:58 -04:00
Steven Rostedt (Red Hat)
a3418a364e tracing: Use TRACE_FLAGS macro to keep enums and strings matched
Use a cute little macro trick to keep the names of the trace flags file
guaranteed to match the corresponding masks.

The macro TRACE_FLAGS is defined as a serious of enum names followed by
the string name of the file that matches it. For example:

 #define TRACE_FLAGS						\
		C(PRINT_PARENT,		"print-parent"),	\
		C(SYM_OFFSET,		"sym-offset"),		\
		C(SYM_ADDR,		"sym-addr"),		\
		C(VERBOSE,		"verbose"),

Now we can define the following:

 #undef C
 #define C(a, b) TRACE_ITER_##a##_BIT
 enum trace_iterator_bits { TRACE_FLAGS };

The above creates:

 enum trace_iterator_bits {
	TRACE_ITER_PRINT_PARENT_BIT,
	TRACE_ITER_SYM_OFFSET_BIT,
	TRACE_ITER_SYM_ADDR_BIT,
	TRACE_ITER_VERBOSE_BIT,
 };

Then we can redefine C as:

 #undef C
 #define C(a, b) TRACE_ITER_##a = (1 << TRACE_ITER_##a##_BIT)
 enum trace_iterator_flags { TRACE_FLAGS };

Which creates:

 enum trace_iterator_flags {
	TRACE_ITER_PRINT_PARENT	= (1 << TRACE_ITER_PRINT_PARENT_BIT),
	TRACE_ITER_SYM_OFFSET	= (1 << TRACE_ITER_SYM_OFFSET_BIT),
	TRACE_ITER_SYM_ADDR	= (1 << TRACE_ITER_SYM_ADDR_BIT),
	TRACE_ITER_VERBOSE	= (1 << TRACE_ITER_VERBOSE_BIT),
 };

Then finally we can create the list of file names:

 #undef C
 #define C(a, b) b
 static const char *trace_options[] = {
	TRACE_FLAGS
	NULL
 };

Which creates:
 static const char *trace_options[] = {
	"print-parent",
	"sym-offset",
	"sym-addr",
	"verbose",
	NULL
 };

The importance of this is that the strings match the bit index.

	trace_options[TRACE_ITER_SYM_ADDR_BIT] == "sym-addr"

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-29 13:23:57 -04:00
Steven Rostedt (Red Hat)
ce3fed628e tracing: Use enums instead of hard coded bitmasks for TRACE_ITER flags
Using enums with FLAG_BIT and then defining a FLAG = (1 << FLAG_BIT), is a
bit more robust as we require that there are no bits out of order or skipped
to match the file names that represent the bits.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-29 13:23:56 -04:00
Steven Rostedt (Red Hat)
938db5f569 tracing: Remove unused tracing option "ftrace_preempt"
There was a time where the function tracing would disable interrupts unless
specifically told not to, where it would only disable preemption. With the
new lockless code, the function tracing never disalbes interrupts and just
uses disabling of preemption. Remove the option "ftrace_preempt" as it does
nothing anyway.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-29 13:23:54 -04:00
Steven Rostedt (Red Hat)
03905582fd tracing: Move "display-graph" option to main options
In order to facilitate making all tracer options visible even when the
tracer is not active, we need to get rid of duplicate options. Any option
that is shared between multiple tracers really should be a main option.

As the wakeup and irqsoff tracers both use the "display-graph" option, and
use it exactly the same way, move that option from the tracer options to the
main options and consolidate them.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-29 12:56:40 -04:00
Steven Rostedt (Red Hat)
ef92480a58 tracing: Turn seq_print_user_ip() into a static function
seq_print_user_ip() is used in only one location in one file. Turn it into a
static function. We could inject its code into the caller, but that would
make the code a bit too complex. Keep the code separate.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-28 10:16:12 -04:00
Steven Rostedt (Red Hat)
6b1032d53c tracing: Inject seq_print_userip_objs() into its only user
seq_print_userip_objs() is used only in one location, in one file. Instead
of having it as an external function, go one further than making it static,
but inject is code into its only user. It doesn't make the calling function
much more complex.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-28 10:11:44 -04:00
Steven Rostedt (Red Hat)
ca475e831f tracing: Make ftrace_trace_stack() static
ftrace_trace_stack() is not called outside of trace.c. Make it a static
function.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-28 09:41:11 -04:00
Steven Rostedt (Red Hat)
b7f0c959ed tracing: Pass trace_array into trace_buffer_unlock_commit()
In preparation for having trace options be per instance, the trace_array
needs to be passed to the trace_buffer_unlock_commit(). The
trace_event_buffer_lock_reserve() already passes in the trace_event_file
where the trace_array can be derived from.

Also added a "__init" to the boot up test event plus function tracing
function function_test_events_call().

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-25 17:38:44 -04:00
Steven Rostedt (Red Hat)
41907416bc tracing: Remove unused function trace_current_buffer_lock_reserve()
trace_current_buffer_lock_reserve() is not used by anything. Might as well
get rid of it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-25 15:37:31 -04:00
Steven Rostedt (Red Hat)
d78a461427 tracing: Remove ftrace_trace_stack_regs()
ftrace_trace_stack_regs() is used in only one place, and because that is
such a simple function, just move its code into the location that it was
used in (trace_buffer_unlock_commit_regs()).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-25 15:37:23 -04:00
Yaowei Bai
f0132c4e0d kernel/trace_probe: is_good_name can be boolean
This patch makes is_good_name return bool to improve readability
due to this particular function only using either one or zero as its
return value.

No functional change.

Link: http://lkml.kernel.org/r/1442929393-4753-2-git-send-email-bywxiaobai@163.com

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-22 13:11:30 -04:00
Linus Torvalds
3ae839454e Mostly stable material, a lot of ARM fixes.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJV+/ucAAoJEL/70l94x66DV8YH/1KDym/1GJ+/Br/YkHZnM53l
 3Q0PwSLu9cNcIL9lUuDLwGTaVj+y8ud1Hjr/uzvKwivktmUYVZhkdtnZmnanvGOM
 qKB9K3nFXCPx8uqy8Dn7fOwEKcg9FmDOTTkWy13HDnXO+V4crSVVt+rPw+6FUMld
 NV5tYdw9Lu7y3XrveDebPWaPtyDL7OAagzmeK47eMffxG7X9Hf1H2aT7HueRi7x/
 SkLIe3gmiOWmHVJDPE9TOmFYIj19gywDFysKes1gdVJLVUIXiELMT7SrvAYnToVB
 zISIEj7Zx4SINPxpf2dUn8REm7NsmJY+PffLIl/Nv+ozGggFQGFH0SMZ08p0bxw=
 =tfmn
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "Mostly stable material, a lot of ARM fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
  sched: access local runqueue directly in single_task_running
  arm/arm64: KVM: Remove 'config KVM_ARM_MAX_VCPUS'
  arm64: KVM: Remove all traces of the ThumbEE registers
  arm: KVM: Disable virtual timer even if the guest is not using it
  arm64: KVM: Disable virtual timer even if the guest is not using it
  arm/arm64: KVM: vgic: Check for !irqchip_in_kernel() when mapping resources
  KVM: s390: Replace incorrect atomic_or with atomic_andnot
  arm: KVM: Fix incorrect device to IPA mapping
  arm64: KVM: Fix user access for debug registers
  KVM: vmx: fix VPID is 0000H in non-root operation
  KVM: add halt_attempted_poll to VCPU stats
  kvm: fix zero length mmio searching
  kvm: fix double free for fast mmio eventfd
  kvm: factor out core eventfd assign/deassign logic
  kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
  KVM: make the declaration of functions within 80 characters
  KVM: arm64: add workaround for Cortex-A57 erratum #852523
  KVM: fix polling for guest halt continued even if disable it
  arm/arm64: KVM: Fix PSCI affinity info return value for non valid cores
  arm64: KVM: set {v,}TCR_EL2 RES1 bits
  ...
2015-09-18 09:23:08 -07:00
Linus Torvalds
fadb97b089 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "This is a rather large update post rc1 due to the final steps of
  cleanups and API changes which had to wait for the preparatory patches
  to hit your tree.

   - Regression fixes for ARM GIC irqchips

   - Regression fixes and lockdep anotations for renesas irq chips

   - The leftovers of the cleanup and preparatory patches which have
     been ignored by maintainers

   - Final conversions of the newly merged users of obsolete APIs

   - Final removal of obsolete APIs

   - Final removal of ARM artifacts which had been introduced during the
     conversion of ARM to the generic interrupt code.

   - Final split of the irq_data into chip specific and common data to
     reflect the needs of hierarchical irq domains.

   - Treewide removal of the first argument of interrupt flow handlers,
     i.e. the irq number, which is not used by the majority of handlers
     and simple to retrieve from the other argument the irq descriptor.

   - A few comment updates and build warning fixes"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  arm64: Remove ununsed set_irq_flags
  ARM: Remove ununsed set_irq_flags
  sh: Kill off set_irq_flags usage
  irqchip: Kill off set_irq_flags usage
  gpu/drm: Kill off set_irq_flags usage
  genirq: Remove irq argument from irq flow handlers
  genirq: Move field 'msi_desc' from irq_data into irq_common_data
  genirq: Move field 'affinity' from irq_data into irq_common_data
  genirq: Move field 'handler_data' from irq_data into irq_common_data
  genirq: Move field 'node' from irq_data into irq_common_data
  irqchip/gic-v3: Use IRQD_FORWARDED_TO_VCPU flag
  irqchip/gic: Use IRQD_FORWARDED_TO_VCPU flag
  genirq: Provide IRQD_FORWARDED_TO_VCPU status flag
  genirq: Simplify irq_data_to_desc()
  genirq: Remove __irq_set_handler_locked()
  pinctrl/pistachio: Use irq_set_handler_locked
  gpio: vf610: Use irq_set_handler_locked
  powerpc/mpc8xx: Use irq_set_handler_locked()
  powerpc/ipic: Use irq_set_handler_locked()
  powerpc/cpm2: Use irq_set_handler_locked()
  ...
2015-09-18 08:11:42 -07:00
Dominik Dingel
00cc163381 sched: access local runqueue directly in single_task_running
Commit 2ee507c472 ("sched: Add function single_task_running to let a task
check if it is the only task running on a cpu") referenced the current
runqueue with the smp_processor_id.  When CONFIG_DEBUG_PREEMPT is enabled,
that is only allowed if preemption is disabled or the currrent task is
bound to the local cpu (e.g. kernel worker).

With commit f781951299 ("kvm: add halt_poll_ns module parameter") KVM
calls single_task_running. If CONFIG_DEBUG_PREEMPT is enabled that
generates a lot of kernel messages.

To avoid adding preemption in that cases, as it would limit the usefulness,
we change single_task_running to access directly the cpu local runqueue.

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Fixes: 2ee507c472
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-18 13:47:59 +02:00
Linus Torvalds
1345df21ac Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "A fix for an abs()/abs64() bug that caused too slow NTP convergence on
  32-bit kernels, plus a removal of an obsolete clockevents driver
  facility after all users got converted during the merge window"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clockevents: Remove unused set_mode() callback
  time: Fix timekeeping_freqadjust()'s incorrect use of abs() instead of abs64()
2015-09-17 10:55:25 -07:00
Linus Torvalds
c2ea72fd86 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A migrate_tasks() locking fix, and a late-coming nohz change plus a
  nohz debug check"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: 'Annotate' migrate_tasks()
  nohz: Assert existing housekeepers when nohz full enabled
  nohz: Affine unpinned timers to housekeepers
2015-09-17 10:49:42 -07:00
Linus Torvalds
9786cff38a Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Spinlock performance regression fix, plus documentation fixes"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/static_keys: Fix up the static keys documentation
  locking/qspinlock/x86: Only emit the test-and-set fallback when building guest support
  locking/qspinlock/x86: Fix performance regression under unaccelerated VMs
  locking/static_keys: Fix a silly typo
2015-09-17 08:45:23 -07:00
Thomas Gleixner
bd0b9ac405 genirq: Remove irq argument from irq flow handlers
Most interrupt flow handlers do not use the irq argument. Those few
which use it can retrieve the irq number from the irq descriptor.

Remove the argument.

Search and replace was done with coccinelle and some extra helper
scripts around it. Thanks to Julia for her help!

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
2015-09-16 15:47:51 +02:00
Jiang Liu
b237721c5d genirq: Move field 'msi_desc' from irq_data into irq_common_data
MSI descriptors are per-irq instead of per irqchip, so move it into
struct irq_common_data.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1433145945-789-35-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-16 15:46:49 +02:00
Jiang Liu
9df872faa7 genirq: Move field 'affinity' from irq_data into irq_common_data
Irq affinity mask is per-irq instead of per irqchip, so move it into
struct irq_common_data.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/1433303281-27688-1-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-16 15:46:49 +02:00
Jiang Liu
af7080e040 genirq: Move field 'handler_data' from irq_data into irq_common_data
Handler data (handler_data) is per-irq instead of per irqchip, so move
it into struct irq_common_data.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1433145945-789-13-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-16 15:46:49 +02:00
Jiang Liu
449e9cae58 genirq: Move field 'node' from irq_data into irq_common_data
NUMA node information is per-irq instead of per-irqchip, so move it into
struct irq_common_data. Also use CONFIG_NUMA to guard irq_common_data.node.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/1433145945-789-8-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-16 15:46:49 +02:00
Viresh Kumar
eef7635a22 clockevents: Remove unused set_mode() callback
All users are migrated to the per-state callbacks, get rid of the
unused interface and the core support code.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linaro-kernel@lists.linaro.org
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/fd60de14cf6d125489c031207567bb255ad946f6.1441943991.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 11:00:55 +02:00
John Stultz
2619d7e9c9 time: Fix timekeeping_freqadjust()'s incorrect use of abs() instead of abs64()
The internal clocksteering done for fine-grained error
correction uses a logarithmic approximation, so any time
adjtimex() adjusts the clock steering, timekeeping_freqadjust()
quickly approximates the correct clock frequency over a series
of ticks.

Unfortunately, the logic in timekeeping_freqadjust(), introduced
in commit:

  dc491596f6 ("timekeeping: Rework frequency adjustments to work better w/ nohz")

used the abs() function with a s64 error value to calculate the
size of the approximated adjustment to be made.

Per include/linux/kernel.h:

  "abs() should not be used for 64-bit types (s64, u64, long long) - use abs64()".

Thus on 32-bit platforms, this resulted in the clocksteering to
take a quite dampended random walk trying to converge on the
proper frequency, which caused the adjustments to be made much
slower then intended (most easily observed when large
adjustments are made).

This patch fixes the issue by using abs64() instead.

Reported-by: Nuno Gonçalves <nunojpg@gmail.com>
Tested-by: Nuno Goncalves <nunojpg@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: <stable@vger.kernel.org> # v3.17+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441840051-20244-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 10:30:47 +02:00
Linus Torvalds
01b0c014ee Merge branch 'akpm' (patches from Andrew)
Merge fourth patch-bomb from Andrew Morton:

 - sys_membarier syscall

 - seq_file interface changes

 - a few misc fixups

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  revert "ocfs2/dlm: use list_for_each_entry instead of list_for_each"
  mm/early_ioremap: add explicit #include of asm/early_ioremap.h
  fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void
  selftests: enhance membarrier syscall test
  selftests: add membarrier syscall test
  sys_membarrier(): system-wide memory barrier (generic, x86)
  MODSIGN: fix a compilation warning in extract-cert
2015-09-11 19:34:09 -07:00
Linus Torvalds
fa9a67ef9d Additional power management and ACPI material for v4.3-rc1
- Build fix for the new Mediatek MT8173 cpufreq driver (Guenter Roeck).
 
  - Generic power domains framework fixes (power on error code
    path, subdomain removal) and cleanup of a deprecated API user
    (Geert Uytterhoeven, Jon Hunter, Ulf Hansson).
 
  - cpufreq-dt driver fixes including two fixes for bugs related to
    the new Operating Performance Points Device Tree bindings
    introduced recently (Viresh Kumar).
 
  - Suspend frequency support for the cpufreq-dt driver
    (Bartlomiej Zolnierkiewicz, Viresh Kumar).
 
  - cpufreq core cleanups (Viresh Kumar).
 
  - intel_pstate driver fixes (Chen Yu, Kristen Carlson Accardi).
 
  - Additional sanity check in the cpuidle core (Xunlei Pang).
 
  - Fix for a comment related to CPU power management (Lina Iyer).
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJV80XwAAoJEILEb/54YlRxIAkP/RrT4mDO9QZkocO+XYDBB739
 oi1aaSDmmVVKlFtFiyX4njS7S+U5UC8PYn5o/fFtVnP1kD9yuZkoak65U3t7bsKK
 +rjun1SzJeooaLXW4ZAJKyPldi/RrsfRJ2JOaEt2vK03ppjx2mouK1VpjNj+qmYd
 wwFb4q5S5Rcpph0Sx+xxhpZKoOgBmXk7nWIe5bWDxuORdx4yoxp4lus/P223wFJy
 r0kJnNyY55mnY4Yx5e3L8W1rMC/Sf6mt8pyMvqRpsKdBHaEbhmGsiERSi0eCZJwr
 AmL666TVgPk6zNM/yc3mFOTRZeAF3soSWF5R6+n2TswWvWFH+ksM14g9ndb7fTv3
 FUuOU3jyfSUj2T4xr0qUZLaQUFB3xNrnWBNnm/CzPwjJtH1u1SK8uBSRqdZnEdGq
 EeqwnwDLFeSAbG6lQVsKVVAPZHolDCnJPsvZebKdnJH09kjgUdm60BLOnif4QDgw
 ULPZ6NToFZkrQSMAX1Uqm42iv/qWcHIgS5x43jAmblY3TsuByqRrI7gBCy+hbGSN
 cQfeCxTdVPi/tUPYlJFmWn43dgEsnceSFKvlQRXGrZJM29BFvhVzJQjTFC8Oj79q
 OP/Rv28+VeV5J531BQu+w8cvg73GJ3DDab+i91ZkIsd7ZNjRYQKXBfR9TImTSD4l
 BPjvaPXwDg6EPhwjcpK/
 =C7+4
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-4.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management and ACPI updates from Rafael Wysocki:
 "These are mostly fixes and cleanups on top of the previous PM+ACPI
  pull request (cpufreq core and drivers, cpuidle, generic power domains
  framework).  Some of them didn't make to that pull request and some
  fix issues introduced by it.

  The only really new thing is the support for suspend frequency in the
  cpufreq-dt driver, but it is needed to fix an issue with Exynos
  platforms.

  Specifics:

   - build fix for the new Mediatek MT8173 cpufreq driver (Guenter
     Roeck).

   - generic power domains framework fixes (power on error code path,
     subdomain removal) and cleanup of a deprecated API user (Geert
     Uytterhoeven, Jon Hunter, Ulf Hansson).

   - cpufreq-dt driver fixes including two fixes for bugs related to the
     new Operating Performance Points Device Tree bindings introduced
     recently (Viresh Kumar).

   - suspend frequency support for the cpufreq-dt driver (Bartlomiej
     Zolnierkiewicz, Viresh Kumar).

   - cpufreq core cleanups (Viresh Kumar).

   - intel_pstate driver fixes (Chen Yu, Kristen Carlson Accardi).

   - additional sanity check in the cpuidle core (Xunlei Pang).

   - fix for a comment related to CPU power management (Lina Iyer)"

* tag 'pm+acpi-4.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  intel_pstate: fix PCT_TO_HWP macro
  intel_pstate: Fix user input of min/max to legal policy region
  PM / OPP: Return suspend_opp only if it is enabled
  cpufreq-dt: add suspend frequency support
  cpufreq: allow cpufreq_generic_suspend() to work without suspend frequency
  PM / OPP: add dev_pm_opp_get_suspend_opp() helper
  staging: board: Migrate away from __pm_genpd_name_add_device()
  cpufreq: Use __func__ to print function's name
  cpufreq: staticize cpufreq_cpu_get_raw()
  PM / Domains: Ensure subdomain is not in use before removing
  cpufreq: Add ARM_MT8173_CPUFREQ dependency on THERMAL
  cpuidle/coupled: Add sanity check for safe_state_index
  PM / Domains: Try power off masters in error path of __pm_genpd_poweron()
  cpufreq: dt: Tolerance applies on both sides of target voltage
  cpufreq: dt: Print error on failing to mark OPPs as shared
  cpufreq: dt: Check OPP count before marking them shared
  kernel/cpu_pm: fix cpu_cluster_pm_exit comment
2015-09-11 19:11:06 -07:00
Mathieu Desnoyers
5b25b13ab0 sys_membarrier(): system-wide memory barrier (generic, x86)
Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system.  It is
implemented by calling synchronize_sched().  It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier.  For synchronization primitives
that distinguish between read-side and write-side (e.g.  userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.

The existing applications of which I am aware that would be improved by
this system call are as follows:

* Through Userspace RCU library (http://urcu.so)
  - DNS server (Knot DNS) https://www.knot-dns.cz/
  - Network sniffer (http://netsniff-ng.org/)
  - Distributed object storage (https://sheepdog.github.io/sheepdog/)
  - User-space tracing (http://lttng.org)
  - Network storage system (https://www.gluster.org/)
  - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
  - Financial software (https://lkml.org/lkml/2015/3/23/189)

Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking.  Especially in the case of RCU used by
libraries, sys_membarrier can speed up the read-side by moving the bulk of
the memory barrier cost to synchronize_rcu().

* Direct users of sys_membarrier
  - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux.  They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
previous mem accesses       previous mem accesses
smp_mb()                    smp_mb()
following mem accesses      following mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader:    1701557485 reads, 2202847 writes
signal-based scheme:          9830061167 reads,    6700 writes
sys_membarrier:               9952759104 reads,     425 writes
sys_membarrier (dyn. check):  7970328887 reads,     425 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.

Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.

An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.

This patch adds the system call to x86 and to asm-generic.

[1] http://urcu.so

membarrier(2) man page:

MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)

NAME
       membarrier - issue memory barriers on a set of threads

SYNOPSIS
       #include <linux/membarrier.h>

       int membarrier(int cmd, int flags);

DESCRIPTION
       The cmd argument is one of the following:

       MEMBARRIER_CMD_QUERY
              Query  the  set  of  supported commands. It returns a bitmask of
              supported commands.

       MEMBARRIER_CMD_SHARED
              Execute a memory barrier on all threads running on  the  system.
              Upon  return from system call, the caller thread is ensured that
              all running threads have passed through a state where all memory
              accesses  to  user-space  addresses  match program order between
              entry to and return from the system  call  (non-running  threads
              are de facto in such a state). This covers threads from all pro=E2=80=90
              cesses running on the system.  This command returns 0.

       The flags argument needs to be 0. For future extensions.

       All memory accesses performed  in  program  order  from  each  targeted
       thread is guaranteed to be ordered with respect to sys_membarrier(). If
       we use the semantic "barrier()" to represent a compiler barrier forcing
       memory  accesses  to  be performed in program order across the barrier,
       and smp_mb() to represent explicit memory barriers forcing full  memory
       ordering  across  the barrier, we have the following ordering table for
       each pair of barrier(), sys_membarrier() and smp_mb():

       The pair ordering is detailed as (O: ordered, X: not ordered):

                              barrier()   smp_mb() sys_membarrier()
              barrier()          X           X            O
              smp_mb()           X           O            O
              sys_membarrier()   O           O            O

RETURN VALUE
       On success, these system calls return zero.  On error, -1 is  returned,
       and errno is set appropriately. For a given command, with flags
       argument set to 0, this system call is guaranteed to always return the
       same value until reboot.

ERRORS
       ENOSYS System call is not implemented.

       EINVAL Invalid arguments.

Linux                             2015-04-15                     MEMBARRIER(2)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nicholas Miell <nmiell@comcast.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-11 15:21:34 -07:00
Rafael J. Wysocki
4614e0cc66 Merge branches 'pm-cpu', 'pm-cpuidle' and 'pm-domains'
* pm-cpu:
  kernel/cpu_pm: fix cpu_cluster_pm_exit comment

* pm-cpuidle:
  cpuidle/coupled: Add sanity check for safe_state_index

* pm-domains:
  staging: board: Migrate away from __pm_genpd_name_add_device()
  PM / Domains: Ensure subdomain is not in use before removing
  PM / Domains: Try power off masters in error path of __pm_genpd_poweron()
2015-09-11 15:37:36 +02:00
Wanpeng Li
5473e0cc37 sched: 'Annotate' migrate_tasks()
Kernel testing triggered this warning:

| WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80()
| Modules linked in:
| CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c7 #2
| Call Trace:
|   dump_stack+0x4b/0x75
|   warn_slowpath_common+0x8b/0xc0
|   warn_slowpath_null+0x22/0x30
|   do_set_cpus_allowed+0x7e/0x80
|   cpuset_cpus_allowed_fallback+0x7c/0x170
|   select_fallback_rq+0x221/0x280
|   migration_call+0xe3/0x250
|   notifier_call_chain+0x53/0x70
|   __raw_notifier_call_chain+0x1e/0x30
|   cpu_notify+0x28/0x50
|   take_cpu_down+0x22/0x40
|   multi_cpu_stop+0xd5/0x140
|   cpu_stopper_thread+0xbc/0x170
|   smpboot_thread_fn+0x174/0x2f0
|   kthread+0xc4/0xe0
|   ret_from_kernel_thread+0x21/0x30

As Peterz pointed out:

| So the normal rules for changing task_struct::cpus_allowed are holding
| both pi_lock and rq->lock, such that holding either stabilizes the mask.
|
| This is so that wakeup can happen without rq->lock and load-balance
| without pi_lock.
|
| From this we already get the relaxation that we can omit acquiring
| rq->lock if the task is not on the rq, because in that case
| load-balancing will not apply to it.
|
| ** these are the rules currently tested in do_set_cpus_allowed() **
|
| Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which
| unconditionally acquires both locks, we could get away with holding just
| rq->lock when on_rq for modification because that'd still exclude
| __set_cpus_allowed_ptr(), it would also work against
| __kthread_bind_mask() because that assumes !on_rq.
|
| That said, this is all somewhat fragile.
|
| Now, I don't think dropping rq->lock is quite as disastrous as it
| usually is because !cpu_active at this point, which means load-balance
| will not interfere, but that too is somewhat fragile.
|
| So we end up with a choice of two fragile..

This patch fixes it by following the rules for changing
task_struct::cpus_allowed with both pi_lock and rq->lock held.

Reported-by: kernel test robot <ying.huang@intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[ Modified changelog and patch. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/BLU436-SMTP1660820490DE202E3934ED3806E0@phx.gbl
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-11 07:57:50 +02:00
Peter Zijlstra
43b3f02899 locking/qspinlock/x86: Fix performance regression under unaccelerated VMs
Dave ran into horrible performance on a VM without PARAVIRT_SPINLOCKS
set and Linus noted that the test-and-set implementation was retarded.

One should spin on the variable with a load, not a RMW.

While there, remove 'queued' from the name, as the lock isn't queued
at all, but a simple test-and-set.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: stable@vger.kernel.org # v4.2+
Link: http://lkml.kernel.org/r/20150904152523.GR18673@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-11 07:49:42 +02:00
Linus Torvalds
33e247c7e5 Merge branch 'akpm' (patches from Andrew)
Merge third patch-bomb from Andrew Morton:

 - even more of the rest of MM

 - lib/ updates

 - checkpatch updates

 - small changes to a few scruffy filesystems

 - kmod fixes/cleanups

 - kexec updates

 - a dma-mapping cleanup series from hch

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits)
  dma-mapping: consolidate dma_set_mask
  dma-mapping: consolidate dma_supported
  dma-mapping: cosolidate dma_mapping_error
  dma-mapping: consolidate dma_{alloc,free}_noncoherent
  dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}
  mm: use vma_is_anonymous() in create_huge_pmd() and wp_huge_pmd()
  mm: make sure all file VMAs have ->vm_ops set
  mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff()
  mm: mark most vm_operations_struct const
  namei: fix warning while make xmldocs caused by namei.c
  ipc: convert invalid scenarios to use WARN_ON
  zlib_deflate/deftree: remove bi_reverse()
  lib/decompress_unlzma: Do a NULL check for pointer
  lib/decompressors: use real out buf size for gunzip with kernel
  fs/affs: make root lookup from blkdev logical size
  sysctl: fix int -> unsigned long assignments in INT_MIN case
  kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo
  kexec: align crash_notes allocation to make it be inside one physical page
  kexec: remove unnecessary test in kimage_alloc_crash_control_pages()
  kexec: split kexec_load syscall from kexec core code
  ...
2015-09-10 18:19:42 -07:00
Linus Torvalds
65c61bc5db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix out-of-bounds array access in netfilter ipset, from Jozsef
    Kadlecsik.

 2) Use correct free operation on netfilter conntrack templates, from
    Daniel Borkmann.

 3) Fix route leak in SCTP, from Marcelo Ricardo Leitner.

 4) Fix sizeof(pointer) in mac80211, from Thierry Reding.

 5) Fix cache pointer comparison in ip6mr leading to missed unlock of
    mrt_lock.  From Richard Laing.

 6) rds_conn_lookup() needs to consider network namespace in key
    comparison, from Sowmini Varadhan.

 7) Fix deadlock in TIPC code wrt broadcast link wakeups, from Kolmakov
    Dmitriy.

 8) Fix fd leaks in bpf syscall, from Daniel Borkmann.

 9) Fix error recovery when installing ipv6 multipath routes, we would
    delete the old route before we would know if we could fully commit
    to the new set of nexthops.  Fix from Roopa Prabhu.

10) Fix run-time suspend problems in r8152, from Hayes Wang.

11) In fec, don't program the MAC address into the chip when the clocks
    are gated off.  From Fugang Duan.

12) Fix poll behavior for netlink sockets when using rx ring mmap, from
    Daniel Borkmann.

13) Don't allocate memory with GFP_KERNEL from get_stats64 in r8169
    driver, from Corinna Vinschen.

14) In TCP Cubic congestion control, handle idle periods better where we
    are application limited, in order to keep cwnd from growing out of
    control.  From Eric Dumzet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
  tcp_cubic: better follow cubic curve after idle period
  tcp: generate CA_EVENT_TX_START on data frames
  xen-netfront: respect user provided max_queues
  xen-netback: respect user provided max_queues
  r8169: Fix sleeping function called during get_stats64, v2
  ether: add IEEE 1722 ethertype - TSN
  netlink, mmap: fix edge-case leakages in nf queue zero-copy
  netlink, mmap: don't walk rx ring on poll if receive queue non-empty
  cxgb4: changes for new firmware 1.14.4.0
  net: fec: add netif status check before set mac address
  r8152: fix the runtime suspend issues
  r8152: split DRIVER_VERSION
  ipv6: fix ifnullfree.cocci warnings
  add microchip LAN88xx phy driver
  stmmac: fix check for phydev being open
  net: qlcnic: delete redundant memsets
  net: mv643xx_eth: use kzalloc
  net: jme: use kzalloc() instead of kmalloc+memset
  net: cavium: liquidio: use kzalloc in setup_glist()
  net: ipv6: use common fib_default_rule_pref
  ...
2015-09-10 13:53:15 -07:00
Ilya Dryomov
9a5bc726d5 sysctl: fix int -> unsigned long assignments in INT_MIN case
The following

    if (val < 0)
        *lvalp = (unsigned long)-val;

is incorrect because the compiler is free to assume -val to be positive
and use a sign-extend instruction for extending the bit pattern.  This is
a problem if val == INT_MIN:

    # echo -2147483648 >/proc/sys/dev/scsi/logging_level
    # cat /proc/sys/dev/scsi/logging_level
    -18446744071562067968

Cast to unsigned long before negation - that way we first sign-extend and
then negate an unsigned, which is well defined.  With this:

    # cat /proc/sys/dev/scsi/logging_level
    -2147483648

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cc: Mikulas Patocka <mikulas@twibright.com>
Cc: Robert Xiao <nneonneo@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Baoquan He
1303a27c9c kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo
In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and
accordingly the MODULES_VADDR is changed to 0xffffffffa0000000.  However,
in v3.12 Kees Cook introduced kaslr to randomise the location of kernel.
And the kernel text mapping addr space is enlarged from 512M to 1G.  That
means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr
support is not compiled in and 1G when kaslr support is compiled in.
Accordingly the MODULES_VADDR is changed too to be:

    #define MODULES_VADDR    (__START_KERNEL_map + KERNEL_IMAGE_SIZE)

So when kaslr is compiled in and enabled, the kernel text mapping addr
space and modules vaddr space need be adjusted.  Otherwise makedumpfile
will collapse since the addr for some symbols is not correct.

Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in
makedumpfile to help calculate MODULES_VADDR.

Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Baoquan He
bbb78b8f3f kexec: align crash_notes allocation to make it be inside one physical page
People reported that crash_notes in /proc/vmcore were corrupted and this
cause crash kdump failure.  With code debugging and log we got the root
cause.  This is because percpu variable crash_notes are allocated in 2
vmalloc pages.  Currently percpu is based on vmalloc by default.  Vmalloc
can't guarantee 2 continuous vmalloc pages are also on 2 continuous
physical pages.  So when 1st kernel exports the starting address and size
of crash_notes through sysfs like below:

/sys/devices/system/cpu/cpux/crash_notes
/sys/devices/system/cpu/cpux/crash_notes_size

kdump kernel use them to get the content of crash_notes.  However the 2nd
part may not be in the next neighbouring physical page as we expected if
crash_notes are allocated accross 2 vmalloc pages.  That's why
nhdr_ptr->n_namesz or nhdr_ptr->n_descsz could be very huge in
update_note_header_size_elf64() and cause note header merging failure or
some warnings.

In this patch change to call __alloc_percpu() to passed in the align value
by rounding crash_notes_size up to the nearest power of two.  This makes
sure the crash_notes is allocated inside one physical page since
sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE.  Meanwhile add
a BUILD_BUG_ON to break compile if size is bigger than PAGE_SIZE since
crash_notes definitely will be in 2 pages.  That need be avoided, and need
be reported if it's unavoidable.

[akpm@linux-foundation.org: use correct comment layout]
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Minfei Huang
04e9949b2d kexec: remove unnecessary test in kimage_alloc_crash_control_pages()
Transforming PFN(Page Frame Number) to struct page is never failure, so we
can simplify the code logic to do the image->control_page assignment
directly in the loop, and remove the unnecessary conditional judgement.

Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Acked-by: Dave Young <dyoung@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Dave Young
2965faa5e0 kexec: split kexec_load syscall from kexec core code
There are two kexec load syscalls, kexec_load another and kexec_file_load.
 kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Dave Young
a43cac0d9d kexec: split kexec_file syscall code to kexec_file.c
Split kexec_file syscall related code to another file kernel/kexec_file.c
so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.

Sharing variables and functions are moved to kernel/kexec_internal.h per
suggestion from Vivek and Petr.

[akpm@linux-foundation.org: fix bisectability]
[akpm@linux-foundation.org: declare the various arch_kexec functions]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Frederic Weisbecker
bb304a5c6f kmod: handle UMH_WAIT_PROC from system unbound workqueue
The UMH_WAIT_PROC handler runs in its own thread in order to make sure
that waiting for the exec kernel thread completion won't block other
usermodehelper queued jobs.

On older workqueue implementations, worklets couldn't sleep without
blocking the rest of the queue.  But now the workqueue subsystem handles
that.  Khelper still had the older limitation due to its singlethread
properties but we replaced it to system unbound workqueues.

Those are affine to the current node and can block up to some number of
instances.

They are a good candidate to handle UMH_WAIT_PROC assuming that we have
enough system unbound workers to handle lots of parallel usermodehelper
jobs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Frederic Weisbecker
90f023030e kmod: use system_unbound_wq instead of khelper
We need to launch the usermodehelper kernel threads with the widest
affinity and this is partly why we use khelper.  This workqueue has
unbound properties and thus a wide affinity inherited by all its children.

Now khelper also has special properties that we aren't much interested in:
ordered and singlethread.  There is really no need about ordering as all
we do is creating kernel threads.  This can be done concurrently.  And
singlethread is a useless limitation as well.

The workqueue engine already proposes generic unbound workqueues that
don't share these useless properties and handle well parallel jobs.

The only worrysome specific is their affinity to the node of the current
CPU.  It's fine for creating the usermodehelper kernel threads but those
inherit this affinity for longer jobs such as requesting modules.

This patch proposes to use these node affine unbound workqueues assuming
that a node is sufficient to handle several parallel usermodehelper
requests.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Frederic Weisbecker
b639e86bae kmod: add up-to-date explanations on the purpose of each asynchronous levels
There seem to be quite some confusions on the comments, likely due to
changes that came after them.

Now since it's very non obvious why we have 3 levels of asynchronous code
to implement usermodehelpers, it's important to comment in detail the
reason of this layout.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Frederic Weisbecker
d097c0240a kmod: remove unecessary explicit wide CPU affinity setting
Khelper is affine to all CPUs.  Now since it creates the
call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide
affinity.

As such explicitly forcing a wide affinity from those kernel threads
is like a no-op.

Just remove it. It's needless and it breaks CPU isolation users who
rely on workqueue affinity tuning.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00