All memory accounting and limiting has been switched over to the
lockless page counters. Bye, res_counter!
[akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
[mhocko@suse.cz: ditch the last remainings of res_counter]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull VFS changes from Al Viro:
"First pile out of several (there _definitely_ will be more). Stuff in
this one:
- unification of d_splice_alias()/d_materialize_unique()
- iov_iter rewrite
- killing a bunch of ->f_path.dentry users (and f_dentry macro).
Getting that completed will make life much simpler for
unionmount/overlayfs, since then we'll be able to limit the places
sensitive to file _dentry_ to reasonably few. Which allows to have
file_inode(file) pointing to inode in a covered layer, with dentry
pointing to (negative) dentry in union one.
Still not complete, but much closer now.
- crapectomy in lustre (dead code removal, mostly)
- "let's make seq_printf return nothing" preparations
- assorted cleanups and fixes
There _definitely_ will be more piles"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
copy_from_iter_nocache()
new helper: iov_iter_kvec()
csum_and_copy_..._iter()
iov_iter.c: handle ITER_KVEC directly
iov_iter.c: convert copy_to_iter() to iterate_and_advance
iov_iter.c: convert copy_from_iter() to iterate_and_advance
iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
iov_iter.c: convert iov_iter_zero() to iterate_and_advance
iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
iov_iter.c: iterate_and_advance
iov_iter.c: macros for iterating over iov_iter
kill f_dentry macro
dcache: fix kmemcheck warning in switch_names
new helper: audit_file()
nfsd_vfs_write(): use file_inode()
ncpfs: use file_inode()
kill f_dentry uses
lockd: get rid of ->f_path.dentry->d_sb
...
side-step that by reading copies that pstore saved.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUheJgAAoJEKurIx+X31iB5F0P/jdpAw6cI26icGiOcRvRYvce
jLq/WbGggxZlx3rtgGpekJmcJ1NBBTLdyx4b86q4q/zstQkoJ9lqGCn63YcIMJNB
pdctmbkGyoQQXBTAzSCFs6pybMUmtYKMDiT3OJddcCm4fUjd4RQHvNP+5ESsf0lQ
9YpIS+rZOtB2/5N6/i4+Lnaffc3s5gXw/dJMxOm/laWtRFRyhf22YP18cRp5LmuV
NHqu1NoeLnar/qL6plPl73lEyZVOPRC01T7OWmmCkcLieYPGkqQlkoXp95VBKf5u
CvD167oM71OccMa0gOTlCS8a6y5KO6y8I+YAR60iANTLDh+rHZiwNj1gY4v/Z29m
2ba1xAulQrpCxqml6eVxAKaF+4HXaXVXKqjQIivJcGyfYf6BXLMvC0M3Lsv7XQdz
HKl++o0JELDEJjVW0i9Wa5CjgcqXdvuRXOoKDaKTZWff2yfUxqIN5Xl7zIV2kgVy
ZqPDBHJSmHjuzmJ6inhPkmdS2uz94PVSE7ykeaa8iCBbpdsS+FchtF2sRMvUhU23
ekHsxk0Mk/pS5EBNc6rrrM9NtKrUQMa1e/oT5G7QowksDeNpsPjx92OeUImxgh3x
+hmObN9vx6SepwVSfjI1rwrMsAknphJfPmyi/XJgkVbfRMCv2we1npvYd6hqFUMV
daekMzGOi5eqoaWB8hje
=Ezg0
-----END PGP SIGNATURE-----
Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore fixes from Tony Luck:
"On a system that restricts access to dmesg, don't let people side-step
that by reading copies that pstore saved"
* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
syslog: Provide stub check_syslog_permissions
pstore: Honor dmesg_restrict sysctl on dmesg dumps
pstore/ram: Strip ramoops header for correct decompression
Conflicts:
drivers/net/ethernet/amd/xgbe/xgbe-desc.c
drivers/net/ethernet/renesas/sh_eth.c
Overlapping changes in both conflict cases.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull more 2038 timer work from Thomas Gleixner:
"Two more patches for the ongoing 2038 work:
- New accessors to clock MONOTONIC and REALTIME seconds
This is a seperate branch as Arnd has follow up work depending on
this"
* 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Provide y2038 safe accessor to the seconds portion of CLOCK_REALTIME
timekeeping: Provide fast accessor to the seconds part of CLOCK_MONOTONIC
Pull x86 MPX support from Thomas Gleixner:
"This enables support for x86 MPX.
MPX is a new debug feature for bound checking in user space. It
requires kernel support to handle the bound tables and decode the
bound violating instruction in the trap handler"
* 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
asm-generic: Remove asm-generic arch_bprm_mm_init()
mm: Make arch_unmap()/bprm_mm_init() available to all architectures
x86: Cleanly separate use of asm-generic/mm_hooks.h
x86 mpx: Change return type of get_reg_offset()
fs: Do not include mpx.h in exec.c
x86, mpx: Add documentation on Intel MPX
x86, mpx: Cleanup unused bound tables
x86, mpx: On-demand kernel allocation of bounds tables
x86, mpx: Decode MPX instruction to get bound violation information
x86, mpx: Add MPX-specific mmap interface
x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
x86, mpx: Add MPX to disabled features
ia64: Sync struct siginfo with general version
mips: Sync struct siginfo with general version
mpx: Extend siginfo structure to include bound violation information
x86, mpx: Rename cfg_reg_u and status_reg
x86: mpx: Give bndX registers actual names
x86: Remove arbitrary instruction size limit in instruction decoder
Pull irq domain updates from Thomas Gleixner:
"The real interesting irq updates:
- Support for hierarchical irq domains:
For complex interrupt routing scenarios where more than one
interrupt related chip is involved we had no proper representation
in the generic interrupt infrastructure so far. That made people
implement rather ugly constructs in their nested irq chip
implementations. The main offenders are x86 and arm/gic.
To distangle that mess we have now hierarchical irqdomains which
seperate the various interrupt chips and connect them via the
hierarchical domains. That keeps the domain specific details
internal to the particular hierarchy level and removes the
criss/cross referencing of chip internals. The resulting hierarchy
for a complex x86 system will look like this:
vector mapped: 74
msi-0 mapped: 2
dmar-ir-1 mapped: 69
ioapic-1 mapped: 4
ioapic-0 mapped: 20
pci-msi-2 mapped: 45
dmar-ir-0 mapped: 3
ioapic-2 mapped: 1
pci-msi-1 mapped: 2
htirq mapped: 0
Neither ioapic nor pci-msi know about the dmar interrupt remapping
between themself and the vector domain. If interrupt remapping is
disabled ioapic and pci-msi become direct childs of the vector
domain.
In hindsight we should have done that years ago, but in hindsight
we always know better :)
- Support for generic MSI interrupt domain handling
We have more and more non PCI related MSI interrupts, so providing
a generic infrastructure for this is better than having all
affected architectures implementing their own private hacks.
- Support for PCI-MSI interrupt domain handling, based on the generic
MSI support.
This part carries the pci/msi branch from Bjorn Helgaas pci tree to
avoid a massive conflict. The PCI/MSI parts are acked by Bjorn.
I have two more branches on top of this. The full conversion of x86
to hierarchical domains and a partial conversion of arm/gic"
* 'irq-irqdomain-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
genirq: Move irq_chip_write_msi_msg() helper to core
PCI/MSI: Allow an msi_controller to be associated to an irq domain
PCI/MSI: Provide mechanism to alloc/free MSI/MSIX interrupt from irqdomain
PCI/MSI: Enhance core to support hierarchy irqdomain
PCI/MSI: Move cached entry functions to irq core
genirq: Provide default callbacks for msi_domain_ops
genirq: Introduce msi_domain_alloc/free_irqs()
asm-generic: Add msi.h
genirq: Add generic msi irq domain support
genirq: Introduce callback irq_chip.irq_write_msi_msg
genirq: Work around __irq_set_handler vs stacked domains ordering issues
irqdomain: Introduce helper function irq_domain_add_hierarchy()
irqdomain: Implement a method to automatically call parent domains alloc/free
genirq: Introduce helper irq_domain_set_info() to reduce duplicated code
genirq: Split out flow handler typedefs into seperate header file
genirq: Add IRQ_SET_MASK_OK_DONE to support stacked irqchip
genirq: Introduce irq_chip.irq_compose_msi_msg() to support stacked irqchip
genirq: Add more helper functions to support stacked irq_chip
genirq: Introduce helper functions to support stacked irq_chip
irqdomain: Do irq_find_mapping and set_type for hierarchy irqdomain in case OF
...
Pull irq core updates from Thomas Gleixner:
"This is the first (boring) part of irq updates:
- support for big endian I/O accessors in the generic irq chip
- cleanup of brcmstb/bcm7120 drivers so they can be reused for non
ARM SoCs
- the usual pile of fixes and updates for the various ARM irq chips"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
irqchip: dw-apb-ictl: Add PM support
irqchip: dw-apb-ictl: Enable IRQ_GC_MASK_CACHE_PER_TYPE
irqchip: dw-apb-ictl: Always use use {readl|writel}_relaxed
ARM: orion: convert the irq_reg_{readl,writel} calls to the new API
irqchip: atmel-aic: Add missing entry for rm9200 irq fixups
irqchip: atmel-aic: Rename at91sam9_aic_irq_fixup for naming consistency
irqchip: atmel-aic: Add specific irq fixup function for sam9g45 and sam9rl
irqchip: atmel-aic: Add irq fixups for at91sam926x SoCs
irqchip: atmel-aic: Add irq fixup for RTT block
irqchip: brcmstb-l2: Convert driver to use irq_reg_{readl,writel}
irqchip: bcm7120-l2: Convert driver to use irq_reg_{readl,writel}
irqchip: bcm7120-l2: Decouple driver from brcmstb-l2
irqchip: bcm7120-l2: Extend driver to support 64+ bit controllers
irqchip: bcm7120-l2: Use gc->mask_cache to simplify suspend/resume functions
irqchip: bcm7120-l2: Fix missing nibble in gc->unused mask
irqchip: bcm7120-l2: Make sure all register accesses use base+offset
irqchip: bcm7120-l2, brcmstb-l2: Remove ARM Kconfig dependency
irqchip: bcm7120-l2: Eliminate bad IRQ check
irqchip: brcmstb-l2: Eliminate dependency on ARM code
genirq: Generic chip: Add big endian I/O accessors
...
Pull timer core updates from Thomas Gleixner:
"The time(r) departement provides:
- more infrastructure work on the year 2038 issue
- a few fixes in the Armada SoC timers
- the usual pile of fixlets and improvements"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: armada-370-xp: Use the reference clock on A375 SoC
watchdog: orion: Use the reference clock on Armada 375 SoC
clocksource: armada-370-xp: Add missing clock enable
time: Fix sign bug in NTP mult overflow warning
time: Remove timekeeping_inject_sleeptime()
rtc: Update suspend/resume timing to use 64bit time
rtc/lib: Provide y2038 safe rtc_tm_to_time()/rtc_time_to_tm() replacement
time: Fixup comments to reflect usage of timespec64
time: Expose get_monotonic_coarse64() for in-kernel uses
time: Expose getrawmonotonic64 for in-kernel uses
time: Provide y2038 safe mktime() replacement
time: Provide y2038 safe timekeeping_inject_sleeptime() replacement
time: Provide y2038 safe do_settimeofday() replacement
time: Complete NTP adjustment threshold judging conditions
time: Avoid possible NTP adjustment mult overflow.
time: Rename udelay_test.c to test_udelay.c
clocksource: sirf: Remove hard-coded clock rate
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle are:
- 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.
This instruments might_sleep() checks to catch places that nest
blocking primitives - such as mutex usage in a wait loop. Such
bugs can result in hard to debug races/hangs.
Another category of invalid nesting that this facility will detect
is the calling of blocking functions from within schedule() ->
sched_submit_work() -> blk_schedule_flush_plug().
There's some potential for false positives (if secondary blocking
primitives themselves are not ready yet for this facility), but the
kernel will warn once about such bugs per bootup, so the warning
isn't much of a nuisance.
This feature comes with a number of fixes, for problems uncovered
with it, so no messages are expected normally.
- Another round of sched/numa optimizations and refinements, for
CONFIG_NUMA_BALANCING=y.
- Another round of sched/dl fixes and refinements.
Plus various smaller fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched: Add missing rcu protection to wake_up_all_idle_cpus
sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
sched/numa: Init numa balancing fields of init_task
sched/deadline: Remove unnecessary definitions in cpudeadline.h
sched/cpupri: Remove unnecessary definitions in cpupri.h
sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
sched/fair: Fix stale overloaded status in the busiest group finding logic
sched: Move p->nr_cpus_allowed check to select_task_rq()
sched/completion: Document when to use wait_for_completion_io_*()
sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
sched/fair: Kill task_struct::numa_entry and numa_group::task_list
sched: Refactor task_struct to use numa_faults instead of numa_* pointers
sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
sched/deadline: Reschedule from switched_from_dl() after a successful pull
sched/deadline: Push task away if the deadline is equal to curr during wakeup
sched/deadline: Add deadline rq status print
sched/deadline: Fix artificial overrun introduced by yield_task_dl()
sched/rt: Clean up check_preempt_equal_prio()
sched/core: Use dl_bw_of() under rcu_read_lock_sched()
sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
...
Pull perf events update from Ingo Molnar:
"On the kernel side there's few changes, the one that stands out is
PEBS machine state sampling support on x86, by Stephane Eranian.
On the tooling side:
User visible tooling changes:
- Don't open the DWARF info multiple times, keeping instead a dwfl
handle in struct dso, greatly speeding up 'perf report' on powerpc.
(Sukadev Bhattiprolu)
- Introduce PARSE_OPT_DISABLED option flag and use it to avoid
showing undersired options in tools that provides frontends to
'perf record', like sched, kvm, etc (Namhyung Kim)
- Fallback to kallsyms when using the minimal 'ELF' loader (Arnaldo
Carvalho de Melo)
- Fix annotation with kcore (Adrian Hunter)
- Support source line numbers in annotate using a hotkey (Andi Kleen)
- Callchain improvements including:
* Enable printing the srcline in the history
* Make get_srcline fall back to sym+offset (Andi Kleen)
- TUI hist_entry browser fixes, including showing missing overhead
value for first level callchain. Detected comparing the output of
--stdio/--gui (that matched) with --tui, that had this problem.
(Namhyung Kim)
- Support handling complete branch stacks as histograms (Andi Kleen)
Tooling infrastructure changes:
- Prep work for supporting per-pkg and snapshot counters in 'perf
stat' (Jiri Olsa)
- 'perf stat' refactorings, moving stuff from it to evsel.c to use in
per-pkg/snapshot format changes (Jiri Olsa)
- Add per-pkg format file parsing (Matt Fleming)
- Clean up libelf feature support code (Namhyung Kim)
- Add gzip decompression support for kernel modules (Namhyung Kim)
- More prep patches for Intel PT, including a a thread stack and more
stuff made available via the database export mechanism (Adrian
Hunter)
- More Intel PT work, including a facility to export sample data
(comms, threads, symbol names, etc) in a database friendly way,
with an script to use this to create a postgresql database.
(Adrian Hunter)
- Make sure that thread->mg->machine points to the machine where the
thread exists (it was being set only for the kmaps kernel modules
case, do it as well for the mmaps) and use it to shorten function
signatures (Arnaldo Carvalho de Melo)
... and lots of other fixes and smaller improvements"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (91 commits)
perf report: In branch stack mode use address history sorting
perf report: Add --branch-history option
perf callchain: Support handling complete branch stacks as histograms
perf stat: Add support for snapshot counters
perf stat: Add support for per-pkg counters
perf tools: Remove perf_evsel__read interface
perf stat: Use read_counter in read_counter_aggr
perf stat: Make read_counter work over the thread dimension
perf stat: Use perf_evsel__read_cb in read_counter
perf tools: Add snapshot format file parsing
perf tools: Add per-pkg format file parsing
perf evsel: Introduce perf_evsel__read_cb function
perf evsel: Introduce perf_counts_values__scale function
perf evsel: Introduce perf_evsel__compute_deltas function
perf tools: Allow to force redirect pr_debug to stderr.
perf tools: Fix segfault due to invalid kernel dso access
perf callchain: Make get_srcline fall back to sym+offset
perf symbols: Move bfd_demangle stubbing to its only user
perf callchain: Enable printing the srcline in the history
perf tools: Collapse first level callchain entry if it has sibling
...
Pull RCU updates from Ingo Molnar:
"These are the main changes in this cycle:
- Streamline RCU's use of per-CPU variables, shifting from "cpu"
arguments to functions to "this_"-style per-CPU variable
accessors.
- signal-handling RCU updates.
- real-time updates.
- torture-test updates.
- miscellaneous fixes.
- documentation updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
rcu: Fix FIXME in rcu_tasks_kthread()
rcu: More info about potential deadlocks with rcu_read_unlock()
rcu: Optimize cond_resched_rcu_qs()
rcu: Add sparse check for RCU_INIT_POINTER()
documentation: memory-barriers.txt: Correct example for reorderings
documentation: Add atomic_long_t to atomic_ops.txt
documentation: Additional restriction for control dependencies
documentation: Document RCU self test boot params
rcutorture: Fix rcu_torture_cbflood() memory leak
rcutorture: Remove obsolete kversion param in kvm.sh
rcutorture: Remove stale test configurations
rcutorture: Enable RCU self test in configs
rcutorture: Add early boot self tests
torture: Run Linux-kernel binary out of results directory
cpu: Avoid puts_pending overflow
rcu: Remove "cpu" argument to rcu_cleanup_after_idle()
rcu: Remove "cpu" argument to rcu_prepare_for_idle()
rcu: Remove "cpu" argument to rcu_needs_cpu()
rcu: Remove "cpu" argument to rcu_note_context_switch()
rcu: Remove "cpu" argument to rcu_preempt_check_callbacks()
...
Generalize id_map_mutex so it can be used for more state of a user namespace.
Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If you did not create the user namespace and are allowed
to write to uid_map or gid_map you should already have the necessary
privilege in the parent user namespace to establish any mapping
you want so this will not affect userspace in practice.
Limiting unprivileged uid mapping establishment to the creator of the
user namespace makes it easier to verify all credentials obtained with
the uid mapping can be obtained without the uid mapping without
privilege.
Limiting unprivileged gid mapping establishment (which is temporarily
absent) to the creator of the user namespace also ensures that the
combination of uid and gid can already be obtained without privilege.
This is part of the fix for CVE-2014-8989.
Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
setresuid allows the euid to be set to any of uid, euid, suid, and
fsuid. Therefor it is safe to allow an unprivileged user to map
their euid and use CAP_SETUID privileged with exactly that uid,
as no new credentials can be obtained.
I can not find a combination of existing system calls that allows setting
uid, euid, suid, and fsuid from the fsuid making the previous use
of fsuid for allowing unprivileged mappings a bug.
This is part of a fix for CVE-2014-8989.
Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
As any gid mapping will allow and must allow for backwards
compatibility dropping groups don't allow any gid mappings to be
established without CAP_SETGID in the parent user namespace.
For a small class of applications this change breaks userspace
and removes useful functionality. This small class of applications
includes tools/testing/selftests/mount/unprivilged-remount-test.c
Most of the removed functionality will be added back with the addition
of a one way knob to disable setgroups. Once setgroups is disabled
setting the gid_map becomes as safe as setting the uid_map.
For more common applications that set the uid_map and the gid_map
with privilege this change will have no affect.
This is part of a fix for CVE-2014-8989.
Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
setgroups is unique in not needing a valid mapping before it can be called,
in the case of setgroups(0, NULL) which drops all supplemental groups.
The design of the user namespace assumes that CAP_SETGID can not actually
be used until a gid mapping is established. Therefore add a helper function
to see if the user namespace gid mapping has been established and call
that function in the setgroups permission check.
This is part of the fix for CVE-2014-8989, being able to drop groups
without privilege using user namespaces.
Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently, blktrace can be started/stopped via its ioctl-based interface
(used by the userspace blktrace tool) or via its ftrace interface. The
function blk_trace_remove_queue(), called each time an "enable" tunable
of the ftrace interface transitions to zero, removes the trace from the
running list, even if no function from the sysfs interface adds it to
such a list. This leads to a null pointer dereference. This commit
changes the blk_trace_remove_queue() function so that it does not remove
the blk_trace from the running list.
v2:
- Now the patch removes the invocation of list_del() instead of
adding an useless if branch, as suggested by Namhyung Kim.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
* pm-domains:
ARM: shmobile: Convert to genpd flags for PM clocks for R-mobile
ARM: shmobile: Convert to genpd flags for PM clocks for r8a7779
PM / Domains: Initial PM clock support for genpd
PM / Domains: Power on the PM domain right after attach completes
PM / Domains: Move struct pm_domain_data to pm_domain.h
PM / Domains: Extract code to power off/on a PM domain
PM / Domains: Make genpd parameter of pm_genpd_present() const
* pm-sleep:
PM / hibernate: Deletion of an unnecessary check before the function call "vfree"
PM / Hibernate: Migrate to ktime_t
* pm-tools:
tools: cpupower: fix return checks for sysfs_get_idlestate_count()
* powercap:
powercap / RAPL: fix build dependency on iosf_mbi
powercap / RAPL: add new model ids
powercap / RAPL: handle atom and core differences
powercap / RAPL: abstract per cpu type functions
* pm-clk:
PM / clock_ops: make __pm_clk_enable more generic
PM / clock_ops: Add pm_clk_add_clk()
* pm-config:
PM: Kconfig: fix unmet dependency for CPU_PM
* pm-opp:
PM / OPP replace kfree_rcu() with call_srcu() in opp_set_availability()
PM / OPP Introduce APIs to remove OPPs
PM / OPP mark OPPs as 'static' or 'dynamic'
PM / OPP don't match for existing OPPs when list is empty
PM / OPP rename 'head' as 'rcu_head' or 'srcu_head' based on its type
When there is serious memory pressure, all workers in a pool could be
blocked, and a new thread cannot be created because it requires memory
allocation.
In this situation a WQ_MEM_RECLAIM workqueue will wake up the
rescuer thread to do some work.
The rescuer will only handle requests that are already on ->worklist.
If max_requests is 1, that means it will handle a single request.
The rescuer will be woken again in 100ms to handle another max_requests
requests.
I've seen a machine (running a 3.0 based "enterprise" kernel) with
thousands of requests queued for xfslogd, which has a max_requests of
1, and is needed for retiring all 'xfs' write requests. When one of
the worker pools gets into this state, it progresses extremely slowly
and possibly never recovers (only waited an hour or two).
With this patch we leave a pool_workqueue on mayday list
until it is clearly no longer in need of assistance. This allows
all requests to be handled in a timely fashion.
We keep each pool_workqueue on the mayday list until
need_to_create_worker() is false, and no work for this workqueue is
found in the pool.
I have tested this in combination with a (hackish) patch which forces
all work items to be handled by the rescuer thread. In that context
it significantly improves performance. A similar patch for a 3.0
kernel significantly improved performance on a heavy work load.
Thanks to Jan Kara for some design ideas, and to Dongsu Park for
some comments and testing.
tj: Inverted the lock order between wq_mayday_lock and pool->lock with
a preceding patch and simplified this patch. Added comment and
updated changelog accordingly. Dongsu spotted missing get_pwq()
in the simplified code.
Cc: Dongsu Park <dongsu.park@profitbricks.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, pool->lock nests inside pool->lock. There's no inherent
reason for this order. The only place where the two locks are held
together is pool_mayday_timeout() and it just got decided that way.
This nesting order turns out to complicate things with the planned
rescuer_thread() update. Let's invert them. This doesn't cause any
behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Dongsu Park <dongsu.park@profitbricks.com>
Locklessly doing is_idle_task(rq->curr) is only okay because of
RCU protection. The older variant of the broken code checked
rq->curr == rq->idle instead and therefore didn't need RCU.
Fixes: f6be8af1c9 ("sched: Add new API wake_up_if_idle() to wake up the idle cpu")
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Chuansheng Liu <chuansheng.liu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/729365dddca178506dfd0a9451006344cd6808bc.1417277372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhNLZAAoJEHm+PkMAQRiGAEcH/iclYDW7k2GKemMqboy+Ohmh
+ELbQothNhlGZlS1wWdD69LBiiXkkQ+ufVYFh/hC0oy0gUdfPMt5t+bOHy6cjn6w
9zOcACtpDKnqbOwRqXZjZgNmIabk7lRjbn7GK4GQqpIaW4oO0FWcT91FFhtGSPDa
tjtmGRqDmbNsqfzr18h0WPEpUZmT6MxIdv17AYDliPB1MaaRuAv1Kss05TJrXdfL
Oucv+C0uwnybD9UWAz6pLJ3H/HR9VJFdkaJ4Y0pbCHAuxdd1+swoTpicluHlsJA1
EkK5iWQRMpcmGwKvB0unCAQljNpaJiq4/Tlmmv8JlYpMlmIiVLT0D8BZx5q05QQ=
=oGNw
-----END PGP SIGNATURE-----
Merge tag 'v3.18' into drm-next
Linux 3.18
Backmerge Linus tree into -next as we had conflicts in i915/radeon/nouveau,
and everyone was solving them individually.
* tag 'v3.18': (57 commits)
Linux 3.18
watchdog: s3c2410_wdt: Fix the mask bit offset for Exynos7
uapi: fix to export linux/vm_sockets.h
i2c: cadence: Set the hardware time-out register to maximum value
i2c: davinci: generate STP always when NACK is received
ahci: disable MSI on SAMSUNG 0xa800 SSD
context_tracking: Restore previous state in schedule_user
slab: fix nodeid bounds check for non-contiguous node IDs
lib/genalloc.c: export devm_gen_pool_create() for modules
mm: fix anon_vma_clone() error treatment
mm: fix swapoff hang after page migration and fork
fat: fix oops on corrupted vfat fs
ipc/sem.c: fully initialize sem_array before making it visible
drivers/input/evdev.c: don't kfree() a vmalloc address
cxgb4: Fill in supported link mode for SFP modules
xen-netfront: Remove BUGs on paged skb data which crosses a page boundary
mm/vmpressure.c: fix race in vmpressure_work_fn()
mm: frontswap: invalidate expired data on a dup-store failure
mm: do not overwrite reserved pages counter at show_mem()
drm/radeon: kernel panic in drm_calc_vbltimestamp_from_scanoutpos with 3.18.0-rc6
...
Conflicts:
drivers/gpu/drm/i915/intel_display.c
drivers/gpu/drm/nouveau/nouveau_drm.c
drivers/gpu/drm/radeon/radeon_cs.c
No point to expose this to the world. The only legitimate user is the
core code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
introduce program type BPF_PROG_TYPE_SOCKET_FILTER that is used
for attaching programs to sockets where ctx == skb.
add verifier checks for ABS/IND instructions which can only be seen
in socket filters, therefore the check:
if (env->prog->aux->prog_type != BPF_PROG_TYPE_SOCKET_FILTER)
verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rule is simple. Don't allow anything that wouldn't be allowed
without unprivileged mappings.
It was previously overlooked that establishing gid mappings would
allow dropping groups and potentially gaining permission to files and
directories that had lesser permissions for a specific group than for
all other users.
This is the rule needed to fix CVE-2014-8989 and prevent any other
security issues with new_idmap_permitted.
The reason for this rule is that the unix permission model is old and
there are programs out there somewhere that take advantage of every
little corner of it. So allowing a uid or gid mapping to be
established without privielge that would allow anything that would not
be allowed without that mapping will result in expectations from some
code somewhere being violated. Violated expectations about the
behavior of the OS is a long way to say a security issue.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Today there are 3 instances of setgroups and due to an oversight their
permission checking has diverged. Add a common function so that
they may all share the same permission checking code.
This corrects the current oversight in the current permission checks
and adds a helper to avoid this in the future.
A user namespace security fix will update this new helper, shortly.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We've lost the +1 required for correct timeouts in
commit 5ed0bdf21a
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Wed Jul 16 21:05:06 2014 +0000
drm: i915: Use nsec based interfaces
Use ktime_get_raw_ns() and get rid of the back and forth timespec
conversions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: John Stultz <john.stultz@linaro.org>
So fix this up by reinstating our handrolled _timeout function. While
at it bother with handling MAX_JIFFIES.
v2: Convert to usecs (we don't care about the accuracy anyway) first
to avoid overflow issues Dave Gordon spotted.
v3: Drop the explicit MAX_JIFFY_OFFSET check, usecs_to_jiffies should
take care of that already. It might be a bit too enthusiastic about it
though.
v4: Chris has a much nicer color, so use his implementation.
This requires to export nsec_to_jiffies from time.c.
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Gordon <david.s.gordon@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=82749
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
a) make get_proc_ns() return a pointer to struct ns_common
b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that
is_mnt_ns_file() could get away with fewer dereferences.
That way struct proc_ns becomes invisible outside of fs/proc/*.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
for now - just move corresponding ->proc_inum instances over there
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
rescuer_thread() caches &rescuer->scheduled in a local variable
scheduled for convenience. There's one WARN_ON_ONCE() which was using
&rescuer->scheduled directly. Replace it with the local variable.
This patch causes no functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
It appears that some SCHEDULE_USER (asm for schedule_user) callers
in arch/x86/kernel/entry_64.S are called from RCU kernel context,
and schedule_user will return in RCU user context. This causes RCU
warnings and possible failures.
This is intended to be a minimal fix suitable for 3.18.
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit b2b49ccbdd (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
selected) PM_RUNTIME is always set if PM is set, so quite a few
depend on CONFIG_PM or even may be dropped entirely in some cases.
Replace CONFIG_PM_RUNTIME with CONFIG_PM in the PM core code.
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The initial reason for this patch is that I noticed that:
if (len > TRACE_BUF_SIZE)
is off by one. In this code, if len == TRACE_BUF_SIZE, then it means we
have truncated the last character off the output string. If we truncate
two or more characters then we exit without printing.
After some discussion, we decided that printing truncated data is better
than not printing at all so we should just use vscnprintf() and remove
the test entirely. Also I have updated memcpy() to copy the NUL char
instead of setting the NUL in a separate step.
Link: http://lkml.kernel.org/r/20141127155752.GA21914@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, function graph tracer prints "!" or "+" just before
function execution time to signal a function overhead, depending
on the time. And some tracers tracing latency also print "!" or
"+" just after time to signal overhead, depending on the interval
between events. Even it is usually enough to do that, we sometimes
need to signal for bigger execution time than 100 micro seconds.
For example, I used function graph tracer to detect if there is
any case that exit_mm() takes too much time. I did following steps
in /sys/kernel/debug/tracing. It was easier to detect very large
excution time with patched kernel than with original kernel.
$ echo exit_mm > set_graph_function
$ echo function_graph > current_tracer
$ echo > trace
$ cat trace_pipe > $LOGFILE
... (do something and terminate logging)
$ grep "\\$" $LOGFILE
3) $ 22082032 us | } /* kernel_map_pages */
3) $ 22082040 us | } /* free_pages_prepare */
3) $ 22082113 us | } /* free_hot_cold_page */
3) $ 22083455 us | } /* free_hot_cold_page_list */
3) $ 22083895 us | } /* release_pages */
3) $ 22177873 us | } /* free_pages_and_swap_cache */
3) $ 22178929 us | } /* unmap_single_vma */
3) $ 22198885 us | } /* unmap_vmas */
3) $ 22206949 us | } /* exit_mmap */
3) $ 22207659 us | } /* mmput */
3) $ 22207793 us | } /* exit_mm */
And then, it was easy to find out that a schedule-out occured by
sub_preempt_count() within kernel_map_pages().
To detect very large function exection time caused by either problematic
function implementation or scheduling issues, this patch can be useful.
Link: http://lkml.kernel.org/r/1416789259-24038-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add support to allow not "!" for and (&&) and (||). That is:
!(field1 == X && field2 == Y)
Where the value of the full clause will be notted.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Ted noticed that he could not filter on an event for a bit being cleared.
That's because the filtering logic only tests event fields with a limited
number of comparisons which, for bit logic, only include "&", which can
test if a bit is set, but there's no good way to see if a bit is clear.
This adds a way to do: !(field & 2048)
Which returns true if the bit is not set, and false otherwise.
Note, currently !(field1 == 10 && field2 == 15) is not supported.
That is, the 'not' only works for direct comparisons, not for the
AND and OR logic.
Link: http://lkml.kernel.org/r/20141202021912.GA29096@thunk.org
Link: http://lkml.kernel.org/r/20141202120430.71979060@gandalf.local.home
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Suggested-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In commit 6067dc5a8c ("time: Avoid possible NTP adjustment
mult overflow") a new check was added to watch for adjustments
that could cause a mult overflow.
Unfortunately the check compares a signed with unsigned value
and ignored the case where the adjustment was negative, which
causes spurious warn-ons on some systems (and seems like it
would result in problematic time adjustments there as well, due
to the early return).
Thus this patch adds a check to make sure the adjustment is
positive before we check for an overflow, and resovles the issue
in my testing.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Debugged-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1416890145-30048-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
not on non-paranoid returns. I suspect that this is a mistake and that
the code only works because int3 is paranoid.
Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME
from the uprobes code.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris bisected a NULL pointer deference in task_sched_runtime() to
commit 6e998916df 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency'.
Chris observed crashes in atop or other /proc walking programs when he
started fork bombs on his machine. He assumed that this is a new exit
race, but that does not make any sense when looking at that commit.
What's interesting is that, the commit provides update_curr callbacks
for all scheduling classes except stop_task and idle_task.
While nothing can ever hit that via the clock_nanosleep() and
clock_gettime() interfaces, which have been the target of the commit in
question, the author obviously forgot that there are other code paths
which invoke task_sched_runtime()
do_task_stat(()
thread_group_cputime_adjusted()
thread_group_cputime()
task_cputime()
task_sched_runtime()
if (task_current(rq, p) && task_on_rq_queued(p)) {
update_rq_clock(rq);
up->sched_class->update_curr(rq);
}
If the stats are read for a stomp machine task, aka 'migration/N' and
that task is current on its cpu, this will happily call the NULL pointer
of stop_task->update_curr. Ooops.
Chris observation that this happens faster when he runs the fork bomb
makes sense as the fork bomb will kick migration threads more often so
the probability to hit the issue will increase.
Add the missing update_curr callbacks to the scheduler classes stop_task
and idle_task. While idle tasks cannot be monitored via /proc we have
other means to hit the idle case.
Fixes: 6e998916df 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
Reported-by: Chris Mason <clm@fb.com>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Required to support non PCI based MSI.
[ tglx: Extracted from Jiangs patch series ]
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Implement the basic functions for MSI interrupt support with
hierarchical interrupt domains.
[ tglx: Extracted and combined from several patches ]
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With the introduction of stacked domains, we have the issue that,
depending on where in the stack this is called, __irq_set_handler
will succeed or fail: If this is called from the inner irqchip,
__irq_set_handler() will fail, as it will look at the outer domain
as the (desc->irq_data.chip == &no_irq_chip) test fails (we haven't
set the top level yet).
This patch implements the following: "If there is at least one
valid irqchip in the domain, it will probably sort itself out".
This is clearly not ideal, but it is far less confusing then
crashing because the top-level domain is not up yet.
[ tglx: Added comment and a protection against chained interrupts in
that context ]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Link: http://lkml.kernel.org/r/1416048553-29289-3-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Introduce helper function irq_domain_add_hierarchy(), which creates
a linear irqdomain if parameter 'size' is not zero, otherwise creates
a tree irqdomain.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Link: http://lkml.kernel.org/r/1416061447-9472-5-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a flags to irq_domain.flags to control whether the irqdomain core
should automatically call parent irqdomain's alloc/free callbacks. It
help to reduce hierarchy irqdomains users' code size.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Link: http://lkml.kernel.org/r/1416061447-9472-4-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add IRQ_SET_MASK_OK_DONE in addition to IRQ_SET_MASK_OK and
IRQ_SET_MASK_OK_NOCOPY to support stacked irqchip. IRQ_SET_MASK_OK_DONE
is the same as IRQ_SET_MASK_OK to irq core. To stacked irqchip, it means
that ascendant irqchips have done all the work and no more handling
needed in descendant irqchips.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add callback irq_compose_msi_msg to struct irq_chip, which will be used
to support stacked irqchip.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now we already support hierarchy irq_data, so introduce several helpers
to support stacked irq_chips.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It is possible to call irq_create_of_mapping to create/translate the
same IRQ from DT for multiple times. Perform irq_find_mapping check
and set_type for hierarchy irqdomain in irq_create_of_mapping() to
avoid duplicate these functionality in all outer most irqdomain.
Signed-off-by: Yingjoe Chen <yingjoe.chen@mediatek.com>
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We plan to use hierarchy irqdomain to suppport CPU vector assignment,
interrupt remapping controller, IO-APIC controller, MSI interrupt
and hypertransport interrupt etc on x86 platforms. So extend irqdomain
interfaces to support hierarchy irqdomain.
There are already many clients of current irqdomain interfaces.
To minimize the changes, we choose to introduce new version 2 interfaces
to support hierarchy instead of extending existing irqdomain interfaces.
According to Thomas's suggestion, the most important design decision is
to build hierarchy struct irq_data to support hierarchy irqdomain, so
hierarchy irqdomain related data could be saved in struct irq_data.
With support of hierarchy irq_data, we could also support stacked
irq_chips. This is most useful in case of set_affinity().
The new hierarchy irqdomain introduces following interfaces:
1) irq_domain_alloc_irqs()/irq_domain_free_irqs(): allocate/release IRQ
and related resources.
2) __irq_domain_alloc_irqs(): a special version to support legacy IRQs.
3) irq_domain_activate_irq()/irq_domain_deactivate_irq(): program
interrupt controllers to activate/deactivate interrupt.
There are also several help functions to ease irqdomain implemenations:
1) irq_domain_get_irq_data(): get irq_data associated with a specific
irqdomain.
2) irq_domain_set_hwirq_and_chip(): save irqdomain specific data into
irq_data.
3) irq_domain_alloc_irqs_parent()/irq_domain_free_irqs_parent(): invoke
parent irqdomain's alloc/free callbacks.
We also changed irq_startup()/irq_shutdown() to invoke
irq_domain_activate_irq()/irq_domain_deactivate_irq() to program
interrupt controller when start/stop interrupts.
[ tglx: Folded parts of the later patch series in ]
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Conflicts:
drivers/net/ieee802154/fakehard.c
A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.
Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull scheduler fixes from Ingo Molnar:
"Misc fixes: two NUMA fixes, two cputime fixes and an RCU/lockdep fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency
sched/cputime: Fix cpu_timer_sample_group() double accounting
sched/numa: Avoid selecting oneself as swap target
sched/numa: Fix out of bounds read in sched_init_numa()
sched: Remove lockdep check in sched_move_task()
Pull perf fixes from Ingo Molnar:
"Misc fixes: two Intel uncore driver fixes, a CPU-hotplug fix and a
build dependencies fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix boot crash on SBOX PMU on Haswell-EP
perf/x86/intel/uncore: Fix IRP uncore register offsets on Haswell EP
perf: Fix corruption of sibling list with hotplug
perf/x86: Fix embarrasing typo
Fix up a few comments that weren't updated when the
functions were converted to use timespec64 structures.
Acked-by: Arnd Bergmann <arnd.bergmann@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Adds a timespec64 based get_monotonic_coarse64() implementation
that can be used as we convert internal users of
get_monotonic_coarse away from using timespecs.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Adds a timespec64 based getrawmonotonic64() implementation
that can be used as we convert internal users of
getrawmonotonic away from using timespecs.
Signed-off-by: John Stultz <john.stultz@linaro.org>
As part of addressing "y2038 problem" for in-kernel uses, this
patch adds safe mktime64() using time64_t.
After this patch, mktime() is deprecated and all its call sites
will be fixed using mktime64(), after that it can be removed.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
As part of addressing "y2038 problem" for in-kernel uses, this
patch adds timekeeping_inject_sleeptime64() using timespec64.
After this patch, timekeeping_inject_sleeptime() is deprecated
and all its call sites will be fixed using the new interface,
after that it can be removed.
NOTE: timekeeping_inject_sleeptime() is safe actually, but we
want to eliminate timespec eventually, so comes this patch.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The kernel uses 32-bit signed value(time_t) for seconds elapsed
1970-01-01:00:00:00, thus it will overflow at 2038-01-19 03:14:08
on 32-bit systems. This is widely known as the y2038 problem.
As part of addressing "y2038 problem" for in-kernel uses, this patch
adds safe do_settimeofday64() using timespec64.
After this patch, do_settimeofday() is deprecated and all its call
sites will be fixed using do_settimeofday64(), after that it can be
removed.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The clocksource mult-adjustment threshold is [mult-maxadj, mult+maxadj],
timekeeping_adjust() only deals with the upper threshold, but misses the
lower threshold.
This patch adds the lower threshold judging condition.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
[jstultz: Minor fix for > 80 char line]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Ideally, __clocksource_updatefreq_scale, selects the largest shift
value possible for a clocksource. This results in the mult memember of
struct clocksource being particularly large, although not so large
that NTP would adjust the clock to cause it to overflow.
That said, nothing actually prohibits an overflow from occuring, its
just that it "shouldn't" occur.
So while very unlikely, and so far never observed, the value of
(cs->mult+cs->maxadj) may have a chance to reach very near 0xFFFFFFFF,
so there is a possibility it may overflow when doing NTP positive
adjustment
See the following detail: When NTP slewes the clock, kernel goes
through update_wall_time()->...->timekeeping_apply_adjustment():
tk->tkr.mult += mult_adj;
Since there is no guard against it, its possible tk->tkr.mult may
overflow during this operation.
This patch avoids any possible mult overflow by judging the overflow
case before adding mult_adj to mult, also adds the WARNING message
when capturing such case.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
[jstultz: Reworded commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Kees requested that this test module be renamed for consistency sake,
so this patch renames the udelay_test.c file (recently added to
tip/timers/core for 3.17) to test_udelay.c
Cc: Kees Cook <keescook@chromium.org>
Cc: Greg KH <greg@kroah.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linux-Next <linux-next@vger.kernel.org>
Cc: David Riley <davidriley@chromium.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Introduce FTRACE_OPS_FL_IPMODIFY to avoid conflict among
ftrace users who may modify regs->ip to change the execution
path. If two or more users modify the regs->ip on the same
function entry, one of them will be broken. So they must add
IPMODIFY flag and make sure that ftrace_set_filter_ip() succeeds.
Note that ftrace doesn't allow ftrace_ops which has IPMODIFY
flag to have notrace hash, and the ftrace_ops must have a
filter hash (so that the ftrace_ops can hook only specific
entries), because it strongly depends on the address and
must be allowed for only few selected functions.
Link: http://lkml.kernel.org/r/20141121102516.11844.27829.stgit@localhost.localdomain
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Vojtech Pavlik <vojtech@suse.cz>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ fixed up some of the comments ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To avoid include hell, the per_cpu variable printk_func was declared
in percpu.h. But it is only defined if printk is defined.
As users of printk may also use the printk_func variable, it needs to
be defined even if CONFIG_PRINTK is not.
Also add a printk.h include in percpu.h just to be safe.
Link: http://lkml.kernel.org/r/20141121183215.01ba539c@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix up a few typos in comments and convert an int into a bool in
update_traceon_count().
Link: http://lkml.kernel.org/r/546DD445.5080108@hitachi.com
Suggested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Being able to divert printk to call another function besides the normal
logging is useful for such things like NMI handling. If some functions
are to be called from NMI that does printk() it is possible to lock up
the box if the nmi handler triggers when another printk is happening.
One example of this use is to perform a stack trace on all CPUs via NMI.
But if the NMI is to do the printk() it can cause the system to lock up.
By allowing the printk to be diverted to another function that can safely
record the printk output and then print it when it in a safe context
then NMIs will be safe to call these functions like show_regs().
Link: http://lkml.kernel.org/p/20140619213952.209176403@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The seq_buf functions are rather useful outside of tracing. Instead
of having it be dependent on CONFIG_TRACING, move the code into lib/
and allow other users to have access to it even when tracing is not
configured.
The seq_buf utility is similar to the seq_file utility, but instead of
writing sending data back up to userland, it writes it into a buffer
defined at seq_buf_init(). This allows us to send a descriptor around
that writes printf() formatted strings into it that can be retrieved
later.
It is currently used by the tracing facility for such things like trace
events to convert its binary saved data in the ring buffer into an
ASCII human readable context to be displayed in /sys/kernel/debug/trace.
It can also be used for doing NMI prints safely from NMI context into
the seq_buf and retrieved later and dumped to printk() safely. Doing
printk() from an NMI context is dangerous because an NMI can preempt
a current printk() and deadlock on it.
Link: http://lkml.kernel.org/p/20140619213952.058255809@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function bstr_printf() from lib/vsprnintf.c is only available if
CONFIG_BINARY_PRINTF is defined. This is due to the only user currently
being the tracing infrastructure, which needs to select this config
when tracing is configured. Until there is another user of the binary
printf formats, this will continue to be the case.
Since seq_buf.c is now lives in lib/ and is compiled even without
tracing, it must encompass its use of bstr_printf() which is used
by seq_buf_printf(). This too is only used by the tracing infrastructure
and is still encapsulated by the CONFIG_BINARY_PRINTF.
Link: http://lkml.kernel.org/r/20141104160222.969013383@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add two helper functions; seq_buf_get_buf() and seq_buf_commit() that
are used by seq_buf_path(). This makes the code similar to the
seq_file: seq_path() function, and will help to be able to consolidate
the functions between seq_file and trace_seq.
Link: http://lkml.kernel.org/r/20141104160222.644881406@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.977571447@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently seq_buf is full when all but one byte of the buffer is
filled. Change it so that the seq_buf is full when all of the
buffer is filled.
Some of the functions would fill the buffer completely and report
everything was fine. This was inconsistent with the max of size - 1.
Changing this to be max of size makes all functions consistent.
Link: http://lkml.kernel.org/r/20141104160222.502133196@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.811957882@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a seq_buf_can_fit() helper function that removes the possible mistakes
of comparing the seq_buf length plus added data compared to the size of
the buffer.
Link: http://lkml.kernel.org/r/20141118164025.GL23958@pathway.suse.cz
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To be really paranoid about writing out of bound data in
trace_printk_seq(), add another check of len compared to size.
Link: http://lkml.kernel.org/r/20141119144004.GB2332@dhcp128.suse.cz
Suggested-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the seq_buf->len will soon be +1 size when there's an overflow, we
must use trace_seq_used() or seq_buf_used() methods to get the real
length. This will prevent buffer overflow issues if just the len
of the seq_buf descriptor is used to copy memory.
Link: http://lkml.kernel.org/r/20141114121911.09ba3d38@gandalf.local.home
Reported-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing_fill_pipe_page() logic is a little confusing with the
use of count saving the seq.len and reusing it.
Instead of subtracting a number that is calculated from the saved
value of the seq.len from seq.len, just save the seq.len at the start
and if we need to reset it, just assign it again.
When the seq_buf overflow is len == size + 1, the current logic will
break. Changing it to use a saved length for resetting back to the
original value is more robust and will work when we change the way
seq_buf sets the overflow.
Link: http://lkml.kernel.org/r/20141118161546.GJ23958@pathway.suse.cz
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Rewrite seq_buf_path() like it is done in seq_path() and allow
it to accept any escape character instead of just "\n".
Making seq_buf_path() like seq_path() will help prevent problems
when converting seq_file to use the seq_buf logic.
Link: http://lkml.kernel.org/r/20141104160222.048795666@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.338523371@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Create a seq_buf layer that trace_seq sits on. The seq_buf will not
be limited to page size. This will allow other usages of seq_buf
instead of a hard set PAGE_SIZE one that trace_seq has.
Link: http://lkml.kernel.org/r/20141104160221.864997179@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.170377300@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Link: http://lkml.kernel.org/r/5468F875.7080907@users.sourceforge.net
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- fix NULL pointer dereference:
kernel/bpf/arraymap.c:41 array_map_alloc() error: potential null dereference 'array'. (kzalloc returns null)
kernel/bpf/arraymap.c:41 array_map_alloc() error: we previously assumed 'array' could be null (see line 40)
- integer overflow check was missing in arraymap
(hashmap checks for overflow via kmalloc_array())
- arraymap can round_up(value_size, 8) to zero. check was missing.
- hashmap was missing zero size check as well, since roundup_pow_of_two() can
truncate into zero
- found a typo in the arraymap comment and unnecessary empty line
Fix all of these issues and make both overflow checks explicit U32 in size.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the trace_seq of ftrace_raw_output_prep() is full this function
returns TRACE_TYPE_PARTIAL_LINE, otherwise it returns zero.
The problem is that TRACE_TYPE_PARTIAL_LINE happens to be zero!
The thing is, the caller of ftrace_raw_output_prep() expects a
success to be zero. Change that to expect it to be
TRACE_TYPE_HANDLED.
Link: http://lkml.kernel.org/r/20141114112522.GA2988@dhcp128.suse.cz
Reminded-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_seq_printf() and friends are used to store strings into a buffer
that can be passed around from function to function. If the trace_seq buffer
fills up, it will not print any more. The return values were somewhat
inconsistant and using trace_seq_has_overflowed() was a better way to know
if the write to the trace_seq buffer succeeded or not.
Now that all users have removed reading the return value of the printf()
type functions, they can safely return void and keep future users of them
from reading the inconsistent values as well.
Link: http://lkml.kernel.org/r/20141114011411.992510720@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will not be returning values
soon and will be void functions. To know if they succeeded or not, the
functions trace_seq_has_overflowed() and trace_handle_return() should be
used instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will soon no longer have
return values. Using trace_seq_has_overflowed() and trace_handle_return()
should be used instead.
Link: http://lkml.kernel.org/r/20141114011411.693008134@goodmis.org
Link: http://lkml.kernel.org/r/20141115050602.333705855@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will soon not have a return
value and will only be a void function. Use trace_seq_has_overflowed()
instead to know if the trace_seq operations succeeded or not.
Link: http://lkml.kernel.org/r/20141114011411.530216306@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The return values for trace_seq_printf() and friends are going to be
removed and they will become void functions. The mmio tracer checked
their return and even did so incorrectly.
Some of the funtions which returned the values were never checked
themselves. Removing all the checks simplifies the code.
Use trace_seq_has_overflowed() and trace_handle_return() where
necessary instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of checking the return value of trace_seq_printf() and friends
for overflowing of the buffer, use the trace_seq_has_overflowed() helper
function.
This cleans up the code quite a bit and also takes us a step closer to
changing the return values of trace_seq_printf() and friends to void.
Link: http://lkml.kernel.org/r/20141114011411.181812785@goodmis.org
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of doing individual checks all over the place that makes the code
very messy. Just check trace_seq_has_overflowed() at the end or in
strategic places.
This makes the code much cleaner and also helps with getting closer
to removing the return values of trace_seq_printf() and friends.
Link: http://lkml.kernel.org/r/20141114011410.987913836@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The branch tracer should not be checking the trace_seq_printf() return value
as that will soon be void. There's a new trace_handle_return() helper function
that will return TRACE_TYPE_PARTIAL_LINE if the trace_seq overflowed
and TRACE_TYPE_HANDLED otherwise.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove checking the return value of all trace_seq_puts(). It was wrong
anyway as only the last return value mattered. But as the trace_seq_puts()
is going to be a void function in the future, we should not be checking
the return value of it anyway.
Just return !trace_seq_has_overflowed() instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Checking the return code of every trace_seq_printf() operation and having
to return early if it overflowed makes the code messy.
Using the new trace_seq_has_overflowed() and trace_handle_return() functions
allows us to clean up the code.
In the future, trace_seq_printf() and friends will be turning into void
functions and not returning a value. The trace_seq_has_overflowed() is to
be used instead. This cleanup allows that change to take place.
Cc: Jens Axboe <axboe@fb.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding a trace_seq_has_overflowed() which returns true if the trace_seq
had too much written into it allows us to simplify the code.
Instead of checking the return value of every call to trace_seq_printf()
and friends, they can all be called normally, and at the end we can
return !trace_seq_has_overflowed() instead.
Several functions also return TRACE_TYPE_PARTIAL_LINE when the trace_seq
overflowed and TRACE_TYPE_HANDLED otherwise. Another helper function
was created called trace_handle_return() which takes a trace_seq and
returns these enums. Using this helper function also simplifies the
code.
This change also makes it possible to remove the return values of
trace_seq_printf() and friends. They should instead just be
void functions.
Link: http://lkml.kernel.org/r/20141114011410.365183157@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In trace_seq_bitmask() it calls bitmap_scnprintf() not from the current
position of the trace_seq buffer (s->buffer + s->len), but instead from
the beginning of the buffer (s->buffer).
Luckily, the only user of this "ipi_raise tracepoint" uses it as the
first parameter, and as such, the start of the temp buffer in
include/trace/ftrace.h (see __get_bitmask()).
Reported-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Stack traces that happen from function tracing check if the address
on the stack is a __kernel_text_address(). That is, is the address
kernel code. This calls core_kernel_text() which returns true
if the address is part of the builtin kernel code. It also calls
is_module_text_address() which returns true if the address belongs
to module code.
But what is missing is ftrace dynamically allocated trampolines.
These trampolines are allocated for individual ftrace_ops that
call the ftrace_ops callback functions directly. But if they do a
stack trace, the code checking the stack wont detect them as they
are neither core kernel code nor module address space.
Adding another field to ftrace_ops that also stores the size of
the trampoline assigned to it we can create a new function called
is_ftrace_trampoline() that returns true if the address is a
dynamically allocate ftrace trampoline. Note, it ignores trampolines
that are not dynamically allocated as they will return true with
the core_kernel_text() function.
Link: http://lkml.kernel.org/r/20141119034829.497125839@goodmis.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
... for situations when we don't have any candidate in pathnames - basically,
in descriptor-based syscalls.
[Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add the new __NR_s390_pci_mmio_write and __NR_s390_pci_mmio_read
system calls to allow user space applications to access device PCI I/O
memory pages on s390x platform.
[ Martin Schwidefsky: some code beautification ]
Signed-off-by: Alexey Ishchuk <aishchuk@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The function probe counting for traceon and traceoff suffered a race
condition where if the probe was executing on two or more CPUs at the
same time, it could decrement the counter by more than one when
disabling (or enabling) the tracer only once.
The way the traceon and traceoff probes are suppose to work is that
they disable (or enable) tracing once per count. If a user were to
echo 'schedule:traceoff:3' into set_ftrace_filter, then when the
schedule function was called, it would disable tracing. But the count
should only be decremented once (to 2). Then if the user enabled tracing
again (via tracing_on file), the next call to schedule would disable
tracing again and the count would be decremented to 1.
But if multiple CPUS called schedule at the same time, it is possible
that the count would be decremented more than once because of the
simple "count--" used.
By reading the count into a local variable and using memory barriers
we can guarantee that the count would only be decremented once per
disable (or enable).
The stack trace probe had a similar race, but here the stack trace will
decrement for each time it is called. But this had the read-modify-
write race, where it could stack trace more than the number of times
that was specified. This case we use a cmpxchg to stack trace only the
number of times specified.
The dump probes can still use the old "update_count()" function as
they only run once, and that is controlled by the dump logic
itself.
Link: http://lkml.kernel.org/r/20141118134643.4b550ee4@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The number of and dependencies between high-level power management
Kconfig options make life much harder than necessary. Several
conbinations of them have to be tested and supported, even though
some of those combinations are very rarely used in practice (if
they are used in practice at all). Moreover, the fact that we
have separate independent Kconfig options for runtime PM and
system suspend is a serious obstacle for integration between
the two frameworks.
To overcome these difficulties, always select PM_RUNTIME if PM_SLEEP
is set. Among other things, this will allow system suspend callbacks
provided by bus types and device drivers to rely on the runtime PM
framework regardless of the kernel configuration.
Enthusiastically-acked-by: Kevin Hilman <khilman@linaro.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
proper types and function helpers are ready. Use them in verifier testsuite.
Remove temporary stubs
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
map accessors to eBPF programs
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix errno of BPF_MAP_LOOKUP_ELEM command as bpf manpage
described it in commit b4fc1a460f30("Merge branch 'bpf-next'"):
-----
BPF_MAP_LOOKUP_ELEM
int bpf_lookup_elem(int fd, void *key, void *value)
{
union bpf_attr attr = {
.map_fd = fd,
.key = ptr_to_u64(key),
.value = ptr_to_u64(value),
};
return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
}
bpf() syscall looks up an element with given key in a map fd.
If element is found it returns zero and stores element's value
into value. If element is not found it returns -1 and sets
errno to ENOENT.
and further down in manpage:
ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM, indicates that
element with given key was not found.
-----
In general all BPF commands return ENOENT when map element is not found
(including BPF_MAP_GET_NEXT_KEY and BPF_MAP_UPDATE_ELEM with
flags == BPF_MAP_UPDATE_ONLY)
Subsequent patch adds a testsuite to check return values for all of
these combinations.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
add new map type BPF_MAP_TYPE_ARRAY and its implementation
- optimized for fastest possible lookup()
. in the future verifier/JIT may recognize lookup() with constant key
and optimize it into constant pointer. Can optimize non-constant
key into direct pointer arithmetic as well, since pointers and
value_size are constant for the life of the eBPF program.
In other words array_map_lookup_elem() may be 'inlined' by verifier/JIT
while preserving concurrent access to this map from user space
- two main use cases for array type:
. 'global' eBPF variables: array of 1 element with key=0 and value is a
collection of 'global' variables which programs can use to keep the state
between events
. aggregation of tracing events into fixed set of buckets
- all array elements pre-allocated and zero initialized at init time
- key as an index in array and can only be 4 byte
- map_delete_elem() returns EINVAL, since elements cannot be deleted
- map_update_elem() replaces elements in an non-atomic way
(for atomic updates hashtable type should be used instead)
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
add new map type BPF_MAP_TYPE_HASH and its implementation
- maps are created/destroyed by userspace. Both userspace and eBPF programs
can lookup/update/delete elements from the map
- eBPF programs can be called in_irq(), so use spin_lock_irqsave() mechanism
for concurrent updates
- key/value are opaque range of bytes (aligned to 8 bytes)
- user space provides 3 configuration attributes via BPF syscall:
key_size, value_size, max_entries
- map takes care of allocating/freeing key/value pairs
- map_update_elem() must fail to insert new element when max_entries
limit is reached to make sure that eBPF programs cannot exhaust memory
- map_update_elem() replaces elements in an atomic way
- optimized for speed of lookup() which can be called multiple times from
eBPF program which itself is triggered by high volume of events
. in the future JIT compiler may recognize lookup() call and optimize it
further, since key_size is constant for life of eBPF program
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the current meaning of BPF_MAP_UPDATE_ELEM syscall command is:
either update existing map element or create a new one.
Initially the plan was to add a new command to handle the case of
'create new element if it didn't exist', but 'flags' style looks
cleaner and overall diff is much smaller (more code reused), so add 'flags'
attribute to BPF_MAP_UPDATE_ELEM command with the following meaning:
#define BPF_ANY 0 /* create new element or update existing */
#define BPF_NOEXIST 1 /* create new element if it didn't exist */
#define BPF_EXIST 2 /* update existing element */
bpf_update_elem(fd, key, value, BPF_NOEXIST) call can fail with EEXIST
if element already exists.
bpf_update_elem(fd, key, value, BPF_EXIST) can fail with ENOENT
if element doesn't exist.
Userspace will call it as:
int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
{
union bpf_attr attr = {
.map_fd = fd,
.key = ptr_to_u64(key),
.value = ptr_to_u64(value),
.flags = flags;
};
return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}
First two bits of 'flags' are used to encode style of bpf_update_elem() command.
Bits 2-63 are reserved for future use.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement cgroup_get_e_css() which finds and gets the effective css
for the specified cgroup and subsystem combination. This function
always returns a valid pinned css. This will be used by cgroup
writeback support.
While at it, add comment to cgroup_e_css() to explain why that
function is different from cgroup_get_e_css() and has to test
cgrp->child_subsys_mask instead of cgroup_css(cgrp, ss).
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Add a new cgroup_subsys operatoin ->css_e_css_changed(). This is
invoked if any of the effective csses seen from the css's cgroup may
have changed. This will be used to implement cgroup writeback
support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Add a new cgroup subsys callback css_released(). This is called when
the reference count of the css (cgroup_subsys_state) reaches zero
before RCU scheduling free.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
When a subsystem is offlined, its entry on @cgrp->subsys[] is cleared
asynchronously. If cgroup_subtree_control_write() is requested to
enable the subsystem again before the entry is cleared, it has to wait
for the previous offlining to finish and clear the @cgrp->subsys[]
entry before trying to enable the subsystem again.
This is currently done while verifying the input enable / disable
parameters. This used to be correct but f63070d350 ("cgroup: make
interface files visible iff enabled on cgroup->subtree_control")
breaks it. The commit is one of the commits implementing subsystem
dependency.
Through subsystem dependency, some subsystems may be enabled and
disabled implicitly in addition to the explicitly requested ones. The
actual subsystems to be enabled and disabled are determined during
@css_enable/disable calculation. The current offline wait logic skips
the ones which are already implicitly enabled and then waits for
subsystems in @enable; however, this misses the subsystems which may
be implicitly enabled through dependency from @enable. If such
implicitly subsystem hasn't yet finished offlining yet, the function
ends up trying to create a css when its @cgrp->subsys[] slot is
already occupied triggering BUG_ON() in init_and_link_css().
Fix it by moving the wait logic after @css_enable is calculated and
waiting for all the subsystems in @css_enable. This fixes the above
bug as the mask contains all subsystems which are to be enabled
including the ones enabled through dependencies.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: f63070d350 ("cgroup: make interface files visible iff enabled on cgroup->subtree_control")
Acked-by: Zefan Li <lizefan@huawei.com>
Make cgroup_subtree_control_write() first calculate new
subtree_control (new_sc), child_subsys_mask (new_ss) and
css_enable/disable masks before applying them to the cgroup. Also,
store the original subtree_control (old_sc) and child_subsys_mask
(old_ss) and use them to restore the orignal state after failure.
This patch shouldn't cause any behavior changes. This prepares for a
fix for a bug in the async css offline wait logic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
cgroup_refresh_child_subsys_mask() calculates and updates the
effective @cgrp->child_subsys_maks according to the current
@cgrp->subtree_control. Separate out the calculation part into
cgroup_calc_child_subsys_mask(). This will be used to fix a bug in
the async css offline wait logic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
If BL_SWITCHER is enabled but SUSPEND and CPU_IDLE is not enabled
we are getting following config warning.
warning: (BL_SWITCHER) selects CPU_PM which has unmet direct
dependencies (SUSPEND || CPU_IDLE)
It has been noticed that CPU_PM dependencies in this file are not really
required so let's remove these dependencies from CPU_PM.
Signed-off-by: Pankaj Dubey <pankaj.dubey@samsung.com>
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The vfree() function performs also input parameter validation. Thus the test
around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This is really the meat of the MPX patch set. If there is one patch to
review in the entire series, this is the one. There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner. (small FAQ below).
Long Description:
This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers. The prctl() is an
explicit signal from userspace.
PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.
PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.
PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address. Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.
Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive. Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.
==== Why does the hardware even have these Bounds Tables? ====
MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".
They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.
The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.
==== Why not do this in userspace? ====
This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.
Q: Can virtual space simply be reserved for the bounds tables so
that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
area needs 4*X GB of virtual space, plus 2GB for the bounds
directory. If we were to preallocate them for the 128TB of
user virtual address space, we would need to reserve 512TB+2GB,
which is larger than the entire virtual address space today.
This means they can not be reserved ahead of time. Also, a
single process's pre-popualated bounds directory consumes 2GB
of virtual *AND* physical memory. IOW, it's completely
infeasible to prepopulate bounds directories.
Q: Can we preallocate bounds table space at the same time memory
is allocated which might contain pointers that might eventually
need bounds tables?
A: This would work if we could hook the site of each and every
memory allocation syscall. This can be done for small,
constrained applications. But, it isn't practical at a larger
scale since a given app has no way of controlling how all the
parts of the app might allocate memory (think libraries). The
kernel is really the only place to intercept these calls.
Q: Could a bounds fault be handed to userspace and the tables
allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
handler functions and even if mmap() would work it still
requires locking or nasty tricks to keep track of the
allocation state there.
Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.
Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds new fields about bound violation into siginfo
structure. si_lower and si_upper are respectively lower bound
and upper bound when bound violation is caused.
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151819.1908C900@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The version field defined in the audit status structure was found to have
limitations in terms of its expressibility of features supported. This is
distict from the get/set features call to be able to command those features
that are present.
Converting this field from a version number to a feature bitmap will allow
distributions to selectively backport and support certain features and will
allow upstream to be able to deprecate features in the future. It will allow
userspace clients to first query the kernel for which features are actually
present and supported. Currently, EINVAL is returned rather than EOPNOTSUP,
which isn't helpful in determining if there was an error in the command, or if
it simply isn't supported yet. Past features are not represented by this
bitmap, but their use may be converted to EOPNOTSUP if needed in the future.
Since "version" is too generic to convert with a #define, use a union in the
struct status, introducing the member "feature_bitmap" unionized with
"version".
Convert existing AUDIT_VERSION_* macros over to AUDIT_FEATURE_BITMAP*
counterparts, leaving the former for backwards compatibility.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: minor whitespace tweaks]
Signed-off-by: Paul Moore <pmoore@redhat.com>
This patch reorders fields in the perf_sample_data struct in order to
minimize the number of cachelines touched in perf_sample_data_init().
It also removes some intializations which are redundant with the code
in kernel/events/core.c
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1411559322-16548-7-git-send-email-eranian@google.com
Cc: cebbert.lkml@gmail.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: jolsa@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Enable capture of interrupted machine state for each sample.
Registers to sample are passed per event in the sample_regs_intr bitmask.
To sample interrupt machine state, the PERF_SAMPLE_INTR_REGS must be passed in
sample_type.
The list of available registers is arch dependent and provided by asm/perf_regs.h
Registers are laid out as u64 in the order of the bit order of sample_intr_regs.
This patch also adds a new ABI version PERF_ATTR_SIZE_VER4 because we extend
the perf_event_attr struct with a new u64 field.
Reviewed-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: cebbert.lkml@gmail.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/1411559322-16548-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Actually, cpudl_set() and cpudl_init() can never be used without
CONFIG_SMP.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415260327-30465-4-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Actually, cpupri_set() and cpupri_init() can never be used without
CONFIG_SMP.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: "pang.xunlei" <pang.xunlei@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415260327-30465-1-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Do not call dequeue_pushable_dl_task() when failing to push an eligible
task, as it remains pushable, merely not at this particular moment.
Actually the patch is the same behavior as commit 311e800e16 ("sched,
rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit caeb178c60 ("sched/fair: Make update_sd_pick_busiest() return
'true' on a busier sd") changes groups to be ranked in the order of
overloaded > imbalance > other, and busiest group is picked according
to this order.
sgs->group_capacity_factor is used to check if the group is overloaded.
When the child domain prefers tasks to go to siblings first, the
sgs->group_capacity_factor will be set lower than one in order to
move all the excess tasks away.
However, group overloaded status is not updated when
sgs->group_capacity_factor is set to lower than one, which leads to us
missing to find the busiest group.
This patch fixes it by updating group overloaded status when sg capacity
factor is set to one, in order to find the busiest group accurately.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com
[ Fixed the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.
Suggested-and-Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "pang.xunlei" <pang.xunlei@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As discussed in [1], accounting IO is meant for blkio only. Document that
so driver authors won't use them for device io.
[1] http://thread.gmane.org/gmane.linux.drivers.i2c/20470
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415098901-2768-1-git-send-email-wsa@the-dreams.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit d670ec1317 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.
Reproducer/tester can be found further below, it can be compiled and ran by:
gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
while ./tst-cpuclock2 ; do : ; done
This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".
Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.
KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .
This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.
Full reproducer (tst-cpuclock2.c):
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdint.h>
#include <inttypes.h>
/* Parameters for the Linux kernel ABI for CPU clocks. */
#define CPUCLOCK_SCHED 2
#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
static pthread_barrier_t barrier;
/* Help advance the clock. */
static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1) ;
return NULL;
}
/* Don't use the glibc wrapper. */
static int do_nanosleep(int flags, const struct timespec *req)
{
clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
}
static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
{
int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
return after_i - before_i;
}
int main(void)
{
int result = 0;
pthread_t th;
pthread_barrier_init(&barrier, NULL, 2);
if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
perror("pthread_create");
return 1;
}
pthread_barrier_wait(&barrier);
/* The test. */
struct timespec before, after, sleeptimeabs;
int64_t sleepdiff, diffabs;
const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
/* The relative nanosleep. Not sure why this is needed, but its presence
seems to make it easier to reproduce the problem. */
if (do_nanosleep(0, &sleeptime) != 0) {
perror("clock_nanosleep");
return 1;
}
/* Get the current time. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
perror("clock_gettime[2]");
return 1;
}
/* Compute the absolute sleep time based on the current time. */
uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
sleeptimeabs.tv_nsec = nsec % 1000000000;
/* Sleep for the computed time. */
if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
perror("absolute clock_nanosleep");
return 1;
}
/* Get the time after the sleep. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
perror("clock_gettime[3]");
return 1;
}
/* The time after sleep should always be equal to or after the absolute sleep
time passed to clock_nanosleep. */
sleepdiff = tsdiff(&sleeptimeabs, &after);
if (sleepdiff < 0) {
printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
result = 1;
printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
}
/* The difference between the timestamps taken before and after the
clock_nanosleep call should be equal to or more than the duration of the
sleep. */
diffabs = tsdiff(&before, &after);
if (diffabs < sleeptime.tv_nsec) {
printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
result = 1;
}
pthread_cancel(th);
return result;
}
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While looking over the cpu-timer code I found that we appear to add
the delta for the calling task twice, through:
cpu_timer_sample_group()
thread_group_cputimer()
thread_group_cputime()
times->sum_exec_runtime += task_sched_runtime();
*sample = cputime.sum_exec_runtime + task_delta_exec();
Which would make the sample run ahead, making the sleep short.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>