Future fixes will use sysfs files that contain cpumask output. The code
needs to know the length of the cpumask in order to determine which cpus
are set in a cpumask. Currently topo.max_cpu_num is the maximum cpu
number. It can be increased the the maximum value of cpus represented in
cpumasks.
Set max_num_cpus to the length of a cpumask.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Len Brown <len.brown@intel.com>
There's a use case during test to only print specific round of iterations
if --num_iterations is specified, for example, with this patch applied:
turbostat -i 5 -n 4
will capture 4 samples with 5 seconds interval.
[lenb: renamed to --num_iterations from --iterations]
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
All MSRs related to turbostat are same as Kabylake.
Even though SDM claims that core C3 residency can be read from MSR 0x662,
the read on this MSR fails on CNL platform. Hence disabled C3 MSR read
and display.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
The SNB_C1_AUTO_UNDEMOTE definition should have been deleted once
it was copied into msr-index.h. One copy of the truth is better --
particularly when Matt needs to fix it:-)
Signed-off-by: Len Brown <len.brown@intel.com>
According to the Intel Software Developers' Manual, Vol. 4, Order No.
335592, these macros have been reversed since they were added.
Fixes: 889facbee3 ("tools/power turbostat: v3.0: monitor Watts and Temperature")
Signed-off-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Like the "C1" and "C1%" column, the new POLL and POLL% columns
show invocations and residency% during the measurement interval.
While it didn't seem important to track in the past,
we've recently found some Linux cpuidle bugs related to POLL%.
Signed-off-by: Len Brown <len.brown@intel.com>
The column header for PC10 residency is "Pk%pc10"
This is missing the 'g' that others have, eg Pkg%pc6,
to allow tab-delimited columns to fit into 8-columns.
However, --hide Pk%pc10 did not work, it was still looking for the 'g'.
This was confusing, because --list shows the correct "Pk%pc10"
Reported-by: Wendy Wang <wendy.wang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Linux 4.15 exports the ACPI Low Power Idle Table's
counters in /sys/devices/system/cpu/cpuidle/
low_power_idle_cpu_residency_us
Show this in the "CPU%LPI" column.
Today this reflects the "North Complex"
residency in PC10, so expect it to
closely follow "Pk%pc10".
low_power_idle_system_residency_us
Show this in the "SYS%LPI" column.
Today, this reflects the North is in PC10,
plus the PCH is sufficiently quiescent
to save additional power via the "S0ix"
system state, as measured by the
PCH SLP_S0 counter.
Signed-off-by: Len Brown <len.brown@intel.com>
rpm-lint flagged these as being executable:
kernel-tools.x86_64: W: spurious-executable-perm /usr/share/man/man8/turbostat.8.gz
kernel-tools.x86_64: W: spurious-executable-perm /usr/share/man/man8/x86_energy_perf_policy.8.gz
Fix this
Signed-off-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Len Brown <len.brown@intel.com>
When the user reuests to collect and show columns
that are not present on every row (eg. for every CPU)
turbostat still prints an (empty) line for every CPU.
Update so no blank lines are printed.
old:
# turbostat --quiet --show Pkg%pc6
Pkg%pc6
9.12
9.12
Pkg%pc6
9.12
9.12
new:
# turbostat --quiet --show Pkg%pc6
Pkg%pc6
9.12
9.12
Pkg%pc6
9.12
9.12
Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Improve readability a little bit by changing this output:
MSR_PKG_CST_CONFIG_CONTROL: 0x00008407 (locked: pkg-cstate-limit=7: unlimited, automatic-c-state-conversion=off)
with this output:
MSR_PKG_CST_CONFIG_CONTROL: 0x00008407 (locked, pkg-cstate-limit=7 (unlimited), automatic-c-state-conversion=off)
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
BDX and SKX have a bit that tells them to PROMOTE shallow
C-states requests to MWAIT(C6). It is generally a BIOS bug
if this bit is set. As we have encountered that BIOS bug,
let's print this bit in turbostat debug output.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Some SKX use a 24 MHz crystal, so do not hard code 25 MHz.
Also, SKX crystal is not exact, because SKX uses an EMI reduction
circuit that costs a fraction of a percent.
Signed-off-by: Len Brown <len.brown@intel.com>
MSR_IA32_MISC_ENABLE[18] is the MWAIT ENABLE bit, not DISABLE bit...
so
MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST No-MWAIT PREFETCH TURBO)
should print as:
MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MWAIT PREFETCH TURBO)
Signed-off-by: Len Brown <len.brown@intel.com>
The recent patch that implements table printing on a keypress introduced a
regression - turbostat prints the table almost continuously if it is run from a
daemon program.
The problem is also easy to reproduce like this:
echo | turbostat
The reason is that we cannot assume that stdin is always a TTY. It can be many
things.
This patch adds fixes the problem by limiting the new keypress functionality to
TTYs only. If stdin is not a TTY, we just sleep for the full interval time.
While on it, clean-up 'do_sleep()' to return no value, as callers do not expect
that anyway.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
In turbostat interval mode, a newline typed on standard input
will now conclude the current interval. Data will immediately
be collected and printed for that interval, and the next interval
will be started.
This is similar to the recently added SIGUSR1 feature.
But that is for use by programs, while this is for interactive use.
Signed-off-by: Len Brown <len.brown@intel.com>
Interval-mode turbostat now catches and discards SIGUSR1.
Thus, SIGUSR1 can be used to tell turbostat to cut short
the current measurement interval. Turbostat will then start
the next measurement interval using the regular interval length.
This can be used to give turbostat variable intervals.
Invoke turbostat with --interval LARGE_NUMBER_SEC
and have a program that has permission to send it a SIGUSR1
always before LARGE_NUMBER_SEC expires.
It may also be useful to use "--enable Time_Of_Day_Seconds"
to observe the actual interval length.
Signed-off-by: Len Brown <len.brown@intel.com>
Add a Time_Of_Day_Seconds column showing when measurement
for each row was completed. Units are [sec.subsec] since Epoch,
as reported by gettimeofday(2).
While useful to correlate turbostat output with other tools,
this built-in column is disabled, by default.
Add the "--enable" option to enable such disabled-by-default
built-in columns:
"--enable Time_Of_Day_Seconds"
"--enable usec"
"--enable all", will enable all disabled-by-defauilt built-in counters.
When "--debug" is used, all disabled-by-default columns are enabled,
unless explicitly skipped using "--hide"
Signed-off-by: Len Brown <len.brown@intel.com>
Turbostat neglects to display all package C-states for some Skylake Xeon BIOS configurations.
This is due to a typo in the table decoding MSR_PKG_CST_CONFIG_CONTROL (0x000000e2)
Here we fix that typo, according to Intel SDM, vol 4, Table 2-41 -
"MSRs Supported by Intel® Xeon® Processor Scalable Family with DisplayFamily_DisplayModel 06_55H".
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Merge misc fixes from Andrew Morton:
"16 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
kasan: fix memory hotplug during boot
kasan: free allocated shadow memory on MEM_CANCEL_ONLINE
checkpatch: fix macro argument precedence test
init/main.c: include <linux/mem_encrypt.h>
kernel/sys.c: fix potential Spectre v1 issue
mm/memory_hotplug: fix leftover use of struct page during hotplug
proc: fix smaps and meminfo alignment
mm: do not warn on offline nodes unless the specific node is explicitly requested
mm, memory_hotplug: make has_unmovable_pages more robust
mm/kasan: don't vfree() nonexistent vm_area
MAINTAINERS: change hugetlbfs maintainer and update files
ipc/shm: fix shmat() nil address after round-down when remapping
Revert "ipc/shm: Fix shmat mmap nil-page protection"
idr: fix invalid ptr dereference on item delete
ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio"
mm: fix nr_rotate_swap leak in swapon() error case
Pull networking fixes from David Miller:
"Let's begin the holiday weekend with some networking fixes:
1) Whoops need to restrict cfg80211 wiphy names even more to 64
bytes. From Eric Biggers.
2) Fix flags being ignored when using kernel_connect() with SCTP,
from Xin Long.
3) Use after free in DCCP, from Alexey Kodanev.
4) Need to check rhltable_init() return value in ipmr code, from Eric
Dumazet.
5) XDP handling fixes in virtio_net from Jason Wang.
6) Missing RTA_TABLE in rtm_ipv4_policy[], from Roopa Prabhu.
7) Need to use IRQ disabling spinlocks in mlx4_qp_lookup(), from Jack
Morgenstein.
8) Prevent out-of-bounds speculation using indexes in BPF, from
Daniel Borkmann.
9) Fix regression added by AF_PACKET link layer cure, from Willem de
Bruijn.
10) Correct ENIC dma mask, from Govindarajulu Varadarajan.
11) Missing config options for PMTU tests, from Stefano Brivio"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
ibmvnic: Fix partial success login retries
selftests/net: Add missing config options for PMTU tests
mlx4_core: allocate ICM memory in page size chunks
enic: set DMA mask to 47 bit
ppp: remove the PPPIOCDETACH ioctl
ipv4: remove warning in ip_recv_error
net : sched: cls_api: deal with egdev path only if needed
vhost: synchronize IOTLB message with dev cleanup
packet: fix reserve calculation
net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands
net/mlx5e: When RXFCS is set, add FCS data into checksum calculation
bpf: properly enforce index mask to prevent out-of-bounds speculation
net/mlx4: Fix irq-unsafe spinlock usage
net: phy: broadcom: Fix bcm_write_exp()
net: phy: broadcom: Fix auxiliary control register reads
net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy
net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message
ibmvnic: Only do H_EOI for mobility events
tuntap: correctly set SOCKWQ_ASYNC_NOSPACE
virtio-net: fix leaking page for gso packet during mergeable XDP
...
If the radix tree underlying the IDR happens to be full and we attempt
to remove an id which is larger than any id in the IDR, we will call
__radix_tree_delete() with an uninitialised 'slot' pointer, at which
point anything could happen. This was easiest to hit with a single
entry at id 0 and attempting to remove a non-0 id, but it could have
happened with 64 entries and attempting to remove an id >= 64.
Roman said:
The syzcaller test boils down to opening /dev/kvm, creating an
eventfd, and calling a couple of KVM ioctls. None of this requires
superuser. And the result is dereferencing an uninitialized pointer
which is likely a crash. The specific path caught by syzbot is via
KVM_HYPERV_EVENTD ioctl which is new in 4.17. But I guess there are
other user-triggerable paths, so cc:stable is probably justified.
Matthew added:
We have around 250 calls to idr_remove() in the kernel today. Many of
them pass an ID which is embedded in the object they're removing, so
they're safe. Picking a few likely candidates:
drivers/firewire/core-cdev.c looks unsafe; the ID comes from an ioctl.
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c is similar
drivers/atm/nicstar.c could be taken down by a handcrafted packet
Link: http://lkml.kernel.org/r/20180518175025.GD6361@bombadil.infradead.org
Fixes: 0a835c4f09 ("Reimplement IDR and IDA using the radix tree")
Reported-by: <syzbot+35666cba7f0a337e2e79@syzkaller.appspotmail.com>
Debugged-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Borkmann says:
====================
pull-request: bpf 2018-05-24
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a bug in the original fix to prevent out of bounds speculation when
multiple tail call maps from different branches or calls end up at the
same tail call helper invocation, from Daniel.
2) Two selftest fixes, one in reuseport_bpf_numa where test is skipped in
case of missing numa support and another one to update kernel config to
properly support xdp_meta.sh test, from Anders.
...
Would be great if you have a chance to merge net into net-next after that.
The verifier fix would be needed later as a dependency in bpf-next for
upcomig work there. When you do the merge there's a trivial conflict on
BPF side with 849fa50662 ("bpf/verifier: refine retval R0 state for
bpf_get_stack helper"): Resolution is to keep both functions, the
do_refine_retval_range() and record_func_map().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
PMTU tests in pmtu.sh need support for VTI, VTI6 and dummy
interfaces: add them to config file.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: d1f1b9cbf3 ("selftests: net: Introduce first PMTU test")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reuseport_bpf_numa test case fails there's no numa support. The
test shouldn't fail if there's no support it should be skipped.
Fixes: 3c2c3c16aa ("reuseport, bpf: add test case for bpf_get_numa_node_id")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Merge speculative store buffer bypass fixes from Thomas Gleixner:
- rework of the SPEC_CTRL MSR management to accomodate the new fancy
SSBD (Speculative Store Bypass Disable) bit handling.
- the CPU bug and sysfs infrastructure for the exciting new Speculative
Store Bypass 'feature'.
- support for disabling SSB via LS_CFG MSR on AMD CPUs including
Hyperthread synchronization on ZEN.
- PRCTL support for dynamic runtime control of SSB
- SECCOMP integration to automatically disable SSB for sandboxed
processes with a filter flag for opt-out.
- KVM integration to allow guests fiddling with SSBD including the new
software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on
AMD.
- BPF protection against SSB
.. this is just the core and x86 side, other architecture support will
come separately.
* 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
bpf: Prevent memory disambiguation attack
x86/bugs: Rename SSBD_NO to SSB_NO
KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD
x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG
x86/bugs: Rework spec_ctrl base and mask logic
x86/bugs: Remove x86_spec_ctrl_set()
x86/bugs: Expose x86_spec_ctrl_base directly
x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host}
x86/speculation: Rework speculative_store_bypass_update()
x86/speculation: Add virtualized speculative store bypass disable support
x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL
x86/speculation: Handle HT correctly on AMD
x86/cpufeatures: Add FEATURE_ZEN
x86/cpufeatures: Disentangle SSBD enumeration
x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS
x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP
KVM: SVM: Move spec control call after restore of GS
x86/cpu: Make alternative_msr_write work for 32-bit code
x86/bugs: Fix the parameters alignment and missing void
x86/bugs: Make cpu_show_common() static
...
Pull networking fixes from David Miller:
1) Fix refcounting bug for connections in on-packet scheduling mode of
IPVS, from Julian Anastasov.
2) Set network header properly in AF_PACKET's packet_snd, from Willem
de Bruijn.
3) Fix regressions in 3c59x by converting to generic DMA API. It was
relying upon the hack that the PCI DMA interfaces would accept NULL
for EISA devices. From Christoph Hellwig.
4) Remove RDMA devices before unregistering netdev in QEDE driver, from
Michal Kalderon.
5) Use after free in TUN driver ptr_ring usage, from Jason Wang.
6) Properly check for missing netlink attributes in SMC_PNETID
requests, from Eric Biggers.
7) Set DMA mask before performaing any DMA operations in vmxnet3
driver, from Regis Duchesne.
8) Fix mlx5 build with SMP=n, from Saeed Mahameed.
9) Classifier fixes in bcm_sf2 driver from Florian Fainelli.
10) Tuntap use after free during release, from Jason Wang.
11) Don't use stack memory in scatterlists in tls code, from Matt
Mullins.
12) Not fully initialized flow key object in ipv4 routing code, from
David Ahern.
13) Various packet headroom bug fixes in ip6_gre driver, from Petr
Machata.
14) Remove queues from XPS maps using correct index, from Amritha
Nambiar.
15) Fix use after free in sock_diag, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (64 commits)
net: ip6_gre: fix tunnel metadata device sharing.
cxgb4: fix offset in collecting TX rate limit info
net: sched: red: avoid hashing NULL child
sock_diag: fix use-after-free read in __sk_free
sh_eth: Change platform check to CONFIG_ARCH_RENESAS
net: dsa: Do not register devlink for unused ports
net: Fix a bug in removing queues from XPS map
bpf: fix truncated jump targets on heavy expansions
bpf: parse and verdict prog attach may race with bpf map update
bpf: sockmap update rollback on error can incorrectly dec prog refcnt
net: test tailroom before appending to linear skb
net: ip6_gre: Fix ip6erspan hlen calculation
net: ip6_gre: Split up ip6gre_changelink()
net: ip6_gre: Split up ip6gre_newlink()
net: ip6_gre: Split up ip6gre_tnl_change()
net: ip6_gre: Split up ip6gre_tnl_link_config()
net: ip6_gre: Fix headroom request in ip6erspan_tunnel_xmit()
net: ip6_gre: Request headroom in __gre6_xmit()
selftests/bpf: check return value of fopen in test_verifier.c
erspan: fix invalid erspan version.
...
Pull x86 fixes from Thomas Gleixner:
"An unfortunately larger set of fixes, but a large portion is
selftests:
- Fix the missing clusterid initializaiton for x2apic cluster
management which caused boot failures due to IPIs being sent to the
wrong cluster
- Drop TX_COMPAT when a 64bit executable is exec()'ed from a compat
task
- Wrap access to __supported_pte_mask in __startup_64() where clang
compile fails due to a non PC relative access being generated.
- Two fixes for 5 level paging fallout in the decompressor:
- Handle GOT correctly for paging_prepare() and
cleanup_trampoline()
- Fix the page table handling in cleanup_trampoline() to avoid
page table corruption.
- Stop special casing protection key 0 as this is inconsistent with
the manpage and also inconsistent with the allocation map handling.
- Override the protection key wen moving away from PROT_EXEC to
prevent inaccessible memory.
- Fix and update the protection key selftests to address breakage and
to cover the above issue
- Add a MOV SS self test"
[ Part of the x86 fixes were in the earlier core pull due to dependencies ]
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/mm: Drop TS_COMPAT on 64-bit exec() syscall
x86/apic/x2apic: Initialize cluster ID properly
x86/boot/compressed/64: Fix moving page table out of trampoline memory
x86/boot/compressed/64: Set up GOT for paging_prepare() and cleanup_trampoline()
x86/pkeys: Do not special case protection key 0
x86/pkeys/selftests: Add a test for pkey 0
x86/pkeys/selftests: Save off 'prot' for allocations
x86/pkeys/selftests: Fix pointer math
x86/pkeys: Override pkey when moving away from PROT_EXEC
x86/pkeys/selftests: Fix pkey exhaustion test off-by-one
x86/pkeys/selftests: Add PROT_EXEC test
x86/pkeys/selftests: Factor out "instruction page"
x86/pkeys/selftests: Allow faults on unknown keys
x86/pkeys/selftests: Avoid printf-in-signal deadlocks
x86/pkeys/selftests: Remove dead debugging code, fix dprint_in_signal
x86/pkeys/selftests: Stop using assert()
x86/pkeys/selftests: Give better unexpected fault error messages
x86/selftests: Add mov_to_ss test
x86/mpx/selftests: Adjust the self-test to fresh distros that export the MPX ABI
x86/pkeys/selftests: Adjust the self-test to fresh distros that export the pkeys ABI
...
Pull perf tooling fixes from Thomas Gleixner:
- fix segfault when processing unknown threads in cs-etm
- fix "perf test inet_pton" on s390 failing due to missing inline
- display all available events on 'perf annotate --stdio'
- add missing newline when parsing an empty BPF program
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Add missing newline when parsing empty BPF proggie
perf cs-etm: Remove redundant space
perf cs-etm: Support unknown_thread in cs_etm_auxtrace
perf annotate: Display all available events on --stdio
perf test: "probe libc's inet_pton" fails on s390 due to missing inline
Pull core fixes from Thomas Gleixner:
- Unbreak the BPF compilation which got broken by the unconditional
requirement of asm-goto, which is not supported by clang.
- Prevent probing on exception masking instructions in uprobes and
kprobes to avoid the issues of the delayed exceptions instead of
having an ugly workaround.
- Prevent a double free_page() in the error path of do_kexec_load()
- A set of objtool updates addressing various issues mostly related to
switch tables and the noreturn detection for recursive sibling calls
- Header sync for tools.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Detect RIP-relative switch table references, part 2
objtool: Detect RIP-relative switch table references
objtool: Support GCC 8 switch tables
objtool: Support GCC 8's cold subfunctions
objtool: Fix "noreturn" detection for recursive sibling calls
objtool, kprobes/x86: Sync the latest <asm/insn.h> header with tools/objtool/arch/x86/include/asm/insn.h
x86/cpufeature: Guard asm_volatile_goto usage for BPF compilation
uprobes/x86: Prohibit probing on MOV SS instruction
kprobes/x86: Prohibit probing on exception masking instructions
x86/kexec: Avoid double free_page() upon do_kexec_load() failure
With the following commit:
fd35c88b74 ("objtool: Support GCC 8 switch tables")
I added a "can't find switch jump table" warning, to stop covering up
silent failures if add_switch_table() can't find anything.
That warning found yet another bug in the objtool switch table detection
logic. For cases 1 and 2 (as described in the comments of
find_switch_table()), the find_symbol_containing() check doesn't adjust
the offset for RIP-relative switch jumps.
Incidentally, this bug was already fixed for case 3 with:
6f5ec2993b ("objtool: Detect RIP-relative switch table references")
However, that commit missed the fix for cases 1 and 2.
The different cases are now starting to look more and more alike. So
fix the bug by consolidating them into a single case, by checking the
original dynamic jump instruction in the case 3 loop.
This also simplifies the code and makes it more robust against future
switch table detection issues -- of which I'm sure there will be many...
Switch table detection has been the most fragile area of objtool, by
far. I long for the day when we'll have a GCC plugin for annotating
switch tables. Linus asked me to delay such a plugin due to the
flakiness of the plugin infrastructure in older versions of GCC, so this
rickety code is what we're stuck with for now. At least the code is now
a little simpler than it was.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f400541613d45689086329432f3095119ffbc328.1526674218.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a test which shows a race in the multi-order iteration code. This
test reliably hits the race in under a second on my machine, and is the
result of a real bug report against kernel a production v4.15 based
kernel (4.15.6-300.fc27.x86_64). With a real kernel this issue is hit
when using order 9 PMD DAX radix tree entries.
The race has to do with how we tear down multi-order sibling entries
when we are removing an item from the tree. Remember that an order 2
entry looks like this:
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
where 'entry' is in some slot in the struct radix_tree_node, and the
three slots following 'entry' contain sibling pointers which point back
to 'entry.'
When we delete 'entry' from the tree, we call :
radix_tree_delete()
radix_tree_delete_item()
__radix_tree_delete()
replace_slot()
replace_slot() first removes the siblings in order from the first to the
last, then at then replaces 'entry' with NULL. This means that for a
brief period of time we end up with one or more of the siblings removed,
so:
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
This causes an issue if you have a reader iterating over the slots in
the tree via radix_tree_for_each_slot() while only under
rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
mm/filemap.c.
The issue is that when __radix_tree_next_slot() => skip_siblings() tries
to skip over the sibling entries in the slots, it currently does so with
an exact match on the slot directly preceding our current slot.
Normally this works:
V preceding slot
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
^ current slot
This lets you find the first sibling, and you skip them all in order.
But in the case where one of the siblings is NULL, that slot is skipped
and then our sibling detection is interrupted:
V preceding slot
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
^ current slot
This means that the sibling pointers aren't recognized since they point
all the way back to 'entry', so we think that they are normal internal
radix tree pointers. This causes us to think we need to walk down to a
struct radix_tree_node starting at the address of 'entry'.
In a real running kernel this will crash the thread with a GP fault when
you try and dereference the slots in your broken node starting at
'entry'.
In the radix tree test suite this will be caught by the address
sanitizer:
==27063==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x60c0008ae400 at pc 0x00000040ce4f bp 0x7fa89b8fcad0 sp 0x7fa89b8fcac0
READ of size 8 at 0x60c0008ae400 thread T3
#0 0x40ce4e in __radix_tree_next_slot /home/rzwisler/project/linux/tools/testing/radix-tree/radix-tree.c:1660
#1 0x4022cc in radix_tree_next_slot linux/../../../../include/linux/radix-tree.h:567
#2 0x4022cc in iterator_func /home/rzwisler/project/linux/tools/testing/radix-tree/multiorder.c:655
#3 0x7fa8a088d50a in start_thread (/lib64/libpthread.so.0+0x750a)
#4 0x7fa8a03bd16e in clone (/lib64/libc.so.6+0xf516e)
Link: http://lkml.kernel.org/r/20180503192430.7582-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the lifetime of "struct item" entries in the radix tree are
not controlled by RCU, but are instead deleted inline as they are
removed from the tree.
In the following patches we add a test which has threads iterating over
items pulled from the tree and verifying them in an
rcu_read_lock()/rcu_read_unlock() section. This means that though an
item has been removed from the tree it could still be being worked on by
other threads until the RCU grace period expires. So, we need to
actually free the "struct item" structures at the end of the grace
period, just as we do with "struct radix_tree_node" items.
Link: http://lkml.kernel.org/r/20180503192430.7582-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pulled from a patch from Matthew Wilcox entitled "xarray: Add definition
of struct xarray":
> From: Matthew Wilcox <mawilcox@microsoft.com>
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
https://patchwork.kernel.org/patch/10341249/
These defines fix this compilation error:
In file included from ./linux/radix-tree.h:6:0,
from ./linux/../../../../include/linux/idr.h:15,
from ./linux/idr.h:1,
from idr.c:4:
./linux/../../../../include/linux/idr.h: In function `idr_init_base':
./linux/../../../../include/linux/radix-tree.h:129:2: warning: implicit declaration of function `spin_lock_init'; did you mean `spinlock_t'? [-Wimplicit-function-declaration]
spin_lock_init(&(root)->xa_lock); \
^
./linux/../../../../include/linux/idr.h:126:2: note: in expansion of macro `INIT_RADIX_TREE'
INIT_RADIX_TREE(&idr->idr_rt, IDR_RT_MARKER);
^~~~~~~~~~~~~~~
by providing a spin_lock_init() wrapper for the v4.17-rc* version of the
radix tree test suite.
Link: http://lkml.kernel.org/r/20180503192430.7582-3-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c6ce3e2fe3 ("radix tree test suite: Add config option for map
shift") introduced a phony makefile target called 'mapshift' that ends
up generating the file generated/map-shift.h. This phony target was
then added as a dependency of the top level 'targets' build target,
which is what is run when you go to tools/testing/radix-tree and just
type 'make'.
Unfortunately, this phony target doesn't actually work as a dependency,
so you end up getting:
$ make
make: *** No rule to make target 'generated/map-shift.h', needed by 'main.o'. Stop.
make: *** Waiting for unfinished jobs....
Fix this by making the file generated/map-shift.h our real makefile
target, and add this a dependency of the top level build target.
Link: http://lkml.kernel.org/r/20180503192430.7582-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When running bpf's selftest test_xdp_meta.sh it fails:
./test_xdp_meta.sh
Error: Specified qdisc not found.
selftests: test_xdp_meta [FAILED]
Need to enable CONFIG_NET_SCH_INGRESS and CONFIG_NET_CLS_ACT to get the
test to pass.
Fixes: 22c8852624 ("bpf: improve selftests and add tests for meta pointer")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commit 0a67487403 ("selftests/bpf: Only run tests if !bpf_disabled")
forgot to check return value of fopen.
This caused some confusion, when running test_verifier (from
tools/testing/selftests/bpf/) on an older kernel (< v4.4) as it will
simply seqfault.
This fix avoids the segfault and prints an error, but allow program to
continue. Given the sysctl was introduced in 1be7f75d16 ("bpf:
enable non-root eBPF programs"), we know that the running kernel
cannot support unpriv, thus continue with unpriv_disabled = true.
Fixes: 0a67487403 ("selftests/bpf: Only run tests if !bpf_disabled")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* x86 fixes: PCID, UMIP, locking
* Improved support for recent Windows version that have a 2048 Hz
APIC timer.
* Rename KVM_HINTS_DEDICATED CPUID bit to KVM_HINTS_REALTIME
* Better behaved selftests.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJa/bkTAAoJEL/70l94x66Dzf8IAJ1GqtXi0CNbq8MvU4QIqw0L
HLIRoe/QgkTeTUa2fwirEuu5I+/wUyPvy5sAIsn/F5eiZM7nciLm+fYzw6F2uPIm
lSCqKpVwmh8dPl1SBaqPnTcB1HPVwcCgc2SF9Ph7yZCUwFUtoeUuPj8v6Qy6y21g
jfobHFZa3MrFgi7kPxOXSrC1qxuNJL9yLB5mwCvCK/K7jj2nrGJkLLDuzgReCqvz
isOdpof3hz8whXDQG5cTtybBgE9veym4YqJY8R5ANXBKqbFlhaNF1T3xXrdPMISZ
7bsGgkhYEOqeQsPrFwzAIiFxe2DogFwkn1BcvJ1B+duXrayt5CBnDPRB6Yxg00M=
=H0d0
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- ARM/ARM64 locking fixes
- x86 fixes: PCID, UMIP, locking
- improved support for recent Windows version that have a 2048 Hz APIC
timer
- rename KVM_HINTS_DEDICATED CPUID bit to KVM_HINTS_REALTIME
- better behaved selftests
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME
KVM: arm/arm64: VGIC/ITS save/restore: protect kvm_read_guest() calls
KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock
KVM: arm/arm64: VGIC/ITS: Promote irq_lock() in update_affinity
KVM: arm/arm64: Properly protect VGIC locks from IRQs
KVM: X86: Lower the default timer frequency limit to 200us
KVM: vmx: update sec exec controls for UMIP iff emulating UMIP
kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabled
KVM: selftests: exit with 0 status code when tests cannot be run
KVM: hyperv: idr_find needs RCU protection
x86: Delay skip of emulated hypercall instruction
KVM: Extend MAX_IRQ_ROUTES to 4096 for all archs
Typically a switch table can be found by detecting a .rodata access
followed an indirect jump:
1969: 4a 8b 0c e5 00 00 00 mov 0x0(,%r12,8),%rcx
1970: 00
196d: R_X86_64_32S .rodata+0x438
1971: e9 00 00 00 00 jmpq 1976 <dispc_runtime_suspend+0xb6a>
1972: R_X86_64_PC32 __x86_indirect_thunk_rcx-0x4
Randy Dunlap reported a case (seen with GCC 4.8) where the .rodata
access uses RIP-relative addressing:
19bd: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 19c4 <dispc_runtime_suspend+0xbb8>
19c0: R_X86_64_PC32 .rodata+0x45c
19c4: e9 00 00 00 00 jmpq 19c9 <dispc_runtime_suspend+0xbbd>
19c5: R_X86_64_PC32 __x86_indirect_thunk_rdi-0x4
In this case the relocation addend needs to be adjusted accordingly in
order to find the location of the switch table.
The fix is for case 3 (as described in the comments), but also make the
existing case 1 & 2 checks more precise by only adjusting the addend for
R_X86_64_PC32 relocations.
This fixes the following warnings:
drivers/video/fbdev/omap2/omapfb/dss/dispc.o: warning: objtool: dispc_runtime_suspend()+0xbb8: sibling call from callable instruction with modified stack frame
drivers/video/fbdev/omap2/omapfb/dss/dispc.o: warning: objtool: dispc_runtime_resume()+0xcc5: sibling call from callable instruction with modified stack frame
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b6098294fd67afb69af8c47c9883d7a68bf0f8ea.1526305958.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Protection key 0 is the default key for all memory and will
not normally come back from pkey_alloc(). But, you might
still want pass it to mprotect_pkey().
This check ensures that you can use pkey 0.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171356.9E40B254@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This makes it possible to to tell what 'prot' a given allocation
is supposed to have. That way, if we want to change just the
pkey, we know what 'prot' to pass to mprotect_pkey().
Also, keep a record of the most recent allocation so the tests
can easily find it.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171354.AA23E228@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We dump out the entire area of the siginfo where the si_pkey_ptr is
supposed to be. But, we do some math on the poitner, which is a u32.
We intended to do byte math, not u32 math on the pointer.
Cast it over to a u8* so it works.
Also, move this block of code to below th si_code check. It doesn't
hurt anything, but the si_pkey field is gibberish for other signal
types.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171352.9BE09819@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In our "exhaust all pkeys" test, we make sure that there
is the expected number available. Turns out that the
test did not cover the execute-only key, but discussed
it anyway. It did *not* discuss the test-allocated
key.
Now that we have a test for the mprotect(PROT_EXEC) case,
this off-by-one issue showed itself. Correct the off-by-
one and add the explanation for the case we missed.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171350.E1656B95@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Under the covers, implement executable-only memory with
protection keys when userspace calls mprotect(PROT_EXEC).
But, we did not have a selftest for that. Now we do.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171348.9EEE4BEF@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We currently have an execute-only test, but it is for
the explicit mprotect_pkey() interface. We will soon
add a test for the implicit mprotect(PROT_EXEC)
enterface. We need this code in both tests.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171347.C64AB733@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The exec-only pkey is allocated inside the kernel and userspace
is not told what it is. So, allow PK faults to occur that have
an unknown key.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171345.7FC7DA00@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
printf() and friends are unusable in signal handlers. They deadlock.
The pkey selftest does not do any normal printing in signal handlers,
only extra debugging. So, just print the format string so we get
*some* output when debugging.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171344.C53FD2F3@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>