We have now fixed the known bugs in STRICT_KERNEL_RWX for Book3S
64-bit Hash and Radix MMUs, see preceding commits, so allow the
option to be selected again.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210331003845.216246-6-mpe@ellerman.id.au
When we enabled STRICT_KERNEL_RWX we received some reports of boot
failures when using the Hash MMU and running under phyp. The crashes
are intermittent, and often exhibit as a completely unresponsive
system, or possibly an oops.
One example, which was caught in xmon:
[ 14.068327][ T1] devtmpfs: mounted
[ 14.069302][ T1] Freeing unused kernel memory: 5568K
[ 14.142060][ T347] BUG: Unable to handle kernel instruction fetch
[ 14.142063][ T1] Run /sbin/init as init process
[ 14.142074][ T347] Faulting instruction address: 0xc000000000004400
cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
lr: c0000000001862d4: update_rq_clock+0x44/0x110
sp: c00000000c747880
msr: 8000000040001031
current = 0xc00000000c60d380
paca = 0xc00000001ec9de80 irqmask: 0x03 irq_happened: 0x01
pid = 347, comm = kworker/2:1
...
enter ? for help
[c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
[c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
[c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
[c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
[c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
[c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
[c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
[c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
[c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
[c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
[c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
This shows that CPU 2, which was idle, woke up and then appears to
randomly take an instruction fault on a completely valid area of
kernel text.
The cause turns out to be the call to hash__mark_rodata_ro(), late in
boot. Due to the way we layout text and rodata, that function actually
changes the permissions for all of text and rodata to read-only plus
execute.
To do the permission change we use a hypervisor call, H_PROTECT. On
phyp that appears to be implemented by briefly removing the mapping of
the kernel text, before putting it back with the updated permissions.
If any other CPU is executing during that window, it will see spurious
faults on the kernel text and/or data, leading to crashes.
To fix it we use stop machine to collect all other CPUs, and then have
them drop into real mode (MMU off), while we change the mapping. That
way they are unaffected by the mapping temporarily disappearing.
We don't see this bug on KVM because KVM always use VPM=1, where
faults are directed to the hypervisor, and the fault will be
serialised vs the h_protect() by HPTE_V_HVLOCK.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210331003845.216246-5-mpe@ellerman.id.au
Pull the loop calling hpte_updateboltedpp() out of
hash__change_memory_range() into a helper function. We need it to be a
separate function for the next patch.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210331003845.216246-4-mpe@ellerman.id.au
In hash__mark_rodata_ro() we pass the raw PP_RXXX value to
hash__change_memory_range(). That has the effect of setting the key to
zero, because PP_RXXX contains no key value.
Fix it by using htab_convert_pte_flags(), which knows how to convert a
pgprot into a pp value, including the key.
Fixes: d94b827e89 ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping with hash translation")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Link: https://lore.kernel.org/r/20210331003845.216246-3-mpe@ellerman.id.au
The flags argument to plpar_pte_protect() (aka. H_PROTECT), includes
the key in bits 9-13, but currently we always set those bits to zero.
In the past that hasn't been a problem because we always used key 0
for the kernel, and updateboltedpp() is only used for kernel mappings.
However since commit d94b827e89 ("powerpc/book3s64/kuap: Use Key 3
for kernel mapping with hash translation") we are now inadvertently
changing the key (to zero) when we call plpar_pte_protect().
That hasn't broken anything because updateboltedpp() is only used for
STRICT_KERNEL_RWX, which is currently disabled on 64s due to other
bugs.
But we want to fix that, so first we need to pass the key correctly to
plpar_pte_protect(). We can't pass our newpp value directly in, we
have to convert it into the form expected by the hcall.
The hcall we're using here is H_PROTECT, which is specified in section
14.5.4.1.6 of LoPAPR v1.1.
It takes a `flags` parameter, and the description for flags says:
* flags: AVPN, pp0, pp1, pp2, key0-key4, n, and for the CMO
option: CMO Option flags as defined in Table 189‚
If you then go to the start of the parent section, 14.5.4.1, on page
405, it says:
Register Linkage (For hcall() tokens 0x04 - 0x18)
* On Call
* R3 function call token
* R4 flags (see Table 178‚ “Page Frame Table Access flags field
definition‚” on page 401)
Then you have to go to section 14.5.3, and on page 394 there is a list
of hcalls and their tokens (table 176), and there you can see that
H_PROTECT == 0x18.
Finally you can look at table 178, on page 401, where it specifies the
layout of the bits for the key:
Bit Function
-----------------
50-54 | key0-key4
Those are big-endian bit numbers, converting to normal bit numbers you
get bits 9-13, or 0x3e00.
In the kernel we have:
#define HPTE_R_KEY_HI ASM_CONST(0x3000000000000000)
#define HPTE_R_KEY_LO ASM_CONST(0x0000000000000e00)
So the LO bits of newpp are already in the right place, and the HI
bits need to be shifted down by 48.
Fixes: d94b827e89 ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping with hash translation")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210331003845.216246-2-mpe@ellerman.id.au
In the past we had a fallback definition for _PAGE_KERNEL_ROX, but we
removed that in commit d82fd29c5a ("powerpc/mm: Distribute platform
specific PAGE and PMD flags and definitions") and added definitions
for each MMU family.
However we missed adding a definition for 64s, which was not really a
bug because it's currently not used.
But we'd like to use PAGE_KERNEL_ROX in a future patch so add a
definition now.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210331003845.216246-1-mpe@ellerman.id.au
When adding a PTE a ptesync is needed to order the update of the PTE
with subsequent accesses otherwise a spurious fault may be raised.
radix__set_pte_at() does not do this for performance gains. For
non-kernel memory this is not an issue as any faults of this kind are
corrected by the page fault handler. For kernel memory these faults
are not handled. The current solution is that there is a ptesync in
flush_cache_vmap() which should be called when mapping from the
vmalloc region.
However, map_kernel_page() does not call flush_cache_vmap(). This is
troublesome in particular for code patching with Strict RWX on radix.
In do_patch_instruction() the page frame that contains the instruction
to be patched is mapped and then immediately patched. With no ordering
or synchronization between setting up the PTE and writing to the page
it is possible for faults.
As the code patching is done using __put_user_asm_goto() the resulting
fault is obscured - but using a normal store instead it can be seen:
BUG: Unable to handle kernel data access on write at 0xc008000008f24a3c
Faulting instruction address: 0xc00000000008bd74
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: nop_module(PO+) [last unloaded: nop_module]
CPU: 4 PID: 757 Comm: sh Tainted: P O 5.10.0-rc5-01361-ge3c1b78c8440-dirty #43
NIP: c00000000008bd74 LR: c00000000008bd50 CTR: c000000000025810
REGS: c000000016f634a0 TRAP: 0300 Tainted: P O (5.10.0-rc5-01361-ge3c1b78c8440-dirty)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 44002884 XER: 00000000
CFAR: c00000000007c68c DAR: c008000008f24a3c DSISR: 42000000 IRQMASK: 1
This results in the kind of issue reported here:
https://lore.kernel.org/linuxppc-dev/15AC5B0E-A221-4B8C-9039-FA96B8EF7C88@lca.pw/
Chris Riedl suggested a reliable way to reproduce the issue:
$ mount -t debugfs none /sys/kernel/debug
$ (while true; do echo function > /sys/kernel/debug/tracing/current_tracer ; echo nop > /sys/kernel/debug/tracing/current_tracer ; done) &
Turning ftrace on and off does a large amount of code patching which
in usually less then 5min will crash giving a trace like:
ftrace-powerpc: (____ptrval____): replaced (4b473b11) != old (60000000)
------------[ ftrace bug ]------------
ftrace failed to modify
[<c000000000bf8e5c>] napi_busy_loop+0xc/0x390
actual: 11:3b:47:4b
Setting ftrace call site to call ftrace function
ftrace record flags: 80000001
(1)
expected tramp: c00000000006c96c
------------[ cut here ]------------
WARNING: CPU: 4 PID: 809 at kernel/trace/ftrace.c:2065 ftrace_bug+0x28c/0x2e8
Modules linked in: nop_module(PO-) [last unloaded: nop_module]
CPU: 4 PID: 809 Comm: sh Tainted: P O 5.10.0-rc5-01360-gf878ccaf250a #1
NIP: c00000000024f334 LR: c00000000024f330 CTR: c0000000001a5af0
REGS: c000000004c8b760 TRAP: 0700 Tainted: P O (5.10.0-rc5-01360-gf878ccaf250a)
MSR: 900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28008848 XER: 20040000
CFAR: c0000000001a9c98 IRQMASK: 0
GPR00: c00000000024f330 c000000004c8b9f0 c000000002770600 0000000000000022
GPR04: 00000000ffff7fff c000000004c8b6d0 0000000000000027 c0000007fe9bcdd8
GPR08: 0000000000000023 ffffffffffffffd8 0000000000000027 c000000002613118
GPR12: 0000000000008000 c0000007fffdca00 0000000000000000 0000000000000000
GPR16: 0000000023ec37c5 0000000000000000 0000000000000000 0000000000000008
GPR20: c000000004c8bc90 c0000000027a2d20 c000000004c8bcd0 c000000002612fe8
GPR24: 0000000000000038 0000000000000030 0000000000000028 0000000000000020
GPR28: c000000000ff1b68 c000000000bf8e5c c00000000312f700 c000000000fbb9b0
NIP ftrace_bug+0x28c/0x2e8
LR ftrace_bug+0x288/0x2e8
Call Trace:
ftrace_bug+0x288/0x2e8 (unreliable)
ftrace_modify_all_code+0x168/0x210
arch_ftrace_update_code+0x18/0x30
ftrace_run_update_code+0x44/0xc0
ftrace_startup+0xf8/0x1c0
register_ftrace_function+0x4c/0xc0
function_trace_init+0x80/0xb0
tracing_set_tracer+0x2a4/0x4f0
tracing_set_trace_write+0xd4/0x130
vfs_write+0xf0/0x330
ksys_write+0x84/0x140
system_call_exception+0x14c/0x230
system_call_common+0xf0/0x27c
To fix this when updating kernel memory PTEs using ptesync.
Fixes: f1cb8f9beb ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Tidy up change log slightly]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210208032957.1232102-1-jniethe5@gmail.com
Various spelling/typo fixes.
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Convert powerpc to relative jump labels.
Before the patch, pseries_defconfig vmlinux.o has:
9074 __jump_table 0003f2a0 0000000000000000 0000000000000000 01321fa8 2**0
With the patch, the same config gets:
9074 __jump_table 0002a0e0 0000000000000000 0000000000000000 01321fb4 2**0
Size is 258720 without the patch, 172256 with the patch.
That's a 33% size reduction.
Largely copied from commit c296146c05 ("arm64/kernel: jump_label:
Switch to relative references")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/828348da7868eda953ce023994404dfc49603b64.1616514473.git.christophe.leroy@csgroup.eu
Implement Extended Berkeley Packet Filter on Powerpc 32
Test result with test_bpf module:
test_bpf: Summary: 378 PASSED, 0 FAILED, [354/366 JIT'ed]
Registers mapping:
[BPF_REG_0] = r11-r12
/* function arguments */
[BPF_REG_1] = r3-r4
[BPF_REG_2] = r5-r6
[BPF_REG_3] = r7-r8
[BPF_REG_4] = r9-r10
[BPF_REG_5] = r21-r22 (Args 9 and 10 come in via the stack)
/* non volatile registers */
[BPF_REG_6] = r23-r24
[BPF_REG_7] = r25-r26
[BPF_REG_8] = r27-r28
[BPF_REG_9] = r29-r30
/* frame pointer aka BPF_REG_10 */
[BPF_REG_FP] = r17-r18
/* eBPF jit internal registers */
[BPF_REG_AX] = r19-r20
[TMP_REG] = r31
As PPC32 doesn't have a redzone in the stack, a stack frame must always
be set in order to host at least the tail count counter.
The stack frame remains for tail calls, it is set by the first callee
and freed by the last callee.
r0 is used as temporary register as much as possible. It is referenced
directly in the code in order to avoid misusing it, because some
instructions interpret it as value 0 instead of register r0
(ex: addi, addis, stw, lwz, ...)
The following operations are not implemented:
case BPF_ALU64 | BPF_DIV | BPF_X: /* dst /= src */
case BPF_ALU64 | BPF_MOD | BPF_X: /* dst %= src */
case BPF_STX | BPF_XADD | BPF_DW: /* *(u64 *)(dst + off) += src */
The following operations are only implemented for power of two constants:
case BPF_ALU64 | BPF_MOD | BPF_K: /* dst %= imm */
case BPF_ALU64 | BPF_DIV | BPF_K: /* dst /= imm */
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/61d8b149176ddf99e7d5cef0b6dc1598583ca202.1616430991.git.christophe.leroy@csgroup.eu
The following opcodes will be needed for the implementation
of eBPF for PPC32. Add them in asm/ppc-opcode.h
PPC_RAW_ADDE
PPC_RAW_ADDZE
PPC_RAW_ADDME
PPC_RAW_MFLR
PPC_RAW_ADDIC
PPC_RAW_ADDIC_DOT
PPC_RAW_SUBFC
PPC_RAW_SUBFE
PPC_RAW_SUBFIC
PPC_RAW_SUBFZE
PPC_RAW_ANDIS
PPC_RAW_NOR
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f7bd573a368edd78006f8a5af508c726e7ce1ed2.1616430991.git.christophe.leroy@csgroup.eu
In preparation of using user_access_begin/end in restore_user_regs(),
move the access_ok() inside the function.
It makes no difference as the behaviour on a failed access_ok() is
the same as on failed restore_user_regs().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c106eb2f37c3040f1fd38b40e50c670feb7cb835.1616151715.git.christophe.leroy@csgroup.eu
In the same spirit as commit f1cf4f93de ("powerpc/signal32: Remove
ifdefery in middle of if/else")
MSR_TM_ACTIVE() is always defined and returns always 0 when
CONFIG_PPC_TRANSACTIONAL_MEM is not selected, so the awful
ifdefery in the middle of an if/else can be removed.
Make 'msr_hi' a 'long long' to avoid build failure on PPC32
due to the 32 bits left shift.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a4b48b2f0be1ef13fc8e57452b7f8350da28d521.1616151715.git.christophe.leroy@csgroup.eu
Convention is to prefix functions with __unsafe_ instead of
suffixing it with _unsafe.
Rename save_user_regs_unsafe() and save_general_regs_unsafe()
accordingly, that is respectively __unsafe_save_general_regs() and
__unsafe_save_user_regs().
Suggested-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8cef43607e5b35a7fd0829dec812d88beb570df2.1616151715.git.christophe.leroy@csgroup.eu
Similarly to commit 5cf773fc8f37 ("powerpc/uaccess: Also perform
64 bits copies in unsafe_copy_to_user() on ppc32")
ppc32 has an efficiant 64 bits unsafe_get_user(), so also use it in
order to unroll loops more.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/308e65d9237a14e8c0e3b22919fcf0b5e5592608.1616151715.git.christophe.leroy@csgroup.eu
clang 11 and future GCC are supporting asm goto with outputs.
Use it to implement get_user in order to get better generated code.
Note that clang requires to set x in the default branch of
__get_user_size_goto() otherwise is compliant about x not being
initialised :puzzled:
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/403745b5aaa1b315bb4e8e46c1ba949e77eecec0.1615398265.git.christophe.leroy@csgroup.eu
Make get_user() do the access_ok() check then call __get_user().
Make put_user() do the access_ok() check then call __put_user().
Then embed __get_user_size() and __put_user_size() in
__get_user() and __put_user().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/eebc554f6a81f570c46ea3551000ff5b886e4faa.1615398265.git.christophe.leroy@csgroup.eu
__get_user_bad() and __put_user_bad() are functions that are
declared but not defined, in order to make the link fail in
case they are called.
Nowadays, we have BUILD_BUG() and BUILD_BUG_ON() for that, and
they have the advantage to break the build earlier as it breaks
it at compile time instead of link time.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d7d839e994f49fae4ff7b70fac72bd951272436b.1615398265.git.christophe.leroy@csgroup.eu
Commit d02f6b7dab ("powerpc/uaccess: Evaluate macro arguments once,
before user access is allowed") changed the __chk_user_ptr()
argument from the passed ptr pointer to the locally
declared __gu_addr. But __gu_addr is locally defined as __user
so the check is pointless.
During kernel build __chk_user_ptr() voids and is only evaluated
during sparse checks so it should have been armless to leave the
original pointer check there.
Nevertheless, this check is indeed redundant with the assignment
above which casts the ptr pointer to the local __user __gu_addr.
In case of mismatch, sparse will detect it there, so the
__check_user_ptr() is not needed anywhere else than in access_ok().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/69f17d75046733b891ab2e668dbf464787cdf598.1615398265.git.christophe.leroy@csgroup.eu
__unsafe_put_user_goto() is just an intermediate layer to
__put_user_size_goto() without added value other than doing
the __user pointer type checking.
Do the __user pointer type checking in __put_user_size_goto()
and remove __unsafe_put_user_goto().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b6552149209aebd887a6977272b06a41256bdb9f.1615398265.git.christophe.leroy@csgroup.eu
Commit 6bfd93c32a ("powerpc: Fix incorrect might_sleep in
__get_user/__put_user on kernel addresses") added a check to not call
might_sleep() on kernel addresses. This was to enable the use of
__get_user() in the alignment exception handler for any address.
Then commit 95156f0051 ("lockdep, mm: fix might_fault() annotation")
added a check of the address space in might_fault(), based on
set_fs() logic. But this didn't solve the powerpc alignment exception
case as it didn't call set_fs(KERNEL_DS).
Nowadays, set_fs() is gone, previous patch fixed the alignment
exception handler and __get_user/__put_user are not supposed to be
used anymore to read kernel memory.
Therefore the is_kernel_addr() check has become useless and can be
removed.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/e0a980a4dc7a2551183dd5cb30f46eafdbee390c.1615398265.git.christophe.leroy@csgroup.eu
In the old days, when we didn't have kernel userspace access
protection and had set_fs(), it was wise to use __get_user()
and friends to read kernel memory.
Nowadays, get_user() is granting userspace access and is exclusively
for userspace access.
In alignment exception handler, use probe_kernel_read_inst()
instead of __get_user_instr() for reading instructions in kernel.
This will allow to remove the is_kernel_addr() check in
__get/put_user() in a following patch.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d9ecbce00178484e66ca7adec2ff210058037704.1615398265.git.christophe.leroy@csgroup.eu
Powerpc is the only architecture having _inatomic variants of
__get_user() and __put_user() accessors. They were introduced
by commit e68c825bb0 ("[POWERPC] Add inatomic versions of __get_user
and __put_user").
Those variants expand to the _nosleep macros instead of expanding
to the _nocheck macros. The only difference between the _nocheck
and the _nosleep macros is the call to might_fault().
Since commit 662bbcb274 ("mm, sched: Allow uaccess in atomic with
pagefault_disable()"), __get/put_user() can be used in atomic parts
of the code, therefore __get/put_user_inatomic() have become useless.
Remove __get_user_inatomic() and __put_user_inatomic().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1e5c895669e8d54a7810b62dc61eb111f33c2c37.1615398265.git.christophe.leroy@csgroup.eu
This patch converts emulate_spe() to using user_access_begin
logic.
Since commit 662bbcb274 ("mm, sched: Allow uaccess in atomic with
pagefault_disable()"), might_fault() doesn't fire when called from
sections where pagefaults are disabled, which must be the case
when using _inatomic variants of __get_user and __put_user. So
the might_fault() in user_access_begin() is not a problem.
There was a verification of user_mode() together with the access_ok(),
but there is a second verification of user_mode() just after, that
leads to immediate return. The access_ok() is now part of the
user_access_begin which is called after that other user_mode()
verification, so no need to check user_mode() again.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c95a648fdf75992c9d88f3c73cc23e7537fcf2ad.1615555354.git.christophe.leroy@csgroup.eu
This reverts commit 675bceb097 ("powerpc/mm: Remove DEBUG_VM_PGTABLE support on powerpc")
All the related issues are fixed as of commit:
f14312e1ed ("mm/debug_vm_pgtable: avoid doing memory allocation with pgtable_t mapped.")
Hence re-enable it.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210318034855.74513-1-aneesh.kumar@linux.ibm.com
The vio bus is a fake bus, which we use on pseries LPARs (guests) to
discover devices provided by the hypervisor. There's no need or sense
in creating the vio bus on bare metal systems.
Which is why commit 4336b93378 ("powerpc/pseries: Make vio and
ibmebus initcalls pseries specific") made the initialisation of the
vio bus only happen in LPARs.
However as a result of that commit we now see errors at boot on bare
metal systems:
Driver 'hvc_console' was unable to register with bus_type 'vio' because the bus was not initialized.
Driver 'tpm_ibmvtpm' was unable to register with bus_type 'vio' because the bus was not initialized.
This happens because those drivers are built-in, and are calling
vio_register_driver(). It in turn calls driver_register() with a
reference to vio_bus_type, but we haven't registered vio_bus_type with
the driver core.
Fix it by also guarding vio_register_driver() with a check to see if
we are on pseries.
Fixes: 4336b93378 ("powerpc/pseries: Make vio and ibmebus initcalls pseries specific")
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Link: https://lore.kernel.org/r/20210316010938.525657-1-mpe@ellerman.id.au
When compiling the powerpc with the SMP disabled, it shows the issue:
arch/powerpc/kernel/watchdog.c: In function ‘watchdog_smp_panic’:
arch/powerpc/kernel/watchdog.c:177:4: error: implicit declaration of function ‘smp_send_nmi_ipi’; did you mean ‘smp_send_stop’? [-Werror=implicit-function-declaration]
177 | smp_send_nmi_ipi(c, wd_lockup_ipi, 1000000);
| ^~~~~~~~~~~~~~~~
| smp_send_stop
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:273: arch/powerpc/kernel/watchdog.o] Error 1
make[1]: *** [scripts/Makefile.build:534: arch/powerpc/kernel] Error 2
make: *** [Makefile:1980: arch/powerpc] Error 2
make: *** Waiting for unfinished jobs....
We found that powerpc used ipi to implement hardlockup watchdog, so the
HAVE_HARDLOCKUP_DETECTOR_ARCH should depend on the SMP.
Fixes: 2104180a53 ("powerpc/64s: implement arch-specific hardlockup watchdog")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chen Huang <chenhuang5@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210327094900.938555-1-chenhuang5@huawei.com
One of the reasons that dlpar_cpu_offline can fail is when attempting to
offline the last online CPU of the kernel. This can be observed in a
pseries QEMU guest that has hotplugged CPUs. If the user offlines all
other CPUs of the guest, and a hotplugged CPU is now the last online
CPU, trying to reclaim it will fail.
The current error message in this situation returns rc with -EBUSY and a
generic explanation, e.g.:
pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16
EBUSY can be caused by other conditions, such as cpu_hotplug_disable
being true. Throwing a more specific error message for this case,
instead of just "Failed to offline CPU", makes it clearer that the error
is in fact a known error situation instead of other generic/unknown
cause.
This patch adds a 'last online' check in dlpar_cpu_offline() to catch
the 'last online CPU' offline error, eturning a more informative error
message:
pseries-hotplug-cpu: Unable to remove last online CPU PowerPC,POWER9
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210323205056.52768-2-danielhb413@gmail.com
call_do_irq() and call_do_softirq() are simple enough to be
worth inlining.
Inlining them avoids an mflr/mtlr pair plus a save/reload on stack.
This is inspired from S390 arch. Several other arches do more or
less the same. The way sparc arch does seems odd thought.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210320122227.345427-1-mpe@ellerman.id.au
Sparse warns:
warning: symbol 'rfi_flush' was not declared.
warning: symbol 'entry_flush' was not declared.
warning: symbol 'uaccess_flush' was not declared.
Define 'entry_flush' and 'uaccess_flush' as static because they are
not referenced outside the file. Include asm/security_features.h in
which 'rfi_flush' is declared.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: He Ying <heying24@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210316041148.29694-1-heying24@huawei.com
arch/powerpc/kernel/iommu.c:76:2-16: WARNING: NULL check before some freeing functions is not needed.
NULL check before some freeing functions is not needed.
Based on checkpatch warning
"kfree(NULL) is safe this check is probably not required"
and kfreeaddr.cocci by Julia Lawall.
Generated by: scripts/coccinelle/free/ifnullfree.cocci
Fixes: 691602aab9 ("powerpc/iommu/debug: Add debugfs entries for IOMMU tables")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210318234441.GA63469@f8e20a472e81