On systems with large amount of memory, loading kdump kernel through
kexec_file_load syscall may fail with the below error:
"Failed to update fdt with linux,drconf-usable-memory property"
This happens because the size estimation for kdump kernel's FDT does
not account for the additional space needed to setup usable memory
properties. Fix it by accounting for the space needed to include
linux,usable-memory & linux,drconf-usable-memory properties while
estimating kdump kernel's FDT size.
Fixes: 6ecd0163d3 ("powerpc/kexec_file: Add appropriate regions for memory reserve map")
Cc: stable@vger.kernel.org # v5.9+
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/161243826811.119001.14083048209224609814.stgit@hbathini
THP config results in compound pages. Make sure the kernel enables
the PageCompound() check with CONFIG_HUGETLB_PAGE disabled and
CONFIG_TRANSPARENT_HUGEPAGE enabled.
This makes sure we correctly flush the icache with THP pages.
flush_dcache_icache_page only matter for platforms that don't support
COHERENT_ICACHE.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210203045812.234439-1-aneesh.kumar@linux.ibm.com
SLB faults should not be taken while the PACA save areas are live, all
memory accesses should be fetches from the kernel text, and access to
PACA and the current stack, before C code is called or any other
accesses are made.
All of these have pinned SLBs so will not take a SLB fault. Therefore
EXSLB is not be required.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210208063406.331655-1-npiggin@gmail.com
The amount of code executed with enabled user space access (unlocked
KUAP) should be minimal. However with CONFIG_PROVE_LOCKING or
CONFIG_DEBUG_ATOMIC_SLEEP enabled, might_fault() calls into various
parts of the kernel, and may even end up replaying interrupts which in
turn may access user space and forget to restore the KUAP state.
The problem places are:
1. strncpy_from_user (and similar) which unlock KUAP and call
unsafe_get_user -> __get_user_allowed -> __get_user_nocheck()
with do_allow=false to skip KUAP as the caller took care of it.
2. __unsafe_put_user_goto() which is called with unlocked KUAP.
eg:
WARNING: CPU: 30 PID: 1 at arch/powerpc/include/asm/book3s/64/kup.h:324 arch_local_irq_restore+0x160/0x190
NIP arch_local_irq_restore+0x160/0x190
LR lock_is_held_type+0x140/0x200
Call Trace:
0xc00000007f392ff8 (unreliable)
___might_sleep+0x180/0x320
__might_fault+0x50/0xe0
filldir64+0x2d0/0x5d0
call_filldir+0xc8/0x180
ext4_readdir+0x948/0xb40
iterate_dir+0x1ec/0x240
sys_getdents64+0x80/0x290
system_call_exception+0x160/0x280
system_call_common+0xf0/0x27c
Change __get_user_nocheck() to look at `do_allow` to decide whether to
skip might_fault(). Since strncpy_from_user/etc call might_fault()
anyway before unlocking KUAP, there should be no visible change.
Drop might_fault() in __unsafe_put_user_goto() as it is only called
from unsafe_put_user(), which already has KUAP unlocked.
Since keeping might_fault() is still desirable for debugging, add
calls to it in user_[read|write]_access_begin(). That also allows us
to drop the is_kernel_addr() test, because there should be no code
using user_[read|write]_access_begin() in order to access a kernel
address.
Fixes: de78a9c42a ("powerpc: Add a framework for Kernel Userspace Access Protection")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[mpe: Combine with related patch from myself, merge change logs]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210204121612.32721-1-aik@ozlabs.ru
Currently unsafe_put_user() expands to __put_user_goto(), which
expands to __put_user_nocheck_goto().
There are no other uses of __put_user_nocheck_goto(), and although
there are some other uses of __put_user_goto() those could just use
unsafe_put_user().
Every layer of indirection introduces the possibility that some code
is calling that layer, and makes keeping track of the required
semantics at each point more complicated.
So drop __put_user_goto(), and rename __put_user_nocheck_goto() to
__unsafe_put_user_goto(). The "nocheck" is implied by "unsafe".
Replace the few uses of __put_user_goto() with unsafe_put_user().
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210208135717.2618798-1-mpe@ellerman.id.au
The allyesconfig ppc64 kernel fails to link with relocations unable to
fit after commit 3a96570ffc ("powerpc: convert interrupt handlers to
use wrappers"), which is due to the interrupt handler functions being
put into the .noinstr.text section, which the linker script places on
the opposite side of the main .text section from the interrupt entry
asm code which calls the handlers.
This results in a lot of linker stubs that overwhelm the 252-byte sized
space we allow for them, or in the case of BE a .opd relocation link
error for some reason.
It's not required to put interrupt handlers in the .noinstr section,
previously they used NOKPROBE_SYMBOL, so take them out and replace
with a NOKPROBE_SYMBOL in the wrapper macro. Remove the explicit
NOKPROBE_SYMBOL macros in the interrupt handler functions. This makes
a number of interrupt handlers nokprobe that were not prior to the
interrupt wrappers commit, but since that commit they were made
nokprobe due to being in .noinstr.text, so this fix does not change
that.
The fixes tag is different to the commit that first exposes the problem
because it is where the wrapper macros were introduced.
Fixes: 8d41fc618a ("powerpc: interrupt handler wrapper functions")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Slightly fix up comment wording]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210211063636.236420-1-npiggin@gmail.com
Function names should tell what the function does, not how.
mfsrin() and mtsrin() are read/writing segment registers.
They are called that way because they are using mfsrin and mtsrin
instructions, but it doesn't matter for the caller.
In preparation of following patch, change their name to mfsr() and mtsr()
in order to make it obvious they manipulate segment registers without
messing up with how they do it.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f92d99f4349391b77766745900231aa880a0efb5.1612612022.git.christophe.leroy@csgroup.eu
barrier_nospec() in uaccess helpers is there to protect against
speculative accesses around access_ok().
When using user_access_begin() sequences together with
unsafe_get_user() like macros, barrier_nospec() is called for
every single read although we know the access_ok() is done
onece.
Since all user accesses must be granted by a call to either
allow_read_from_user() or allow_read_write_user() which will
always happen after the access_ok() check, move the barrier_nospec()
there.
Reported-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c72f014730823b413528e90ab6c4d3bcb79f8497.1612692067.git.christophe.leroy@csgroup.eu
Similarly to the x86 commit b13b1d2d86 ("x86/mm: In the PTE swapout
page reclaim case clear the accessed bit instead of flushing the TLB"),
implement ptep_clear_flush_young that does not actually flush the TLB
in the case the referenced bit is cleared.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201217134731.488135-8-npiggin@gmail.com
Currently Monitor Mode Control Registers and Sampling registers are
part of extended regs. Patch adds support to include Performance Monitor
Counter Registers (PMC1 to PMC6 ) as part of extended registers.
PMCs are saved in the perf interrupt handler as part of
per-cpu array 'pmcs' in struct cpu_hw_events. While capturing
the register values for extended regs, fetch these saved PMC values.
Simplified the PERF_REG_PMU_MASK_300/31 definition to include PMU
SPRs MMCR0 to PMC6. Exclude the unsupported SPRs (MMCR3, SIER2, SIER3)
from extended mask value for CPU_FTR_ARCH_300 in the new definition.
PERF_REG_EXTENDED_MAX is used to check if any index beyond the extended
registers is requested in the sample. Have one PERF_REG_EXTENDED_MAX
for CPU_FTR_ARCH_300/CPU_FTR_ARCH_31 since perf_reg_validate function
already checks the extended mask for the presence of any unsupported
register.
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1612335337-1888-3-git-send-email-atrajeev@linux.vnet.ibm.com
This removes arch_supports_pkeys(), arch_usable_pkeys() and
thread_pkey_regs_*() which are remnants from the following:
commit 06bb53b338 ("powerpc: store and restore the pkey state across context switches")
commit 2cd4bd192e ("powerpc/pkeys: Fix handling of pkey state across fork()")
commit cf43d3b264 ("powerpc: Enable pkey subsystem")
arch_supports_pkeys() and arch_usable_pkeys() were unused
since their introduction while thread_pkey_regs_*() became
unused after the introduction of the following:
commit d5fa30e699 ("powerpc/book3s64/pkeys: Reset userspace AMR correctly on exec")
commit 48a8ab4eeb ("powerpc/book3s64/pkeys: Don't update SPRN_AMR when in kernel mode")
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210202150050.75335-1-sandipan@linux.ibm.com
Saving and restoring soft-mask state can now be done in C using the
interrupt handler wrapper functions.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-41-npiggin@gmail.com
This moves the common NMI entry and exit code into the interrupt handler
wrappers.
This changes the behaviour of soft-NMI (watchdog) and HMI interrupts, and
also MCE interrupts on 64e, by adding missing parts of the NMI entry to
them.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-40-npiggin@gmail.com
The interrupt handler wrapper functions are not the ideal place to
maintain context tracking because after they return, the low level exit
code must then determine if there are interrupts to replay, or if the
task should be preempted, etc. Those paths (e.g., schedule_user) include
their own exception_enter/exit pairs to fix this up but it's a bit hacky
(see schedule_user() comments).
Ideally context tracking will go to user mode only when there are no
more interrupts or context switches or other exit processing work to
handle.
64e can not do this because it does not use the C interrupt exit code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-36-npiggin@gmail.com
Previously context tracking was not done for asynchronous interrupts,
(those that run in interrupt context), and if those would cause a
reschedule when they exit, then scheduling functions (schedule_user,
preempt_schedule_irq) call exception_enter/exit to fix this up and
exit user context.
This is a hack we would like to get away from, so do context tracking
for asynchronous interrupts too.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-34-npiggin@gmail.com
This moves exception_enter/exit calls to wrapper functions for
synchronous interrupts. More interrupt handlers are covered by
this than previously.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-33-npiggin@gmail.com
This moves the 64s/hash context tracking from hash_page_mm() to
__do_hash_fault(), so it's no longer called by OCXL / SPU
accelerators, which was certainly the wrong thing to be doing,
because those callers are not low level interrupt handlers, so
should have entered a kernel context tracking already.
Then remain in kernel context for the duration of the fault,
rather than enter/exit for the hash fault then enter/exit for
the page fault, which is pointless.
Even still, calling exception_enter/exit in __do_hash_fault seems
questionable because that's touching per-cpu variables, tracing,
etc., which might have been interrupted by this hash fault or
themselves cause hash faults. But maybe I miss something because
hash_page_mm very deliberately calls trace_hash_fault too, for
example. So for now go with it, it's no worse than before, in this
regard.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-32-npiggin@gmail.com
Add context tracking to the system call handler explicitly, and remove
_TIF_NOHZ.
This improves system call performance when nohz_full is enabled. On a
POWER9, gettid scv system call cost on a nohz_full CPU improves from
1129 cycles to 1004 cycles and on a housekeeping CPU from 550 cycles
to 430 cycles.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-31-npiggin@gmail.com
Simple helper for synchronous interrupt handlers (i.e., process-context)
to enable interrupts if it was taken in an interrupts-enabled context.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-30-npiggin@gmail.com
Add wrapper functions (derived from x86 macros) for interrupt handler
functions. This allows interrupt entry code to be written in C.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-27-npiggin@gmail.com
As explained by commit daf00ae71d ("powerpc/traps: restore
recoverability of machine_check interrupts"), die() can't be called from
within nmi_enter to nicely kill a process context that was interrupted.
nmi_exit must be called first.
This adds a function die_mce which takes care of this for machine check
handlers.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-24-npiggin@gmail.com
This is currently the same as unknown_exception, but it will diverge
after interrupt wrappers are added and code moved out of asm into the
wrappers (e.g., async handlers will check FINISH_NAP).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-22-npiggin@gmail.com
Interrupt handler prototypes are going to be rearranged in a
future patch, so tidy this out of the way first.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-21-npiggin@gmail.com
This function acts like an interrupt handler so it needs to follow
the standard interrupt handler function signature which will be
introduced in a future change.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-13-npiggin@gmail.com
Similar to the previous patch this makes interrupt handler function
types more regular so they can be wrapped with the next patch.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-12-npiggin@gmail.com
Similar to the previous patch this makes interrupt handler function
types more regular so they can be wrapped with the next patch.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-9-npiggin@gmail.com
Make mm fault handlers all just take the pt_regs * argument and load
DAR/DSISR from that. Make those that return a value return long.
This is done to make the function signatures match other handlers, which
will help with a future patch to add wrappers. Explicit arguments could
be added for performance but that would require more wrapper macro
variants.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-7-npiggin@gmail.com
The fault handling still has some complex logic particularly around
hash table handling, in asm. Implement most of this in C.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-6-npiggin@gmail.com
On many powerpc platforms the discovery and initalisation of
pci_controllers (PHBs) happens inside of setup_arch(). This is very early
in boot (pre-initcalls) and means that we're initialising the PHB long
before many basic kernel services (slab allocator, debugfs, a real ioremap)
are available.
On PowerNV this causes an additional problem since we map the PHB registers
with ioremap(). As of commit d538aadc27 ("powerpc/ioremap: warn on early
use of ioremap()") a warning is printed because we're using the "incorrect"
API to setup and MMIO mapping in searly boot. The kernel does provide
early_ioremap(), but that is not intended to create long-lived MMIO
mappings and a seperate warning is printed by generic code if
early_ioremap() mappings are "leaked."
This is all fixable with dumb hacks like using early_ioremap() to setup
the initial mapping then replacing it with a real ioremap later on in
boot, but it does raise the question: Why the hell are we setting up the
PHB's this early in boot?
The old and wise claim it's due to "hysterical rasins." Aside from amused
grapes there doesn't appear to be any real reason to maintain the current
behaviour. Already most of the newer embedded platforms perform PHB
discovery in an arch_initcall and between the end of setup_arch() and the
start of initcalls none of the generic kernel code does anything PCI
related. On powerpc scanning PHBs occurs in a subsys_initcall so it should
be possible to move the PHB discovery to a core, postcore or arch initcall.
This patch adds the ppc_md.discover_phbs hook and a core_initcall stub that
calls it. The core_initcalls are the earliest to be called so this will
any possibly issues with dependency between initcalls. This isn't just an
academic issue either since on pseries and PowerNV EEH init occurs in an
arch_initcall and depends on the pci_controllers being available, similarly
the creation of pci_dns occurs at core_initcall_sync (i.e. between core and
postcore initcalls). These problems need to be addressed seperately.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Make discover_phbs() static]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201103043523.916109-1-oohall@gmail.com
./arch/powerpc/include/asm/paravirt.h:83:44: error: implicit declaration
of function 'smp_processor_id'; did you mean 'raw_smp_processor_id'?
smp_processor_id is defined in linux/smp.h but it is not included.
The build error happens only when the patch is applied to 5.3 kernel but
it only works by chance in mainline.
Fixes: ca3f969dcb ("powerpc/paravirt: Use is_kvm_guest() in vcpu_is_preempted()")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210120132838.15589-1-msuchanek@suse.de
Access to per-cpu variables requires translation to be enabled on
pseries machine running in hash mmu mode, Since part of MCE handler
runs in realmode and part of MCE handling code is shared between ppc
architectures pseries and powernv, it becomes difficult to manage
these variables differently on different architectures, So have
these variables in paca instead of having them as per-cpu variables
to avoid complications.
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210128104143.70668-2-ganeshgr@linux.ibm.com
Maximum recursive depth of MCE is 4, Considering the maximum depth
allowed reduce the size of event to 10 from 100. This saves us ~19kB
of memory and has no fatal consequences.
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210128104143.70668-1-ganeshgr@linux.ibm.com
soft_nmi_interrupt() usage requires PPC_WATCHDOG to be configured.
Check the CONFIG definition to declare the prototype.
It fixes this W=1 compile error :
../arch/powerpc/kernel/watchdog.c:250:6: error: no previous prototype for ‘soft_nmi_interrupt’ [-Werror=missing-prototypes]
250 | void soft_nmi_interrupt(struct pt_regs *regs)
| ^~~~~~~~~~~~~~~~~~
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210104143206.695198-18-clg@kaod.org
It fixes this W=1 compile error :
../arch/powerpc/mm/book3s64/slb.c:380:6: error: no previous prototype for ‘preload_new_slb_context’ [-Werror=missing-prototypes]
380 | void preload_new_slb_context(unsigned long start, unsigned long sp)
| ^~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210104143206.695198-15-clg@kaod.org
It fixes this W=1 compile error :
../arch/powerpc/mm/book3s64/hash_utils.c:1867:6: error: no previous prototype for ‘hpte_insert_repeating’ [-Werror=missing-prototypes]
1867 | long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
| ^~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210104143206.695198-14-clg@kaod.org