2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-17 10:04:14 +08:00
Commit Graph

20843 Commits

Author SHA1 Message Date
Hari Bathini
742a265acc powerpc/fadump: register kernel metadata address with opal
OPAL allows registering address with it in the first kernel and
retrieving it after MPIPL. Setup kernel metadata and register its
address with OPAL to use it for processing the crash dump.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821345011.5656.13567765019032928471.stgit@hbathini.in.ibm.com
2019-09-14 00:04:43 +10:00
Hari Bathini
6abec12c65 powerpc/fadump: improve fadump_reserve_mem()
Some code clean-up like using minimal assignments and updating printk
messages. Also, add an 'error_out' label for handling error cleanup
at one place.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821343485.5656.10202857091553646948.stgit@hbathini.in.ibm.com
2019-09-14 00:04:43 +10:00
Hari Bathini
41df592872 powerpc/fadump: add fadump support on powernv
Add basic callback functions for FADump on PowerNV platform.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821342072.5656.4346362203141486452.stgit@hbathini.in.ibm.com
2019-09-14 00:04:43 +10:00
Hari Bathini
6f5f193e84 powerpc/opal: add MPIPL interface definitions
MPIPL is Memory Preserving IPL supported from POWER9. This enables the
kernel to reset the system with memory 'preserved'. Also, it supports
copying memory from a source address to some destination address during
MPIPL boot. Add MPIPL interface definitions here to leverage these f/w
features in adding FADump support for PowerNV platform.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821340710.5656.10071829040515662624.stgit@hbathini.in.ibm.com
2019-09-14 00:04:43 +10:00
Hari Bathini
f35120115b pseries/fadump: move out platform specific support from generic code
Move code that supports processing the crash'ed kernel's memory
preserved by firmware to platform specific callback functions.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821337690.5656.13050665924800177744.stgit@hbathini.in.ibm.com
2019-09-14 00:04:42 +10:00
Hari Bathini
8255da95e5 powerpc/fadump: release all the memory above boot memory size
Except for Reserved dump area (see Documentation/powerpc/firmware-
assisted-dump.rst) which is permanent reserved, all memory above boot
memory size, where boot memory size is the memory required for the
kernel to boot successfully when booted with restricted memory (memory
for capture kernel), is released when the dump is invalidated. Make
this a bit more explicit in the code.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821336092.5656.1079046285366041687.stgit@hbathini.in.ibm.com
2019-09-14 00:04:42 +10:00
Hari Bathini
109f25cc5f powerpc/fadump: add source info while displaying region contents
Improve how fadump_region contents are displayed by adding source
information of memory regions that are to be dumped by f/w.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821334740.5656.5897097884010195405.stgit@hbathini.in.ibm.com
2019-09-14 00:04:42 +10:00
Hari Bathini
41a65d1618 pseries/fadump: define RTAS register/un-register callback functions
Move platform specific register/un-register code, the RTAS calls, to
register/un-register callback functions. This would also mean moving
code that initializes and prints the platform specific FADump data.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821332856.5656.16380417702046411631.stgit@hbathini.in.ibm.com
2019-09-14 00:04:42 +10:00
Hari Bathini
d3833a7010 powerpc/fadump: introduce callbacks for platform specific operations
Introduce callback functions for platform specific operations like
register, unregister, invalidate & such. Also, define place-holders
for the same on pSeries platform.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821330286.5656.15538934400074110770.stgit@hbathini.in.ibm.com
2019-09-14 00:04:42 +10:00
Hari Bathini
0226e55275 powerpc/fadump: move rtas specific definitions to platform code
Currently, FADump is only supported on pSeries but that is going to
change soon with FADump support being added on PowerNV platform. So,
move rtas specific definitions to platform code to allow FADump
to have multiple platforms support.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821328494.5656.16219929140866195511.stgit@hbathini.in.ibm.com
2019-09-14 00:04:41 +10:00
Hari Bathini
72aa651795 powerpc/fadump: use helper functions to reserve/release cpu notes buffer
Use helper functions to simplify memory allocation, pinning down and
freeing the memory used for CPU notes buffer.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821323555.5656.2486038022572739622.stgit@hbathini.in.ibm.com
2019-09-14 00:04:41 +10:00
Hari Bathini
7f0ad11d3f powerpc/fadump: declare helper functions in internal header file
Declare helper functions, that can be reused by multiple platforms,
in the internal header file.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821320487.5656.2660730464212209984.stgit@hbathini.in.ibm.com
2019-09-14 00:04:41 +10:00
Hari Bathini
961cf26a98 powerpc/fadump: add helper functions
Add helper functions to setup & free CPU notes buffer and to find if a
given memory area is contiguous. Also, use boolean as return type for
the function that finds if boot memory area is contiguous. While at
it, save the virtual address of CPU notes buffer instead of physical
address as virtual address is used often.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821318971.5656.9281936950510635858.stgit@hbathini.in.ibm.com
2019-09-14 00:04:41 +10:00
Hari Bathini
ca986d7fa7 powerpc/fadump: move internal macros/definitions to a new header
Though asm/fadump.h is meant to be used by other components dealing
with FADump, it also has macros/definitions internal to FADump code.
Move them to a new header file used within FADump code. This also
makes way for refactoring platform specific FADump code.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821313134.5656.6597770626574392140.stgit@hbathini.in.ibm.com
2019-09-14 00:04:41 +10:00
Masahiro Yamada
1fdfa4c6af powerpc: improve prom_init_check rule
This slightly improves the prom_init_check rule.

[1] Avoid needless check

Currently, prom_init_check.sh is invoked every time you run 'make'
even if you have changed nothing in prom_init.c. With this commit,
the script is re-run only when prom_init.o is recompiled.

[2] Beautify the build log

Currently, the O= build shows the absolute path to the script:

  CALL    /abs/path/to/source/of/linux/arch/powerpc/kernel/prom_init_check.sh

With this commit, it is always a relative path to the timestamp file:

  PROMCHK arch/powerpc/kernel/prom_init_check

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190912074037.13813-1-yamada.masahiro@socionext.com
2019-09-14 00:04:41 +10:00
Michael Ellerman
caff52dc0b powerpc/kvm: Add ifdefs around template code
Some of the templates used for KVM patching are only used on certain
platforms, but currently they are always built-in, fix that.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911115746.12433-4-mpe@ellerman.id.au
2019-09-14 00:04:40 +10:00
Michael Ellerman
731dade128 powerpc/kvm: Explicitly mark kvm guest code as __init
All the code in kvm.c can be marked __init. Most of it is already
inlined into the initcall, but not all. So instead of relying on the
inlining, mark it all as __init. This saves ~280 bytes of text for my
configuration.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911115746.12433-3-mpe@ellerman.id.au
2019-09-14 00:04:40 +10:00
Michael Ellerman
dac39f7885 powerpc/64s: Remove overlaps_kvm_tmp()
kvm_tmp is now in .text and so doesn't need a special overlap check.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911115746.12433-2-mpe@ellerman.id.au
2019-09-14 00:04:40 +10:00
Michael Ellerman
0cb0837f9d powerpc/kvm: Move kvm_tmp into .text, shrink to 64K
In some configurations of KVM, guests binary patch themselves to
avoid/reduce trapping into the hypervisor. For some instructions this
requires replacing one instruction with a sequence of instructions.

For those cases we need to write the sequence of instructions
somewhere and then patch the location of the original instruction to
branch to the sequence. That requires that the location of the
sequence be within 32MB of the original instruction.

The current solution for this is that we create a 1MB array in BSS,
write sequences into there, and then free the remainder of the array.

This has a few problems:
 - it confuses kmemleak.
 - it confuses lockdep.
 - it requires mapping kvm_tmp executable, which can cause adjacent
   areas to also be mapped executable if we're using 16M pages for the
   linear mapping.
 - the 32MB limit can be exceeded if the kernel is big enough,
   especially with STRICT_KERNEL_RWX enabled, which then prevents the
   patching from working at all.

We can fix all those problems by making kvm_tmp just a region of
regular .text. However currently it's 1MB in size, and we don't want
to waste 1MB of text. In practice however I only see ~30KB of kvm_tmp
being used even for an allyes_config. So shrink kvm_tmp to 64K, which
ought to be enough for everyone, and move it into .text.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190911115746.12433-1-mpe@ellerman.id.au
2019-09-14 00:04:40 +10:00
Michael Ellerman
79cb687913 powerpc/powernv: Fix build with IOMMU_API=n
The builds breaks when IOMMU_API=n, eg. skiroot_defconfig:

  arch/powerpc/platforms/powernv/npu-dma.c:96:28: error: 'get_gpu_pci_dev_and_pe' defined but not used
  arch/powerpc/platforms/powernv/npu-dma.c:126:13: error: 'pnv_npu_set_window' defined but not used

Fixes: b4d37a7b69 ("powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass() function")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-09-14 00:03:00 +10:00
Michael Ellerman
1b7f3b6c43 powerpc/eeh: Fix build with STACKTRACE=n
The build breaks when STACKTRACE=n, eg. skiroot_defconfig:

  arch/powerpc/kernel/eeh_event.c:124:23: error: implicit declaration of function 'stack_trace_save'

Fix it with some ifdefs for now.

Fixes: 25baf3d816 ("powerpc/eeh: Defer printing stack trace")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-09-14 00:01:14 +10:00
Greg Kurz
6ccb4ac2bf powerpc/xive: Fix bogus error code returned by OPAL
There's a bug in skiboot that causes the OPAL_XIVE_ALLOCATE_IRQ call
to return the 32-bit value 0xffffffff when OPAL has run out of IRQs.
Unfortunatelty, OPAL return values are signed 64-bit entities and
errors are supposed to be negative. If that happens, the linux code
confusingly treats 0xffffffff as a valid IRQ number and panics at some
point.

A fix was recently merged in skiboot:

e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()")

but we need a workaround anyway to support older skiboots already
in the field.

Internally convert 0xffffffff to OPAL_RESOURCE which is the usual error
returned upon resource exhaustion.

Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/156821713818.1985334.14123187368108582810.stgit@bahia.lan
2019-09-12 09:27:05 +10:00
Nathan Lynch
92c94dfb69 powerpc/pseries: correctly track irq state in default idle
prep_irq_for_idle() is intended to be called before entering
H_CEDE (and it is used by the pseries cpuidle driver). However the
default pseries idle routine does not call it, leading to mismanaged
lazy irq state when the cpuidle driver isn't in use. Manifestations of
this include:

* Dropped IPIs in the time immediately after a cpu comes
  online (before it has installed the cpuidle handler), making the
  online operation block indefinitely waiting for the new cpu to
  respond.

* Hitting this WARN_ON in arch_local_irq_restore():
	/*
	 * We should already be hard disabled here. We had bugs
	 * where that wasn't the case so let's dbl check it and
	 * warn if we are wrong. Only do that when IRQ tracing
	 * is enabled as mfmsr() can be costly.
	 */
	if (WARN_ON_ONCE(mfmsr() & MSR_EE))
		__hard_irq_disable();

Call prep_irq_for_idle() from pseries_lpar_idle() and honor its
result.

Fixes: 363edbe261 ("powerpc: Default arch idle could cede processor on pseries")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190910225244.25056-1-nathanl@linux.ibm.com
2019-09-12 09:27:04 +10:00
Ravi Bangoria
bc01bdf6c5 powerpc/watchpoint: Disable watchpoint hit by larx/stcx instructions
If watchpoint exception is generated by larx/stcx instructions, the
reservation created by larx gets lost while handling exception, and
thus stcx instruction always fails. Generally these instructions are
used in a while(1) loop, for example spinlocks. And because stcx
never succeeds, it loops forever and ultimately hangs the system.

Note that ptrace anyway works in one-shot mode and thus for ptrace
we don't change the behaviour. It's up to ptrace user to take care
of this.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190910131513.30499-1-ravi.bangoria@linux.ibm.com
2019-09-12 09:27:00 +10:00
Vasant Hegde
587164cd59 powerpc/powernv: Add new opal message type
We have OPAL_MSG_PRD message type to pass prd related messages from
OPAL to `opal-prd`. It can handle messages upto 64 bytes. We have a
requirement to send bigger than 64 bytes of data from OPAL to
`opal-prd`. Lets add new message type (OPAL_MSG_PRD2) to pass bigger
data.

Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[mpe: Make the error string clear that it's the PRD2 event that failed]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190826065701.8853-2-hegdevasant@linux.vnet.ibm.com
2019-09-12 09:27:00 +10:00
Vasant Hegde
2be1d5d147 powerpc/powernv: Enhance opal message read interface
Use "opal-msg-size" device tree property to allocate memory for
"opal_msg".

Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[mpe: s/uint32_t/u32/ and mark opal_msg_size as __ro_after_init]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190826065701.8853-1-hegdevasant@linux.vnet.ibm.com
2019-09-12 09:27:00 +10:00
Christoph Hellwig
b4d37a7b69 powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass() function
Neither pnv_npu_try_dma_set_bypass() nor the pnv_npu_dma_set_32() and
pnv_npu_dma_set_bypass() helpers called by it are used anywhere in the
kernel tree, so remove them.

mpe: They're unused since 2d6ad41b2c ("powerpc/powernv: use the
generic iommu bypass code") removed the last usage.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903165147.11099-1-hch@lst.de
2019-09-12 09:27:00 +10:00
Santosh Sivaraj
20055a8bfa powerpc/memcpy: Fix stack corruption for smaller sizes
For sizes lesser than 128 bytes, the code branches out early without saving
the stack frame, which when restored later drops frame of the caller.

Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903214359.23887-1-santosh@fossix.org
2019-09-12 09:27:00 +10:00
Segher Boessenkool
aa497d4352 powerpc: Add attributes for setjmp/longjmp
The setjmp function should be declared as "returns_twice", or bad
things can happen[1]. This does not actually change generated code in
my testing.

The longjmp function should be declared as "noreturn", so that the
compiler can optimise calls to it better. This makes the generated
code a little shorter.

1: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-returns_005ftwice-function-attribute

Signed-off-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c02ce4a573f3bac907e2c70957a2d1275f910013.1567605586.git.segher@kernel.crashing.org
2019-09-12 09:26:59 +10:00
Paolo Bonzini
32d1d15c52 KVM/arm updates for 5.4
- New ITS translation cache
 - Allow up to 512 CPUs to be supported with GICv3 (for real this time)
 - Now call kvm_arch_vcpu_blocking early in the blocking sequence
 - Tidy-up device mappings in S2 when DIC is available
 - Clean icache invalidation on VMID rollover
 - General cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl12QlAPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDXxUQAMd+GlOlmTXqEiuKudVApTkl6WIebfh0vkn6
 /1j8yNgJqRtZEY/YqE/XhAaqz1tx88VtzqSrNG4Pmrl9rDHMD9mDuk+w5UvEN2vy
 D5/nEe/wnzyVpuROBlHhsRbCRkT/6dNpnDnydwxCUqQPhfsAHnTNx6IygVzH9BHS
 D/1+KLI1imW8YziSSf6SGlIKJtk0eo5qo/aT6/mhb+e18Dobax3miItZL4mAqFPd
 tCV8fvOLb/phdSmOZuD/3XF9JOodk2ycvF9MW9Rp/FxDx9HULCXPv/3KnoHg9ca5
 QSGz1Chj0C2avaQJ4GbHZnZZjdvL2TmVxMpixocc/VZCqlO3ifRKf91t/rq4cElG
 HxLE9AX6kqW6UK66RHUQiHxjqRG8ynz8xEmlhwd7YhCLmtmJSXLTrmc2ABf64+BT
 RaexRa3h6D19fLBcMN5gpP8I48XaRpfxg6E/jCw5ZEr/8zhzLajFnE89ftgRR04f
 bSXOnj0kAhrBZ6jRTEata1MrFAt58wiaulxTxgMlnj1hHpqA3b+x6woRECAEVOlc
 6JJuzReJSBuCJL/rVtXGF31mXNnqUo+oTcDpQSle/fDtQ/44+xlYj6V/ZeFIRHAz
 nwUw9DHyZ/JMSwPNsqtdzCnLths1rNw34A7VgdVWiqiPYEcGGUnMzkRrXKMYjjJn
 LD4+Rh/e
 =0dD/
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm updates for 5.4

- New ITS translation cache
- Allow up to 512 CPUs to be supported with GICv3 (for real this time)
- Now call kvm_arch_vcpu_blocking early in the blocking sequence
- Tidy-up device mappings in S2 when DIC is available
- Clean icache invalidation on VMID rollover
- General cleanup
2019-09-10 19:09:14 +02:00
Paolo Bonzini
8146856b0a PPC KVM update for 5.4
- Some prep for extending the uses of the rmap array
 - Various minor fixes
 - Commits from the powerpc topic/ppc-kvm branch, which fix a problem
   with interrupts arriving after free_irq, causing host hangs and crashes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJdZwd7AAoJEJ2a6ncsY3GffDQH/2q+c2z56ZO2lzfk4Hy9piWn
 Z9PR9n72Z6TiMyVCl7CtLCyI+lRy3QVZnol14ugQNX4aFJiiwDGRHJF0wNxjeok4
 4DAIqBc60qD2dkp1LwtUM1YsLsr/n3tdrGU1b0VrHGoGTVhJDpbjhJsblXZ1ujGr
 KxQ1Uf4XsW5T7kovHuzj+FFlbB5nbEX5cBIU68maBGZSCl355wCOW35rKVITTIIv
 +VKkO2aNbk6bRmZmOi2v1D65eQa2+TKe/o48TneJv1WhL4h4hDyHdmVeWRNoAI6C
 ve8mwCAVs7IITjCJ1qcGnI8NzVxMlXgwVir7sQ1aslRLZfeRAm5FOIPNEz1ADXs=
 =3oLd
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-next-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD

PPC KVM update for 5.4

- Some prep for extending the uses of the rmap array
- Various minor fixes
- Commits from the powerpc topic/ppc-kvm branch, which fix a problem
  with interrupts arriving after free_irq, causing host hangs and crashes.
2019-09-10 16:51:17 +02:00
Christoph Hellwig
7b86ac3371 pagewalk: separate function pointers from iterator data
The mm_walk structure currently mixed data and code.  Split out the
operations vectors into a new mm_walk_ops structure, and while we are
changing the API also declare the mm_walk structure inside the
walk_page_range and walk_page_vma functions.

Based on patch from Linus Torvalds.

Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-07 04:28:04 -03:00
Christoph Hellwig
a520110e4a mm: split out a new pagewalk.h header from mm.h
Add a new header for the two handful of users of the walk_page_range /
walk_page_vma interface instead of polluting all users of mm.h with it.

Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-07 04:28:04 -03:00
Sven Schnelle
175fca3bf9 kexec: add KEXEC_ELF
Right now powerpc provides an implementation to read elf files
with the kexec_file_load() syscall. Make that available as a public
kexec interface so it can be re-used on other architectures.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:43 +02:00
Linus Torvalds
13da6ac106 powerpc fixes for 5.3 #5
One fix for a boot hang on some Freescale machines when PREEMPT is enabled.
 
 Two CVE fixes for bugs in our handling of FP registers and transactional memory,
 both of which can result in corrupted FP state, or FP state leaking between
 processes.
 
 Thanks to:
   Chris Packham, Christophe Leroy, Gustavo Romero, Michael Neuling.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl1x06oTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgCZzD/90EyaWJVS8WPZopoIdnuOfB/F7EZFY
 Lhgd640S1p4o8BUZaQ1T19JOzp6HlO38myOptBufY0BsIJW0M2GwngnBPzSPW8r7
 ImTTf5cU0CDe2m3OJdfBrVpnGmUsmoWxwrsFJZ9wbsXhCwbbUzOUuxD/B9wBIGi/
 sPpTlaYZBhu3cKs9EWPKAODJhtEf55Q1c62gftfj8Y5u8uxQGinYInCghAUr+3Zv
 uCw1CSxOV7yGxfgc1sbOptidOiG4Pljw4EDCUFLpjWTYgPVERASbPHs3C4xuAHGq
 IYuNDUJbwrxMU9BKLFzvL4MKWa5XtzLE34oY8SuyyVAbIQTszgCn2rIwlJXH88PO
 UtId9accmS+dy2lRI+90dC0qeTgUUIZXS1NF0cl5YNRN0TlMyjHL2/sRxCZF2svF
 EaGNjTQLAsfX0ccO9xQr8+KBSfFURMEkO8QQAR0lzJmIgbvSuzfjlZpbcYd2Nqfe
 EiYU4GeAQSn14vi0ZMdRWxc1rki9pPhGkrUwToDALsiEedRB03olM955uecf7fra
 S8MzHFBYh8Apd/lsAj53uAbL2rIHDJ5+6/eezYp7bRbo6FlvWDs9kmYTX3p3ixq1
 Q4gDHfbwnWxxhjUBri5QNZF9YHgkyGPURGpIbdXk9R4Hc7ihQWwDBcSrueca51Ug
 m97SLF5/+yWx0A==
 =C+wa
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.3-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "One fix for a boot hang on some Freescale machines when PREEMPT is
  enabled.

  Two CVE fixes for bugs in our handling of FP registers and
  transactional memory, both of which can result in corrupted FP state,
  or FP state leaking between processes.

  Thanks to: Chris Packham, Christophe Leroy, Gustavo Romero, Michael
  Neuling"

* tag 'powerpc-5.3-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/tm: Fix restoring FP/VMX facility incorrectly on interrupts
  powerpc/tm: Fix FP/VMX unavailable exceptions inside a transaction
  powerpc/64e: Drop stale call to smp_processor_id() which hangs SMP startup
2019-09-06 08:54:45 -07:00
Jordan Niethe
67c87892e2 powerpc: Remove empty comment
Commit 2874c5fd28 ("treewide: Replace GPLv2 boilerplate/reference with
SPDX - rule 152") left an empty comment in machdep.h, as the boilerplate
was the only text in the comment. Remove the empty comment.

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813051212.6387-1-jniethe5@gmail.com
2019-09-05 14:22:41 +10:00
Madhavan Srinivasan
41ba17f20e powerpc/imc: Dont create debugfs files for cpu-less nodes
Commit <684d984038aa> ('powerpc/powernv: Add debugfs interface for
imc-mode and imc') added debugfs interface for the nest imc pmu
devices to support changing of different ucode modes. Primarily adding
this capability for debug. But when doing so, the code did not
consider the case of cpu-less nodes. So when reading the _cmd_ or
_mode_ file of a cpu-less node will create this crash.

  Faulting instruction address: 0xc0000000000d0d58
  Oops: Kernel access of bad area, sig: 11 [#1]
  ...
  CPU: 67 PID: 5301 Comm: cat Not tainted 5.2.0-rc6-next-20190627+ #19
  NIP:  c0000000000d0d58 LR: c00000000049aa18 CTR:c0000000000d0d50
  REGS: c00020194548f9e0 TRAP: 0300   Not tainted  (5.2.0-rc6-next-20190627+)
  MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR:28022822  XER: 00000000
  CFAR: c00000000049aa14 DAR: 000000000003fc08 DSISR:40000000 IRQMASK: 0
  ...
  NIP imc_mem_get+0x8/0x20
  LR  simple_attr_read+0x118/0x170
  Call Trace:
    simple_attr_read+0x70/0x170 (unreliable)
    debugfs_attr_read+0x6c/0xb0
    __vfs_read+0x3c/0x70
     vfs_read+0xbc/0x1a0
    ksys_read+0x7c/0x140
    system_call+0x5c/0x70

Patch fixes the issue with a more robust check for vbase to NULL.

Before patch, ls output for the debugfs imc directory

  # ls /sys/kernel/debug/powerpc/imc/
  imc_cmd_0    imc_cmd_251  imc_cmd_253  imc_cmd_255  imc_mode_0    imc_mode_251  imc_mode_253  imc_mode_255
  imc_cmd_250  imc_cmd_252  imc_cmd_254  imc_cmd_8    imc_mode_250  imc_mode_252  imc_mode_254  imc_mode_8

After patch, ls output for the debugfs imc directory

  # ls /sys/kernel/debug/powerpc/imc/
  imc_cmd_0  imc_cmd_8  imc_mode_0  imc_mode_8

Actual bug here is that, we have two loops with potentially different
loop counts. That is, in imc_get_mem_addr_nest(), loop count is
obtained from the dt entries. But in case of export_imc_mode_and_cmd(),
loop was based on for_each_nid() count. Patch fixes the loop count in
latter based on the struct mem_info. Ideally it would be better to
have array size in struct imc_pmu.

Fixes: 684d984038 ('powerpc/powernv: Add debugfs interface for imc-mode and imc')
Reported-by: Qian Cai <cai@lca.pw>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190827101635.6942-1-maddy@linux.vnet.ibm.com
2019-09-05 14:22:41 +10:00
Nicholas Piggin
2275d7b575 powerpc/64s/radix: introduce options to disable use of the tlbie instruction
Introduce two options to control the use of the tlbie instruction. A
boot time option which completely disables the kernel using the
instruction, this is currently incompatible with HASH MMU, KVM, and
coherent accelerators.

And a debugfs option can be switched at runtime and avoids using tlbie
for invalidating CPU TLBs for normal process and kernel address
mappings. Coherent accelerators are still managed with tlbie, as will
KVM partition scope translations.

Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a
basic implementation which does not attempt to make any optimisation
beyond the tlbie implementation.

This is useful for performance testing among other things. For example
in certain situations on large systems, using IPIs may be faster than
tlbie as they can be directed rather than broadcast. Later we may also
take advantage of the IPIs to do more interesting things such as trim
the mm cpumask more aggressively.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-7-npiggin@gmail.com
2019-09-05 14:22:41 +10:00
Nicholas Piggin
7d805accbe powerpc/64s: remove unnecessary translation cache flushes at boot
The various translation structure invalidations performed in early boot
when the MMU is off are not required, because everything is invalidated
immediately before a CPU first enables its MMU (see early_init_mmu
and early_init_mmu_secondary).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-6-npiggin@gmail.com
2019-09-05 14:22:41 +10:00
Nicholas Piggin
7e71c428a6 powerpc/64s/pseries: radix flush translations before MMU is enabled at boot
Radix guests are responsible for managing their own translation caches,
so make them match bare metal radix and hash, and make each CPU flush
all its translations right before enabling its MMU.

Radix guests may not flush partition scope translations, so in
tlbiel_all, make these flushes conditional on CPU_FTR_HVMODE. Process
scope translations are the only type visible to the guest.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-5-npiggin@gmail.com
2019-09-05 14:22:40 +10:00
Nicholas Piggin
fd13daea5f powerpc/64s: make mmu_partition_table_set_entry TLB flush optional
No functional change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-4-npiggin@gmail.com
2019-09-05 14:22:40 +10:00
Nicholas Piggin
99161de3a2 powerpc/64s/radix: tidy up TLB flushing code
There should be no functional changes.

- Use calls to existing radix_tlb.c functions in flush_partition.

- Rename radix__flush_tlb_lpid to radix__flush_all_lpid and similar,
  because they flush everything, matching flush_all_mm rather than
  flush_tlb_mm for the lpid.

- Remove some unused radix_tlb.c flush primitives.

Signed-off: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-3-npiggin@gmail.com
2019-09-05 14:22:40 +10:00
Nicholas Piggin
ed6546bdc6 powerpc/64s: remove register_process_table callback
This callback is only required because the partition table init comes
before process table allocation on powernv (aka bare metal aka native).

Change the order to allocate the process table first, and remove the
callback.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-2-npiggin@gmail.com
2019-09-05 14:22:40 +10:00
Oliver O'Halloran
bd6461cc7b powerpc/eeh: Add a eeh_dev_break debugfs interface
Add an interface to debugfs for generating an EEH event on a given device.
This works by disabling memory accesses to and from the device by setting
the PCI_COMMAND register (or the VF Memory Space Enable on the parent PF).

This is a somewhat portable alternative to using the platform specific
error injection mechanisms since those tend to be either hard to use, or
straight up broken. For pseries the interfaces also requires the use of
/dev/mem which is probably going to go away in a post-LOCKDOWN world
(and it's a horrific hack to begin with) so moving to a kernel-provided
interface makes sense and provides a sane, cross-platform interface for
userspace so we can write more generic testing scripts.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-14-oohall@gmail.com
2019-09-05 14:22:39 +10:00
Oliver O'Halloran
22cda7c168 powerpc/eeh: Add debugfs interface to run an EEH check
Detecting an frozen EEH PE usually occurs when an MMIO load returns a 0xFFs
response. When performing EEH testing using the EEH error injection feature
available on some platforms there is no simple way to kick-off the kernel's
recovery process since any accesses from userspace (usually /dev/mem) will
bypass the MMIO helpers in the kernel which check if a 0xFF response is due
to an EEH freeze or not.

If a device contains a 0xFF byte in it's config space it's possible to
trigger the recovery process via config space read from userspace, but this
is not a reliable method. If a driver is bound to the device an in use it
will frequently trigger the MMIO check, but this is also inconsistent.

To solve these problems this patch adds a debugfs file called
"eeh_dev_check" which accepts a <domain>:<bus>:<dev>.<fn> string and runs
eeh_dev_check_failure() on it. This is the same check that's done when the
kernel gets a 0xFF result from an config or MMIO read with the added
benifit that it can be reliably triggered from userspace.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-13-oohall@gmail.com
2019-09-05 14:22:39 +10:00
Oliver O'Halloran
aeff27c121 powerpc/eeh: Set attention indicator while recovering
I am the RAS team. Hear me roar.

Roar.

On a more serious note, being able to locate failed devices can be helpful.
Set the attention indicator if the slot supports it once we've determined
the device is present and only clear it if the device is fully recovered.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-12-oohall@gmail.com
2019-09-05 14:22:39 +10:00
Oliver O'Halloran
a839bd87a2 pci-hotplug/pnv_php: Add support for IODA3 Power9 PHBs
Currently we check that an IODA2 compatible PHB is upstream of this slot.
This is mainly to avoid pnv_php creating slots for the various "virtual
PHBs" that we create for NVLink. There's no real need for this restriction
so allow it on IODA3.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-10-oohall@gmail.com
2019-09-05 14:22:39 +10:00
Oliver O'Halloran
98fd32cde5 powernv/eeh: Use generic code to handle hot resets
When we reset PCI devices managed by a hotplug driver the reset may
generate spurious hotplug events that cause the PCI device we're resetting
to be torn down accidently. This is a problem for EEH (when the driver is
EEH aware) since we want to leave the OS PCI device state intact so that
the device can be re-set without losing any resources (network, disks,
etc) provided by the driver.

Generic PCI code provides the pci_bus_error_reset() function to handle
resetting a PCI Device (or bus) by using the reset method provided by the
hotplug slot driver. We can use this function if the EEH core has
requested a hot reset (common case) without tripping over the hotplug
driver.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-8-oohall@gmail.com
2019-09-05 14:22:38 +10:00
Oliver O'Halloran
5055453335 powerpc/eeh: Remove stale CAPI comment
Support for switching CAPI cards into and out of CAPI mode was removed a
while ago. Drop the comment since it's no longer relevant.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-7-oohall@gmail.com
2019-09-05 14:22:38 +10:00
Oliver O'Halloran
25baf3d816 powerpc/eeh: Defer printing stack trace
Currently we print a stack trace in the event handler to help with
debugging EEH issues. In the case of suprise hot-unplug this is unneeded,
so we want to prevent printing the stack trace unless we know it's due to
an actual device error. To accomplish this, we can save a stack trace at
the point of detection and only print it once the EEH recovery handler has
determined the freeze was due to an actual error.

Since the whole point of this is to prevent spurious EEH output we also
move a few prints out of the detection thread, or mark them as pr_debug
so anyone interested can get output from the eeh_check_dev_failure()
if they want.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-6-oohall@gmail.com
2019-09-05 14:22:38 +10:00
Oliver O'Halloran
b104af5a76 powerpc/eeh: Check slot presence state in eeh_handle_normal_event()
When a device is surprise removed while undergoing IO we will probably
get an EEH PE freeze due to MMIO timeouts and other errors. When a freeze
is detected we send a recovery event to the EEH worker thread which will
notify drivers, and perform recovery as needed.

In the event of a hot-remove we don't want recovery to occur since there
isn't a device to recover. The recovery process is fairly long due to
the number of wait states (required by PCIe) which causes problems when
devices are removed and replaced (e.g. hot swapping of U.2 NVMe drives).

To determine if we need to skip the recovery process we can use the
get_adapter_state() operation of the hotplug_slot to determine if the
slot contains a device or not, and if the slot is empty we can skip
recovery entirely.

One thing to note is that the slot being EEH frozen does not prevent the
hotplug driver from working. We don't have the EEH recovery thread
remove any of the devices since it's assumed that the hotplug driver
will handle tearing down the slot state.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-5-oohall@gmail.com
2019-09-05 14:22:38 +10:00
Oliver O'Halloran
38ddc01147 powerpc/eeh: Make permanently failed devices non-actionable
If a device is torn down by a hotplug slot driver it's marked as removed
and marked as permaantly failed. There's no point in trying to recover a
permernantly failed device so it should be considered un-actionable.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-4-oohall@gmail.com
2019-09-05 14:22:38 +10:00
Oliver O'Halloran
5ef753ae43 powerpc/eeh: Fix race when freeing PDNs
When hot-adding devices we rely on the hotplug driver to create pci_dn's
for the devices under the hotplug slot. Converse, when hot-removing the
driver will remove the pci_dn's that it created. This is a problem because
the pci_dev is still live until it's refcount drops to zero. This can
happen if the driver is slow to tear down it's internal state. Ideally, the
driver would not attempt to perform any config accesses to the device once
it's been marked as removed, but sometimes it happens. As a result, we
might attempt to access the pci_dn for a device that has been torn down and
the kernel may crash as a result.

To fix this, don't free the pci_dn unless the corresponding pci_dev has
been released.  If the pci_dev is still live, then we mark the pci_dn with
a flag that indicates the pci_dev's release function should free it.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-3-oohall@gmail.com
2019-09-05 14:22:37 +10:00
Oliver O'Halloran
799abe283e powerpc/eeh: Clean up EEH PEs after recovery finishes
When the last device in an eeh_pe is removed the eeh_pe structure itself
(and any empty parents) are freed since they are no longer needed. This
results in a crash when a hotplug driver is involved since the following
may occur:

1. Device is suprise removed.
2. Driver performs an MMIO, which fails and queues and eeh_event.
3. Hotplug driver receives a hotplug interrupt and removes any
   pci_devs that were under the slot.
4. pci_dev is torn down and the eeh_pe is freed.
5. The EEH event handler thread processes the eeh_event and crashes
   since the eeh_pe pointer in the eeh_event structure is no
   longer valid.

Crashing is generally considered poor form. Instead of doing that use
the fact PEs are marked as EEH_PE_INVALID to keep them around until the
end of the recovery cycle, at which point we can safely prune any empty
PEs.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-2-oohall@gmail.com
2019-09-05 14:22:37 +10:00
Masahiro Yamada
858805b336 kbuild: add $(BASH) to run scripts with bash-extension
CONFIG_SHELL falls back to sh when bash is not installed on the system,
but nobody is testing such a case since bash is usually installed.
So, shell scripts invoked by CONFIG_SHELL are only tested with bash.

It makes it difficult to test whether the hashbang #!/bin/sh is real.
For example, #!/bin/sh in arch/powerpc/kernel/prom_init_check.sh is
false. (I fixed it up)

Besides, some shell scripts invoked by CONFIG_SHELL use bash-extension
and #!/bin/bash is specified as the hashbang, while CONFIG_SHELL may
not always be set to bash.

Probably, the right thing to do is to introduce BASH, which is bash by
default, and always set CONFIG_SHELL to sh. Replace $(CONFIG_SHELL)
with $(BASH) for bash scripts.

If somebody tries to add bash-extension to a #!/bin/sh script, it will
be caught in testing because /bin/sh is a symlink to dash on some major
distributions.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-09-04 22:54:13 +09:00
Gustavo Romero
a8318c13e7 powerpc/tm: Fix restoring FP/VMX facility incorrectly on interrupts
When in userspace and MSR FP=0 the hardware FP state is unrelated to
the current process. This is extended for transactions where if tbegin
is run with FP=0, the hardware checkpoint FP state will also be
unrelated to the current process. Due to this, we need to ensure this
hardware checkpoint is updated with the correct state before we enable
FP for this process.

Unfortunately we get this wrong when returning to a process from a
hardware interrupt. A process that starts a transaction with FP=0 can
take an interrupt. When the kernel returns back to that process, we
change to FP=1 but with hardware checkpoint FP state not updated. If
this transaction is then rolled back, the FP registers now contain the
wrong state.

The process looks like this:
   Userspace:                      Kernel

               Start userspace
                with MSR FP=0 TM=1
                  < -----
   ...
   tbegin
   bne
               Hardware interrupt
                   ---- >
                                    <do_IRQ...>
                                    ....
                                    ret_from_except
                                      restore_math()
				        /* sees FP=0 */
                                        restore_fp()
                                          tm_active_with_fp()
					    /* sees FP=1 (Incorrect) */
                                          load_fp_state()
                                        FP = 0 -> 1
                  < -----
               Return to userspace
                 with MSR TM=1 FP=1
                 with junk in the FP TM checkpoint
   TM rollback
   reads FP junk

When returning from the hardware exception, tm_active_with_fp() is
incorrectly making restore_fp() call load_fp_state() which is setting
FP=1.

The fix is to remove tm_active_with_fp().

tm_active_with_fp() is attempting to handle the case where FP state
has been changed inside a transaction. In this case the checkpointed
and transactional FP state is different and hence we must restore the
FP state (ie. we can't do lazy FP restore inside a transaction that's
used FP). It's safe to remove tm_active_with_fp() as this case is
handled by restore_tm_state(). restore_tm_state() detects if FP has
been using inside a transaction and will set load_fp and call
restore_math() to ensure the FP state (checkpoint and transaction) is
restored.

This is a data integrity problem for the current process as the FP
registers are corrupted. It's also a security problem as the FP
registers from one process may be leaked to another.

Similarly for VMX.

A simple testcase to replicate this will be posted to
tools/testing/selftests/powerpc/tm/tm-poison.c

This fixes CVE-2019-15031.

Fixes: a7771176b4 ("powerpc: Don't enable FP/Altivec if not checkpointed")
Cc: stable@vger.kernel.org # 4.15+
Signed-off-by: Gustavo Romero <gromero@linux.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190904045529.23002-2-gromero@linux.vnet.ibm.com
2019-09-04 22:31:13 +10:00
Gustavo Romero
8205d5d98e powerpc/tm: Fix FP/VMX unavailable exceptions inside a transaction
When we take an FP unavailable exception in a transaction we have to
account for the hardware FP TM checkpointed registers being
incorrect. In this case for this process we know the current and
checkpointed FP registers must be the same (since FP wasn't used
inside the transaction) hence in the thread_struct we copy the current
FP registers to the checkpointed ones.

This copy is done in tm_reclaim_thread(). We use thread->ckpt_regs.msr
to determine if FP was on when in userspace. thread->ckpt_regs.msr
represents the state of the MSR when exiting userspace. This is setup
by check_if_tm_restore_required().

Unfortunatley there is an optimisation in giveup_all() which returns
early if tsk->thread.regs->msr (via local variable `usermsr`) has
FP=VEC=VSX=SPE=0. This optimisation means that
check_if_tm_restore_required() is not called and hence
thread->ckpt_regs.msr is not updated and will contain an old value.

This can happen if due to load_fp=255 we start a userspace process
with MSR FP=1 and then we are context switched out. In this case
thread->ckpt_regs.msr will contain FP=1. If that same process is then
context switched in and load_fp overflows, MSR will have FP=0. If that
process now enters a transaction and does an FP instruction, the FP
unavailable will not update thread->ckpt_regs.msr (the bug) and MSR
FP=1 will be retained in thread->ckpt_regs.msr.  tm_reclaim_thread()
will then not perform the required memcpy and the checkpointed FP regs
in the thread struct will contain the wrong values.

The code path for this happening is:

       Userspace:                      Kernel
                   Start userspace
                    with MSR FP/VEC/VSX/SPE=0 TM=1
                      < -----
       ...
       tbegin
       bne
       fp instruction
                   FP unavailable
                       ---- >
                                        fp_unavailable_tm()
					  tm_reclaim_current()
					    tm_reclaim_thread()
					      giveup_all()
					        return early since FP/VMX/VSX=0
						/* ckpt MSR not updated (Incorrect) */
					      tm_reclaim()
					        /* thread_struct ckpt FP regs contain junk (OK) */
                                              /* Sees ckpt MSR FP=1 (Incorrect) */
					      no memcpy() performed
					        /* thread_struct ckpt FP regs not fixed (Incorrect) */
					  tm_recheckpoint()
					     /* Put junk in hardware checkpoint FP regs */
                                         ....
                      < -----
                   Return to userspace
                     with MSR TM=1 FP=1
                     with junk in the FP TM checkpoint
       TM rollback
       reads FP junk

This is a data integrity problem for the current process as the FP
registers are corrupted. It's also a security problem as the FP
registers from one process may be leaked to another.

This patch moves up check_if_tm_restore_required() in giveup_all() to
ensure thread->ckpt_regs.msr is updated correctly.

A simple testcase to replicate this will be posted to
tools/testing/selftests/powerpc/tm/tm-poison.c

Similarly for VMX.

This fixes CVE-2019-15030.

Fixes: f48e91e87e ("powerpc/tm: Fix FP and VMX register corruption")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190904045529.23002-1-gromero@linux.vnet.ibm.com
2019-09-04 22:31:13 +10:00
Christoph Hellwig
249baa5479 dma-mapping: provide a better default ->get_required_mask
Most dma_map_ops instances are IOMMUs that work perfectly fine in 32-bits
of IOVA space, and the generic direct mapping code already provides its
own routines that is intelligent based on the amount of memory actually
present.  Wire up the dma-direct routine for the ARM direct mapping code
as well, and otherwise default to the constant 32-bit mask.  This way
we only need to override it for the occasional odd IOMMU that requires
64-bit IOVA support, or IOMMU drivers that are more efficient if they
can fall back to the direct mapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:19 +02:00
Christoph Hellwig
f9f3232a7d dma-mapping: explicitly wire up ->mmap and ->get_sgtable
While the default ->mmap and ->get_sgtable implementations work for the
majority of our dma_map_ops impementations they are inherently safe
for others that don't use the page allocator or CMA and/or use their
own way of remapping not covered by the common code.  So remove the
defaults if these methods are not wired up, but instead wire up the
default implementations for all safe instances.

Fixes: e1c7e32453 ("dma-mapping: always provide the dma_map_ops based implementation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:18 +02:00
Greg Kroah-Hartman
7a81146204 Merge 5.3-rc7 into usb-next
We need the usb fixes in here for testing

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-02 19:31:18 +02:00
Will Deacon
ac12cf85d6 Merge branches 'for-next/52-bit-kva', 'for-next/cpu-topology', 'for-next/error-injection', 'for-next/perf', 'for-next/psci-cpuidle', 'for-next/rng', 'for-next/smpboot', 'for-next/tbi' and 'for-next/tlbi' into for-next/core
* for-next/52-bit-kva: (25 commits)
  Support for 52-bit virtual addressing in kernel space

* for-next/cpu-topology: (9 commits)
  Move CPU topology parsing into core code and add support for ACPI 6.3

* for-next/error-injection: (2 commits)
  Support for function error injection via kprobes

* for-next/perf: (8 commits)
  Support for i.MX8 DDR PMU and proper SMMUv3 group validation

* for-next/psci-cpuidle: (7 commits)
  Move PSCI idle code into a new CPUidle driver

* for-next/rng: (4 commits)
  Support for 'rng-seed' property being passed in the devicetree

* for-next/smpboot: (3 commits)
  Reduce fragility of secondary CPU bringup in debug configurations

* for-next/tbi: (10 commits)
  Introduce new syscall ABI with relaxed requirements for pointer tags

* for-next/tlbi: (6 commits)
  Handle spurious page faults arising from kernel space
2019-08-30 12:46:12 +01:00
Nicholas Piggin
9b123d1ea2 powerpc/64s/exception: reduce page fault unnecessary loads
This avoids 3 loads in the radix page fault case, 1 load in the
hash fault case, and 2 loads in the hash miss page fault case.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-37-npiggin@gmail.com
2019-08-30 11:14:59 +10:00
Nicholas Piggin
05f97d94dd powerpc/64s/exception: Remove pointless KVM handler name bifurcation
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-36-npiggin@gmail.com
2019-08-30 11:14:59 +10:00
Nicholas Piggin
1b3599829a powerpc/64s/exception: program check handler do not branch into a macro
It is clever, but the small code saving is not worth the spaghetti of
jumping to a label in an expanded macro, particularly when the label
is just a number rather than a descriptive name.

So expand the INT_COMMON macro twice, once for the stack and no stack
cases, and branch to those. The slight code size increase is worth
the improved clarity of branches for this non-performance critical
code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-35-npiggin@gmail.com
2019-08-30 11:14:58 +10:00
Nicholas Piggin
c7c5cbb42d powerpc/64s/exception: move interrupt entry code above the common handler
This better reflects the order in which the code is executed.

No generated code change except BUG line number constants.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-34-npiggin@gmail.com
2019-08-30 11:14:58 +10:00
Nicholas Piggin
d1a8471888 powerpc/64s/exception: INT_COMMON add DAR, DSISR, reconcile options
Move DAR and DSISR saving to pt_regs into INT_COMMON. Also add an
option to expand RECONCILE_IRQ_STATE.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-33-npiggin@gmail.com
2019-08-30 11:14:58 +10:00
Nicholas Piggin
8c9fb5d4f3 powerpc/64s/exception: Expand EXCEPTION_PROLOG_COMMON_1 and 2 into caller
No generated code change except BUG line number constants.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-32-npiggin@gmail.com
2019-08-30 11:14:58 +10:00
Nicholas Piggin
5d5e0edfd5 powerpc/64s/exception: Expand EXCEPTION_COMMON macro into caller
No generated code change except BUG line number constants.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-31-npiggin@gmail.com
2019-08-30 11:14:58 +10:00
Nicholas Piggin
bcbceed40a powerpc/64s/exception: Add INT_COMMON gas macro to generate common exception code
No generated code change except BUG line number constants.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-30-npiggin@gmail.com
2019-08-30 11:14:57 +10:00
Nicholas Piggin
9a9c739aa8 powerpc/64s/exception: Merge EXCEPTION_PROLOG_COMMON_2/3
Merge EXCEPTION_PROLOG_COMMON_3 into EXCEPTION_PROLOG_COMMON_2.

No generated code change except BUG line number constants.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-29-npiggin@gmail.com
2019-08-30 11:14:57 +10:00
Nicholas Piggin
7027d53d1a powerpc/64s/exception: KVM_HANDLER reorder arguments to match other macros
Also change argument name (n -> vec) to match others.

No generated code change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-28-npiggin@gmail.com
2019-08-30 11:14:57 +10:00
Nicholas Piggin
141fed2669 powerpc/64s/exception: Add INT_KVM_HANDLER gas macro
Replace the 4 variants of cpp macros with one gas macro.

No generated code change except BUG line number constants.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-27-npiggin@gmail.com
2019-08-30 11:14:57 +10:00
Nicholas Piggin
4515c5fa41 powerpc/64s/exception: INT_HANDLER support HDAR/HDSISR and use it in HDSI
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-26-npiggin@gmail.com
2019-08-30 11:14:57 +10:00
Nicholas Piggin
52b989231c powerpc/64s/exception: Add the virt variant of the denorm interrupt handler
All other virt handlers have the prolog code in the virt vector rather
than branch to the real vector. Follow this pattern in the denorm virt
handler.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-25-npiggin@gmail.com
2019-08-30 11:14:56 +10:00
Nicholas Piggin
d29768e13c powerpc/64s/exception: remove EXCEPTION_PROLOG_0/1, rename _2
EXCEPTION_PROLOG_0 and _1 have only a single caller, so expand them
into it.

Rename EXCEPTION_PROLOG_2_REAL to INT_SAVE_SRR_AND_JUMP and
EXCEPTION_PROLOG_2_VIRT to INT_VIRT_SAVE_SRR_AND_JUMP, which are
more descriptive.

No generated code change except BUG line number constants.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-24-npiggin@gmail.com
2019-08-30 11:14:56 +10:00
Michael Ellerman
9b40f62b8a powerpc/64s/exceptions: Use keyword params to shorten arg lists
The argument lists for the INT_HANDLER macro are getting a bit
unwieldy. Use keyword parameters with default values to shorten them.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190830011426.16810-1-mpe@ellerman.id.au
2019-08-30 11:14:51 +10:00
Nicholas Piggin
7299417c82 powerpc/64s/exception: Replace PROLOG macros and EXC helpers with a gas macro
This creates a single macro that generates the exception prolog code,
with variants specified by arguments, rather than assorted nested
macros for different variants.

The increasing length of macro argument list is not nice to read or
modify, but this is a temporary condition that will be improved in
later changes.

No generated code change except BUG line number constants and label
names.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-23-npiggin@gmail.com
2019-08-30 10:32:37 +10:00
Nicholas Piggin
5ff79a5ea6 powerpc/64s/exception: remove 0xb00 handler
This vector is not used by any supported processor, and has been
implemented as an unknown exception going back to 2.6. There is
nothing special about 0xb00, so remove it like other unused
vectors.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-22-npiggin@gmail.com
2019-08-30 10:32:37 +10:00
Nicholas Piggin
9a7a0773d7 powerpc/64s/exception: Fix performance monitor virt handler
The perf virt handler uses EXCEPTION_PROLOG_2_REAL rather than _VIRT.
In practice this is okay because the _REAL variant is usable by virt
mode interrupts, but should be fixed (and is a performance win).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-21-npiggin@gmail.com
2019-08-30 10:32:36 +10:00
Nicholas Piggin
def0db4f9d powerpc/64s/exception: Add EXC_HV_OR_STD, which selects HSRR if HVMODE
Add EXC_HV_OR_STD and use it to consolidate the 0x500 external
interrupt.

Executed code is unchanged.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-20-npiggin@gmail.com
2019-08-30 10:32:36 +10:00
Nicholas Piggin
a243281195 powerpc/64s/exception: move head-64.h exception code to exception-64s.S
The head-64.h code should deal only with the head code sections
and offset calculations.

No generated code change except BUG line number constants.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-19-npiggin@gmail.com
2019-08-30 10:32:36 +10:00
Nicholas Piggin
c31f7134dc powerpc/64s/exception: Fix DAR load for handle_page_fault error case
This buglet goes back to before the 64/32 arch merge, but it does not
seem to have had practical consequences because bad_page_fault does
not use the 2nd argument, but rather regs->dar/nip.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-18-npiggin@gmail.com
2019-08-30 10:32:36 +10:00
Nicholas Piggin
b3fe35261e powerpc/64s/exception: machine check improve labels and comments
Short forward and backward branches can be given number labels,
but larger significant divergences in code path a more readable
if they're given descriptive names.

Also adjusts a comment to account for guest delivery.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-17-npiggin@gmail.com
2019-08-30 10:32:36 +10:00
Nicholas Piggin
fce16d4822 powerpc/64s/exception: untangle early machine check handler branch
machine_check_early_common now branches to machine_check_handle_early
which is its only caller.

Move interleaving code out of the way, and remove the branch.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-16-npiggin@gmail.com
2019-08-30 10:32:36 +10:00
Nicholas Piggin
b7d9ccec30 powerpc/64s/exception: machine check move unrecoverable handling out of line
Similarly to the previous change, all callers of the unrecoverable
handler run relocated so can reach it with a direct branch. This makes
it easy to move out of line, which makes the "normal" path less
cluttered and easier to follow.

MSR[ME] manipulation still requires the rfi, so that is moved out of
line to its own function.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-15-npiggin@gmail.com
2019-08-30 10:32:36 +10:00
Nicholas Piggin
296e753fb4 powerpc/64s/exception: simplify machine check early path
machine_check_handle_early_common can reach machine_check_handle_early
directly now that it runs at the relocated address, so just branch
directly.

The rfi sequence is required to enable MSR[ME] but that step is moved
into a helper function, making the code easier to follow.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-14-npiggin@gmail.com
2019-08-30 10:32:35 +10:00
Nicholas Piggin
abd1f4ca2b powerpc/64s/exception: machine check move tramp code
Following convention, move the tramp code (unrelocated) above the
common handlers (relocated).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-13-npiggin@gmail.com
2019-08-30 10:32:35 +10:00
Nicholas Piggin
c8eb54dbc8 powerpc/64s/exception: machine check restructure to reuse common macros
Follow the pattern of sreset and HMI handlers more closely: use
EXCEPTION_PROLOG_COMMON_1 rather than open-coding it, and run the
handler at the relocated location.

This helps later simplification and code sharing.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-12-npiggin@gmail.com
2019-08-30 10:32:35 +10:00
Nicholas Piggin
272f636445 powerpc/64s/exception: machine check pseries should skip the late handler for kernel MCEs
The powernv machine check handler copes with taking a MCE from one of
three contexts, guest, kernel, and user. In each case the early
handler runs first on a special stack, then:

- The guest case branches to the KVM interrupt handler (via standard
  interrupt macros).
- The user case will run the "late" handler which is like a normal
  interrupt that runs in virtual mode and uses the regular kernel
  stack.
- The kernel case queues the event and schedules it for processing
  with irq work.

The last case is important, it must not enable virtual memory because
the MMU state may not be set up to deal with that (e.g., SLB might be
clear), it must not use the regular kernel stack for similar reasons
(e.g., might be in OPAL with OPAL stack in r1), and the kernel does
not expect anything to touch its stack if interrupts are disabled.

The pseries handler does not do this queueing, but instead it always
runs the late handler for host MCEs, which has some of the same
problems.

Now that pseries is using machine_check_events, change it to do the
same as powernv and queue events for kernel MCEs.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-11-npiggin@gmail.com
2019-08-30 10:32:35 +10:00
Nicholas Piggin
9ca766f989 powerpc/64s/pseries: machine check convert to use common event code
The common machine_check_event data structures and queues are mostly
platform independent, with powernv decoding SRR1/DSISR/etc., into
machine_check_event objects.

This patch converts pseries to use this infrastructure by decoding
fwnmi/rtas data into machine_check_event objects.

This allows queueing to be used by a subsequent change to delay the
virtual mode handling of machine checks that occur in kernel space
where it is unsafe to switch immediately to virtual mode, similarly
to powernv.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix implicit fallthrough warnings in mce_handle_error()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-10-npiggin@gmail.com
2019-08-30 10:32:35 +10:00
Nicholas Piggin
7290f3b3d3 powerpc/64s/powernv: machine check dump SLB contents
Re-use the code introduced in pseries to save and dump the contents
of the SLB in the case of an SLB involved machine check exception.

This patch also avoids allocating the SLB save array on pseries radix.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-9-npiggin@gmail.com
2019-08-30 10:32:35 +10:00
Nicholas Piggin
0b66370c61 powerpc/64s/exception: machine check use correct cfar for late handler
Bare metal machine checks run an "early" handler in real mode before
running the main handler which reports the event.

The main handler runs exactly as a normal interrupt handler, after the
"windup" which sets registers back as they were at interrupt entry.
CFAR does not get restored by the windup code, so that will be wrong
when the handler is run.

Restore the CFAR to the saved value before running the late handler.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-8-npiggin@gmail.com
2019-08-30 10:32:34 +10:00
Nicholas Piggin
fa2760eca5 powerpc/64s/exception: machine check remove machine_check_pSeries_0 branch
This label has only one caller, so unwind the branch and move it
inline. The location of the comment is adjusted to match similar
one in system reset.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-7-npiggin@gmail.com
2019-08-30 10:32:34 +10:00
Nicholas Piggin
b5c27f7c56 powerpc/64s/exception: machine check pseries should always run the early handler
Now that pseries with fwnmi registered runs the early machine check
handler, there is no good reason to special case the non-fwnmi case
and skip the early handler. Reducing the code and number of paths is
a top priority for asm code, it's better to handle this in C where
possible (and the pseries early handler is a no-op if fwnmi is not
registered).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-6-npiggin@gmail.com
2019-08-30 10:32:34 +10:00
Nicholas Piggin
fe9d482b1d powerpc/64s/exception: machine check adjust RFI target
The host kernel delivery case for powernv does RFI_TO_USER_OR_KERNEL,
but should just use RFI_TO_KERNEL which makes it clear this is not a
user case.

This is not a bug because RFI_TO_USER_OR_KERNEL deals with kernel
returns just fine.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-5-npiggin@gmail.com
2019-08-30 10:32:34 +10:00
Nicholas Piggin
19dbe673e6 powerpc/64s/exception: machine check fix KVM guest test
The machine_check_handle_early hypervisor guest test is skipped if
!HVMODE or MSR[HV]=0, which is wrong for PR or nested hypervisors
that could be running a guest in this state.

Test HSTATE_IN_GUEST up front and use that to branch out to the KVM
handler, then MSR[PR] alone can test for this kernel's userspace.
This matches all other interrupt handling.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-4-npiggin@gmail.com
2019-08-30 10:32:34 +10:00
Nicholas Piggin
1039f62431 powerpc/64s/exception: machine check remove bitrotted comment
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-3-npiggin@gmail.com
2019-08-30 10:32:34 +10:00
Nicholas Piggin
0be9f7fd5d powerpc/64s/exception: machine check fwnmi remove HV case
fwnmi does not trigger in HV mode, so remove always-true feature test.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802105709.27696-2-npiggin@gmail.com
2019-08-30 10:32:34 +10:00
Ryan Grimm
bf75a8db72 powerpc/configs: Enable secure guest support in pseries and ppc64 defconfigs
Enables running as a secure guest in platforms with an Ultravisor.

Signed-off-by: Ryan Grimm <grimm@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-17-bauerman@linux.ibm.com
2019-08-30 09:56:30 +10:00
Anshuman Khandual
2efbc58f15 powerpc/pseries/svm: Force SWIOTLB for secure guests
SWIOTLB checks range of incoming CPU addresses to be bounced and sees if
the device can access it through its DMA window without requiring bouncing.
In such cases it just chooses to skip bouncing. But for cases like secure
guests on powerpc platform all addresses need to be bounced into the shared
pool of memory because the host cannot access it otherwise. Hence the need
to do the bouncing is not related to device's DMA window and use of bounce
buffers is forced by setting swiotlb_force.

Also, connect the shared memory conversion functions into the
ARCH_HAS_MEM_ENCRYPT hooks and call swiotlb_update_mem_attributes() to
convert SWIOTLB's memory pool to shared memory.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[ bauerman: Use ARCH_HAS_MEM_ENCRYPT hooks to share swiotlb memory pool. ]
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-15-bauerman@linux.ibm.com
2019-08-30 09:55:41 +10:00
Thiago Jung Bauermann
edea902c1c powerpc/pseries/iommu: Don't use dma_iommu_ops on secure guests
Secure guest memory is inacessible to devices so regular DMA isn't
possible.

In that case set devices' dma_map_ops to NULL so that the generic
DMA code path will use SWIOTLB to bounce buffers for DMA.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-14-bauerman@linux.ibm.com
2019-08-30 09:55:41 +10:00
Sukadev Bhattiprolu
4edaac512c powerpc/pseries/svm: Disable doorbells in SVM guests
Normally, the HV emulates some instructions like MSGSNDP, MSGCLRP
from a KVM guest. To emulate the instructions, it must first read
the instruction from the guest's memory and decode its parameters.

However for a secure guest (aka SVM), the page containing the
instruction is in secure memory and the HV cannot access directly.
It would need the Ultravisor (UV) to facilitate accessing the
instruction and parameters but the UV currently does not have
the support for such accesses.

Until the UV has such support, disable doorbells in SVMs. This might
incur a performance hit but that is yet to be quantified.

With this patch applied (needed only in SVMs not needed for HV) we
are able to launch SVM guests with multi-core support. Eg:

	qemu -smp sockets=2,cores=2,threads=2.

Fix suggested by Benjamin Herrenschmidt. Thanks to input from
Paul Mackerras, Ram Pai and Michael Anderson.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-13-bauerman@linux.ibm.com
2019-08-30 09:55:41 +10:00
Ryan Grimm
734560ac39 powerpc/pseries/svm: Export guest SVM status to user space via sysfs
User space might want to know it's running in a secure VM.  It can't do
a mfmsr because mfmsr is a privileged instruction.

The solution here is to create a cpu attribute:

/sys/devices/system/cpu/svm

which will read 0 or 1 based on the S bit of the current CPU.

Signed-off-by: Ryan Grimm <grimm@linux.vnet.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-12-bauerman@linux.ibm.com
2019-08-30 09:55:41 +10:00
Ram Pai
256ba2c168 powerpc/pseries/svm: Unshare all pages before kexecing a new kernel
A new kernel deserves a clean slate. Any pages shared with the hypervisor
is unshared before invoking the new kernel. However there are exceptions.
If the new kernel is invoked to dump the current kernel, or if there is a
explicit request to preserve the state of the current kernel, unsharing
of pages is skipped.

NOTE: While testing crashkernel, make sure at least 256M is reserved for
crashkernel. Otherwise SWIOTLB allocation will fail and crash kernel will
fail to boot.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-11-bauerman@linux.ibm.com
2019-08-30 09:55:40 +10:00
Anshuman Khandual
d5394c059d powerpc/pseries/svm: Use shared memory for Debug Trace Log (DTL)
Secure guests need to share the DTL buffers with the hypervisor. To that
end, use a kmem_cache constructor which converts the underlying buddy
allocated SLUB cache pages into shared memory.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-10-bauerman@linux.ibm.com
2019-08-30 09:55:40 +10:00
Anshuman Khandual
bd104e6db6 powerpc/pseries/svm: Use shared memory for LPPACA structures
LPPACA structures need to be shared with the host. Hence they need to be in
shared memory. Instead of allocating individual chunks of memory for a
given structure from memblock, a contiguous chunk of memory is allocated
and then converted into shared memory. Subsequent allocation requests will
come from the contiguous chunk which will be always shared memory for all
structures.

While we are able to use a kmem_cache constructor for the Debug Trace Log,
LPPACAs are allocated very early in the boot process (before SLUB is
available) so we need to use a simpler scheme here.

Introduce helper is_svm_platform() which uses the S bit of the MSR to tell
whether we're running as a secure guest.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-9-bauerman@linux.ibm.com
2019-08-30 09:55:40 +10:00
Thiago Jung Bauermann
e311a92da1 powerpc/pseries: Add and use LPPACA_SIZE constant
Helps document what the hard-coded number means.

Also take the opportunity to fix an #endif comment.

Suggested-by: Alexey Kardashevskiy <aik@linux.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-8-bauerman@linux.ibm.com
2019-08-30 09:55:40 +10:00
Sukadev Bhattiprolu
7f70c3815a powerpc: Introduce the MSR_S bit
Protected Execution Facility (PEF) is an architectural change for
POWER 9 that enables Secure Virtual Machines (SVMs). When enabled,
PEF adds a new higher privileged mode, called Ultravisor mode, to
POWER architecture.

The hardware changes include the following:

  * There is a new bit in the MSR that determines whether the current
    process is running in secure mode, MSR(S) bit 41. MSR(S)=1, process
    is in secure mode, MSR(s)=0 process is in normal mode.

  * The MSR(S) bit can only be set by the Ultravisor.

  * HRFID cannot be used to set the MSR(S) bit. If the hypervisor needs
    to return to a SVM it must use an ultracall. It can determine if
    the VM it is returning to is secure.

  * The privilege of a process is now determined by three MSR bits,
    MSR(S, HV, PR). In each of the tables below the modes are listed
    from least privilege to highest privilege. The higher privilege
    modes can access all the resources of the lower privilege modes.

    **Secure Mode MSR Settings**

       +---+---+---+---------------+
       | S | HV| PR|Privilege      |
       +===+===+===+===============+
       | 1 | 0 | 1 | Problem       |
       +---+---+---+---------------+
       | 1 | 0 | 0 | Privileged(OS)|
       +---+---+---+---------------+
       | 1 | 1 | 0 | Ultravisor    |
       +---+---+---+---------------+
       | 1 | 1 | 1 | Reserved      |
       +---+---+---+---------------+

    **Normal Mode MSR Settings**

       +---+---+---+---------------+
       | S | HV| PR|Privilege      |
       +===+===+===+===============+
       | 0 | 0 | 1 | Problem       |
       +---+---+---+---------------+
       | 0 | 0 | 0 | Privileged(OS)|
       +---+---+---+---------------+
       | 0 | 1 | 0 | Hypervisor    |
       +---+---+---+---------------+
       | 0 | 1 | 1 | Problem (HV)  |
       +---+---+---+---------------+

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[ cclaudio: Update the commit message ]
Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-7-bauerman@linux.ibm.com
2019-08-30 09:55:40 +10:00
Ram Pai
f7777e008c powerpc/pseries/svm: Add helpers for UV_SHARE_PAGE and UV_UNSHARE_PAGE
These functions are used when the guest wants to grant the hypervisor
access to certain pages.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-6-bauerman@linux.ibm.com
2019-08-30 09:55:40 +10:00
Ram Pai
6a9c930bd7 powerpc/prom_init: Add the ESM call to prom_init
Make the Enter-Secure-Mode (ESM) ultravisor call to switch the VM to secure
mode. Pass kernel base address and FDT address so that the Ultravisor is
able to verify the integrity of the VM using information from the ESM blob.

Add "svm=" command line option to turn on switching to secure mode.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[ andmike: Generate an RTAS os-term hcall when the ESM ucall fails. ]
Signed-off-by: Michael Anderson <andmike@linux.ibm.com>
[ bauerman: Cleaned up the code a bit. ]
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-5-bauerman@linux.ibm.com
2019-08-30 09:54:35 +10:00
Benjamin Herrenschmidt
528229d210 powerpc: Add support for adding an ESM blob to the zImage wrapper
For secure VMs, the signing tool will create a ticket called the "ESM blob"
for the Enter Secure Mode ultravisor call with the signatures of the kernel
and initrd among other things.

This adds support to the wrapper script for adding that blob via the "-e"
option to the zImage.pseries.

It also adds code to the zImage wrapper itself to retrieve and if necessary
relocate the blob, and pass its address to Linux via the device-tree, to be
later consumed by prom_init.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[ bauerman: Minor adjustments to some comments. ]
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-4-bauerman@linux.ibm.com
2019-08-30 09:53:29 +10:00
Thiago Jung Bauermann
136bc0397a powerpc/pseries: Introduce option to build secure virtual machines
Introduce CONFIG_PPC_SVM to control support for secure guests and include
Ultravisor-related helpers when it is selected

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-3-bauerman@linux.ibm.com
2019-08-30 09:53:28 +10:00
Michael Ellerman
9044adca78 Merge branch 'topic/ppc-kvm' into next
Merge our ppc-kvm topic branch to bring in the Ultravisor support
patches.
2019-08-30 09:52:57 +10:00
Claudio Carvalho
68e0aa8ec5 powerpc/powernv: Add ultravisor message log interface
The ultravisor (UV) provides an in-memory console which follows the
OPAL in-memory console structure.

This patch extends the OPAL msglog code to initialize the UV memory
console and provide the "/sys/firmware/ultravisor/msglog" interface
for userspace to view the UV message log.

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190828130521.26764-2-mpe@ellerman.id.au
2019-08-30 09:40:16 +10:00
Claudio Carvalho
dea45ea777 powerpc/powernv/opal-msglog: Refactor memcons code
This patch refactors the code in opal-msglog that operates on the OPAL
memory console in order to make it cleaner and also allow the reuse of
the new memcons_* functions.

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190828130521.26764-1-mpe@ellerman.id.au
2019-08-30 09:40:16 +10:00
Sukadev Bhattiprolu
6c85b7bc63 powerpc/kvm: Use UV_RETURN ucall to return to ultravisor
When an SVM makes an hypercall or incurs some other exception, the
Ultravisor usually forwards (a.k.a. reflects) the exceptions to the
Hypervisor. After processing the exception, Hypervisor uses the
UV_RETURN ultracall to return control back to the SVM.

The expected register state on entry to this ultracall is:

* Non-volatile registers are restored to their original values.
* If returning from an hypercall, register R0 contains the return value
  (unlike other ultracalls) and, registers R4 through R12 contain any
  output values of the hypercall.
* R3 contains the ultracall number, i.e UV_RETURN.
* If returning with a synthesized interrupt, R2 contains the
  synthesized interrupt number.

Thanks to input from Paul Mackerras, Ram Pai and Mike Anderson.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-8-cclaudio@linux.ibm.com
2019-08-30 09:40:16 +10:00
Claudio Carvalho
512a5a6452 powerpc/powernv: Access LDBAR only if ultravisor disabled
LDBAR is a per-thread SPR populated and used by the thread-imc pmu
driver to dump the data counter into memory. It contains memory along
with few other configuration bits. LDBAR is populated and enabled only
when any of the thread imc pmu events are monitored.

In ultravisor enabled systems, LDBAR becomes ultravisor privileged and
an attempt to write to it will cause a Hypervisor Emulation Assistance
interrupt.

In ultravisor enabled systems, the ultravisor is responsible to maintain
the LDBAR (e.g. save and restore it).

This restricts LDBAR access to only when ultravisor is disabled.

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Reviewed-by: Ryan Grimm <grimm@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-7-cclaudio@linux.ibm.com
2019-08-30 09:40:16 +10:00
Claudio Carvalho
5223134029 powerpc/mm: Write to PTCR only if ultravisor disabled
In ultravisor enabled systems, PTCR becomes ultravisor privileged only
for writing and an attempt to write to it will cause a Hypervisor
Emulation Assitance interrupt.

This patch uses the set_ptcr_when_no_uv() function to restrict PTCR
writing to only when ultravisor is disabled.

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-6-cclaudio@linux.ibm.com
2019-08-30 09:40:16 +10:00
Michael Anderson
139a1d2842 powerpc/mm: Use UV_WRITE_PATE ucall to register a PATE
When Ultravisor (UV) is enabled, the partition table is stored in secure
memory and can only be accessed via the UV. The Hypervisor (HV) however
maintains a copy of the partition table in normal memory to allow Nest MMU
translations to occur (for normal VMs). The HV copy includes partition
table entries (PATE)s for secure VMs which would currently be unused
(Nest MMU translations cannot access secure memory) but they would be
needed as we add functionality.

This patch adds the UV_WRITE_PATE ucall which is used to update the PATE
for a VM (both normal and secure) when Ultravisor is enabled.

Signed-off-by: Michael Anderson <andmike@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[ cclaudio: Write the PATE in HV's table before doing that in UV's ]
Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Reviewed-by: Ryan Grimm <grimm@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-5-cclaudio@linux.ibm.com
2019-08-30 09:40:15 +10:00
Claudio Carvalho
bb04ffe85e powerpc/powernv: Introduce FW_FEATURE_ULTRAVISOR
In PEF enabled systems, some of the resources which were previously
hypervisor privileged are now ultravisor privileged and controlled by
the ultravisor firmware.

This adds FW_FEATURE_ULTRAVISOR to indicate if PEF is enabled.

The host kernel can use FW_FEATURE_ULTRAVISOR, for instance, to skip
accessing resources (e.g. PTCR and LDBAR) in case PEF is enabled.

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
[ andmike: Device node name to "ibm,ultravisor" ]
Signed-off-by: Michael Anderson <andmike@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-4-cclaudio@linux.ibm.com
2019-08-30 09:40:15 +10:00
Claudio Carvalho
a49dddbdb0 powerpc/kernel: Add ucall_norets() ultravisor call handler
The ultracalls (ucalls for short) allow the Secure Virtual Machines
(SVM)s and hypervisor to request services from the ultravisor such as
accessing a register or memory region that can only be accessed when
running in ultravisor-privileged mode.

This patch adds the ucall_norets() ultravisor call handler.

The specific service needed from an ucall is specified in register
R3 (the first parameter to the ucall). Other parameters to the
ucall, if any, are specified in registers R4 through R12.

Return value of all ucalls is in register R3. Other output values
from the ucall, if any, are returned in registers R4 through R12.

Each ucall returns specific error codes, applicable in the context
of the ucall. However, like with the PowerPC Architecture Platform
Reference (PAPR), if no specific error code is defined for a particular
situation, then the ucall will fallback to an erroneous
parameter-position based code. i.e U_PARAMETER, U_P2, U_P3 etc depending
on the ucall parameter that may have caused the error.

Every host kernel (powernv) needs to be able to do ucalls in case it
ends up being run in a machine with ultravisor enabled. Otherwise, the
kernel may crash early in boot trying to access ultravisor resources,
for instance, trying to set the partition table entry 0. Secure guests
also need to be able to do ucalls and its kernel may not have
CONFIG_PPC_POWERNV=y. For that reason, the ucall.S file is placed under
arch/powerpc/kernel.

If ultravisor is not enabled, the ucalls will be redirected to the
hypervisor which must handle/fail the call.

Thanks to inputs from Ram Pai and Michael Anderson.

Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190822034838.27876-3-cclaudio@linux.ibm.com
2019-08-30 09:40:15 +10:00
Claudio Carvalho
70ed86f4de powerpc: Add PowerPC Capabilities ELF note
Add the PowerPC name and the PPC_ELFNOTE_CAPABILITIES type in the
kernel binary ELF note. This type is a bitmap that can be used to
advertise kernel capabilities to userland.

This patch also defines PPCCAP_ULTRAVISOR_BIT as being the bit zero.

Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com>
[ maxiwell: Define the 'PowerPC' type in the elfnote.h ]
Signed-off-by: Maxiwell S. Garcia <maxiwell@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190829155021.2915-2-maxiwell@linux.ibm.com
2019-08-30 09:40:15 +10:00
Alexey Kardashevskiy
a102f139aa powerpc/powernv/ioda: Remove obsolete iommu_table_ops::exchange callbacks
As now we have xchg_no_kill/tce_kill, these are not used anymore so
remove them.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190829085252.72370-6-aik@ozlabs.ru
2019-08-30 09:40:15 +10:00
Alexey Kardashevskiy
021b786811 powerpc/pseries/iommu: Switch to xchg_no_kill
This is the last implementation of iommu_table_ops::exchange() which
we are about to remove.

This implements xchg_no_kill() for pseries. Since it is paravirtual
platform, the hypervisor does TCE invalidations and we do not have
to deal with it here, hence no tce_kill() hook.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190829085252.72370-5-aik@ozlabs.ru
2019-08-30 09:40:14 +10:00
Alexey Kardashevskiy
01b7d128b5 KVM: PPC: Book3S: Invalidate multiple TCEs at once
Invalidating a TCE cache entry for each updated TCE is quite expensive.
This makes use of the new iommu_table_ops::xchg_no_kill()/tce_kill()
callbacks to bring down the time spent in mapping a huge guest DMA window;
roughly 20s to 10s for each guest's 100GB of DMA space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190829085252.72370-3-aik@ozlabs.ru
2019-08-30 09:40:14 +10:00
Alexey Kardashevskiy
35872480da powerpc/powernv/ioda: Split out TCE invalidation from TCE updates
At the moment updates in a TCE table are made by iommu_table_ops::exchange
which update one TCE and invalidates an entry in the PHB/NPU TCE cache
via set of registers called "TCE Kill" (hence the naming).
Writing a TCE is a simple xchg() but invalidating the TCE cache is
a relatively expensive OPAL call. Mapping a 100GB guest with PCI+NPU
passed through devices takes about 20s.

Thankfully we can do better. Since such big mappings happen at the boot
time and when memory is plugged/onlined (i.e. not often), these requests
come in 512 pages so we call call OPAL 512 times less which brings 20s
from the above to less than 10s. Also, since TCE caches can be flushed
entirely, calling OPAL for 512 TCEs helps skiboot [1] to decide whether
to flush the entire cache or not.

This implements 2 new iommu_table_ops callbacks:
- xchg_no_kill() to update a single TCE with no TCE invalidation;
- tce_kill() to invalidate multiple TCEs.
This uses the same xchg_no_kill() callback for IODA1/2.

This implements 2 new wrappers on top of the new callbacks similar to
the existing iommu_tce_xchg().

This does not use the new callbacks yet, the next patches will;
so this should not cause any behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190829085252.72370-2-aik@ozlabs.ru
2019-08-30 09:40:14 +10:00
Alexey Kardashevskiy
4f916593be KVM: PPC: Book3S: Fix incorrect guest-to-user-translation error handling
H_PUT_TCE_INDIRECT handlers receive a page with up to 512 TCEs from
a guest. Although we verify correctness of TCEs before we do anything
with the existing tables, there is a small window when a check in
kvmppc_tce_validate might pass and right after that the guest alters
the page with TCEs which can cause early exit from the handler and
leave srcu_read_lock(&vcpu->kvm->srcu) (virtual mode) or lock_rmap(rmap)
(real mode) locked.

This fixes the bug by jumping to the common exit code with an appropriate
unlock.

Fixes: 121f80ba68 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190826045520.92153-1-aik@ozlabs.ru
2019-08-30 09:40:14 +10:00
Alexey Kardashevskiy
bc605cd79e powerpc/of/pci: Rewrite pci_parse_of_flags
The existing code uses bunch of hardcoded values from the PCI Bus
Binding to IEEE Std 1275 spec; and it does so in quite non-obvious
way.

This defines fields from the cell#0 of the "reg" property of a PCI
device and uses them for parsing.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[mpe: Unsplit some 80/81 char lines, space the code with some newlines]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190829084417.71873-1-aik@ozlabs.ru
2019-08-29 20:24:05 +10:00
Christoph Hellwig
f2902a2fb4 powerpc: use the generic dma coherent remap allocator
This switches to using common code for the DMA allocations, including
potential use of the CMA allocator if configured.

Switching to the generic code enables DMA allocations from atomic
context, which is required by the DMA API documentation, and also
adds various other minor features drivers start relying upon.  It
also makes sure we have on tested code base for all architectures
that require uncached pte bits for coherent DMA allocations.

Another advantage is that consistent memory allocations now share
the general vmalloc pool instead of needing an explicit careout
from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> # tested on 8xx
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190814132230.31874-2-hch@lst.de
2019-08-28 23:19:34 +10:00
Nicholas Piggin
555e28179d powerpc/64: remove support for kernel-mode syscalls
There is support for the kernel to execute the 'sc 0' instruction and
make a system call to itself. This is a relic that is unused in the
tree, therefore untested. It's also highly questionable for modules to
be doing this.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190827033010.28090-3-npiggin@gmail.com
2019-08-28 23:19:34 +10:00
Nicholas Piggin
facd04a904 powerpc: convert to copy_thread_tls
Commit 3033f14ab7 ("clone: support passing tls argument via C rather
than pt_regs magic") introduced the HAVE_COPY_THREAD_TLS option. Use it
to avoid a subtle assumption about the argument ordering of clone type
syscalls.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190827033010.28090-2-npiggin@gmail.com
2019-08-28 23:19:34 +10:00
Christophe Leroy
c7bf1252d5 powerpc/32: don't use CPU_FTR_COHERENT_ICACHE
Only 601 and E200 have CPU_FTR_COHERENT_ICACHE.

Just use #ifdefs instead of feature fixup.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5f3e92ccd64d06477b27626f6007a9da3b8da157.1566834712.git.christophe.leroy@c-s.fr
2019-08-28 23:19:34 +10:00
Christophe Leroy
e0291f1dec powerpc/32: drop CPU_FTR_UNIFIED_ID_CACHE
Only 601 and e200 have unified I/D cache.

Drop the feature and use CONFIG_PPC_BOOK3S_601 and CONFIG_E200.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b5902144266d2f4eed1ffea53915bd0245841e02.1566834712.git.christophe.leroy@c-s.fr
2019-08-28 23:19:33 +10:00
Christophe Leroy
39097b9c6d powerpc/32s: use CONFIG_PPC_BOOK3S_601 instead of reading PVR
Use CONFIG_PPC_BOOK3S_601 instead of reading PVR to know if
it is a 601 or not.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/909c26db9facd7fe454695b303f952e019dd9eda.1566834712.git.christophe.leroy@c-s.fr
2019-08-28 23:19:33 +10:00
Christophe Leroy
88fb309409 powerpc/32s: drop CPU_FTR_USE_RTC feature
CPU_FTR_USE_RTC feature only applies to powerpc601.

Drop this feature and replace it with tests on CONFIG_PPC_BOOK3S_601.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/170411e2360861f4a95c21faad43519a08bc4040.1566834712.git.christophe.leroy@c-s.fr
2019-08-28 23:19:33 +10:00
Christophe Leroy
12c3f1fd87 powerpc/32s: get rid of CPU_FTR_601 feature
Now that 601 is exclusive from other 6xx, CPU_FTR_601 and
associated fixups are useless.

Drop this feature and use #ifdefs instead.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ecdb7194a17dbfa01865df6a82979533adc2c70b.1566834712.git.christophe.leroy@c-s.fr
2019-08-28 23:19:33 +10:00
Christophe Leroy
f7a0bf7d90 powerpc/32s: add an option to exclusively select powerpc 601
Powerpc 601 is rather old powerpc which as some important
limitations compared to other book3s/32 powerpcs:
- No Timebase.
- Common BATs for instruction and data.
- No execution protection in segment registers.
- No RI bit in MSR
- ...

It is starting to be difficult and cumbersome to maintain
kernels that are compatible both with 601 and other 6xx cores.

Create a compiletime option to exclusively select either powerpc 601
or other 6xx.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d644eaf7dff8cc149260066802af230bdf34fded.1566834712.git.christophe.leroy@c-s.fr
2019-08-28 23:19:33 +10:00
Christophe Leroy
3bbd234373 powerpc/8xx: set STACK_END_MAGIC earlier on the init_stack
Today, the STACK_END_MAGIC is set on init_stack in start_kernel().

To avoid a false 'Thread overran stack, or stack corrupted' message
on early Oopses, setup STACK_END_MAGIC as soon as possible.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/54f67bb7ac486c1350f2fa8905cd279f94b9dfb1.1566382841.git.christophe.leroy@c-s.fr
2019-08-28 11:31:18 +10:00
Christophe Leroy
a045657412 powerpc/8xx: drop unused self-modifying code alternative to FixupDAR.
The code which fixups the DAR on TLB errors for dbcX instructions
has a self-modifying code alternative that has never been used.

Drop it.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Joakim Tjernlund <joakim.tjernlund@infinera.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b095e12c82fcba1ac4c09fc3b85d969f36614746.1566417610.git.christophe.leroy@c-s.fr
2019-08-28 11:31:18 +10:00
Christophe Leroy
63ce271b5e powerpc/prom: convert PROM_BUG() to standard trap
Prior to commit 1bd98d7fbaf5 ("ppc64: Update BUG handling based on
ppc32"), BUG() family was using BUG_ILLEGAL_INSTRUCTION which
was an invalid instruction opcode to trap into program check
exception.

That commit converted them to using standard trap instructions,
but prom/prom_init and their PROM_BUG() macro were left over.
head_64.S and exception-64s.S were left aside as well.

Convert them to using the standard BUG infrastructure.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/cdaf4bbbb64c288a077845846f04b12683f8875a.1566817807.git.christophe.leroy@c-s.fr
2019-08-28 11:31:18 +10:00
Radim Krčmář
c91ff72142 Merge tag 'kvm-ppc-fixes-5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
KVM/PPC fix for 5.3

- Fix bug which could leave locks locked in the host on return
  to a guest.
2019-08-27 16:02:48 +02:00
Christopher M. Riedl
405efc5980 powerpc/spinlocks: Fix oops in __spin_yield() on bare metal
Booting w/ppc64le_defconfig + CONFIG_PREEMPT on bare metal results in
the oops below due to calling into __spin_yield() when not running in
an SPLPAR, which means lppaca pointers are NULL.

We fixed a similar case previously in commit a6201da34f ("powerpc:
Fix oops due to bad access of lppaca on bare metal"), by adding SPLPAR
checks in lppaca_shared_proc(). However when PREEMPT is enabled we can
call __spin_yield() directly from arch_spin_yield().

To fix it add spin_yield() and rw_yield() which check that
shared-processor LPAR is enabled before calling the SPLPAR-only
implementation of each.

  BUG: Kernel NULL pointer dereference at 0x00000100
  Faulting instruction address: 0xc000000000097f88
  Oops: Kernel access of bad area, sig: 7 [#1]
  LE PAGE_SIZE=64K MMU=Radix MMU=Hash PREEMPT SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in:
  CPU: 0 PID: 2 Comm: kthreadd Not tainted 5.2.0-rc6-00491-g249155c20f9b #28
  NIP:  c000000000097f88 LR: c000000000c07a88 CTR: c00000000015ca10
  REGS: c0000000727079f0 TRAP: 0300   Not tainted  (5.2.0-rc6-00491-g249155c20f9b)
  MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 84000424  XER: 20040000
  CFAR: c000000000c07a84 DAR: 0000000000000100 DSISR: 00080000 IRQMASK: 1
  GPR00: c000000000c07a88 c000000072707c80 c000000001546300 c00000007be38a80
  GPR04: c0000000726f0c00 0000000000000002 c00000007279c980 0000000000000100
  GPR08: c000000001581b78 0000000080000001 0000000000000008 c00000007279c9b0
  GPR12: 0000000000000000 c000000001730000 c000000000142558 0000000000000000
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR24: c00000007be38a80 c000000000c002f4 0000000000000000 0000000000000000
  GPR28: c000000072221a00 c0000000726c2600 c00000007be38a80 c00000007be38a80
  NIP [c000000000097f88] __spin_yield+0x48/0xa0
  LR [c000000000c07a88] __raw_spin_lock+0xb8/0xc0
  Call Trace:
  [c000000072707c80] [c000000072221a00] 0xc000000072221a00 (unreliable)
  [c000000072707cb0] [c000000000bffb0c] __schedule+0xbc/0x850
  [c000000072707d70] [c000000000c002f4] schedule+0x54/0x130
  [c000000072707da0] [c0000000001427dc] kthreadd+0x28c/0x2b0
  [c000000072707e20] [c00000000000c1cc] ret_from_kernel_thread+0x5c/0x70
  Instruction dump:
  4d9e0020 552a043e 210a07ff 79080fe0 0b080000 3d020004 3908b878 794a1f24
  e8e80000 7ce7502a e8e70000 38e70100 <7ca03c2c> 70a70001 78a50020 4d820020
  ---[ end trace 474d6b2b8fc5cb7e ]---

Fixes: 499dcd4137 ("powerpc/64s: Allocate LPPACAs individually")
Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
[mpe: Reword change log a bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813031314.1828-4-cmr@informatik.wtf
2019-08-27 21:34:34 +10:00
Paul Mackerras
ff42df49e7 KVM: PPC: Book3S HV: Don't lose pending doorbell request on migration on P9
On POWER9, when userspace reads the value of the DPDES register on a
vCPU, it is possible for 0 to be returned although there is a doorbell
interrupt pending for the vCPU.  This can lead to a doorbell interrupt
being lost across migration.  If the guest kernel uses doorbell
interrupts for IPIs, then it could malfunction because of the lost
interrupt.

This happens because a newly-generated doorbell interrupt is signalled
by setting vcpu->arch.doorbell_request to 1; the DPDES value in
vcpu->arch.vcore->dpdes is not updated, because it can only be updated
when holding the vcpu mutex, in order to avoid races.

To fix this, we OR in vcpu->arch.doorbell_request when reading the
DPDES value.

Cc: stable@vger.kernel.org # v4.13+
Fixes: 579006944e ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2019-08-27 14:08:22 +10:00
Paul Mackerras
d28eafc5a6 KVM: PPC: Book3S HV: Check for MMU ready on piggybacked virtual cores
When we are running multiple vcores on the same physical core, they
could be from different VMs and so it is possible that one of the
VMs could have its arch.mmu_ready flag cleared (for example by a
concurrent HPT resize) when we go to run it on a physical core.
We currently check the arch.mmu_ready flag for the primary vcore
but not the flags for the other vcores that will be run alongside
it.  This adds that check, and also a check when we select the
secondary vcores from the preempted vcores list.

Cc: stable@vger.kernel.org # v4.14+
Fixes: 38c53af853 ("KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-08-27 14:08:10 +10:00
Christopher M. Riedl
31391ff7ea powerpc/spinlocks: Rename SPLPAR-only spinlocks
The __rw_yield and __spin_yield locks only pertain to SPLPAR mode.
Rename them to make this relationship obvious.

Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813031314.1828-3-cmr@informatik.wtf
2019-08-27 13:03:36 +10:00
Christopher M. Riedl
d57b78353a powerpc/spinlocks: Refactor SHARED_PROCESSOR
Determining if a processor is in shared processor mode is not a constant
so don't hide it behind a #define.

Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813031314.1828-2-cmr@informatik.wtf
2019-08-27 13:03:36 +10:00
Christophe Leroy
d7fb5b18a5 powerpc/64: optimise LOAD_REG_IMMEDIATE_SYM()
Optimise LOAD_REG_IMMEDIATE_SYM() using a temporary register to
parallelise operations.

It reduces the path from 5 to 3 instructions.

Suggested-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bad41ed02531bb0382420cbab50a0d7153b71767.1566311636.git.christophe.leroy@c-s.fr
2019-08-27 13:03:36 +10:00
Christophe Leroy
ba18025fb0 powerpc/32: replace LOAD_MSR_KERNEL() by LOAD_REG_IMMEDIATE()
LOAD_MSR_KERNEL() and LOAD_REG_IMMEDIATE() are doing the same thing
in the same way. Drop LOAD_MSR_KERNEL()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8f04a6df0bc8949517fd8236d50c15008ccf9231.1566311636.git.christophe.leroy@c-s.fr
2019-08-27 13:03:36 +10:00
Christophe Leroy
c691b4b83b powerpc: rewrite LOAD_REG_IMMEDIATE() as an intelligent macro
Today LOAD_REG_IMMEDIATE() is a basic #define which loads all
parts on a value into a register, including the parts that are NUL.

This means always 2 instructions on PPC32 and always 5 instructions
on PPC64. And those instructions cannot run in parallele as they are
updating the same register.

Ex: LOAD_REG_IMMEDIATE(r1,THREAD_SIZE) in head_64.S results in:

3c 20 00 00     lis     r1,0
60 21 00 00     ori     r1,r1,0
78 21 07 c6     rldicr  r1,r1,32,31
64 21 00 00     oris    r1,r1,0
60 21 40 00     ori     r1,r1,16384

Rewrite LOAD_REG_IMMEDIATE() with GAS macro in order to skip
the parts that are NUL.

Rename existing LOAD_REG_IMMEDIATE() as LOAD_REG_IMMEDIATE_SYM()
and use that one for loading value of symbols which are not known
at compile time.

Now LOAD_REG_IMMEDIATE(r1,THREAD_SIZE) in head_64.S results in:

38 20 40 00     li      r1,16384

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d60ce8dd3a383c7adbfc322bf1d53d81724a6000.1566311636.git.christophe.leroy@c-s.fr
2019-08-27 13:03:36 +10:00
Christophe Leroy
163918fc57 powerpc/mm: split out early ioremap path.
ioremap does things differently depending on whether
SLAB is available or not at different levels.

Try to separate the early path from the beginning.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/3acd2dbe04b04f111475e7a59f2b6f2ab9b95ab6.1566309263.git.christophe.leroy@c-s.fr
2019-08-27 13:03:35 +10:00
Christophe Leroy
4a45b7460c powerpc/mm: refactor ioremap vm area setup.
PPC32 and PPC64 are doing the same once SLAB is available.
Create a do_ioremap() function that calls get_vm_area and
do the mapping.

For PPC64, we add the 4K PFN hack sanity check to __ioremap_caller()
in order to avoid using __ioremap_at(). Other checks in __ioremap_at()
are irrelevant for __ioremap_caller().

On PPC64, VM area is allocated in the range [ioremap_bot ; IOREMAP_END]
On PPC32, VM area is allocated in the range [VMALLOC_START ; VMALLOC_END]

Lets define IOREMAP_START is ioremap_bot for PPC64, and alias
IOREMAP_START/END to VMALLOC_START/END on PPC32

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/42e7e36ad32e0fdf76692426cc642799c9f689b8.1566309263.git.christophe.leroy@c-s.fr
2019-08-27 13:03:35 +10:00
Christophe Leroy
191e42063a powerpc/mm: refactor ioremap_range() and use ioremap_page_range()
book3s64's ioremap_range() is almost same as fallback ioremap_range(),
except that it calls radix__ioremap_range() when radix is enabled.

radix__ioremap_range() is also very similar to the other ones, expect
that it calls ioremap_page_range when slab is available.

PPC32 __ioremap_caller() have a loop doing the same thing as
ioremap_range() so use it on PPC32 as well.

Lets keep only one version of ioremap_range() which calls
ioremap_page_range() on all platforms when slab is available.

At the same time, drop the nid parameter which is not used.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4b1dca7096b01823b101be7338983578641547f1.1566309263.git.christophe.leroy@c-s.fr
2019-08-27 13:03:35 +10:00
Christophe Leroy
f381d5711f powerpc/mm: Move ioremap functions out of pgtable_32/64.c
Create ioremap_32.c and ioremap_64.c and move respective ioremap
functions out of pgtable_32.c and pgtable_64.c

In the meantime, fix a few comments and changes a printk() to
pr_warn(). Also fix a few oversplitted lines.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b5c8b02ccefd4ede64c61b53cf64fb5dacb35740.1566309263.git.christophe.leroy@c-s.fr
2019-08-27 13:03:35 +10:00
Christophe Leroy
7cd9b317b6 powerpc/mm: make ioremap_bot common to all
Drop multiple definitions of ioremap_bot and make one common to
all subarches.

Only CONFIG_PPC_BOOK3E_64 had a global static init value for
ioremap_bot. Now ioremap_bot is set in early_init_mmu_global().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/920eebfd9f36f14c79d1755847f5bf7c83703bdd.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:34 +10:00
Christophe Leroy
edfe1a5679 powerpc/mm: move ioremap_prot() into ioremap.c
Both ioremap_prot() are idenfical, move them into ioremap.c

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0b3eb0e0f1490a99fd6c983e166fb8946233f151.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:34 +10:00
Christophe Leroy
4634c375db powerpc/mm: move common 32/64 bits ioremap functions into ioremap.c
ioremap(), ioremap_wc() and ioremap_coherent() are now identical on
PPC32 and PPC64 as iowa_is_active() will always return false on
PPC32. Move them into a new common location called ioremap.c

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6223803ce024d6ab4dfaa919f44098aed5b4bc33.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:34 +10:00
Christophe Leroy
14b4d97669 powerpc/mm: rework io-workaround invocation.
ppc_md.ioremap() is only used for I/O workaround on CELL platform,
so indirect function call can be avoided.

This patch reworks the io-workaround and ioremap() functions to
use the global 'io_workaround_inited' flag for the activation
of io-workaround.

When CONFIG_PPC_IO_WORKAROUNDS or CONFIG_PPC_INDIRECT_MMIO are not
selected, the I/O workaround ioremap() voids and the global flag is
not used.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5fa3ef069fbd0f152512afaae19e7a60161454cf.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:34 +10:00
Christophe Leroy
492643e81e powerpc/mm: drop function __ioremap()
__ioremap() is not used anymore, drop it.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ccc439f481a0884e00a6be1bab44bab2a4477fea.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:33 +10:00
Christophe Leroy
8aee077292 powerpc/mm: drop ppc_md.iounmap() and __iounmap()
ppc_md.iounmap() is never set, drop it.

Once ppc_md.iounmap() is gone, iounmap() remains the only user of
__iounmap() and iounmap() does nothing else than calling __iounmap().
So drop iounmap() and make __iounmap() the new iounmap().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d73ba92bb7a387cc58cc34666d7f5158a45851b0.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:33 +10:00
Christophe Leroy
6f57e6631d powerpc/ps3: replace __ioremap() by ioremap_prot()
__ioremap() is similar to ioremap_prot() except that ioremap_prot()
does a few sanity changes in addition.

The flags used by PS3 are not impacted by those changes so for
PS3 both functions are equivalent.

At the same time, drop parts of the comment that have been invalid
since commit e58e87adc8 ("powerpc/mm: Update _PAGE_KERNEL_RO")

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/36bff5d875ff562889c5e12dab63e5d7c5d1fbd8.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:33 +10:00
Christoph Hellwig
f0f8d7ae39 powerpc: remove the ppc44x ocm.c file
The on chip memory allocator is entirely unused in the kernel tree.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7b1668941ad1041d08b19167030868de5840b153.1566309262.git.christophe.leroy@c-s.fr
2019-08-27 13:03:33 +10:00
Christophe Leroy
b4645ffc49 powerpc/64: don't select ARCH_HAS_SCALED_CPUTIME on book3E
Book3E doesn't have SPRN_SPURR/SPRN_PURR.

Activating ARCH_HAS_SCALED_CPUTIME is just wasting CPU time.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://github.com/linuxppc/issues/issues/171
Link: https://lore.kernel.org/r/a8b567c569aa521a7cf1beb061d43d79070e850c.1566492229.git.christophe.leroy@c-s.fr
2019-08-27 13:03:32 +10:00
Christopher M. Riedl
d8f0e0b073 powerpc/64s: support nospectre_v2 cmdline option
Add support for disabling the kernel implemented spectre v2 mitigation
(count cache flush on context switch) via the nospectre_v2 and
mitigations=off cmdline options.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190524024647.381-1-cmr@informatik.wtf
2019-08-27 13:03:32 +10:00
Paul Mackerras
2ad7a27dea KVM: PPC: Book3S: Enable XIVE native capability only if OPAL has required functions
There are some POWER9 machines where the OPAL firmware does not support
the OPAL_XIVE_GET_QUEUE_STATE and OPAL_XIVE_SET_QUEUE_STATE calls.
The impact of this is that a guest using XIVE natively will not be able
to be migrated successfully.  On the source side, the get_attr operation
on the KVM native device for the KVM_DEV_XIVE_GRP_EQ_CONFIG attribute
will fail; on the destination side, the set_attr operation for the same
attribute will fail.

This adds tests for the existence of the OPAL get/set queue state
functions, and if they are not supported, the XIVE-native KVM device
is not created and the KVM_CAP_PPC_IRQ_XIVE capability returns false.
Userspace can then either provide a software emulation of XIVE, or
else tell the guest that it does not have a XIVE controller available
to it.

Cc: stable@vger.kernel.org # v5.2+
Fixes: 3fab2d1058 ("KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode")
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-08-27 11:45:49 +10:00
Alexey Kardashevskiy
ddfd151f3d KVM: PPC: Book3S: Fix incorrect guest-to-user-translation error handling
H_PUT_TCE_INDIRECT handlers receive a page with up to 512 TCEs from
a guest. Although we verify correctness of TCEs before we do anything
with the existing tables, there is a small window when a check in
kvmppc_tce_validate might pass and right after that the guest alters
the page of TCEs, causing an early exit from the handler and leaving
srcu_read_lock(&vcpu->kvm->srcu) (virtual mode) or lock_rmap(rmap)
(real mode) locked.

This fixes the bug by jumping to the common exit code with an appropriate
unlock.

Cc: stable@vger.kernel.org # v4.11+
Fixes: 121f80ba68 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-08-27 10:59:30 +10:00
Suraj Jitindar Singh
d22deab696 KVM: PPC: Book3S HV: Define usage types for rmap array in guest memslot
The rmap array in the guest memslot is an array of size number of guest
pages, allocated at memslot creation time. Each rmap entry in this array
is used to store information about the guest page to which it
corresponds. For example for a hpt guest it is used to store a lock bit,
rc bits, a present bit and the index of a hpt entry in the guest hpt
which maps this page. For a radix guest which is running nested guests
it is used to store a pointer to a linked list of nested rmap entries
which store the nested guest physical address which maps this guest
address and for which there is a pte in the shadow page table.

As there are currently two uses for the rmap array, and the potential
for this to expand to more in the future, define a type field (being the
top 8 bits of the rmap entry) to be used to define the type of the rmap
entry which is currently present and define two values for this field
for the two current uses of the rmap array.

Since the nested case uses the rmap entry to store a pointer, define
this type as having the two high bits set as is expected for a pointer.
Define the hpt entry type as having bit 56 set (bit 7 IBM bit ordering).

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-08-23 15:57:24 +10:00
Paul Menzel
ff7240ccf0 KVM: PPC: Book3S: Mark expected switch fall-through
Fix the error below triggered by `-Wimplicit-fallthrough`, by tagging
it as an expected fall-through.

    arch/powerpc/kvm/book3s_32_mmu.c: In function ‘kvmppc_mmu_book3s_32_xlate_pte’:
    arch/powerpc/kvm/book3s_32_mmu.c:241:21: error: this statement may fall through [-Werror=implicit-fallthrough=]
          pte->may_write = true;
          ~~~~~~~~~~~~~~~^~~~~~
    arch/powerpc/kvm/book3s_32_mmu.c:242:5: note: here
         case 3:
         ^~~~

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-08-23 15:57:24 +10:00
Paul Mackerras
75bf465f0b Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in fixes for the XIVE interrupt controller which touch both
generic powerpc and PPC KVM code.  To avoid merge conflicts, these
commits will go upstream via the powerpc tree as well as the KVM tree.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-08-23 14:08:04 +10:00
Ingo Molnar
6c06b66e95 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU and LKMM changes from Paul E. McKenney:

 - A few more RCU flavor consolidation cleanups.

 - Miscellaneous fixes.

 - Updates to RCU's list-traversal macros improving lockdep usability.

 - Torture-test updates.

 - Forward-progress improvements for no-CBs CPUs: Avoid ignoring
   incoming callbacks during grace-period waits.

 - Forward-progress improvements for no-CBs CPUs: Use ->cblist
   structure to take advantage of others' grace periods.

 - Also added a small commit that avoids needlessly inflicting
   scheduler-clock ticks on callback-offloaded CPUs.

 - Forward-progress improvements for no-CBs CPUs: Reduce contention
   on ->nocb_lock guarding ->cblist.

 - Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
   list to further reduce contention on ->nocb_lock guarding ->cblist.

 - LKMM updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-22 20:52:04 +02:00
Christoph Hellwig
cdfee56232 driver core: initialize a default DMA mask for platform device
We still treat devices without a DMA mask as defaulting to 32-bits for
both mask, but a few releases ago we've started warning about such
cases, as they require special cases to work around this sloppyness.
Add a dma_mask field to struct platform_device so that we can initialize
the dma_mask pointer in struct device and initialize both masks to
32-bits by default, replacing similar functionality in m68k and
powerpc.  The arch_setup_pdev_archdata hooks is now unused and removed.

Note that the code looks a little odd with the various conditionals
because we have to support platform_device structures that are
statically allocated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20190816062435.881-7-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-22 09:41:55 -07:00
Daniel Axtens
7e04a46d84 powerpc/configs: Disable /dev/port in skiroot defconfig
While reviewing lockdown patches, I discovered that we still enable
/dev/port (CONFIG_DEVPORT) in skiroot.

/dev/port is used for old x86 style IO accesses. It's set up in
drivers/char/mem.c, and is only created if arch_has_dev_port() returns
true. Per arch/powerpc/include/asm/io.h, on PPC64 with PCI, this is
only true if there's a legacy ISA bridge.

Even if a system has a legacy ISA bridge installed, we have no
business accessing it in skiroot.

Deselect CONFIG_DEVPORT for skiroot.

Signed-off-by: Daniel Axtens <dja@axtens.net>
[mpe: Incorporate emailed comments into the change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190627053008.29315-1-dja@axtens.net
2019-08-22 23:12:48 +10:00
Sam Bobroff
27d4396ed5 powerpc/eeh: Slightly simplify eeh_add_to_parent_pe()
Simplify some needlessly complicated boolean logic in
eeh_add_to_parent_pe().

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/09259a50308f10aa764695912bc87dc1d1cf654c.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:12:47 +10:00
Sam Bobroff
cef50c67c1 powerpc/eeh: Remove unused return path from eeh_pe_dev_traverse()
There are no users of the early-out return value from
eeh_pe_dev_traverse(), so remove it.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c648070f5b28fe8ca1880b48e64b267959ffd369.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:12:47 +10:00
Sam Bobroff
2e25505147 powerpc/eeh: Fix crash when edev->pdev changes
If a PCI device is removed during eeh_pe_report_edev(), between the
calls to device_lock() and device_unlock(), edev->pdev will change and
cause a crash as the wrong mutex is released.

To correct this, hold the PCI rescan/remove lock while taking a copy
of edev->pdev and performing a get_device() on it.  Use this value to
release the mutex, but also pass it through to the device driver's EEH
handlers so that they always see the same device.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/3c590579a0faa24d20c826dcd26c739eb4d454e6.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:12:47 +10:00
Sam Bobroff
1ff8f36fc7 powerpc/eeh: Convert log messages to eeh_edev_* macros
Convert existing messages, where appropriate, to use the eeh_edev_*
logging macros.

The only effect should be minor adjustments to the log messages, apart
from:

- A new message in pseries_eeh_probe() "Probing device" to match the
powernv case.
- The "Probing device" message in pnv_eeh_probe() is now generated
slightly later, which will mean that it is no longer emitted for
devices that aren't probed due to the initial checks.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ce505a0a7a4a5b0367f0f40f8b26e7c0a9cf4cb7.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:12:47 +10:00
Sam Bobroff
b093f2cbed powerpc/eeh: Introduce EEH edev logging macros
Now that struct eeh_dev includes the BDFN of it's PCI device, make use
of it to replace eeh_edev_info() with a set of dev_dbg()-style macros
that only need a struct edev.

With the BDFN available without the struct pci_dev, eeh_pci_name() is
now unnecessary, so remove it.

While only the "info" level function is used here, the others will be
used in followup work.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f90ae9a53d762be7b0ccbad79e62b5a1b4f4996e.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:12:46 +10:00
Oliver O'Halloran
7c33a994d3 powerpc/eeh: Add bdfn field to eeh_dev
Preparation for removing pci_dn from the powernv EEH code. The only
thing we really use pci_dn for is to get the bdfn of the device for
config space accesses, so adding that information to eeh_dev reduces
the need to carry around the pci_dn.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[SB: Re-wrapped commit message, fixed whitespace damage.]
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/e458eb69a1f591d8a120782f23a8506b15d3c654.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:12:46 +10:00
Sam Bobroff
c44e4ccada powerpc/eeh: Refactor around eeh_probe_devices()
Now that EEH support for all devices (on PowerNV and pSeries) is
provided by the pcibios bus add device hooks, eeh_probe_devices() and
eeh_addr_cache_build() are redundant and can be removed.

Move the EEH enabled message into it's own function so that it can be
called from multiple places.

Note that previously on pSeries, useless EEH sysfs files were created
for some devices that did not have EEH support and this change
prevents them from being created.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/33b0a6339d5ac88693de092d6fba984f2a5add66.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:12:46 +10:00
Sam Bobroff
b905f8cdca powerpc/eeh: EEH for pSeries hot plug
On PowerNV and pSeries, devices currently acquire EEH support from
several different places: Boot-time devices from eeh_probe_devices()
and eeh_addr_cache_build(), Virtual Function devices from the pcibios
bus add device hooks and hot plugged devices from pci_hp_add_devices()
(with other platforms using other methods as well).  Unfortunately,
pSeries machines currently discover hot plugged devices using
pci_rescan_bus(), not pci_hp_add_devices(), and so those devices do
not receive EEH support.

Rather than adding another case for pci_rescan_bus(), this change
widens the scope of the pcibios bus add device hooks so that they can
handle all devices. As a side effect this also supports devices
discovered after manually rescanning via /sys/bus/pci/rescan.

Note that on PowerNV, this change allows the EEH subsystem to become
enabled after boot as long as it has not been forced off, which was
not previously possible (it was already possible on pSeries).

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/72ae8ae9c54097158894a52de23690448de38ea9.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:12:40 +10:00
Sam Bobroff
685a0bc00a powerpc/eeh: Initialize EEH address cache earlier
The EEH address cache is currently initialized and populated by a
single function: eeh_addr_cache_build().  While the initial population
of the cache can only be done once resources are allocated,
initialization (just setting up a spinlock) could be done much
earlier.

So move the initialization step into a separate function and call it
from a core_initcall (rather than a subsys initcall).

This will allow future work to make use of the cache during boot time
PCI scanning.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0557206741bffee76cdfff042f65321f6f7a5b41.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:11:48 +10:00
Sam Bobroff
617082a481 powerpc/eeh: Improve debug messages around device addition
Also remove useless comment.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/59db84f4bf94718a12f206bc923ac797d47e4cc1.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:11:48 +10:00
Sam Bobroff
aa06e3d60e powerpc/eeh: Clear stale EEH_DEV_NO_HANDLER flag
The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the
use of driver callbacks in drivers that have been bound part way
through the recovery process. This is necessary to prevent later stage
handlers from being called when the earlier stage handlers haven't,
which can be confusing for drivers.

However, the flag is set for all devices that are added after boot
time and only cleared at the end of the EEH recovery process. This
results in hot plugged devices erroneously having the flag set during
the first recovery after they are added (causing their driver's
handlers to be incorrectly ignored).

To remedy this, clear the flag at the beginning of recovery
processing. The flag is still cleared at the end of recovery
processing, although it is no longer really necessary.

Also clear the flag during eeh_handle_special_event(), for the same
reasons.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b8ca5629d27de74c957d4f4b250177d1b6fc4bbd.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:11:48 +10:00
Sam Bobroff
3f068aae7a powerpc/64: Adjust order in pcibios_init()
The pcibios_init() function for PowerPC 64 currently calls
pci_bus_add_devices() before pcibios_resource_survey(). This means
that at boot time, when the pcibios_bus_add_device() hooks are called
by pci_bus_add_devices(), device resources have not been allocated and
they are unable to perform EEH setup, so a separate pass is needed.

This patch adjusts that order so that it will become possible to
consolidate the EEH setup work into a single location.

The only functional change is to execute pcibios_resource_survey()
(excepting ppc_md.pcibios_fixup(), see below) before
pci_bus_add_devices() instead of after it.

Because pcibios_scan_phb() and pci_bus_add_devices() are called
together in a loop, this must be broken into one loop for each call.
Then the call to pcibios_resource_survey() is moved up in between
them. This changes the ordering but because pcibios_resource_survey()
also calls ppc_md.pcibios_fixup(), that call is extracted out into
pcibios_init() to where pcibios_resource_survey() was, so that it is
not moved.

The only other caller of pcibios_resource_survey() is the PowerPC 32
version of pcibios_init(), and therefore, that is modified to call
ppc_md.pcibios_fixup() right after pcibios_resource_survey() so that
there is no functional change there at all.

The re-arrangement will cause very few side-effects because at this
stage in the boot, pci_bus_add_devices() does very little:
- pci_create_sysfs_dev_files() does nothing (no sysfs yet)
- pci_proc_attach_device() does nothing (no proc yet)
- device_attach() does nothing (no drivers yet)
This leaves only the pci_final_fixup calls, D3 support, and marking
the device as added. Of those, only the pci_final_fixup calls have the
potential to be affected by resource allocation.

The only pci_final_fixup handlers that touch resources seem to be one
for x86 (pci_amd_enable_64bit_bar()), and a PowerPC 32 platform driver
(quirk_final_uli1575()), neither of which use this pcibios_init()
function. Even if they did, it would almost certainly be a bug, under
the current ordering, to rely on or make changes to resources before
they were allocated.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4506b0489eabd0921a3587d90bd44c7683f3472d.1565930772.git.sbobroff@linux.ibm.com
2019-08-22 23:11:48 +10:00
Masahiro Yamada
56347074c5 powerpc: remove meaningless KBUILD_ARFLAGS addition
The KBUILD_ARFLAGS addition in arch/powerpc/Makefile has never worked
in a useful way because it is always overridden by the following code
in the top Makefile:

  # use the deterministic mode of AR if available
  KBUILD_ARFLAGS := $(call ar-option,D)

The code in the top Makefile was added in 2011, by commit 40df759e2b
("kbuild: Fix build with binutils <= 2.19").

The KBUILD_ARFLAGS addition for ppc has always been dead code from the
beginning.

Nobody has reported a problem since 43c9127d94 ("powerpc: Add option
to use thin archives"), so this code was unneeded.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190713032106.8509-1-yamada.masahiro@socionext.com
2019-08-22 23:11:48 +10:00
Sean Christopherson
12b58f4ed2 KVM: Assert that struct kvm_vcpu is always as offset zero
KVM implementations that wrap struct kvm_vcpu with a vendor specific
struct, e.g. struct vcpu_vmx, must place the vcpu member at offset 0,
otherwise the usercopy region intended to encompass struct kvm_vcpu_arch
will instead overlap random chunks of the vendor specific struct.
E.g. padding a large number of bytes before struct kvm_vcpu triggers
a usercopy warn when running with CONFIG_HARDENED_USERCOPY=y.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22 10:09:27 +02:00
Masahiro Yamada
2ff2b7ec65 kbuild: add CONFIG_ASM_MODVERSIONS
Add CONFIG_ASM_MODVERSIONS. This allows to remove one if-conditional
nesting in scripts/Makefile.build.

scripts/Makefile.build is run every time Kbuild descends into a
sub-directory. So, I want to avoid $(wildcard ...) evaluation
where possible although computing $(wildcard ...) is so cheap that
it may not make measurable performance difference.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
2019-08-22 01:14:11 +09:00
Santosh Sivaraj
42ac26d253 powerpc: add machine check safe copy_to_user
Use  memcpy_mcsafe() implementation to define copy_to_user_mcsafe()

Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820081352.8641-8-santosh@fossix.org
2019-08-21 22:23:49 +10:00
Balbir Singh
4d4a273854 powerpc/memcpy: Add memcpy_mcsafe for pmem
The pmem infrastructure uses memcpy_mcsafe in the pmem layer so as to
convert machine check exceptions into a return value on failure in case
a machine check exception is encountered during the memcpy. The return
value is the number of bytes remaining to be copied.

This patch largely borrows from the copyuser_power7 logic and does not add
the VMX optimizations, largely to keep the patch simple. If needed those
optimizations can be folded in.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[arbab@linux.ibm.com: Added symbol export]
Co-developed-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820081352.8641-7-santosh@fossix.org
2019-08-21 22:23:48 +10:00
Balbir Singh
895e3dceeb powerpc/mce: Handle UE event for memcpy_mcsafe
If we take a UE on one of the instructions with a fixup entry, set nip
to continue execution at the fixup entry. Stop processing the event
further or print it.

Co-developed-by: Reza Arbab <arbab@linux.ibm.com>
Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820081352.8641-6-santosh@fossix.org
2019-08-21 22:23:48 +10:00
Reza Arbab
1a1715f516 powerpc/mce: Make machine_check_ue_event() static
The function doesn't get used outside this file, so make it static.

Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820081352.8641-4-santosh@fossix.org
2019-08-21 22:23:47 +10:00
Balbir Singh
99ead78afd powerpc/mce: Fix MCE handling for huge pages
The current code would fail on huge pages addresses, since the shift would
be incorrect. Use the correct page shift value returned by
__find_linux_pte() to get the correct physical address. The code is more
generic and can handle both regular and compound pages.

Fixes: ba41e1e1cc ("powerpc/mce: Hookup derror (load/store) UE errors")
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[arbab@linux.ibm.com: Fixup pseries_do_memory_failure()]
Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820081352.8641-3-santosh@fossix.org
2019-08-21 22:23:47 +10:00
Santosh Sivaraj
b5bda6263c powerpc/mce: Schedule work from irq_work
schedule_work() cannot be called from MCE exception context as MCE can
interrupt even in interrupt disabled context.

Fixes: 733e4a4c44 ("powerpc/mce: hookup memory_failure for UE errors")
Cc: stable@vger.kernel.org # v4.15+
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820081352.8641-2-santosh@fossix.org
2019-08-21 22:23:03 +10:00
Masahiro Yamada
10df063855 kbuild: rebuild modules when module linker scripts are updated
Currently, the timestamp of module linker scripts are not checked.
Add them to the dependency of modules so they are correctly rebuilt.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-08-21 21:05:21 +09:00
Nathan Lynch
ccfb5bd71d powerpc/pseries/mobility: use cond_resched when updating device tree
After a partition migration, pseries_devicetree_update() processes
changes to the device tree communicated from the platform to
Linux. This is a relatively heavyweight operation, with multiple
device tree searches, memory allocations, and conversations with
partition firmware.

There's a few levels of nested loops which are bounded only by
decisions made by the platform, outside of Linux's control, and indeed
we have seen RCU stalls on large systems while executing this call
graph. Use cond_resched() in these loops so that the cpu is yielded
when needed.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802192926.19277-4-nathanl@linux.ibm.com
2019-08-20 21:22:28 +10:00
Nathan Lynch
10e4850d7c powerpc/rtas: allow rescheduling while changing cpu states
rtas_cpu_state_change_mask() potentially operates on scores of cpus,
so explicitly allow rescheduling in the loop body.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802192926.19277-3-nathanl@linux.ibm.com
2019-08-20 21:22:27 +10:00
Nathan Lynch
a6717c01dd powerpc/rtas: use device model APIs and serialization during LPM
The LPAR migration implementation and userspace-initiated cpu hotplug
can interleave their executions like so:

1. Set cpu 7 offline via sysfs.

2. Begin a partition migration, whose implementation requires the OS
   to ensure all present cpus are online; cpu 7 is onlined:

     rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up

   This sets cpu 7 online in all respects except for the cpu's
   corresponding struct device; dev->offline remains true.

3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is
   already online and returns success. The driver core (device_online)
   sets dev->offline = false.

4. The migration completes and restores cpu 7 to offline state:

     rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down

This leaves cpu7 in a state where the driver core considers the cpu
device online, but in all other respects it is offline and
unused. Attempts to online the cpu via sysfs appear to succeed but the
driver core actually does not pass the request to the lower-level
cpuhp support code. This makes the cpu unusable until the cpu device
is manually set offline and then online again via sysfs.

Instead of directly calling cpu_up/cpu_down, the migration code should
use the higher-level device core APIs to maintain consistent state and
serialize operations.

Fixes: 120496ac2d ("powerpc: Bring all threads online prior to migration/hibernation")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802192926.19277-2-nathanl@linux.ibm.com
2019-08-20 21:22:27 +10:00
Christophe Leroy
415480dce2 powerpc/603: Fix handling of the DIRTY flag
If a page is already mapped RW without the DIRTY flag, the DIRTY
flag is never set and a TLB store miss exception is taken forever.

This is easily reproduced with the following app:

void main(void)
{
	volatile char *ptr = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);

	*ptr = *ptr;
}

When DIRTY flag is not set, bail out of TLB miss handler and take
a minor page fault which will set the DIRTY flag.

Fixes: f8b58c64ea ("powerpc/603: let's handle PAGE_DIRTY directly")
Cc: stable@vger.kernel.org # v5.1+
Reported-by: Doug Crawford <doug.crawford@intelight-its.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/80432f71194d7ee75b2f5043ecf1501cf1cca1f3.1566196646.git.christophe.leroy@c-s.fr
2019-08-20 21:22:21 +10:00
Nicholas Piggin
6bb25170d7 powerpc/64s/radix: Remove redundant pfn_pte bitop, add VM_BUG_ON
pfn_pte is never given a pte above the addressable physical memory
limit, so the masking is redundant. In case of a software bug, it
is not obviously better to silently truncate the pfn than to corrupt
the pte (either one will result in memory corruption or crashes),
so there is no reason to add this to the fast path.

Add VM_BUG_ON to catch cases where the pfn is invalid. These would
catch the create_section_mapping bug fixed by a previous commit.

  [16885.256466] ------------[ cut here ]------------
  [16885.256492] kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612!
  cpu 0x0: Vector: 700 (Program Check) at [c0000000ee0a36d0]
      pc: c000000000080738: __map_kernel_page+0x248/0x6f0
      lr: c000000000080ac0: __map_kernel_page+0x5d0/0x6f0
      sp: c0000000ee0a3960
     msr: 9000000000029033
    current = 0xc0000000ec63b400
    paca    = 0xc0000000017f0000   irqmask: 0x03   irq_happened: 0x01
      pid   = 85, comm = sh
  kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612!
  Linux version 5.3.0-rc1-00001-g0fe93e5f3394
  enter ? for help
  [c0000000ee0a3a00] c000000000d37378 create_physical_mapping+0x260/0x360
  [c0000000ee0a3b10] c000000000d370bc create_section_mapping+0x1c/0x3c
  [c0000000ee0a3b30] c000000000071f54 arch_add_memory+0x74/0x130

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190724084638.24982-5-npiggin@gmail.com
2019-08-20 21:22:20 +10:00
Nicholas Piggin
4dd7554a64 powerpc/64: Add VIRTUAL_BUG_ON checks for __va and __pa addresses
Ensure __va is given a physical address below PAGE_OFFSET, and __pa is
given a virtual address above PAGE_OFFSET.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190724084638.24982-4-npiggin@gmail.com
2019-08-20 21:22:20 +10:00
Nicholas Piggin
10c4bd7cd2 powerpc/perf: fix imc allocation failure handling
The alloc_pages_node return value should be tested for failure
before being passed to page_address.

Tested-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190724084638.24982-3-npiggin@gmail.com
2019-08-20 21:22:20 +10:00
Nicholas Piggin
31f210cf42 powerpc/64s/radix: Fix memory hot-unplug page table split
create_physical_mapping expects physical addresses, but splitting
these mapping on hot unplug is supplying virtual (effective)
addresses.

Fixes: 4dd5f8a99e ("powerpc/mm/radix: Split linear mapping on hot-unplug")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190724084638.24982-2-npiggin@gmail.com
2019-08-20 21:22:19 +10:00
Nicholas Piggin
8f51e39294 powerpc/64s/radix: Fix memory hotplug section page table creation
create_physical_mapping expects physical addresses, but creating and
splitting these mappings after boot is supplying virtual (effective)
addresses. This can be irritated by booting with mem= to limit memory
then probing an unused physical memory range:

  echo <addr> > /sys/devices/system/memory/probe

This mostly works by accident, firstly because __va(__va(x)) == __va(x)
so the virtual address does not get corrupted. Secondly because pfn_pte
masks out the upper bits of the pfn beyond the physical address limit,
so a pfn constructed with a 0xc000000000000000 virtual linear address
will be masked back to the correct physical address in the pte.

Fixes: 6cc27341b2 ("powerpc/mm: add radix__create_section_mapping()")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190724084638.24982-1-npiggin@gmail.com
2019-08-20 21:22:18 +10:00
Nicholas Piggin
e354d7dc81 powerpc/64: allow compiler to cache 'current'
current may be cached by the compiler, so remove the volatile asm
restriction. This results in better generated code, as well as being
smaller and fewer dependent loads, it can avoid store-hit-load flushes
like this one that shows up in irq_exit():

    preempt_count_sub(HARDIRQ_OFFSET);
    if (!in_interrupt() && ...)

Which ends up as:

    ((struct thread_info *)current)->preempt_count -= HARDIRQ_OFFSET;
    if (((struct thread_info *)current)->preempt_count ...

Evaluating current twice presently means it has to be loaded twice, and
here gcc happens to pick a different register each time, then
preempt_count is accessed via that base register:

    1058:       ld      r10,2392(r13)     <-- current
    105c:       lwz     r9,0(r10)         <-- preempt_count
    1060:       addis   r9,r9,-1
    1064:       stw     r9,0(r10)         <-- preempt_count
    1068:       ld      r9,2392(r13)      <-- current
    106c:       lwz     r9,0(r9)          <-- preempt_count
    1070:       rlwinm. r9,r9,0,11,23
    1074:       bne     1090 <irq_exit+0x60>

This can frustrate store-hit-load detection heuristics and cause
flushes. Allowing the compiler to cache current in a reigster with this
patch results in the same base register being used for all accesses,
which is more likely to be detected as an alias:

    1058:       ld      r31,2392(r13)
    ...
    1070:       lwz     r9,0(r31)
    1074:       addis   r9,r9,-1
    1078:       stw     r9,0(r31)
    107c:       lwz     r9,0(r31)
    1080:       rlwinm. r9,r9,0,11,23
    1084:       bne     10a0 <irq_exit+0x60>

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190612140317.24490-1-npiggin@gmail.com
2019-08-20 21:22:15 +10:00
Christophe Leroy
7ab0b7cb89 powerpc/32: Add warning on misaligned copy_page() or clear_page()
copy_page() and clear_page() expect page aligned destination, and
use dcbz instruction to clear entire cache lines based on the
assumption that the destination is cache aligned.

As shown during analysis of a bug in BTRFS filesystem, a misaligned
copy_page() can create bugs that are difficult to locate (see Link).

Add an explicit WARNING when copy_page() or clear_page() are called
with misaligned destination.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204371
Link: https://lore.kernel.org/r/c6cea38f90480268d439ca44a645647e260fff09.1565941808.git.christophe.leroy@c-s.fr
2019-08-20 21:22:15 +10:00
Christophe Leroy
f204338f8e powerpc/mm: ppc 603 doesn't need update_mmu_cache()
On powerpc 603, there is no hash table so get out of
update_mmu_cache() early.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6133e0f115d955fac4061536dab0fa7480a1c433.1565933217.git.christophe.leroy@c-s.fr
2019-08-20 21:22:14 +10:00
Christophe Leroy
f49f4e2b68 powerpc/mm: Simplify update_mmu_cache() on BOOK3S32
On BOOK3S32, hash_preload() neither use is_exec nor trap,
so drop those parameters and simplify update_mmu_cached().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/35f143c6fe29f9fd25c7f3cd4448ae401029ce3c.1565933217.git.christophe.leroy@c-s.fr
2019-08-20 21:22:14 +10:00
Christophe Leroy
e5a1edb9fe powerpc/mm: move update_mmu_cache() into book3s hash utils.
update_mmu_cache() is only for BOOK3S, and can be simplified for
BOOK3S32.

Move it out of mem.c into respective BOOK3S32 and BOOK3S64 files
containing hash utils.

BOOK3S64 version of hash_preload() is only used locally, declare it
static.

Remove the radix_enabled() stuff in BOOK3S32 version.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/107aaf43583a5f5d09e0d4e84c4c4390ecfcd512.1565933217.git.christophe.leroy@c-s.fr
2019-08-20 21:22:14 +10:00
Christophe Leroy
4c1616ef03 powerpc/mm: move FSL_BOOK3 version of update_mmu_cache()
Move FSL_BOOK3E version of update_mmu_cache() at the same
place as book3e_hugetlb_preload() as update_mmu_cache() is
the only user of book3e_hugetlb_preload().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4d69fdc86df9c74adc71a60331a86f6afb8b5e9e.1565933217.git.christophe.leroy@c-s.fr
2019-08-20 21:22:14 +10:00
Christophe Leroy
d964211791 powerpc/mm: define empty update_mmu_cache() as static inline
Only BOOK3S and FSL_BOOK3E have a usefull update_mmu_cache().

For the others, just define it static inline.

In the meantime, simplify the FSL_BOOK3E related ifdef as
book3e_hugetlb_preload() only exists when CONFIG_PPC_FSL_BOOK3E
is selected.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/668aba4db6b9af6d8a151174e11a4289f1a6bbcd.1565933217.git.christophe.leroy@c-s.fr
2019-08-20 21:22:14 +10:00
Christophe Leroy
ad628a34ec powerpc/mm: don't display empty early ioremap area
On the 8xx, the layout displayed at boot is:

[    0.000000] Memory: 121856K/131072K available (5728K kernel code, 592K rwdata, 1248K rodata, 560K init, 448K bss, 9216K reserved, 0K cma-reserved)
[    0.000000] Kernel virtual memory layout:
[    0.000000]   * 0xffefc000..0xffffc000  : fixmap
[    0.000000]   * 0xffefc000..0xffefc000  : early ioremap
[    0.000000]   * 0xc9000000..0xffefc000  : vmalloc & ioremap
[    0.000000] SLUB: HWalign=16, Order=0-3, MinObjects=0, CPUs=1, Nodes=1

Remove display of an empty early ioremap.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f6267226038cb25a839b567319e240576e3f8565.1565793287.git.christophe.leroy@c-s.fr
2019-08-20 21:22:13 +10:00
Christophe Leroy
9d6d712fbf powerpc/32s: Fix boot failure with DEBUG_PAGEALLOC without KASAN.
When KASAN is selected, the definitive hash table has to be
set up later, but there is already an early temporary one.

When KASAN is not selected, there is no early hash table,
so the setup of the definitive hash table cannot be delayed.

Fixes: 72f208c6a8 ("powerpc/32s: move hash code patching out of MMU_init_hw()")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: Jonathan Neuschafer <j.neuschaefer@gmx.net>
Tested-by: Jonathan Neuschafer <j.neuschaefer@gmx.net>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b7860c5e1e784d6b96ba67edf47dd6cbc2e78ab6.1565776892.git.christophe.leroy@c-s.fr
2019-08-20 21:22:13 +10:00
Christophe Leroy
38a0d0cdb4 powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function
We see warnings such as:
  kernel/futex.c: In function 'do_futex':
  kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this function [-Wmaybe-uninitialized]
     return oldval == cmparg;
                   ^
  kernel/futex.c:1651:6: note: 'oldval' was declared here
    int oldval, ret;
        ^

This is because arch_futex_atomic_op_inuser() only sets *oval if ret
is 0 and GCC doesn't see that it will only use it when ret is 0.

Anyway, the non-zero ret path is an error path that won't suffer from
setting *oval, and as *oval is a local var in futex_atomic_op_inuser()
it will have no impact.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: reword change log slightly]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.leroy@c-s.fr
2019-08-20 21:22:12 +10:00
Christophe Leroy
663c0c9496 powerpc/kasan: Fix shadow area set up for modules.
When loading modules, from time to time an Oops is encountered during
the init of shadow area for globals. This is due to the last page not
always being mapped depending on the exact distance between the start
and the end of the shadow area and the alignment with the page
addresses.

Fix this by aligning the starting address with the page address.

Fixes: 2edb16efc8 ("powerpc/32: Add KASAN support")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4f887e9b77d0d725cbb52035c7ece485c1c5fc14.1565361881.git.christophe.leroy@c-s.fr
2019-08-20 21:22:11 +10:00
Christophe Leroy
45ff3c5595 powerpc/kasan: Fix parallel loading of modules.
Parallel loading of modules may lead to bad setup of shadow page table
entries.

First, lets align modules so that two modules never share the same
shadow page.

Second, ensure that two modules cannot allocate two page tables for
the same PMD entry at the same time. This is done by using
init_mm.page_table_lock in the same way as __pte_alloc_kernel()

Fixes: 2edb16efc8 ("powerpc/32: Add KASAN support")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c97284f912128cbc3f2fe09d68e90e65fb3e6026.1565361876.git.christophe.leroy@c-s.fr
2019-08-20 21:22:11 +10:00
Christophe Leroy
658d029df0 powerpc/hw_breakpoint: move instruction stepping out of hw_breakpoint_handler()
On 8xx, breakpoints stop after executing the instruction, so
stepping/emulation is not needed. Move it into a sub-function and
remove the #ifdefs.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f8cdc3f1c66ad3c43ebc568abcc6c39ed4676284.1561737231.git.christophe.leroy@c-s.fr
2019-08-20 21:22:10 +10:00
Christophe Leroy
65e701b2d2 powerpc/ptdump: drop non vital #ifdefs
hashpagetable.c is only compiled when CONFIG_PPC_BOOK3S_64 is
defined, so drop the test and its 'else' branch.

Use IS_ENABLED(CONFIG_PPC_PSERIES) instead of #ifdef, this allows the
code to be checked at any build. It is still optimised out by GCC.

Use IS_ENABLED(CONFIG_PPC_64K_PAGES) instead of #ifdef.

Use IS_ENABLED(CONFIG_SPARSEMEN_VMEMMAP) instead of #ifdef.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c8998ed32e4e3954b56a8dacecfe43319a2a0483.1565786091.git.christophe.leroy@c-s.fr
2019-08-20 21:22:09 +10:00
Christophe Leroy
f3a2ac0589 powerpc/ptdump: get out of note_prot_wx() when CONFIG_PPC_DEBUG_WX is not selected.
When CONFIG_PPC_DEBUG_WX, note_prot_wx() is useless.

Get out of it early and inconditionnally in that case,
so that GCC can kick all the code out.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ff6c8f631bd4ce3a10e0cc241eb569816187bc20.1565786091.git.christophe.leroy@c-s.fr
2019-08-20 21:22:09 +10:00
Christophe Leroy
8224235272 powerpc/ptdump: drop dummy KERN_VIRT_START on PPC32
PPC32 doesn't have KERN_VIRT_START. Make PAGE_OFFSET the
default starting address for the dump, and drop the dummy
definition of KERN_VIRT_START. Only use KERN_VIRT_START for
non radix PPC64.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/334632b1df4775b0ccf3bdc8d6b201d14e3daedd.1565786091.git.christophe.leroy@c-s.fr
2019-08-20 21:22:09 +10:00
Christophe Leroy
e033829d2a powerpc/ptdump: fix walk_pagetables() address mismatch
walk_pagetables() always walk the entire pgdir from address 0
but considers PAGE_OFFSET or KERN_VIRT_START as the starting
address of the walk, resulting in a possible mismatch in the
displayed addresses.

Ex: on PPC32, when KERN_VIRT_START was locally defined as
PAGE_OFFSET, ptdump displayed 0x80000000
instead of 0xc0000000 for the first kernel page,
because 0xc0000000 + 0xc0000000 = 0x80000000

Start the walk at st->start_address instead of starting at 0.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5aa2ac513295f594cce8ddb1c649f61947bd063d.1565786091.git.christophe.leroy@c-s.fr
2019-08-20 21:22:08 +10:00
Christophe Leroy
7c7a532ba3 powerpc/ptdump: Fix addresses display on PPC32
Commit 453d87f6a8 ("powerpc/mm: Warn if W+X pages found on boot")
wrongly changed KERN_VIRT_START from 0 to PAGE_OFFSET, leading to a
shift in the displayed addresses.

Lets revert that change to resync walk_pagetables()'s addr val and
pgd_t pointer for PPC32.

Fixes: 453d87f6a8 ("powerpc/mm: Warn if W+X pages found on boot")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/eb4d626514e22f85814830012642329018ef6af9.1565786091.git.christophe.leroy@c-s.fr
2019-08-20 21:22:08 +10:00
Michael Ellerman
117acf5c29 powerpc/Makefile: Always pass --synthetic to nm if supported
Back in 2004 we added logic to arch/ppc64/Makefile to pass
the --synthetic option to nm, if it was supported by nm.

Then in 2005 when arch/ppc64 and arch/ppc were merged, the logic to
add --synthetic was moved inside an #ifdef CONFIG_PPC64 block within
arch/powerpc/Makefile, and has remained there since.

That was fine, though crufty, until recently when a change to
init/Kconfig added a config time check that uses $(NM). On powerpc
that leads to an infinite loop because Kconfig uses $(NM) to calculate
some values, then the powerpc Makefile changes $(NM), which Kconfig
notices and restarts.

The original commit that added --synthetic simply said:
  On new toolchains we need to use nm --synthetic or we miss code
  symbols.

And the nm man page says that the --synthetic option causes nm to:
  Include synthetic symbols in the output. These are special symbols
  created by the linker for various purposes.

So it seems safe to always pass --synthetic if nm supports it, ie. on
32-bit and 64-bit, it just means 32-bit kernels might have more
symbols reported (and in practice I see no extra symbols). Making it
unconditional avoids the #ifdef CONFIG_PPC64, which in turn avoids the
infinite loop.

Debugged-by: Peter Collingbourne <pcc@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-20 10:01:18 +01:00
Cédric Le Goater
39f14e79b1 powerpc/xmon: Add a dump of all XIVE interrupts
Modify the xmon 'dxi' command to query all interrupts if no IRQ number
is specified.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190814154754.23682-4-clg@kaod.org
2019-08-19 13:20:24 +10:00
Cédric Le Goater
b4868ff55d powerpc/xive: Fix dump of XIVE interrupt under pseries
The xmon 'dxi' command calls OPAL to query the XIVE configuration of a
interrupt. This can only be done on baremetal (PowerNV) and it will
crash a pseries machine.

Introduce a new XIVE get_irq_config() operation which implements a
different query depending on the platform, PowerNV or pseries, and
modify xmon to use a top level wrapper.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190814154754.23682-3-clg@kaod.org
2019-08-19 13:20:24 +10:00
Cédric Le Goater
c3e0dbd7f7 powerpc/xmon: Check for HV mode when dumping XIVE info from OPAL
Currently, the xmon 'dx' command calls OPAL to dump the XIVE state in
the OPAL logs and also outputs some of the fields of the internal XIVE
structures in Linux. The OPAL calls can only be done on baremetal
(PowerNV) and they crash a pseries machine. Fix by checking the
hypervisor feature of the CPU.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190814154754.23682-2-clg@kaod.org
2019-08-19 13:20:24 +10:00
Alexey Kardashevskiy
201ed7f327 powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages
At the moment we create a small window only for 32bit devices, the window
maps 0..2GB of the PCI space only. For other devices we either use
a sketchy bypass or hardware bypass but the former can only work if
the amount of RAM is no bigger than the device's DMA mask and the latter
requires devices to support at least 59bit DMA.

This extends the default DMA window to the maximum size possible to allow
a wider DMA mask than just 32bit. The default window size is now limited
by the the iommu_table::it_map allocation bitmap which is a contiguous
array, 1 bit per an IOMMU page.

This increases the default IOMMU page size from hard coded 4K to
the system page size to allow wider DMA masks.

This increases the level number to not exceed the max order allocation
limit per TCE level. By the same time, this keeps minimal levels number
as 2 in order to save memory.

As the extended window now overlaps the 32bit MMIO region, this adds
an area reservation to iommu_init_table().

After this change the default window size is 0x80000000000==1<<43 so
devices limited to DMA mask smaller than the amount of system RAM can
still use more than just 2GB of memory for DMA.

This is an optimization and not a bug fix for DMA API usage.

With the on-demand allocation of indirect TCE table levels enabled and
2 levels, the first TCE level size is just
1<<ceil((log2(0x7ffffffffff+1)-16)/2)=16384 TCEs or 2 system pages.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190718051139.74787-5-aik@ozlabs.ru
2019-08-19 13:20:23 +10:00
Alexey Kardashevskiy
c37c792dec powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window
We allocate only the first level of multilevel TCE tables for KVM
already (alloc_userspace_copy==true), and the rest is allocated on demand.
This is not enabled though for bare metal.

This removes the KVM limitation (implicit, via the alloc_userspace_copy
parameter) and always allocates just the first level. The on-demand
allocation of missing levels is already implemented.

As from now on DMA map might happen with disabled interrupts, this
allocates TCEs with GFP_ATOMIC; otherwise lockdep reports errors 1].
In practice just a single page is allocated there so chances for failure
are quite low.

To save time when creating a new clean table, this skips non-allocated
indirect TCE entries in pnv_tce_free just like we already do in
the VFIO IOMMU TCE driver.

This changes the default level number from 1 to 2 to reduce the amount
of memory required for the default 32bit DMA window at the boot time.
The default window size is up to 2GB which requires 4MB of TCEs which is
unlikely to be used entirely or at all as most devices these days are
64bit capable so by switching to 2 levels by default we save 4032KB of
RAM per a device.

While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace
can trigger this path via VFIO, see the failure and try creating a table
again with different parameters which might succeed.

[1]:
===
BUG: sleeping function called from invalid context at mm/page_alloc.c:4596
in_atomic(): 1, irqs_disabled(): 1, pid: 1038, name: scsi_eh_1
2 locks held by scsi_eh_1/1038:
 #0: 000000005efd659a (&host->eh_mutex){+.+.}, at: ata_eh_acquire+0x34/0x80
 #1: 0000000006cf56a6 (&(&host->lock)->rlock){....}, at: ata_exec_internal_sg+0xb0/0x5c0
irq event stamp: 500
hardirqs last  enabled at (499): [<c000000000cb8a74>] _raw_spin_unlock_irqrestore+0x94/0xd0
hardirqs last disabled at (500): [<c000000000cb85c4>] _raw_spin_lock_irqsave+0x44/0x120
softirqs last  enabled at (0): [<c000000000101120>] copy_process.isra.4.part.5+0x640/0x1a80
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 73 PID: 1038 Comm: scsi_eh_1 Not tainted 5.2.0-rc6-le_nv2_aikATfstn1-p1 #634
Call Trace:
[c000003d064cef50] [c000000000c8e6c4] dump_stack+0xe8/0x164 (unreliable)
[c000003d064cefa0] [c00000000014ed78] ___might_sleep+0x2f8/0x310
[c000003d064cf020] [c0000000003ca084] __alloc_pages_nodemask+0x2a4/0x1560
[c000003d064cf220] [c0000000000c2530] pnv_alloc_tce_level.isra.0+0x90/0x130
[c000003d064cf290] [c0000000000c2888] pnv_tce+0x128/0x3b0
[c000003d064cf360] [c0000000000c2c00] pnv_tce_build+0xb0/0xf0
[c000003d064cf3c0] [c0000000000bbd9c] pnv_ioda2_tce_build+0x3c/0xb0
[c000003d064cf400] [c00000000004cfe0] ppc_iommu_map_sg+0x210/0x550
[c000003d064cf510] [c00000000004b7a4] dma_iommu_map_sg+0x74/0xb0
[c000003d064cf530] [c000000000863944] ata_qc_issue+0x134/0x470
[c000003d064cf5b0] [c000000000863ec4] ata_exec_internal_sg+0x244/0x5c0
[c000003d064cf700] [c0000000008642d0] ata_exec_internal+0x90/0xe0
[c000003d064cf780] [c0000000008650ac] ata_dev_read_id+0x2ec/0x640
[c000003d064cf8d0] [c000000000878e28] ata_eh_recover+0x948/0x16d0
[c000003d064cfa10] [c00000000087d760] sata_pmp_error_handler+0x480/0xbf0
[c000003d064cfbc0] [c000000000884624] ahci_error_handler+0x74/0xe0
[c000003d064cfbf0] [c000000000879fa8] ata_scsi_port_error_handler+0x2d8/0x7c0
[c000003d064cfca0] [c00000000087a544] ata_scsi_error+0xb4/0x100
[c000003d064cfd00] [c000000000802450] scsi_error_handler+0x120/0x510
[c000003d064cfdb0] [c000000000140c48] kthread+0x1b8/0x1c0
[c000003d064cfe20] [c00000000000bd8c] ret_from_kernel_thread+0x5c/0x70
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
irq event stamp: 2305

========================================================
hardirqs last  enabled at (2305): [<c00000000000e4c8>] fast_exc_return_irq+0x28/0x34
hardirqs last disabled at (2303): [<c000000000cb9fd0>] __do_softirq+0x4a0/0x654
WARNING: possible irq lock inversion dependency detected
5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Tainted: G        W
softirqs last  enabled at (2304): [<c000000000cba054>] __do_softirq+0x524/0x654
softirqs last disabled at (2297): [<c00000000010f278>] irq_exit+0x128/0x180
--------------------------------------------------------
swapper/0/0 just changed the state of lock:
0000000006cf56a6 (&(&host->lock)->rlock){-...}, at: ahci_single_level_irq_intr+0xac/0x120
but this lock took another, HARDIRQ-unsafe lock in the past:
 (fs_reclaim){+.+.}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               local_irq_disable();
                               lock(&(&host->lock)->rlock);
                               lock(fs_reclaim);
  <Interrupt>
    lock(&(&host->lock)->rlock);

 *** DEADLOCK ***

no locks held by swapper/0/0.

the shortest dependencies between 2nd lock and 1st lock:
 -> (fs_reclaim){+.+.} ops: 167579 {
    HARDIRQ-ON-W at:
                      lock_acquire+0xf8/0x2a0
                      fs_reclaim_acquire.part.23+0x44/0x60
                      kmem_cache_alloc_node_trace+0x80/0x590
                      alloc_desc+0x64/0x270
                      __irq_alloc_descs+0x2e4/0x3a0
                      irq_domain_alloc_descs+0xb0/0x150
                      irq_create_mapping+0x168/0x2c0
                      xics_smp_probe+0x2c/0x98
                      pnv_smp_probe+0x40/0x9c
                      smp_prepare_cpus+0x524/0x6c4
                      kernel_init_freeable+0x1b4/0x650
                      kernel_init+0x2c/0x148
                      ret_from_kernel_thread+0x5c/0x70
    SOFTIRQ-ON-W at:
                      lock_acquire+0xf8/0x2a0
                      fs_reclaim_acquire.part.23+0x44/0x60
                      kmem_cache_alloc_node_trace+0x80/0x590
                      alloc_desc+0x64/0x270
                      __irq_alloc_descs+0x2e4/0x3a0
                      irq_domain_alloc_descs+0xb0/0x150
                      irq_create_mapping+0x168/0x2c0
                      xics_smp_probe+0x2c/0x98
                      pnv_smp_probe+0x40/0x9c
                      smp_prepare_cpus+0x524/0x6c4
                      kernel_init_freeable+0x1b4/0x650
                      kernel_init+0x2c/0x148
                      ret_from_kernel_thread+0x5c/0x70
    INITIAL USE at:
                     lock_acquire+0xf8/0x2a0
                     fs_reclaim_acquire.part.23+0x44/0x60
                     kmem_cache_alloc_node_trace+0x80/0x590
                     alloc_desc+0x64/0x270
                     __irq_alloc_descs+0x2e4/0x3a0
                     irq_domain_alloc_descs+0xb0/0x150
                     irq_create_mapping+0x168/0x2c0
                     xics_smp_probe+0x2c/0x98
                     pnv_smp_probe+0x40/0x9c
                     smp_prepare_cpus+0x524/0x6c4
                     kernel_init_freeable+0x1b4/0x650
                     kernel_init+0x2c/0x148
                     ret_from_kernel_thread+0x5c/0x70
  }
===

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190718051139.74787-4-aik@ozlabs.ru
2019-08-19 13:20:23 +10:00
Alexey Kardashevskiy
4f7e0babbc powerpc/iommu: Allow bypass-only for DMA
POWER8 and newer support a bypass mode which maps all host memory to
PCI buses so an IOMMU table is not always required. However if we fail to
create such a table, the DMA setup fails and the kernel does not boot.

This skips the 32bit DMA setup check if the bypass is selected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190718051139.74787-3-aik@ozlabs.ru
2019-08-19 13:20:23 +10:00
Alexey Kardashevskiy
56090a3902 powerpc/powernv/ioda: Fix race in TCE level allocation
pnv_tce() returns a pointer to a TCE entry and originally a TCE table
would be pre-allocated. For the default case of 2GB window the table
needs only a single level and that is fine. However if more levels are
requested, it is possible to get a race when 2 threads want a pointer
to a TCE entry from the same page of TCEs.

This adds cmpxchg to handle the race. Note that once TCE is non-zero,
it cannot become zero again.

Fixes: a68bd1267b ("powerpc/powernv/ioda: Allocate indirect TCE levels on demand")
CC: stable@vger.kernel.org # v4.19+
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190718051139.74787-2-aik@ozlabs.ru
2019-08-19 13:20:23 +10:00
Gautham R. Shenoy
c784be435d powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt()
The calls to arch_add_memory()/arch_remove_memory() are always made
with the read-side cpu_hotplug_lock acquired via memory_hotplug_begin().
On pSeries, arch_add_memory()/arch_remove_memory() eventually call
resize_hpt() which in turn calls stop_machine() which acquires the
read-side cpu_hotplug_lock again, thereby resulting in the recursive
acquisition of this lock.

In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system
lockup during a memory hotplug operation because cpus_read_lock() is a
per-cpu rwsem read, which, in the fast-path (in the absence of the
writer, which in our case is a CPU-hotplug operation) simply
increments the read_count on the semaphore. Thus a recursive read in
the fast-path doesn't cause any problems.

However, we can hit this problem in practice if there is a concurrent
CPU-Hotplug operation in progress which is waiting to acquire the
write-side of the lock. This will cause the second recursive read to
block until the writer finishes. While the writer is blocked since the
first read holds the lock. Thus both the reader as well as the writers
fail to make any progress thereby blocking both CPU-Hotplug as well as
Memory Hotplug operations.

Memory-Hotplug				CPU-Hotplug
CPU 0					CPU 1
------                                  ------

1. down_read(cpu_hotplug_lock.rw_sem)
   [memory_hotplug_begin]
					2. down_write(cpu_hotplug_lock.rw_sem)
					[cpu_up/cpu_down]
3. down_read(cpu_hotplug_lock.rw_sem)
   [stop_machine()]

Lockdep complains as follows in these code-paths.

 swapper/0/1 is trying to acquire lock:
 (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x2c/0x60

but task is already holding lock:
(____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(cpu_hotplug_lock.rw_sem);
   lock(cpu_hotplug_lock.rw_sem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 3 locks held by swapper/0/1:
  #0: (____ptrval____) (&dev->mutex){....}, at: __driver_attach+0x12c/0x1b0
  #1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
  #2: (____ptrval____) (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x54/0x1a0

stack backtrace:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166
 Call Trace:
   dump_stack+0xe8/0x164 (unreliable)
   __lock_acquire+0x1110/0x1c70
   lock_acquire+0x240/0x290
   cpus_read_lock+0x64/0xf0
   stop_machine+0x2c/0x60
   pseries_lpar_resize_hpt+0x19c/0x2c0
   resize_hpt_for_hotplug+0x70/0xd0
   arch_add_memory+0x58/0xfc
   devm_memremap_pages+0x5e8/0x8f0
   pmem_attach_disk+0x764/0x830
   nvdimm_bus_probe+0x118/0x240
   really_probe+0x230/0x4b0
   driver_probe_device+0x16c/0x1e0
   __driver_attach+0x148/0x1b0
   bus_for_each_dev+0x90/0x130
   driver_attach+0x34/0x50
   bus_add_driver+0x1a8/0x360
   driver_register+0x108/0x170
   __nd_driver_register+0xd0/0xf0
   nd_pmem_driver_init+0x34/0x48
   do_one_initcall+0x1e0/0x45c
   kernel_init_freeable+0x540/0x64c
   kernel_init+0x2c/0x160
   ret_from_kernel_thread+0x5c/0x68

Fix this issue by
  1) Requiring all the calls to pseries_lpar_resize_hpt() be made
     with cpu_hotplug_lock held.

  2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked()
     as a consequence of 1)

  3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt()
     with cpu_hotplug_lock held.

Fixes: dbcf929c00 ("powerpc/pseries: Add support for hash table resizing")
Cc: stable@vger.kernel.org # v4.11+
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1557906352-29048-1-git-send-email-ego@linux.vnet.ibm.com
2019-08-19 13:20:23 +10:00
Michael Ellerman
1a47908e0f Merge branch 'topic/ppc-kvm' into next
Merge our ppc-kvm topic branch. This contains several fixes for the XIVE
interrupt controller that we are sharing with the KVM tree.
2019-08-19 13:19:43 +10:00
Michael Ellerman
4215fa2d7d Merge branch 'fixes' into next
Merge in our fixes branch, which brings in clone3() as well as some
implicit fallthrough fixes we want in next.
2019-08-19 12:43:13 +10:00
Paul Mackerras
da15c03b04 powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race
Testing has revealed the existence of a race condition where a XIVE
interrupt being shut down can be in one of the XIVE interrupt queues
(of which there are up to 8 per CPU, one for each priority) at the
point where free_irq() is called.  If this happens, can return an
interrupt number which has been shut down.  This can lead to various
symptoms:

- irq_to_desc(irq) can be NULL.  In this case, no end-of-interrupt
  function gets called, resulting in the CPU's elevated interrupt
  priority (numerically lowered CPPR) never gets reset.  That then
  means that the CPU stops processing interrupts, causing device
  timeouts and other errors in various device drivers.

- The irq descriptor or related data structures can be in the process
  of being freed as the interrupt code is using them.  This typically
  leads to crashes due to bad pointer dereferences.

This race is basically what commit 62e0468650 ("genirq: Add optional
hardware synchronization for shutdown", 2019-06-28) is intended to
fix, given a get_irqchip_state() method for the interrupt controller
being used.  It works by polling the interrupt controller when an
interrupt is being freed until the controller says it is not pending.

With XIVE, the PQ bits of the interrupt source indicate the state of
the interrupt source, and in particular the P bit goes from 0 to 1 at
the point where the hardware writes an entry into the interrupt queue
that this interrupt is directed towards.  Normally, the code will then
process the interrupt and do an end-of-interrupt (EOI) operation which
will reset PQ to 00 (assuming another interrupt hasn't been generated
in the meantime).  However, there are situations where the code resets
P even though a queue entry exists (for example, by setting PQ to 01,
which disables the interrupt source), and also situations where the
code leaves P at 1 after removing the queue entry (for example, this
is done for escalation interrupts so they cannot fire again until
they are explicitly re-enabled).

The code already has a 'saved_p' flag for the interrupt source which
indicates that a queue entry exists, although it isn't maintained
consistently.  This patch adds a 'stale_p' flag to indicate that
P has been left at 1 after processing a queue entry, and adds code
to set and clear saved_p and stale_p as necessary to maintain a
consistent indication of whether a queue entry may or may not exist.

With this, we can implement xive_get_irqchip_state() by looking at
stale_p, saved_p and the ESB PQ bits for the interrupt.

There is some additional code to handle escalation interrupts
properly; because they are enabled and disabled in KVM assembly code,
which does not have access to the xive_irq_data struct for the
escalation interrupt.  Hence, stale_p may be incorrect when the
escalation interrupt is freed in kvmppc_xive_{,native_}cleanup_vcpu().
Fortunately, we can fix it up by looking at vcpu->arch.xive_esc_on,
with some careful attention to barriers in order to ensure the correct
result if xive_esc_irq() races with kvmppc_xive_cleanup_vcpu().

Finally, this adds code to make noise on the console (pr_crit and
WARN_ON(1)) if we find an interrupt queue entry for an interrupt
which does not have a descriptor.  While this won't catch the race
reliably, if it does get triggered it will be an indication that
the race is occurring and needs to be debugged.

Fixes: 243e25112d ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813100648.GE9567@blackberry
2019-08-16 14:16:59 +10:00
Paul Mackerras
8d4ba9c931 KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device
At present, when running a guest on POWER9 using HV KVM but not using
an in-kernel interrupt controller (XICS or XIVE), for example if QEMU
is run with the kernel_irqchip=off option, the guest entry code goes
ahead and tries to load the guest context into the XIVE hardware, even
though no context has been set up.

To fix this, we check that the "CAM word" is non-zero before pushing
it to the hardware.  The CAM word is initialized to a non-zero value
in kvmppc_xive_connect_vcpu() and kvmppc_xive_native_connect_vcpu(),
and is now cleared in kvmppc_xive_{,native_}cleanup_vcpu.

Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813100100.GC9567@blackberry
2019-08-16 14:16:08 +10:00
Paul Mackerras
959c5d5134 KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts
Escalation interrupts are interrupts sent to the host by the XIVE
hardware when it has an interrupt to deliver to a guest VCPU but that
VCPU is not running anywhere in the system.  Hence we disable the
escalation interrupt for the VCPU being run when we enter the guest
and re-enable it when the guest does an H_CEDE hypercall indicating
it is idle.

It is possible that an escalation interrupt gets generated just as we
are entering the guest.  In that case the escalation interrupt may be
using a queue entry in one of the interrupt queues, and that queue
entry may not have been processed when the guest exits with an H_CEDE.
The existing entry code detects this situation and does not clear the
vcpu->arch.xive_esc_on flag as an indication that there is a pending
queue entry (if the queue entry gets processed, xive_esc_irq() will
clear the flag).  There is a comment in the code saying that if the
flag is still set on H_CEDE, we have to abort the cede rather than
re-enabling the escalation interrupt, lest we end up with two
occurrences of the escalation interrupt in the interrupt queue.

However, the exit code doesn't do that; it aborts the cede in the sense
that vcpu->arch.ceded gets cleared, but it still enables the escalation
interrupt by setting the source's PQ bits to 00.  Instead we need to
set the PQ bits to 10, indicating that an interrupt has been triggered.
We also need to avoid setting vcpu->arch.xive_esc_on in this case
(i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
xive_esc_irq() will run at some point and clear it, and if we race with
that we may end up with an incorrect result (i.e. xive_esc_on set when
the escalation interrupt has just been handled).

It is extremely unlikely that having two queue entries would cause
observable problems; theoretically it could cause queue overflow, but
the CPU would have to have thousands of interrupts targetted to it for
that to be possible.  However, this fix will also make it possible to
determine accurately whether there is an unhandled escalation
interrupt in the queue, which will be needed by the following patch.

Fixes: 9b9b13a6d1 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813100349.GD9567@blackberry
2019-08-16 14:16:04 +10:00
Cédric Le Goater
237aed48c6 KVM: PPC: Book3S HV: XIVE: Free escalation interrupts before disabling the VP
When a vCPU is brought done, the XIVE VP (Virtual Processor) is first
disabled and then the event notification queues are freed. When freeing
the queues, we check for possible escalation interrupts and free them
also.

But when a XIVE VP is disabled, the underlying XIVE ENDs also are
disabled in OPAL. When an END (Event Notification Descriptor) is
disabled, its ESB pages (ESn and ESe) are disabled and loads return all
1s. Which means that any access on the ESB page of the escalation
interrupt will return invalid values.

When an interrupt is freed, the shutdown handler computes a 'saved_p'
field from the value returned by a load in xive_do_source_set_mask().
This value is incorrect for escalation interrupts for the reason
described above.

This has no impact on Linux/KVM today because we don't make use of it
but we will introduce in future changes a xive_get_irqchip_state()
handler. This handler will use the 'saved_p' field to return the state
of an interrupt and 'saved_p' being incorrect, softlockup will occur.

Fix the vCPU cleanup sequence by first freeing the escalation interrupts
if any, then disable the XIVE VP and last free the queues.

Fixes: 90c73795af ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode")
Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.org
2019-08-16 14:03:27 +10:00
Nicholas Piggin
2b87a2553a powerpc/64s: Make boot look nice(r)
Radix boot looks like this:

 -----------------------------------------------------
 phys_mem_size     = 0x200000000
 dcache_bsize      = 0x80
 icache_bsize      = 0x80
 cpu_features      = 0x0000c06f8f5fb1a7
   possible        = 0x0000fbffcf5fb1a7
   always          = 0x00000003800081a1
 cpu_user_features = 0xdc0065c2 0xaee00000
 mmu_features      = 0xbc006041
 firmware_features = 0x0000000010000000
 hash-mmu: ppc64_pft_size    = 0x0
 hash-mmu: kernel vmalloc start   = 0xc008000000000000
 hash-mmu: kernel IO start        = 0xc00a000000000000
 hash-mmu: kernel vmemmap start   = 0xc00c000000000000
 -----------------------------------------------------

Fix:

 -----------------------------------------------------
 phys_mem_size     = 0x200000000
 dcache_bsize      = 0x80
 icache_bsize      = 0x80
 cpu_features      = 0x0000c06f8f5fb1a7
   possible        = 0x0000fbffcf5fb1a7
   always          = 0x00000003800081a1
 cpu_user_features = 0xdc0065c2 0xaee00000
 mmu_features      = 0xbc006041
 firmware_features = 0x0000000010000000
 vmalloc start     = 0xc008000000000000
 IO start          = 0xc00a000000000000
 vmemmap start     = 0xc00c000000000000
 -----------------------------------------------------

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190516020437.11783-1-npiggin@gmail.com
2019-08-15 22:52:55 +10:00
Christophe JAILLET
fd3806562f powerpc/xive: Add a check for memory allocation failure
The result of this kzalloc is not checked. Add a check and corresponding
error handling code.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/cc53462734dfeaf15b6bad0e626b483de18656b4.1564647619.git.christophe.jaillet@wanadoo.fr
2019-08-15 22:52:55 +10:00
Christophe JAILLET
b214a8f2ea powerpc/xive: Use GFP_KERNEL instead of GFP_ATOMIC in 'xive_irq_bitmap_add()'
There is no need to use GFP_ATOMIC here. GFP_KERNEL should be enough.
GFP_KERNEL is also already used for another allocation just a few lines
below.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/85d5d247ce753befd6aa63c473f7823de6520ccd.1564647619.git.christophe.jaillet@wanadoo.fr
2019-08-15 22:52:55 +10:00
Linus Torvalds
e83b009c5c dma-mapping fixes for 5.3-rc
- fix the handling of the bus_dma_mask in dma_get_required_mask, which
    caused a regression in this merge window (Lucas Stach)
  - fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)
  - fix dma_mmap_coherent to not cause page attribute mismatches on
    coherent architectures like x86 (me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1UFhILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOjexAAjPKLo4WGBGO1nd0btwXcI9A7jQTQlXrokmorDVzx
 5++GmTUBeEgvUJath5D3qpQTRZXo9Wb9oGMdS5U6bWJB+SbWtErM304t905TJoDM
 Cs7xcB1ZQeG/5OrQ+qGPgQCo6WO1dOl9FpaIptjNm4dn+OYhyO/YA+dgrJDwgkiA
 140RYUWa+Zhq3df4YqP4M4EnezLN1c4uE80wUxVQKDcq59sxCJek0QT0pUAMbdmQ
 /cUd2XSU113o1llmIRUh0Oj6VSEhWKHb+bdb8JfGndLzxvDcXZKl60tikWe6xpy2
 Ue0kkHRk6OPVRIxWkRjt8D+mlrCyNqN6HWx6eBmVnRKHxZ4ia2hYOFuYN9FFLLK+
 kCUlu5P/HUabBedKIxk4rbWITUqcRSviPD2WdnH2RWblvXNSDoSAufYuJ/9IGSoL
 P6a43DVKFesVF/MxeH9Ko8bnxMUO9Zn97GHcQIUplRwaqrnrCEPlvLVf/teswSQG
 C13rTnouZ0FA4z/uV96G6HfGIj87MLe/RovmLCMTeiSKrDpbcO7szP037Km73M+V
 UBmatoYCioVLxBjw3NkxCRc9UpDPdRUu31uVHrAarh4tutUASEWLrb6s9vFlGyED
 zis9IHWtIAYP3VfFtkXdZ7oDlqC/3KdEErHZuT+z4PK3Wj/QtQVfQ8SB79xFMneD
 V2E=
 =Jzmo
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix the handling of the bus_dma_mask in dma_get_required_mask, which
   caused a regression in this merge window (Lucas Stach)

 - fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)

 - fix dma_mmap_coherent to not cause page attribute mismatches on
   coherent architectures like x86 (me)

* tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: fix page attributes for dma_mmap_*
  dma-direct: don't truncate dma_required_mask to bus addressing capabilities
  dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING
2019-08-14 10:31:11 -07:00
Christophe Leroy
b9ee5e04fd powerpc/64e: Drop stale call to smp_processor_id() which hangs SMP startup
Commit ebb9d30a6a ("powerpc/mm: any thread in one core can be the
first to setup TLB1") removed the need to know the cpu_id in
early_init_this_mmu(), but the call to smp_processor_id() which was
marked __maybe_used remained.

Since commit ed1cd6deb0 ("powerpc: Activate CONFIG_THREAD_INFO_IN_TASK")
thread_info cannot be reached before MMU is properly set up.

Drop this stale call to smp_processor_id() which makes SMP hang when
CONFIG_PREEMPT is set.

Fixes: ebb9d30a6a ("powerpc/mm: any thread in one core can be the first to setup TLB1")
Fixes: ed1cd6deb0 ("powerpc: Activate CONFIG_THREAD_INFO_IN_TASK")
Cc: stable@vger.kernel.org # v5.1+
Reported-by: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bef479514f4c08329fa649f67735df8918bc0976.1565268248.git.christophe.leroy@c-s.fr
2019-08-12 22:56:57 +10:00
Christoph Hellwig
33dcb37cef dma-mapping: fix page attributes for dma_mmap_*
All the way back to introducing dma_common_mmap we've defaulted to mark
the pages as uncached.  But this is wrong for DMA coherent devices.
Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that
flag is only treated special on the alloc side for non-coherent devices.

Introduce a new dma_pgprot helper that deals with the check for coherent
devices so that only the remapping cases ever reach arch_dma_mmap_pgprot
and we thus ensure no aliasing of page attributes happens, which makes
the powerpc version of arch_dma_mmap_pgprot obsolete and simplifies the
remaining ones.

Note that this means arch_dma_mmap_pgprot is a bit misnamed now, but
we'll phase it out soon.

Fixes: 64ccc9c033 ("common: dma-mapping: add support for generic dma_mmap_* calls")
Reported-by: Shawn Anastasio <shawn@anastas.io>
Reported-by: Gavin Li <git@thegavinli.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64
2019-08-10 19:52:45 +02:00
Linus Torvalds
23df57afe8 powerpc fixes for 5.3 #4
Just one fix, a revert of a commit that was meant to be a minor improvement to
 some inline asm, but ended up having no real benefit with GCC and broke booting
 32-bit machines when using Clang.
 
 Thanks to:
   Arnd Bergmann, Christophe Leroy, Nathan Chancellor, Nick Desaulniers, Segher
   Boessenkool.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl1Oi1ATHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgDtbD/4wRJ0otvftdDX/1gurM4C9C9SUkmJA
 Om8YYaeyQ429mpQm6Hl8UJcnmeOZS0xOggOT4wNjUmyBYnc6UeFn8WBiCngdpPzp
 0ISVUOXh8iJWippllOdWYVLioirJJO4XEyKkUMbhMbwCfmaTI2axaxoo/woSTBWt
 1TuZybDTa1hB5jrJ60aHA4vUxxa2UH58MZP1UOME581mAy77N2RDzC5lBZcK2ob7
 mlCQn0HgLTuM/KZIRyZ7DpWehFIS0tFfbkB6PCcti9+dNxyK56/fzcp8U4cUg5iu
 w/ESFrtVL13MR0n8XkJ1gfvvh78l3l0jaDGrcGifkUTIJoDHaOVOtTG/0jFjF/TN
 e22IQ8kNJcqspfFu2Kazby16d97hKqUgIgYKheBGX9bIeWuQzrEWDxgTqa3Exr0v
 TX3V9LDQjSSNJFZaIrJU3Oa8xxErQKaNKtgNuUK7I3JUjr50UynzXaJFLdh+VNzg
 6uKtaO51CZMflFlqQ3qdhiPfh2mUCL2W7cGSMJ1ftduN2BZmezsYSwdrBQ53tYQ4
 M5n59vA4hy+8HxRd9lhrdsas2a21OhcDxU3Leq+OOBSWsvHSa6MoNrqqONeN7FS1
 +GqQP5NUefV57MSXojTnpPSRxoK5VgK1SMXkjhgoqYul2GLz8UdRzTl9U94UAAXY
 TM1s3o3/dGS7mQ==
 =cRJi
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fix from Michael Ellerman:
 "Just one fix, a revert of a commit that was meant to be a minor
  improvement to some inline asm, but ended up having no real benefit
  with GCC and broke booting 32-bit machines when using Clang.

  Thanks to: Arnd Bergmann, Christophe Leroy, Nathan Chancellor, Nick
  Desaulniers, Segher Boessenkool"

* tag 'powerpc-5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  Revert "powerpc: slightly improve cache helpers"
2019-08-10 10:17:19 -07:00
Linus Torvalds
7f20fd2337 Bugfixes (arm and x86) and cleanups.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJdTfRfAAoJEL/70l94x66DcN0IAIwyaU2+kwP0jd2miQuKxgwl
 WU4u7dZCoQC6meWEVmrSJIVMBONRubmZ9iCqT7807YP8YZSQpOth51FMbULUWuy1
 VW1eaRwqidX0EAihDhg2ZbBZ8H6RQ9Fn0aiEEh44dAZZAwGSVnO3PRKvQEJ15xjk
 q+OQ4hrxtoorwLj+myejmq3YenTFTCMMJfYwwvlCl+J1FfrLZi5k3X5Gjk+j8Ixd
 8CL8/6u5Lu6MCgfYVvxvo8/bUPiATBdF1sWJMMALwXTrDiSy4tQRD0NvZP1HM8G1
 hy0XnhgtsS9rWNLtAFOj+r/XhP9V5lOOGX8yBcj0XQQr+DC9MG6MCL+pXXOaMcA=
 =ZZh8
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "Bugfixes (arm and x86) and cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  selftests: kvm: Adding config fragments
  KVM: selftests: Update gitignore file for latest changes
  kvm: remove unnecessary PageReserved check
  KVM: arm/arm64: vgic: Reevaluate level sensitive interrupts on enable
  KVM: arm: Don't write junk to CP15 registers on reset
  KVM: arm64: Don't write junk to sysregs on reset
  KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block
  x86: kvm: remove useless calls to kvm_para_available
  KVM: no need to check return value of debugfs_create functions
  KVM: remove kvm_arch_has_vcpu_debugfs()
  KVM: Fix leak vCPU's VMCS value into other pCPU
  KVM: Check preempted_in_kernel for involuntary preemption
  KVM: LAPIC: Don't need to wakeup vCPU twice afer timer fire
  arm64: KVM: hyp: debug-sr: Mark expected switch fall-through
  KVM: arm64: Update kvm_arm_exception_class and esr_class_str for new EC
  KVM: arm: vgic-v3: Mark expected switch fall-through
  arm64: KVM: regmap: Fix unexpected switch fall-through
  KVM: arm/arm64: Introduce kvm_pmu_vcpu_init() to setup PMU counter index
2019-08-09 15:46:29 -07:00
Paolo Bonzini
0e1c438c44 KVM/arm fixes for 5.3
- A bunch of switch/case fall-through annotation, fixing one actual bug
 - Fix PMU reset bug
 - Add missing exception class debug strings
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl1Bzw8PHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDlXYP/ixqJzqpJetTrvpiUpmLjhp4YwjjOxqyeQvo
 bWy/EFz8bSWbTZlwAAstFDVmtGenuwaiOakChvV8GH6USYqRsYdvc/sJu0evQplJ
 JQtOzGhyv1NuM0s9wYBcstAH+YAW+gBK5YFnowreheuidK/1lo3C/EnR2DxCtNal
 gpV3qQt8qfw3ysGlpC/fDjjOYw4lDkFa6CSx9uk3/587fPBqHANRY/i87nJxmhhX
 lGeCJcOrY3cy1HhbedFwxVt4Q/ZbHf0UhTfgwvsBYw7BaWmB1ymoEOoktQcUWoKb
 LL0rBe+OxNQgRnJpn3fMEHiCAmXaI9qE4dohFOl1J3dQvCElcV/jWjkXDD1+KgzW
 S2XZGB6yxet93Fh1x6xv4i6ATJvmZeTIDUXi9KkjcDiycB9YMCDYY2ejTbQv5VUP
 V0DghGGDd3d8sY7dEjxwBakuJ6nqKixSouQaNsWuBTm7tVpEVS8yW+hqWs/IVI5b
 48SDbxaNpKvx7sAyhuWAjCFbZeIm0hd//JN3JoxazF9i9PKuqnZLbNv/ME6hmzj+
 LrETwaAbjsw5Au+ST+OdT2UiauiBm9C6Kg62qagHrKJviuK941+3hjH8aj/e0pYk
 a0DQxumiyofXPQ0pVe8ZfqlPptONz+EKyAsrOm8AjLJ+bBdRUNHLcZKYj7em7YiE
 pANc8/T+
 =kcDj
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm fixes for 5.3

- A bunch of switch/case fall-through annotation, fixing one actual bug
- Fix PMU reset bug
- Add missing exception class debug strings
2019-08-09 16:53:39 +02:00
Denis Efremov
b8074aa246 PCI: Convert pci_resource_to_user() to a weak function
Convert pci_resource_to_user() to a weak function so the existing
architecture-specific implementations will automatically override the
generic one.  This allows us to remove HAVE_ARCH_PCI_RESOURCE_TO_USER
definitions and avoid the conditional compilation for this single function.

Link: https://lore.kernel.org/r/20190729101401.28068-1-efremov@linux.com
Link: https://lore.kernel.org/r/20190729101401.28068-2-efremov@linux.com
Link: https://lore.kernel.org/r/20190729101401.28068-3-efremov@linux.com
Link: https://lore.kernel.org/r/20190729101401.28068-4-efremov@linux.com
Link: https://lore.kernel.org/r/20190729101401.28068-5-efremov@linux.com
Link: https://lore.kernel.org/r/20190729101401.28068-6-efremov@linux.com
Signed-off-by: Denis Efremov <efremov@linux.com>
[bhelgaas: squash into one commit]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Paul Burton <paul.burton@mips.com>	# MIPS
2019-08-08 15:12:07 -05:00
Leo Yan
45880f7b7b error-injection: Consolidate override function definition
The function override_function_with_return() is defined separately for
each architecture and every architecture's definition is almost same
with each other.  E.g. x86 and powerpc both define function in its own
asm/error-injection.h header and override_function_with_return() has
the same definition, the only difference is that x86 defines an extra
function just_return_func() but it is specific for x86 and is only used
by x86's override_function_with_return(), so don't need to export this
function.

This patch consolidates override_function_with_return() definition into
asm-generic/error-injection.h header, thus all architectures can use the
common definition.  As result, the architecture specific headers are
removed; the include/linux/error-injection.h header also changes to
include asm-generic/error-injection.h header rather than architecture
header, furthermore, it includes linux/compiler.h for successful
compilation.

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-07 13:52:43 +01:00
Paolo Bonzini
741cbbae07 KVM: remove kvm_arch_has_vcpu_debugfs()
There is no need for this function as all arches have to implement
kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
let us actually simplify the code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05 12:55:48 +02:00
Wanpeng Li
17e433b543 KVM: Fix leak vCPU's VMCS value into other pCPU
After commit d73eb57b80 (KVM: Boost vCPUs that are delivering interrupts), a
five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:

 INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
 Call Trace:
   flush_tlb_mm_range+0x68/0x140
   tlb_flush_mmu.part.75+0x37/0xe0
   tlb_finish_mmu+0x55/0x60
   zap_page_range+0x142/0x190
   SyS_madvise+0x3cd/0x9c0
   system_call_fastpath+0x1c/0x21

swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.

This patch fixes it by checking conservatively a subset of events.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Cc: stable@vger.kernel.org
Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05 12:55:47 +02:00
Leonardo Bras
9616dda1aa powerpc/pseries/hotplug-memory.c: Replace nested ifs by switch-case
I noticed these nested ifs can be easily replaced by switch-cases,
which can improve readability.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190801225251.17864-1-leonardo@linux.ibm.com
2019-08-05 18:53:04 +10:00
Jordan Niethe
1ebe0dcce1 powerpc/xive: Update comment referencing magic loads from an ESB
The comment above xive_esb_read() references magic loads from an ESB as
described xive.h. This has been inaccurate since commit 12c1f339cd
("powerpc/xive: Move definition of ESB bits") which moved the
description. Update the comment to reference the new location of the
description in xive-regs.h

Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Acked-by: Stewart Smith <stewart@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802000835.26191-1-jniethe5@gmail.com
2019-08-05 18:53:04 +10:00