Symbol offsets to the KASLR base do not match symbol address in
the vmlinux image. That is the result of setting the KASLR base
to the beginning of .text section as result of an optimization.
Revert that optimization and allocate virtual memory for the
whole kernel image including __START_KERNEL bytes as per the
linker script. That allows keeping the semantics of the KASLR
base offset in sync with other architectures.
Rename __START_KERNEL to TEXT_OFFSET, since it represents the
offset of the .text section within the kernel image, rather than
a virtual address.
Still skip mapping TEXT_OFFSET bytes to save memory on pgtables
and provoke exceptions in case an attempt to access this area is
made, as no kernel symbol may reside there.
In case CONFIG_KASAN is enabled the location counter might exceed
the value of TEXT_OFFSET, while the decompressor linker script
forcefully resets it to TEXT_OFFSET, which leads to a sections
overlap link failure. Use MAX() expression to avoid that.
Reported-by: Omar Sandoval <osandov@osandov.com>
Closes: https://lore.kernel.org/linux-s390/ZnS8dycxhtXBZVky@telecaster.dhcp.thefacebook.com/
Fixes: 56b1069c40 ("s390/boot: Rework deployment of the kernel image")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
qsi() executes the instruction qsi (query sample information)
and stores the result of the query in a sample information block
pointed to by the function argument. The instruction does not
change the condition code register. The return code is always
zero. No need to check for errors. Remove now unreferenced
macros PMC_FAILURE and RS_INIT_FAILURE_QSI.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
sf_disable() returns the condition code of instruction lsctl (load
sampling controls). However the parameter to lsctl() in
sf_disable() is a sample control block containing
all zeroes. This invocation of lsctl() does not fail and returns
always zero even when there is no authorization for sampling
on the machine. In short, sampling can be always turned off.
Ignore the return code of sf_disable() and change the function
return to void.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The identity mapping position in virtual memory is randomized
together with the kernel mapping. That position can never
overlap with the lowcore even when the lowcore is relocated.
Prevent overlapping with the lowcore to allow independent
positioning of the identity mapping. With the current value
of the alternative lowcore address of 0x70000 the overlap
could happen in case the identity mapping is placed at zero.
This is a prerequisite for uncoupling of randomization base
of kernel image and identity mapping in virtual memory.
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Refactor the list of constant variables into a macro.
This should make it easier to add more constants in the future.
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Save some bytes and move early_pgm_check_handler() to init text
section.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add missing warning handling to the early program check handler. This
way a warning is printed to the console as soon as the early console
is setup, and the kernel continues to boot.
Before this change a disabled wait psw was loaded instead and the
machine was silently stopped without giving an idea about what
happened.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add the missing pieces so the early program check handler also works
with a relocated lowcore. Right now the result of an early program
check in case of a relocated lowcore would be a program check loop.
Fixes: 8f1e70adb1 ("s390/boot: Add cmdline option to relocate lowcore")
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Have all program check handlers in one file to make future changes easy.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
All functions but setup_pmc_cpu() use a local variable named
cpuhw to refer to struct cpu_hw_sf.
In setup_pmc_cpu() rename variable cpusf to cpuhw. This makes
the naming scheme consistent with all other functions.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Define and initialize a variable in one place.
Remove space between cast and variable.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The macros PERF_CPUM_CF_MAX_CTR and PERF_EVENT_CPUM_CF_DIAG
are used in only one source file arch/s390/kernel/perf_cpum_cf.c.
Move these defines from the header file
arch/s390/include/asm/perf_event.h to the only user.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Some defines in common header file arch/s390/include/asm/perf_event.h
are only used in one source file arch/s390/kernel/perf_cpum_sf.c.
Move these defines from header to source file.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Rename macro SAMPLE_FREQ_MODE to SAMPL_FREQ_MODE to make its
prefix consistent with all other macro starting with prefix
SAMPL_XXX.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Member hw_perf_event::reg.reg is set but never used, so remove it.
Defines REG_NONE and REG_OVERFLOW are not referenced anymore.
The initialization to zero takes place in function
perf_event_alloc() where
...
event = kmem_cache_alloc_node(perf_event_cache,
GFP_KERNEL | __GFP_ZERO, node);
...
makes sure memory allocated for the event is zero'ed.
This is done in the kernel's common code in kernel/events/core.c
The struct perf_event contains member hw_perf_event as in
struct perf_event {
....
struct hw_perf_event hw;
....
};
This contained sub-structure is also initialized to zero.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Remove the unused and empty arch/s390/kernel/alternative.h header
file which was added by mistake.
Fixes: 5ade5be4ed ("s390: Add infrastructure to patch lowcore accesses")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
With the recent rewrite of the fpu code exception handling for the
lfpc instruction within load_fpu_state() was erroneously removed.
Add it again to prevent that loading invalid floating point register
values cause an unhandled specification exception.
Fixes: 8c09871a95 ("s390/fpu: limit save and restore to used registers")
Cc: stable@vger.kernel.org
Reported-by: Aristeu Rozanski <aris@redhat.com>
Tested-by: Aristeu Rozanski <aris@redhat.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
- Fix KMSAN build breakage caused by the conflict between s390 and
mm-stable trees
- Add KMSAN page markers for ptdump
- Add runtime constant support
- Fix __pa/__va for modules under non-GPL licenses by exporting necessary
vm_layout struct with EXPORT_SYMBOL to prevent linkage problems
- Fix an endless loop in the CF_DIAG event stop in the CPU Measurement
Counter Facility code when the counter set size is zero
- Remove the PROTECTED_VIRTUALIZATION_GUEST config option and enable
its functionality by default
- Support allocation of multiple MSI interrupts per device and improve
logging of architecture-specific limitations
- Add support for lowcore relocation as a debugging feature to catch
all null ptr dereferences in the kernel address space, improving
detection beyond the current implementation's limited write access
protection
- Clean up and rework CPU alternatives to allow for callbacks and early
patching for the lowcore relocation
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmajrGUACgkQjYWKoQLX
FBgZjwf/X8suh1Gm2qO47hdGURusOKmEa6GjYaihKzOi2I5yAWVXGsAYX7QtXI4X
fxbKyuGJIvq7LOrIojN1JOCPGkDRztgMddqDJI7WljuRiw6dcd4L5tbaPgCKEv3Q
AQcoq+Aeg1L5xnuNFPdQXl6+Fy2lTFqJCkUl+uW05pGAn2R212dYG3HB41TpwOtJ
Sv2R5+yD9TQCKnHyuCQqaGf7d6SQTcVeBj8zrqVmcyduNK+BYYMOwlJ/UTRzeZEX
3DmQg/TdAkxXf0jZ+vrNILEfHlIvwDAhFjdoAXXL0TX4lx2cHLx9AiqNxYhUprsG
0gutc/nLq2FxhqofoJ0z9TCdb1Ef7w==
=skva
-----END PGP SIGNATURE-----
Merge tag 's390-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Vasily Gorbik:
- Fix KMSAN build breakage caused by the conflict between s390 and
mm-stable trees
- Add KMSAN page markers for ptdump
- Add runtime constant support
- Fix __pa/__va for modules under non-GPL licenses by exporting
necessary vm_layout struct with EXPORT_SYMBOL to prevent linkage
problems
- Fix an endless loop in the CF_DIAG event stop in the CPU Measurement
Counter Facility code when the counter set size is zero
- Remove the PROTECTED_VIRTUALIZATION_GUEST config option and enable
its functionality by default
- Support allocation of multiple MSI interrupts per device and improve
logging of architecture-specific limitations
- Add support for lowcore relocation as a debugging feature to catch
all null ptr dereferences in the kernel address space, improving
detection beyond the current implementation's limited write access
protection
- Clean up and rework CPU alternatives to allow for callbacks and early
patching for the lowcore relocation
* tag 's390-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
s390: Remove protvirt and kvm config guards for uv code
s390/boot: Add cmdline option to relocate lowcore
s390/kdump: Make kdump ready for lowcore relocation
s390/entry: Make system_call() ready for lowcore relocation
s390/entry: Make ret_from_fork() ready for lowcore relocation
s390/entry: Make __switch_to() ready for lowcore relocation
s390/entry: Make restart_int_handler() ready for lowcore relocation
s390/entry: Make mchk_int_handler() ready for lowcore relocation
s390/entry: Make int handlers ready for lowcore relocation
s390/entry: Make pgm_check_handler() ready for lowcore relocation
s390/entry: Add base register to CHECK_VMAP_STACK/CHECK_STACK macro
s390/entry: Add base register to SIEEXIT macro
s390/entry: Add base register to MBEAR macro
s390/entry: Make __sie64a() ready for lowcore relocation
s390/head64: Make startup code ready for lowcore relocation
s390: Add infrastructure to patch lowcore accesses
s390/atomic_ops: Disable flag outputs constraint for GCC versions below 14.2.0
s390/entry: Move SIE indicator flag to thread info
s390/nmi: Simplify ptregs setup
s390/alternatives: Remove alternative facility list
...
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.
This patch has been generated by the following coccinelle script:
```
virtual patch
@r1@
identifier ctl, write, buffer, lenp, ppos;
identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@
identifier func, ctl, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos)
{ ... }
@r3@
identifier func;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int , void *, size_t *, loff_t *);
@r4@
identifier func, ctl;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int , void *, size_t *, loff_t *);
@r5@
identifier func, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.
Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
Removing the CONFIG_PROTECTED_VIRTUALIZATION_GUEST ifdefs and config
option as well as CONFIG_KVM ifdefs in uv files.
Having this configurable has been more of a pain than a help.
It's time to remove the ifdefs and the config option.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in store_status()
and __do_machine_kdump().
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in system_call().
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in ret_from_fork().
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in __switch_to().
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in restart_int_handler().
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in mcck_int_handler().
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in the ext/io interrupt
handlers.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in pgm_check_handler().
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to CHECK_VMAP_STACK and CHECK_STACK. No
functional change, because %r0 is passed to the macro.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to SIEEXIT. No functional change, because
%r0 is passed to the macro.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to MBEAR. No functional change, because
%r0 is passed to the macro.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in __sie64a().
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of having lowcore at different address than zero,
add the base register to all lowcore accesses in startup_continue().
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The s390 architecture defines two special per-CPU data pages
called the "prefix area". In s390-linux terminology this is usually
called "lowcore". This memory area contains system configuration
data like old/new PSW's for system call/interrupt/machine check
handlers and lots of other data. It is normally mapped to logical
address 0. This area can only be accessed when in supervisor mode.
This means that kernel code can dereference NULL pointers, because
accesses to address 0 are allowed. Parts of lowcore can be write
protected, but read accesses and write accesses outside of the write
protected areas are not caught.
To remove this limitation for debugging and testing, remap lowcore to
another address and define a function get_lowcore() which simply
returns the address where lowcore is mapped at. This would normally
introduce a pointer dereference (=memory read). As lowcore is used
for several very often used variables, add code to patch this function
during runtime, so we avoid the memory reads.
For C code get_lowcore() has to be used, for assembly code it is
the GET_LC macro. When using this macro/function a reference is added
to alternative patching. All these locations will be patched to the
actual lowcore location when the kernel is booted or a module is loaded.
To make debugging/bisecting problems easier, this patch adds all the
infrastructure but the lowcore address is still hardwired to 0. This
way the code can be converted on a per function basis, and the
functionality is enabled in a patch after all the functions have
been converted.
Note that this requires at least z16 because the old lpsw instruction
only allowed a 12 bit displacement. z16 introduced lpswey which allows
20 bits (signed), so the lowcore can effectively be mapped from
address 0 - 0x7e000. To use 0x7e000 as address, a 6 byte lgfi
instruction would have to be used in the alternative. To save two
bytes, llilh can be used, but this only allows to set bits 16-31 of
the address. In order to use the llilh instruction, use 0x70000 as
alternative lowcore address. This is still large enough to catch
NULL pointer dereferences into large arrays.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
CIF_SIE indicates if a thread is running in SIE context. This is the
state of a thread and not the CPU. Therefore move this indicator to
thread info.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The low level machine check handler code fills the ptregs structure
partially with the register contents present at machine check handler
entry and partially with contents from the machine check save area.
In case of a machine check the contents of all general purpose
registers are saved by the CPU to the machine check save area.
Therefore simplify the code and fill the ptregs structure by only
using the machine check save area as source.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The alternative and the normal facility list are always identical. Remove
the alternative facility list, which allows to simplify the alternatives
code.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The nospec implementation is deeply integrated into the alternatives
code: only for nospec an alternative facility list is implemented and
used by the alternative code, while it is modified by nospec specific
needs.
Push down the nospec alternative handling into the nospec by
introducing a new alternative type and a specific nospec callback to
decide if alternatives should be applied.
Also introduce a new global nobp variable which together with facility
82 can be used to decide if nobp is enabled or not.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add the required code to patch alternatives early in the decompressor.
This is required for the upcoming lowcore relocation changes, where
alternatives for facility 193 need to get patched before lowcore
alternatives.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Co-developed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Rework alternatives to allow for callbacks. With this every
alternative entry has additional data encoded:
- When (aka context) an alternative is supposed to be applied
- The type of an alternative, which allows for type specific handling
and callbacks
- Extra type specific payload (patch information), which can be passed
to callbacks in order to decide if an alternative should be applied
or not
With this only the "late" context is implemented, which means there is
no change to the previous behaviour. All code is just converted to the
more generic new infrastructure.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Move all text sync functions from alternative.c to processor.c. This
way there is only minimal code left in alternative.c left, which is a
prerequisite to use the C file within boot code as well.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The two alternative header files must stay in sync. This is easier to
achieve within one header file. Therefore merge both of them and have
only one file, like most other architectures.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The alternative code is using the words facility and feature for the
same. Rename facility to more generic feature everywhere to have
consistent naming.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The current Kernel doesn't boot without alternative patching on
z16 machines. To avoid such bugs in the future, remove the option
disable alternative patching.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
To allow testing flags for offline CPUs, move the CIF flags
to struct pcpu. To avoid having to calculate the array index
for each access, add a pointer to the pcpu member for the current
cpu to lowcore.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of moving the CIF flags from lowcore to pcpu_devices,
convert the pcpu_devices array to use the percpu infrastructure.
This is required because using the pcpu_devices array as it is would
introduce a performance penalty due to the fact that CPU flags for
multiple CPUs would end up in the same cacheline.
Note that a pointer to the pcpu struct of the IPL CPU is still required.
This is because a restart interrupt can be triggered on an offline CPU.
s390 stores the percpu offset in lowcore, but offline CPUs have no
lowcore area allocated. So percpu data cannot be used from an offline
CPU and we need to get the pcpu pointer for the IPL cpu from somewhere
else.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The current smp code allows to trigger a restart interrupt on CPUs
offline in linux. To allow using the percpu infrastructure instead
of the pcpu_devices array, switch to the ipl cpu which is always
online before calling do_restart().
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Event CF_DIAG reads out complete counter sets using stcctm
instruction. This is done at event start time when the process
starts execution and at event stop time when the process is
removed from the CPU. During removal the difference of each
counter in the counter sets is calculated and saved as raw data
in the ring buffer. This works fine unless the number of counters
in a counter set is zero. This may happen for the extended counter
set. This set is machine specific and the size of the counter
set can be zero even when extended counter set is authorized for
read access.
This case is not handled. cfdiag_diffctr() checks authorization
of the extended counter set. If true the functions assumes
the extended counter set has been saved in a data buffer. However
this is not the case, cfdiag_getctrset() does not save a counter
set with counter set size of zero. This mismatch causes an endless
loop in the counter set readout during event stop handling.
The calculation of the difference of the counters in each counter
now verifies the size of the counter set is non-zero. A counter set
with size zero is skipped.
Fixes: a029a4eab3 ("s390/cpumf: Allow concurrent access for CPU Measurement Counter Facility")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The struct vm_layout contains fields used in __pa/__va calculations. Such
fundamental things have to be exported with EXPORT_SYMBOL to avoid
breakages of out-of-tree modules under non-GPL licenses.
Fixes: 7de0446f0b ("s390/boot: Make identity mapping base address explicit")
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Implement the runtime constant infrastructure for s390, allowing the
dcache d_hash() function to be generated using as a constant for hash
table address followed by shift by a constant of the hash index.
This is the s390 variant of commit 94a2bc0f61 ("arm64: add 'runtime
constant' support") and commit e3c92e8171 ("runtime constants: add
x86 architecture support").
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
walkers") is known to cause a performance regression
(https://lore.kernel.org/all/3acefad9-96e5-4681-8014-827d6be71c7a@linux.ibm.com/T/#mfa809800a7862fb5bdf834c6f71a3a5113eb83ff).
Yu has a fix which I'll send along later via the hotfixes branch.
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability of
cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of the
zeropage in MAP_SHARED mappings. I don't see any runtime effects here -
more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of
higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the
series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has
simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Is anyone reading this stuff? If so, email me!
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large folio
userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's self
testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are
"mm: memcg: separate legacy cgroup v1 code and put under config
option" and
"mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of excessive
correctable memory errors. In order to permit userspace to monitor and
handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from migrate
folio" teaches the kernel to appropriately handle migration from
poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare
refcount increments. So these paes can first be moved aside if they
reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps
for much faster reading of vma information. The series is "query VMAs
from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance Yang
improves the kernel's presentation of developer information related to
multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and not
very useful feature from slab fault injection.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2C+QAKCRDdBJ7gKXxA
joTkAQDvjqOoFStqk4GU3OXMYB7WCU/ZQMFG0iuu1EEwTVDZ4QEA8CnG7seek1R3
xEoo+vw0sWWeLV3qzsxnCA1BJ8cTJA8=
=z0Lf
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My
bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability
of cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache
index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of
the zeropage in MAP_SHARED mappings. I don't see any runtime effects
here - more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
of higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
the series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
has simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large
folio userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's
self testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
under config option" and "mm: memcg: put cgroup v1-specific memcg
data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of
excessive correctable memory errors. In order to permit userspace to
monitor and handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from
migrate folio" teaches the kernel to appropriately handle migration
from poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory
utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than
bare refcount increments. So these paes can first be moved aside if
they reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to
/proc/pid/maps for much faster reading of vma information. The series
is "query VMAs from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance
Yang improves the kernel's presentation of developer information
related to multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and
not very useful feature from slab fault injection.
* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
mm/mglru: fix ineffective protection calculation
mm/zswap: fix a white space issue
mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
mm/hugetlb: fix possible recursive locking detected warning
mm/gup: clear the LRU flag of a page before adding to LRU batch
mm/numa_balancing: teach mpol_to_str about the balancing mode
mm: memcg1: convert charge move flags to unsigned long long
alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
lib: reuse page_ext_data() to obtain codetag_ref
lib: add missing newline character in the warning message
mm/mglru: fix overshooting shrinker memory
mm/mglru: fix div-by-zero in vmpressure_calc_level()
mm/kmemleak: replace strncpy() with strscpy()
mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
mm: ignore data-race in __swap_writepage
hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
mm: shmem: rename mTHP shmem counters
mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
mm/migrate: putback split folios when numa hint migration fails
...
- Remove restrictions on PAI NNPA and crypto counters, enabling
concurrent per-task and system-wide sampling and counting events
- Switch to GENERIC_CPU_DEVICES by setting up the CPU present mask in
the architecture code and letting the generic code handle CPU bring-up
- Add support for the diag204 busy indication facility to prevent
undesirable blocking during hypervisor logical CPU utilization
queries. Implement results caching
- Improve the handling of Store Data SCLP events by suppressing
unnecessary warning, preventing buffer release in I/O during failures,
and adding timeout handling for Store Data requests to address potential
firmware issues
- Provide optimized __arch_hweight*() implementations
- Remove the unnecessary CPU KOBJ_CHANGE uevents generated during topology
updates, as they are unused and also not present on other architectures
- Cleanup atomic_ops, optimize __atomic_set() for small values and
__atomic_cmpxchg_bool() for compilers supporting flag output constraint
- Couple of cleanups for KVM:
- Move and improve KVM struct definitions for DAT tables from gaccess.c
to a new header
- Pass the asce as parameter to sie64a()
- Make the crdte() and cspg() page table handling wrappers return a
boolean to indicate success, like the other existing "compare and swap"
wrappers
- Add documentation for HWCAP flags
- Switch to obtaining total RAM pages from memblock instead of
totalram_pages() during mm init, to ensure correct calculation of zero
page size, when defer_init is enabled
- Refactor lowcore access and switch to using the get_lowcore() function
instead of the S390_lowcore macro
- Cleanups for PG_arch_1 and folio handling in UV and hugetlb code
- Add missing MODULE_DESCRIPTION() macros
- Fix VM_FAULT_HWPOISON handling in do_exception()
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmaYGegACgkQjYWKoQLX
FBjCwwf/aRYoLIXCa9/nHGWFiUjZm6xBgVwZh55bXjfNG9TI2J9UZSsYlOFGUJKl
gvD2Ym+LqAejK8R4EUHkfD6ftaKMQuIxNDoedxhwuSpfOQ2mZ5teu0MxTh8QcUAx
4Y2w5XEeCuqE3SuoZ4SJa58K4rGl4cFpPsKNa8ofdzH1ZLFNe8Wqzis4kh0htqLb
FtPj6nsgfzQ5kg14rVkGxCa4CqoFxonXgsA6nH6xZLbxKUInyq8uV44UBQ+aJq5v
dsdzZ5XuAJHN2FpBuuOYQYZYw3XIy/kka7o4EjffORi5SGCRMWO4Zt0P6HXaNkh6
xV8EEO8myeo7rV8dnrk1V4yGjGJmfA==
=3IGY
-----END PGP SIGNATURE-----
Merge tag 's390-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Remove restrictions on PAI NNPA and crypto counters, enabling
concurrent per-task and system-wide sampling and counting events
- Switch to GENERIC_CPU_DEVICES by setting up the CPU present mask in
the architecture code and letting the generic code handle CPU
bring-up
- Add support for the diag204 busy indication facility to prevent
undesirable blocking during hypervisor logical CPU utilization
queries. Implement results caching
- Improve the handling of Store Data SCLP events by suppressing
unnecessary warning, preventing buffer release in I/O during
failures, and adding timeout handling for Store Data requests to
address potential firmware issues
- Provide optimized __arch_hweight*() implementations
- Remove the unnecessary CPU KOBJ_CHANGE uevents generated during
topology updates, as they are unused and also not present on other
architectures
- Cleanup atomic_ops, optimize __atomic_set() for small values and
__atomic_cmpxchg_bool() for compilers supporting flag output
constraint
- Couple of cleanups for KVM:
- Move and improve KVM struct definitions for DAT tables from
gaccess.c to a new header
- Pass the asce as parameter to sie64a()
- Make the crdte() and cspg() page table handling wrappers return a
boolean to indicate success, like the other existing "compare and
swap" wrappers
- Add documentation for HWCAP flags
- Switch to obtaining total RAM pages from memblock instead of
totalram_pages() during mm init, to ensure correct calculation of
zero page size, when defer_init is enabled
- Refactor lowcore access and switch to using the get_lowcore()
function instead of the S390_lowcore macro
- Cleanups for PG_arch_1 and folio handling in UV and hugetlb code
- Add missing MODULE_DESCRIPTION() macros
- Fix VM_FAULT_HWPOISON handling in do_exception()
* tag 's390-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()
s390/kvm: Move bitfields for dat tables
s390/entry: Pass the asce as parameter to sie64a()
s390/sthyi: Use cached data when diag is busy
s390/sthyi: Move diag operations
s390/hypfs_diag: Diag204 busy loop
s390/diag: Add busy-indication-facility requirements
s390/diag: Diag204 add busy return errno
s390/diag: Return errno's from diag204
s390/sclp: Diag204 busy indication facility detection
s390/atomic_ops: Make use of flag output constraint
s390/atomic_ops: Improve __atomic_set() for small values
s390/atomic_ops: Use symbolic names
s390/smp: Switch to GENERIC_CPU_DEVICES
s390/hwcaps: Add documentation for HWCAP flags
s390/pgtable: Make crdte() and cspg() return a value
s390/topology: Remove CPU KOBJ_CHANGE uevents
s390/sclp: Add timeout to Store Data requests
s390/sclp: Prevent release of buffer in I/O
s390/sclp: Suppress unnecessary Store Data warning
...
Pass the guest ASCE explicitly as parameter, instead of having sie64a()
take it from lowcore.
This removes hidden state from lowcore, and makes things look cleaner.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Link: https://lore.kernel.org/r/20240703155900.103783-2-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When sthyi is being emulated, data from diag204 is used. If diag204
returns busy, previously cached sthyi info block is returned to the
caller and cache expiry is set to expired.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Move diag204 related operations to their own functions for better error
handling and better readability.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When diag204-busy-indication facility is installed, diag204 can return
'8' which means device is busy and no operation is done. Add check for
return codes of diag204 call. Return error codes according to diag204
return codes.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Return different errno's from diag204 to allow users to handle them
accordingly. Instead of returning -1 regardless of the failing
condition, return -EINVAL on invalid memory address and -EOPNOTSUPP when
diag instruction fails.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Instead of setting up non-boot CPUs early in architecture code,
only setup the cpu present mask and let the generic code handle
cpu bringup.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The unwind code can read uninitialized frames. Furthermore, even in the
good case, KMSAN does not emit shadow for backchains. Therefore disable
it for the unwinding functions.
Link: https://lkml.kernel.org/r/20240621113706.315500-37-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
s390 uses assembly code to initialize ftrace_regs and call
kprobe_ftrace_handler(). Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on kprobe_ftrace_handler() entry. This causes
KMSAN warnings when running the ftrace testsuite.
Fix by trusting the assembly code and always unpoisoning ftrace_regs in
kprobe_ftrace_handler().
Link: https://lkml.kernel.org/r/20240621113706.315500-30-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Diagnose 224 stores 4k bytes, which currently cannot be deduced from the
inline assembly constraints. This leads to KMSAN false positives.
Fix the constraints by using a 4k-sized struct instead of a raw pointer.
While at it, prettify them too.
Link: https://lkml.kernel.org/r/20240621113706.315500-29-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
s390 generates KOBJ_CHANGE uevents on CPUs whenever a topology update
occurs. These uevents currently have no users and they are also not
present on other architectures. As they are not necessary, remove
these extra uevents.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The mmap2() syscall has never been used on 64-bit s390x and should
have been removed as part of 5a79859ae0 ("s390: remove 31 bit
support").
Remove it now.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Using sys_io_pgetevents() as the entry point for compat mode tasks
works almost correctly, but misses the sign extension for the min_nr
and nr arguments.
This was addressed on parisc by switching to
compat_sys_io_pgetevents_time64() in commit 6431e92fc8 ("parisc:
io_pgetevents_time64() needs compat syscall in 32-bit compat mode"),
as well as by using more sophisticated system call wrappers on x86 and
s390. However, arm64, mips, powerpc, sparc and riscv still have the
same bug.
Change all of them over to use compat_sys_io_pgetevents_time64()
like parisc already does. This was clearly the intention when the
function was originally added, but it got hooked up incorrectly in
the tables.
Cc: stable@vger.kernel.org
Fixes: 48166e6ea4 ("y2038: add 64-bit time_t syscalls to all 32-bit architectures")
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Assign the output from get_lowcore() to a local variable,
so the code is easier to read.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Assign the output from get_lowcore() to a local variable,
so the code is easier to read.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Assign the output from get_lowcore() to a local variable,
so the code is easier to read.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Assign the output from get_lowcore() to a local variable,
so the code is easier to read.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Assign the output from get_lowcore() to a local variable,
so the code is easier to read.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Replace all S390_lowcore usages in arch/s390/ by get_lowcore().
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The PMU for PAI NNPA counters enforces the following restriction:
- No per-task context for PAI sampling event NNPA_ALL
- No multiple system-wide PAI sampling event NNPA_ALL
Both restrictions are removed. One or more per-task sampling events
are supported. Also one or more system-wide sampling events are
supported.
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The PMU for PAI NNPA counters enforces the following restriction:
- No per-task context for PAI NNPA counters.
This restriction is removed. One or more per-task/system-wide counting
events can now be active at the same time while one system wide
sampling event is active.
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The PMU for PAI NNPA counters enforces the following restriction:
- No system wide counting while system wide sampling is active.
This restriction is removed. One or more system wide counting events
can now be active at the same time while at most one system wide
sampling event is active.
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The PMU for PAI crypto counters enforces the following restrictions:
- No per-task context for PAI crypto sampling event CRYPTO_ALL
- No multiple system-wide PAI crypto sampling event CRYPTO_ALL
Both restrictions are removed. One or more per-task sampling events
are supported. Also one or more system-wide sampling events are
supported.
Example for per-task context of sampling event CRYPTO_ALL:
# perf record -e pai_crypto/CRYPTO_ALL/ -- true
Example for system-wide context of sampling event CRYPTO_ALL:
# perf record -e pai_crypto/CRYPTO_ALL/ -a -- sleep 4
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The PMU for PAI crypto counters enforces the following restriction:
- No per-task context for PAI crypto counters events.
This restriction is removed. One or more per-task/system-wide counting
events can now be active at the same time while at most one system
wide sampling event is active.
Example for per-task context of a PAI crypto counter event:
# perf stat -e pai_crypto/KM_AES_128/ -- true
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The PMU for PAI crypto counters enforces the following restriction:
- No system wide counting while system wide sampling is active.
This restriction is removed. One or more system wide counting events
can now be active at the same time while at most one system wide
sampling event is active.
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Let's also implement HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE, so we can convert
arch_make_page_accessible() to be a simple wrapper around
arch_make_folio_accessible(). Unfortunately, we cannot do that in the
header.
There are only two arch_make_page_accessible() calls remaining in gup.c.
We can now drop HAVE_ARCH_MAKE_PAGE_ACCESSIBLE completely form core-MM.
We'll handle that separately, once the s390x part landed.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-10-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Let's do the same as we did for uv_destroy_(folio|pte)() and
have the following variants:
(1) uv_convert_from_secure(): "low level" helper that operates on paddr
and does not mess with folios.
(2) uv_convert_from_secure_folio(): Consumes a folio to which we hold a
reference.
(3) uv_convert_from_secure_pte(): Consumes a PTE that holds a reference
through the mapping.
Unfortunately we need uv_convert_from_secure_pte(), because pfn_folio()
and friends are not available in pgtable.h.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-9-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Let's have the following variants for destroying pages:
(1) uv_destroy(): Like uv_pin_shared() and uv_convert_from_secure(),
"low level" helper that operates on paddr and doesn't mess with folios.
(2) uv_destroy_folio(): Consumes a folio to which we hold a reference.
(3) uv_destroy_pte(): Consumes a PTE that holds a reference through the
mapping.
Unfortunately we need uv_destroy_pte(), because pfn_folio() and
friends are not available in pgtable.h.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-8-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
It's not used outside of uv.c, so let's make it a static function.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-7-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
We removed the usage of PG_arch_1 for page tables in commit
a51324c430 ("s390/cmma: rework no-dat handling").
Let's update the comment in UV to reflect that.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-6-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Now that make_folio_secure() may only set PG_arch_1 for small folios,
let's convert relevant remaining UV code to only work on (small) folios
and simply reject large folios early. This way, we'll never end up
touching PG_arch_1 on tail pages of a large folio in UV code.
The folio_get()/folio_put() for functions that are documented to already
hold a folio reference look weird; likely they are required to make
concurrent gmap_make_secure() back off because the caller might only hold
an implicit reference due to the page mapping. So leave that alone for now.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-5-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
While s390x makes sure to never have PMD-mapped THP in processes that use
KVM -- by remapping them using PTEs in
thp_split_walk_pmd_entry()->split_huge_pmd() -- there is still the
possibility of having PTE-mapped THPs (large folios) mapped into guest
memory.
This would happen if user space allocates memory before calling
KVM_CREATE_VM (which would call s390_enable_sie()). With upstream QEMU,
this currently doesn't happen, because guest memory is setup and
conditionally preallocated after KVM_CREATE_VM.
Could it happen with shmem/file-backed memory when another process
allocated memory in the pagecache? Likely, although currently not a
common setup.
Trying to split any PTE-mapped large folios sounds like the right and
future-proof thing to do here. So let's call split_folio() and handle the
return values accordingly.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-4-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Let's factor out handling of LRU cache draining and convert the if-else
chain to a switch-case.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-3-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
folio_wait_writeback() requires that no spinlocks are held and that
a folio reference is held, as documented. After we dropped the PTL, the
folio could get freed concurrently. So grab a temporary reference.
Fixes: 214d9bbcd3 ("s390/mm: provide memory management functions for protected KVM guests")
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-2-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The virtual memory information stored in os_info area is
required for creation of the kernel image PT_LOAD program
header for kernels since commit a2ec5bec56dd ("s390/mm:
uncouple physical vs virtual address spaces").
By contrast, if such information in os_info is absent the
PT_LOAD program header should not be created.
Currently the proper PT_LOAD program header is created for
kernels that contain the virtual memory information, but
for kernels without one an invalid header of zero size is
created. That in turn leads to stand-alone dump failures.
Use OS_INFO_KASLR_OFFSET variable to check whether os_info
is present or not (same as crash and makedumpfile tools do)
and based on that create or do not create the kernel image
PT_LOAD program header.
Fixes: f4cac27dc0 ("s390/crash: Use old os_info to create PT_LOAD headers")
Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Acked-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Patch series "Introduce mseal", v10.
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory range
against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW) and
no-execute (NX) bits. Linux has supported NX since the release of kernel
version 2.6.8 in August 2004 [1]. The memory permission feature improves
the security stance on memory corruption bugs, as an attacker cannot
simply write to arbitrary memory and point the code to it. The memory
must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects the
VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime. A
similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall
[4]. Also, Chrome wants to adopt this feature for their CFI work [2] and
this patchset has been designed to be compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in the
case of libc, sealing is only applied to read-only (RO) or read-execute
(RX) memory segments (such as .text and .RELRO) to prevent them from
becoming writable, the lifetime of those mappings are tied to the lifetime
of the process.
Chrome wants to seal two large address space reservations that are managed
by different allocators. The memory is mapped RW- and RWX respectively
but write access to it is restricted using pkeys (or in the future ARM
permission overlay extensions). The lifetime of those mappings are not
tied to the lifetime of the process, therefore, while the memory is
sealed, the allocators still need to free or discard the unused memory.
For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a security
risk. For example if a jump instruction crosses a page boundary and the
second page gets discarded, it will overwrite the target bytes with zeros
and change the control flow. Checking write-permission before the discard
operation allows us to control when the operation is valid. In this case,
the madvise will only succeed if the executing thread has PKEY write
permissions and PKRU changes are protected in software by control-flow
integrity.
Although the initial version of this patch series is targeting the Chrome
browser as its first user, it became evident during upstream discussions
that we would also want to ensure that the patch set eventually is a
complete solution for memory sealing and compatible with other use cases.
The specific scenario currently in mind is glibc's use case of loading and
sealing ELF executables. To this end, Stephen is working on a change to
glibc to add sealing support to the dynamic linker, which will seal all
non-writable segments at startup. Once this work is completed, all
applications will be able to automatically benefit from these new
protections.
In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental in
shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.
MM perf benchmarks
==================
This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
check the VMAs’ sealing flag, so that no partial update can be made,
when any segment within the given memory range is sealed.
To measure the performance impact of this loop, two tests are developed.
[8]
The first is measuring the time taken for a particular system call,
by using clock_gettime(CLOCK_MONOTONIC). The second is using
PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
similar results.
The tests have roughly below sequence:
for (i = 0; i < 1000, i++)
create 1000 mappings (1 page per VMA)
start the sampling
for (j = 0; j < 1000, j++)
mprotect one mapping
stop and save the sample
delete 1000 mappings
calculates all samples.
Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
4G memory, Chromebook.
Based on the latest upstream code:
The first test (measuring time)
syscall__ vmas t t_mseal delta_ns per_vma %
munmap__ 1 909 944 35 35 104%
munmap__ 2 1398 1502 104 52 107%
munmap__ 4 2444 2594 149 37 106%
munmap__ 8 4029 4323 293 37 107%
munmap__ 16 6647 6935 288 18 104%
munmap__ 32 11811 12398 587 18 105%
mprotect 1 439 465 26 26 106%
mprotect 2 1659 1745 86 43 105%
mprotect 4 3747 3889 142 36 104%
mprotect 8 6755 6969 215 27 103%
mprotect 16 13748 14144 396 25 103%
mprotect 32 27827 28969 1142 36 104%
madvise_ 1 240 262 22 22 109%
madvise_ 2 366 442 76 38 121%
madvise_ 4 623 751 128 32 121%
madvise_ 8 1110 1324 215 27 119%
madvise_ 16 2127 2451 324 20 115%
madvise_ 32 4109 4642 534 17 113%
The second test (measuring cpu cycle)
syscall__ vmas cpu cmseal delta_cpu per_vma %
munmap__ 1 1790 1890 100 100 106%
munmap__ 2 2819 3033 214 107 108%
munmap__ 4 4959 5271 312 78 106%
munmap__ 8 8262 8745 483 60 106%
munmap__ 16 13099 14116 1017 64 108%
munmap__ 32 23221 24785 1565 49 107%
mprotect 1 906 967 62 62 107%
mprotect 2 3019 3203 184 92 106%
mprotect 4 6149 6569 420 105 107%
mprotect 8 9978 10524 545 68 105%
mprotect 16 20448 21427 979 61 105%
mprotect 32 40972 42935 1963 61 105%
madvise_ 1 434 497 63 63 115%
madvise_ 2 752 899 147 74 120%
madvise_ 4 1313 1513 200 50 115%
madvise_ 8 2271 2627 356 44 116%
madvise_ 16 4312 4883 571 36 113%
madvise_ 32 8376 9319 943 29 111%
Based on the result, for 6.8 kernel, sealing check adds
20-40 nano seconds, or around 50-100 CPU cycles, per VMA.
In addition, I applied the sealing to 5.10 kernel:
The first test (measuring time)
syscall__ vmas t tmseal delta_ns per_vma %
munmap__ 1 357 390 33 33 109%
munmap__ 2 442 463 21 11 105%
munmap__ 4 614 634 20 5 103%
munmap__ 8 1017 1137 120 15 112%
munmap__ 16 1889 2153 263 16 114%
munmap__ 32 4109 4088 -21 -1 99%
mprotect 1 235 227 -7 -7 97%
mprotect 2 495 464 -30 -15 94%
mprotect 4 741 764 24 6 103%
mprotect 8 1434 1437 2 0 100%
mprotect 16 2958 2991 33 2 101%
mprotect 32 6431 6608 177 6 103%
madvise_ 1 191 208 16 16 109%
madvise_ 2 300 324 24 12 108%
madvise_ 4 450 473 23 6 105%
madvise_ 8 753 806 53 7 107%
madvise_ 16 1467 1592 125 8 108%
madvise_ 32 2795 3405 610 19 122%
The second test (measuring cpu cycle)
syscall__ nbr_vma cpu cmseal delta_cpu per_vma %
munmap__ 1 684 715 31 31 105%
munmap__ 2 861 898 38 19 104%
munmap__ 4 1183 1235 51 13 104%
munmap__ 8 1999 2045 46 6 102%
munmap__ 16 3839 3816 -23 -1 99%
munmap__ 32 7672 7887 216 7 103%
mprotect 1 397 443 46 46 112%
mprotect 2 738 788 50 25 107%
mprotect 4 1221 1256 35 9 103%
mprotect 8 2356 2429 72 9 103%
mprotect 16 4961 4935 -26 -2 99%
mprotect 32 9882 10172 291 9 103%
madvise_ 1 351 380 29 29 108%
madvise_ 2 565 615 49 25 109%
madvise_ 4 872 933 61 15 107%
madvise_ 8 1508 1640 132 16 109%
madvise_ 16 3078 3323 245 15 108%
madvise_ 32 5893 6704 811 25 114%
For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
CPU cycles, there is even decrease in some cases.
It might be interesting to compare 5.10 and 6.8 kernel
The first test (measuring time)
syscall__ vmas t_5_10 t_6_8 delta_ns per_vma %
munmap__ 1 357 909 552 552 254%
munmap__ 2 442 1398 956 478 316%
munmap__ 4 614 2444 1830 458 398%
munmap__ 8 1017 4029 3012 377 396%
munmap__ 16 1889 6647 4758 297 352%
munmap__ 32 4109 11811 7702 241 287%
mprotect 1 235 439 204 204 187%
mprotect 2 495 1659 1164 582 335%
mprotect 4 741 3747 3006 752 506%
mprotect 8 1434 6755 5320 665 471%
mprotect 16 2958 13748 10790 674 465%
mprotect 32 6431 27827 21397 669 433%
madvise_ 1 191 240 49 49 125%
madvise_ 2 300 366 67 33 122%
madvise_ 4 450 623 173 43 138%
madvise_ 8 753 1110 357 45 147%
madvise_ 16 1467 2127 660 41 145%
madvise_ 32 2795 4109 1314 41 147%
The second test (measuring cpu cycle)
syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma %
munmap__ 1 684 1790 1106 1106 262%
munmap__ 2 861 2819 1958 979 327%
munmap__ 4 1183 4959 3776 944 419%
munmap__ 8 1999 8262 6263 783 413%
munmap__ 16 3839 13099 9260 579 341%
munmap__ 32 7672 23221 15549 486 303%
mprotect 1 397 906 509 509 228%
mprotect 2 738 3019 2281 1140 409%
mprotect 4 1221 6149 4929 1232 504%
mprotect 8 2356 9978 7622 953 423%
mprotect 16 4961 20448 15487 968 412%
mprotect 32 9882 40972 31091 972 415%
madvise_ 1 351 434 82 82 123%
madvise_ 2 565 752 186 93 133%
madvise_ 4 872 1313 442 110 151%
madvise_ 8 1508 2271 763 95 151%
madvise_ 16 3078 4312 1234 77 140%
madvise_ 32 5893 8376 2483 78 142%
From 5.10 to 6.8
munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.
In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
increase from 5.10 to 6.8 is significantly larger, approximately ten times
greater for munmap and mprotect.
When I discuss the mm performance with Brian Makin, an engineer who worked
on performance, it was brought to my attention that such performance
benchmarks, which measuring millions of mm syscall in a tight loop, may
not accurately reflect real-world scenarios, such as that of a database
service. Also this is tested using a single HW and ChromeOS, the data
from another HW or distribution might be different. It might be best to
take this data with a grain of salt.
This patch (of 5):
Wire up mseal syscall for all architectures.
Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org
Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Jann Horn <jannh@google.com> [Bug #2]
Cc: Jeff Xu <jeffxu@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Stephen Röttger <sroettger@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Amer Al Shanawany <amer.shanawany@gmail.com>
Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Switch read and write software bits for PUDs
- Add missing hardware bits for PUDs and PMDs
- Generate unwind information for C modules to fix GDB unwind
error for vDSO functions
- Create .build-id links for unstripped vDSO files to enable
vDSO debugging with symbols
- Use standard stack frame layout for vDSO generated stack frames
to manually walk stack frames without DWARF information
- Rework perf_callchain_user() and arch_stack_walk_user() functions
to reduce code duplication
- Skip first stack frame when walking user stack
- Add basic checks to identify invalid instruction pointers when
walking stack frames
- Introduce and use struct stack_frame_vdso_wrapper within vDSO user
wrapper code to automatically generate an asm-offset define. Also
use STACK_FRAME_USER_OVERHEAD instead of STACK_FRAME_OVERHEAD to
document that the code works with user space stack
- Clear the backchain of the extra stack frame added by the vDSO user
wrapper code. This allows the user stack walker to detect and skip
the non-standard stack frame. Without this an incorrect instruction
pointer would be added to stack traces.
- Rewrite psw_idle() function in C to ease maintenance and further
enhancements
- Remove get_vtimer() function and use get_cpu_timer() instead
- Mark psw variable in __load_psw_mask() as __unitialized to avoid
superfluous clearing of PSW
- Remove obsolete and superfluous comment about removed TIF_FPU flag
- Replace memzero_explicit() and kfree() with kfree_sensitive() to
fix warnings reported by Coccinelle
- Wipe sensitive data and all copies of protected- or secure-keys
from stack when an IOCTL fails
- Both do_airq_interrupt() and do_io_interrupt() functions set
CIF_NOHZ_DELAY flag. Move it in do_io_irq() to simplify the code
- Provide iucv_alloc_device() and iucv_release_device() helpers,
which can be used to deduplicate more or less identical IUCV
device allocation and release code in four different drivers
- Make use of iucv_alloc_device() and iucv_release_device()
helpers to get rid of quite some code and also remove a
cast to an incompatible function (clang W=1)
- There is no user of iucv_root outside of the core IUCV code left.
Therefore remove the EXPORT_SYMBOL
- __apply_alternatives() contains a runtime check which verifies
that the size of the to be patched code area is even. Convert
this to a compile time check
- Increase size of buffers for sending z/VM CP DIAGNOSE X'008'
commands from 128 to 240
- Do not accept z/VM CP DIAGNOSE X'008' commands longer than
maximally allowed
- Use correct defines IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN instead
of IPL_BP_FCP_LEN and IPL_BP0_FCP_LEN ones to initialize NVMe
reIPL block on 'scp_data' sysfs attribute update
- Initialize the correct fields of the NVMe dump block, which
were confused with FCP fields
- Refactor macros for 'scp_data' (re-)IPL sysfs attribute to
reduce code duplication
- Introduce 'scp_data' sysfs attribute for dump IPL to allow tools
such as dumpconf passing additional kernel command line parameters
to a stand-alone dumper
- Rework the CPACF query functions to use the correct RRE or RRF
instruction formats and set instruction register fields correctly
- Instead of calling BUG() at runtime force a link error during
compile when a unsupported opcode is used with __cpacf_query()
or __cpacf_check_opcode() functions
- Fix a crash in ap_parse_bitmap_str() function on /sys/bus/ap/apmask
or /sys/bus/ap/aqmask sysfs file update with a relative mask value
- Fix "bindings complete" udev event which should be sent once all AP
devices have been bound to device drivers and again when unbind/bind
actions take place and all AP devices are bound again
- Facility list alt_stfle_fac_list is nowhere used in the decompressor,
therefore remove it there
- Remove custom kprobes insn slot allocator in favour of the standard
module_alloc() one, since kernel image and module areas are located
within 4GB
- Use kvcalloc() instead of kvmalloc_array() in zcrypt driver to avoid
calling memset() with a large byte count and get rid of the sparse
warning as result
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZkyx2BccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8PYZAP9KxEfTyUmIh61Gx8+m3BW5dy7p
E2Q8yotlUpGj49ul+AD8CEAyTiWR95AlMOVZZLV/0J7XIjhALvpKAGfiJWkvXgc=
=pife
-----END PGP SIGNATURE-----
Merge tag 's390-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Alexander Gordeev:
- Switch read and write software bits for PUDs
- Add missing hardware bits for PUDs and PMDs
- Generate unwind information for C modules to fix GDB unwind error for
vDSO functions
- Create .build-id links for unstripped vDSO files to enable vDSO
debugging with symbols
- Use standard stack frame layout for vDSO generated stack frames to
manually walk stack frames without DWARF information
- Rework perf_callchain_user() and arch_stack_walk_user() functions to
reduce code duplication
- Skip first stack frame when walking user stack
- Add basic checks to identify invalid instruction pointers when
walking stack frames
- Introduce and use struct stack_frame_vdso_wrapper within vDSO user
wrapper code to automatically generate an asm-offset define. Also use
STACK_FRAME_USER_OVERHEAD instead of STACK_FRAME_OVERHEAD to document
that the code works with user space stack
- Clear the backchain of the extra stack frame added by the vDSO user
wrapper code. This allows the user stack walker to detect and skip
the non-standard stack frame. Without this an incorrect instruction
pointer would be added to stack traces.
- Rewrite psw_idle() function in C to ease maintenance and further
enhancements
- Remove get_vtimer() function and use get_cpu_timer() instead
- Mark psw variable in __load_psw_mask() as __unitialized to avoid
superfluous clearing of PSW
- Remove obsolete and superfluous comment about removed TIF_FPU flag
- Replace memzero_explicit() and kfree() with kfree_sensitive() to fix
warnings reported by Coccinelle
- Wipe sensitive data and all copies of protected- or secure-keys from
stack when an IOCTL fails
- Both do_airq_interrupt() and do_io_interrupt() functions set
CIF_NOHZ_DELAY flag. Move it in do_io_irq() to simplify the code
- Provide iucv_alloc_device() and iucv_release_device() helpers, which
can be used to deduplicate more or less identical IUCV device
allocation and release code in four different drivers
- Make use of iucv_alloc_device() and iucv_release_device() helpers to
get rid of quite some code and also remove a cast to an incompatible
function (clang W=1)
- There is no user of iucv_root outside of the core IUCV code left.
Therefore remove the EXPORT_SYMBOL
- __apply_alternatives() contains a runtime check which verifies that
the size of the to be patched code area is even. Convert this to a
compile time check
- Increase size of buffers for sending z/VM CP DIAGNOSE X'008' commands
from 128 to 240
- Do not accept z/VM CP DIAGNOSE X'008' commands longer than maximally
allowed
- Use correct defines IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN instead of
IPL_BP_FCP_LEN and IPL_BP0_FCP_LEN ones to initialize NVMe reIPL
block on 'scp_data' sysfs attribute update
- Initialize the correct fields of the NVMe dump block, which were
confused with FCP fields
- Refactor macros for 'scp_data' (re-)IPL sysfs attribute to reduce
code duplication
- Introduce 'scp_data' sysfs attribute for dump IPL to allow tools such
as dumpconf passing additional kernel command line parameters to a
stand-alone dumper
- Rework the CPACF query functions to use the correct RRE or RRF
instruction formats and set instruction register fields correctly
- Instead of calling BUG() at runtime force a link error during compile
when a unsupported opcode is used with __cpacf_query() or
__cpacf_check_opcode() functions
- Fix a crash in ap_parse_bitmap_str() function on /sys/bus/ap/apmask
or /sys/bus/ap/aqmask sysfs file update with a relative mask value
- Fix "bindings complete" udev event which should be sent once all AP
devices have been bound to device drivers and again when unbind/bind
actions take place and all AP devices are bound again
- Facility list alt_stfle_fac_list is nowhere used in the decompressor,
therefore remove it there
- Remove custom kprobes insn slot allocator in favour of the standard
module_alloc() one, since kernel image and module areas are located
within 4GB
- Use kvcalloc() instead of kvmalloc_array() in zcrypt driver to avoid
calling memset() with a large byte count and get rid of the sparse
warning as result
* tag 's390-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
s390/zcrypt: Use kvcalloc() instead of kvmalloc_array()
s390/kprobes: Remove custom insn slot allocator
s390/boot: Remove alt_stfle_fac_list from decompressor
s390/ap: Fix bind complete udev event sent after each AP bus scan
s390/ap: Fix crash in AP internal function modify_bitmap()
s390/cpacf: Make use of invalid opcode produce a link error
s390/cpacf: Split and rework cpacf query functions
s390/ipl: Introduce sysfs attribute 'scp_data' for dump ipl
s390/ipl: Introduce macros for (re)ipl sysfs attribute 'scp_data'
s390/ipl: Fix incorrect initialization of nvme dump block
s390/ipl: Fix incorrect initialization of len fields in nvme reipl block
s390/ipl: Do not accept z/VM CP diag X'008' cmds longer than max length
s390/ipl: Fix size of vmcmd buffers for sending z/VM CP diag X'008' cmds
s390/alternatives: Convert runtime sanity check into compile time check
s390/iucv: Unexport iucv_root
tty: hvc-iucv: Make use of iucv_alloc_device()
s390/smsgiucv_app: Make use of iucv_alloc_device()
s390/netiucv: Make use of iucv_alloc_device()
s390/vmlogrdr: Make use of iucv_alloc_device()
s390/iucv: Provide iucv_alloc_device() / iucv_release_device()
...
documented (hopefully adequately) in the respective changelogs. Notable
series include:
- Lucas Stach has provided some page-mapping
cleanup/consolidation/maintainability work in the series "mm/treewide:
Remove pXd_huge() API".
- In the series "Allow migrate on protnone reference with
MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one
test.
- In their series "Memory allocation profiling" Kent Overstreet and
Suren Baghdasaryan have contributed a means of determining (via
/proc/allocinfo) whereabouts in the kernel memory is being allocated:
number of calls and amount of memory.
- Matthew Wilcox has provided the series "Various significant MM
patches" which does a number of rather unrelated things, but in largely
similar code sites.
- In his series "mm: page_alloc: freelist migratetype hygiene" Johannes
Weiner has fixed the page allocator's handling of migratetype requests,
with resulting improvements in compaction efficiency.
- In the series "make the hugetlb migration strategy consistent" Baolin
Wang has fixed a hugetlb migration issue, which should improve hugetlb
allocation reliability.
- Liu Shixin has hit an I/O meltdown caused by readahead in a
memory-tight memcg. Addressed in the series "Fix I/O high when memory
almost met memcg limit".
- In the series "mm/filemap: optimize folio adding and splitting" Kairui
Song has optimized pagecache insertion, yielding ~10% performance
improvement in one test.
- Baoquan He has cleaned up and consolidated the early zone
initialization code in the series "mm/mm_init.c: refactor
free_area_init_core()".
- Baoquan has also redone some MM initializatio code in the series
"mm/init: minor clean up and improvement".
- MM helper cleanups from Christoph Hellwig in his series "remove
follow_pfn".
- More cleanups from Matthew Wilcox in the series "Various page->flags
cleanups".
- Vlastimil Babka has contributed maintainability improvements in the
series "memcg_kmem hooks refactoring".
- More folio conversions and cleanups in Matthew Wilcox's series
"Convert huge_zero_page to huge_zero_folio"
"khugepaged folio conversions"
"Remove page_idle and page_young wrappers"
"Use folio APIs in procfs"
"Clean up __folio_put()"
"Some cleanups for memory-failure"
"Remove page_mapping()"
"More folio compat code removal"
- David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb
functions to work on folis".
- Code consolidation and cleanup work related to GUP's handling of
hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
- Rick Edgecombe has developed some fixes to stack guard gaps in the
series "Cover a guard gap corner case".
- Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series
"mm/ksm: fix ksm exec support for prctl".
- Baolin Wang has implemented NUMA balancing for multi-size THPs. This
is a simple first-cut implementation for now. The series is "support
multi-size THP numa balancing".
- Cleanups to vma handling helper functions from Matthew Wilcox in the
series "Unify vma_address and vma_pgoff_address".
- Some selftests maintenance work from Dev Jain in the series
"selftests/mm: mremap_test: Optimizations and style fixes".
- Improvements to the swapping of multi-size THPs from Ryan Roberts in
the series "Swap-out mTHP without splitting".
- Kefeng Wang has significantly optimized the handling of arm64's
permission page faults in the series
"arch/mm/fault: accelerate pagefault when badaccess"
"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
- GUP cleanups from David Hildenbrand in "mm/gup: consistently call it
GUP-fast".
- hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to
use struct vm_fault".
- selftests build fixes from John Hubbard in the series "Fix
selftests/mm build without requiring "make headers"".
- Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes
the initialization code so that migration between different memory types
works as intended.
- David Hildenbrand has improved follow_pte() and fixed an errant driver
in the series "mm: follow_pte() improvements and acrn follow_pte()
fixes".
- David also did some cleanup work on large folio mapcounts in his
series "mm: mapcount for large folios + page_mapcount() cleanups".
- Folio conversions in KSM in Alex Shi's series "transfer page to folio
in KSM".
- Barry Song has added some sysfs stats for monitoring multi-size THP's
in the series "mm: add per-order mTHP alloc and swpout counters".
- Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled
and limit checking cleanups".
- Matthew Wilcox has been looking at buffer_head code and found the
documentation to be lacking. The series is "Improve buffer head
documentation".
- Multi-size THPs get more work, this time from Lance Yang. His series
"mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes
the freeing of these things.
- Kemeng Shi has added more userspace-visible writeback instrumentation
in the series "Improve visibility of writeback".
- Kemeng Shi then sent some maintenance work on top in the series "Fix
and cleanups to page-writeback".
- Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the
series "Improve anon_vma scalability for anon VMAs". Intel's test bot
reported an improbable 3x improvement in one test.
- SeongJae Park adds some DAMON feature work in the series
"mm/damon: add a DAMOS filter type for page granularity access recheck"
"selftests/damon: add DAMOS quota goal test"
- Also some maintenance work in the series
"mm/damon/paddr: simplify page level access re-check for pageout"
"mm/damon: misc fixes and improvements"
- David Hildenbrand has disabled some known-to-fail selftests ni the
series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL".
- memcg metadata storage optimizations from Shakeel Butt in "memcg:
reduce memory consumption by memcg stats".
- DAX fixes and maintenance work from Vishal Verma in the series
"dax/bus.c: Fixups for dax-bus locking".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkgQYwAKCRDdBJ7gKXxA
jrdKAP9WVJdpEcXxpoub/vVE0UWGtffr8foifi9bCwrQrGh5mgEAx7Yf0+d/oBZB
nvA4E0DcPrUAFy144FNM0NTCb7u9vAw=
=V3R/
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
"The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.
Notable series include:
- Lucas Stach has provided some page-mapping cleanup/consolidation/
maintainability work in the series "mm/treewide: Remove pXd_huge()
API".
- In the series "Allow migrate on protnone reference with
MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
one test.
- In their series "Memory allocation profiling" Kent Overstreet and
Suren Baghdasaryan have contributed a means of determining (via
/proc/allocinfo) whereabouts in the kernel memory is being
allocated: number of calls and amount of memory.
- Matthew Wilcox has provided the series "Various significant MM
patches" which does a number of rather unrelated things, but in
largely similar code sites.
- In his series "mm: page_alloc: freelist migratetype hygiene"
Johannes Weiner has fixed the page allocator's handling of
migratetype requests, with resulting improvements in compaction
efficiency.
- In the series "make the hugetlb migration strategy consistent"
Baolin Wang has fixed a hugetlb migration issue, which should
improve hugetlb allocation reliability.
- Liu Shixin has hit an I/O meltdown caused by readahead in a
memory-tight memcg. Addressed in the series "Fix I/O high when
memory almost met memcg limit".
- In the series "mm/filemap: optimize folio adding and splitting"
Kairui Song has optimized pagecache insertion, yielding ~10%
performance improvement in one test.
- Baoquan He has cleaned up and consolidated the early zone
initialization code in the series "mm/mm_init.c: refactor
free_area_init_core()".
- Baoquan has also redone some MM initializatio code in the series
"mm/init: minor clean up and improvement".
- MM helper cleanups from Christoph Hellwig in his series "remove
follow_pfn".
- More cleanups from Matthew Wilcox in the series "Various
page->flags cleanups".
- Vlastimil Babka has contributed maintainability improvements in the
series "memcg_kmem hooks refactoring".
- More folio conversions and cleanups in Matthew Wilcox's series:
"Convert huge_zero_page to huge_zero_folio"
"khugepaged folio conversions"
"Remove page_idle and page_young wrappers"
"Use folio APIs in procfs"
"Clean up __folio_put()"
"Some cleanups for memory-failure"
"Remove page_mapping()"
"More folio compat code removal"
- David Hildenbrand chipped in with "fs/proc/task_mmu: convert
hugetlb functions to work on folis".
- Code consolidation and cleanup work related to GUP's handling of
hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
- Rick Edgecombe has developed some fixes to stack guard gaps in the
series "Cover a guard gap corner case".
- Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
series "mm/ksm: fix ksm exec support for prctl".
- Baolin Wang has implemented NUMA balancing for multi-size THPs.
This is a simple first-cut implementation for now. The series is
"support multi-size THP numa balancing".
- Cleanups to vma handling helper functions from Matthew Wilcox in
the series "Unify vma_address and vma_pgoff_address".
- Some selftests maintenance work from Dev Jain in the series
"selftests/mm: mremap_test: Optimizations and style fixes".
- Improvements to the swapping of multi-size THPs from Ryan Roberts
in the series "Swap-out mTHP without splitting".
- Kefeng Wang has significantly optimized the handling of arm64's
permission page faults in the series
"arch/mm/fault: accelerate pagefault when badaccess"
"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
- GUP cleanups from David Hildenbrand in "mm/gup: consistently call
it GUP-fast".
- hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
path to use struct vm_fault".
- selftests build fixes from John Hubbard in the series "Fix
selftests/mm build without requiring "make headers"".
- Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
series "Improved Memory Tier Creation for CPUless NUMA Nodes".
Fixes the initialization code so that migration between different
memory types works as intended.
- David Hildenbrand has improved follow_pte() and fixed an errant
driver in the series "mm: follow_pte() improvements and acrn
follow_pte() fixes".
- David also did some cleanup work on large folio mapcounts in his
series "mm: mapcount for large folios + page_mapcount() cleanups".
- Folio conversions in KSM in Alex Shi's series "transfer page to
folio in KSM".
- Barry Song has added some sysfs stats for monitoring multi-size
THP's in the series "mm: add per-order mTHP alloc and swpout
counters".
- Some zswap cleanups from Yosry Ahmed in the series "zswap
same-filled and limit checking cleanups".
- Matthew Wilcox has been looking at buffer_head code and found the
documentation to be lacking. The series is "Improve buffer head
documentation".
- Multi-size THPs get more work, this time from Lance Yang. His
series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
optimizes the freeing of these things.
- Kemeng Shi has added more userspace-visible writeback
instrumentation in the series "Improve visibility of writeback".
- Kemeng Shi then sent some maintenance work on top in the series
"Fix and cleanups to page-writeback".
- Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
the series "Improve anon_vma scalability for anon VMAs". Intel's
test bot reported an improbable 3x improvement in one test.
- SeongJae Park adds some DAMON feature work in the series
"mm/damon: add a DAMOS filter type for page granularity access recheck"
"selftests/damon: add DAMOS quota goal test"
- Also some maintenance work in the series
"mm/damon/paddr: simplify page level access re-check for pageout"
"mm/damon: misc fixes and improvements"
- David Hildenbrand has disabled some known-to-fail selftests ni the
series "selftests: mm: cow: flag vmsplice() hugetlb tests as
XFAIL".
- memcg metadata storage optimizations from Shakeel Butt in "memcg:
reduce memory consumption by memcg stats".
- DAX fixes and maintenance work from Vishal Verma in the series
"dax/bus.c: Fixups for dax-bus locking""
* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
selftests: cgroup: add tests to verify the zswap writeback path
mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
mm/damon/core: fix return value from damos_wmark_metric_value
mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
selftests: cgroup: remove redundant enabling of memory controller
Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
Docs/mm/damon/design: use a list for supported filters
Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
selftests/damon: classify tests for functionalities and regressions
selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
selftests/damon: add a test for DAMOS quota goal
...
- Avoid 'constexpr', which is a keyword in C23
- Allow 'dtbs_check' and 'dt_compatible_check' run independently of
'dt_binding_check'
- Fix weak references to avoid GOT entries in position-independent
code generation
- Convert the last use of 'optional' property in arch/sh/Kconfig
- Remove support for the 'optional' property in Kconfig
- Remove support for Clang's ThinLTO caching, which does not work with
the .incbin directive
- Change the semantics of $(src) so it always points to the source
directory, which fixes Makefile inconsistencies between upstream and
downstream
- Fix 'make tar-pkg' for RISC-V to produce a consistent package
- Provide reasonable default coverage for objtool, sanitizers, and
profilers
- Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.
- Remove the last use of tristate choice in drivers/rapidio/Kconfig
- Various cleanups and fixes in Kconfig
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmZFlGcVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsG8voQALC8NtFpduWVfLRj2Qg6Ll/xf1vX
2igcTJEOFHkeqXLGoT8dTDKLEipUBUvKyguPq66CGwVTe2g6zy/nUSXeVtFrUsIa
msLTi8FqhqUo5lodNvGMRf8qqmuqcvnXoiQwIocF92jtsFy14bhiFY+n4HfcFNjj
GOKwqBZYQUwY/VVb090efc7RfS9c7uwABJSBelSoxg3AGZriwjGy7Pw5aSKGgVYi
inqL1eR6qwPP6z7CgQWM99soP+zwybFZmnQrsD9SniRBI4rtAat8Ih5jQFaSUFUQ
lk2w0NQBRFN88/uR2IJ2GWuIlQ74WeJ+QnCqVuQ59tV5zw90wqSmLzngfPD057Dv
JjNuhk0UyXVtpIg3lRtd4810ppNSTe33b9OM4O2H846W/crju5oDRNDHcflUXcwm
Rmn5ho1rb5QVzDVejJbgwidnUInSgJ9PZcvXQ/RJVZPhpgsBzAY9pQexG1G3hviw
y9UDrt6KP6bF9tHjmolmtdIes9Pj0c4dN6/Rdj4HS4hIQ/GDar0tnwvOvtfUctNL
orJlBsA6GeMmDVXKkR0ytOCWRYqWWbyt8g70RVKQJfuHX7/hGyAQPaQ2/u4mQhC2
aevYfbNJMj0VDfGz81HDBKFtkc5n+Ite8l157dHEl2LEabkOkRdNVcn7SNbOvZmd
ZCSnZ31h7woGfNho
=D5B/
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Avoid 'constexpr', which is a keyword in C23
- Allow 'dtbs_check' and 'dt_compatible_check' run independently of
'dt_binding_check'
- Fix weak references to avoid GOT entries in position-independent code
generation
- Convert the last use of 'optional' property in arch/sh/Kconfig
- Remove support for the 'optional' property in Kconfig
- Remove support for Clang's ThinLTO caching, which does not work with
the .incbin directive
- Change the semantics of $(src) so it always points to the source
directory, which fixes Makefile inconsistencies between upstream and
downstream
- Fix 'make tar-pkg' for RISC-V to produce a consistent package
- Provide reasonable default coverage for objtool, sanitizers, and
profilers
- Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.
- Remove the last use of tristate choice in drivers/rapidio/Kconfig
- Various cleanups and fixes in Kconfig
* tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits)
kconfig: use sym_get_choice_menu() in sym_check_prop()
rapidio: remove choice for enumeration
kconfig: lxdialog: remove initialization with A_NORMAL
kconfig: m/nconf: merge two item_add_str() calls
kconfig: m/nconf: remove dead code to display value of bool choice
kconfig: m/nconf: remove dead code to display children of choice members
kconfig: gconf: show checkbox for choice correctly
kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal
Makefile: remove redundant tool coverage variables
kbuild: provide reasonable defaults for tool coverage
modules: Drop the .export_symbol section from the final modules
kconfig: use menu_list_for_each_sym() in sym_check_choice_deps()
kconfig: use sym_get_choice_menu() in conf_write_defconfig()
kconfig: add sym_get_choice_menu() helper
kconfig: turn defaults and additional prompt for choice members into error
kconfig: turn missing prompt for choice members into error
kconfig: turn conf_choice() into void function
kconfig: use linked list in sym_set_changed()
kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED
kconfig: gconf: remove debug code
...
- tracing/probes: Adding new pseudo-types %pd and %pD support for dumping
dentry name from 'struct dentry *' and file name from 'struct file *'.
- uprobes: Some performance optimizations have been done.
. Speed up the BPF uprobe event by delaying the fetching of the uprobe
event arguments that are not used in BPF.
. Avoid locking by speculatively checking whether uprobe event is valid.
. Reduce lock contention by using read/write_lock instead of spinlock for
uprobe list operation. This improved BPF uprobe benchmark result 43% on
average.
- rethook: Removes non-fatal warning messages when tracing stack from BPF
and skip rcu_is_watching() validation in rethook if possible.
- objpool: Optimizing objpool (which is used by kretprobes and fprobe as
rethook backend storage) by inlining functions and avoid caching nr_cpu_ids
because it is a const value.
- fprobe: Add entry/exit callbacks types (code cleanup)
- kprobes: Check ftrace was killed in kprobes if it uses ftrace.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmZFUxsbHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b+fIH/A96/SeC5WRLhXmHfTCM
IvKUea2n0b0oV/2pVfHqfkCBTICuUZ97Opd9VH9jLtjBOTh0fUOGZ2DNVGdSYfWm
IIkS5dhuZxHXrSHEVYykwLHI3AOL7Q6Ny9EmOg1CNMidUkPMNtBvppsBYPlFU/B/
qQJAvOdkVOnNITCaas0+MNgepoVVKdJzdNQ1I4WrGyG8isCZBaCYKo2QcGyheCNN
y8NXvnVHgmgHQ8nTaeE5AawclFzFnhwHfPQPe1kiyGrx15b8K+VYmaZxPKv33A1a
KT3TKJ1Ep7s7iWFh2iPVJzIwOXCmSnvNTKfNx/MDuKtO7UVfFwytoMEaekbmv3bG
VqM=
=n/mW
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:
- tracing/probes: Add new pseudo-types %pd and %pD support for dumping
dentry name from 'struct dentry *' and file name from 'struct file *'
- uprobes performance optimizations:
- Speed up the BPF uprobe event by delaying the fetching of the
uprobe event arguments that are not used in BPF
- Avoid locking by speculatively checking whether uprobe event is
valid
- Reduce lock contention by using read/write_lock instead of
spinlock for uprobe list operation. This improved BPF uprobe
benchmark result 43% on average
- rethook: Remove non-fatal warning messages when tracing stack from
BPF and skip rcu_is_watching() validation in rethook if possible
- objpool: Optimize objpool (which is used by kretprobes and fprobe as
rethook backend storage) by inlining functions and avoid caching
nr_cpu_ids because it is a const value
- fprobe: Add entry/exit callbacks types (code cleanup)
- kprobes: Check ftrace was killed in kprobes if it uses ftrace
* tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
kprobe/ftrace: bail out if ftrace was killed
selftests/ftrace: Fix required features for VFS type test case
objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
objpool: enable inlining objpool_push() and objpool_pop() operations
rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
ftrace: make extra rcu_is_watching() validation check optional
uprobes: reduce contention on uprobes_tree access
rethook: Remove warning messages printed for finding return address of a frame.
fprobe: Add entry/exit callbacks types
selftests/ftrace: add fprobe test cases for VFS type "%pd" and "%pD"
selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD"
Documentation: tracing: add new type '%pd' and '%pD' for kprobe
tracing/probes: support '%pD' type for print struct file's name
tracing/probes: support '%pd' type for print struct dentry's name
uprobes: add speculative lockless system-wide uprobe filter check
uprobes: prepare uprobe args buffer lazily
uprobes: encapsulate preparation of uprobe args buffer
Since commit c98d2ecae0 ("s390/mm: Uncouple physical vs virtual address
spaces") the kernel image and module area are within the same 4GB area.
This eliminates the need of a custom insn slot allocator for kprobes within
the kernel image, since standard module_alloc() allocated pages are
sufficient for PC relative instructions with a signed 32 bit offset.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
It is nowhere used in the decompressor, therefore remove it.
Fixes: 17e89e1340 ("s390/facilities: move stfl information from lowcore to global data")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
If an error happens in ftrace, ftrace_kill() will prevent disarming
kprobes. Eventually, the ftrace_ops associated with the kprobes will be
freed, yet the kprobes will still be active, and when triggered, they
will use the freed memory, likely resulting in a page fault and panic.
This behavior can be reproduced quite easily, by creating a kprobe and
then triggering a ftrace_kill(). For simplicity, we can simulate an
ftrace error with a kernel module like [1]:
[1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer
sudo perf probe --add commit_creds
sudo perf trace -e probe:commit_creds
# In another terminal
make
sudo insmod ftrace_killer.ko # calls ftrace_kill(), simulating bug
# Back to perf terminal
# ctrl-c
sudo perf probe --del commit_creds
After a short period, a page fault and panic would occur as the kprobe
continues to execute and uses the freed ftrace_ops. While ftrace_kill()
is supposed to be used only in extreme circumstances, it is invoked in
FTRACE_WARN_ON() and so there are many places where an unexpected bug
could be triggered, yet the system may continue operating, possibly
without the administrator noticing. If ftrace_kill() does not panic the
system, then we should do everything we can to continue operating,
rather than leave a ticking time bomb.
Link: https://lore.kernel.org/all/20240501162956.229427-1-stephen.s.brennan@oracle.com/
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
This is analogous to the reipl's sysfs attribute named equally and enables
tools such as s390-tools' dumpconf to pass additional kernel cmdline
parameters to a stand-alone dumper such as zfcpdump (e.g. to enable
debug output with 'dump_debug' parameter) or ngdump.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This is a refactoring change to reduce code duplication and improve code
reuse.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Initialize the correct fields of the nvme dump block.
This bug had not been detected before because first, the fcp and nvme fields
of struct ipl_parameter_block are part of the same union and, therefore,
overlap in memory and second, they are identical in structure and size.
Fixes: d70e38cb1d ("s390: nvme dump support")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Use correct symbolic constants IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN
to initialize nvme reipl block when 'scp_data' sysfs attribute is
being updated. This bug had not been detected before because
the corresponding fcp and nvme symbolic constants are equal.
Fixes: 23a457b8d5 ("s390: nvme reipl")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The old implementation of vmcmd sysfs string attributes truncated passed
z/VM CP diagnose X'008' commands which were longer than the max allowed
number of characters but the reported number of written characters was
still equal to the entire length of a given string. This can result in
silent failures of some s390-tools (e.g. dumpconf) which can be very hard
to detect. Therefore, this commit makes a write attempt to a vmcmd sysfs
attribute
* fail with E2BIG error if a given string is longer than the maximum
allowed one
* never destroy the old data in the vmcmd sysfs attribute if the new data
doesn't fit into it entirely
* return the actual number of written characters if it succeeds
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
z/VM CP diagnose X'008' accepts commands of max 240 characters.
Using a smaller value as a buffer size makes kernel send truncated CP
commands which are longer than the old buffer size. This can result in
invalid CP commands passed to z/VM.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
__apply_alternatives() contains a runtime check which verifies that the
size of the to be patched code area is even. Convert this to a compile time
check using a similar ".org" trick, which is already used to verify that
old and new code areas have the same size.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Both do_airq_interrupt() and do_io_interrupt() set
CIF_NOHZ_DELAY. Move it to do_io_irq() to simplify
the code.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Now Kbuild provides reasonable defaults for objtool, sanitizers, and
profilers.
Remove redundant variables.
Note:
This commit changes the coverage for some objects:
- include arch/mips/vdso/vdso-image.o into UBSAN, GCOV, KCOV
- include arch/sparc/vdso/vdso-image-*.o into UBSAN
- include arch/sparc/vdso/vma.o into UBSAN
- include arch/x86/entry/vdso/extable.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
- include arch/x86/entry/vdso/vdso-image-*.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
- include arch/x86/entry/vdso/vdso32-setup.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
- include arch/x86/entry/vdso/vma.o into GCOV, KCOV
- include arch/x86/um/vdso/vma.o into KASAN, GCOV, KCOV
I believe these are positive effects because all of them are kernel
space objects.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Roberto Sassu <roberto.sassu@huawei.com>
It has been removed in commit 2c6b96762f ("s390/fpu: remove TIF_FPU"),
so we should not mention TIF_FPU in the comment here anymore. Since the
remaining parts of the comment just document the obvious fact that
save_user_fpu_regs() saves the FPU state, simply remove the comment now
completely.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20240503080648.81461-1-thuth@redhat.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Instead of implementing get_vtimer() use get_cpu_timer()
which does the same.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
To ease maintenance and further enhancements, convert
the psw_idle() function to C.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Clear the backchain of the extra stack frame added by the vdso user wrapper
code. This allows the user stack walker to detect and skip the non-standard
stack frame. Without this an incorrect instruction pointer would be added
to stack traces, and stack frame walking would be continued with a more or
less random back chain.
Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Introduce and use struct stack_frame_vdso_wrapper within vdso user wrapper
code. With this structure it is possible to automatically generate an
asm-offset define which can be used to save and restore the return address
of the calling function.
Also use STACK_FRAME_USER_OVERHEAD instead of STACK_FRAME_OVERHEAD to
document that the code works with user space stack frames with the standard
stack frame layout.
Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Add basic checks to identify invalid instruction pointers when walking
stack frames:
Instruction pointers must
- have even addresses
- be larger than mmap_min_addr
- lower than the asce_limit of the process
Alternatively it would also be possible to walk page tables similar to fast
GUP and verify that the mapping of the corresponding page is executable,
however that seems to be overkill.
Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
When walking user stack frames the first stack frame (where the stack
pointer points to) should be skipped: the return address of the current
function is saved in the previous stack frame, not the current stack frame,
which is allocated for to be called functions.
Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The two functions perf_callchain_user() and arch_stack_walk_user() are
nearly identical. Reduce code duplication and add a common helper which can
be called by both functions.
Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
By default user space is compiled with standard stack frame layout and not
with the packed stack layout. The vdso code however inherited the
-mpacked-stack compiler option from the kernel. Remove this option to make
sure the vdso is compiled with standard stack frame layout.
This makes sure that the stack frame backchain location for vdso generated
stack frames is the same like for calling code (if compiled with default
options). This allows to manually walk stack frames without DWARF
information, like the kernel is doing it e.g. with arch_stack_walk_user().
Fixes: 4bff8cb545 ("s390: convert to GENERIC_VDSO")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
GDB fails to unwind vDSO functions with error message "PC not saved",
for instance when stepping through gettimeofday().
Add -fasynchronous-unwind-tables to CFLAGS to generate .eh_frame
DWARF unwind information for the vDSO C modules.
Fixes: 4bff8cb545 ("s390: convert to GENERIC_VDSO")
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
execmem does not depend on modules, on the contrary modules use
execmem.
To make execmem available when CONFIG_MODULES=n, for instance for
kprobes, split execmem_params initialization out from
arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Extend execmem parameters to accommodate more complex overrides of
module_alloc() by architectures.
This includes specification of a fallback range required by arm, arm64
and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for
allocation of KASAN shadow required by s390 and x86 and support for
late initialization of execmem required by arm64.
The core implementation of execmem_alloc() takes care of suppressing
warnings when the initial allocation fails but there is a fallback range
defined.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Tested-by: Liviu Dudau <liviu@dudau.co.uk>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
module_alloc() is used everywhere as a mean to allocate memory for code.
Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.
Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.
Start splitting code allocation from modules by introducing execmem_alloc()
and execmem_free() APIs.
Initially, execmem_alloc() is a wrapper for module_alloc() and
execmem_free() is a replacement of module_memfree() to allow updating all
call sites to use the new APIs.
Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem_alloc() takes
a type argument, that will be used to identify the calling subsystem and to
allow architectures define parameters for ranges suitable for that
subsystem.
No functional changes.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
- Add cpufreq pressure feedback for the scheduler
- Rework misfit load-balancing wrt. affinity restrictions
- Clean up and simplify the code around ::overutilized and
::overload access.
- Simplify sched_balance_newidle()
- Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES
handling that changed the output.
- Rework & clean up <asm/vtime.h> interactions wrt. arch_vtime_task_switch()
- Reorganize, clean up and unify most of the higher level
scheduler balancing function names around the sched_balance_*()
prefix.
- Simplify the balancing flag code (sched_balance_running)
- Miscellaneous cleanups & fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBtA0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gQEw//WiCiV7zTlWShSiG/g8GTfoAvl53QTWXF
0jQ8TUcoIhxB5VeGgxVG1srYt8f505UXjH7L0MJLrbC3nOgRCg4NK57WiQEachKK
HORIJHT0tMMsKIwX9D5Ovo4xYJn+j7mv7j/caB+hIlzZAbWk+zZPNWcS84p0ZS/4
appY6RIcp7+cI7bisNMGUuNZS14+WMdWoX3TgoI6ekgDZ7Ky+kQvkwGEMBXsNElO
qZOj6yS/QUE4Htwz0tVfd6h5svoPM/VJMIvl0yfddPGurfNw6jEh/fjcXnLdAzZ6
9mgcosETncQbm0vfSac116lrrZIR9ygXW/yXP5S7I5dt+r+5pCrBZR2E5g7U4Ezp
GjX1+6J9U6r6y12AMLRjadFOcDvxdwtszhZq4/wAcmS3B9dvupnH/w7zqY9ho3wr
hTdtDHoAIzxJh7RNEHgeUC0/yQX3wJ9THzfYltDRIIjHTuvl4d5lHgsug+4Y9ClE
pUIQm/XKouweQN9TZz2ULle4ZhRrR9sM9QfZYfirJ/RppmuKool4riWyQFQNHLCy
mBRMjFFsTpFIOoZXU6pD4EabOpWdNrRRuND/0yg3WbDat2gBWq6jvSFv2UN1/v7i
Un5jijTuN7t8yP5lY5Tyf47kQfLlA9bUx1v56KnF9mrpI87FyiDD3MiQVhDsvpGX
rP96BIOrkSo=
=obph
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Add cpufreq pressure feedback for the scheduler
- Rework misfit load-balancing wrt affinity restrictions
- Clean up and simplify the code around ::overutilized and
::overload access.
- Simplify sched_balance_newidle()
- Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES
handling that changed the output.
- Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch()
- Reorganize, clean up and unify most of the higher level
scheduler balancing function names around the sched_balance_*()
prefix
- Simplify the balancing flag code (sched_balance_running)
- Miscellaneous cleanups & fixes
* tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
sched/pelt: Remove shift of thermal clock
sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()
thermal/cpufreq: Remove arch_update_thermal_pressure()
sched/cpufreq: Take cpufreq feedback into account
cpufreq: Add a cpufreq pressure feedback for the scheduler
sched/fair: Fix update of rd->sg_overutilized
sched/vtime: Do not include <asm/vtime.h> header
s390/irq,nmi: Include <asm/vtime.h> header directly
s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
sched/vtime: Get rid of generic vtime_task_switch() implementation
sched/vtime: Remove confusing arch_vtime_task_switch() declaration
sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
sched/fair: Rename root_domain::overload to ::overloaded
sched/fair: Use helper functions to access root_domain::overload
sched/fair: Check root_domain::overload value before update
sched/fair: Combine EAS check with root_domain::overutilized access
sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
...
- Rework the AP initialization and add missing cleanups to the error path
- Swap IRQ and AP bus/device registration to avoid race conditions
- Export prot_virt_guest symbol
- Introduce AP configuration changes notifier interface to facilitate
modularization of the AP bus
- Add CONFIG_AP kernel configuration option to allow modularization of
the AP bus
- Rework CONFIG_ZCRYPT_DEBUG kernel configuration option description and
dependency and rename it to CONFIG_AP_DEBUG
- Convert sprintf() and snprintf() to sysfs_emit() in CIO code
- Adjust indentation of RELOCS command build step
- Make crypto performance counters upward compatible
- Convert make_page_secure() and gmap_make_secure() to use folio
- Rework channel-utilization-block (CUB) handling in preparation of
introducing additional CUBs
- Use attribute groups to simplify registration, removal and extension
of measurement-related channel-path sysfs attributes
- Add a per-channel-path binary "ext_measurement" sysfs attribute that
provides access to extended channel-path measurement data
- Export measurement data for all channel-measurement-groups (CMG), not
only for a specific ones. This enables support of new CMG data formats
in userspace without the need for kernel changes
- Add a per-channel-path sysfs attribute "speed_bps" that provides the
operating speed in bits per second or 0 if the operating speed is not
available
- The CIO tracepoint subchannel-type field "st" is incorrectly set to
the value of subchannel-enabled SCHIB "ena" field. Fix that
- Do not forcefully limit vmemmap starting address to MAX_PHYSMEM_BITS
- Consider the maximum physical address available to a DCSS segment
(512GB) when memory layout is set up
- Simplify the virtual memory layout setup by reducing the size of
identity mapping vs vmemmap overlap
- Swap vmalloc and Lowcore/Real Memory Copy areas in virtual memory.
This will allow to place the kernel image next to kernel modules
- Move everyting KASLR related from <asm/setup.h> to <asm/page.h>
- Put virtual memory layout information into a structure to improve
code generation
- Currently __kaslr_offset is the kernel offset in both physical and
virtual memory spaces. Uncouple these offsets to allow uncoupling
of the addresses spaces
- Currently the identity mapping base address is implicit and is always
set to zero. Make it explicit by putting into __identity_base persistent
boot variable and use it in proper context
- Introduce .amode31 section start and end macros AMODE31_START and
AMODE31_END
- Introduce OS_INFO entries that do not reference any data in memory,
but rather provide only values
- Store virtual memory layout in OS_INFO. It is read out by makedumpfile,
crash and other tools
- Store virtual memory layout in VMCORE_INFO. It is read out by crash and
other tools when /proc/kcore device is used
- Create additional PT_LOAD ELF program header that covers kernel image
only, so that vmcore tools could locate kernel text and data when virtual
and physical memory spaces are uncoupled
- Uncouple physical and virtual address spaces
- Map kernel at fixed location when KASLR mode is disabled. The location is
defined by CONFIG_KERNEL_IMAGE_BASE kernel configuration value.
- Rework deployment of kernel image for both compressed and uncompressed
variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel configuration
value
- Move .vmlinux.relocs section in front of the compressed kernel.
The interim section rescue step is avoided as result
- Correct modules thunk offset calculation when branch target is more
than 2GB away
- Kernel modules contain their own set of expoline thunks. Now that the
kernel modules area is less than 4GB away from kernel expoline thunks,
make modules use kernel expolines. Also make EXPOLINE_EXTERN the default
if the compiler supports it
- userfaultfd can insert shared zeropages into processes running VMs,
but that is not allowed for s390. Fallback to allocating a fresh
zeroed anonymous folio and insert that instead
- Re-enable shared zeropages for non-PV and non-skeys KVM guests
- Rename hex2bitmap() to ap_hex2bitmap() and export it for external use
- Add ap_config sysfs attribute to provide the means for setting or
displaying adapters, domains and control domains assigned to a vfio-ap
mediated device in a single operation
- Make vfio_ap_mdev_link_queue() ignore duplicate link requests
- Add write support to ap_config sysfs attribute to allow atomic update
a vfio-ap mediated device state
- Document ap_config sysfs attribute
- Function os_info_old_init() is expected to be called only from a regular
kdump kernel. Enable it to be called from a stand-alone dump kernel
- Address gcc -Warray-bounds warning and fix array size in struct os_info
- s390 does not support SMBIOS, so drop unneeded CONFIG_DMI checks
- Use unwinder instead of __builtin_return_address() with ftrace to
prevent returning of undefined values
- Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
kernel is enabled. Drop these for the case CONFIG_PIE_BUILD is disabled
- Compile kernel with -fPIC and link with -no-pie to allow kpatch feature
always succeed and drop the whole CONFIG_PIE_BUILD option-enabled code
- Add missing virt_to_phys() converter for VSIE facility and crypto
control blocks
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZjkp5xccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8D99AQCEby+KHssuZe9m0NvvikWREYBC
myqob4EmdU3KdTEbNAEAt2OB7mzSQc90yjawI+Je7vwVyh3uc2Nb4Qg05yO6owI=
=eOYN
-----END PGP SIGNATURE-----
Merge tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Store AP Query Configuration Information in a static buffer
- Rework the AP initialization and add missing cleanups to the error
path
- Swap IRQ and AP bus/device registration to avoid race conditions
- Export prot_virt_guest symbol
- Introduce AP configuration changes notifier interface to facilitate
modularization of the AP bus
- Add CONFIG_AP kernel configuration option to allow modularization of
the AP bus
- Rework CONFIG_ZCRYPT_DEBUG kernel configuration option description
and dependency and rename it to CONFIG_AP_DEBUG
- Convert sprintf() and snprintf() to sysfs_emit() in CIO code
- Adjust indentation of RELOCS command build step
- Make crypto performance counters upward compatible
- Convert make_page_secure() and gmap_make_secure() to use folio
- Rework channel-utilization-block (CUB) handling in preparation of
introducing additional CUBs
- Use attribute groups to simplify registration, removal and extension
of measurement-related channel-path sysfs attributes
- Add a per-channel-path binary "ext_measurement" sysfs attribute that
provides access to extended channel-path measurement data
- Export measurement data for all channel-measurement-groups (CMG), not
only for a specific ones. This enables support of new CMG data
formats in userspace without the need for kernel changes
- Add a per-channel-path sysfs attribute "speed_bps" that provides the
operating speed in bits per second or 0 if the operating speed is not
available
- The CIO tracepoint subchannel-type field "st" is incorrectly set to
the value of subchannel-enabled SCHIB "ena" field. Fix that
- Do not forcefully limit vmemmap starting address to MAX_PHYSMEM_BITS
- Consider the maximum physical address available to a DCSS segment
(512GB) when memory layout is set up
- Simplify the virtual memory layout setup by reducing the size of
identity mapping vs vmemmap overlap
- Swap vmalloc and Lowcore/Real Memory Copy areas in virtual memory.
This will allow to place the kernel image next to kernel modules
- Move everyting KASLR related from <asm/setup.h> to <asm/page.h>
- Put virtual memory layout information into a structure to improve
code generation
- Currently __kaslr_offset is the kernel offset in both physical and
virtual memory spaces. Uncouple these offsets to allow uncoupling of
the addresses spaces
- Currently the identity mapping base address is implicit and is always
set to zero. Make it explicit by putting into __identity_base
persistent boot variable and use it in proper context
- Introduce .amode31 section start and end macros AMODE31_START and
AMODE31_END
- Introduce OS_INFO entries that do not reference any data in memory,
but rather provide only values
- Store virtual memory layout in OS_INFO. It is read out by
makedumpfile, crash and other tools
- Store virtual memory layout in VMCORE_INFO. It is read out by crash
and other tools when /proc/kcore device is used
- Create additional PT_LOAD ELF program header that covers kernel image
only, so that vmcore tools could locate kernel text and data when
virtual and physical memory spaces are uncoupled
- Uncouple physical and virtual address spaces
- Map kernel at fixed location when KASLR mode is disabled. The
location is defined by CONFIG_KERNEL_IMAGE_BASE kernel configuration
value.
- Rework deployment of kernel image for both compressed and
uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel
configuration value
- Move .vmlinux.relocs section in front of the compressed kernel. The
interim section rescue step is avoided as result
- Correct modules thunk offset calculation when branch target is more
than 2GB away
- Kernel modules contain their own set of expoline thunks. Now that the
kernel modules area is less than 4GB away from kernel expoline
thunks, make modules use kernel expolines. Also make EXPOLINE_EXTERN
the default if the compiler supports it
- userfaultfd can insert shared zeropages into processes running VMs,
but that is not allowed for s390. Fallback to allocating a fresh
zeroed anonymous folio and insert that instead
- Re-enable shared zeropages for non-PV and non-skeys KVM guests
- Rename hex2bitmap() to ap_hex2bitmap() and export it for external use
- Add ap_config sysfs attribute to provide the means for setting or
displaying adapters, domains and control domains assigned to a
vfio-ap mediated device in a single operation
- Make vfio_ap_mdev_link_queue() ignore duplicate link requests
- Add write support to ap_config sysfs attribute to allow atomic update
a vfio-ap mediated device state
- Document ap_config sysfs attribute
- Function os_info_old_init() is expected to be called only from a
regular kdump kernel. Enable it to be called from a stand-alone dump
kernel
- Address gcc -Warray-bounds warning and fix array size in struct
os_info
- s390 does not support SMBIOS, so drop unneeded CONFIG_DMI checks
- Use unwinder instead of __builtin_return_address() with ftrace to
prevent returning of undefined values
- Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
kernel is enabled. Drop these for the case CONFIG_PIE_BUILD is
disabled
- Compile kernel with -fPIC and link with -no-pie to allow kpatch
feature always succeed and drop the whole CONFIG_PIE_BUILD
option-enabled code
- Add missing virt_to_phys() converter for VSIE facility and crypto
control blocks
* tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
Revert "s390: Relocate vmlinux ELF data to virtual address space"
KVM: s390: vsie: Use virt_to_phys for crypto control block
s390: Relocate vmlinux ELF data to virtual address space
s390: Compile kernel with -fPIC and link with -no-pie
s390: vmlinux.lds.S: Drop .hash and .gnu.hash for !CONFIG_PIE_BUILD
s390/ftrace: Use unwinder instead of __builtin_return_address()
s390/pci: Drop unneeded reference to CONFIG_DMI
s390/os_info: Fix array size in struct os_info
s390/os_info: Initialize old os_info in standalone dump kernel
docs: Update s390 vfio-ap doc for ap_config sysfs attribute
s390/vfio-ap: Add write support to sysfs attr ap_config
s390/vfio-ap: Ignore duplicate link requests in vfio_ap_mdev_link_queue
s390/vfio-ap: Add sysfs attr, ap_config, to export mdev state
s390/ap: Externalize AP bus specific bitmap reading function
s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests
mm/userfaultfd: Do not place zeropages when zeropages are disallowed
s390/expoline: Make modules use kernel expolines
s390/nospec: Correct modules thunk offset calculation
s390/boot: Do not rescue .vmlinux.relocs section
s390/boot: Rework deployment of the kernel image
...
Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for
checked-in source files. It is merely a convention without any functional
difference. In fact, $(obj) and $(src) are exactly the same, as defined
in scripts/Makefile.build:
src := $(obj)
When the kernel is built in a separate output directory, $(src) does
not accurately reflect the source directory location. While Kbuild
resolves this discrepancy by specifying VPATH=$(srctree) to search for
source files, it does not cover all cases. For example, when adding a
header search path for local headers, -I$(srctree)/$(src) is typically
passed to the compiler.
This introduces inconsistency between upstream and downstream Makefiles
because $(src) is used instead of $(srctree)/$(src) for the latter.
To address this inconsistency, this commit changes the semantics of
$(src) so that it always points to the directory in the source tree.
Going forward, the variables used in Makefiles will have the following
meanings:
$(obj) - directory in the object tree
$(src) - directory in the source tree (changed by this commit)
$(objtree) - the top of the kernel object tree
$(srctree) - the top of the kernel source tree
Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced
with $(src).
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
These are generated files. Prefix them with $(obj)/ instead of $(src)/.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
When the kernel is built with CONFIG_PIE_BUILD option enabled it
uses dynamic symbols, for which the linker does not allow more
than 64K number of entries. This can break features like kpatch.
Hence, whenever possible the kernel is built with CONFIG_PIE_BUILD
option disabled. For that support of unaligned symbols generated by
linker scripts in the compiler is necessary.
However, older compilers might lack such support. In that case the
build process resorts to CONFIG_PIE_BUILD option-enabled build.
Compile object files with -fPIC option and then link the kernel
binary with -no-pie linker option.
As result, the dynamic symbols are not generated and not only kpatch
feature succeeds, but also the whole CONFIG_PIE_BUILD option-enabled
code could be dropped.
[ agordeev: Reworded the commit message ]
Suggested-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
option is enabled. Drop these for the case CONFIG_PIE_BUILD is disabled.
[ agordeev: Reworded the commit message ]
Fixes: 778666df60 ("s390: compile relocatable kernel without -fPIE")
Suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Using __builtin_return_address(n) might return undefined values
when used with values of n outside of the stack. This was noticed
when __builtin_return_address() was called in ftrace on top level
functions like the interrupt handlers.
As this behaviour cannot be fixed, use the s390 stack unwinder and
remove the ftrace compilation flags for unwind_bc.c and stacktrace.c
to prevent the unwinding function polluting function traces.
Another advantage is that this also works with clang.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
gcc's -Warray-bounds warned about an out-of-bounds access to
the entry array contained in struct os_info. This doesn't trigger
a bug right now because there's a large reserved space after the
array. Nevertheless fix this, and also add a BUILD_BUG_ON to make
sure struct os_info is always exactly on page in size.
Fixes: f4cac27dc0 ("s390/crash: Use old os_info to create PT_LOAD headers")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The commit be42660d0c13 ("s390/crash: use old os_info to create PT_LOAD headers")
introduced use of the old os_info into standalone dump kernel.
Before this change os_info_old_init() expected to be called only from
a regular kdump kernel although the function itself is able to work
in standalone dump kernels as well (because copy_oldmem_kernel() is able
to handle both use cases). Therefore, fix the expectation of os_info_old_init()
and enable it to be called from a standalone dump kernel.
Fixes: f4cac27dc0 ("s390/crash: Use old os_info to create PT_LOAD headers")
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The return-address (RA) register r14 is specified as volatile in the
s390x ELF ABI [1]. Nevertheless proper CFI directives must be provided
for an unwinder to restore the return address, if the RA register
value is changed from its value at function entry, as it is the case.
[1]: s390x ELF ABI, https://github.com/IBM/s390x-abi/releases
Fixes: 4bff8cb545 ("s390: convert to GENERIC_VDSO")
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The recently added check to figure out if a fault happened on gmap ASCE
dereferences the gmap pointer in lowcore without checking that it is not
NULL. For all non-KVM processes the pointer is NULL, so that some value
from lowcore will be read. With the current layouts of struct gmap and
struct lowcore the read value (aka ASCE) is zero, so that this doesn't lead
to any observable bug; at least currently.
Fix this by adding the missing NULL pointer check.
Fixes: 64c3431808 ("s390/entry: compare gmap asce to determine guest/host fault")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Fix offset calculation when branch target is more then 2Gb away.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Rework deployment of kernel image for both compressed and
uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED
kernel configuration variable.
In case CONFIG_KERNEL_UNCOMPRESSED is disabled avoid uncompressing
the kernel to a temporary buffer and copying it to the target
address. Instead, uncompress it directly to the target destination.
In case CONFIG_KERNEL_UNCOMPRESSED is enabled avoid moving the
kernel to default 0x100000 location when KASLR is disabled or
failed. Instead, use the uncompressed kernel image directly.
In case KASLR is disabled or failed .amode31 section location in
memory is not randomized and precedes the kernel image. In case
CONFIG_KERNEL_UNCOMPRESSED is disabled that location overlaps the
area used by the decompression algorithm. That is fine, since that
area is not used after the decompression finished and the size of
.amode31 section is not expected to exceed BOOT_HEAP_SIZE ever.
There is no decompression in case CONFIG_KERNEL_UNCOMPRESSED is
enabled. Therefore, rename decompress_kernel() to deploy_kernel(),
which better describes both uncompressed and compressed cases.
Introduce AMODE31_SIZE macro to avoid immediate value of 0x3000
(the size of .amode31 section) in the decompressor linker script.
Modify the vmlinux linker script to force the size of .amode31
section to AMODE31_SIZE (the value of (_eamode31 - _samode31)
could otherwise differ as result of compiler options used).
Introduce __START_KERNEL macro that defines the kernel ELF image
entry point and set it to the currrent value of 0x100000.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.
The vmcore ELF program headers describe virtual memory
regions of a crashed kernel. User level tools use that
information for the kernel text and data analysis (e.g
vmcore-dmesg extracts the kernel log).
Currently the kernel image is covered by program headers
describing the identity mapping regions. But in the future
the kernel image will be mapped into separate region outside
of the identity mapping. Create the additional ELF program
header that covers kernel image only, so that vmcore tools
could locate kernel text and data.
Further, the identity mapping in crashed and capture kernels
will have different base address. Due to that __va() macro
can not be used in the capture kernel. Instead, read crashed
kernel identity mapping base address from os_info and use
it for PT_LOAD type program headers creation.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.
The virtual memory layout is needed for address translation
by crash tool when /proc/kcore device is used as the memory
image.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.
The virtual memory layout will be read out by makedumpfile,
crash and other user tools for virtual address translation.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Introduce entries that do not reference any data in memory,
but rather provide values. Set the size of such entries to
zero and do not compute checksum for them, since there is no
data which integrity needs to be checked. The integrity of
the value entries itself is still covered by the os_info
checksum.
Reserve the lowest unused entry index OS_INFO_RESERVED for
future use - presumably for the number of entries present.
That could later be used by user level tools. The existing
tools would not notice any difference.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.
Introduce .amode31 section address range AMODE31_START
and AMODE31_END macros for later use.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.
Currently the identity mapping base address is implicit
and is always set to zero. Make it explicit by putting
into __identity_base persistent boot variable and use it
in proper context - which is the value of PAGE_OFFSET.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.
Put virtual memory layout information into a structure
to improve code generation when accessing the structure
members, which are currently only ident_map_size and
__kaslr_offset.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
update_timer_sys() and update_timer_mcck() are inlines used for
CPU time accounting from the interrupt and machine-check handlers.
These routines are specific to s390 architecture, but included
via <linux/vtime.h> header implicitly. Avoid the extra loop and
include <asm/vtime.h> header directly.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/3fb696637c0eb7e9d6ffd6cbf9e647d7c5986b3d.1712760275.git.agordeev@linux.ibm.com
Remove uses of deprecated page APIs, and move the check for large
folios to here to avoid taking the folio lock if the folio is too large.
We could do better here by attempting to split the large folio, but I'll
leave that improvement for someone who can test it.
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240322161149.2327518-3-willy@infradead.org
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
These page APIs are deprecated, so convert the incoming page to a folio
and use the folio APIs instead. The ultravisor API cannot handle large
folios, so return -EINVAL if one has slipped through.
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240322161149.2327518-2-willy@infradead.org
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The CPU Measurement facility crypto counter set functionality
is defined by the Second Counter Version Number. This number
varies between machine types, but is upward compatible.
Lessen the checks to reflect this behavior.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The inline function is_prot_virt_guest() in asm/uv.h makes use of the
prot_virt_guest symbol. As this inline function can be called by other
parts of the kernel (modules and built-in), the symbol should be
exported, similar to the prot_virt_host symbol.
One consumer of is_prot_virt_guest() will be the ap bus code.
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Holger Dengler <dengler@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Align system call table on 8 bytes. With sys_call_table entry size
of 8 bytes that eliminates the possibility of a system call pointer
crossing cache line boundary.
Cc: stable@kernel.org
Suggested-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In case of a sampling event, the PAI PMU device drivers need a
reference to this event. Currently to PMU device driver reference
is removed when a sampling event is destroyed. This may lead to
situations where the reference of the PMU device driver is removed
while being used by a different sampling event.
Reset the event reference pointer of the PMU device driver when
a sampling event is deleted and before the next one might be added.
Fixes: 39d62336f5 ("s390/pai: add support for cryptography counters")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
- Various virtual vs physical address usage fixes
- Add new bitwise types and helper functions and use them in s390 specific
drivers and code to make it easier to find virtual vs physical address
usage bugs. Right now virtual and physical addresses are identical for
s390, except for module, vmalloc, and similar areas. This will be
changed, hopefully with the next merge window, so that e.g. the kernel
image and modules will be located close to each other, allowing for
direct branches and also for some other simplifications.
As a prerequisite this requires to fix all misuses of virtual and
physical addresses. As it turned out people are so used to the concept
that virtual and physical addresses are the same, that new bugs got added
to code which was already fixed. In order to avoid that even more code
gets merged which adds such bugs add and use new bitwise types, so that
sparse can be used to find such usage bugs.
Most likely the new types can go away again after some time
- Provide a simple ARCH_HAS_DEBUG_VIRTUAL implementation
- Fix kprobe branch handling: if an out-of-line single stepped relative
branch instruction has a target address within a certain address area in
the entry code, the program check handler may incorrectly execute cleanup
code as if KVM code was executed, leading to crashes
- Fix reference counting of zcrypt card objects
- Various other small fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmX5lIYACgkQIg7DeRsp
bsJYxA//cYGSaopFLuHxQndQi5UcMx0NMcaD9auMyDraPWJH0/F+vIWnLPxgJ1wB
zfav3Ssvdd/41rPIXHSGxUzXzv+EHCY5+91t4plfCxWtd1SbpJ8gG2vrXTH512ql
RhJ7crRFQqoigIxlsdwUkwGSqd+L6H73ikKzQyAaJFKC/e9BEpCH4zb8NuTqCeJE
a81/A8BGSIo4Fk4qj1T5bHZzkznmxtisMPGXzoKa28LKhzwqqbGZYeHohnkaT6co
UpY/HMdhGH74kkKqpF0cLDI0Bd8ptfjbcVcKyJ8OD7vCfinIM3bZdYg0KIzyRMhu
d49JDXctPiePXTCi8X+AJnhNYFGlHuDciEpTMzkw97kwhmCfAOW5UaAlBo8dJqat
zt76Cxbl3D+hYI7Wbs9lt2QsXso4/1fMMNQbu9pwyMKxHlCBVe+ZqNIl0gP8OAXZ
51pghcLlcwYNeYRkSzvfEhO9aeUsRg40O5UBCklVMRScrx7qie2wsllEFavb7vZd
Ejv89dvn0KtzYIHG+U5SQ9d0x1JL9qApVM06sCGZGsBM9r4hHFGa8x57Pr+51ZPs
j0qbr7iAWgCGXN8c3p/Nrev6eqBio81CpD9PWik7QpJS38EKkussqfPdQitgQpsi
7r0Sx51oBzyAVadmVAf8/flUrbJTvD3BdbkzN99W4BXyARJk7CI=
=Vh2x
-----END PGP SIGNATURE-----
Merge tag 's390-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
- Various virtual vs physical address usage fixes
- Add new bitwise types and helper functions and use them in s390
specific drivers and code to make it easier to find virtual vs
physical address usage bugs.
Right now virtual and physical addresses are identical for s390,
except for module, vmalloc, and similar areas. This will be changed,
hopefully with the next merge window, so that e.g. the kernel image
and modules will be located close to each other, allowing for direct
branches and also for some other simplifications.
As a prerequisite this requires to fix all misuses of virtual and
physical addresses. As it turned out people are so used to the
concept that virtual and physical addresses are the same, that new
bugs got added to code which was already fixed. In order to avoid
that even more code gets merged which adds such bugs add and use new
bitwise types, so that sparse can be used to find such usage bugs.
Most likely the new types can go away again after some time
- Provide a simple ARCH_HAS_DEBUG_VIRTUAL implementation
- Fix kprobe branch handling: if an out-of-line single stepped relative
branch instruction has a target address within a certain address area
in the entry code, the program check handler may incorrectly execute
cleanup code as if KVM code was executed, leading to crashes
- Fix reference counting of zcrypt card objects
- Various other small fixes and cleanups
* tag 's390-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits)
s390/entry: compare gmap asce to determine guest/host fault
s390/entry: remove OUTSIDE macro
s390/entry: add CIF_SIE flag and remove sie64a() address check
s390/cio: use while (i--) pattern to clean up
s390/raw3270: make class3270 constant
s390/raw3270: improve raw3270_init() readability
s390/tape: make tape_class constant
s390/vmlogrdr: make vmlogrdr_class constant
s390/vmur: make vmur_class constant
s390/zcrypt: make zcrypt_class constant
s390/mm: provide simple ARCH_HAS_DEBUG_VIRTUAL support
s390/vfio_ccw_cp: use new address translation helpers
s390/iucv: use new address translation helpers
s390/ctcm: use new address translation helpers
s390/lcs: use new address translation helpers
s390/qeth: use new address translation helpers
s390/zfcp: use new address translation helpers
s390/tape: fix virtual vs physical address confusion
s390/3270: use new address translation helpers
s390/3215: use new address translation helpers
...
With the current implementation, there are some cornercases where
a host fault would be treated as a guest fault, for example
when the sie instruction causes a program check. Therefore store
the gmap asce in ptregs, and use that to compare the primary asce
from the fault instead of matching instruction addresses.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>