Core
----
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used
to describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols
---------
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP
path manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF
---
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key
to better support decap on GRE tunnel devices not operating
in collect metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk
and bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols
by livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter
---------
- Remove the CLUSTERIP target. It has been marked as obsolete
for years, and we still have WARN splats wrt. races of
the out-of-band /proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to
the existing 'delete' commands, but do not return an error if
the referenced object (set, chain, rule...) did not exist.
Driver API
----------
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into multiple
files, drop some of the unnecessarily granular locks and factor out
common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless Extensions
for Wi-Fi 7 devices at all. Everyone should switch to using nl80211
interface instead.
- Improve the CAN bit timing configuration. Use extack to return error
messages directly to user space, update the SJW handling, including
the definition of a new default value that will benefit CAN-FD
controllers, by increasing their oscillator tolerance.
New hardware / drivers
----------------------
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers
-------
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- enetc: support XDP_REDIRECT for XDP non-linear buffers
- enetc: improve reconfig, avoid link flap and waiting for idle
- enetc: support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP1VIYACgkQMUZtbf5S
IrvsChAApz0rNL/sPKxXTEfxZ1tN7D3sYxYKQPomxvl5BV+MvicrLddJy3KmzEFK
nnJNO3nuRNuH422JQ/ylZ4mGX1opa6+5QJb0UINImXUI7Fm8HHBIuPGkv7d5CheZ
7JexFqjPJXUy9nPyh1Rra+IA9AcRd2U7jeGEZR38wb99bHJQj5Bzdk20WArEB0el
n44aqg49LXH71bSeXRz77x5SjkwVtYiccQxLcnmTbjLU2xVraLvI2J+wAhHnVXWW
9lrU1+V4Ex2Xcd1xR0L0cHeK+meP1TrPRAeF+JDpVI3a/zJiE7cZjfHdG/jH5xWl
leZJqghVozrZQNtewWWO7XhUFhMDgFu3W/1vNLjSHPZEqaz1JpM67J1+ql6s63l4
LMWoXbcYZz+SL9ZRCoPkbGue/5fKSHv8/Jl9Sh58+eTS+c/zgN8uFGRNFXLX1+EP
n8uvt985PxMd6x1+dHumhOUzxnY4Sfi1vjitSunTsNFQ3Cmp4SO0IfBVJWfLUCuC
xz5hbJGJJbSpvUsO+HWyCg83E5OWghRE/Onpt2jsQSZCrO9HDg4FRTEf3WAMgaqc
edb5KfbRZPTJQM08gWdluXzSk1nw3FNP2tXW4XlgUrEbjb+fOk0V9dQg2gyYTxQ1
Nhvn8ZQPi6/GMMELHAIPGmmW1allyOGiAzGlQsv8EmL+OFM6WDI=
=xXhC
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
API:
- Use kmap_local instead of kmap_atomic.
- Change request callback to take void pointer.
- Print FIPS status in /proc/crypto (when enabled).
Algorithms:
- Add rfc4106/gcm support on arm64.
- Add ARIA AVX2/512 support on x86.
Drivers:
- Add TRNG driver for StarFive SoC.
- Delete ux500/hash driver (subsumed by stm32/hash).
- Add zlib support in qat.
- Add RSA support in aspeed.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmPzAiwACgkQxycdCkmx
i6et8xAAoO3w5MZFGXMzWsYhfSZFdceXBEQfDR7JOCdHxpMIQhw0FLlb0uttFk6m
SeWrdP9wiifBDoCmw7qffFJml8ZftPL/XeXjob2d9v7jKbPyw3lDSIdsNfN/5EEL
oIc9915zwrgawvahPAa+PQ4Ue03qRjUyOcV42dpd1W3NYhzDVHoK5OUU+mEFYDvx
Sgw/YUugKf0VXkVDFzG5049+CPcheyRZqclAo9jyl2eZiXujgUyV33nxRCtqIA+t
7jlHKwi+6QzFHY0CX5BvShR8xyEuH5MLoU3H/jYGXnRb3nEpRYAEO4VZchIHqF0F
Y6pKIKc6Q8OyIVY8RsjQY3hioCqYnQFZ5Xtc1zGtOYEitVLbkmItMG0mVn0XOfyt
gJDi6gkEw5uPUbEQdI4R1xEgJ8eCckMsOJ+uRxqTm+uLqNDxPbsB9bohKniMogXV
lDlVXjU23AA9VeKtqU8FvWjfgqsN47X4aoq1j4/4aI7X9F7P9FOP21TZloP7+ssj
PFrzNaRXUrMEsvyS1wqPegIh987lj6WkH4hyU0wjzaIq4IQELidHsSXFS12iWIPH
kTEoC/trAVoYSr0zXKWUCs4h/x0FztVNbjs4KiDP2FLXX1RzeVZ0WlaXZhryHr+n
1+8yCuS6tVofAbSX0wNkZdf0x5+3CIBw4kqSIvjKDPYYEfIDaT0=
=dMYe
-----END PGP SIGNATURE-----
Merge tag 'v6.3-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
"API:
- Use kmap_local instead of kmap_atomic
- Change request callback to take void pointer
- Print FIPS status in /proc/crypto (when enabled)
Algorithms:
- Add rfc4106/gcm support on arm64
- Add ARIA AVX2/512 support on x86
Drivers:
- Add TRNG driver for StarFive SoC
- Delete ux500/hash driver (subsumed by stm32/hash)
- Add zlib support in qat
- Add RSA support in aspeed"
* tag 'v6.3-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (156 commits)
crypto: x86/aria-avx - Do not use avx2 instructions
crypto: aspeed - Fix modular aspeed-acry
crypto: hisilicon/qm - fix coding style issues
crypto: hisilicon/qm - update comments to match function
crypto: hisilicon/qm - change function names
crypto: hisilicon/qm - use min() instead of min_t()
crypto: hisilicon/qm - remove some unused defines
crypto: proc - Print fips status
crypto: crypto4xx - Call dma_unmap_page when done
crypto: octeontx2 - Fix objects shared between several modules
crypto: nx - Fix sparse warnings
crypto: ecc - Silence sparse warning
tls: Pass rec instead of aead_req into tls_encrypt_done
crypto: api - Remove completion function scaffolding
tls: Remove completion function scaffolding
tipc: Remove completion function scaffolding
net: ipv6: Remove completion function scaffolding
net: ipv4: Remove completion function scaffolding
net: macsec: Remove completion function scaffolding
dm: Remove completion function scaffolding
...
- Large cleanup of the con3270/tty3270 driver. Among others this fixes:
* Background Color Support
* ASCII Line Character Support
* VT100 Support
* Geometries other than 80x24
- Cleanup and improve cmpxchg() code. Also add cmpxchg_user_key() to
uaccess functions, which will be used by KVM to access KVM guest memory
with a specific storage key.
- Add support for user space events counting to CPUMF.
- Cleanup the vfio/ccw code, which also allows now to properly support 2K
Format-2 IDALs.
- Move kernel page table allocation and initialization to decompressor,
which finally allows to enter the kernel with dynamic address translation
enabled. This in turn allows to get rid of code with special handling in
the kernel, which has to distinguish if DAT is on or off.
- Replace kretprobe with rethook.
- Various improvements to vfio/ap queue resets:
* Use TAPQ to verify completion of a reset in progress rather than
multiple invocations of ZAPQ.
* Check TAPQ response codes when verifying successful completion of ZAPQ.
* Fix erroneous handling of some error response codes.
* Increase the maximum amount of time to wait for successful completion
of ZAPQ.
- Rework system call wrappers to get rid of alias functions, which were
only left on s390.
- Cleanup diag288_wdt watchdog driver. It has been agreed on with Guenter
Roeck that this goes upstream via the s390 tree.
- Add missing loadparm parameter handling for list-directed ECKD ipl/reipl.
- Various improvements to memory detection code.
- Remove arch_cpu_idle_time() since the current implementation is broken,
and allows user space observable accounted idle times which can
temporarily decrease.
- Add Reset DAT-Protection support: (only) allow to change PTEs from RO to
RW with a new RDP instruction. Unlike the currently used IPTE
instruction, this does not necessarily guarantee that TLBs of all CPUs
are synchronously flushed; and that remote CPUs can see spurious
protection faults. The overall improvement for not requiring an all CPU
synchronization, like it is required with IPTE, should be beneficial.
- Fix KFENCE page fault reporting.
- Smaller cleanups and improvement all over the place.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmPzWhQACgkQIg7DeRsp
bsKkdQ//QbCwDMt6T3bmi6gGgs9HRSkLOTHAlOIRuetEzRiBzm/O4Gm8NycvPspl
BIcuXmQKt+gBS44tWikKpwuhmWrAtiFUxs/M1uPfRXqjUf+ZFJinPJgtPCBa/3rv
tQkh541QxpX4K5Ks71WKv2Kh0RjaTqw5Kj+rlDBYHsxZvb28mDigINRYoVSxNUKi
dTVlR0UgdGLecXfezpvWeEAbJu6Q2pbIkOT3tNOumNqRAoUN4cbH3P0agHJdq8oj
L/++d4tfVbwL8N/VCwIVBeW/AQzA0B2UCDVz75Pd55+FFrIGVp1hn7QC9QQieomL
fzGOTrL4D9U8JkAIJqhioA1NlcN1+QW2svoMVo0N3vBJpIbzX4bZKTDxZZ26dG9H
ox7YvhsZtJA7p34X5hetoObzZcmiYJStT+BDao7q1x3oLf4G31HaP+YUDIKBPmNW
ieZa+ujYbKor1pD6ysaMVX+c1qhfX6S/V0uBAikoqMWUVUvH/ZeuSxCSfMuvWUrQ
KFuc0HnPiiIO1Ux3wN5oN33+pWCSdUcJOeg4aj0jkkFT9Ct3TOBupIGBGyhOAh6r
OTp1iqJuQjwOmkWPLyRuGMzRmDDp+hWz9qNF/DFwIV1IMi9AzJjEOg31cMVxx3gG
iM25560uqvhhUwQFbyjVgJmj00dmqMX07q8QSVOgg9oUrnRhQrc=
=MW9x
-----END PGP SIGNATURE-----
Merge tag 's390-6.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Large cleanup of the con3270/tty3270 driver. Among others this fixes:
- Background Color Support
- ASCII Line Character Support
- VT100 Support
- Geometries other than 80x24
- Cleanup and improve cmpxchg() code. Also add cmpxchg_user_key() to
uaccess functions, which will be used by KVM to access KVM guest
memory with a specific storage key
- Add support for user space events counting to CPUMF
- Cleanup the vfio/ccw code, which also allows now to properly support
2K Format-2 IDALs
- Move kernel page table allocation and initialization to decompressor,
which finally allows to enter the kernel with dynamic address
translation enabled. This in turn allows to get rid of code with
special handling in the kernel, which has to distinguish if DAT is on
or off
- Replace kretprobe with rethook
- Various improvements to vfio/ap queue resets:
- Use TAPQ to verify completion of a reset in progress rather than
multiple invocations of ZAPQ.
- Check TAPQ response codes when verifying successful completion of
ZAPQ.
- Fix erroneous handling of some error response codes.
- Increase the maximum amount of time to wait for successful
completion of ZAPQ
- Rework system call wrappers to get rid of alias functions, which were
only left on s390
- Cleanup diag288_wdt watchdog driver. It has been agreed on with
Guenter Roeck that this goes upstream via the s390 tree
- Add missing loadparm parameter handling for list-directed ECKD
ipl/reipl
- Various improvements to memory detection code
- Remove arch_cpu_idle_time() since the current implementation is
broken, and allows user space observable accounted idle times which
can temporarily decrease
- Add Reset DAT-Protection support: (only) allow to change PTEs from RO
to RW with a new RDP instruction. Unlike the currently used IPTE
instruction, this does not necessarily guarantee that TLBs of all
CPUs are synchronously flushed; and that remote CPUs can see spurious
protection faults. The overall improvement for not requiring an all
CPU synchronization, like it is required with IPTE, should be
beneficial
- Fix KFENCE page fault reporting
- Smaller cleanups and improvement all over the place
* tag 's390-6.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (182 commits)
s390/irq,idle: simplify idle check
s390/processor: add test_and_set_cpu_flag() and test_and_clear_cpu_flag()
s390/processor: let cpu helper functions return boolean values
s390/kfence: fix page fault reporting
s390/zcrypt: introduce ctfm field in struct CPRBX
s390: remove confusing comment from uapi types header file
vfio/ccw: remove WARN_ON during shutdown
s390/entry: remove toolchain dependent micro-optimization
s390/mem_detect: do not truncate online memory ranges info
s390/vx: remove __uint128_t type from __vector128 struct again
s390/mm: add support for RDP (Reset DAT-Protection)
s390/mm: define private VM_FAULT_* reasons from top bits
Documentation: s390: correct spelling
s390/ap: fix status returned by ap_qact()
s390/ap: fix status returned by ap_aqic()
s390: vfio-ap: tighten the NIB validity check
Revert "s390/mem_detect: do not update output parameters on failure"
s390/idle: remove arch_cpu_idle_time() and corresponding code
s390/vx: use simple assignments to access __vector128 members
s390/vx: add 64 and 128 bit members to __vector128 struct
...
- Improve the scalability of the CFS bandwidth unthrottling logic
with large number of CPUs.
- Fix & rework various cpuidle routines, simplify interaction with
the generic scheduler code. Add __cpuidle methods as noinstr to
objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks.
- Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS,
to query previously issued registrations.
- Limit scheduler slice duration to the sysctl_sched_latency period,
to improve scheduling granularity with a large number of SCHED_IDLE
tasks.
- Debuggability enhancement on sys_exit(): warn about disabled IRQs,
but also enable them to prevent a cascade of followup problems and
repeat warnings.
- Fix the rescheduling logic in prio_changed_dl().
- Micro-optimize cpufreq and sched-util methods.
- Micro-optimize ttwu_runnable()
- Micro-optimize the idle-scanning in update_numa_stats(),
select_idle_capacity() and steal_cookie_task().
- Update the RSEQ code & self-tests
- Constify various scheduler methods
- Remove unused methods
- Refine __init tags
- Documentation updates
- ... Misc other cleanups, fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzbJwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIvA//ZcEaB8Z6ChLRQjM+bsaudKJu3pdLQbPK
iYbP8Da+LsAfxbEfYuGV3m+jIp0LlBOtsI/EezxQrXV+V7FvNyAX9Y00eEu/zlj8
7Jn3LMy/DBYTwH7LwVdcU0MyIVI8ZPc6WNnkx0LOtGZn8n+qfHPSDzcP3CW+a5AV
UvllPYpYyEmsX0Eby7CF4Ue8mSmbViw/xR3rNr8ZSve0c25XzKabw8O9kE3jiHxP
d/zERJoAYeDyYUEuZqhfn5dTlB4an4IjNEkAfRE5SQ09RA8Gkxsa5Ar8gob9e9M1
eQsdd4/bdhnrkM8L5qDZczqmgCTZ2bukQrxkBXhRDhLgoFxwAn77b+2ZjmIW3Lae
AyGqRcDSg1q2oxaYm5ZiuO/t26aDOZu9vPHyHRDGt95EGbZlrp+GgeePyfCigJYz
UmPdZAAcHdSymnnnlcvdG37WVvaVkpgWZzd8LbtBi23QR+Zc4WQ2IlgnUS5WKNNf
VOBcAcP6E1IslDotZDQCc2dPFFQoQQEssVooyUc5oMytm7BsvxXLOeHG+Ncu/8uc
H+U8Qn8jnqTxJbC5hkWQIJlhVKCq2FJrHxxySYTKROfUNcDgCmxboFeAcXTCIU1K
T0S+sdoTS/CvtLklRkG0j6B8N4N98mOd9cFwUV3tX+/gMLMep3hCQs5L76JagvC5
skkQXoONNaM=
=l1nN
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Improve the scalability of the CFS bandwidth unthrottling logic with
large number of CPUs.
- Fix & rework various cpuidle routines, simplify interaction with the
generic scheduler code. Add __cpuidle methods as noinstr to objtool's
noinstr detection and fix boatloads of cpuidle bugs & quirks.
- Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query
previously issued registrations.
- Limit scheduler slice duration to the sysctl_sched_latency period, to
improve scheduling granularity with a large number of SCHED_IDLE
tasks.
- Debuggability enhancement on sys_exit(): warn about disabled IRQs,
but also enable them to prevent a cascade of followup problems and
repeat warnings.
- Fix the rescheduling logic in prio_changed_dl().
- Micro-optimize cpufreq and sched-util methods.
- Micro-optimize ttwu_runnable()
- Micro-optimize the idle-scanning in update_numa_stats(),
select_idle_capacity() and steal_cookie_task().
- Update the RSEQ code & self-tests
- Constify various scheduler methods
- Remove unused methods
- Refine __init tags
- Documentation updates
- Misc other cleanups, fixes
* tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
sched/rt: pick_next_rt_entity(): check list_entry
sched/deadline: Add more reschedule cases to prio_changed_dl()
sched/fair: sanitize vruntime of entity being placed
sched/fair: Remove capacity inversion detection
sched/fair: unlink misfit task from cpu overutilized
objtool: mem*() are not uaccess safe
cpuidle: Fix poll_idle() noinstr annotation
sched/clock: Make local_clock() noinstr
sched/clock/x86: Mark sched_clock() noinstr
x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
x86/atomics: Always inline arch_atomic64*()
cpuidle: tracing, preempt: Squash _rcuidle tracing
cpuidle: tracing: Warn about !rcu_is_watching()
cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG
cpuidle: drivers: firmware: psci: Dont instrument suspend code
KVM: selftests: Fix build of rseq test
exit: Detect and fix irq disabled state in oops
cpuidle, arm64: Fix the ARM64 cpuidle logic
cpuidle: mvebu: Fix duplicate flags assignment
sched/fair: Limit sched slice duration
...
- Optimize perf_sample_data layout
- Prepare sample data handling for BPF integration
- Update the x86 PMU driver for Intel Meteor Lake
- Restructure the x86 uncore code to fix a SPR (Sapphire Rapids)
discovery breakage
- Fix the x86 Zhaoxin PMU driver
- Cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzaHgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jYQg/+KRfobCevMQlZVnz09T3SsJ4ahJ587BL6
g2C6kobyUNfeChpFVroBkTR+yCb6Mq4xGr2nda9+2E978BYu9eanpx/u/bXNQ6NU
6YhLwgRrlFXonYn07kFfUJeELZ0W+zpPvymEN1KhTQWcrgXDfXRt2VfMwNsVxGRF
ZRyCWK+UOzSMU22FtW3I/xVLBB0vio9Y6wRC5QOpDVW5YtGwQGust7GJ53JPK43J
m2soJvWORauT+v0aqc7ggOtKd6pahVoXrDrbktxtq9N0ZGI+PubVCGevex++cXm/
B3QSf6VcMMuU6pfzxiEwRa8Whrc3XFeSDEfvMjC5v3becGNkdNBnGOJzYprwgRZJ
irb6/dSrv5P2lj6WphsO1Wzcm7EoWh8M7DVOMh/13Y/oODRdOrv48112Don9UURC
EPyvzAzizqdwdDopUmfiqUwuAXqb8uPZqCgmlz/NJkVz1/ijlfrmLgeDuf0vI7Aq
HznzzRwjFHzyCH7D+rtonFh3JDaqgaouY76tpC5yTtzKbZPlFT8kzeCvqkTMnGgH
czZnSNc/kBup0HDkNSlthK+TyrMXWKeVa8KQSY1E0NJHO4IBBCMzZywSoAaeofQK
hqfQyofX9XHmuHhCA4yIfv1XkZGlBTxpPAyDdHjgs9iJTsodSYMs8ESY08eW8DXn
Ld/35O6SylM=
=ztUT
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
- Optimize perf_sample_data layout
- Prepare sample data handling for BPF integration
- Update the x86 PMU driver for Intel Meteor Lake
- Restructure the x86 uncore code to fix a SPR (Sapphire Rapids)
discovery breakage
- Fix the x86 Zhaoxin PMU driver
- Cleanups
* tag 'perf-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
perf/x86/intel/uncore: Add Meteor Lake support
x86/perf/zhaoxin: Add stepping check for ZXC
perf/x86/intel/ds: Fix the conversion from TSC to perf time
perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table
perf/x86/uncore: Add a quirk for UPI on SPR
perf/x86/uncore: Ignore broken units in discovery table
perf/x86/uncore: Fix potential NULL pointer in uncore_get_alias_name
perf/x86/uncore: Factor out uncore_device_to_die()
perf/core: Call perf_prepare_sample() before running BPF
perf/core: Introduce perf_prepare_header()
perf/core: Do not pass header for sample ID init
perf/core: Set data->sample_flags in perf_prepare_sample()
perf/core: Add perf_sample_save_brstack() helper
perf/core: Add perf_sample_save_raw_data() helper
perf/core: Add perf_sample_save_callchain() helper
perf/core: Save the dynamic parts of sample data size
x86/kprobes: Use switch-case for 0xFF opcodes in prepare_emulation
perf/core: Change the layout of perf_sample_data
perf/x86/msr: Add Meteor Lake support
perf/x86/cstate: Add Meteor Lake support
...
* The last part of the cmpxchg patches
* A few fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmPkwH0ACgkQ41TmuOI4
ufhrshAAmv9OlCNVsGTmQLpEnGdnxGM2vBPDEygdi+oVHtpMBFn27R3fu295aUR0
v0o3xsSImhaOU03OxWrsLqPanEL5BqnicLwkL4xou3NXXD4Wo0Zrstd3ykfaODhq
bTDx7zC2zMQ5J+LPuwDaYUat5R0bHv7cULv1CKLdyISnPGafy0kpUPvC30nymJZi
nV7/DjvDYbuOFfhdTEOklGRXvMSEBPLGhIJk/cYZzJECNeNJFUeSs+00uNJ8P6WO
BQD/FLWie+Fn6lTGIUhulZCPf65KI4bHHLB6WFXA5Jy+O08urdtLiZwlBC4iNsFV
NFIwangpJ/RnupJoOMwQfw31op5SZuiOYn91njaGIiLpHgvA9+iaERsqXtjp8NW7
/ne1TZqtrGbYY71XvZ/yPQU5VGc/MG1CyCGX1CPNSQO7v4yl27BNChxdkBHzzm2u
C0IuLZuXl25XwAt8xbdi65fb84pJOeWRU4Zoe4cUZ3drBy5cZsmFXe3lhEAqs7nf
MB9XekTLpZ6pCqTE1u/BOrobVg5es/lDQiDeLCvDe1I3I5inSD6ehjJz7qjK0w8o
3pn0rb+Kb4Ijzfi4RNbgJXmBNzkwwSSPPwYt4THHOZtr8p0fZMBeGHqq1wTJmKcq
M/+9w4cZqgFpdyNqitj8NyTayX1Lj4LWayexCBYaGkLuHTD6cCk=
=HOly
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-next-6.3-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
* Two more V!=R patches
* The last part of the cmpxchg patches
* A few fixes
Use the per-cpu CIF_ENABLED_WAIT flag to decide if an interrupt
occurred while a cpu was idle, instead of checking two conditions
within the old psw.
Also move clearing of the CIF_ENABLED_WAIT bit to the early interrupt
handler, which in turn makes arch_vcpu_is_preempted() also a bit more
precise, since the flag is now cleared before interrupt handlers have
been called.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Let cpu helper functions return boolean values. This also allows to
make the code a bit simpler by getting rid of the "!!" construct.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Baoquan He reported lots of KFENCE reports when /proc/kcore is read,
e.g. with crash or even simpler with dd:
BUG: KFENCE: invalid read in copy_from_kernel_nofault+0x5e/0x120
Invalid read at 0x00000000f4f5149f:
copy_from_kernel_nofault+0x5e/0x120
read_kcore+0x6b2/0x870
proc_reg_read+0x9a/0xf0
vfs_read+0x94/0x270
ksys_read+0x70/0x100
__do_syscall+0x1d0/0x200
system_call+0x82/0xb0
The reason for this is that read_kcore() simply reads memory that might
have been unmapped by KFENCE with copy_from_kernel_nofault(). Any fault due
to pages being unmapped by KFENCE would be handled gracefully by the fault
handler (exception table fixup).
However the s390 fault handler first reports the fault, and only afterwards
would perform the exception table fixup. Most architectures have this in
reversed order, which also avoids the false positive KFENCE reports when an
unmapped page is accessed.
Therefore change the s390 fault handler so it handles exception table
fixups before KFENCE page faults are reported.
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Acked-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/r/20230213183858.1473681-1-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Modify the CPRBX struct to expose a new field ctfm for use with hardware
command filtering within a CEX8 crypto card in CCA coprocessor mode.
The field replaces a reserved byte padding field so that the layout of the
struct and the size does not change.
The new field is used only by user space applications which may use this to
expose the HW filtering facilities in the crypto firmware layers.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The comment for addr_t doesn't make too much sense. Given that also
the formatting is incorrect, just remove it.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Get rid of CONFIG_AS_IS_LLVM in entry.S to make the code a bit more
readable. This removes a micro-optimization, but given that the llvm IAS
limitation will likely stay, just use the version that works with llvm.
See commit 4c25f0ff63 ("s390/entry: workaround llvm's IAS limitations")
for further details.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit bf64f0517e ("s390/mem_detect: handle online memory limit
just once") introduced truncation of mem_detect online ranges
based on identity mapping size. For kdump case however the full
set of online memory ranges has to be feed into memblock_physmem_add
so that crashed system memory could be extracted.
Instead of truncating introduce a "usable limit" which is respected by
mem_detect api. Also add extra online memory ranges iterator which still
provides full set of online memory ranges disregarding the "usable limit".
Fixes: bf64f0517e ("s390/mem_detect: handle online memory limit just once")
Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The __uint128_t member was only added for future convenience to the
__vector128 struct. However this is a uapi header file, 31/32 bit (aka
compat layer) is still supported, but doesn't know anything about this
type:
/usr/include/asm/types.h:27:17: error: unknown type name __uint128_t
27 | __uint128_t v;
Therefore remove it again.
Fixes: b0b7b43fcc ("s390/vx: add 64 and 128 bit members to __vector128 struct")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
RDP instruction allows to reset DAT-protection bit in a PTE, with less
CPU synchronization overhead than IPTE instruction. In particular, IPTE
can cause machine-wide synchronization overhead, and excessive IPTE usage
can negatively impact machine performance.
RDP can be used instead of IPTE, if the new PTE only differs in SW bits
and _PAGE_PROTECT HW bit, for PTE protection changes from RO to RW.
SW PTE bit changes are allowed, e.g. for dirty and young tracking, but none
of the other HW-defined part of the PTE must change. This is because the
architecture forbids such changes to an active and valid PTE, which
is why invalidation with IPTE is always used first, before writing a new
entry.
The RDP optimization helps mainly for fault-driven SW dirty-bit tracking.
Writable PTEs are initially always mapped with HW _PAGE_PROTECT bit set,
to allow SW dirty-bit accounting on first write protection fault, where
the DAT-protection would then be reset. The reset is now done with RDP
instead of IPTE, if RDP instruction is available.
RDP cannot always guarantee that the DAT-protection reset is propagated
to all CPUs immediately. This means that spurious TLB protection faults
on other CPUs can now occur. For this, common code provides a
flush_tlb_fix_spurious_fault() handler, which will now be used to do a
CPU-local TLB flush. However, this will clear the whole TLB of a CPU, and
not just the affected entry. For more fine-grained flushing, by simply
doing a (local) RDP again, flush_tlb_fix_spurious_fault() would need to
also provide the PTE pointer.
Note that spurious TLB protection faults cannot really be distinguished
from racing pagetable updates, where another thread already installed the
correct PTE. In such a case, the local TLB flush would be unnecessary
overhead, but overall reduction of CPU synchronization overhead by not
using IPTE is still expected to be beneficial.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The current definition already collapse with the generic definition of
vm_fault_reason. Move the private definitions to allocate bits from the
top of uint so they won't collapse anymore.
Signed-off-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20230205231704.909536-4-peterx@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY+bZrwAKCRDbK58LschI
gzi4AP4+TYo0jnSwwkrOoN9l4f5VO9X8osmj3CXfHBv7BGWVxAD/WnvA3TDZyaUd
agIZTkRs6BHF9He8oROypARZxTeMLwM=
=nO1C
-----END PGP SIGNATURE-----
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-02-11
We've added 96 non-merge commits during the last 14 day(s) which contain
a total of 152 files changed, 4884 insertions(+), 962 deletions(-).
There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c
between commit 5b246e533d ("ice: split probe into smaller functions")
from the net-next tree and commit 66c0e13ad2 ("drivers: net: turn on
XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev()
is otherwise there a 2nd time, and add XDP features to the existing
ice_cfg_netdev() one:
[...]
ice_set_netdev_features(netdev);
netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
NETDEV_XDP_ACT_XSK_ZEROCOPY;
ice_set_ops(netdev);
[...]
Stephen's merge conflict mail:
https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/
The main changes are:
1) Add support for BPF trampoline on s390x which finally allows to remove many
test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich.
2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski.
3) Add capability to export the XDP features supported by the NIC.
Along with that, add a XDP compliance test tool,
from Lorenzo Bianconi & Marek Majtyka.
4) Add __bpf_kfunc tag for marking kernel functions as kfuncs,
from David Vernet.
5) Add a deep dive documentation about the verifier's register
liveness tracking algorithm, from Eduard Zingerman.
6) Fix and follow-up cleanups for resolve_btfids to be compiled
as a host program to avoid cross compile issues,
from Jiri Olsa & Ian Rogers.
7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted
when testing on different NICs, from Jesper Dangaard Brouer.
8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang.
9) Extend libbpf to add an option for when the perf buffer should
wake up, from Jon Doron.
10) Follow-up fix on xdp_metadata selftest to just consume on TX
completion, from Stanislav Fomichev.
11) Extend the kfuncs.rst document with description on kfunc
lifecycle & stability expectations, from David Vernet.
12) Fix bpftool prog profile to skip attaching to offline CPUs,
from Tonghao Zhang.
====================
Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 159491f3b5 ("s390/ap: rework assembler functions to use
unions for in/out register variables") the function ap_qact() tries to
grab the status from the wrong part of the register. Thus we always end
up with zeros. Which is wrong, among others, because we detect failures
via status.response_code.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reported-by: Harald Freudenberger <freude@linux.ibm.com>
Fixes: 159491f3b5 ("s390/ap: rework assembler functions to use unions for in/out register variables")
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
There function ap_aqic() tries to grab the status from the
wrong part of the register. Thus we always end up with
zeros. Which is wrong, among others, because we detect
failures via status.response_code.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reported-by: Janosch Frank <frankja@linux.ibm.com>
Fixes: 159491f3b5 ("s390/ap: rework assembler functions to use unions for in/out register variables")
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Replace indirect modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.
Link: https://lkml.kernel.org/r/20230126193752.297968-6-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.
[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit cbc29f107e.
Get rid of the following smatch warnings:
arch/s390/include/asm/mem_detect.h:86 get_mem_detect_end() error: uninitialized symbol 'end'.
arch/s390/include/asm/mem_detect.h:86 get_mem_detect_end() error: uninitialized symbol 'end'.
arch/s390/boot/vmem.c:256 setup_vmem() error: uninitialized symbol 'start'.
arch/s390/boot/vmem.c:258 setup_vmem() error: uninitialized symbol 'end'.
Note that there is no bug in the code. This is purely to silence smatch.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
arch_cpu_idle_time() returns the idle time of any given cpu if it is in
idle, or zero if not. All if this is racy and partially incorrect. Time
stamps taken with store clock extended and store clock fast from different
cpus are compared, while the architecture states that this is nothing which
can be relied on (see Principles of Operation; Chapter 4, "Setting and
Inspecting the Clock").
A more fundamental problem is that the timestamp when a cpu is leaving idle
is taken early in the assembler part of the interrupt handler, and this
value is only transferred many cycles later to the cpu's per-cpu idle data
structure.
This per cpu data structure is read by arch_cpu_idle() to tell for which
period of time a remote cpu is idle: if only an idle_enter value is
present, the assumed idle time of the cpu is calculated by taking a local
timestamp and returning the difference of the local timestamp and the
idle_enter value. This is potentially incorrect, since the remote cpu may
have already left idle, but the taken timestamp may not have been
transferred to the per-cpu data structure. This in turn means that too much
idle time may be reported for a cpu, and a subsequent calculation of system
idle time may result in a smaller value.
Instead of coming up with even more complex code trying to fix this, just
remove this code, and only account idle time of a cpu, after idle state is
left.
Another minor bug is that it is assumed that timestamps are non-zero, which
is not necessarily the case for timestamps taken with store clock
fast. This however is just a very minor problem, since this can only happen
when the epoch increases.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add 64 and 128 bit members to __vector128 struct in order to allow reading
of the complete value, or the higher or lower part of vector register
contents instead of having to use casts.
Add an explicit __aligned(4) statement to avoid that the alignment of the
structure changes from 4 to 8. This should make sure that no breakage
happens because of this change.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
arch_cpu_idle() is marked noinstr and therefore must only call functions
which are also not instrumented.
Make sure that cpu flag helper functions are always inlined to avoid that
the compiler generates an out-of-line function for e.g. the call within
arch_cpu_idle().
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
There is no reason to do idle time accounting in arch_cpu_idle().
Do idle time accounting in account_idle_time_irq(), where it belongs
to. The accounted values don't change between account_idle_time_irq() and
arch_cpu_idle(); so the result is the same.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Fix virtual vs physical address confusion (which currently are the same).
In chsc_sgib(), do the virtual-physical conversion in the caller since
the caller needs to make sure it is a 31-bit address and zero has a
special meaning (disassociating the GIB).
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Michael Mueller <mimu@linux.ibm.com>
Link: https://lore.kernel.org/r/20221107085727.1533792-1-nrb@linux.ibm.com
Message-Id: <20221107085727.1533792-1-nrb@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
User space can use the MEM_OP ioctl to make storage key checked reads
and writes to the guest, however, it has no way of performing atomic,
key checked, accesses to the guest.
Extend the MEM_OP ioctl in order to allow for this, by adding a cmpxchg
op. For now, support this op for absolute accesses only.
This op can be used, for example, to set the device-state-change
indicator and the adapter-local-summary indicator atomically.
Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230206164602.138068-13-scgl@linux.ibm.com
Message-Id: <20230206164602.138068-13-scgl@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Remove code duplication with regards to the CHECK_ONLY flag.
Decrease the number of indents.
No functional change indented.
Suggested-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Link: https://lore.kernel.org/r/20230206164602.138068-12-scgl@linux.ibm.com
Message-Id: <20230206164602.138068-12-scgl@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Remove code duplication with regards to the CHECK_ONLY flag.
Decrease the number of indents.
No functional change indented.
Suggested-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230206164602.138068-11-scgl@linux.ibm.com
Message-Id: <20230206164602.138068-11-scgl@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Instead of having one function covering all mem_op operations,
have a function implementing absolute access and dispatch to that
function in its caller, based on the operation code.
This way additional future operations can be implemented by adding an
implementing function without changing existing operations.
Suggested-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230206164602.138068-10-scgl@linux.ibm.com
Message-Id: <20230206164602.138068-10-scgl@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
The vcpu and vm mem_op ioctl implementations share some functionality.
Move argument checking into a function and call it from both
implementations. This allows code reuse in case of additional future
mem_op operations.
Suggested-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230206164602.138068-9-scgl@linux.ibm.com
Message-Id: <20230206164602.138068-9-scgl@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Migration mode is a VM attribute which enables tracking of changes in
storage attributes (PGSTE). It assumes dirty tracking is enabled on all
memslots to keep a dirty bitmap of pages with changed storage attributes.
When enabling migration mode, we currently check that dirty tracking is
enabled for all memslots. However, userspace can disable dirty tracking
without disabling migration mode.
Since migration mode is pointless with dirty tracking disabled, disable
migration mode whenever userspace disables dirty tracking on any slot.
Also update the documentation to clarify that dirty tracking must be
enabled when enabling migration mode, which is already enforced by the
code in kvm_s390_vm_start_migration().
Also highlight in the documentation for KVM_S390_GET_CMMA_BITS that it
can now fail with -EINVAL when dirty tracking is disabled while
migration mode is on. Move all the error codes to a table so this stays
readable.
To disable migration mode, slots_lock should be held, which is taken
in kvm_set_memory_region() and thus held in
kvm_arch_prepare_memory_region().
Restructure the prepare code a bit so all the sanity checking is done
before disabling migration mode. This ensures migration mode isn't
disabled when some sanity check fails.
Cc: stable@vger.kernel.org
Fixes: 190df4a212 ("KVM: s390: CMMA tracking, ESSA emulation, migration mode")
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20230127140532.230651-2-nrb@linux.ibm.com
Message-Id: <20230127140532.230651-2-nrb@linux.ibm.com>
[frankja@linux.ibm.com: fixed commit message typo, moved api.rst error table upwards]
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Facilities setup has to be done after "facilities" command line option
parsing, it might set extra or remove existing facilities bits for
testing purposes.
Fixes: bb1520d581 ("s390/mm: start kernel with DAT enabled")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
KASAN common code is able to handle memory hotplug and create KASAN shadow
memory on a fly. Online memory ranges are available from mem_detect,
use this information to avoid mapping KASAN shadow for standby memory.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
If kernel is build without KASAN support there is a chance that kernel
image is going to be positioned by KASLR code to overlap with identity
mapping page tables.
When kernel is build with KASAN support enabled memory which
is potentially going to be used for page tables and KASAN
shadow mapping is accounted for in KASLR with the use of
kasan_estimate_memory_needs(). Split this function and introduce
vmem_estimate_memory_needs() to cover decompressor's vmem identity
mapping page tables.
Fixes: bb1520d581 ("s390/mm: start kernel with DAT enabled")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add a function to get online memory in total. It is supposed to be used
in the decompressor as well as during early kernel startup.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Introduce mem_detect_truncate() to cut any online memory ranges above
established identity mapping size, so that mem_detect users wouldn't
have to do it over and over again.
Suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Allocation of mem_detect extended area was not considered neither
in commit 9641b8cc73 ("s390/ipl: read IPL report at early boot")
nor in commit b2d24b97b2 ("s390/kernel: add support for kernel address
space layout randomization (KASLR)"). As a result mem_detect extended
theoretically may overlap with ipl report or randomized kernel image
position. But as mem_detect code will allocate extended area only
upon exceeding 255 online regions (which should alternate with offline
memory regions) it is not seen in practice.
To make sure mem_detect extended area does not overlap with ipl report
or randomized kernel position extend usage of "safe_addr". Make initrd
handling and mem_detect extended area allocation code move it further
right and make KASLR takes in into consideration as well.
Fixes: 9641b8cc73 ("s390/ipl: read IPL report at early boot")
Fixes: b2d24b97b2 ("s390/kernel: add support for kernel address space layout randomization (KASLR)")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In case sclp_early_get_memsize() fails but diag260() succeeds make sure
some sane value is returned. This error scenario is highly unlikely,
but this change makes system able to boot in such case.
Suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Get rid of this sparse warning:
arch/s390/kernel/diag.c:69:29: warning: symbol '__diag8c_tmp_amode31' was not declared. Should it be static?
Fixes: fbaee7464f ("s390/tty3270: add support for diag 8c")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Compiling the kernel with CONFIG_KPROBES disabled, but CONFIG_RETHOOK
enabled, results in this sparse warning:
arch/s390/kernel/rethook.c:26:15: warning: no previous prototype for 'arch_rethook_trampoline_callback' [-Wmissing-prototypes]
26 | unsigned long arch_rethook_trampoline_callback(struct pt_regs *regs)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Add a local rethook header file similar to riscv to address this.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 1a280f48c0 ("s390/kprobes: replace kretprobe with rethook")
Link: https://lore.kernel.org/all/202302030102.69dZIuJk-lkp@intel.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Kasan shadow memory area has been moved to the end of kernel address
space since commit 9a39abb7c9 ("s390/boot: simplify and fix kernel
memory layout setup"), therefore skipping any memory ranges above
VMALLOC_START in empty page tables cleanup code already handles
KASAN shadow memory intersection case and explicit checks could be
removed.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit b9ff81003c ("s390/vmem: cleanup empty page tables") introduced
empty page tables cleanup in vmem code, but when the kernel is built
with KASAN enabled the code has no effect due to wrong KASAN shadow
memory intersection condition, which effectively ignores any memory
range below KASAN shadow. Fix intersection condition to make code
work as anticipated.
Fixes: b9ff81003c ("s390/vmem: cleanup empty page tables")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Kasan shadow memory area has been moved to the end of kernel address
space since commit 9a39abb7c9 ("s390/boot: simplify and fix kernel
memory layout setup"). Change kasan memory layout note accordingly.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently if for some reason sclp_early_read_info() fails,
sclp_early_get_memsize() will not set max_physmem_end and it
will stay uninitialized. Any garbage value other than 0 will lead
to detect_memory() taking wrong path or returning a garbage value
as max_physmem_end. To avoid that simply initialize max_physmem_end.
Fixes: 73045a08cf ("s390: unify identity mapping limits handling")
Reported-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In the current code each reipl type implements its own pair of loadparm
show/store functions. Add a macro to deduplicate the code a bit.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Fixes: 87fd22e0ae ("s390/ipl: add eckd support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Replace alloc_zeroed_user_highpage_movable(). The main difference is
returning a folio containing a single page instead of returning the page,
but take the opportunity to rename the function to match other allocation
functions a little better and rewrite the documentation to place more
emphasis on the zeroing rather than the highmem aspect.
Link: https://lkml.kernel.org/r/20230116191813.2145215-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__HAVE_ARCH_PTE_SWP_EXCLUSIVE is now supported by all architectures that
support swp PTEs, so let's drop it.
Link: https://lkml.kernel.org/r/20230113171026.582290-27-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Function __get_mem_detect_block() resets start and end
output parameters in case of invalid mem_detect array
index is provided. That violates the rule of sparing
the output on fail path and leads e.g to a below anomaly:
for_each_mem_detect_block(i, &start, &end)
continue;
One would expect start and end contain addresses of the
last memory block (if available), but in fact the two
will be reset to zeroes. That is not how an iterator is
expected to work.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Unbinding an I/O subchannel with a child-CCW device in disconnected
state sometimes causes a kernel-panic. The race condition was seen
mostly during testing, when setting all the CHPIDs of a device to
offline and at the same time, the unbinding the I/O subchannel driver.
The kernel-panic occurs because of double delete, the I/O subchannel
driver calls device_del on the CCW device while another device_del
invocation for the same device is in-flight. For instance, disabling
all the CHPIDs will trigger the ccw_device_remove function, which will
call a ccw_device_unregister(), which ends up calling the device_del()
which is asynchronous via cdev's todo workqueue. And unbinding the I/O
subchannel driver calls io_subchannel_remove() function which calls the
ccw_device_unregister() and device_del().
This double delete can be prevented by serializing all CCW device
registration/unregistration calls into the driver core. This patch
introduces a mutex which will be used for this purpose.
Signed-off-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Reported-by: Boris Fiuczynski <fiuczy@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---[ Real Memory Copy Area Start ]---
0x001bfffffffff000-0x001c000000000000 4K PTE I
---[ Kasan Shadow Start ]---
---[ Real Memory Copy Area End ]---
0x001c000000000000-0x001c000200000000 8G PMD RW NX
...
---[ Kasan Shadow End ]---
ptdump does a stable sort of markers. Move kasan markers after
memcpy real to avoid swapping.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
setup_vmem() already calls populate for all online memory regions.
pgtable_populate_end() could be removed.
Also rename pgtable_populate_begin() to pgtable_populate_init().
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit bb1520d581 ("s390/mm: start kernel with DAT enabled")
doesn't consider online memory holes due to potential memory offlining
and erroneously creates pgtables for stand-by memory, which bear RW+X
attribute and trigger a warning:
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x0000000c3fffffff 49G online yes 0-48
0x0000000c40000000-0x0000000c7fffffff 1G offline 49
0x0000000c80000000-0x0000000fffffffff 14G online yes 50-63
0x0000001000000000-0x00000013ffffffff 16G offline 64-79
s390/mm: Found insecure W+X mapping at address 0xc40000000
WARNING: CPU: 14 PID: 1 at arch/s390/mm/dump_pagetables.c:142 note_page+0x2cc/0x2d8
Map only online memory ranges which fit within identity mapping limit.
Fixes: bb1520d581 ("s390/mm: start kernel with DAT enabled")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Historically calls to __decompress() didn't specify "out_len" parameter
on many architectures including s390, expecting that no writes beyond
uncompressed kernel image are performed. This has changed since commit
2aa14b1ab2 ("zstd: import usptream v1.5.2") which includes zstd library
commit 6a7ede3dfccb ("Reduce size of dctx by reutilizing dst buffer
(#2751)"). Now zstd decompression code might store literal buffer in
the unwritten portion of the destination buffer. Since "out_len" is
not set, it is considered to be unlimited and hence free to use for
optimization needs. On s390 this might corrupt initrd or ipl report
which are often placed right after the decompressor buffer. Luckily the
size of uncompressed kernel image is already known to the decompressor,
so to avoid the problem simply specify it in the "out_len" parameter.
Link: https://github.com/facebook/zstd/commit/6a7ede3dfccb
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Link: https://lore.kernel.org/r/patch-1.thread-41c676.git-41c676c2d153.your-ad-here.call-01675030179-ext-9637@work.hours
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPW7E8eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGf7MIAI0JnHN9WvtEukSZ
E6j6+cEGWxsvD6q0g3GPolaKOCw7hlv0pWcFJFcUAt0jebspMdxV2oUGJ8RYW7Lg
nCcHvEVswGKLAQtQSWw52qotW6fUfMPsNYYB5l31sm1sKH4Cgss0W7l2HxO/1LvG
TSeNHX53vNAZ8pVnFYEWCSXC9bzrmU/VALF2EV00cdICmfvjlgkELGXoLKJJWzUp
s63fBHYGGURSgwIWOKStoO6HNo0j/F/wcSMx8leY8qDUtVKHj4v24EvSgxUSDBER
ch3LiSQ6qf4sw/z7pqruKFthKOrlNmcc0phjiES0xwwGiNhLv0z3rAhc4OM2cgYh
SDc/Y/c=
=zpaD
-----END PGP SIGNATURE-----
Merge tag 'v6.2-rc6' into sched/core, to pick up fixes
Pick up fixes before merging another batch of cpuidle updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Jason Gunthorpe says:
====================
iommufd follows the same design as KVM and uses memory cgroups to limit
the amount of kernel memory a iommufd file descriptor can pin down. The
various internal data structures already use GFP_KERNEL_ACCOUNT to charge
its own memory.
However, one of the biggest consumers of kernel memory is the IOPTEs
stored under the iommu_domain and these allocations are not tracked.
This series is the first step in fixing it.
The iommu driver contract already includes a 'gfp' argument to the
map_pages op, allowing iommufd to specify GFP_KERNEL_ACCOUNT and then
having the driver allocate the IOPTE tables with that flag will capture a
significant amount of the allocations.
Update the iommu_map() API to pass in the GFP argument, and fix all call
sites. Replace iommu_map_atomic().
Audit the "enterprise" iommu drivers to make sure they do the right thing.
Intel and S390 ignore the GFP argument and always use GFP_ATOMIC. This is
problematic for iommufd anyhow, so fix it. AMD and ARM SMMUv2/3 are
already correct.
A follow up series will be needed to capture the allocations made when the
iommu_domain itself is allocated, which will complete the job.
====================
* 'iommu-memory-accounting' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/s390: Use GFP_KERNEL in sleepable contexts
iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s
iommu/intel: Use GFP_KERNEL in sleepable contexts
iommu/intel: Support the gfp argument to the map_pages op
iommu/intel: Add a gfp parameter to alloc_pgtable_page()
iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map()
iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()
iommu: Add a gfp parameter to iommu_map_sg()
iommu: Remove iommu_map_atomic()
iommu: Add a gfp parameter to iommu_map()
Link: https://lore.kernel.org/linux-iommu/0-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
CONFIG_VFIO_MDEV cannot be selected when VFIO itself is
disabled, otherwise we get a link failure:
WARNING: unmet direct dependencies detected for VFIO_MDEV
Depends on [n]: VFIO [=n]
Selected by [y]:
- SAMPLE_VFIO_MDEV_MTTY [=y] && SAMPLES [=y]
- SAMPLE_VFIO_MDEV_MDPY [=y] && SAMPLES [=y]
- SAMPLE_VFIO_MDEV_MBOCHS [=y] && SAMPLES [=y]
/home/arnd/cross/arm64/gcc-13.0.1-nolibc/x86_64-linux/bin/x86_64-linux-ld: samples/vfio-mdev/mdpy.o: in function `mdpy_remove':
mdpy.c:(.text+0x1e1): undefined reference to `vfio_unregister_group_dev'
/home/arnd/cross/arm64/gcc-13.0.1-nolibc/x86_64-linux/bin/x86_64-linux-ld: samples/vfio-mdev/mdpy.o: in function `mdpy_probe':
mdpy.c:(.text+0x149e): undefined reference to `_vfio_alloc_device'
Fixes: 8bf8c5ee1f ("vfio-mdev: turn VFIO_MDEV into a selectable symbol")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20230126211211.1762319-1-arnd@kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Implement calling kernel functions from eBPF. In general, the eBPF ABI
is fairly close to that of s390x, with one important difference: on
s390x callers should sign-extend signed arguments. Handle that by using
information returned by bpf_jit_find_kfunc_model().
Here is an example of how sign extensions works. Suppose we need to
call the following function from BPF:
; long noinline bpf_kfunc_call_test4(signed char a, short b, int c,
long d)
0000000000936a78 <bpf_kfunc_call_test4>:
936a78: c0 04 00 00 00 00 jgnop bpf_kfunc_call_test4
; return (long)a + (long)b + (long)c + d;
936a7e: b9 08 00 45 agr %r4,%r5
936a82: b9 08 00 43 agr %r4,%r3
936a86: b9 08 00 24 agr %r2,%r4
936a8a: c0 f4 00 1e 3b 27 jg <__s390_indirect_jump_r14>
As per the s390x ABI, bpf_kfunc_call_test4() has the right to assume
that a, b and c are sign-extended by the caller, which results in using
64-bit additions (agr) without any additional conversions. Without sign
extension we would have the following on the JITed code side:
; tmp = bpf_kfunc_call_test4(-3, -30, -200, -1000);
; 5: b4 10 00 00 ff ff ff fd w1 = -3
0x3ff7fdcdad4: llilf %r2,0xfffffffd
; 6: b4 20 00 00 ff ff ff e2 w2 = -30
0x3ff7fdcdada: llilf %r3,0xffffffe2
; 7: b4 30 00 00 ff ff ff 38 w3 = -200
0x3ff7fdcdae0: llilf %r4,0xffffff38
; 8: b7 40 00 00 ff ff fc 18 r4 = -1000
0x3ff7fdcdae6: lgfi %r5,-1000
0x3ff7fdcdaec: mvc 64(4,%r15),160(%r15)
0x3ff7fdcdaf2: lgrl %r1,bpf_kfunc_call_test4@GOT
0x3ff7fdcdaf8: brasl %r14,__s390_indirect_jump_r1
This first 3 llilfs are 32-bit loads, that need to be sign-extended
to 64 bits.
Note: at the moment bpf_jit_find_kfunc_model() does not seem to play
nicely with XDP metadata functions: add_kfunc_call() adds an "abstract"
bpf_*() version to kfunc_btf_tab, but then fixup_kfunc_call() puts the
concrete version into insn->imm, which bpf_jit_find_kfunc_model() cannot
find. But this seems to be a common code problem.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-7-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
arch_prepare_bpf_trampoline() is used for direct attachment of eBPF
programs to various places, bypassing kprobes. It's responsible for
calling a number of eBPF programs before, instead and/or after
whatever they are attached to.
Add a s390x implementation, paying attention to the following:
- Reuse the existing JIT infrastructure, where possible.
- Like the existing JIT, prefer making multiple passes instead of
backpatching. Currently 2 passes is enough. If literal pool is
introduced, this needs to be raised to 3. However, at the moment
adding literal pool only makes the code larger. If branch
shortening is introduced, the number of passes needs to be
increased even further.
- Support both regular and ftrace calling conventions, depending on
the trampoline flags.
- Use expolines for indirect calls.
- Handle the mismatch between the eBPF and the s390x ABIs.
- Sign-extend fmod_ret return values.
invoke_bpf_prog() produces about 120 bytes; it might be possible to
slightly optimize this, but reaching 50 bytes, like on x86_64, looks
unrealistic: just loading cookie, __bpf_prog_enter, bpf_func, insnsi
and __bpf_prog_exit as literals already takes at least 5 * 12 = 60
bytes, and we can't use relative addressing for most of them.
Therefore, lower BPF_MAX_TRAMP_LINKS on s390x.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-5-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_arch_text_poke() is used to hotpatch eBPF programs and trampolines.
s390x has a very strict hotpatching restriction: the only thing that is
allowed to be hotpatched is conditional branch mask.
Take the same approach as commit de5012b41e ("s390/ftrace: implement
hotpatching"): create a conditional jump to a "plt", which loads the
target address from memory and jumps to it; then first patch this
address, and then the mask.
Trampolines (introduced in the next patch) respect the ftrace calling
convention: the return address is in %r0, and %r1 is clobbered. With
that in mind, bpf_arch_text_poke() does not differentiate between jumps
and calls.
However, there is a simple optimization for jumps (for the epilogue_ip
case): if a jump already points to the destination, then there is no
"plt" and we can just flip the mask.
For simplicity, the "plt" template is defined in assembly, and its size
is used to define C arrays. There doesn't seem to be a way to convey
this size to C as a constant, so it's hardcoded and double-checked
during runtime.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-4-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
All the indirect jumps in the eBPF JIT already use expolines, except
for the tail call one.
Fixes: de5cb6eb51 ("s390: use expoline thunks in the BPF JIT")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-3-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
warns while building objects in the purgatory folder:
clang-16: error: argument unused during compilation: '-MD' [-Werror,-Wunused-command-line-argument]
'-MMD' is always passed to the preprocessor via c_flags, even when
KBUILD_CFLAGS is overridden in a folder, so clang complains the addition
of '-MD' will be unused. Remove '-MD' to clear up this warning, as it is
unnecessary with '-MMD'.
Additionally, '-c' is also unnecessary, remove it while in the area.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
points out that there is a linking phase flag added to CFLAGS, which
will only be used for compiling
clang-16: error: argument unused during compilation: '-shared' [-Werror,-Wunused-command-line-argument]
'-shared' is already present in ldflags-y so it can just be dropped.
Fixes: 2b2a25845d ("s390/vdso: Use $(LD) instead of $(CC) to link vDSO")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
warns:
clang-16: error: argument unused during compilation: '-s' [-Werror,-Wunused-command-line-argument]
The compiler's '-s' flag is a linking option (it is passed along to the
linker directly), which means it does nothing when the linker is not
invoked by the compiler. The kernel builds all .o files with '-c', which
stops the compilation pipeline before linking, so '-s' can be safely
dropped from KBUILD_AFLAGS_64.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
bpftrace and friends only consider functions present in
/sys/kernel/tracing/available_filter_functions.
For system calls there is the s390 specific problem that the system call
function itself is present via __se_sys##name() while the system call
itself is wired up via an __s390x_sys##name() alias. The required DWARF
debug information however is only available for the original function, not
the alias, but within available_filter_functions only the functions with
__s390x_ prefix are available. Which means the required DWARF debug
information cannot be found.
While this could be solved via tooling, it is easier to change the s390
specific system call wrapper handling.
Therefore get rid of this alias handling and implement system call wrappers
like most other architectures are doing. In result the implementation
generates the following functions:
long __s390x_sys##name(struct pt_regs *regs)
static inline long __se_sys##name(...)
static inline long __do_sys##name(...)
__s390x_sys##name() is the visible system call function which is also wired
up in the system call table. Its only parameter is a pt_regs variable.
This function calls the corresponding __se_sys##name() function, which has
as many parameters like the system call definition. This function in turn
performs all zero and sign extensions of all system call parameters, taken
from the pt_regs structure, and finally calls __do_sys##name().
__do_sys##name() is the actual inlined system call function implementation.
For all 64 bit system calls there is a 31/32 bit system call function
__s390_sys##name() generated, which handles all system call parameters
correctly as required by compat handling. This function may be wired
up within the compat system call table, unless there exists an
explicit compat system call function, which is then used instead.
Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Move __S390_SYS_STUBx() the end of the CONFIG_COMPAT section, so both
variants (compat and non-compat) are close together and can be easily
compared.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Remove __SC_COMPAT_TYPE define which is an unused leftover.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
SYSCALL_METADATA() is only supposed to be used for non-compat system
calls. Otherwise there would be a name clash.
This also removes the inconsistency that s390 is the only architecture
which uses SYSCALL_METADATA() for compat system calls, and even that only
for compat system calls without parameters. Only two such compat system
calls exist.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When debugging vmlinux with QEMU + GDB, the following GDB error may
occur:
(gdb) c
Continuing.
Warning:
Cannot insert breakpoint -1.
Cannot access memory at address 0xffffffffffff95c0
Command aborted.
(gdb)
The reason is that, when .interp section is present, GDB tries to
locate the file specified in it in memory and put a number of
breakpoints there (see enable_break() function in gdb/solib-svr4.c).
Sometimes GDB finds a bogus location that matches its heuristics,
fails to set a breakpoint and stops. This makes further debugging
impossible.
The .interp section contains misleading information anyway (vmlinux
does not need ld.so), so fix by discarding it.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Simplify the use of constants PMC_INIT and PMC_RELEASE.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
With no in-kernel user, the source files can be merged.
Move all functions and the variable definitions to file perf_cpum_cf.c
This file now contains all the necessary functions and definitions
for the CPU Measurement counter facility device driver.
The files cpu_mcf.h and perf_cpum_cf_common.c are deleted.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit 17bebcc68e ("s390/cpum_cf: Add minimal in-kernel interface for
counter measurements") introduced a small in-kernel interface for CPU
Measurement counter facility.
There are no users of this interface, therefore remove it.
The following functions are removed:
kernel_cpumcf_alert(),
kernel_cpumcf_begin(),
kernel_cpumcf_end(),
kernel_cpumcf_avail()
there is no need for them anymore.
With the removal of function kernel_cpumcf_alert(), also remove
member alert in struct cpu_cf_events. Its purpose was to counter
measurement alert interrupts for the in-kernel interface.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Function stccm_avail() is defined in a header file and the
only user is one single source file. Move this function to the source
file where it is also used and remove it from the header file.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Function cpum_cf_ctrset_size() is defined in one source file and the
only user is in another source file. Move this function to the source
file where it is used and remove its prototype from the header file.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
To remove an event from the CPU Measurement counter facility
use the lock/unlock scheme as done in event creation. Remove
the atomic_add_unless function to make the code easier.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
These contexts are sleepable, so use the proper annotation. The GFP_ATOMIC
was added mechanically in the prior patches.
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/10-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
dma_alloc_cpu_table() and dma_alloc_page_table() are eventually called by
iommufd through s390_iommu_map_pages() and it should not be forced to
atomic. Thread the gfp parameter through the call chain starting from
s390_iommu_map_pages().
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/9-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
VFIO_MDEV is just a library with helpers for the drivers. Stop making
it a user choice and just select it by the drivers that use the helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>
Link: https://lore.kernel.org/r/20230110091009.474427-3-hch@lst.de
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
That's an adaptation of commit f3a112c0c4 ("x86,rethook,kprobes:
Replace kretprobe with rethook on x86") to s390.
Replaces the kretprobe code with rethook on s390. With this patch,
kretprobe on s390 uses the rethook instead of kretprobe specific
trampoline code.
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The CPU Measurement Sampling Facility (CPUM_SF) installs large buffers
to save samples collected by hardware. These buffers are organized
as Sample Data Buffer Tables (SDBT) and Sample Data Buffers (SDB).
SDBs contain the samples which are extracted and saved in the perf
ring buffer.
The SDBTs are chained using real addresses and refer to SDBs using
real addresses.
The diagnostic sampling setup uses buffers provided by the process
which invokes perf_event_open system call. The buffers are memory
mapped. The buffers have been allocated by the kernel event
subsystem.
Add proper virtual to phyiscal address translation to the buffer
chaining. The current constraint which requires virtual equals
real address layout is removed.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Macro AUX_SDB_NUM() has three parameters. The first one is not used.
Remove the first parameter. Also convert the macros to inline functions.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The CPU Measurement Sampling Facility (CPUM_SF) installs large buffers
to save samples collected by hardware. These buffers are organized
as Sample Data Buffer Tables (SDBT) and Sample Data Buffers (SDB).
SDBs contain the samples which are extracted and saved in the perf
ring buffer.
The SDBTs are chained using real addresses and refer to SDBs using
real addresses.
Adds proper virtual to phyiscal address translation to the buffer
chaining. The current constraint which requires virtual equals real
address layout is removed.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Remove debug statements from function setup_pmc_cpu().
The debug statement displays a pointer value to a per cpu variable.
This pointer value is printed nowhere else, so it has no use for
cross reference.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Some inline helper functions are defined in a header file but used
in only one source file. Move these functions to the source file.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Swap is a function interface that provides exchange function. To avoid
code duplication, we can use swap function.
./arch/s390/mm/vmem.c:680:10-11: WARNING opportunity for swap().
[hca@linux.ibm.com: get rid of all temp variables]
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3786
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230117060223.58583-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
zap_page_range was originally designed to unmap pages within an address
range that could span multiple vmas. While working on [1], it was
discovered that all callers of zap_page_range pass a range entirely within
a single vma. In addition, the mmu notification call within zap_page
range does not correctly handle ranges that span multiple vmas. When
crossing a vma boundary, a new mmu_notifier_range_init/end call pair with
the new vma should be made.
Instead of fixing zap_page_range, do the following:
- Create a new routine zap_vma_pages() that will remove all pages within
the passed vma. Most users of zap_page_range pass the entire vma and
can use this new routine.
- For callers of zap_page_range not passing the entire vma, instead call
zap_page_range_single().
- Remove zap_page_range.
[1] https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.kravetz@oracle.com/
Link: https://lkml.kernel.org/r/20230104002732.232573-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Factor out perf_prepare_header() so that it can call
perf_prepare_sample() without a header if not needed.
Also it checks the filtered_sample_type to avoid duplicate
work when perf_prepare_sample() is called twice (or more).
Suggested-by: Peter Zijlstr <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-8-namhyung@kernel.org
When we save the raw_data to the perf sample data, we need to update
the sample flags and the dynamic size. To make sure this is done
consistently, add the perf_sample_save_raw_data() helper and convert
all call sites.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-4-namhyung@kernel.org
GCC 11.1.0 and 11.2.0 generate a wrong warning when compiling the
kernel e.g. with allmodconfig:
arch/s390/kernel/setup.c: In function ‘setup_lowcore_dat_on’:
./include/linux/fortify-string.h:57:33: error: ‘__builtin_memcpy’ reading 128 bytes from a region of size 0 [-Werror=stringop-overread]
...
arch/s390/kernel/setup.c:526:9: note: in expansion of macro ‘memcpy’
526 | memcpy(abs_lc->cregs_save_area, S390_lowcore.cregs_save_area,
| ^~~~~~
This could be addressed by using absolute_pointer() with the
S390_lowcore macro, but this is not a good idea since this generates
worse code for performance critical paths.
Therefore simply use a for loop to copy the array in question and get
rid of the warning.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Move __amode31_base declaration to proper header file to get rid of
arch/s390/boot/startup.c:24:15:
warning: symbol '__amode31_base' was not declared. Should it be static?
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Move Absolute Lowcore Area allocation to the decompressor.
As result, get_abs_lowcore() and put_abs_lowcore() access
brackets become really straight and do not require complex
execution context analysis and LAP and interrupts tackling.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Move Real Memory Copy Area allocation to the decompressor.
As result, memcpy_real() and memcpy_real_iter() movers
become usable since the very moment the kernel starts.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently the decompressor sets up only identity mapping.
Allow adding more address range types as a prerequisite
for allocation of kernel fixed mappings.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The identity mapping is created in the decompressor,
there is no need to have the same functionality in
the kasan setup code. Thus, remove it.
Remove the 4KB pages check for first 1MB since there
is no need to take care of the lowcore pages.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
As the kernel is executed in DAT-on mode only, remove
unnecessary DAT bit check together with the dead code.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The setup of the kernel virtual address space is spread
throughout the sources, boot stages and config options
like this:
1. The available physical memory regions are queried
and stored as mem_detect information for later use
in the decompressor.
2. Based on the physical memory availability the virtual
memory layout is established in the decompressor;
3. If CONFIG_KASAN is disabled the kernel paging setup
code populates kernel pgtables and turns DAT mode on.
It uses the information stored at step [1].
4. If CONFIG_KASAN is enabled the kernel early boot
kasan setup populates kernel pgtables and turns DAT
mode on. It uses the information stored at step [1].
The kasan setup creates early_pg_dir directory and
directly overwrites swapper_pg_dir entries to make
shadow memory pages available.
Move the kernel virtual memory setup to the decompressor
and start the kernel with DAT turned on right from the
very first istruction. That completely eliminates the
boot phase when the kernel runs in DAT-off mode, simplies
the overall design and consolidates pgtables setup.
The identity mapping is created in the decompressor, while
kasan shadow mappings are still created by the early boot
kernel code.
Share with decompressor the existing kasan memory allocator.
It decreases the size of a newly requested memory block from
pgalloc_pos and ensures that kernel image is not overwritten.
pgalloc_low and pgalloc_pos pointers are made preserved boot
variables for that.
Use the bootdata infrastructure to setup swapper_pg_dir
and invalid_pg_dir directories used by the kernel later.
The interim early_pg_dir directory established by the
kasan initialization code gets eliminated as result.
As the kernel runs in DAT-on mode only the PSW_KERNEL_BITS
define gets PSW_MASK_DAT bit by default. Additionally, the
setup_lowcore_dat_off() and setup_lowcore_dat_on() routines
get merged, since there is no DAT-off mode stage anymore.
The memory mappings are created with RW+X protection that
allows the early boot code setting up all necessary data
and services for the kernel being booted. Just before the
paging is enabled the memory protection is changed to
RO+X for text, RO+NX for read-only data and RW+NX for
kernel data and the identity mapping.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Detect and enable memory facilities which is a
prerequisite for pgtables setup in the decompressor.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Similar to existing PAGE_KERNEL_EXEC and SEGMENT_KERNEL_EXEC
memory protection add REGION3_KERNEL_EXEC attribute that
could be set on PUD pgtable entries.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Convert setup of pgtable entries to use set_pXe_bit()
helpers as the preferred way in MM code.
Locally introduce pgprot_clear_bit() helper, which is
strictly speaking a generic function. However, it is
only x86 pgprot_clear_protnone_bits() helper, which
does a similar thing, so do not make it public.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Fix variables initialization coding style and setup zero
pgtable same way region and segment pgtables are set up.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The kasan early boot memory allocators operate on pgalloc_pos
and segment_pos physical address pointers, but fail to convert
it to the corresponding virtual pointers.
Currently it is not a problem, since virtual and physical
addresses on s390 are the same. Nevertheless, should they
ever differ, this would cause an invalid pointer access.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit ada1da31ce ("s390/sclp: sort out physical vs
virtual pointers usage") fixed the notion of virtual
address for sclp_early_sccb pointer. However, it did
not take into account that kasan_early_init() can also
output messages and sclp_early_sccb should be adjusted
by the time kasan_early_init() is called.
Currently it is not a problem, since virtual and physical
addresses on s390 are the same. Nevertheless, should they
ever differ, this would cause an invalid pointer access.
Fixes: ada1da31ce ("s390/sclp: sort out physical vs virtual pointers usage")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Current arch_cpu_idle() is called with IRQs disabled, but will return
with IRQs enabled.
However, the very first thing the generic code does after calling
arch_cpu_idle() is raw_local_irq_disable(). This means that
architectures that can idle with IRQs disabled end up doing a
pointless 'enable-disable' dance.
Therefore, push this IRQ disabling into the idle function, meaning
that those architectures can avoid the pointless IRQ state flipping.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.618076436@infradead.org
Idle code is very like entry code in that RCU isn't available. As
such, add a little validation.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.373461409@infradead.org
s390 doesn't use irq_domains, so it has no place to set
IRQ_DOMAIN_FLAG_ISOLATED_MSI. Instead of continuing to abuse the iommu
subsystem to convey this information add a simple define which s390 can
make statically true. The define will cause msi_device_has_isolated() to
return true.
Remove IOMMU_CAP_INTR_REMAP from the s390 iommu driver.
Link: https://lore.kernel.org/r/8-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Use READ_ONCE() before cmpxchg() to prevent that the compiler generates
code that fetches the to be compared old value several times from memory.
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20230109145456.2895385-1-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Make sure that *ptr__ within arch_this_cpu_to_op_simple() is only
dereferenced once by using READ_ONCE(). Otherwise the compiler could
generate incorrect code.
Cc: <stable@vger.kernel.org>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The current cmpxchg_double() loops within the perf hw sampling code do not
have READ_ONCE() semantics to read the old value from memory. This allows
the compiler to generate code which reads the "old" value several times
from memory, which again allows for inconsistencies.
For example:
/* Reset trailer (using compare-double-and-swap) */
do {
te_flags = te->flags & ~SDB_TE_BUFFER_FULL_MASK;
te_flags |= SDB_TE_ALERT_REQ_MASK;
} while (!cmpxchg_double(&te->flags, &te->overflow,
te->flags, te->overflow,
te_flags, 0ULL));
The compiler could generate code where te->flags used within the
cmpxchg_double() call may be refetched from memory and which is not
necessarily identical to the previous read version which was used to
generate te_flags. Which in turn means that an incorrect update could
happen.
Fix this by adding READ_ONCE() semantics to all cmpxchg_double()
loops. Given that READ_ONCE() cannot generate code on s390 which atomically
reads 16 bytes, use a private compare-and-swap-double implementation to
achieve that.
Also replace cmpxchg_double() with the private implementation to be able to
re-use the old value within the loops.
As a side effect this converts the whole code to only use bit fields
to read and modify bits within the hws trailer header.
Reported-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/linux-s390/Y71QJBhNTIatvxUT@osiris/T/#ma14e2a5f7aa8ed4b94b6f9576799b3ad9c60f333
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add missing header include to get rid of
arch/s390/crypto/arch_random.c:15:1:
warning: symbol 's390_arch_random_available' was not declared. Should it be static?
arch/s390/crypto/arch_random.c:17:12:
warning: symbol 's390_arch_random_counter' was not declared. Should it be static?
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
This commit addresses the following erroneous situation with file-based
kdump executed on a system with a valid IPL report.
On s390, a kdump kernel, its initrd and IPL report if present are loaded
into a special and reserved on boot memory region - crashkernel. When
a system crashes and kdump was activated before, the purgatory code
is entered first which swaps the crashkernel and [0 - crashkernel size]
memory regions. Only after that the kdump kernel is entered. For this
reason, the pointer to an IPL report in lowcore must point to the IPL report
after the swap and not to the address of the IPL report that was located in
crashkernel memory region before the swap. Failing to do so, makes the
kdump's decompressor try to read memory from the crashkernel memory region
which already contains the production's kernel memory.
The situation described above caused spontaneous kdump failures/hangs
on systems where the Secure IPL is activated because on such systems
an IPL report is always present. In that case kdump's decompressor tried
to parse an IPL report which frequently lead to illegal memory accesses
because an IPL report contains addresses to various data.
Cc: <stable@vger.kernel.org>
Fixes: 99feaa717e ("s390/kexec_file: Create ipl report and pass to next kernel")
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The idal_nr_words() routine works well for 4K IDAWs, but lost its
ability to handle the old 2K formats with the removal of 31-bit
builds in commit 5a79859ae0 ("s390: remove 31 bit support").
Since there's nothing preventing a guest from generating this IDAW
format, let's re-introduce the math for them and use both when
calculating the number of IDAWs based on the bits specified in
the ORB.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Use __packed __aligned instead of __attribute__((packed, aligned(X)));
to match the rest of the file.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In order to use the fs3270 one would need at least the ioctl definitions
in uapi. Add two new include files in uapi, which contain:
fs3270: ioctl number declarations + returned struct for TUBGETMOD.
raw3270: all the orders, attributes and similar stuff used with 3270
terminals.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The current code uses diag210 to infer the 3270 geometry from the
model number when running on z/VM. This doesn't work well as almost
all 3270 software clients report as 3279-2 with a custom resolution.
tty3270 assumes it has a 80x24 terminal connected because of the -2
suffix. Use diag 8c to fetch the realy geometry from z/VM.
Note that this doesn't allow dynamic resizing, i.e. reconnecting to
a z/VM session with a different geometry.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
cputime_t was a core kernel type, removed by commits
ed5c8c854f2b..b672592f0221. As explained in commit b672592f02
("sched/cputime: Remove generic asm headers"), the final cleanup is for
the arch to provide cputime_to_nsec[s](). Commit e53051e757
("s390/cputime: provide archicture specific cputime_to_nsecs") did that,
but just didn't remove the then-unused cputime_to_usecs() and associated
remnants.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20221006105635.115775-1-npiggin@gmail.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
CPU Measurement counting facility events PROBLEM_STATE_CPU_CYCLES(32)
and PROBLEM_STATE_INSTRUCTIONS(33) are valid events. However the device
driver returns error -EOPNOTSUPP when these event are to be installed.
Fix this and allow installation of events PROBLEM_STATE_CPU_CYCLES,
PROBLEM_STATE_CPU_CYCLES:u, PROBLEM_STATE_INSTRUCTIONS and
PROBLEM_STATE_INSTRUCTIONS:u.
Kernel space counting only is still not supported by s390.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Since commit 80b6093b55 ("kbuild: add -Wundef to KBUILD_CPPFLAGS
for W=1 builds"), building with W=1 detects misuse of #(el)if.
$ make W=1 ARCH=s390 CROSS_COMPILE=s390x-linux-gnu-
[snip]
arch/s390/boot/decompressor.c:28:7: warning: "CONFIG_KERNEL_ZSTD" is not defined, evaluates to 0 [-Wundef]
28 | #elif CONFIG_KERNEL_ZSTD
| ^~~~~~~~~~~~~~~~~~
This issue has been hidden because arch/s390/boot/Makefile overwrites
KBUILD_CFLAGS, dropping -Wundef.
CONFIG_KERNEL_ZSTD is a bool option. #elif defined() should be used.
The line #ifdef CONFIG_KERNEL_BZIP2 is fine, but I changed it for
consistency.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20230106161024.2373602-1-masahiroy@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Nathan Chancellor reports that the s390 vmlinux fails to link with
GNU ld < 2.36 since commit 99cb0d917f ("arch: fix broken BuildID
for arm64 and riscv").
It happens for defconfig, or more specifically for CONFIG_EXPOLINE=y.
$ s390x-linux-gnu-ld --version | head -n1
GNU ld (GNU Binutils for Debian) 2.35.2
$ make -s ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- allnoconfig
$ ./scripts/config -e CONFIG_EXPOLINE
$ make -s ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- olddefconfig
$ make -s ARCH=s390 CROSS_COMPILE=s390x-linux-gnu-
`.exit.text' referenced in section `.s390_return_reg' of drivers/base/dd.o: defined in discarded section `.exit.text' of drivers/base/dd.o
make[1]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
make: *** [Makefile:1252: vmlinux] Error 2
arch/s390/kernel/vmlinux.lds.S wants to keep EXIT_TEXT:
.exit.text : {
EXIT_TEXT
}
But, at the same time, EXIT_TEXT is thrown away by DISCARD because
s390 does not define RUNTIME_DISCARD_EXIT.
I still do not understand why the latter wins after 99cb0d917f,
but defining RUNTIME_DISCARD_EXIT seems correct because the comment
line in arch/s390/kernel/vmlinux.lds.S says:
/*
* .exit.text is discarded at runtime, not link time,
* to deal with references from __bug_table
*/
Nathan also found that binutils commit 21401fc7bf67 ("Duplicate output
sections in scripts") cured this issue, so we cannot reproduce it with
binutils 2.36+, but it is better to not rely on it.
Fixes: 99cb0d917f ("arch: fix broken BuildID for arm64 and riscv")
Link: https://lore.kernel.org/all/Y7Jal56f6UBh1abE@dev-arch.thelio-3990X/
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20230105031306.1455409-1-masahiroy@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Symbols _edata and _end in the linker script are the
only unaligned expicitly on page boundary. Although
_end is aligned implicitly by BSS_SECTION macro that
is still inconsistent and could lead to a bug if a tool
or function would assume that _edata is as aligned as
others.
For example, vmem_map_init() function does not align
symbols _etext, _einittext etc. Should these symbols
be unaligned as well, the size of ranges to update
were short on one page.
Instead of fixing every occurrence of this kind in the
code and external tools just force the alignment on
these two symbols.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Using DEBUG_H without a prefix is very generic and inconsistent with
other header guards in arch/s390/include/asm. In fact it collides with
the same name in the ath9k wireless driver though that depends on !S390
via disabled wireless support. Let's just use a consistent header guard
name and prevent possible future trouble.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
xts_fallback_setkey() in xts_aes_set_key() will now enforce key size
rule in FIPS mode when setting up the fallback algorithm keys, which
makes the check in xts_aes_set_key() redundant or unreachable. So just
drop this check.
xts_fallback_setkey() now makes a key size check in xts_verify_key():
xts_fallback_setkey()
crypto_skcipher_setkey() [ skcipher_setkey_unaligned() ]
cipher->setkey() { .setkey = xts_setkey }
xts_setkey()
xts_verify_key()
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
xts_check_key() is obsoleted by xts_verify_key(). Over time XTS crypto
drivers adopted the newer xts_verify_key() variant, but xts_check_key()
is still used by a number of drivers. Switch drivers to use the newer
xts_verify_key() and make a couple of cleanups. This allows us to drop
xts_check_key() completely and avoid redundancy.
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
__cmpxchg_user_key() uses 128 bit types which, depending on compiler
and config options, may lead to an __ashlti3() library call.
Get rid of that by simply casting the 128 bit values to 32 bit values.
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Fixes: 51098f0eb2 ("s390/cmpxchg: make loop condition for 1,2 byte cases precise")
Link: https://lore.kernel.org/all/4b96b112d5415d08a81d30657feec2c8c3000f7c.camel@linux.ibm.com/
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Allow architectures to opt out of the generic hardware enabling logic,
and opt out on both s390 and PPC, which don't need to manually enable
virtualization as it's always on (when available).
In addition to letting s390 and PPC drop a bit of dead code, this will
hopefully also allow ARM to clean up its related code, e.g. ARM has its
own per-CPU flag to track which CPUs have enable hardware due to the
need to keep hardware enabled indefinitely when pKVM is enabled.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Anup Patel <anup@brainfault.org>
Message-Id: <20221130230934.1014142-50-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_arch_check_processor_compat() and its support code now that all
architecture implementations are nops.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390
Acked-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-33-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_arch_init() and kvm_arch_exit() now that all implementations
are nops.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Acked-by: Anup Patel <anup@brainfault.org>
Message-Id: <20221130230934.1014142-30-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tag __kvm_s390_init() and its unique helpers as __init. These functions
are only ever called during module_init(), but could not be tagged
accordingly while they were invoked from the common kvm_arch_init(),
which is not __init because of x86.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20221130230934.1014142-29-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the guts of kvm_arch_init() into a new helper, __kvm_s390_init(),
and invoke the new helper directly from kvm_s390_init() instead of
bouncing through kvm_init(). Invoking kvm_arch_init() is the very
first action performed by kvm_init(), i.e. this is a glorified nop.
Moving setup to __kvm_s390_init() will allow tagging more functions as
__init, and emptying kvm_arch_init() will allow dropping the hook
entirely once all architecture implementations are nops.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20221130230934.1014142-28-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_arch_hardware_setup() and kvm_arch_hardware_unsetup() now that
all implementations are nops.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390
Acked-by: Anup Patel <anup@brainfault.org>
Message-Id: <20221130230934.1014142-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that kvm_arch_hardware_setup() is called immediately after
kvm_arch_init(), fold the guts of kvm_arch_hardware_(un)setup() into
kvm_arch_{init,exit}() as a step towards dropping one of the hooks.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20221130230934.1014142-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation for folding kvm_arch_hardware_setup() into kvm_arch_init(),
unwind initialization one step at a time instead of simply calling
kvm_arch_exit(). Using kvm_arch_exit() regardless of which initialization
step failed relies on all affected state playing nice with being undone
even if said state wasn't first setup. That holds true for state that is
currently configured by kvm_arch_init(), but not for state that's handled
by kvm_arch_hardware_setup(), e.g. calling gmap_unregister_pte_notifier()
without first registering a notifier would result in list corruption due
to attempting to delete an entry that was never added to the list.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20221130230934.1014142-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The archs that use cputime_to_nsecs() internally provide their own
definition and don't need the fallback. cputime_to_usecs() unused except
in this fallback, and is not defined anywhere.
This removes the final remnant of the cputime_t code from the kernel.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20221220070705.2958959-1-npiggin@gmail.com
The <asm/archrandom.h> header is a random.c private detail, not
something to be called by other code. As such, don't make it
automatically available by way of random.h.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Including:
- Core code:
- map/unmap_pages() cleanup
- SVA and IOPF refactoring
- Clean up and document return codes from device/domain
attachment code
- AMD driver:
- Rework and extend parsing code for ivrs_ioapic, ivrs_hpet
and ivrs_acpihid command line options
- Some smaller cleanups
- Intel driver:
- Blocking domain support
- Cleanups
- S390 driver:
- Fixes and improvements for attach and aperture handling
- PAMU driver:
- Resource leak fix and cleanup
- Rockchip driver:
- Page table permission bit fix
- Mediatek driver:
- Improve safety from invalid dts input
- Smaller fixes and improvements
- Exynos driver:
- Fix driver initialization sequence
- Sun50i driver:
- Remove IOMMU_DOMAIN_IDENTITY as it has not been working
forever
- Various other fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmOd1PQACgkQK/BELZcB
GuO7NxAAiwJUO99pTwvqnByzcC783AuE/fqKHDb9DZaN6Cr0VXSbKEwm8Lc2PC00
2CTwK/zGhy8BKBQnPiooJ+YOMPjE4yhFIF9jr5ASH5AVWv8EEFpo8zIFKAcF5rh/
c2Y5RIUwsGXuhR7U3lMTw84r39TZG2eHPwTEU6KvEJ1LCOMyD8IBYrZK2rvpGpem
3swXUfF5bQGAT8LlIFN7p+qsVs6ZtuD40qre3kerjrBtCPUMlxIIV5TJ8oQTecsk
vKpD51mEVW+rjUKvqui8NDYuPfT76F2FPS37dfA1F36p8dmsMGSrtWngNm73r546
AmY8Gui6wKsv4Qn7Mxv49f/WZIXzdRTXOKx/zhYvvGxu7keqQIRIWYcLSxqfaGku
cqJT401Ws1NHmRpx/t90lMH/anY5+kUMRTQG9Iq5ruLhExskd0SJcffa1i7YIGIe
lPCTDf7MOXfDudR0Dtp87pGZQBaSkrSzZvb7qZY3Bj83WGZnLPpl6Z3N8KbkGzEO
zNNvv1CtxZnIPrdOaKvfxQlAKiWKxkPRHuqk1TE8hkoNOe5ZgdOSJP5SeCrZ5tEf
qljPXvDVF9f8CYw7QlfEDnbLnqDMGZpPAGqKPItbaijQLPZx4Jm4dw6+7i9hETIa
wJ+1R9iAf+qiR0rlqueALKRaI4DjE8RU8yYSDpn2kn0BUOhWmb8=
=ZM/m
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core code:
- map/unmap_pages() cleanup
- SVA and IOPF refactoring
- Clean up and document return codes from device/domain attachment
AMD driver:
- Rework and extend parsing code for ivrs_ioapic, ivrs_hpet and
ivrs_acpihid command line options
- Some smaller cleanups
Intel driver:
- Blocking domain support
- Cleanups
S390 driver:
- Fixes and improvements for attach and aperture handling
PAMU driver:
- Resource leak fix and cleanup
Rockchip driver:
- Page table permission bit fix
Mediatek driver:
- Improve safety from invalid dts input
- Smaller fixes and improvements
Exynos driver:
- Fix driver initialization sequence
Sun50i driver:
- Remove IOMMU_DOMAIN_IDENTITY as it has not been working forever
- Various other fixes"
* tag 'iommu-updates-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (74 commits)
iommu/mediatek: Fix forever loop in error handling
iommu/mediatek: Fix crash on isr after kexec()
iommu/sun50i: Remove IOMMU_DOMAIN_IDENTITY
iommu/amd: Fix typo in macro parameter name
iommu/mediatek: Remove unused "mapping" member from mtk_iommu_data
iommu/mediatek: Improve safety for mediatek,smi property in larb nodes
iommu/mediatek: Validate number of phandles associated with "mediatek,larbs"
iommu/mediatek: Add error path for loop of mm_dts_parse
iommu/mediatek: Use component_match_add
iommu/mediatek: Add platform_device_put for recovering the device refcnt
iommu/fsl_pamu: Fix resource leak in fsl_pamu_probe()
iommu/vt-d: Use real field for indication of first level
iommu/vt-d: Remove unnecessary domain_context_mapped()
iommu/vt-d: Rename domain_add_dev_info()
iommu/vt-d: Rename iommu_disable_dev_iotlb()
iommu/vt-d: Add blocking domain support
iommu/vt-d: Add device_block_translation() helper
iommu/vt-d: Allocate pasid table in device probe path
iommu/amd: Check return value of mmu_notifier_register()
iommu/amd: Fix pci device refcount leak in ppr_notifier()
...
* Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
* Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
* Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option,
which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a9:
"Fix a number of issues with MTE, such as races on the tags being
initialised vs the PG_mte_tagged flag as well as the lack of support
for VM_SHARED when KVM is involved. Patches from Catalin Marinas and
Peter Collingbourne").
* Merge the pKVM shadow vcpu state tracking that allows the hypervisor
to have its own view of a vcpu, keeping that state private.
* Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
* Fix a handful of minor issues around 52bit VA/PA support (64kB pages
only) as a prefix of the oncoming support for 4kB and 16kB pages.
* Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
* Second batch of the lazy destroy patches
* First batch of KVM changes for kernel virtual != physical address support
* Removal of a unused function
x86:
* Allow compiling out SMM support
* Cleanup and documentation of SMM state save area format
* Preserve interrupt shadow in SMM state save area
* Respond to generic signals during slow page faults
* Fixes and optimizations for the non-executable huge page errata fix.
* Reprogram all performance counters on PMU filter change
* Cleanups to Hyper-V emulation and tests
* Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest
running on top of a L1 Hyper-V hypervisor)
* Advertise several new Intel features
* x86 Xen-for-KVM:
** Allow the Xen runstate information to cross a page boundary
** Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
** Add support for 32-bit guests in SCHEDOP_poll
* Notable x86 fixes and cleanups:
** One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
** Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
years back when eliminating unnecessary barriers when switching between
vmcs01 and vmcs02.
** Clean up vmread_error_trampoline() to make it more obvious that params
must be passed on the stack, even for x86-64.
** Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
of the current guest CPUID.
** Fudge around a race with TSC refinement that results in KVM incorrectly
thinking a guest needs TSC scaling when running on a CPU with a
constant TSC, but no hardware-enumerated TSC frequency.
** Advertise (on AMD) that the SMM_CTL MSR is not supported
** Remove unnecessary exports
Generic:
* Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
* Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
* Fix build errors that occur in certain setups (unsure exactly what is
unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
* Introduce actual atomics for clear/set_bit() in selftests
* Add support for pinning vCPUs in dirty_log_perf_test.
* Rename the so called "perf_util" framework to "memstress".
* Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress tests.
* Add a common ucall implementation; code dedup and pre-work for running
SEV (and beyond) guests in selftests.
* Provide a common constructor and arch hook, which will eventually be
used by x86 to automatically select the right hypercall (AMD vs. Intel).
* A bunch of added/enabled/fixed selftests for ARM64, covering memslots,
breakpoints, stage-2 faults and access tracking.
* x86-specific selftest changes:
** Clean up x86's page table management.
** Clean up and enhance the "smaller maxphyaddr" test, and add a related
test to cover generic emulation failure.
** Clean up the nEPT support checks.
** Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
** Fix an ordering issue in the AMX test introduced by recent conversions
to use kvm_cpu_has(), and harden the code to guard against similar bugs
in the future. Anything that tiggers caching of KVM's supported CPUID,
kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
the caching occurs before the test opts in via prctl().
Documentation:
* Remove deleted ioctls from documentation
* Clean up the docs for the x86 MSR filter.
* Various fixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt
KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF
mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex
yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii
Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW
MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA==
=QAYX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on (see merge
commit 382b5b87a9: "Fix a number of issues with MTE, such as
races on the tags being initialised vs the PG_mte_tagged flag as
well as the lack of support for VM_SHARED when KVM is involved.
Patches from Catalin Marinas and Peter Collingbourne").
- Merge the pKVM shadow vcpu state tracking that allows the
hypervisor to have its own view of a vcpu, keeping that state
private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB
pages only) as a prefix of the oncoming support for 4kB and 16kB
pages.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address
support
- Removal of a unused function
x86:
- Allow compiling out SMM support
- Cleanup and documentation of SMM state save area format
- Preserve interrupt shadow in SMM state save area
- Respond to generic signals during slow page faults
- Fixes and optimizations for the non-executable huge page errata
fix.
- Reprogram all performance counters on PMU filter change
- Cleanups to Hyper-V emulation and tests
- Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
guest running on top of a L1 Hyper-V hypervisor)
- Advertise several new Intel features
- x86 Xen-for-KVM:
- Allow the Xen runstate information to cross a page boundary
- Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
- Add support for 32-bit guests in SCHEDOP_poll
- Notable x86 fixes and cleanups:
- One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
- Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
a few years back when eliminating unnecessary barriers when
switching between vmcs01 and vmcs02.
- Clean up vmread_error_trampoline() to make it more obvious that
params must be passed on the stack, even for x86-64.
- Let userspace set all supported bits in MSR_IA32_FEAT_CTL
irrespective of the current guest CPUID.
- Fudge around a race with TSC refinement that results in KVM
incorrectly thinking a guest needs TSC scaling when running on a
CPU with a constant TSC, but no hardware-enumerated TSC
frequency.
- Advertise (on AMD) that the SMM_CTL MSR is not supported
- Remove unnecessary exports
Generic:
- Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
- Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
- Fix build errors that occur in certain setups (unsure exactly what
is unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
- Introduce actual atomics for clear/set_bit() in selftests
- Add support for pinning vCPUs in dirty_log_perf_test.
- Rename the so called "perf_util" framework to "memstress".
- Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress
tests.
- Add a common ucall implementation; code dedup and pre-work for
running SEV (and beyond) guests in selftests.
- Provide a common constructor and arch hook, which will eventually
be used by x86 to automatically select the right hypercall (AMD vs.
Intel).
- A bunch of added/enabled/fixed selftests for ARM64, covering
memslots, breakpoints, stage-2 faults and access tracking.
- x86-specific selftest changes:
- Clean up x86's page table management.
- Clean up and enhance the "smaller maxphyaddr" test, and add a
related test to cover generic emulation failure.
- Clean up the nEPT support checks.
- Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
- Fix an ordering issue in the AMX test introduced by recent
conversions to use kvm_cpu_has(), and harden the code to guard
against similar bugs in the future. Anything that tiggers
caching of KVM's supported CPUID, kvm_cpu_has() in this case,
effectively hides opt-in XSAVE features if the caching occurs
before the test opts in via prctl().
Documentation:
- Remove deleted ioctls from documentation
- Clean up the docs for the x86 MSR filter.
- Various fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
KVM: x86: Add proper ReST tables for userspace MSR exits/flags
KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
KVM: arm64: selftests: Align VA space allocator with TTBR0
KVM: arm64: Fix benign bug with incorrect use of VA_BITS
KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
KVM: x86: Advertise that the SMM_CTL MSR is not supported
KVM: x86: remove unnecessary exports
KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
tools: KVM: selftests: Convert clear/set_bit() to actual atomics
tools: Drop "atomic_" prefix from atomic test_and_set_bit()
tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
perf tools: Use dedicated non-atomic clear/set bit helpers
tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
KVM: arm64: selftests: Enable single-step without a "full" ucall()
KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
KVM: Remove stale comment about KVM_REQ_UNHALT
KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
KVM: Reference to kvm_userspace_memory_region in doc and comments
KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
...
- More userfaultfs work from Peter Xu.
- Several convert-to-folios series from Sidhartha Kumar and Huang Ying.
- Some filemap cleanups from Vishal Moola.
- David Hildenbrand added the ability to selftest anon memory COW handling.
- Some cpuset simplifications from Liu Shixin.
- Addition of vmalloc tracing support by Uladzislau Rezki.
- Some pagecache folioifications and simplifications from Matthew Wilcox.
- A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it.
- Miguel Ojeda contributed some cleanups for our use of the
__no_sanitize_thread__ gcc keyword. This series shold have been in the
non-MM tree, my bad.
- Naoya Horiguchi improved the interaction between memory poisoning and
memory section removal for huge pages.
- DAMON cleanups and tuneups from SeongJae Park
- Tony Luck fixed the handling of COW faults against poisoned pages.
- Peter Xu utilized the PTE marker code for handling swapin errors.
- Hugh Dickins reworked compound page mapcount handling, simplifying it
and making it more efficient.
- Removal of the autonuma savedwrite infrastructure from Nadav Amit and
David Hildenbrand.
- zram support for multiple compression streams from Sergey Senozhatsky.
- David Hildenbrand reworked the GUP code's R/O long-term pinning so
that drivers no longer need to use the FOLL_FORCE workaround which
didn't work very well anyway.
- Mel Gorman altered the page allocator so that local IRQs can remnain
enabled during per-cpu page allocations.
- Vishal Moola removed the try_to_release_page() wrapper.
- Stefan Roesch added some per-BDI sysfs tunables which are used to
prevent network block devices from dirtying excessive amounts of
pagecache.
- David Hildenbrand did some cleanup and repair work on KSM COW
breaking.
- Nhat Pham and Johannes Weiner have implemented writeback in zswap's
zsmalloc backend.
- Brian Foster has fixed a longstanding corner-case oddity in
file[map]_write_and_wait_range().
- sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
Chen.
- Shiyang Ruan has done some work on fsdax, to make its reflink mode
work better under xfstests. Better, but still not perfect.
- Christoph Hellwig has removed the .writepage() method from several
filesystems. They only need .writepages().
- Yosry Ahmed wrote a series which fixes the memcg reclaim target
beancounting.
- David Hildenbrand has fixed some of our MM selftests for 32-bit
machines.
- Many singleton patches, as usual.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA
jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml
CodAgiA51qwzId3GRytIo/tfWZSezgA=
=d19R
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- More userfaultfs work from Peter Xu
- Several convert-to-folios series from Sidhartha Kumar and Huang Ying
- Some filemap cleanups from Vishal Moola
- David Hildenbrand added the ability to selftest anon memory COW
handling
- Some cpuset simplifications from Liu Shixin
- Addition of vmalloc tracing support by Uladzislau Rezki
- Some pagecache folioifications and simplifications from Matthew
Wilcox
- A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use
it
- Miguel Ojeda contributed some cleanups for our use of the
__no_sanitize_thread__ gcc keyword.
This series should have been in the non-MM tree, my bad
- Naoya Horiguchi improved the interaction between memory poisoning and
memory section removal for huge pages
- DAMON cleanups and tuneups from SeongJae Park
- Tony Luck fixed the handling of COW faults against poisoned pages
- Peter Xu utilized the PTE marker code for handling swapin errors
- Hugh Dickins reworked compound page mapcount handling, simplifying it
and making it more efficient
- Removal of the autonuma savedwrite infrastructure from Nadav Amit and
David Hildenbrand
- zram support for multiple compression streams from Sergey Senozhatsky
- David Hildenbrand reworked the GUP code's R/O long-term pinning so
that drivers no longer need to use the FOLL_FORCE workaround which
didn't work very well anyway
- Mel Gorman altered the page allocator so that local IRQs can remnain
enabled during per-cpu page allocations
- Vishal Moola removed the try_to_release_page() wrapper
- Stefan Roesch added some per-BDI sysfs tunables which are used to
prevent network block devices from dirtying excessive amounts of
pagecache
- David Hildenbrand did some cleanup and repair work on KSM COW
breaking
- Nhat Pham and Johannes Weiner have implemented writeback in zswap's
zsmalloc backend
- Brian Foster has fixed a longstanding corner-case oddity in
file[map]_write_and_wait_range()
- sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
Chen
- Shiyang Ruan has done some work on fsdax, to make its reflink mode
work better under xfstests. Better, but still not perfect
- Christoph Hellwig has removed the .writepage() method from several
filesystems. They only need .writepages()
- Yosry Ahmed wrote a series which fixes the memcg reclaim target
beancounting
- David Hildenbrand has fixed some of our MM selftests for 32-bit
machines
- Many singleton patches, as usual
* tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits)
mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio
mm: mmu_gather: allow more than one batch of delayed rmaps
mm: fix typo in struct pglist_data code comment
kmsan: fix memcpy tests
mm: add cond_resched() in swapin_walk_pmd_entry()
mm: do not show fs mm pc for VM_LOCKONFAULT pages
selftests/vm: ksm_functional_tests: fixes for 32bit
selftests/vm: cow: fix compile warning on 32bit
selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions
mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem
mm,thp,rmap: fix races between updates of subpages_mapcount
mm: memcg: fix swapcached stat accounting
mm: add nodes= arg to memory.reclaim
mm: disable top-tier fallback to reclaim on proactive reclaim
selftests: cgroup: make sure reclaim target memcg is unprotected
selftests: cgroup: refactor proactive reclaim code to reclaim_until()
mm: memcg: fix stale protection of reclaim target memcg
mm/mmap: properly unaccount memory on mas_preallocate() failure
omfs: remove ->writepage
jfs: remove ->writepage
...
direction misannotations and (hopefully) preventing
more of the same for the future.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
=bcvF
-----END PGP SIGNATURE-----
Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"iov_iter work; most of that is about getting rid of direction
misannotations and (hopefully) preventing more of the same for the
future"
* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use less confusing names for iov_iter direction initializers
iov_iter: saner checks for attempt to copy to/from iterator
[xen] fix "direction" argument of iov_iter_kvec()
[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
[target] fix iov_iter_bvec() "direction" argument
[s390] memcpy_real(): WRITE is "data source", not destination...
[s390] zcore: WRITE is "data source", not destination...
[infiniband] READ is "data destination", not source...
[fsi] WRITE is "data source", not destination...
[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
csum_and_copy_to_iter(): handle ITER_DISCARD
get rid of unlikely() on page_copy_sane() calls
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
/ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
=QRhK
-----END PGP SIGNATURE-----
Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
- Replace prandom_u32_max() and various open-coded variants of it,
there is now a new family of functions that uses fast rejection
sampling to choose properly uniformly random numbers within an
interval:
get_random_u32_below(ceil) - [0, ceil)
get_random_u32_above(floor) - (floor, U32_MAX]
get_random_u32_inclusive(floor, ceil) - [floor, ceil]
Coccinelle was used to convert all current users of
prandom_u32_max(), as well as many open-coded patterns, resulting in
improvements throughout the tree.
I'll have a "late" 6.1-rc1 pull for you that removes the now unused
prandom_u32_max() function, just in case any other trees add a new
use case of it that needs to converted. According to linux-next,
there may be two trivial cases of prandom_u32_max() reintroductions
that are fixable with a 's/.../.../'. So I'll have for you a final
conversion patch doing that alongside the removal patch during the
second week.
This is a treewide change that touches many files throughout.
- More consistent use of get_random_canary().
- Updates to comments, documentation, tests, headers, and
simplification in configuration.
- The arch_get_random*_early() abstraction was only used by arm64 and
wasn't entirely useful, so this has been replaced by code that works
in all relevant contexts.
- The kernel will use and manage random seeds in non-volatile EFI
variables, refreshing a variable with a fresh seed when the RNG is
initialized. The RNG GUID namespace is then hidden from efivarfs to
prevent accidental leakage.
These changes are split into random.c infrastructure code used in the
EFI subsystem, in this pull request, and related support inside of
EFISTUB, in Ard's EFI tree. These are co-dependent for full
functionality, but the order of merging doesn't matter.
- Part of the infrastructure added for the EFI support is also used for
an improvement to the way vsprintf initializes its siphash key,
replacing an sleep loop wart.
- The hardware RNG framework now always calls its correct random.c
input function, add_hwgenerator_randomness(), rather than sometimes
going through helpers better suited for other cases.
- The add_latent_entropy() function has long been called from the fork
handler, but is a no-op when the latent entropy gcc plugin isn't
used, which is fine for the purposes of latent entropy.
But it was missing out on the cycle counter that was also being mixed
in beside the latent entropy variable. So now, if the latent entropy
gcc plugin isn't enabled, add_latent_entropy() will expand to a call
to add_device_randomness(NULL, 0), which adds a cycle counter,
without the absent latent entropy variable.
- The RNG is now reseeded from a delayed worker, rather than on demand
when used. Always running from a worker allows it to make use of the
CPU RNG on platforms like S390x, whose instructions are too slow to
do so from interrupts. It also has the effect of adding in new inputs
more frequently with more regularity, amounting to a long term
transcript of random values. Plus, it helps a bit with the upcoming
vDSO implementation (which isn't yet ready for 6.2).
- The jitter entropy algorithm now tries to execute on many different
CPUs, round-robining, in hopes of hitting even more memory latencies
and other unpredictable effects. It also will mix in a cycle counter
when the entropy timer fires, in addition to being mixed in from the
main loop, to account more explicitly for fluctuations in that timer
firing. And the state it touches is now kept within the same cache
line, so that it's assured that the different execution contexts will
cause latencies.
* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
random: include <linux/once.h> in the right header
random: align entropy_timer_state to cache line
random: mix in cycle counter when jitter timer fires
random: spread out jitter callback to different CPUs
random: remove extraneous period and add a missing one in comments
efi: random: refresh non-volatile random seed when RNG is initialized
vsprintf: initialize siphash key using notifier
random: add back async readiness notifier
random: reseed in delayed work rather than on-demand
random: always mix cycle counter in add_latent_entropy()
hw_random: use add_hwgenerator_randomness() for early entropy
random: modernize documentation comment on get_random_bytes()
random: adjust comment to account for removed function
random: remove early archrandom abstraction
random: use random.trust_{bootloader,cpu} command line option only
stackprotector: actually use get_random_canary()
stackprotector: move get_random_canary() into stackprotector.h
treewide: use get_random_u32_inclusive() when possible
treewide: use get_random_u32_{above,below}() instead of manual loop
treewide: use get_random_u32_below() instead of deprecated function
...
- Thoroughly rewrite the data structures that implement perf task context handling,
with the goal of fixing various quirks and unfeatures both in already merged,
and in upcoming proposed code.
The old data structure is the per task and per cpu perf_event_contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ | ^
`---------------------------------' | `--> pmu ---'
v ^
perf_event ------'
In this new design this is replaced with a single task context and
a single CPU context, plus intermediate data-structures:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------' | |
| | perf_cpu_pmu_context <--.
| `----. ^ |
| | | |
| v v |
| ,--> perf_event_pmu_context |
| | |
| | |
v v |
perf_event ---> pmu ----------------'
[ See commit bd27568117 for more details. ]
This rewrite was developed by Peter Zijlstra and Ravi Bangoria.
- Optimize perf_tp_event()
- Update the Intel uncore PMU driver, extending it with UPI topology discovery
on various hardware models.
- Misc fixes & cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmOXjuURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j+VhAAknimsLwenTHCGQp7yqsWSKfBr9KI2UgD
ZgtQuuwRwSzwqAEwC5Mt6zcIkxRNhU1ookFPqQbpY3XA0W4aNakUk8bDF8QIEKW0
MFWxn7PtReWqKcUay2oEGGurqZ5OtfpljJGxigQh5oVeMGc+itIwHF2JefeyoRnu
pq7R2qDgOBb7Np4lWTdqXGmKufzp04/nely2IZQBO8x80cGRZiKQIrGrch6vLUf7
3iEz9rwmvPyz0aczYSpa/duEZDMLm4lWNK4oMUEXuUWC8gU7CUzBJsJ3AS5NgxAu
yGBXe/s7GHqwtc/F30l5gK/J5WAyK83IF7sckxTj0dBUpyC6wQwwYPm8BaCAMoqN
X6mU7Ve938Siih1TyOBZfZsrtDDILhV2N/nku2erb3iqes26u0RcT25rWtu9Yqvn
hm4Gm6cmkHWq4EOHSBvAdC7l7lDZ3fyVI5+8nN9ly9Qv867HjG70dvIr9iEEolpX
rhFAz8r/NwTXhDY0AmFZcOkrnNV3IuHtibJ/9wJlgJNqDPqN12Wxqdzy0Nj3HH6G
EsukBO05cWaDS0gB8MpaO6Q6YtqAr87ZY+afHDBwcfkME50/CyBLr5rd47dTR+Ip
B+zreYKcaNHdEMd1A9KULRTTDnEjlXYMwjVVJiPRV0jcmA3dHmM46HN5Ae9NdO6+
R2BAWv9XR6M=
=KNaI
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
- Thoroughly rewrite the data structures that implement perf task
context handling, with the goal of fixing various quirks and
unfeatures both in already merged, and in upcoming proposed code.
The old data structure is the per task and per cpu
perf_event_contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ | ^
`---------------------------------' | `--> pmu ---'
v ^
perf_event ------'
In this new design this is replaced with a single task context and a
single CPU context, plus intermediate data-structures:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------' | |
| | perf_cpu_pmu_context <--.
| `----. ^ |
| | | |
| v v |
| ,--> perf_event_pmu_context |
| | |
| | |
v v |
perf_event ---> pmu ----------------'
[ See commit bd27568117 for more details. ]
This rewrite was developed by Peter Zijlstra and Ravi Bangoria.
- Optimize perf_tp_event()
- Update the Intel uncore PMU driver, extending it with UPI topology
discovery on various hardware models.
- Misc fixes & cleanups
* tag 'perf-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
perf/x86/intel/uncore: Fix reference count leak in __uncore_imc_init_box()
perf/x86/intel/uncore: Fix reference count leak in snr_uncore_mmio_map()
perf/x86/intel/uncore: Fix reference count leak in hswep_has_limit_sbox()
perf/x86/intel/uncore: Fix reference count leak in sad_cfg_iio_topology()
perf/x86/intel/uncore: Make set_mapping() procedure void
perf/x86/intel/uncore: Update sysfs-devices-mapping file
perf/x86/intel/uncore: Enable UPI topology discovery for Sapphire Rapids
perf/x86/intel/uncore: Enable UPI topology discovery for Icelake Server
perf/x86/intel/uncore: Get UPI NodeID and GroupID
perf/x86/intel/uncore: Enable UPI topology discovery for Skylake Server
perf/x86/intel/uncore: Generalize get_topology() for SKX PMUs
perf/x86/intel/uncore: Disable I/O stacks to PMU mapping on ICX-D
perf/x86/intel/uncore: Clear attr_update properly
perf/x86/intel/uncore: Introduce UPI topology type
perf/x86/intel/uncore: Generalize IIO topology support
perf/core: Don't allow grouping events from different hw pmus
perf/amd/ibs: Make IBS a core pmu
perf: Fix function pointer case
perf/x86/amd: Remove the repeated declaration
perf: Fix possible memleak in pmu_dev_alloc()
...
- Core:
- The timer_shutdown[_sync]() infrastructure:
Tearing down timers can be tedious when there are circular
dependencies to other things which need to be torn down. A prime
example is timer and workqueue where the timer schedules work and the
work arms the timer.
What needs to prevented is that pending work which is drained via
destroy_workqueue() does not rearm the previously shutdown
timer. Nothing in that shutdown sequence relies on the timer being
functional.
The conclusion was that the semantics of timer_shutdown_sync() should
be:
- timer is not enqueued
- timer callback is not running
- timer cannot be rearmed
Preventing the rearming of shutdown timers is done by discarding rearm
attempts silently. A warning for the case that a rearm attempt of a
shutdown timer is detected would not be really helpful because it's
entirely unclear how it should be acted upon. The only way to address
such a case is to add 'if (in_shutdown)' conditionals all over the
place. This is error prone and in most cases of teardown not required
all.
- The real fix for the bluetooth HCI teardown based on
timer_shutdown_sync().
A larger scale conversion to timer_shutdown_sync() is work in
progress.
- Consolidation of VDSO time namespace helper functions
- Small fixes for timer and timerqueue
- Drivers:
- Prevent integer overflow on the XGene-1 TVAL register which causes
an never ending interrupt storm.
- The usual set of new device tree bindings
- Small fixes and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUuC0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodpZD/9kCDi009n65QFF1J4kE5aZuABbRMtO
7sy66fJpDyB/MtcbPPH29uzQUEs1VMTQVB+ZM+7e1YGoxSWuSTzeoFH+yK1w4tEZ
VPbOcvUEjG0esKUehwYFeOjSnIjy6M1Y41aOUaDnq00/azhfTrzLxQA1BbbFbkpw
S7u2hllbyRJ8KdqQyV9cVpXmze6fcpdtNhdQeoA7qQCsSPnJ24MSpZ/PG9bAovq8
75IRROT7CQRd6AMKAVpA9Ov8ak9nbY3EgQmoKcp5ZXfXz8kD3nHky9Lste7djgYB
U085Vwcelt39V5iXevDFfzrBYRUqrMKOXIf2xnnoDNeF5Jlj5gChSNVZwTLO38wu
RFEVCjCjuC41GQJWSck9LRSYdriW/htVbEE8JLc6uzUJGSyjshgJRn/PK4HjpiLY
AvH2rd4rAap/rjDKvfWvBqClcfL7pyBvavgJeyJ8oXyQjHrHQwapPcsMFBm0Cky5
soF0Lr3hIlQ9u+hwUuFdNZkY9mOg09g9ImEjW1AZTKY0DfJMc5JAGjjSCfuopVUN
Uf/qqcUeQPSEaC+C9xiFs0T3svYFxBqpgPv4B6t8zAnozon9fyZs+lv5KdRg4X77
qX395qc6PaOSQlA7gcxVw3vjCPd0+hljXX84BORP7z+uzcsomvIH1MxJepIHmgaJ
JrYbSZ5qzY5TTA==
=JlDe
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for timers, timekeeping and drivers:
Core:
- The timer_shutdown[_sync]() infrastructure:
Tearing down timers can be tedious when there are circular
dependencies to other things which need to be torn down. A prime
example is timer and workqueue where the timer schedules work and
the work arms the timer.
What needs to prevented is that pending work which is drained via
destroy_workqueue() does not rearm the previously shutdown timer.
Nothing in that shutdown sequence relies on the timer being
functional.
The conclusion was that the semantics of timer_shutdown_sync()
should be:
- timer is not enqueued
- timer callback is not running
- timer cannot be rearmed
Preventing the rearming of shutdown timers is done by discarding
rearm attempts silently.
A warning for the case that a rearm attempt of a shutdown timer is
detected would not be really helpful because it's entirely unclear
how it should be acted upon. The only way to address such a case is
to add 'if (in_shutdown)' conditionals all over the place. This is
error prone and in most cases of teardown not required all.
- The real fix for the bluetooth HCI teardown based on
timer_shutdown_sync().
A larger scale conversion to timer_shutdown_sync() is work in
progress.
- Consolidation of VDSO time namespace helper functions
- Small fixes for timer and timerqueue
Drivers:
- Prevent integer overflow on the XGene-1 TVAL register which causes
an never ending interrupt storm.
- The usual set of new device tree bindings
- Small fixes and improvements all over the place"
* tag 'timers-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
dt-bindings: timer: renesas,cmt: Add r8a779g0 CMT support
dt-bindings: timer: renesas,tmu: Add r8a779g0 support
clocksource/drivers/arm_arch_timer: Use kstrtobool() instead of strtobool()
clocksource/drivers/timer-ti-dm: Fix missing clk_disable_unprepare in dmtimer_systimer_init_clock()
clocksource/drivers/timer-ti-dm: Clear settings on probe and free
clocksource/drivers/timer-ti-dm: Make timer_get_irq static
clocksource/drivers/timer-ti-dm: Fix warning for omap_timer_match
clocksource/drivers/arm_arch_timer: Fix XGene-1 TVAL register math error
clocksource/drivers/timer-npcm7xx: Enable timer 1 clock before use
dt-bindings: timer: nuvoton,npcm7xx-timer: Allow specifying all clocks
dt-bindings: timer: rockchip: Add rockchip,rk3128-timer
clockevents: Repair kernel-doc for clockevent_delta2ns()
clocksource/drivers/ingenic-ost: Define pm functions properly in platform_driver struct
clocksource/drivers/sh_cmt: Access registers according to spec
vdso/timens: Refactor copy-pasted find_timens_vvar_page() helper into one copy
Bluetooth: hci_qca: Fix the teardown problem for real
timers: Update the documentation to reflect on the new timer_shutdown() API
timers: Provide timer_shutdown[_sync]()
timers: Add shutdown mechanism to the internal functions
timers: Split [try_to_]del_timer[_sync]() to prepare for shutdown mode
...
- Factor out handle_write() function and simplify 3215 console
write operation.
- When 3170 terminal emulator is connected to the 3215 console
driver the boot time could be very long due to limited buffer
space or missing operator input. Add con3215_drop command line
parameter and con3215_drop sysfs attribute file to instruct
the kernel drop console data when such conditions are met.
- Fix white space errors in 3215 console driver.
- Move enum paiext_mode definition to a header file and rename
it to paievt_mode to indicate this is now used for several
events. Rename PAI_MODE_COUNTER to PAI_MODE_COUNTING to make
consistent with PAI_MODE_SAMPLING.
- Simplify the logic of PMU pai_crypto mapped buffer reference
counter and make it consistent with PMU pai_ext.
- Rename PMU pai_crypto mapped buffer structure member users
to active_events to make it consistent with PMU pai_ext.
- Enable HUGETLB_PAGE_OPTIMIZE_VMEMMAP configuration option.
This results in saving of 12K per 1M hugetlb page (~1.2%)
and 32764K per 2G hugetlb page (~1.6%).
- Use generic serial.h, bugs.h, shmparam.h and vga.h header
files and scrap s390-specific versions.
- The generic percpu setup code does not expect the s390-like
implementation and emits a warning. To get rid of that warning
and provide sane CPU-to-node and CPU-to-CPU distance mappings
implementat a minimal version of setup_per_cpu_areas().
- Use kstrtobool() instead of strtobool() for re-IPL sysfs device
attributes.
- Avoid unnecessary lookup of a pointer to MSI descriptor when
setting IRQ affinity for a PCI device.
- Get rid of "an incompatible function type cast" warning by
changing debug_sprintf_format_fn() function prototype so it
matches the debug_format_proc_t function type.
- Remove unused info_blk_hdr__pcpus() and get_page_state()
functions.
- Get rid of clang "unused unused insn cache ops function"
warning by moving s390_insn definition to a private header.
- Get rid of clang "unused function" warning by making function
raw3270_state_final() only available if CONFIG_TN3270_CONSOLE
is enabled.
- Use kstrobool() to parse sclp_con_drop parameter to make it
identical to the con3215_drop parameter and allow passing
values like "yes" and "true".
- Use sysfs_emit() for all SCLP sysfs show functions, which is
the current standard way to generate output strings.
- Make SCLP con_drop sysfs attribute also writable and allow to
change its value during runtime. This makes SCLP console drop
handling consistent with the 3215 device driver.
- Virtual and physical addresses are indentical on s390. However,
there is still a confusion when pointers are directly casted to
physical addresses or vice versa. Use correct address converters
virt_to_phys() and phys_to_virt() for s390 channel IO drivers.
- Support for power managemant has been removed from s390 since
quite some time. Remove unused power managemant code from the
appldata device driver.
- Allow memory tools like KASAN see memory accesses from the
checksum code. Switch to GENERIC_CSUM if KASAN is enabled,
just like x86 does.
- Add support of ECKD DASDs disks so it could be used as boot
and dump devices.
- Follow checkpatch recommendations and use octal values instead
of S_IRUGO and S_IWUSR for dump device attributes in sysfs.
- Changes to vx-insn.h do not cause a recompile of C files that
use asm(".include \"asm/vx-insn.h\"\n") magic to access vector
instruction macros from inline assemblies. Add wrapper include
header file to avoid this problem.
- Use vector instruction macros instead of byte patterns to
increase register validation routine readability.
- The current machine check register validation handling does not
take into account various scenarios and might lead to killing a
wrong user process or potentially ignore corrupted FPU registers.
Simplify logic of the machine check handler and stop the whole
machine if the previous context was kerenel mode. If the previous
context was user mode, kill the current task.
- Introduce sclp_emergency_printk() function which can be used to
emit a message in emergency cases. It is supposed to be used in
cases where regular console device drivers may not work anymore,
e.g. unrecoverable machine checks.
Keep the early Service-Call Control Block so it can also be used
after initdata has been freed to allow sclp_emergency_printk()
implementation.
- In case a system will be stopped because of an unrecoverable
machine check error print the machine check interruption code
to give a hint of what went wrong.
- Move storage error checking from the assembly entry code to C
in order to simplify machine check handling. Enter the handler
with DAT turned on, which simplifies the entry code even more.
- The machine check extended save areas are allocated using
a private "nmi_save_areas" slab cache which guarantees a
required power-of-two alignment. Get rid of that cache in
favour of kmalloc().
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCY5ckrhccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8NlrAQD8NCLeEAkhGCRnzdTyngExCrzV
Mw//cEnksUkIPqalJgEArbyFjGh05ecNaiDQduH8Gh94/qOhGE4obMdTgMWq7QY=
=3aou
-----END PGP SIGNATURE-----
Merge tag 's390-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Factor out handle_write() function and simplify 3215 console write
operation
- When 3170 terminal emulator is connected to the 3215 console driver
the boot time could be very long due to limited buffer space or
missing operator input. Add con3215_drop command line parameter and
con3215_drop sysfs attribute file to instruct the kernel drop console
data when such conditions are met
- Fix white space errors in 3215 console driver
- Move enum paiext_mode definition to a header file and rename it to
paievt_mode to indicate this is now used for several events. Rename
PAI_MODE_COUNTER to PAI_MODE_COUNTING to make consistent with
PAI_MODE_SAMPLING
- Simplify the logic of PMU pai_crypto mapped buffer reference counter
and make it consistent with PMU pai_ext
- Rename PMU pai_crypto mapped buffer structure member users to
active_events to make it consistent with PMU pai_ext
- Enable HUGETLB_PAGE_OPTIMIZE_VMEMMAP configuration option. This
results in saving of 12K per 1M hugetlb page (~1.2%) and 32764K per
2G hugetlb page (~1.6%)
- Use generic serial.h, bugs.h, shmparam.h and vga.h header files and
scrap s390-specific versions
- The generic percpu setup code does not expect the s390-like
implementation and emits a warning. To get rid of that warning and
provide sane CPU-to-node and CPU-to-CPU distance mappings implementat
a minimal version of setup_per_cpu_areas()
- Use kstrtobool() instead of strtobool() for re-IPL sysfs device
attributes
- Avoid unnecessary lookup of a pointer to MSI descriptor when setting
IRQ affinity for a PCI device
- Get rid of "an incompatible function type cast" warning by changing
debug_sprintf_format_fn() function prototype so it matches the
debug_format_proc_t function type
- Remove unused info_blk_hdr__pcpus() and get_page_state() functions
- Get rid of clang "unused unused insn cache ops function" warning by
moving s390_insn definition to a private header
- Get rid of clang "unused function" warning by making function
raw3270_state_final() only available if CONFIG_TN3270_CONSOLE is
enabled
- Use kstrobool() to parse sclp_con_drop parameter to make it identical
to the con3215_drop parameter and allow passing values like "yes" and
"true"
- Use sysfs_emit() for all SCLP sysfs show functions, which is the
current standard way to generate output strings
- Make SCLP con_drop sysfs attribute also writable and allow to change
its value during runtime. This makes SCLP console drop handling
consistent with the 3215 device driver
- Virtual and physical addresses are indentical on s390. However, there
is still a confusion when pointers are directly casted to physical
addresses or vice versa. Use correct address converters
virt_to_phys() and phys_to_virt() for s390 channel IO drivers
- Support for power managemant has been removed from s390 since quite
some time. Remove unused power managemant code from the appldata
device driver
- Allow memory tools like KASAN see memory accesses from the checksum
code. Switch to GENERIC_CSUM if KASAN is enabled, just like x86 does
- Add support of ECKD DASDs disks so it could be used as boot and dump
devices
- Follow checkpatch recommendations and use octal values instead of
S_IRUGO and S_IWUSR for dump device attributes in sysfs
- Changes to vx-insn.h do not cause a recompile of C files that use
asm(".include \"asm/vx-insn.h\"\n") magic to access vector
instruction macros from inline assemblies. Add wrapper include header
file to avoid this problem
- Use vector instruction macros instead of byte patterns to increase
register validation routine readability
- The current machine check register validation handling does not take
into account various scenarios and might lead to killing a wrong user
process or potentially ignore corrupted FPU registers. Simplify logic
of the machine check handler and stop the whole machine if the
previous context was kerenel mode. If the previous context was user
mode, kill the current task
- Introduce sclp_emergency_printk() function which can be used to emit
a message in emergency cases. It is supposed to be used in cases
where regular console device drivers may not work anymore, e.g.
unrecoverable machine checks
Keep the early Service-Call Control Block so it can also be used
after initdata has been freed to allow sclp_emergency_printk()
implementation
- In case a system will be stopped because of an unrecoverable machine
check error print the machine check interruption code to give a hint
of what went wrong
- Move storage error checking from the assembly entry code to C in
order to simplify machine check handling. Enter the handler with DAT
turned on, which simplifies the entry code even more
- The machine check extended save areas are allocated using a private
"nmi_save_areas" slab cache which guarantees a required power-of-two
alignment. Get rid of that cache in favour of kmalloc()
* tag 's390-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (38 commits)
s390/nmi: get rid of private slab cache
s390/nmi: move storage error checking back to C, enter with DAT on
s390/nmi: print machine check interruption code before stopping system
s390/sclp: introduce sclp_emergency_printk()
s390/sclp: keep sclp_early_sccb
s390/nmi: rework register validation handling
s390/nmi: use vector instruction macros instead of byte patterns
s390/vx: add vx-insn.h wrapper include file
s390/ipl: use octal values instead of S_* macros
s390/ipl: add eckd dump support
s390/ipl: add eckd support
vfio/ccw: identify CCW data addresses as physical
vfio/ccw: sort out physical vs virtual pointers usage
s390/checksum: support GENERIC_CSUM, enable it for KASAN
s390/appldata: remove power management callbacks
s390/cio: sort out physical vs virtual pointers usage
s390/sclp: allow to change sclp_console_drop during runtime
s390/sclp: convert to use sysfs_emit()
s390/sclp: use kstrobool() to parse sclp_con_drop parameter
s390/3270: make raw3270_state_final() depend on CONFIG_TN3270_CONSOLE
...
ACPI:
* Enable FPDT support for boot-time profiling
* Fix CPU PMU probing to work better with PREEMPT_RT
* Update SMMUv3 MSI DeviceID parsing to latest IORT spec
* APMT support for probing Arm CoreSight PMU devices
CPU features:
* Advertise new SVE instructions (v2.1)
* Advertise range prefetch instruction
* Advertise CSSC ("Common Short Sequence Compression") scalar
instructions, adding things like min, max, abs, popcount
* Enable DIT (Data Independent Timing) when running in the kernel
* More conversion of system register fields over to the generated
header
CPU misfeatures:
* Workaround for Cortex-A715 erratum #2645198
Dynamic SCS:
* Support for dynamic shadow call stacks to allow switching at
runtime between Clang's SCS implementation and the CPU's
pointer authentication feature when it is supported (complete
with scary DWARF parser!)
Tracing and debug:
* Remove static ftrace in favour of, err, dynamic ftrace!
* Seperate 'struct ftrace_regs' from 'struct pt_regs' in core
ftrace and existing arch code
* Introduce and implement FTRACE_WITH_ARGS on arm64 to replace
the old FTRACE_WITH_REGS
* Extend 'crashkernel=' parameter with default value and fallback
to placement above 4G physical if initial (low) allocation
fails
SVE:
* Optimisation to avoid disabling SVE unconditionally on syscall
entry and just zeroing the non-shared state on return instead
Exceptions:
* Rework of undefined instruction handling to avoid serialisation
on global lock (this includes emulation of user accesses to the
ID registers)
Perf and PMU:
* Support for TLP filters in Hisilicon's PCIe PMU device
* Support for the DDR PMU present in Amlogic Meson G12 SoCs
* Support for the terribly-named "CoreSight PMU" architecture
from Arm (and Nvidia's implementation of said architecture)
Misc:
* Tighten up our boot protocol for systems with memory above
52 bits physical
* Const-ify static keys to satisty jump label asm constraints
* Trivial FFA driver cleanups in preparation for v1.1 support
* Export the kernel_neon_* APIs as GPL symbols
* Harden our instruction generation routines against
instrumentation
* A bunch of robustness improvements to our arch-specific selftests
* Minor cleanups and fixes all over (kbuild, kprobes, kfence, PMU, ...)
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmOPLFAQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNPRcCACLyDTvkimiqfoPxzzgdkx/6QOvw9s3/mXg
UcTORSZBR1VnYkiMYEKVz/tTfG99dnWtD8/0k/rz48NbhBfsF2sN4ukyBBXVf0zR
fjnaVyVC11LUgBgZKPo6maV+jf/JWf9hJtpPl06KTiPb2Hw2JX4DXg+PeF8t2hGx
NLH4ekQOrlDM8mlsN5mc0YsHbiuO7Xe/NRuet8TsgU4bEvLAwO6bzOLVUMqDQZNq
bQe2ENcGVAzAf7iRJb38lj9qB/5hrQTHRXqLXMSnJyyVjQEwYca0PeJMa7x30bXF
ZZ+xQ8Wq0mxiffZraf6SE34yD4gaYS4Fziw7rqvydC15vYhzJBH1
=hV+2
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"The highlights this time are support for dynamically enabling and
disabling Clang's Shadow Call Stack at boot and a long-awaited
optimisation to the way in which we handle the SVE register state on
system call entry to avoid taking unnecessary traps from userspace.
Summary:
ACPI:
- Enable FPDT support for boot-time profiling
- Fix CPU PMU probing to work better with PREEMPT_RT
- Update SMMUv3 MSI DeviceID parsing to latest IORT spec
- APMT support for probing Arm CoreSight PMU devices
CPU features:
- Advertise new SVE instructions (v2.1)
- Advertise range prefetch instruction
- Advertise CSSC ("Common Short Sequence Compression") scalar
instructions, adding things like min, max, abs, popcount
- Enable DIT (Data Independent Timing) when running in the kernel
- More conversion of system register fields over to the generated
header
CPU misfeatures:
- Workaround for Cortex-A715 erratum #2645198
Dynamic SCS:
- Support for dynamic shadow call stacks to allow switching at
runtime between Clang's SCS implementation and the CPU's pointer
authentication feature when it is supported (complete with scary
DWARF parser!)
Tracing and debug:
- Remove static ftrace in favour of, err, dynamic ftrace!
- Seperate 'struct ftrace_regs' from 'struct pt_regs' in core ftrace
and existing arch code
- Introduce and implement FTRACE_WITH_ARGS on arm64 to replace the
old FTRACE_WITH_REGS
- Extend 'crashkernel=' parameter with default value and fallback to
placement above 4G physical if initial (low) allocation fails
SVE:
- Optimisation to avoid disabling SVE unconditionally on syscall
entry and just zeroing the non-shared state on return instead
Exceptions:
- Rework of undefined instruction handling to avoid serialisation on
global lock (this includes emulation of user accesses to the ID
registers)
Perf and PMU:
- Support for TLP filters in Hisilicon's PCIe PMU device
- Support for the DDR PMU present in Amlogic Meson G12 SoCs
- Support for the terribly-named "CoreSight PMU" architecture from
Arm (and Nvidia's implementation of said architecture)
Misc:
- Tighten up our boot protocol for systems with memory above 52 bits
physical
- Const-ify static keys to satisty jump label asm constraints
- Trivial FFA driver cleanups in preparation for v1.1 support
- Export the kernel_neon_* APIs as GPL symbols
- Harden our instruction generation routines against instrumentation
- A bunch of robustness improvements to our arch-specific selftests
- Minor cleanups and fixes all over (kbuild, kprobes, kfence, PMU, ...)"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (151 commits)
arm64: kprobes: Return DBG_HOOK_ERROR if kprobes can not handle a BRK
arm64: kprobes: Let arch do_page_fault() fix up page fault in user handler
arm64: Prohibit instrumentation on arch_stack_walk()
arm64:uprobe fix the uprobe SWBP_INSN in big-endian
arm64: alternatives: add __init/__initconst to some functions/variables
arm_pmu: Drop redundant armpmu->map_event() in armpmu_event_init()
kselftest/arm64: Allow epoll_wait() to return more than one result
kselftest/arm64: Don't drain output while spawning children
kselftest/arm64: Hold fp-stress children until they're all spawned
arm64/sysreg: Remove duplicate definitions from asm/sysreg.h
arm64/sysreg: Convert ID_DFR1_EL1 to automatic generation
arm64/sysreg: Convert ID_DFR0_EL1 to automatic generation
arm64/sysreg: Convert ID_AFR0_EL1 to automatic generation
arm64/sysreg: Convert ID_MMFR5_EL1 to automatic generation
arm64/sysreg: Convert MVFR2_EL1 to automatic generation
arm64/sysreg: Convert MVFR1_EL1 to automatic generation
arm64/sysreg: Convert MVFR0_EL1 to automatic generation
arm64/sysreg: Convert ID_PFR2_EL1 to automatic generation
arm64/sysreg: Convert ID_PFR1_EL1 to automatic generation
arm64/sysreg: Convert ID_PFR0_EL1 to automatic generation
...
This pull request contains the following branches:
doc.2022.10.20a: Documentation updates. This is the second
in a series from an ongoing review of the RCU documentation.
fixes.2022.10.21a: Miscellaneous fixes.
lazy.2022.11.30a: Introduces a default-off Kconfig option that depends
on RCU_NOCB_CPU that, on CPUs mentioned in the nohz_full or
rcu_nocbs boot-argument CPU lists, causes call_rcu() to introduce
delays. These delays result in significant power savings on
nearly idle Android and ChromeOS systems. These savings range
from a few percent to more than ten percent.
This series also includes several commits that change call_rcu()
to a new call_rcu_hurry() function that avoids these delays in
a few cases, for example, where timely wakeups are required.
Several of these are outside of RCU and thus have acks and
reviews from the relevant maintainers.
srcunmisafe.2022.11.09a: Creates an srcu_read_lock_nmisafe() and an
srcu_read_unlock_nmisafe() for architectures that support NMIs,
but which do not provide NMI-safe this_cpu_inc(). These NMI-safe
SRCU functions are required by the upcoming lockless printk()
work by John Ogness et al.
That printk() series depends on these commits, so if you pull
the printk() series before this one, you will have already
pulled in this branch, plus two more SRCU commits:
0cd7e350ab ("rcu: Make SRCU mandatory")
51f5f78a4f ("srcu: Make Tiny synchronize_srcu() check for readers")
These two commits appear to work well, but do not have
sufficient testing exposure over a long enough time for me to
feel comfortable pushing them unless something in mainline is
definitely going to use them immediately, and currently only
the new printk() work uses them.
torture.2022.10.18c: Changes providing minor but important increases
in test coverage for the new RCU polled-grace-period APIs.
torturescript.2022.10.20a: Changes that avoid redundant kernel builds,
thus providing about a 30% speedup for the torture.sh acceptance
test.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmOKnS8THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jCMiD/4weraRjmcLhZ3tz2vgTI8ZsXdIiCfU
vCln0AOKroVo37S4BhViVfryV2D4VFfEb1UY6EgxNFu7Jd3z0seQShZh/5r8bFMU
p0E6TC8PwyKUpQstTOwOynkw6BWGW1qeL620PpBNRAy4MkxL8AGv40tHRIHEeAzc
cCTax2+xW9ae0ZtAZHDDCUAzpYpcjScIf4OZ3tkSaFCcpWZijg+dN60dnsZ9l7h9
DtqKH61rszXAtxkmN9Fs9OY5MPCXi9Es6LVYq6KN06jqxwJRqmYf+pai3apmNIOf
P8isXOQG58tbhBLpNCG58UBSkjI2GG8Lcq6hYr6d/7Ukm7RF49q8eL7OQlVrJMuQ
Zi2DVTEAu2U3pzdTC14gi3RvqP7dO+psBs+LpGXtj4RxYvAP99e9KSRcG14j/Wwa
L52AetBzBXTCS5nhPOG8RP22d8HRZLxMe9x7T8iVCDuwH4M1zTF5cVzLeEdgPAD7
tdX4eV16PLt1AvhCEuHU/2v520gc2K9oGXLI1A6kzquXh7FflcPWl5WS+sYUbB/p
gBsblz7C3I5GgSoW4aAMnkukZiYgSvVql8ZyRwQuRzvLpYcofMpoanZbcufDjuw9
N5QzAaMmzHnBu3hOJS2WaSZRZ73fed3NO8jo8q8EMfYeWK3NAHybBdaQqSTgsO8i
s+aN+LZ4s5MnRw==
=eMOr
-----END PGP SIGNATURE-----
Merge tag 'rcu.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates. This is the second in a series from an ongoing
review of the RCU documentation.
- Miscellaneous fixes.
- Introduce a default-off Kconfig option that depends on RCU_NOCB_CPU
that, on CPUs mentioned in the nohz_full or rcu_nocbs boot-argument
CPU lists, causes call_rcu() to introduce delays.
These delays result in significant power savings on nearly idle
Android and ChromeOS systems. These savings range from a few percent
to more than ten percent.
This series also includes several commits that change call_rcu() to a
new call_rcu_hurry() function that avoids these delays in a few
cases, for example, where timely wakeups are required. Several of
these are outside of RCU and thus have acks and reviews from the
relevant maintainers.
- Create an srcu_read_lock_nmisafe() and an srcu_read_unlock_nmisafe()
for architectures that support NMIs, but which do not provide
NMI-safe this_cpu_inc(). These NMI-safe SRCU functions are required
by the upcoming lockless printk() work by John Ogness et al.
- Changes providing minor but important increases in torture test
coverage for the new RCU polled-grace-period APIs.
- Changes to torturescript that avoid redundant kernel builds, thus
providing about a 30% speedup for the torture.sh acceptance test.
* tag 'rcu.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits)
net: devinet: Reduce refcount before grace period
net: Use call_rcu_hurry() for dst_release()
workqueue: Make queue_rcu_work() use call_rcu_hurry()
percpu-refcount: Use call_rcu_hurry() for atomic switch
scsi/scsi_error: Use call_rcu_hurry() instead of call_rcu()
rcu/rcutorture: Use call_rcu_hurry() where needed
rcu/rcuscale: Use call_rcu_hurry() for async reader test
rcu/sync: Use call_rcu_hurry() instead of call_rcu
rcuscale: Add laziness and kfree tests
rcu: Shrinker for lazy rcu
rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
rcu: Make call_rcu() lazy to save power
rcu: Implement lockdep_rcu_enabled for !CONFIG_DEBUG_LOCK_ALLOC
srcu: Debug NMI safety even on archs that don't require it
srcu: Explain the reason behind the read side critical section on GP start
srcu: Warn when NMI-unsafe API is used in NMI
arch/s390: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option
arch/loongarch: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option
rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
rcu-tasks: Make grace-period-age message human-readable
...
* s390: fix multi-epoch extension in nested guests
* x86: fix uninitialized variable on nested triple fault
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOPcBoUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroN9sAgArbccE7GCXUP3snd08rdV2Y1efKq3
SsZWlze/esVlsRBr4Z+lzMamCIdMwUT242XEq50pOjQeqYB0PjaIraDhd/lnhbk/
pYITFCoOmRtT4PyUrfYZIFeQBo+yhuhrYN8WCLFCatqY4ec+9p5nCiaFk1Wn49ZD
zFm+uVXhlTa+DGNjGzjBi3mAPqTpRNAxdlPAEHv+kyaQgfRlTSo37H8Xs6PfcLVe
YnsGG47Ozx3pPe5Zo3Sr1k56CQLjNcY77lVWXskKIiABsM3G5YJ2BlGgG007zlSB
4G0/GyJhb351c1MHk1O+AfMNPfD4shJfLdIm5bJNGdiFPz9XFdpqa49BNQ==
=SKVi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Unless anything comes from the ARM side, this should be the last pull
request for this release - and it's mostly documentation:
- Document the interaction between KVM_CAP_HALT_POLL and halt_poll_ns
- s390: fix multi-epoch extension in nested guests
- x86: fix uninitialized variable on nested triple fault"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Document the interaction between KVM_CAP_HALT_POLL and halt_poll_ns
KVM: Move halt-polling documentation into common directory
KVM: x86: fix uninitialized variable use on KVM_REQ_TRIPLE_FAULT
KVM: s390: vsie: Fix the initialization of the epoch extension (epdx) field
Get rid of private "nmi_save_areas" slab cache. The only reason this was
introduced years ago was that with some slab debugging options allocations
would only guarantee a minimum alignment of ARCH_KMALLOC_MINALIGN, which
was eight bytes back then. This is not sufficient for the extended machine
check save area.
However since commit 59bb47985c ("mm, sl[aou]b: guarantee natural
alignment for kmalloc(power-of-two)") kmalloc guarantees a power-of-two
alignment even with debugging options enabled.
Therefore the private slab cache can be removed.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Checking for storage errors in machine check entry code was done in order
to handle also storage errors on kernel page tables. However this is
extremely unlikely and some basic assumptions what works on machine check
entry are necessary anyway. In order to simplify machine check handling
delay checking for storage errors to C code.
With this also change the machine check new PSW to have DAT on, which
simplifies the entry code even further.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
In case a system will be stopped because of e.g. missing validity bits
print the machine check interruption code before the system is stopped.
This is helpful, since up to now no message was printed in such a
case. Only a disabled wait PSW was loaded, which doesn't give a hint of
what went wrong.
Improve this by printing a message with debug information.
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Introduce sclp_emergency_printk() which can be used to emit a message
in emergency cases. sclp_emergency_printk() is only supposed to be
used in cases where it can be assumed that regular console device
drivers may not work anymore.
For example this may be the case for unrecoverable machine checks.
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Keep sclp_early_sccb so it can also be used after initdata has been
freed. This is a prerequisite to allow printing a message from the
machine check handler.
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
If a machine check happens in kernel mode, and the machine check
interruption code indicates that e.g. vector register contents in the
machine check area are not valid, the logic is to kill current.
The idea behind this was that if within kernel context vector
registers are not used then it is sufficient to kill the current user
space process to avoid that it continues with potentially corrupt
register contents. This however does not necessarily work, since the
current code does not take into account that a machine check can also
happen when a kernel thread is running (= no user space context), and
in addition there is no way to distinguish between the "previous" and
"next" user process task, if the machine check happens when a task
switch happens.
Given that machine checks with invalid saved register contents in the
machine check save area are extremely rare, simplify the logic: if
register contents are invalid and the previous context was kernel
mode, stop the whole machine. If the previous context was user mode,
kill the corresponding task.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Use vector instruction macros instead of byte patterns to increase
readability. The generated code is nearly identical:
- 1e8: e7 0f 10 00 00 36 vlm %v0,%v15,0(%r1)
- 1ee: e7 0f 11 00 0c 36 vlm %v16,%v31,256(%r1)
+ 1e8: e7 0f 10 00 30 36 vlm %v0,%v15,0(%r1),3
+ 1ee: e7 0f 11 00 3c 36 vlm %v16,%v31,256(%r1),3
By using the VLM macro the alignment hint is automatically specified
too. Even though from a performance perspective it doesn't matter at
all for the machine check code, this shows yet another benefit when
using the macros.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The vector instruction macros can also be used in inline assemblies. For
this the magic
asm(".include \"asm/vx-insn.h\"\n");
must be added to C files in order to avoid that the pre-processor
eliminates the __ASSEMBLY__ guarded macros. This however comes with the
problem that changes to asm/vx-insn.h do not cause a recompile of C files
which have only this magic statement instead of a proper include statement.
This can be observed with the arch/s390/kernel/fpu.c file.
In order to fix this problem and also to avoid that the include must
be specified twice, add a wrapper include header file which will do
all necessary steps.
This way only the vx-insn.h header file needs to be included and changes to
the new vx-insn-asm.h header file cause a recompile of all dependent files
like it should.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
octal values are easier to read and checkpatch also recommends
to use them, so replace all the S_* macros with their counterparts.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This adds support to use ECKD disks as dump device
to linux. The new dump type is called 'eckd_dump', parameters
are the same as for eckd ipl.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This adds support to IPL from ECKD DASDs to linux.
It introduces a few sysfs files in /sys/firmware/reipl/eckd:
bootprog: the boot program selector
clear: whether to issue a diag308 LOAD_NORMAL or LOAD_CLEAR
device: the device to ipl from
br_chr: Cylinder/Head/Record number to read the bootrecord from.
Might be '0' or 'auto' if it should be read from the
volume label.
scpdata: data to be passed to the ipl'd program.
The new ipl type is called 'eckd'.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
address post-6.0 issues, which is hopefully a sign that things are
converging.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY4pQpQAKCRDdBJ7gKXxA
jquxAP9Lqif7CGDgdq8uWY2hHS/Ujc3k7Ohgyzs37olnCuU8KwEA6/J7SpjsBgtY
OfzvnwxpCTh8Kfzu/oNckIHo/EEiIA8=
=o6qT
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"15 hotfixes, 11 marked cc:stable.
Only three or four of the latter address post-6.0 issues, which is
hopefully a sign that things are converging"
* tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
revert "kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible"
Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled
drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame
mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths
mm/khugepaged: fix GUP-fast interaction by sending IPI
mm/khugepaged: take the right locks for page table retraction
mm: migrate: fix THP's mapcount on isolation
mm: introduce arch_has_hw_nonleaf_pmd_young()
mm: add dummy pmd_young() for architectures not having it
mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes()
tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep"
nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry()
hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
madvise: use zap_page_range_single for madvise dontneed
mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
This is the s390 variant of commit d911c67e10 ("x86: kasan: kmsan:
support CONFIG_GENERIC_CSUM on x86, enable it for KASAN/KMSAN"). Even
though most of the s390 specific checksum code is written in C there is
still the csum_partial() inline assembly which could prevent KASAN and
KMSAN from seeing all memory accesses.
Therefore switch to GENERIC_CSUM if KASAN is enabled just like x86.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Support for power managemant has been removed from s390 since quite
some time. Therefore remove unused power managemant code from the
appldata device driver.
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
find_timens_vvar_page() is not architecture-specific, as can be seen from
how all five per-architecture versions of it are the same.
(arm64, powerpc and riscv are exactly the same; x86 and s390 have two
characters difference inside a comment, less blank lines, and mark the
!CONFIG_TIME_NS version as inline.)
Refactor the five copies into a central copy in kernel/time/namespace.c.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130115320.2918447-1-jannh@google.com
In __gmap_segment_gaddr() pmd level page table page is being extracted
from the pmd pointer, similar to pmd_pgtable_page() implementation. This
reduces some redundancy by directly using pmd_pgtable_page() instead,
though first making it available.
Link: https://lkml.kernel.org/r/20221125034502.1559986-1-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.
However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with the
page table lock because the page had already been made invisible to rmap.
And that is a problem, because while the TLB entry exists, we could end up
with the following situation:
(a) one CPU could come in and clean it, never seeing our mapping of the
page
(b) another CPU could continue to use the stale and dirty TLB entry and
continue to write to said page
resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.
End result: possibly lost dirty data.
This extends our current TLB gather infrastructure to optionally track a
"should I do a delayed page_remove_rmap() for this page after flushing the
TLB". It uses the newly introduced 'encoded page pointer' to do that
without having to keep separate data around.
Note, this is complicated by a couple of issues:
- we want to delay the rmap removal, but not past the page table lock,
because that simplifies the memcg accounting
- only SMP configurations want to delay TLB flushing, since on UP
there are obviously no remote TLBs to worry about, and the page
table lock means there are no preemption issues either
- s390 has its own mmu_gather model that doesn't delay TLB flushing,
and as a result also does not want the delayed rmap. As such, we can
treat S390 like the UP case and use a common fallback for the "no
delays" case.
- we can track an enormous number of pages in our mmu_gather structure,
with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
all set up to be approximately 10k pending pages.
We do not want to have a huge number of batched pages that we then
need to check for delayed rmap handling inside the page table lock.
Particularly that last point results in a noteworthy detail, where the
normal page batch gathering is limited once we have delayed rmaps pending,
in such a way that only the last batch (the so-called "active batch") in
the mmu_gather structure can have any delayed entries.
NOTE! While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.
So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly even
when user space goes off the rails.
[akpm@linux-foundation.org: fix UP build, per Linus]
Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
Link: https://lkml.kernel.org/r/20221109203051.1835763-4-torvalds@linux-foundation.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is purely a preparatory patch that makes all the data structures
ready for encoding flags with the mmu_gather page pointers.
The code currently always sets the flag to zero and doesn't use it yet,
but now it's tracking the type state along. The next step will be to
actually start using it.
Link: https://lkml.kernel.org/r/20221109203051.1835763-3-torvalds@linux-foundation.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In order to avoid #ifdeffery add a dummy pmd_young() implementation as a
fallback. This is required for the later patch "mm: introduce
arch_has_hw_nonleaf_pmd_young()".
Link: https://lkml.kernel.org/r/fd3ac3cd-7349-6bbd-890a-71a9454ca0b3@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- First batch of KVM changes for kernel virtual != physical address support
- Removal of a unused function
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmN/eYwACgkQ41TmuOI4
ufjoxA/9Et38aXO/IhmUt8v0QhA4yec+sc5GSFfQSYehej/1Vqhw0DXx+ORUiRgg
+rtiXJSSqkuD2dL+BDffY2xoul6nzNdVf4AbkcnrWscfWr6xwVYlPvuL0ymGI6J2
U/IPedRoKw0bHw/wHs05yV5PubrRwDFERKhtyXWYGbPJhX0w2n3IFOoKH1oWBhLW
Dc8jEs6t3gDbJ71Er0xoeBUoiuu+PgZG06cpOvzBZ0KclRgjADXyISqqk8/4mu8w
R+/Wf8NcrbQYV1jfCeq5zIsKC8uvnFj25UuyTLumn5vh+dNNsvE72Khe4tz7LI0I
ZPZ+GZuemu7Yi12dKjw4Sw3ui0ejWH/5XL1SVB0X/xYIWrBqOot+Lq6538GCng+c
tJt+zsu64VFgXCCZ8O9qO4uE4DBL70H3ThT7VZxIghSTZtY0xh3uFc64f3/3d9dy
K4WTJHrmMxhXaA/rqtIa8I53JvFl8CztofZATiQQesyPuc7lZ01w1Co5el4xYaxe
YknyMTq11qf/iYqVOW7sjoWW/YRuuMZ4+FhpI3o/SllVdN98iTwkk1kP3wcoBO5P
bvzpm+WXHbv9OxifPrqkqv34+upbjfEmSogHudQzagBX4vl3rZRfBCdQGCAha0Uc
ZYyg68kiil5sWmHI/Ln/ZjANYfbS5sF0CreuWxnmqcwKl2NSN/E=
=/1yt
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-next-6.2-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address support
- Removal of a unused function
- Fix size of incorrectly increased from four to eight bytes
TOD field of crash dump save area. As result in case of
kdump NT_S390_TODPREG ELF notes section contains correct
value and "detected read beyond size of field" compiler
warning goes away.
- Fix memory leak in cryptographic Adjunct Processors (AP)
module on initialization failure path.
- Add Gerald Schaefer <gerald.schaefer@linux.ibm.com> and
Alexander Gordeev <agordeev@linux.ibm.com> as S390 memory
management maintainers. Also rename the S390 section to
S390 ARCHITECTURE to be a bit more precise.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCY4DkjxccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8JrmAP928nAXDg4Gfctxka3tANgqu6uZ
SLQ79G717Hq+OHyYjwD9GHFrKjfaimYocAsyf9jPyKnnac4jVydA42CLdLbV2AI=
=Vl1I
-----END PGP SIGNATURE-----
Merge tag 's390-6.1-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Fix size of incorrectly increased from four to eight bytes TOD field
of crash dump save area. As result in case of kdump NT_S390_TODPREG
ELF notes section contains correct value and "detected read beyond
size of field" compiler warning goes away.
- Fix memory leak in cryptographic Adjunct Processors (AP) module on
initialization failure path.
- Add Gerald Schaefer <gerald.schaefer@linux.ibm.com> and Alexander
Gordeev <agordeev@linux.ibm.com> as S390 memory management
maintainers. Also rename the S390 section to S390 ARCHITECTURE to be
a bit more precise.
* tag 's390-6.1-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
MAINTAINERS: add S390 MM section
s390/crashdump: fix TOD programmable field size
s390/ap: fix memory leak in ap_init_qci_info()
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We recently experienced some weird huge time jumps in nested guests when
rebooting them in certain cases. After adding some debug code to the epoch
handling in vsie.c (thanks to David Hildenbrand for the idea!), it was
obvious that the "epdx" field (the multi-epoch extension) did not get set
to 0xff in case the "epoch" field was negative.
Seems like the code misses to copy the value from the epdx field from
the guest to the shadow control block. By doing so, the weird time
jumps are gone in our scenarios.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2140899
Fixes: 8fa1696ea7 ("KVM: s390: Multiple Epoch Facility support")
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Cc: stable@vger.kernel.org # 4.19+
Link: https://lore.kernel.org/r/20221123090833.292938-1-thuth@redhat.com
Message-Id: <20221123090833.292938-1-thuth@redhat.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
The size of the TOD programmable field was incorrectly increased from
four to eight bytes with commit 1a2c5840ac ("s390/dump: cleanup CPU
save area handling").
This leads to an elf notes section NT_S390_TODPREG which has a size of
eight instead of four bytes in case of kdump, however even worse is
that the contents is incorrect: it is supposed to contain only the
contents of the TOD programmable field, but in fact contains a mix of
the TOD programmable field (32 bit upper bits) and parts of the CPU
timer register (lower 32 bits).
Fix this by simply changing the size of the todpreg field within the
save area structure. This will implicitly also fix the size of the
corresponding elf notes sections.
This also gets rid of this compile time warning:
in function ‘fortify_memcpy_chk’,
inlined from ‘save_area_add_regs’ at arch/s390/kernel/crash_dump.c:99:2:
./include/linux/fortify-string.h:413:25: error: call to ‘__read_overflow2_field’
declared with attribute warning: detected read beyond size of field
(2nd parameter); maybe use struct_group()? [-Werror=attribute-warning]
413 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fixes: 1a2c5840ac ("s390/dump: cleanup CPU save area handling")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
clang warns about an unused insn cache ops function:
arch/s390/kernel/kprobes.c:34:1:
error: unused function 'is_kprobe_s390_insn_slot' [-Werror,-Wunused-function]
DEFINE_INSN_CACHE_OPS(s390_insn);
^
./include/linux/kprobes.h:335:20: note: expanded from macro 'DEFINE_INSN_CACHE_OPS'
static inline bool is_kprobe_##__name##_slot(unsigned long addr) \
^
<scratch space>:88:1: note: expanded from here
is_kprobe_s390_insn_slot
^
Move the definition to a private header file, which is also similar to
the generic insn cache ops.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
clang warns about an incompatible function type cast:
CC arch/s390/kernel/debug.o
arch/s390/kernel/debug.c:142:2: error: cast from 'int (*)(debug_info_t *, struct debug_view *, char *, debug_sprintf_entry_t *)' (aka 'int (*)(struct debug_info *, struct debug_view *, char *, debug_sprintf_entry_t *)') to 'debug_format_proc_t *' (aka 'int (*)(struct debug_info *, struct debug_view *, char *, const char *)') converts to incompatible function type [-Werror,-Wcast-function-type-strict]
(debug_format_proc_t *)&debug_sprintf_format_fn,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Get rid of this warning by changing debug_sprintf_format_fn() so it matches
the debug_format_proc_t function type, and do the cast of the last
parameter within the function itself.
This is the standard way of handling such cases anyway.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
No point in doing another lookup of irq_data, it's already provided as
an argument.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/linux-s390/8735aoui07.ffs@tglx/
[agordeev@linux.ibm.com added Link tag]
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.
In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.
While at it, include the corresponding header file (<linux/kstrtox.h>)
Link: https://lore.kernel.org/all/cover.1667336095.git.christophe.jaillet@wanadoo.fr/
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
s390 allows to enable CONFIG_NUMA, mainly to enable a couple of system
calls which are only present if NUMA is enabled. The NUMA specific system
calls are required by a couple of applications, which wouldn't work if the
system calls wouldn't be present.
The NUMA implementation itself maps all CPUs and memory to node 0. A
special case is the generic percpu setup code, which doesn't expect an s390
like implementation and therefore emits a message/warning:
"percpu: cpu 0 has no node -1 or node-local memory".
In order to get rid of this message, and also to provide sane CPU to node
and CPU distance mappings implement a minimal setup_per_cpu_areas()
function, which is very close to the generic variant.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
clang warns about an unused function:
arch/s390/kvm/interrupt.c:317:20:
error: unused function 'gisa_clear_ipm_gisc' [-Werror,-Wunused-function]
static inline void gisa_clear_ipm_gisc(struct kvm_s390_gisa *gisa, u32 gisc)
Remove gisa_clear_ipm_gisc(), since it is unused and get rid of this
warning.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Link: https://lore.kernel.org/r/20221118151133.2974602-1-hca@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Add the module parameter "async_destroy", to allow the asynchronous
destroy mechanism to be switched off. This might be useful for
debugging purposes.
The parameter is enabled by default since the feature is opt-in anyway.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Link: https://lore.kernel.org/r/20221111170632.77622-7-imbrenda@linux.ibm.com
Message-Id: <20221111170632.77622-7-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Add support for the Destroy Secure Configuration Fast Ultravisor call,
and take advantage of it for asynchronous destroy.
When supported, the protected guest is destroyed immediately using the
new UVC, leaving only the memory to be cleaned up asynchronously.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20221111170632.77622-6-imbrenda@linux.ibm.com
Message-Id: <20221111170632.77622-6-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
If the appropriate UV feature bit is set, there is no need to perform
an export before import.
The misc feature indicates, among other things, that importing a shared
page from a different protected VM will automatically also transfer its
ownership.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20221111170632.77622-5-imbrenda@linux.ibm.com
Message-Id: <20221111170632.77622-5-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Add KVM_CAP_S390_PROTECTED_ASYNC_DISABLE to signal that the
KVM_PV_ASYNC_DISABLE and KVM_PV_ASYNC_DISABLE_PREPARE commands for the
KVM_S390_PV_COMMAND ioctl are available.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20221111170632.77622-4-imbrenda@linux.ibm.com
Message-Id: <20221111170632.77622-4-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Until now, destroying a protected guest was an entirely synchronous
operation that could potentially take a very long time, depending on
the size of the guest, due to the time needed to clean up the address
space from protected pages.
This patch implements an asynchronous destroy mechanism, that allows a
protected guest to reboot significantly faster than previously.
This is achieved by clearing the pages of the old guest in background.
In case of reboot, the new guest will be able to run in the same
address space almost immediately.
The old protected guest is then only destroyed when all of its memory
has been destroyed or otherwise made non protected.
Two new PV commands are added for the KVM_S390_PV_COMMAND ioctl:
KVM_PV_ASYNC_CLEANUP_PREPARE: set aside the current protected VM for
later asynchronous teardown. The current KVM VM will then continue
immediately as non-protected. If a protected VM had already been
set aside for asynchronous teardown, but without starting the teardown
process, this call will fail. There can be at most one VM set aside at
any time. Once it is set aside, the protected VM only exists in the
context of the Ultravisor, it is not associated with the KVM VM
anymore. Its protected CPUs have already been destroyed, but not its
memory. This command can be issued again immediately after starting
KVM_PV_ASYNC_CLEANUP_PERFORM, without having to wait for completion.
KVM_PV_ASYNC_CLEANUP_PERFORM: tears down the protected VM previously
set aside using KVM_PV_ASYNC_CLEANUP_PREPARE. Ideally the
KVM_PV_ASYNC_CLEANUP_PERFORM PV command should be issued by userspace
from a separate thread. If a fatal signal is received (or if the
process terminates naturally), the command will terminate immediately
without completing. All protected VMs whose teardown was interrupted
will be put in the need_cleanup list. The rest of the normal KVM
teardown process will take care of properly cleaning up all remaining
protected VMs, including the ones on the need_cleanup list.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20221111170632.77622-2-imbrenda@linux.ibm.com
Message-Id: <20221111170632.77622-2-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
cmpxchg_user_key() for byte and short values is implemented via a one word
cmpxchg loop. Give up trying to perform the cmpxchg if it fails too often
because of contention on the cache line. This ensures that the thread
cannot become stuck in the kernel.
Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Link: https://lore.kernel.org/r/20221117100745.3253896-1-scgl@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The cmpxchg implementation for 1 and 2 bytes consists of a 4 byte
cmpxchg loop. Currently, the decision to retry is imprecise, looping if
bits outside the target byte(s) change instead of retrying until the
target byte(s) differ from the old value.
E.g. if an attempt to exchange (prev_left_0 old_bytes prev_right_0) is
made and it fails because the word at the address is
(prev_left_1 x prev_right_1) where both x != old_bytes and one of the
prev_*_1 values differs from the respective prev_*_0 value, the cmpxchg
is retried, even if by a semantic equivalent to a normal cmpxchg, the
exchange would fail.
Instead exit the loop if x != old_bytes and retry otherwise.
Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Link: https://lore.kernel.org/r/20221116144711.3811011-1-scgl@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add cmpxchg_user_key() which allows to execute a compare and exchange
on a user space address. This allows also to specify a storage key
which makes sure that key-controlled protection is considered.
This is based on a patch written by Janis Schoetterl-Glausch.
Link: https://lore.kernel.org/all/20220930210751.225873-2-scgl@linux.ibm.com
Cc: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Link: https://lore.kernel.org/r/Y2J8axs+bcQ2dO/l@osiris
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add new exception table type which is able to handle register
pairs. If an exception is recognized on such an instruction the
specified register pair will be zeroed, and the specified error
register will be modified so it contains -EFAULT, similar to the
existing EX_TABLE_UA_LOAD_REG() macro.
Link: https://lore.kernel.org/r/Y2J8RSW2khWLgpPo@osiris
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Instead of using a digit for input constraints simply initialize the
corresponding output operand in C code and use a "+" constraint
modifier.
Link: https://lore.kernel.org/r/Y2J8H82B6JhJhrp2@osiris
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Make variables local to each case label. This limits the scope of
variables and allows to use proper types everywhere.
Link: https://lore.kernel.org/r/Y2J7+HqgAZwnfxsh@osiris
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
I/O translation tables on s390 use 8 byte page table entries and tables
which are allocated lazily but only freed when the entire I/O
translation table is torn down. Also each IOVA can at any time only
translate to one physical address Furthermore I/O table accesses by the
IOMMU hardware are cache coherent. With a bit of care we can thus use
atomic updates to manipulate the translation table without having to use
a global lock at all. This is done analogous to the existing I/O
translation table handling code used on Intel and AMD x86 systems.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20221109142903.4080275-6-schnelle@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The s390_domain->devices list is only added to when new devices are
attached but is iterated through in read-only fashion for every mapping
operation as well as for I/O TLB flushes and thus in performance
critical code causing contention on the s390_domain->list_lock.
Fortunately such a read-mostly linked list is a standard use case for
RCU. This change closely follows the example fpr RCU protected list
given in Documentation/RCU/listRCU.rst.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20221109142903.4080275-4-schnelle@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
If a zPCI device is in the error state while switching IOMMU domains
zpci_register_ioat() will fail and we would end up with the device not
attached to any domain. In this state since zdev->dma_table == NULL
a reset via zpci_hot_reset_device() would wrongfully re-initialize the
device for DMA API usage using zpci_dma_init_device(). As automatic
recovery is currently disabled while attached to an IOMMU domain this
only affects slot resets triggered through other means but will affect
automatic recovery once we switch to using dma-iommu.
Additionally with that switch common code expects attaching to the
default domain to always work so zpci_register_ioat() should only fail
if there is no chance to recover anyway, e.g. if the device has been
unplugged.
Improve the robustness of attach by specifically looking at the status
returned by zpci_mod_fc() to determine if the device is unavailable and
in this case simply ignore the error. Once the device is reset
zpci_hot_reset_device() will then correctly set the domain's DMA
translation tables.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/20221109142903.4080275-2-schnelle@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
- Fix deadlock in discontiguous saved segments (DCSS) block device
driver. When adding a disk and scanning partitions the scan would
not break out early without a missed flag.
- Avoid using global register variable for current_stack_pointer
due to an old bug in gcc versions prior to gcc-8.4. Due to this
bug a broken code is generated, which leads to stack corruptions.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCY3edLRccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8MtqAQCIla7FAmztUpqFlW1yzOAb7JLd
9VtTX9lxiAwmKeL3rQD+Ow6nNhJiT7VX1+7YS+1a5lbq0fcckEbHoMtiWxRJVwc=
=eAUy
-----END PGP SIGNATURE-----
Merge tag 's390-6.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- Fix deadlock in discontiguous saved segments (DCSS) block device
driver. When adding a disk and scanning partitions the scan would not
break out early without a missed flag.
- Avoid using global register variable for current_stack_pointer due to
an old bug in gcc versions prior to gcc-8.4. Due to this bug a broken
code is generated, which leads to stack corruptions.
* tag 's390-6.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: avoid using global register for current_stack_pointer
s390/dcssblk: fix deadlock when adding a DCSS
In subsequent patches we'll arrange for architectures to have an
ftrace_regs which is entirely distinct from pt_regs. In preparation for
this, we need to minimize the use of pt_regs to where strictly necessary
in the core ftrace code.
This patch adds new ftrace_regs_{get,set}_*() helpers which can be used
to manipulate ftrace_regs. When CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y,
these can always be used on any ftrace_regs, and when
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=n these can be used when regs are
available. A new ftrace_regs_has_args(fregs) helper is added which code
can use to check when these are usable.
Co-developed-by: Florent Revest <revest@chromium.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20221103170520.931305-4-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
In subsequent patches we'll add a sew of ftrace_regs_{get,set}_*()
helpers. In preparation, this patch renames
ftrace_instruction_pointer_set() to
ftrace_regs_set_instruction_pointer().
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20221103170520.931305-3-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
In subsequent patches we'll arrange for architectures to have an
ftrace_regs which is entirely distinct from pt_regs. In preparation for
this, we need to minimize the use of pt_regs to where strictly
necessary in the core ftrace code.
This patch changes the prototype of arch_ftrace_set_direct_caller() to
take ftrace_regs rather than pt_regs, and moves the extraction of the
pt_regs into arch_ftrace_set_direct_caller().
On x86, arch_ftrace_set_direct_caller() can be used even when
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=n, and <linux/ftrace.h> defines
struct ftrace_regs. Due to this, it's necessary to define
arch_ftrace_set_direct_caller() as a macro to avoid using an incomplete
type. I've also moved the body of arch_ftrace_set_direct_caller() after
the CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y defineidion of struct
ftrace_regs.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20221103170520.931305-2-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
It's very unusual to have both a command line option and a compile time
option, and apparently that's confusing to people. Also, basically
everybody enables the compile time option now, which means people who
want to disable this wind up having to use the command line option to
ensure that anyway. So just reduce the number of moving pieces and nix
the compile time option in favor of the more versatile command line
option.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This is a simple mechanical transformation done by:
@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
The generic vga.h contains a couple of defines, which do no harm on
s390. Therefore use the generic version and git rid of the s390
specific empty header file.
Suggested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Use generic shmparam.h header file since the contents are identical.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Use the generic bugs.h header file. Except for an
excellent comment the header files are identical.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
There is no serial driver on s390, especially none that relies on a
bogus BASE_BAUD define. Therefore use the generic header file.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Commit 30de14b188 ("s390: current_stack_pointer shouldn't be a
function") made current_stack_pointer a global register variable like
on many other architectures. Unfortunately on s390 it uncovers old
gcc bug which is fixed only since gcc-9.1 [gcc commit 3ad7fed1cc87
("S/390: Fix PR89775. Stackpointer save/restore instructions removed")]
and backported to gcc-8.4 and later. Due to this bug gcc versions prior
to 8.4 generate broken code which leads to stack corruptions.
Current minimal gcc version required to build the kernel is declared
as 5.1. It is not possible to fix all old gcc versions, so work
around this problem by avoiding using global register variable for
current_stack_pointer.
Fixes: 30de14b188 ("s390: current_stack_pointer shouldn't be a function")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
- fix memcpy warning about field-spanning write in zcrypt driver.
- minor updates to defconfigs.
- Remove CONFIG_DEBUG_INFO_BTF from all defconfigs and add btf.config
addon config file. It significantly decreases compile time and allows
quickly enabling that option into the current kernel config.
- Add kasan.config addon config file which allows to easily enable
KASAN into the current kernel config.
- binutils commit 906f69cf65da ("IBM zSystems: Issue error for *DBL
relocs on misaligned symbols") caused several link errors.
Always build relocatable kernel to avoid this problem.
- Raise the minimum clang version to 15.0.0 to avoid silent generation
of a corrupted code.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCY2566RccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8M0nAQDRrcyOb3BILKYNlYkb9H8Cw/0x
TMl/zEqcjD14XBGXXAD+L6M9nNuWd8GKLYzE7AxEeQ80kAQLqUPGGxhjc09hqAI=
=gkiM
-----END PGP SIGNATURE-----
Merge tag 's390-6.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- fix memcpy warning about field-spanning write in zcrypt driver
- minor updates to defconfigs
- remove CONFIG_DEBUG_INFO_BTF from all defconfigs and add btf.config
addon config file. It significantly decreases compile time and allows
quickly enabling that option into the current kernel config
- add kasan.config addon config file which allows to easily enable
KASAN into the current kernel config
- binutils commit 906f69cf65da ("IBM zSystems: Issue error for *DBL
relocs on misaligned symbols") caused several link errors. Always
build relocatable kernel to avoid this problem
- raise the minimum clang version to 15.0.0 to avoid silent generation
of a corrupted code
* tag 's390-6.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
scripts/min-tool-version.sh: raise minimum clang version to 15.0.0 for s390
s390: always build relocatable kernel
s390/configs: add kasan.config addon config file
s390/configs: move CONFIG_DEBUG_INFO_BTF into btf.config addon config
s390: update defconfigs
s390/zcrypt: fix warning about field-spanning write
Enable HUGETLB_PAGE_OPTIMIZE_VMEMMAP for s390.
With this, vmemmap pages used to back struct pages for compound tail
pages of hugetlb pages are freed and remapped to compound head page
frame as RO, see also Documentation/vm/vmemmap_dedup.rst.
For 1M hugetlb pages, this results in freeing 3 of 4 vmemmap pages,
saving 12K of memory for each 1M hugetlb page (~1.2%).
/sys/kernel/debug/kernel_page_tables will show the impact:
---[ vmemmap Area Start ]---
[...]
0x0000037202d84000-0x0000037202d85000 4K PTE RW NX
0x0000037202d85000-0x0000037202d88000 12K PTE RO NX
For 2G hugetlb pages, this results in freeing 8191 of 8192 vmemmap
pages, saving 32764K of memory for each 2G hugetlb page (~1.6%)
/sys/kernel/debug/kernel_page_tables will show the impact:
---[ vmemmap Area Start ]---
[...]
0x000003720a000000-0x000003720a001000 4K PTE RW NX
0x000003720a001000-0x000003720c000000 32764K PTE RO NX
The memory savings come with some costs:
- vmemmap mapping for compound hugetlb pages is not a PMD mapping any
more, but split to 4K PTE mappings, and it will not be coalesced back
to PMD mapping after freeing hugetlb pages from the pool.
Apart from theoretical performance impact, this will also (slightly)
relativize the memory savings because of additional 2K PTE pagetable
allocations.
- Workload using "on the fly" hugetlb allocations via
"nr_overcommit_hugepages" instead of using the hugetlb pool via
"nr_hugepages" will suffer from considerably increased fault handling
time, see also description from commit 78f39084b4
("mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl").
- Freeing hugetlb pages from the pool will require re-allocation of the
freed struct pages, and therefore needs some memory available to the
kernel. This might fail in memory constrained scenarios.
- For the same reason, memory offline might fail even for ZONE_MOVABLE
when hugetlb pages are present (but not for s390, since we do not
support ARCH_ENABLE_HUGEPAGE_MIGRATION, and therefore cannot have
hugetlb pages in ZONE_MOVABLE).
- General increased complexity and overhead in kernel handling of
compound (head) pages.
Therefore, this feature is disabled by default, and has to be enabled
explicitly either by adding "hugetlb_free_vmemmap=on" kernel parameter,
or during run-time via "/proc/sys/vm/hugetlb_optimize_vmemmap" sysctl.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source
directory (i.e. not from arch/*/include) for the sole purpose of retrieving
irqchip_in_kernel.
Making the function inline in a header that is already included,
such as asm/kvm_host.h, is not possible because it needs to look at
struct kvm which is defined after asm/kvm_host.h is included. So add a
kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is
only performance critical on arm64 and x86, and the non-inline function
is enough on all other architectures.
irq.h can then be deleted from all architectures except x86.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Most architectures (except arm64/x86/sparc) simply return 1 for
kern_addr_valid(), which is only used in read_kcore(), and it calls
copy_from_kernel_nofault() which could check whether the address is a
valid kernel address. So as there is no need for kern_addr_valid(), let's
remove it.
Link: https://lkml.kernel.org/r/20221018074014.185687-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390]
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Helge Deller <deller@gmx.de> [parisc]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Guo Ren <guoren@kernel.org> [csky]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: <aou@eecs.berkeley.edu>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Xuerui Wang <kernel@xen0n.name>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nathan Chancellor reported several link errors on s390 with
CONFIG_RELOCATABLE disabled, after binutils commit 906f69cf65da ("IBM
zSystems: Issue error for *DBL relocs on misaligned symbols"). The binutils
commit reveals potential miscompiles that might have happened already
before with linker script defined symbols at odd addresses.
A similar bug was recently fixed in the kernel with commit c9305b6c1f
("s390: fix nospec table alignments").
See https://github.com/ClangBuiltLinux/linux/issues/1747 for an analysis
from Ulich Weigand.
Therefore always build a relocatable kernel to avoid this problem. There is
hardly any use-case for non-relocatable kernels, so this shouldn't be
controversial.
Link: https://github.com/ClangBuiltLinux/linux/issues/1747
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20221030182202.2062705-1-hca@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Add kasan.config addon config file which allows to easily enable KASAN
into the current kernel config.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
CONFIG_DEBUG_INFO_BTF significantly increases compile time for the
kernel. E.g. when changing a single C file compile time for a new bzImage
is increased by ~50% if BTF debug info is generated.
Therefore remove CONFIG_DEBUG_INFO_BTF from all defconfigs and introduce a
btf.config addon config file. Quickly enabling CONFIG_DEBUG_INFO_BTF into
the current kernel config can be done by simply invoking
make btf.config
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
swiotlb passes virtual addresses to set_memory_encrypted() and
set_memory_decrypted(), but uv_remove_shared() and uv_set_shared()
expect physical addresses. This currently works, because virtual
and physical addresses are the same.
Add virt_to_phys() to resolve the virtual-physical confusion.
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20221107121221.156274-2-nrb@linux.ibm.com
Message-Id: <20221107121221.156274-2-nrb@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
The 'kzdev' field of struct 'zpci_aift' is an array of pointers to
'kvm_zdev' structs. Allocate the proper size accordingly.
Reported by Coccinelle:
WARNING: Use correct pointer type argument for sizeof
Fixes: 98b1d33dac ("KVM: s390: pci: do initial setup for AEN interpretation")
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/20221026013234.960859-1-rafaelmendsr@gmail.com
Message-Id: <20221026013234.960859-1-rafaelmendsr@gmail.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
When running under PV, the guest's TOD clock is under control of the
ultravisor and the hypervisor isn't allowed to change it. Hence, don't
allow userspace to change the guest's TOD clock by returning
-EOPNOTSUPP.
When userspace changes the guest's TOD clock, KVM updates its
kvm.arch.epoch field and, in addition, the epoch field in all state
descriptions of all VCPUs.
But, under PV, the ultravisor will ignore the epoch field in the state
description and simply overwrite it on next SIE exit with the actual
guest epoch. This leads to KVM having an incorrect view of the guest's
TOD clock: it has updated its internal kvm.arch.epoch field, but the
ultravisor ignores the field in the state description.
Whenever a guest is now waiting for a clock comparator, KVM will
incorrectly calculate the time when the guest should wake up, possibly
causing the guest to sleep for much longer than expected.
With this change, kvm_s390_set_tod() will now take the kvm->lock to be
able to call kvm_s390_pv_is_protected(). Since kvm_s390_set_tod_clock()
also takes kvm->lock, use __kvm_s390_set_tod_clock() instead.
The function kvm_s390_set_tod_clock is now unused, hence remove it.
Update the documentation to indicate the TOD clock attr calls can now
return -EOPNOTSUPP.
Fixes: 0f30350471 ("KVM: s390: protvirt: Do only reset registers that are accessible")
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20221011160712.928239-2-nrb@linux.ibm.com
Message-Id: <20221011160712.928239-2-nrb@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
The struct s390_domain_device serves the sole purpose as list entry for
the devices list of a struct s390_domain. As it contains no additional
information besides a list_head and a pointer to the struct zpci_dev we
can simplify things and just thread the device list through struct
zpci_dev directly. This removes the need to allocate during domain
attach and gets rid of one level of indirection during mapping
operations.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20221025115657.1666860-3-schnelle@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
There have been various issues and limitations with the way perf uses
(task) contexts to track events. Most notable is the single hardware
PMU task context, which has resulted in a number of yucky things (both
proposed and merged).
Notably:
- HW breakpoint PMU
- ARM big.little PMU / Intel ADL PMU
- Intel Branch Monitoring PMU
- AMD IBS PMU
- S390 cpum_cf PMU
- PowerPC trace_imc PMU
*Current design:*
Currently we have a per task and per cpu perf_event_contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ | ^
`---------------------------------' | `--> pmu ---'
v ^
perf_event ------'
Each task has an array of pointers to a perf_event_context. Each
perf_event_context has a direct relation to a PMU and a group of
events for that PMU. The task related perf_event_context's have a
pointer back to that task.
Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which
includes a perf_event_context, which again has a direct relation to
that PMU, and a group of events for that PMU.
The perf_cpu_context also tracks which task context is currently
associated with that CPU and includes a few other things like the
hrtimer for rotation etc.
Each perf_event is then associated with its PMU and one
perf_event_context.
*Proposed design:*
New design proposed by this patch reduce to a single task context and
a single CPU context but adds some intermediate data-structures:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------' | |
| | perf_cpu_pmu_context <--.
| `----. ^ |
| | | |
| v v |
| ,--> perf_event_pmu_context |
| | |
| | |
v v |
perf_event ---> pmu ----------------'
With the new design, perf_event_context will hold all events for all
pmus in the (respective pinned/flexible) rbtrees. This can be achieved
by adding pmu to rbtree key:
{cpu, pmu, cgroup, group_index}
Each perf_event_context carries a list of perf_event_pmu_context which
is used to hold per-pmu-per-context state. For example, it keeps track
of currently active events for that pmu, a pmu specific task_ctx_data,
a flag to tell whether rotation is required or not etc.
Additionally, perf_cpu_pmu_context is used to hold per-pmu-per-cpu
state like hrtimer details to drive the event rotation, a pointer to
perf_event_pmu_context of currently running task and some other
ancillary information.
Each perf_event is associated to it's pmu, perf_event_context and
perf_event_pmu_context.
Further optimizations to current implementation are possible. For
example, ctx_resched() can be optimized to reschedule only single pmu
events.
Much thanks to Ravi for picking this up and pushing it towards
completion.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221008062424.313-1-ravi.bangoria@amd.com
Commit 838d9bb62d ("perf: Use sample_flags for raw_data")
changed the way the raw data of an event is collected.
Adjust the PMU pai_ext to the new scheme.
Fixes: 838d9bb62d ("perf: Use sample_flags for raw_data")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This patch enhances the kernel image adding a trailer as required for
secure boot by future firmware versions.
Cc: <stable@vger.kernel.org> # 5.2+
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
For some exception types the instruction address points behind the
instruction that caused the exception. Take that into account and add
the missing exception table entry.
Cc: <stable@vger.kernel.org>
Fixes: f058599e22 ("s390/pci: Fix s390_mmio_read/write with MIO")
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
For some exception types the instruction address points behind the
instruction that caused the exception. Take that into account and add
the missing exception table entry.
Cc: <stable@vger.kernel.org>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
For some exception types the instruction address points behind the
instruction that caused the exception. Take that into account and add
the missing exception table entries.
Cc: <stable@vger.kernel.org>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Rename structure member users to active_events to make it consistent
with PMU pai_ext. Also use the same prefix syntax for increment and
decrement operators in both PMUs.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Rework the mapped buffer reference count in PMU pai_crypto
to match the same technique as in PMU pai_ext.
This simplifies the logic.
Do not count the individual number of counter and sampling
processes. Remember the type of access and the total number of
references to the buffer.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Move enum definition to header file. This is done in preparation
for a follow on patch where this enum will be used in another source
file.
Also change the enum name from paiext_mode to paievt_mode
to indicate this enum is now used for several events.
Make naming consistent and rename PAI_MODE_COUNTER to PAI_MODE_COUNTING.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
pin_guest_page() used page_to_virt() to calculate the hpa of the pinned
page. This currently works, because virtual and physical addresses are
the same. Use page_to_phys() instead to resolve the virtual-real address
confusion.
One caller of pin_guest_page() actually expected the hpa to be a hva, so
add the missing phys_to_virt() conversion here.
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20221025082039.117372-2-nrb@linux.ibm.com
Message-Id: <20221025082039.117372-2-nrb@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>