Specify how is SRSO mitigated when SMT is disabled. Also, correct the
SMT check for that.
Fixes: e9fbc47b81 ("x86/srso: Disable the mitigation on unaffected configurations")
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20230814200813.p5czl47zssuej7nv@treble
Similar to how it doesn't make sense to have UNTRAIN_RET have two
untrain calls, it also doesn't make sense for VMEXIT to have an extra
IBPB call.
This cures VMEXIT doing potentially unret+IBPB or double IBPB.
Also, the (SEV) VMEXIT case seems to have been overlooked.
Redefine the meaning of the synthetic IBPB flags to:
- ENTRY_IBPB -- issue IBPB on entry (was: entry + VMEXIT)
- IBPB_ON_VMEXIT -- issue IBPB on VMEXIT
And have 'retbleed=ibpb' set *BOTH* feature flags to ensure it retains
the previous behaviour and issues IBPB on entry+VMEXIT.
The new 'srso=ibpb_vmexit' option only sets IBPB_ON_VMEXIT.
Create UNTRAIN_RET_VM specifically for the VMEXIT case, and have that
check IBPB_ON_VMEXIT.
All this avoids having the VMEXIT case having to check both ENTRY_IBPB
and IBPB_ON_VMEXIT and simplifies the alternatives.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121149.109557833@infradead.org
Since there can only be one active return_thunk, there only needs be
one (matching) untrain_ret. It fundamentally doesn't make sense to
allow multiple untrain_ret at the same time.
Fold all the 3 different untrain methods into a single (temporary)
helper stub.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121149.042774962@infradead.org
Rename the original retbleed return thunk and untrain_ret to
retbleed_return_thunk() and retbleed_untrain_ret().
No functional changes.
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121148.909378169@infradead.org
Use the existing configurable return thunk. There is absolute no
justification for having created this __x86_return_thunk alternative.
To clarify, the whole thing looks like:
Zen3/4 does:
srso_alias_untrain_ret:
nop2
lfence
jmp srso_alias_return_thunk
int3
srso_alias_safe_ret: // aliasses srso_alias_untrain_ret just so
add $8, %rsp
ret
int3
srso_alias_return_thunk:
call srso_alias_safe_ret
ud2
While Zen1/2 does:
srso_untrain_ret:
movabs $foo, %rax
lfence
call srso_safe_ret (jmp srso_return_thunk ?)
int3
srso_safe_ret: // embedded in movabs instruction
add $8,%rsp
ret
int3
srso_return_thunk:
call srso_safe_ret
ud2
While retbleed does:
zen_untrain_ret:
test $0xcc, %bl
lfence
jmp zen_return_thunk
int3
zen_return_thunk: // embedded in the test instruction
ret
int3
Where Zen1/2 flush the BTB entry using the instruction decoder trick
(test,movabs) Zen3/4 use BTB aliasing. SRSO adds a return sequence
(srso_safe_ret()) which forces the function return instruction to
speculate into a trap (UD2). This RET will then mispredict and
execution will continue at the return site read from the top of the
stack.
Pick one of three options at boot (evey function can only ever return
once).
[ bp: Fixup commit message uarch details and add them in a comment in
the code too. Add a comment about the srso_select_mitigation()
dependency on retbleed_select_mitigation(). Add moar ifdeffery for
32-bit builds. Add a dummy srso_untrain_ret_alias() definition for
32-bit alternatives needing the symbol. ]
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121148.842775684@infradead.org
There is infrastructure to rewrite return thunks to point to any
random thunk one desires, unwrap that from CALL_THUNKS, which up to
now was the sole user of that.
[ bp: Make the thunks visible on 32-bit and add ifdeffery for the
32-bit builds. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121148.775293785@infradead.org
Skip the srso cmd line parsing which is not needed on Zen1/2 with SMT
disabled and with the proper microcode applied (latter should be the
case anyway) as those are not affected.
Fixes: 5a15d83488 ("x86/srso: Tie SBPB bit setting to microcode patch detection")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230813104517.3346-1-bp@alien8.de
Initially, it was thought that doing an innocuous division in the #DE
handler would take care to prevent any leaking of old data from the
divider but by the time the fault is raised, the speculation has already
advanced too far and such data could already have been used by younger
operations.
Therefore, do the innocuous division on every exit to userspace so that
userspace doesn't see any potentially old data from integer divisions in
kernel space.
Do the same before VMRUN too, to protect host data from leaking into the
guest too.
Fixes: 77245f1c3c ("x86/CPU/AMD: Do not leak quotient data after a division by 0")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20230811213824.10025-1-bp@alien8.de
leads to an EFI config table ending up unmapped
- Use the correct segment selector in the 32-bit version of getcpu() in
the vDSO
- Make sure vDSO and VVAR regions are placed in the 47-bit VA range even
on 5-level paging systems
- Add models 0x90-0x91 to the range of AMD Zenbleed-affected CPUs
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTXVWMACgkQEsHwGGHe
VUrrpg//f0BFZV4dbANuYAX47A/hhrm8n5KwV9Mjq8JDeG/TumUN/vAK9+n8outO
pLpPOoXRSCSPybXBSmI1ryVwLaDq6BJy0fYyq4kHOvBFXYKodfTNdO8Oec+le3V/
DBGnY6TQ5x1PgWuAUE9WoeGQuTN1d3Dxm0V/pG2LU/qW3mr+GlBXsSKUVjwp9/OW
JPNw1XQDFVuxT+heRLxQONPkdTxkwwKxZBDRwnSSj7chbJ/jSbnX9a5xinwBvMZY
Q6nelt/AMwAgfO2oz38y1tnR0bfd8fM08SUUgoajWWghKemZK5uNAgZZJd4tPNq6
lBonNc6jF9UGohzgQbNOAjfmDomtyxe4JYGl2SnyWcfwFzGANpcSKSbM9H91LMaI
Sh7hKykZQNGmDctbLsxvlPFgIoxWFLK1DfNeM2yQRzlayqDF8CRlDIcWXHEtYOXf
AOZFMJtOJBZjLbSeWIBW278atNG3ONWEb75kqiKRYxsX9QzvwAZYCYZH+NGe6sLO
kkCm7g3NcAItf1qrPNGZd/k9fA/+3RXEWNdjYsmegMvU8vQBPY0w4NVwGtU9LCkq
jspQxnNlVy1ayqr/TQXRhzn5+d7CQ1PLNwVsGh4S+diCEFu2aEdOhrBhS1uaujLv
5iLpmyyh0yO9aHebK/u4cciAvwVB7WuzQqWYIGzdsolrc+lbY54=
=zPFr
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Do not parse the confidential computing blob on non-AMD hardware as
it leads to an EFI config table ending up unmapped
- Use the correct segment selector in the 32-bit version of getcpu() in
the vDSO
- Make sure vDSO and VVAR regions are placed in the 47-bit VA range
even on 5-level paging systems
- Add models 0x90-0x91 to the range of AMD Zenbleed-affected CPUs
* tag 'x86_urgent_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Enable Zenbleed fix for AMD Custom APU 0405
x86/mm: Fix VDSO and VVAR placement on 5-level paging machines
x86/linkage: Fix typo of BUILD_VDSO in asm/linkage.h
x86/vdso: Choose the right GDT_ENTRY_CPUNODE for 32-bit getcpu() on 64-bit kernel
x86/sev: Do not try to parse for the CC blob on non-AMD hardware
Commit
522b1d6921 ("x86/cpu/amd: Add a Zenbleed fix")
provided a fix for the Zen2 VZEROUPPER data corruption bug affecting
a range of CPU models, but the AMD Custom APU 0405 found on SteamDeck
was not listed, although it is clearly affected by the vulnerability.
Add this CPU variant to the Zenbleed erratum list, in order to
unconditionally enable the fallback fix until a proper microcode update
is available.
Fixes: 522b1d6921 ("x86/cpu/amd: Add a Zenbleed fix")
Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230811203705.1699914-1-cristian.ciocaltea@collabora.com
Under certain circumstances, an integer division by 0 which faults, can
leave stale quotient data from a previous division operation on Zen1
microarchitectures.
Do a dummy division 0/1 before returning from the #DE exception handler
in order to avoid any leaks of potentially sensitive data.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Add Base GDS mitigation
* Support GDS_NO under KVM
* Fix a documentation typo
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTJh5YACgkQaDWVMHDJ
krAzAw/8DzjhAYEa7a1AodCBMNg8uNOPnLNoRPPNhaN5Iw6W3zXYDBDKT9PyjAIx
RoIM0aHx/oY9nCpK441o25oCWAAyzk6E5/+q9hMa7B4aHUGKqiDUC6L9dC8UiiSN
yvoBv4g7F81QnmyazwYI64S6vnbr4Cqe7K/mvVqQ/vbJiugD25zY8mflRV9YAuMk
Oe7Ff/mCA+I/kqyKhJE3cf3qNhZ61FsFI886fOSvIE7g4THKqo5eGPpIQxR4mXiU
Ri2JWffTaeHr2m0sAfFeLH4VTZxfAgBkNQUEWeG6f2kDGTEKibXFRsU4+zxjn3gl
xug+9jfnKN1ceKyNlVeJJZKAfr2TiyUtrlSE5d+subIRKKBaAGgnCQDasaFAluzd
aZkOYz30PCebhN+KTrR84FySHCaxnev04jqdtVGAQEDbTvyNagFUdZFGhWijJShV
l2l4A0gFSYJmPfPVuuAwOJnnZtA1sRH9oz/Sny3+z9BKloZh+Nc/+Cu9zC8SLjaU
BF3Qv2gU9HKTJ+MSy2JrGS52cONfpO5ngFHoOMilZ1KBHrfSb1eiy32PDT+vK60Y
PFEmI8SWl7bmrO1snVUCfGaHBsHJSu5KMqwBGmM4xSRzJpyvRe493xC7+nFvqNLY
vFOFc4jGeusOXgiLPpfGduppkTGcM7sy75UMLwTSLcQbDK99mus=
=ZAPY
-----END PGP SIGNATURE-----
Merge tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/gds fixes from Dave Hansen:
"Mitigate Gather Data Sampling issue:
- Add Base GDS mitigation
- Support GDS_NO under KVM
- Fix a documentation typo"
* tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/x86: Fix backwards on/off logic about YMM support
KVM: Add GDS_NO support to KVM
x86/speculation: Add Kconfig option for GDS
x86/speculation: Add force option to GDS mitigation
x86/speculation: Add Gather Data Sampling mitigation
vulnerability on AMD processors. In short, this is yet another issue
where userspace poisons a microarchitectural structure which can then be
used to leak privileged information through a side channel.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTQs1gACgkQEsHwGGHe
VUo1UA/8C34PwJveZDcerdkaxSF+WKx7AjOI/L2ws1qn9YVFA3ItFMgVuFTrlY6c
1eYKYB3FS9fVN3KzGOXGyhho6seHqfY0+8cyYupR+PVLn9rSy7GqHaIMr37FdQ2z
yb9xu26v+gsvuPEApazS6MxijYS98u71rHhmg97qsHCnUiMJ01+TaGucntukNJv8
FfwjZJvgeUiBPQ/6IeA/O0413tPPJ9weawPyW+sV1w7NlXjaUVkNXwiq/Xxbt9uI
sWwMBjFHpSnhBRaDK8W5Blee/ZfsS6qhJ4jyEKUlGtsElMnZLPHbnrbpxxqA9gyE
K+3ZhoHf/W1hhvcZcALNoUHLx0CvVekn0o41urAhPfUutLIiwLQWVbApmuW80fgC
DhPedEFu7Wp6Okj5+Bqi/XOsOOWN2WRDSzdAq10o1C+e+fzmkr6y4E6gskfz1zXU
ssD9S4+uAJ5bccS5lck4zLffsaA03nAYTlvl1KRP4pOz5G9ln6eyO20ar1WwfGAV
o5ZsTJVGQMyVA49QFkksj+kOI3chkmDswPYyGn2y8OfqYXU4Ip4eN+VkjorIAo10
zIec3Z0bCGZ9UUMylUmdtH3KAm8q0wVNoFrUkMEmO8j6nn7ew2BhwLMn4uu+nOnw
lX2AG6PNhRLVDVaNgDsWMwejaDsitQPoWRuCIAZ0kQhbeYuwfpM=
=73JY
-----END PGP SIGNATURE-----
Merge tag 'x86_bugs_srso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/srso fixes from Borislav Petkov:
"Add a mitigation for the speculative RAS (Return Address Stack)
overflow vulnerability on AMD processors.
In short, this is yet another issue where userspace poisons a
microarchitectural structure which can then be used to leak privileged
information through a side channel"
* tag 'x86_bugs_srso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/srso: Tie SBPB bit setting to microcode patch detection
x86/srso: Add a forgotten NOENDBR annotation
x86/srso: Fix return thunks in generated code
x86/srso: Add IBPB on VMEXIT
x86/srso: Add IBPB
x86/srso: Add SRSO_NO support
x86/srso: Add IBPB_BRTYPE support
x86/srso: Add a Speculative RAS Overflow mitigation
x86/bugs: Increase the x86 bugs vector size to two u32s
The SBPB bit in MSR_IA32_PRED_CMD is supported only after a microcode
patch has been applied so set X86_FEATURE_SBPB only then. Otherwise,
guests would attempt to set that bit and #GP on the MSR write.
While at it, make SMT detection more robust as some guests - depending
on how and what CPUID leafs their report - lead to cpu_smt_control
getting set to CPU_SMT_NOT_SUPPORTED but SRSO_NO should be set for any
guest incarnation where one simply cannot do SMT, for whatever reason.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-by: Salvatore Bonaccorso <carnil@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
injection protection (STIBP) for user processes. Enable STIBP on such
systems.
- Do not delete (but put the ref instead) of AMD MCE error thresholding
sysfs kobjects when destroying them in order not to delete the kernfs
pointer prematurely
- Restore annotation in ret_from_fork_asm() in order to fix kthread
stack unwinding from being marked as unreliable and thus breaking
livepatching
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTGLFUACgkQEsHwGGHe
VUpgDRAAm3uatlqiY2M1Gu9BMMmchTkjr2Fq06TmDQ53SGc6FqLKicltBCZsxbrm
kOrAtmw0jYPTTzqiDy8llyAt+1BC200nAKWTABKhKBrgUiD2crIIC8Rr6YycZ4tm
ueepk4CCxzh+ffcvGau2OuH05SHwQLeTNPr5Rgk4BlVPToaMdXAJChZA/JXsj4gR
3EiWV5/UnC6znzmQKN5PG+BmDrrOlsyDCJXYBVH+vQFa0Udit/rx0YZQ5ZOcD8Tn
D7Ix10pGQV/ESOsD+UFq/u1LPZvJSD2GDsMpWitrw65wnC2TF/XTxBc+pK0mbyKL
3XmH2NPlp1igv3EZ3hltXUcw6Rv8u3hX7VE5S+eQ0FRXJGjxSwoLC9ndw28oPful
FlMjrmI9SE5ojssZ6evLN0/dPXHEz8HvRgw5UTy5I+RqpelMWtML5iDIipaMwoUT
yB9JNIsufY1CM1IHiZBVLZkqIl8X8RtllbJR/RWGfYEHuiXworumgMDp9MsEFY2C
MHr9+/j9E1vU71CvjIYAaJCfWU1Ce+lYCUZ+1SxyDDe3watJKlduuAXbalmyYe0w
ExE5Wt+3ghOzwgj4OtofUivXLWMXr4IgpKliO5TrZ3lGyS3LWQv1dJstCZUnknLZ
A5D/qUSvIXkUdrJbkXrYLQJxtd0ambHc+6ymAIjtMBM8/HF0pR4=
=49ii
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- AMD's automatic IBRS doesn't enable cross-thread branch target
injection protection (STIBP) for user processes. Enable STIBP on such
systems.
- Do not delete (but put the ref instead) of AMD MCE error thresholding
sysfs kobjects when destroying them in order not to delete the kernfs
pointer prematurely
- Restore annotation in ret_from_fork_asm() in order to fix kthread
stack unwinding from being marked as unreliable and thus breaking
livepatching
* tag 'x86_urgent_for_v6.5_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Enable STIBP on AMD if Automatic IBRS is enabled
x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocks
x86: Fix kthread unwind
Add the option to flush IBPB only on VMEXIT in order to protect from
malicious guests but one otherwise trusts the software that runs on the
hypervisor.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Add the option to mitigate using IBPB on a kernel entry. Pull in the
Retbleed alternative so that the IBPB call from there can be used. Also,
if Retbleed mitigation is done using IBPB, the same mitigation can and
must be used here.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Add support for the synthetic CPUID flag which "if this bit is 1,
it indicates that MSR 49h (PRED_CMD) bit 0 (IBPB) flushes all branch
type predictions from the CPU branch predictor."
This flag is there so that this capability in guests can be detected
easily (otherwise one would have to track microcode revisions which is
impossible for guests).
It is also needed only for Zen3 and -4. The other two (Zen1 and -2)
always flush branch type predictions by default.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Add a mitigation for the speculative return address stack overflow
vulnerability found on AMD processors.
The mitigation works by ensuring all RET instructions speculate to
a controlled location, similar to how speculation is controlled in the
retpoline sequence. To accomplish this, the __x86_return_thunk forces
the CPU to mispredict every function return using a 'safe return'
sequence.
To ensure the safety of this mitigation, the kernel must ensure that the
safe return sequence is itself free from attacker interference. In Zen3
and Zen4, this is accomplished by creating a BTB alias between the
untraining function srso_untrain_ret_alias() and the safe return
function srso_safe_ret_alias() which results in evicting a potentially
poisoned BTB entry and using that safe one for all function returns.
In older Zen1 and Zen2, this is accomplished using a reinterpretation
technique similar to Retbleed one: srso_untrain_ret() and
srso_safe_ret().
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Unlike Intel's Enhanced IBRS feature, AMD's Automatic IBRS does not
provide protection to processes running at CPL3/user mode, see section
"Extended Feature Enable Register (EFER)" in the APM v2 at
https://bugzilla.kernel.org/attachment.cgi?id=304652
Explicitly enable STIBP to protect against cross-thread CPL3
branch target injections on systems with Automatic IBRS enabled.
Also update the relevant documentation.
Fixes: e7862eda30 ("x86/cpu: Support AMD Automatic IBRS")
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230720194727.67022-1-kim.phillips@amd.com
AMD systems from Family 10h to 16h share MCA bank 4 across multiple CPUs.
Therefore, the threshold_bank structure for bank 4, and its threshold_block
structures, will be initialized once at boot time. And the kobject for the
shared bank will be added to each of the CPUs that share it. Furthermore,
the threshold_blocks for the shared bank will be added again to the bank's
kobject. These additions will increase the refcount for the bank's kobject.
For example, a shared bank with two blocks and shared across two CPUs will
be set up like this:
CPU0 init
bank create and add; bank refcount = 1; threshold_create_bank()
block 0 init and add; bank refcount = 2; allocate_threshold_blocks()
block 1 init and add; bank refcount = 3; allocate_threshold_blocks()
CPU1 init
bank add; bank refcount = 3; threshold_create_bank()
block 0 add; bank refcount = 4; __threshold_add_blocks()
block 1 add; bank refcount = 5; __threshold_add_blocks()
Currently in threshold_remove_bank(), if the bank is shared then
__threshold_remove_blocks() is called. Here the shared bank's kobject and
the bank's blocks' kobjects are deleted. This is done on the first call
even while the structures are still shared. Subsequent calls from other
CPUs that share the structures will attempt to delete the kobjects.
During kobject_del(), kobject->sd is removed. If the kobject is not part of
a kset with default_groups, then subsequent kobject_del() calls seem safe
even with kobject->sd == NULL.
Originally, the AMD MCA thresholding structures did not use default_groups.
And so the above behavior was not apparent.
However, a recent change implemented default_groups for the thresholding
structures. Therefore, kobject_del() will go down the sysfs_remove_groups()
code path. In this case, the first kobject_del() may succeed and remove
kobject->sd. But subsequent kobject_del() calls will give a WARNing in
kernfs_remove_by_name_ns() since kobject->sd == NULL.
Use kobject_put() on the shared bank's kobject when "removing" blocks. This
decrements the bank's refcount while keeping kobjects enabled until the
bank is no longer shared. At that point, kobject_put() will be called on
the blocks which drives their refcount to 0 and deletes them and also
decrementing the bank's refcount. And finally kobject_put() will be called
on the bank driving its refcount to 0 and deleting it.
The same example above:
CPU1 shutdown
bank is shared; bank refcount = 5; threshold_remove_bank()
block 0 put parent bank; bank refcount = 4; __threshold_remove_blocks()
block 1 put parent bank; bank refcount = 3; __threshold_remove_blocks()
CPU0 shutdown
bank is no longer shared; bank refcount = 3; threshold_remove_bank()
block 0 put block; bank refcount = 2; deallocate_threshold_blocks()
block 1 put block; bank refcount = 1; deallocate_threshold_blocks()
put bank; bank refcount = 0; threshold_remove_bank()
Fixes: 7f99cb5e60 ("x86/CPU/AMD: Use default_groups in kobj_type")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/alpine.LRH.2.02.2205301145540.25840@file01.intranet.prod.int.rdu2.redhat.com
Gather Data Sampling (GDS) is a transient execution attack using
gather instructions from the AVX2 and AVX512 extensions. This attack
allows malicious code to infer data that was previously stored in
vector registers. Systems that are not vulnerable to GDS will set the
GDS_NO bit of the IA32_ARCH_CAPABILITIES MSR. This is useful for VM
guests that may think they are on vulnerable systems that are, in
fact, not affected. Guests that are running on affected hosts where
the mitigation is enabled are protected as if they were running
on an unaffected system.
On all hosts that are not affected or that are mitigated, set the
GDS_NO bit.
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Gather Data Sampling (GDS) is mitigated in microcode. However, on
systems that haven't received the updated microcode, disabling AVX
can act as a mitigation. Add a Kconfig option that uses the microcode
mitigation if available and disables AVX otherwise. Setting this
option has no effect on systems not affected by GDS. This is the
equivalent of setting gather_data_sampling=force.
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
The Gather Data Sampling (GDS) vulnerability allows malicious software
to infer stale data previously stored in vector registers. This may
include sensitive data such as cryptographic keys. GDS is mitigated in
microcode, and systems with up-to-date microcode are protected by
default. However, any affected system that is running with older
microcode will still be vulnerable to GDS attacks.
Since the gather instructions used by the attacker are part of the
AVX2 and AVX512 extensions, disabling these extensions prevents gather
instructions from being executed, thereby mitigating the system from
GDS. Disabling AVX2 is sufficient, but we don't have the granularity
to do this. The XCR0[2] disables AVX, with no option to just disable
AVX2.
Add a kernel parameter gather_data_sampling=force that will enable the
microcode mitigation if available, otherwise it will disable AVX on
affected systems.
This option will be ignored if cmdline mitigations=off.
This is a *big* hammer. It is known to break buggy userspace that
uses incomplete, buggy AVX enumeration. Unfortunately, such userspace
does exist in the wild:
https://www.mail-archive.com/bug-coreutils@gnu.org/msg33046.html
[ dhansen: add some more ominous warnings about disabling AVX ]
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Gather Data Sampling (GDS) is a hardware vulnerability which allows
unprivileged speculative access to data which was previously stored in
vector registers.
Intel processors that support AVX2 and AVX512 have gather instructions
that fetch non-contiguous data elements from memory. On vulnerable
hardware, when a gather instruction is transiently executed and
encounters a fault, stale data from architectural or internal vector
registers may get transiently stored to the destination vector
register allowing an attacker to infer the stale data using typical
side channel techniques like cache timing attacks.
This mitigation is different from many earlier ones for two reasons.
First, it is enabled by default and a bit must be set to *DISABLE* it.
This is the opposite of normal mitigation polarity. This means GDS can
be mitigated simply by updating microcode and leaving the new control
bit alone.
Second, GDS has a "lock" bit. This lock bit is there because the
mitigation affects the hardware security features KeyLocker and SGX.
It needs to be enabled and *STAY* enabled for these features to be
mitigated against GDS.
The mitigation is enabled in the microcode by default. Disable it by
setting gather_data_sampling=off or by disabling all mitigations with
mitigations=off. The mitigation status can be checked by reading:
/sys/devices/system/cpu/vulnerabilities/gather_data_sampling
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Add a fix for the Zen2 VZEROUPPER data corruption bug where under
certain circumstances executing VZEROUPPER can cause register
corruption or leak data.
The optimal fix is through microcode but in the case the proper
microcode revision has not been applied, enable a fallback fix using
a chicken bit.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
- Yosry has also eliminated cgroup's atomic rstat flushing.
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability.
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning.
- Lorenzo Stoakes has done some maintanance work on the get_user_pages()
interface.
- Liam Howlett continues with cleanups and maintenance work to the maple
tree code. Peng Zhang also does some work on maple tree.
- Johannes Weiner has done some cleanup work on the compaction code.
- David Hildenbrand has contributed additional selftests for
get_user_pages().
- Thomas Gleixner has contributed some maintenance and optimization work
for the vmalloc code.
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code.
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting.
- Christoph Hellwig has some cleanups for the filemap/directio code.
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the provided
APIs rather than open-coding accesses.
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings.
- John Hubbard has a series of fixes to the MM selftesting code.
- ZhangPeng continues the folio conversion campaign.
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock.
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
128 to 8.
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management.
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code.
- Vishal Moola also has done some folio conversion work.
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
=B7yQ
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double().
The cmpxchg128() family of functions is basically & functionally
the same as cmpxchg_double(), but with a saner interface: instead
of a 6-parameter horror that forced u128 - u64/u64-halves layout
details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128 types.
- Restructure the generated atomic headers, and add
kerneldoc comments for all of the generic atomic{,64,_long}_t
operations. Generated definitions are much cleaner now,
and come with documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering
when taking multiple locks of the same type. This gets rid of
one use of lockdep_set_novalidate_class() in the bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended
variable shadowing generating garbage code on Clang on certain
ARM builds.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSav3wRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gDyxAAjCHQjpolrre7fRpyiTDwqzIKT27H04vQ
zrQVlVc42WBnn9pe8LthGy43/RvYvqlZvLoLONA4fMkuYriM6nSMsoZjeUmE+6Rs
QAElQC74P5YvEBOa67VNY3/M7sj22ftDe7ODtVV8OrnPjMk1sQNRvaK025Cs3yig
8MAI//hHGNmyVAp1dPYZMJNqxGCvluReLZ4SaUJFCMrg7YgUXgCBj/5Gi07TlKxn
sT8BFCssoEW/B9FXkh59B1t6FBCZoSy4XSZfsZe0uVAUJ4XDEOO+zBgaWFCedNQT
wP323ryBgMrkzUKA8j2/o5d3QnMA1GcBfHNNlvAl/fOfrxWXzDZnOEY26YcaLMa0
YIuRF/JNbPZlt6DCUVBUEvMPpfNYi18dFN0rat1a6xL2L4w+tm55y3mFtSsg76Ka
r7L2nWlRrAGXnuA+VEPqkqbSWRUSWOv5hT2Mcyb5BqqZRsxBETn6G8GVAzIO6j6v
giyfUdA8Z9wmMZ7NtB6usxe3p1lXtnZ/shCE7ZHXm6xstyZrSXaHgOSgAnB9DcuJ
7KpGIhhSODQSwC/h/J0KEpb9Pr/5jCWmXAQ2DWnZK6ndt1jUfFi8pfK58wm0AuAM
o9t8Mx3o8wZjbMdt6up9OIM1HyFiMx2BSaZK+8f/bWemHQ0xwez5g4k5O5AwVOaC
x9Nt+Tp0Ze4=
=DsYj
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
The cmpxchg128() family of functions is basically & functionally the
same as cmpxchg_double(), but with a saner interface.
Instead of a 6-parameter horror that forced u128 - u64/u64-halves
layout details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128
types.
- Restructure the generated atomic headers, and add kerneldoc comments
for all of the generic atomic{,64,_long}_t operations.
The generated definitions are much cleaner now, and come with
documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
taking multiple locks of the same type.
This gets rid of one use of lockdep_set_novalidate_class() in the
bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
shadowing generating garbage code on Clang on certain ARM builds.
* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
locking/atomic: treewide: delete arch_atomic_*() kerneldoc
locking/atomic: docs: Add atomic operations to the driver basic API documentation
locking/atomic: scripts: generate kerneldoc comments
docs: scripts: kernel-doc: accept bitwise negation like ~@var
locking/atomic: scripts: simplify raw_atomic*() definitions
locking/atomic: scripts: simplify raw_atomic_long*() definitions
locking/atomic: scripts: split pfx/name/sfx/order
locking/atomic: scripts: restructure fallback ifdeffery
locking/atomic: scripts: build raw_atomic_long*() directly
locking/atomic: treewide: use raw_atomic*_<op>()
locking/atomic: scripts: add trivial raw_atomic*_<op>()
locking/atomic: scripts: factor out order template generation
locking/atomic: scripts: remove leftover "${mult}"
locking/atomic: scripts: remove bogus order parameter
locking/atomic: xtensa: add preprocessor symbols
locking/atomic: x86: add preprocessor symbols
locking/atomic: sparc: add preprocessor symbols
locking/atomic: sh: add preprocessor symbols
...
used in
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSa9coACgkQEsHwGGHe
VUoKuA/6ApbRnAYvnwv6/o6Jnw9Jafzv1RpN6zIH9sjBX/4ghap5OI03tOO2maag
qQRnYZrjx5jCIctc/i4XS/is51b0Gnv6Bu6uesvuC8xynYnS0jIO4ONqvous/3Jj
7BsduLzFptvmwQV4jf17hv3OD4CqzMDKyKFDIX7zBeq6xdc66AqB+ba+oBmvNVDI
wHCkESdPnzSsjqsSQvfhxTasbBVV/exBpQst6oPT1WBiDscxgV/ArMsF7ZgzQjuF
+WHmyc6KfEgY41iPxBJ6FkUOGBM0fpJl2QmkgS3+WAHxFF3QDK/oKrwo+OPtih8P
Lec3y/JUncvfaKHcmqNEkI4GCfaZEOqQaP4/MUluJasetbzeLrDr+NEbtBPEXSTi
cUfKMdwoaALoGE0YJL8wIr9TxfgmU9kONstQIm9Crl3XJ4e/zHMyRtSoMMv4vOCx
E6amvwaBMsEbs/BRPjP/5f5aYfVI2e81b6QFOlXIjhjogN5AYlBms9sY+CfCw7Fm
LBP1h6OH6FZrAKfDHNSpLNLFllplmN5sVImrFotSdtCJe37WGbMXNEULJJklp7dZ
rEJCdxC0B66YOiYhbSIxQOUcEI9db56qXRfDHq6udXEovenGSV2Ke8SUTyq/QIDP
YWdUKuyxGtW6eKVMo4ineOQTrZ5e5kNpeSt2NOinSiCOl4RAUbs=
=haGW
-----END PGP SIGNATURE-----
Merge tag 'x86_sgx_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SGX update from Borislav Petkov:
- A fix to avoid using a list iterator variable after the loop it is
used in
* tag 'x86_sgx_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Avoid using iterator after loop in sgx_mmu_notifier_release()
mechanism in order to look up the memory type of a region easily. Also
address memory range lookup issues like returning an invalid memory
type. Furthermore, this handles the decoupling of PAT from MTRR more
naturally. All work by Juergen Gross
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSazOIACgkQEsHwGGHe
VUqltQ/8D1oA4LrgnbFO25J/27U/MwKo7ZI3hN6/OkH2FfdBgqeOlOV4TnndDL88
l/UrzOfWJQxpVLTO3SLMtDla0VrT24B4HZ4hvzDEdJZ8f1DLZ+gLN7sKOMjoIcO9
fvBZ5+/gFtVxSquwZmWvM0qKiCkKxmznJfpOx1/lt9UtKyKmpPSMVdrpqOeufL7k
xxWqGRh2s104ZKwfMOj4dgvCVK9ZUlsPqqiARzkqc0bCg7SeIyPea/S2eljhTl15
BTOA/wW/lcVQ9yWmDD8inzxrZI4EHEohEaNMfof3AqFyYCOU4RzvE9tpAFEK3GXp
NilxYkZ+JbEljq2QiEt0Ll8XEVKedi7YC1oN3ciiy9RS6+rWSPIvuMFV9tgPRjr1
AbWYmDoiLz+5ePI+0fckStRRntWKiao+hOaXb5RbEcg+85hkDHZZC7b0tCAUvnh7
OwuQfbzAqipn2G1hg+LThHDSjI4qHfHJlpeuPcsAxWef1diJbe15StdVWm+ttRE0
MTXSn3J9qT9MoY5y6m4KSybp0c1nSFlCK/ZkNvzwWHmkAG6M7wuFmBn3pVzEaCew
fneGZcX9Ija4MY8Ygajp8GI1aQ4mBNif+uVE7UUY17hH9qAf8vI8Joqs+4L35u8h
SZl/IqJO9ziEmVLdy9ajgm1xW04AFE1RYRfa6aH6K6tRaIoh8bE=
=Dmx5
-----END PGP SIGNATURE-----
Merge tag 'x86_mtrr_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mtrr updates from Borislav Petkov:
"A serious scrubbing of the MTRR code including adding a new map
mechanism in order to look up the memory type of a region easily.
Also address memory range lookup issues like returning an invalid
memory type. Furthermore, this handles the decoupling of PAT from MTRR
more naturally.
All work by Juergen Gross"
* tag 'x86_mtrr_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/xen: Set default memory type for PV guests to WB
x86/mtrr: Unify debugging printing
x86/mtrr: Remove unused code
x86/mm: Only check uniform after calling mtrr_type_lookup()
x86/mtrr: Don't let mtrr_type_lookup() return MTRR_TYPE_INVALID
x86/mtrr: Use new cache_map in mtrr_type_lookup()
x86/mtrr: Add mtrr=debug command line option
x86/mtrr: Construct a memory map with cache modes
x86/mtrr: Add get_effective_type() service function
x86/mtrr: Allocate mtrr_value array dynamically
x86/mtrr: Move 32-bit code from mtrr.c to legacy.c
x86/mtrr: Have only one set_mtrr() variant
x86/mtrr: Replace vendor tests in MTRR code
x86/xen: Set MTRR state when running as Xen PV initial domain
x86/hyperv: Set MTRR state when running as SEV-SNP Hyper-V guest
x86/mtrr: Support setting MTRR state for software defined MTRRs
x86/mtrr: Replace size_or_mask and size_and_mask with a much easier concept
x86/mtrr: Remove physical address size calculation
the early loading procedure
- Cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSap2cACgkQEsHwGGHe
VUobhA/7B78dVDsm4yoOwx3TiPP1Md/i9vazbR6fyZ0EJQBMqPeujtmuKAd6L0hQ
u+LU3j5FN4MBmdy3elqpTKW0NXTQkPdu/Syc9gwIEMtrsBj/9XbgjiHz70FgkPog
nTkWyIqxiHC0krlXLQfD3yIeDWqSKMZbpIE33ckSdJs6xCvh78uc8cbXGjpG1+dl
QWPjxXNWUVtz5eUqI52tQK2DN5jxzUZVb5/G4kB8icQUFLj2ji+5JB/4zJkgq59r
nNNZ7E9kDAifPH5qS+NRBgoCrZw51l9EBClRQb2xKXaejloMVT+rK9rQrOStEgct
gIEXRdLJeCDzW8OGrDM+FzPZqHx+IEpHduBDoOqyDLxgTpPSxbaeT3zJiAeh/+ox
BDDh2+OrHrMRTSkEGUoILVz5Hr/UwYM7FQkscuweex9mJWKBnXC4jJggXA93w+Ej
USJloYEyMl0sOkMKpzrNsfACl2rpH29Yefg1CJ6aJiX6lkbqSNntyHo8XIrB8jFG
0a3r79kRhIG6AflZ5PK2DGr/KKVBi70K61+9h1vEeSNWjr31eOfyCS82T9ukcs4b
Gmmj5lKHcJMjoD6IGjV7CZA/F6SPXwZjdoz4Py5GzDDf4uVfvr0P92h2KUhULAP6
GlfkSX9D5pLYF0Q0TEllKAzBBFf/0GTYmXq5Pz7BAxr8+srxBu8=
=2i55
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loader updates from Borislav Petkov:
- Load late on both SMT threads on AMD, just like it is being done in
the early loading procedure
- Cleanups
* tag 'x86_microcode_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Load late on both threads too
x86/microcode/amd: Remove unneeded pointer arithmetic
x86/microcode/AMD: Get rid of __find_equiv_id()
and assert __x86_return_thunk's alignment so that future changes to
the symbol macros do not accidentally break them.
- Remove CONFIG_X86_FEATURE_NAMES Kconfig option as its existence is
pointless
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZ1wgACgkQEsHwGGHe
VUrXlRAAhIonFM1suIHo6w085jY5YA1XnsziJr/bT3e16FdHrF1i3RBEX4ml0m3O
ADwa9dMsC9UJIa+/TKRNFfQvfRcLE/rsUKlS1Rluf/IRIxuSt/Oa4bFQHGXFRwnV
eSlnWTNiaWrRs/vJEYAnMOe98oRyElHWa9kZ7K5FC+Ksfn/WO1U1RQ2NWg2A2wkN
8MHJiS41w2piOrLU/nfUoI7+esHgHNlib222LoptDGHuaY8V2kBugFooxAEnTwS3
PCzWUqCTgahs393vbx6JimoIqgJDa7bVdUMB0kOUHxtpbBiNdYYVy6e7UKnV1yjB
qP3v9jQW4+xIyRmlFiErJXEZx7DjAIP5nulGRrUMzRfWEGF8mdRZ+ugGqFMHCeC8
vXI+Ixp2vvsfhG3N/algsJUdkjlpt3hBpElRZCfR08M253KAbAmUNMOr4sx4RPi5
ymC+pLIHd1K0G9jiZaFnOMaY71gAzWizwxwjFKLQMo44q+lpNJvsVO00cr+9RBYj
LQL2APkONVEzHPMYR/LrXCslYaW//DrfLdRQjNbzUTonxFadkTO2Eu8J90B/5SFZ
CqC1NYKMQPVFeg4XuGWCgZEH+jokCGhl8vvmXClAMcOEOZt0/s4H89EKFkmziyon
L1ZrA/U72gWV8EwD7GLtuFJmnV4Ayl/hlek2j0qNKaj6UUgTFg8=
=LcUq
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Borislav Petkov:
- Compute the purposeful misalignment of zen_untrain_ret automatically
and assert __x86_return_thunk's alignment so that future changes to
the symbol macros do not accidentally break them.
- Remove CONFIG_X86_FEATURE_NAMES Kconfig option as its existence is
pointless
* tag 'x86_cpu_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/retbleed: Add __x86_return_thunk alignment checks
x86/cpu: Remove X86_FEATURE_NAMES
x86/Kconfig: Make X86_FEATURE_NAMES non-configurable in prompt
of application containers with dynamically changing task lists
- When reading the tasks file, show the tasks' pid which are only in
the current namespace as opposed to showing the pids from the init
namespace too
- Other fixes and improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZzAwACgkQEsHwGGHe
VUqp6g//dJ3OMAj8q0g9TO5M9caPGtY67tP488dvIhcemvRwwsSFr/qiXHB35l0c
sCbmXVvF0lhaGbEIE94VZyKN7GXpvSKof29lJ2zaB/cgc4qbkm0wzkDA6e3j11xT
OXB8R/cU4FXm1nQ0irT9Bf8w4KrpWr8f3SVbLQkGsc9+vYaSMZHbFIvZ1RmDaFBU
T7WtpmRgfr97updmd4QkkBsHfIUNK/4HamGVBUKsdYX/seYuffRLKHzuiBqr7kDr
WKqdR3iVOqbLFQbH2QIXuAL9+29Z2lfMVUD04e0I62TU6KCckrdObBQ67WgXGAgG
GzCsbsNf1So7nIQmaMZSbR3OeuifXOzQPlFPtIh52SSVyafl1I66nw1tkTsMAqkd
waqaB2dSLFHTND8hE7pUHdz84RFaGoE9/O6JiSxt0qkXQIycJY6hBthLxw76XQe7
HnUqyL/0t7H5FT7TEwbQ26cNGFghA87x4fCc2AolIrZFK/chtPnvyFz3oTvSyLW2
J1YhdJ+mcid70cvGmhe9w9OjFI1e4O22l1uc9jJyTPwWPk6/IxKeT9mx+bJusT7d
mHvglK60L/x9NQHeV7FZIM9NXKhePLGi84aaE+Ly8ZOhoWcRmJTuEN/CGq+Qyuks
KmDhvGIfd4GKRxOwMuKHQEn8WfyRvX5YDvhU24V2Zzb8I3nyFQQ=
=nu66
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Implement a rename operation in resctrlfs to facilitate handling of
application containers with dynamically changing task lists
- When reading the tasks file, show the tasks' pid which are only in
the current namespace as opposed to showing the pids from the init
namespace too
- Other fixes and improvements
* tag 'x86_cache_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/x86: Documentation for MON group move feature
x86/resctrl: Implement rename op for mon groups
x86/resctrl: Factor rdtgroup lock for multi-file ops
x86/resctrl: Only show tasks' pid in current pid namespace
Those GPUs and CPUs are connected together through the coherent fabric
and the GPU memory controllers report errors through x86's MCA so EDAC
needs to support them. The amd64_edac driver supports now HBM (High
Bandwidth Memory) and thus such heterogeneous memory controller
systems
- Other small cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZiUwACgkQEsHwGGHe
VUphSQ/+JLXTAQ06CNos98MR8iCGdThVujhWt1pBIgjhQFJuf4JlEEtKs9htjbud
9HZvgnGbHahRoO8pMCB0jwtz0ATrPbaOvz4BofVp3SIRiR5jMI0tfmyl8iSrnA3Q
m5pbMh6uiIAlH8aPqQXret2iwp7JXOjnBWksgbmUWkI7d2qseKu98ikXyC4QoCaD
AGRJJ6OCA3P85rdT9qabOuXh6yoELOPKw3j243s22sTLiqn+EuoTE+QX5ZjrQ8Ts
DyXN/pYI/vGVP7sECkWf7PsEf1BkL6m5KeXDB4Ij2YJesQnBlBZQdAcxdGdY8z3M
f/qpLdrYvpcLHQy42Jm5VnnISOvMvAl8YWqCEyUmBjXcLwSPNIKHN9LQuznhnQHr
vssRVqQUg1J+/UWAoIzHdrAQ6zvgv1xlX2dG2YOw3t1WMDnMhztW3eoQv04etD3d
fqQH3MrkGHI4qeq1Mice1Gz+NWQG/PXVhgBzbTBDDCiRJkg1Dhxce1OMRUiM4tUW
0JABoU+KS0RZAKXAwine6v5duYmwK36Vl1SSCCWjqFMeR7XMwWWHA9d7t8+wdT1l
KBIEiRTcRnXaZXyLUPSPRbEF5ALS25RgWVPCA3ibuSUnJjGU7Z7/rbwlQryAefVB
nqjATed0zat4fbL9bvnDuOKQEzkuySvUWpU+Eozxbct6oRu5ms0=
=Vcif
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Add initial support for RAS hardware found on AMD server GPUs (MI200).
Those GPUs and CPUs are connected together through the coherent
fabric and the GPU memory controllers report errors through x86's MCA
so EDAC needs to support them. The amd64_edac driver supports now HBM
(High Bandwidth Memory) and thus such heterogeneous memory controller
systems
- Other small cleanups and improvements
* tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/amd64: Cache and use GPU node map
EDAC/amd64: Add support for AMD heterogeneous Family 19h Model 30h-3Fh
EDAC/amd64: Document heterogeneous system enumeration
x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errors
x86/amd_nb: Re-sort and re-indent PCI defines
x86/amd_nb: Add MI200 PCI IDs
ras/debugfs: Fix error checking for debugfs_create_dir()
x86/MCE: Check a hw error's address to determine proper recovery action
- Parallel CPU bringup
The reason why people are interested in parallel bringup is to shorten
the (kexec) reboot time of cloud servers to reduce the downtime of the
VM tenants.
The current fully serialized bringup does the following per AP:
1) Prepare callbacks (allocate, intialize, create threads)
2) Kick the AP alive (e.g. INIT/SIPI on x86)
3) Wait for the AP to report alive state
4) Let the AP continue through the atomic bringup
5) Let the AP run the threaded bringup to full online state
There are two significant delays:
#3 The time for an AP to report alive state in start_secondary() on
x86 has been measured in the range between 350us and 3.5ms
depending on vendor and CPU type, BIOS microcode size etc.
#4 The atomic bringup does the microcode update. This has been
measured to take up to ~8ms on the primary threads depending on
the microcode patch size to apply.
On a two socket SKL server with 56 cores (112 threads) the boot CPU
spends on current mainline about 800ms busy waiting for the APs to come
up and apply microcode. That's more than 80% of the actual onlining
procedure.
This can be reduced significantly by splitting the bringup mechanism
into two parts:
1) Run the prepare callbacks and kick the AP alive for each AP which
needs to be brought up.
The APs wake up, do their firmware initialization and run the low
level kernel startup code including microcode loading in parallel
up to the first synchronization point. (#1 and #2 above)
2) Run the rest of the bringup code strictly serialized per CPU
(#3 - #5 above) as it's done today.
Parallelizing that stage of the CPU bringup might be possible in
theory, but it's questionable whether required surgery would be
justified for a pretty small gain.
If the system is large enough the first AP is already waiting at the
first synchronization point when the boot CPU finished the wake-up of
the last AP. That reduces the AP bringup time on that SKL from ~800ms
to ~80ms, i.e. by a factor ~10x.
The actual gain varies wildly depending on the system, CPU, microcode
patch size and other factors. There are some opportunities to reduce
the overhead further, but that needs some deep surgery in the x86 CPU
bringup code.
For now this is only enabled on x86, but the core functionality
obviously works for all SMP capable architectures.
- Enhancements for SMP function call tracing so it is possible to locate
the scheduling and the actual execution points. That allows to measure
IPI delivery time precisely.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZb/YTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRoOD/9vAiGI3IhGyZcX/RjXxauSHf8Pmqll
05jUubFi5Vi3tKI1ubMOsnMmJTw2yy5xDyS/iGj7AcbRLq9uQd3iMtsXXHNBzo/X
FNxnuWTXYUj0vcOYJ+j4puBumFzzpRCprqccMInH0kUnSWzbnaQCeelicZORAf+w
zUYrswK4HpBXHDOnvPw6Z7MYQe+zyDQSwjSftstLyROzu+lCEw/9KUaysY2epShJ
wHClxS2XqMnpY4rJ/CmJAlRhD0Plb89zXyo6k9YZYVDWoAcmBZy6vaTO4qoR171L
37ApqrgsksMkjFycCMnmrFIlkeb7bkrYDQ5y+xqC3JPTlYDKOYmITV5fZ83HD77o
K7FAhl/CgkPq2Ec+d82GFLVBKR1rijbwHf7a0nhfUy0yMeaJCxGp4uQ45uQ09asi
a/VG2T38EgxVdseC92HRhcdd3pipwCb5wqjCH/XdhdlQrk9NfeIeP+TxF4QhADhg
dApp3ifhHSnuEul7+HNUkC6U+Zc8UeDPdu5lvxSTp2ooQ0JwaGgC5PJq3nI9RUi2
Vv826NHOknEjFInOQcwvp6SJPfcuSTF75Yx6xKz8EZ3HHxpvlolxZLq+3ohSfOKn
2efOuZO5bEu4S/G2tRDYcy+CBvNVSrtZmCVqSOS039c8quBWQV7cj0334cjzf+5T
TRiSzvssbYYmaw==
=Y8if
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP updates from Thomas Gleixner:
"A large update for SMP management:
- Parallel CPU bringup
The reason why people are interested in parallel bringup is to
shorten the (kexec) reboot time of cloud servers to reduce the
downtime of the VM tenants.
The current fully serialized bringup does the following per AP:
1) Prepare callbacks (allocate, intialize, create threads)
2) Kick the AP alive (e.g. INIT/SIPI on x86)
3) Wait for the AP to report alive state
4) Let the AP continue through the atomic bringup
5) Let the AP run the threaded bringup to full online state
There are two significant delays:
#3 The time for an AP to report alive state in start_secondary()
on x86 has been measured in the range between 350us and 3.5ms
depending on vendor and CPU type, BIOS microcode size etc.
#4 The atomic bringup does the microcode update. This has been
measured to take up to ~8ms on the primary threads depending
on the microcode patch size to apply.
On a two socket SKL server with 56 cores (112 threads) the boot CPU
spends on current mainline about 800ms busy waiting for the APs to
come up and apply microcode. That's more than 80% of the actual
onlining procedure.
This can be reduced significantly by splitting the bringup
mechanism into two parts:
1) Run the prepare callbacks and kick the AP alive for each AP
which needs to be brought up.
The APs wake up, do their firmware initialization and run the
low level kernel startup code including microcode loading in
parallel up to the first synchronization point. (#1 and #2
above)
2) Run the rest of the bringup code strictly serialized per CPU
(#3 - #5 above) as it's done today.
Parallelizing that stage of the CPU bringup might be possible
in theory, but it's questionable whether required surgery
would be justified for a pretty small gain.
If the system is large enough the first AP is already waiting at
the first synchronization point when the boot CPU finished the
wake-up of the last AP. That reduces the AP bringup time on that
SKL from ~800ms to ~80ms, i.e. by a factor ~10x.
The actual gain varies wildly depending on the system, CPU,
microcode patch size and other factors. There are some
opportunities to reduce the overhead further, but that needs some
deep surgery in the x86 CPU bringup code.
For now this is only enabled on x86, but the core functionality
obviously works for all SMP capable architectures.
- Enhancements for SMP function call tracing so it is possible to
locate the scheduling and the actual execution points. That allows
to measure IPI delivery time precisely"
* tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
trace,smp: Add tracepoints for scheduling remotelly called functions
trace,smp: Add tracepoints around remotelly called functions
MAINTAINERS: Add CPU HOTPLUG entry
x86/smpboot: Fix the parallel bringup decision
x86/realmode: Make stack lock work in trampoline_compat()
x86/smp: Initialize cpu_primary_thread_mask late
cpu/hotplug: Fix off by one in cpuhp_bringup_mask()
x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils
x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it
x86/smpboot: Support parallel startup of secondary CPUs
x86/smpboot: Implement a bit spinlock to protect the realmode stack
x86/apic: Save the APIC virtual base address
cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE
x86/apic: Provide cpu_primary_thread mask
x86/smpboot: Enable split CPU startup
cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanism
cpu/hotplug: Reset task stack state in _cpu_up()
cpu/hotplug: Remove unused state functions
riscv: Switch to hotplug core state synchronization
parisc: Switch to hotplug core state synchronization
...
- Initialize FPU late.
Right now FPU is initialized very early during boot. There is no real
requirement to do so. The only requirement is to have it done before
alternatives are patched.
That's done in check_bugs() which does way more than what the function
name suggests.
So first rename check_bugs() to arch_cpu_finalize_init() which makes it
clear what this is about.
Move the invocation of arch_cpu_finalize_init() earlier in
start_kernel() as it has to be done before fork_init() which needs to
know the FPU register buffer size.
With those prerequisites the FPU initialization can be moved into
arch_cpu_finalize_init(), which removes it from the early and fragile
part of the x86 bringup.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZdNYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaNBEACWtVd1uhqQldIFgSvZYujsrWXlmkU+
pok6gDzKQNwZADiXW/tn5fP8SBLWT0pgLM9d+oZ5mEaLaOW7HcZLEHcVrn74e3TT
53xN8e1zCzyjCJ/x22vrKH4sn/bU+bQyzSNVu9Disqn9Fl+ts37FqAHDv/ExbneD
DaYXXCLgQsyGbPLD8B7yGOpJTGBUTJxNQS1ZFElBaRsAaw0mYZOEoPvuTFK4o7Uz
GUB2vGefmeNfX+EgLYKG9QoS0F3SMS9X2IYswy1H76ZnV/eXmTsA1S3u3X9yX7kC
XBnPtCC+iX+7o3xFkTpa0oQUdzEyGOItExZZgce6jEQu4Fl7NoIJxhlMg9/Y+vcF
ntipEKSWFLAi1GkZzeKRwSSsoWqRaFxOKLy8qhn9kud09k+UtMBkNrF1CSp9laAz
QParu3B1oHPEzx/jS0bSOCMN+AQZH8rX7LxRp4kpBOeBSZNCnfaBUzfIvmccPls+
EJTO/0JUpRm5LsPSDiJhypPRoOOIP26IloR6OoZTcI3p76NrnYblRvisvuFAgDU6
bk7Belf+GDx0kBZugqQgok7nDaHIBR7vEmca1NV8507UrffVyxLAiI4CiWPcFdOq
ovhO8K+gP4xvzZx4cXZBwYwusjvl/oxKy8yQiGgoftDiWU4sdUCSrwX3x27+hUYL
2P1OLDOXSGwESQ==
=yxMj
-----END PGP SIGNATURE-----
Merge tag 'x86-boot-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Thomas Gleixner:
"Initialize FPU late.
Right now FPU is initialized very early during boot. There is no real
requirement to do so. The only requirement is to have it done before
alternatives are patched.
That's done in check_bugs() which does way more than what the function
name suggests.
So first rename check_bugs() to arch_cpu_finalize_init() which makes
it clear what this is about.
Move the invocation of arch_cpu_finalize_init() earlier in
start_kernel() as it has to be done before fork_init() which needs to
know the FPU register buffer size.
With those prerequisites the FPU initialization can be moved into
arch_cpu_finalize_init(), which removes it from the early and fragile
part of the x86 bringup"
* tag 'x86-boot-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mem_encrypt: Unbreak the AMD_MEM_ENCRYPT=n build
x86/fpu: Move FPU initialization into arch_cpu_finalize_init()
x86/fpu: Mark init functions __init
x86/fpu: Remove cpuinfo argument from init functions
x86/init: Initialize signal frame size late
init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()
init: Invoke arch_cpu_finalize_init() earlier
init: Remove check_bugs() leftovers
um/cpu: Switch to arch_cpu_finalize_init()
sparc/cpu: Switch to arch_cpu_finalize_init()
sh/cpu: Switch to arch_cpu_finalize_init()
mips/cpu: Switch to arch_cpu_finalize_init()
m68k/cpu: Switch to arch_cpu_finalize_init()
loongarch/cpu: Switch to arch_cpu_finalize_init()
ia64/cpu: Switch to arch_cpu_finalize_init()
ARM: cpu: Switch to arch_cpu_finalize_init()
x86/cpu: Switch to arch_cpu_finalize_init()
init: Provide arch_cpu_finalize_init()
Initializing the FPU during the early boot process is a pointless
exercise. Early boot is convoluted and fragile enough.
Nothing requires that the FPU is set up early. It has to be initialized
before fork_init() because the task_struct size depends on the FPU register
buffer size.
Move the initialization to arch_cpu_finalize_init() which is the perfect
place to do so.
No functional change.
This allows to remove quite some of the custom early command line parsing,
but that's subject to the next installment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.902376621@linutronix.de
No point in doing this during really early boot. Move it to an early
initcall so that it is set up before possible user mode helpers are started
during device initialization.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.727330699@linutronix.de
Invoke the X86ism mem_encrypt_init() from X86 arch_cpu_finalize_init() and
remove the weak fallback from the core code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.670360645@linutronix.de
check_bugs() is a dumping ground for finalizing the CPU bringup. Only parts of
it has to do with actual CPU bugs.
Split it apart into arch_cpu_finalize_init() and cpu_select_mitigations().
Fixup the bogus 32bit comments while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613224545.019583869@linutronix.de
If &encl_mm->encl->mm_list does not contain the searched 'encl_mm',
'tmp' will not point to a valid sgx_encl_mm struct.
Linus proposed to avoid any use of the list iterator variable after the
loop, in the attempt to move the list iterator variable declaration into
the macro to avoid any potential misuse after the loop. Using it in
a pointer comparison after the loop is undefined behavior and should be
omitted if possible, see Link tag.
Instead, just use a 'found' boolean to indicate if an element was found.
[ bp: Massage, fix typos. ]
Signed-off-by: Jakob Koschel <jkl820.git@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Link: https://lore.kernel.org/r/20230206-sgx-use-after-iter-v2-1-736ca621adc3@gmail.com
Patch series "remove the vmas parameter from GUP APIs", v6.
(pin_/get)_user_pages[_remote]() each provide an optional output parameter
for an array of VMA objects associated with each page in the input range.
These provide the means for VMAs to be returned, as long as mm->mmap_lock
is never released during the GUP operation (i.e. the internal flag
FOLL_UNLOCKABLE is not specified).
In addition, these VMAs can only be accessed with the mmap_lock held and
become invalidated the moment it is released.
The vast majority of invocations do not use this functionality and of
those that do, all but one case retrieve a single VMA to perform checks
upon.
It is not egregious in the single VMA cases to simply replace the
operation with a vma_lookup(). In these cases we duplicate the (fast)
lookup on a slow path already under the mmap_lock, abstracted to a new
get_user_page_vma_remote() inline helper function which also performs
error checking and reference count maintenance.
The special case is io_uring, where io_pin_pages() specifically needs to
assert that the VMAs underlying the range do not result in broken
long-term GUP file-backed mappings.
As GUP now internally asserts that FOLL_LONGTERM mappings are not
file-backed in a broken fashion (i.e. requiring dirty tracking) - as
implemented in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to
file-backed mappings" - this logic is no longer required and so we can
simply remove it altogether from io_uring.
Eliminating the vmas parameter eliminates an entire class of danging
pointer errors that might have occured should the lock have been
incorrectly released.
In addition, the API is simplified and now clearly expresses what it is
intended for - applying the specified GUP flags and (if pinning) returning
pinned pages.
This change additionally opens the door to further potential improvements
in GUP and the possible marrying of disparate code paths.
I have run this series against gup_test with no issues.
Thanks to Matthew Wilcox for suggesting this refactoring!
This patch (of 6):
No invocation of get_user_pages() use the vmas parameter, so remove it.
The GUP API is confusing and caveated. Recent changes have done much to
improve that, however there is more we can do. Exporting vmas is a prime
target as the caller has to be extremely careful to preclude their use
after the mmap_lock has expired or otherwise be left with dangling
pointers.
Removing the vmas parameter focuses the GUP functions upon their primary
purpose - pinning (and outputting) pages as well as performing the actions
implied by the input flags.
This is part of a patch series aiming to remove the vmas parameter
altogether.
Link: https://lkml.kernel.org/r/cover.1684350871.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/589e0c64794668ffc799651e8d85e703262b1e9d.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Christian König <christian.koenig@amd.com> (for radeon parts)
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sean Christopherson <seanjc@google.com> (KVM)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
To change the resources allocated to a large group of tasks, such as an
application container, a container manager must write all of the tasks'
IDs into the tasks file interface of the new control group. This is
challenging when the container's task list is always changing.
In addition, if the container manager is using monitoring groups to
separately track the bandwidth of containers assigned to the same
control group, when moving a container, it must first move the
container's tasks to the default monitoring group of the new control
group before it can move these tasks into the container's replacement
monitoring group under the destination control group. This is
undesirable because it makes bandwidth usage during the move
unattributable to the correct tasks and resets monitoring event counters
and cache usage information for the group.
Implement the rename operation only for resctrlfs monitor groups to
enable users to move a monitoring group from one control group to
another. This effects a change in resources allocated to all the tasks
in the monitoring group while otherwise leaving the monitoring data
intact.
Signed-off-by: Peter Newman <peternewman@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20230419125015.693566-3-peternewman@google.com
rdtgroup_kn_lock_live() can only release a kernfs reference for a single
file before waiting on the rdtgroup_mutex, limiting its usefulness for
operations on multiple files, such as rename.
Factor the work needed to respectively break and unbreak active
protection on an individual file into rdtgroup_kn_{get,put}().
No functional change.
Signed-off-by: Peter Newman <peternewman@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20230419125015.693566-2-peternewman@google.com
The MI200 (Aldebaran) series of devices introduced a new SMCA bank type
for Unified Memory Controllers. The MCE subsystem already has support
for this new type. The MCE decoder module will decode the common MCA
error information for the new bank type, but it will not pass the
information to the AMD64 EDAC module for detailed memory error decoding.
Have the MCE decoder module recognize the new bank type as an SMCA UMC
memory error and pass the MCA information to AMD64 EDAC.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Co-developed-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230515113537.1052146-3-muralimk@amd.com