mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-11-23 20:24:12 +08:00
Some late arriving documentation changes. In particular, this contains the
conversion of the x86 docs to RST, which has been in the works for some time but needed a couple of final tweaks. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAlzVlVoPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5YPWgH/1z+HO4QiLZ72kVxLf2U5r6FAo4CtQYLymL/ GiDabC7Jt7hobXdFQmDXhFnLOR/ibMnawJw2JAgWXDo33KenKGbE2OiW8ecsebSb hd1F3pU6P3gVTYItcuM8dZ6/0C/F98/J/O3O3sOhZ0Uup2WPxW5XdNOp7LjFQScc ENkgm2C5trs1wGjVswXWztGxSTcYrF7ehhjpWsFr9MUnUOI6ghvXX1akN3cEo7eo 7D8nvG2/HWOkf9Oq87/1uQxF6lERRqOQE+HN1J80XUsNTV5Hn40RP40FeebVv1rr 1GjUu+mKk/5uV+OlRWFqLbt10cU4+TKKfNTqfEchHyDOMpJD+S0= =hfly -----END PGP SIGNATURE----- Merge tag 'docs-5.2a' of git://git.lwn.net/linux Pull more documentation updates from Jonathan Corbet: "Some late arriving documentation changes. In particular, this contains the conversion of the x86 docs to RST, which has been in the works for some time but needed a couple of final tweaks" * tag 'docs-5.2a' of git://git.lwn.net/linux: (29 commits) Documentation: x86: convert x86_64/machinecheck to reST Documentation: x86: convert x86_64/cpu-hotplug-spec to reST Documentation: x86: convert x86_64/fake-numa-for-cpusets to reST Documentation: x86: convert x86_64/5level-paging.txt to reST Documentation: x86: convert x86_64/mm.txt to reST Documentation: x86: convert x86_64/uefi.txt to reST Documentation: x86: convert x86_64/boot-options.txt to reST Documentation: x86: convert i386/IO-APIC.txt to reST Documentation: x86: convert usb-legacy-support.txt to reST Documentation: x86: convert orc-unwinder.txt to reST Documentation: x86: convert resctrl_ui.txt to reST Documentation: x86: convert microcode.txt to reST Documentation: x86: convert pti.txt to reST Documentation: x86: convert amd-memory-encryption.txt to reST Documentation: x86: convert intel_mpx.txt to reST Documentation: x86: convert protection-keys.txt to reST Documentation: x86: convert pat.txt to reST Documentation: x86: convert mtrr.txt to reST Documentation: x86: convert tlb.txt to reST Documentation: x86: convert zero-page.txt to reST ...
This commit is contained in:
commit
1fb3b526df
@ -112,6 +112,7 @@ implementation.
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
x86/index
|
||||
sh/index
|
||||
|
||||
Filesystem Documentation
|
||||
|
@ -1915,7 +1915,10 @@ The following commonly-used handler.action pairs are available:
|
||||
|
||||
The 'matching.event' specification is simply the fully qualified
|
||||
event name of the event that matches the target event for the
|
||||
onmatch() functionality, in the form 'system.event_name'.
|
||||
onmatch() functionality, in the form 'system.event_name'. Histogram
|
||||
keys of both events are compared to find if events match. In case
|
||||
multiple histogram keys are used, they all must match in the specified
|
||||
order.
|
||||
|
||||
Finally, the number and type of variables/fields in the 'param
|
||||
list' must match the number and types of the fields in the
|
||||
@ -1978,9 +1981,9 @@ The following commonly-used handler.action pairs are available:
|
||||
/sys/kernel/debug/tracing/events/sched/sched_waking/trigger
|
||||
|
||||
Then, when the corresponding thread is actually scheduled onto the
|
||||
CPU by a sched_switch event, calculate the latency and use that
|
||||
along with another variable and an event field to generate a
|
||||
wakeup_latency synthetic event::
|
||||
CPU by a sched_switch event (saved_pid matches next_pid), calculate
|
||||
the latency and use that along with another variable and an event field
|
||||
to generate a wakeup_latency synthetic event::
|
||||
|
||||
# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:\
|
||||
onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,\
|
||||
|
@ -249,13 +249,13 @@ essere categorizzate in:
|
||||
|
||||
|
|
||||
|
||||
2. Licenze non raccomandate:
|
||||
2. Licenze deprecate:
|
||||
|
||||
Questo tipo di licenze dovrebbero essere usate solo per codice già esistente
|
||||
o quando si prende codice da altri progetti. Le licenze sono disponibili
|
||||
nei sorgenti del kernel nella cartella::
|
||||
|
||||
LICENSES/other/
|
||||
LICENSES/deprecated/
|
||||
|
||||
I file in questa cartella contengono il testo completo della licenza e i
|
||||
`Metatag`_. Il nome di questi file è lo stesso usato come identificatore
|
||||
@ -263,14 +263,14 @@ essere categorizzate in:
|
||||
|
||||
Esempi::
|
||||
|
||||
LICENSES/other/ISC
|
||||
LICENSES/deprecated/ISC
|
||||
|
||||
Contiene il testo della licenza Internet System Consortium e i suoi
|
||||
metatag::
|
||||
|
||||
LICENSES/other/ZLib
|
||||
LICENSES/deprecated/GPL-1.0
|
||||
|
||||
Contiene il testo della licenza ZLIB e i suoi metatag.
|
||||
Contiene il testo della versione 1 della licenza GPL e i suoi metatag.
|
||||
|
||||
Metatag:
|
||||
|
||||
@ -294,7 +294,55 @@ essere categorizzate in:
|
||||
|
||||
|
|
||||
|
||||
3. _`Eccezioni`:
|
||||
3. Solo per doppie licenze
|
||||
|
||||
Queste licenze dovrebbero essere usate solamente per codice licenziato in
|
||||
combinazione con un'altra licenza che solitamente è quella preferita.
|
||||
Queste licenze sono disponibili nei sorgenti del kernel nella cartella::
|
||||
|
||||
LICENSES/dual
|
||||
|
||||
I file in questa cartella contengono il testo completo della rispettiva
|
||||
licenza e i suoi `Metatags`_. I nomi dei file sono identici agli
|
||||
identificatori di licenza SPDX che dovrebbero essere usati nei file
|
||||
sorgenti.
|
||||
|
||||
Esempi::
|
||||
|
||||
LICENSES/dual/MPL-1.1
|
||||
|
||||
Questo file contiene il testo della versione 1.1 della licenza *Mozilla
|
||||
Pulic License* e i metatag necessari::
|
||||
|
||||
LICENSES/dual/Apache-2.0
|
||||
|
||||
Questo file contiene il testo della versione 2.0 della licenza Apache e i
|
||||
metatag necessari.
|
||||
|
||||
Metatag:
|
||||
|
||||
I requisiti per le 'altre' ('*other*') licenze sono identici a quelli per le
|
||||
`Licenze raccomandate`_.
|
||||
|
||||
Esempio del formato del file::
|
||||
|
||||
Valid-License-Identifier: MPL-1.1
|
||||
SPDX-URL: https://spdx.org/licenses/MPL-1.1.html
|
||||
Usage-Guide:
|
||||
Do NOT use. The MPL-1.1 is not GPL2 compatible. It may only be used for
|
||||
dual-licensed files where the other license is GPL2 compatible.
|
||||
If you end up using this it MUST be used together with a GPL2 compatible
|
||||
license using "OR".
|
||||
To use the Mozilla Public License version 1.1 put the following SPDX
|
||||
tag/value pair into a comment according to the placement guidelines in
|
||||
the licensing rules documentation:
|
||||
SPDX-License-Identifier: MPL-1.1
|
||||
License-Text:
|
||||
Full license text
|
||||
|
||||
|
|
||||
|
||||
4. _`Eccezioni`:
|
||||
|
||||
Alcune licenze possono essere corrette con delle eccezioni che forniscono
|
||||
diritti aggiuntivi. Queste eccezioni sono disponibili nei sorgenti del
|
||||
|
@ -1,3 +1,9 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=====================
|
||||
AMD Memory Encryption
|
||||
=====================
|
||||
|
||||
Secure Memory Encryption (SME) and Secure Encrypted Virtualization (SEV) are
|
||||
features found on AMD processors.
|
||||
|
||||
@ -34,7 +40,7 @@ is operating in 64-bit or 32-bit PAE mode, in all other modes the SEV hardware
|
||||
forces the memory encryption bit to 1.
|
||||
|
||||
Support for SME and SEV can be determined through the CPUID instruction. The
|
||||
CPUID function 0x8000001f reports information related to SME:
|
||||
CPUID function 0x8000001f reports information related to SME::
|
||||
|
||||
0x8000001f[eax]:
|
||||
Bit[0] indicates support for SME
|
||||
@ -48,14 +54,14 @@ CPUID function 0x8000001f reports information related to SME:
|
||||
addresses)
|
||||
|
||||
If support for SME is present, MSR 0xc00100010 (MSR_K8_SYSCFG) can be used to
|
||||
determine if SME is enabled and/or to enable memory encryption:
|
||||
determine if SME is enabled and/or to enable memory encryption::
|
||||
|
||||
0xc0010010:
|
||||
Bit[23] 0 = memory encryption features are disabled
|
||||
1 = memory encryption features are enabled
|
||||
|
||||
If SEV is supported, MSR 0xc0010131 (MSR_AMD64_SEV) can be used to determine if
|
||||
SEV is active:
|
||||
SEV is active::
|
||||
|
||||
0xc0010131:
|
||||
Bit[0] 0 = memory encryption is not active
|
||||
@ -68,6 +74,7 @@ requirements for the system. If this bit is not set upon Linux startup then
|
||||
Linux itself will not set it and memory encryption will not be possible.
|
||||
|
||||
The state of SME in the Linux kernel can be documented as follows:
|
||||
|
||||
- Supported:
|
||||
The CPU supports SME (determined through CPUID instruction).
|
||||
|
@ -1,5 +1,8 @@
|
||||
THE LINUX/x86 BOOT PROTOCOL
|
||||
---------------------------
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===========================
|
||||
The Linux/x86 Boot Protocol
|
||||
===========================
|
||||
|
||||
On the x86 platform, the Linux kernel uses a rather complicated boot
|
||||
convention. This has evolved partially due to historical aspects, as
|
||||
@ -10,65 +13,69 @@ real-mode DOS as a mainstream operating system.
|
||||
|
||||
Currently, the following versions of the Linux/x86 boot protocol exist.
|
||||
|
||||
Old kernels: zImage/Image support only. Some very early kernels
|
||||
============= ============================================================
|
||||
Old kernels zImage/Image support only. Some very early kernels
|
||||
may not even support a command line.
|
||||
|
||||
Protocol 2.00: (Kernel 1.3.73) Added bzImage and initrd support, as
|
||||
Protocol 2.00 (Kernel 1.3.73) Added bzImage and initrd support, as
|
||||
well as a formalized way to communicate between the
|
||||
boot loader and the kernel. setup.S made relocatable,
|
||||
although the traditional setup area still assumed
|
||||
writable.
|
||||
|
||||
Protocol 2.01: (Kernel 1.3.76) Added a heap overrun warning.
|
||||
Protocol 2.01 (Kernel 1.3.76) Added a heap overrun warning.
|
||||
|
||||
Protocol 2.02: (Kernel 2.4.0-test3-pre3) New command line protocol.
|
||||
Protocol 2.02 (Kernel 2.4.0-test3-pre3) New command line protocol.
|
||||
Lower the conventional memory ceiling. No overwrite
|
||||
of the traditional setup area, thus making booting
|
||||
safe for systems which use the EBDA from SMM or 32-bit
|
||||
BIOS entry points. zImage deprecated but still
|
||||
supported.
|
||||
|
||||
Protocol 2.03: (Kernel 2.4.18-pre1) Explicitly makes the highest possible
|
||||
Protocol 2.03 (Kernel 2.4.18-pre1) Explicitly makes the highest possible
|
||||
initrd address available to the bootloader.
|
||||
|
||||
Protocol 2.04: (Kernel 2.6.14) Extend the syssize field to four bytes.
|
||||
Protocol 2.04 (Kernel 2.6.14) Extend the syssize field to four bytes.
|
||||
|
||||
Protocol 2.05: (Kernel 2.6.20) Make protected mode kernel relocatable.
|
||||
Protocol 2.05 (Kernel 2.6.20) Make protected mode kernel relocatable.
|
||||
Introduce relocatable_kernel and kernel_alignment fields.
|
||||
|
||||
Protocol 2.06: (Kernel 2.6.22) Added a field that contains the size of
|
||||
Protocol 2.06 (Kernel 2.6.22) Added a field that contains the size of
|
||||
the boot command line.
|
||||
|
||||
Protocol 2.07: (Kernel 2.6.24) Added paravirtualised boot protocol.
|
||||
Protocol 2.07 (Kernel 2.6.24) Added paravirtualised boot protocol.
|
||||
Introduced hardware_subarch and hardware_subarch_data
|
||||
and KEEP_SEGMENTS flag in load_flags.
|
||||
|
||||
Protocol 2.08: (Kernel 2.6.26) Added crc32 checksum and ELF format
|
||||
Protocol 2.08 (Kernel 2.6.26) Added crc32 checksum and ELF format
|
||||
payload. Introduced payload_offset and payload_length
|
||||
fields to aid in locating the payload.
|
||||
|
||||
Protocol 2.09: (Kernel 2.6.26) Added a field of 64-bit physical
|
||||
Protocol 2.09 (Kernel 2.6.26) Added a field of 64-bit physical
|
||||
pointer to single linked list of struct setup_data.
|
||||
|
||||
Protocol 2.10: (Kernel 2.6.31) Added a protocol for relaxed alignment
|
||||
Protocol 2.10 (Kernel 2.6.31) Added a protocol for relaxed alignment
|
||||
beyond the kernel_alignment added, new init_size and
|
||||
pref_address fields. Added extended boot loader IDs.
|
||||
|
||||
Protocol 2.11: (Kernel 3.6) Added a field for offset of EFI handover
|
||||
Protocol 2.11 (Kernel 3.6) Added a field for offset of EFI handover
|
||||
protocol entry point.
|
||||
|
||||
Protocol 2.12: (Kernel 3.8) Added the xloadflags field and extension fields
|
||||
Protocol 2.12 (Kernel 3.8) Added the xloadflags field and extension fields
|
||||
to struct boot_params for loading bzImage and ramdisk
|
||||
above 4G in 64bit.
|
||||
|
||||
Protocol 2.13: (Kernel 3.14) Support 32- and 64-bit flags being set in
|
||||
Protocol 2.13 (Kernel 3.14) Support 32- and 64-bit flags being set in
|
||||
xloadflags to support booting a 64-bit kernel from 32-bit
|
||||
EFI
|
||||
============= ============================================================
|
||||
|
||||
**** MEMORY LAYOUT
|
||||
|
||||
Memory Layout
|
||||
=============
|
||||
|
||||
The traditional memory map for the kernel loader, used for Image or
|
||||
zImage kernels, typically looks like:
|
||||
zImage kernels, typically looks like::
|
||||
|
||||
| |
|
||||
0A0000 +------------------------+
|
||||
@ -92,7 +99,6 @@ zImage kernels, typically looks like:
|
||||
| BIOS use only |
|
||||
000000 +------------------------+
|
||||
|
||||
|
||||
When using bzImage, the protected-mode kernel was relocated to
|
||||
0x100000 ("high memory"), and the kernel real-mode block (boot sector,
|
||||
setup, and stack/heap) was made relocatable to any address between
|
||||
@ -116,7 +122,7 @@ zImage or old bzImage kernels, which need data written into the
|
||||
above the 0x9A000 point; too many BIOSes will break above that point.
|
||||
|
||||
For a modern bzImage kernel with boot protocol version >= 2.02, a
|
||||
memory layout like the following is suggested:
|
||||
memory layout like the following is suggested::
|
||||
|
||||
~ ~
|
||||
| Protected-mode kernel |
|
||||
@ -141,11 +147,11 @@ X +------------------------+
|
||||
| BIOS use only |
|
||||
000000 +------------------------+
|
||||
|
||||
... where the address X is as low as the design of the boot loader
|
||||
permits.
|
||||
... where the address X is as low as the design of the boot loader permits.
|
||||
|
||||
|
||||
**** THE REAL-MODE KERNEL HEADER
|
||||
The Real-Mode Kernel Header
|
||||
===========================
|
||||
|
||||
In the following text, and anywhere in the kernel boot sequence, "a
|
||||
sector" refers to 512 bytes. It is independent of the actual sector
|
||||
@ -159,12 +165,12 @@ sectors (1K) and then examine the bootup sector size.
|
||||
|
||||
The header looks like:
|
||||
|
||||
Offset Proto Name Meaning
|
||||
/Size
|
||||
|
||||
01F1/1 ALL(1 setup_sects The size of the setup in sectors
|
||||
=========== ======== ===================== ============================================
|
||||
Offset/Size Proto Name Meaning
|
||||
=========== ======== ===================== ============================================
|
||||
01F1/1 ALL(1) setup_sects The size of the setup in sectors
|
||||
01F2/2 ALL root_flags If set, the root is mounted readonly
|
||||
01F4/4 2.04+(2 syssize The size of the 32-bit code in 16-byte paras
|
||||
01F4/4 2.04+(2) syssize The size of the 32-bit code in 16-byte paras
|
||||
01F8/2 ALL ram_size DO NOT USE - for bootsect.S use only
|
||||
01FA/2 ALL vid_mode Video mode control
|
||||
01FC/2 ALL root_dev Default root device number
|
||||
@ -183,8 +189,8 @@ Offset Proto Name Meaning
|
||||
021C/4 2.00+ ramdisk_size initrd size (set by boot loader)
|
||||
0220/4 2.00+ bootsect_kludge DO NOT USE - for bootsect.S use only
|
||||
0224/2 2.01+ heap_end_ptr Free memory after setup end
|
||||
0226/1 2.02+(3 ext_loader_ver Extended boot loader version
|
||||
0227/1 2.02+(3 ext_loader_type Extended boot loader ID
|
||||
0226/1 2.02+(3) ext_loader_ver Extended boot loader version
|
||||
0227/1 2.02+(3) ext_loader_type Extended boot loader ID
|
||||
0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line
|
||||
022C/4 2.03+ initrd_addr_max Highest legal initrd address
|
||||
0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel
|
||||
@ -201,7 +207,9 @@ Offset Proto Name Meaning
|
||||
0258/8 2.10+ pref_address Preferred loading address
|
||||
0260/4 2.10+ init_size Linear memory required during initialization
|
||||
0264/4 2.11+ handover_offset Offset of handover entry point
|
||||
=========== ======== ===================== ============================================
|
||||
|
||||
.. note::
|
||||
(1) For backwards compatibility, if the setup_sects field contains 0, the
|
||||
real value is 4.
|
||||
|
||||
@ -213,7 +221,7 @@ Offset Proto Name Meaning
|
||||
|
||||
If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
|
||||
the boot protocol version is "old". Loading an old kernel, the
|
||||
following parameters should be assumed:
|
||||
following parameters should be assumed::
|
||||
|
||||
Image type = zImage
|
||||
initrd not supported
|
||||
@ -225,7 +233,8 @@ setting fields in the header, you must make sure only to set fields
|
||||
supported by the protocol version in use.
|
||||
|
||||
|
||||
**** DETAILS OF HEADER FIELDS
|
||||
Details of Harder Fileds
|
||||
========================
|
||||
|
||||
For each field, some are information from the kernel to the bootloader
|
||||
("read"), some are expected to be filled out by the bootloader
|
||||
@ -239,106 +248,132 @@ boot loaders can ignore those fields.
|
||||
|
||||
The byte order of all fields is littleendian (this is x86, after all.)
|
||||
|
||||
============ ===========
|
||||
Field name: setup_sects
|
||||
Type: read
|
||||
Offset/size: 0x1f1/1
|
||||
Protocol: ALL
|
||||
============ ===========
|
||||
|
||||
The size of the setup code in 512-byte sectors. If this field is
|
||||
0, the real value is 4. The real-mode code consists of the boot
|
||||
sector (always one 512-byte sector) plus the setup code.
|
||||
|
||||
============ =================
|
||||
Field name: root_flags
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x1f2/2
|
||||
Protocol: ALL
|
||||
============ =================
|
||||
|
||||
If this field is nonzero, the root defaults to readonly. The use of
|
||||
this field is deprecated; use the "ro" or "rw" options on the
|
||||
command line instead.
|
||||
|
||||
============ ===============================================
|
||||
Field name: syssize
|
||||
Type: read
|
||||
Offset/size: 0x1f4/4 (protocol 2.04+) 0x1f4/2 (protocol ALL)
|
||||
Protocol: 2.04+
|
||||
============ ===============================================
|
||||
|
||||
The size of the protected-mode code in units of 16-byte paragraphs.
|
||||
For protocol versions older than 2.04 this field is only two bytes
|
||||
wide, and therefore cannot be trusted for the size of a kernel if
|
||||
the LOAD_HIGH flag is set.
|
||||
|
||||
============ ===============
|
||||
Field name: ram_size
|
||||
Type: kernel internal
|
||||
Offset/size: 0x1f8/2
|
||||
Protocol: ALL
|
||||
============ ===============
|
||||
|
||||
This field is obsolete.
|
||||
|
||||
============ ===================
|
||||
Field name: vid_mode
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x1fa/2
|
||||
============ ===================
|
||||
|
||||
Please see the section on SPECIAL COMMAND LINE OPTIONS.
|
||||
|
||||
============ =================
|
||||
Field name: root_dev
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x1fc/2
|
||||
Protocol: ALL
|
||||
============ =================
|
||||
|
||||
The default root device device number. The use of this field is
|
||||
deprecated, use the "root=" option on the command line instead.
|
||||
|
||||
============ =========
|
||||
Field name: boot_flag
|
||||
Type: read
|
||||
Offset/size: 0x1fe/2
|
||||
Protocol: ALL
|
||||
============ =========
|
||||
|
||||
Contains 0xAA55. This is the closest thing old Linux kernels have
|
||||
to a magic number.
|
||||
|
||||
============ =======
|
||||
Field name: jump
|
||||
Type: read
|
||||
Offset/size: 0x200/2
|
||||
Protocol: 2.00+
|
||||
============ =======
|
||||
|
||||
Contains an x86 jump instruction, 0xEB followed by a signed offset
|
||||
relative to byte 0x202. This can be used to determine the size of
|
||||
the header.
|
||||
|
||||
============ =======
|
||||
Field name: header
|
||||
Type: read
|
||||
Offset/size: 0x202/4
|
||||
Protocol: 2.00+
|
||||
============ =======
|
||||
|
||||
Contains the magic number "HdrS" (0x53726448).
|
||||
|
||||
============ =======
|
||||
Field name: version
|
||||
Type: read
|
||||
Offset/size: 0x206/2
|
||||
Protocol: 2.00+
|
||||
============ =======
|
||||
|
||||
Contains the boot protocol version, in (major << 8)+minor format,
|
||||
e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version
|
||||
10.17.
|
||||
|
||||
============ =================
|
||||
Field name: realmode_swtch
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x208/4
|
||||
Protocol: 2.00+
|
||||
============ =================
|
||||
|
||||
Boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
|
||||
|
||||
============ =============
|
||||
Field name: start_sys_seg
|
||||
Type: read
|
||||
Offset/size: 0x20c/2
|
||||
Protocol: 2.00+
|
||||
============ =============
|
||||
|
||||
The load low segment (0x1000). Obsolete.
|
||||
|
||||
============ ==============
|
||||
Field name: kernel_version
|
||||
Type: read
|
||||
Offset/size: 0x20e/2
|
||||
Protocol: 2.00+
|
||||
============ ==============
|
||||
|
||||
If set to a nonzero value, contains a pointer to a NUL-terminated
|
||||
human-readable kernel version number string, less 0x200. This can
|
||||
@ -348,17 +383,19 @@ Protocol: 2.00+
|
||||
For example, if this value is set to 0x1c00, the kernel version
|
||||
number string can be found at offset 0x1e00 in the kernel file.
|
||||
This is a valid value if and only if the "setup_sects" field
|
||||
contains the value 15 or higher, as:
|
||||
contains the value 15 or higher, as::
|
||||
|
||||
0x1c00 < 15*0x200 (= 0x1e00) but
|
||||
0x1c00 >= 14*0x200 (= 0x1c00)
|
||||
|
||||
0x1c00 >> 9 = 14, so the minimum value for setup_secs is 15.
|
||||
0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15.
|
||||
|
||||
============ ==================
|
||||
Field name: type_of_loader
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x210/1
|
||||
Protocol: 2.00+
|
||||
============ ==================
|
||||
|
||||
If your boot loader has an assigned id (see table below), enter
|
||||
0xTV here, where T is an identifier for the boot loader and V is
|
||||
@ -369,7 +406,7 @@ Protocol: 2.00+
|
||||
Similarly, the ext_loader_ver field can be used to provide more than
|
||||
four bits for the bootloader version.
|
||||
|
||||
For example, for T = 0x15, V = 0x234, write:
|
||||
For example, for T = 0x15, V = 0x234, write::
|
||||
|
||||
type_of_loader <- 0xE4
|
||||
ext_loader_type <- 0x05
|
||||
@ -377,9 +414,12 @@ Protocol: 2.00+
|
||||
|
||||
Assigned boot loader ids (hexadecimal):
|
||||
|
||||
0 LILO (0x00 reserved for pre-2.00 bootloader)
|
||||
== =======================================
|
||||
0 LILO
|
||||
(0x00 reserved for pre-2.00 bootloader)
|
||||
1 Loadlin
|
||||
2 bootsect-loader (0x20, all other values reserved)
|
||||
2 bootsect-loader
|
||||
(0x20, all other values reserved)
|
||||
3 Syslinux
|
||||
4 Etherboot/gPXE/iPXE
|
||||
5 ELILO
|
||||
@ -393,52 +433,67 @@ Protocol: 2.00+
|
||||
E Extended (see ext_loader_type)
|
||||
F Special (0xFF = undefined)
|
||||
10 Reserved
|
||||
11 Minimal Linux Bootloader <http://sebastian-plotz.blogspot.de>
|
||||
11 Minimal Linux Bootloader
|
||||
<http://sebastian-plotz.blogspot.de>
|
||||
12 OVMF UEFI virtualization stack
|
||||
== =======================================
|
||||
|
||||
Please contact <hpa@zytor.com> if you need a bootloader ID
|
||||
value assigned.
|
||||
Please contact <hpa@zytor.com> if you need a bootloader ID value assigned.
|
||||
|
||||
============ ===================
|
||||
Field name: loadflags
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x211/1
|
||||
Protocol: 2.00+
|
||||
============ ===================
|
||||
|
||||
This field is a bitmask.
|
||||
|
||||
Bit 0 (read): LOADED_HIGH
|
||||
|
||||
- If 0, the protected-mode code is loaded at 0x10000.
|
||||
- If 1, the protected-mode code is loaded at 0x100000.
|
||||
|
||||
Bit 1 (kernel internal): KASLR_FLAG
|
||||
|
||||
- Used internally by the compressed kernel to communicate
|
||||
KASLR status to kernel proper.
|
||||
If 1, KASLR enabled.
|
||||
If 0, KASLR disabled.
|
||||
|
||||
- If 1, KASLR enabled.
|
||||
- If 0, KASLR disabled.
|
||||
|
||||
Bit 5 (write): QUIET_FLAG
|
||||
|
||||
- If 0, print early messages.
|
||||
- If 1, suppress early messages.
|
||||
|
||||
This requests to the kernel (decompressor and early
|
||||
kernel) to not write early messages that require
|
||||
accessing the display hardware directly.
|
||||
|
||||
Bit 6 (write): KEEP_SEGMENTS
|
||||
|
||||
Protocol: 2.07+
|
||||
|
||||
- If 0, reload the segment registers in the 32bit entry point.
|
||||
- If 1, do not reload the segment registers in the 32bit entry point.
|
||||
|
||||
Assume that %cs %ds %ss %es are all set to flat segments with
|
||||
a base of 0 (or the equivalent for their environment).
|
||||
|
||||
Bit 7 (write): CAN_USE_HEAP
|
||||
|
||||
Set this bit to 1 to indicate that the value entered in the
|
||||
heap_end_ptr is valid. If this field is clear, some setup code
|
||||
functionality will be disabled.
|
||||
|
||||
|
||||
============ ===================
|
||||
Field name: setup_move_size
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x212/2
|
||||
Protocol: 2.00-2.01
|
||||
============ ===================
|
||||
|
||||
When using protocol 2.00 or 2.01, if the real mode kernel is not
|
||||
loaded at 0x90000, it gets moved there later in the loading
|
||||
@ -451,10 +506,12 @@ Protocol: 2.00-2.01
|
||||
This field is can be ignored when the protocol is 2.02 or higher, or
|
||||
if the real-mode code is loaded at 0x90000.
|
||||
|
||||
============ ========================
|
||||
Field name: code32_start
|
||||
Type: modify (optional, reloc)
|
||||
Offset/size: 0x214/4
|
||||
Protocol: 2.00+
|
||||
============ ========================
|
||||
|
||||
The address to jump to in protected mode. This defaults to the load
|
||||
address of the kernel, and can be used by the boot loader to
|
||||
@ -462,47 +519,57 @@ Protocol: 2.00+
|
||||
|
||||
This field can be modified for two purposes:
|
||||
|
||||
1. as a boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
|
||||
1. as a boot loader hook (see Advanced Boot Loader Hooks below.)
|
||||
|
||||
2. if a bootloader which does not install a hook loads a
|
||||
relocatable kernel at a nonstandard address it will have to modify
|
||||
this field to point to the load address.
|
||||
|
||||
============ ==================
|
||||
Field name: ramdisk_image
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x218/4
|
||||
Protocol: 2.00+
|
||||
============ ==================
|
||||
|
||||
The 32-bit linear address of the initial ramdisk or ramfs. Leave at
|
||||
zero if there is no initial ramdisk/ramfs.
|
||||
|
||||
============ ==================
|
||||
Field name: ramdisk_size
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x21c/4
|
||||
Protocol: 2.00+
|
||||
============ ==================
|
||||
|
||||
Size of the initial ramdisk or ramfs. Leave at zero if there is no
|
||||
initial ramdisk/ramfs.
|
||||
|
||||
============ ===============
|
||||
Field name: bootsect_kludge
|
||||
Type: kernel internal
|
||||
Offset/size: 0x220/4
|
||||
Protocol: 2.00+
|
||||
============ ===============
|
||||
|
||||
This field is obsolete.
|
||||
|
||||
============ ==================
|
||||
Field name: heap_end_ptr
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x224/2
|
||||
Protocol: 2.01+
|
||||
============ ==================
|
||||
|
||||
Set this field to the offset (from the beginning of the real-mode
|
||||
code) of the end of the setup stack/heap, minus 0x0200.
|
||||
|
||||
============ ================
|
||||
Field name: ext_loader_ver
|
||||
Type: write (optional)
|
||||
Offset/size: 0x226/1
|
||||
Protocol: 2.02+
|
||||
============ ================
|
||||
|
||||
This field is used as an extension of the version number in the
|
||||
type_of_loader field. The total version number is considered to be
|
||||
@ -514,10 +581,12 @@ Protocol: 2.02+
|
||||
Kernels prior to 2.6.31 did not recognize this field, but it is safe
|
||||
to write for protocol version 2.02 or higher.
|
||||
|
||||
============ =====================================================
|
||||
Field name: ext_loader_type
|
||||
Type: write (obligatory if (type_of_loader & 0xf0) == 0xe0)
|
||||
Offset/size: 0x227/1
|
||||
Protocol: 2.02+
|
||||
============ =====================================================
|
||||
|
||||
This field is used as an extension of the type number in
|
||||
type_of_loader field. If the type in type_of_loader is 0xE, then
|
||||
@ -528,10 +597,12 @@ Protocol: 2.02+
|
||||
Kernels prior to 2.6.31 did not recognize this field, but it is safe
|
||||
to write for protocol version 2.02 or higher.
|
||||
|
||||
============ ==================
|
||||
Field name: cmd_line_ptr
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x228/4
|
||||
Protocol: 2.02+
|
||||
============ ==================
|
||||
|
||||
Set this field to the linear address of the kernel command line.
|
||||
The kernel command line can be located anywhere between the end of
|
||||
@ -544,10 +615,12 @@ Protocol: 2.02+
|
||||
zero, the kernel will assume that your boot loader does not support
|
||||
the 2.02+ protocol.
|
||||
|
||||
============ ===============
|
||||
Field name: initrd_addr_max
|
||||
Type: read
|
||||
Offset/size: 0x22c/4
|
||||
Protocol: 2.03+
|
||||
============ ===============
|
||||
|
||||
The maximum address that may be occupied by the initial
|
||||
ramdisk/ramfs contents. For boot protocols 2.02 or earlier, this
|
||||
@ -556,10 +629,12 @@ Protocol: 2.03+
|
||||
your ramdisk is exactly 131072 bytes long and this field is
|
||||
0x37FFFFFF, you can start your ramdisk at 0x37FE0000.)
|
||||
|
||||
============ ============================
|
||||
Field name: kernel_alignment
|
||||
Type: read/modify (reloc)
|
||||
Offset/size: 0x230/4
|
||||
Protocol: 2.05+ (read), 2.10+ (modify)
|
||||
============ ============================
|
||||
|
||||
Alignment unit required by the kernel (if relocatable_kernel is
|
||||
true.) A relocatable kernel that is loaded at an alignment
|
||||
@ -571,25 +646,29 @@ Protocol: 2.05+ (read), 2.10+ (modify)
|
||||
loader to modify this field to permit a lesser alignment. See the
|
||||
min_alignment and pref_address field below.
|
||||
|
||||
============ ==================
|
||||
Field name: relocatable_kernel
|
||||
Type: read (reloc)
|
||||
Offset/size: 0x234/1
|
||||
Protocol: 2.05+
|
||||
============ ==================
|
||||
|
||||
If this field is nonzero, the protected-mode part of the kernel can
|
||||
be loaded at any address that satisfies the kernel_alignment field.
|
||||
After loading, the boot loader must set the code32_start field to
|
||||
point to the loaded code, or to a boot loader hook.
|
||||
|
||||
============ =============
|
||||
Field name: min_alignment
|
||||
Type: read (reloc)
|
||||
Offset/size: 0x235/1
|
||||
Protocol: 2.10+
|
||||
============ =============
|
||||
|
||||
This field, if nonzero, indicates as a power of two the minimum
|
||||
alignment required, as opposed to preferred, by the kernel to boot.
|
||||
If a boot loader makes use of this field, it should update the
|
||||
kernel_alignment field with the alignment unit desired; typically:
|
||||
kernel_alignment field with the alignment unit desired; typically::
|
||||
|
||||
kernel_alignment = 1 << min_alignment
|
||||
|
||||
@ -597,44 +676,56 @@ Protocol: 2.10+
|
||||
misaligned kernel. Therefore, a loader should typically try each
|
||||
power-of-two alignment from kernel_alignment down to this alignment.
|
||||
|
||||
============ ==========
|
||||
Field name: xloadflags
|
||||
Type: read
|
||||
Offset/size: 0x236/2
|
||||
Protocol: 2.12+
|
||||
============ ==========
|
||||
|
||||
This field is a bitmask.
|
||||
|
||||
Bit 0 (read): XLF_KERNEL_64
|
||||
|
||||
- If 1, this kernel has the legacy 64-bit entry point at 0x200.
|
||||
|
||||
Bit 1 (read): XLF_CAN_BE_LOADED_ABOVE_4G
|
||||
|
||||
- If 1, kernel/boot_params/cmdline/ramdisk can be above 4G.
|
||||
|
||||
Bit 2 (read): XLF_EFI_HANDOVER_32
|
||||
|
||||
- If 1, the kernel supports the 32-bit EFI handoff entry point
|
||||
given at handover_offset.
|
||||
|
||||
Bit 3 (read): XLF_EFI_HANDOVER_64
|
||||
|
||||
- If 1, the kernel supports the 64-bit EFI handoff entry point
|
||||
given at handover_offset + 0x200.
|
||||
|
||||
Bit 4 (read): XLF_EFI_KEXEC
|
||||
|
||||
- If 1, the kernel supports kexec EFI boot with EFI runtime support.
|
||||
|
||||
|
||||
============ ============
|
||||
Field name: cmdline_size
|
||||
Type: read
|
||||
Offset/size: 0x238/4
|
||||
Protocol: 2.06+
|
||||
============ ============
|
||||
|
||||
The maximum size of the command line without the terminating
|
||||
zero. This means that the command line can contain at most
|
||||
cmdline_size characters. With protocol version 2.05 and earlier, the
|
||||
maximum size was 255.
|
||||
|
||||
============ ====================================
|
||||
Field name: hardware_subarch
|
||||
Type: write (optional, defaults to x86/PC)
|
||||
Offset/size: 0x23c/4
|
||||
Protocol: 2.07+
|
||||
============ ====================================
|
||||
|
||||
In a paravirtualized environment the hardware low level architectural
|
||||
pieces such as interrupt handling, page table handling, and
|
||||
@ -643,25 +734,31 @@ Protocol: 2.07+
|
||||
This field allows the bootloader to inform the kernel we are in one
|
||||
one of those environments.
|
||||
|
||||
========== ==============================
|
||||
0x00000000 The default x86/PC environment
|
||||
0x00000001 lguest
|
||||
0x00000002 Xen
|
||||
0x00000003 Moorestown MID
|
||||
0x00000004 CE4100 TV Platform
|
||||
========== ==============================
|
||||
|
||||
============ =========================
|
||||
Field name: hardware_subarch_data
|
||||
Type: write (subarch-dependent)
|
||||
Offset/size: 0x240/8
|
||||
Protocol: 2.07+
|
||||
============ =========================
|
||||
|
||||
A pointer to data that is specific to hardware subarch
|
||||
This field is currently unused for the default x86/PC environment,
|
||||
do not modify.
|
||||
|
||||
============ ==============
|
||||
Field name: payload_offset
|
||||
Type: read
|
||||
Offset/size: 0x248/4
|
||||
Protocol: 2.08+
|
||||
============ ==============
|
||||
|
||||
If non-zero then this field contains the offset from the beginning
|
||||
of the protected-mode code to the payload.
|
||||
@ -674,22 +771,26 @@ Protocol: 2.08+
|
||||
02 21). The uncompressed payload is currently always ELF (magic
|
||||
number 7F 45 4C 46).
|
||||
|
||||
============ ==============
|
||||
Field name: payload_length
|
||||
Type: read
|
||||
Offset/size: 0x24c/4
|
||||
Protocol: 2.08+
|
||||
============ ==============
|
||||
|
||||
The length of the payload.
|
||||
|
||||
============ ===============
|
||||
Field name: setup_data
|
||||
Type: write (special)
|
||||
Offset/size: 0x250/8
|
||||
Protocol: 2.09+
|
||||
============ ===============
|
||||
|
||||
The 64-bit physical pointer to NULL terminated single linked list of
|
||||
struct setup_data. This is used to define a more extensible boot
|
||||
parameters passing mechanism. The definition of struct setup_data is
|
||||
as follow:
|
||||
as follow::
|
||||
|
||||
struct setup_data {
|
||||
u64 next;
|
||||
@ -708,10 +809,12 @@ Protocol: 2.09+
|
||||
sure to consider the case where the linked list already contains
|
||||
entries.
|
||||
|
||||
============ ============
|
||||
Field name: pref_address
|
||||
Type: read (reloc)
|
||||
Offset/size: 0x258/8
|
||||
Protocol: 2.10+
|
||||
============ ============
|
||||
|
||||
This field, if nonzero, represents a preferred load address for the
|
||||
kernel. A relocating bootloader should attempt to load at this
|
||||
@ -720,9 +823,11 @@ Protocol: 2.10+
|
||||
A non-relocatable kernel will unconditionally move itself and to run
|
||||
at this address.
|
||||
|
||||
============ =======
|
||||
Field name: init_size
|
||||
Type: read
|
||||
Offset/size: 0x260/4
|
||||
============ =======
|
||||
|
||||
This field indicates the amount of linear contiguous memory starting
|
||||
at the kernel runtime start address that the kernel needs before it
|
||||
@ -731,16 +836,18 @@ Offset/size: 0x260/4
|
||||
be used by a relocating boot loader to help select a safe load
|
||||
address for the kernel.
|
||||
|
||||
The kernel runtime start address is determined by the following algorithm:
|
||||
The kernel runtime start address is determined by the following algorithm::
|
||||
|
||||
if (relocatable_kernel)
|
||||
runtime_start = align_up(load_address, kernel_alignment)
|
||||
else
|
||||
runtime_start = pref_address
|
||||
|
||||
============ ===============
|
||||
Field name: handover_offset
|
||||
Type: read
|
||||
Offset/size: 0x264/4
|
||||
============ ===============
|
||||
|
||||
This field is the offset from the beginning of the kernel image to
|
||||
the EFI handover protocol entry point. Boot loaders using the EFI
|
||||
@ -749,7 +856,8 @@ Offset/size: 0x264/4
|
||||
See EFI HANDOVER PROTOCOL below for more details.
|
||||
|
||||
|
||||
**** THE IMAGE CHECKSUM
|
||||
The Image Checksum
|
||||
==================
|
||||
|
||||
From boot protocol version 2.08 onwards the CRC-32 is calculated over
|
||||
the entire file using the characteristic polynomial 0x04C11DB7 and an
|
||||
@ -758,7 +866,8 @@ file; therefore the CRC of the file up to the limit specified in the
|
||||
syssize field of the header is always 0.
|
||||
|
||||
|
||||
**** THE KERNEL COMMAND LINE
|
||||
The Kernel Command Line
|
||||
=======================
|
||||
|
||||
The kernel command line has become an important way for the boot
|
||||
loader to communicate with the kernel. Some of its options are also
|
||||
@ -778,19 +887,20 @@ heap and 0xA0000.
|
||||
If the protocol version is *not* 2.02 or higher, the kernel
|
||||
command line is entered using the following protocol:
|
||||
|
||||
At offset 0x0020 (word), "cmd_line_magic", enter the magic
|
||||
- At offset 0x0020 (word), "cmd_line_magic", enter the magic
|
||||
number 0xA33F.
|
||||
|
||||
At offset 0x0022 (word), "cmd_line_offset", enter the offset
|
||||
- At offset 0x0022 (word), "cmd_line_offset", enter the offset
|
||||
of the kernel command line (relative to the start of the
|
||||
real-mode kernel).
|
||||
|
||||
The kernel command line *must* be within the memory region
|
||||
- The kernel command line *must* be within the memory region
|
||||
covered by setup_move_size, so you may need to adjust this
|
||||
field.
|
||||
|
||||
|
||||
**** MEMORY LAYOUT OF THE REAL-MODE CODE
|
||||
Memory Layout of The Real-Mode Code
|
||||
===================================
|
||||
|
||||
The real-mode code requires a stack/heap to be set up, as well as
|
||||
memory allocated for the kernel command line. This needs to be done
|
||||
@ -806,7 +916,8 @@ segment has to be used:
|
||||
- When loading a zImage kernel ((loadflags & 0x01) == 0).
|
||||
- When loading a 2.01 or earlier boot protocol kernel.
|
||||
|
||||
-> For the 2.00 and 2.01 boot protocols, the real-mode code
|
||||
.. note::
|
||||
For the 2.00 and 2.01 boot protocols, the real-mode code
|
||||
can be loaded at another address, but it is internally
|
||||
relocated to 0x90000. For the "old" protocol, the
|
||||
real-mode code must be loaded at 0x90000.
|
||||
@ -822,24 +933,29 @@ The kernel command line should not be located below the real-mode
|
||||
code, nor should it be located in high memory.
|
||||
|
||||
|
||||
**** SAMPLE BOOT CONFIGURATION
|
||||
Sample Boot Configuartion
|
||||
=========================
|
||||
|
||||
As a sample configuration, assume the following layout of the real
|
||||
mode segment:
|
||||
mode segment.
|
||||
|
||||
When loading below 0x90000, use the entire segment:
|
||||
|
||||
============= ===================
|
||||
0x0000-0x7fff Real mode kernel
|
||||
0x8000-0xdfff Stack and heap
|
||||
0xe000-0xffff Kernel command line
|
||||
============= ===================
|
||||
|
||||
When loading at 0x90000 OR the protocol version is 2.01 or earlier:
|
||||
|
||||
============= ===================
|
||||
0x0000-0x7fff Real mode kernel
|
||||
0x8000-0x97ff Stack and heap
|
||||
0x9800-0x9fff Kernel command line
|
||||
============= ===================
|
||||
|
||||
Such a boot loader should enter the following fields in the header:
|
||||
Such a boot loader should enter the following fields in the header::
|
||||
|
||||
unsigned long base_ptr; /* base address for real-mode segment */
|
||||
|
||||
@ -898,7 +1014,8 @@ Such a boot loader should enter the following fields in the header:
|
||||
}
|
||||
|
||||
|
||||
**** LOADING THE REST OF THE KERNEL
|
||||
Loading The Rest of The Kernel
|
||||
==============================
|
||||
|
||||
The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512
|
||||
in the kernel file (again, if setup_sects == 0 the real value is 4.)
|
||||
@ -906,7 +1023,7 @@ It should be loaded at address 0x10000 for Image/zImage kernels and
|
||||
0x100000 for bzImage kernels.
|
||||
|
||||
The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01
|
||||
bit (LOAD_HIGH) in the loadflags field is set:
|
||||
bit (LOAD_HIGH) in the loadflags field is set::
|
||||
|
||||
is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01);
|
||||
load_address = is_bzImage ? 0x100000 : 0x10000;
|
||||
@ -916,8 +1033,8 @@ the entire 0x10000-0x90000 range of memory. This means it is pretty
|
||||
much a requirement for these kernels to load the real-mode part at
|
||||
0x90000. bzImage kernels allow much more flexibility.
|
||||
|
||||
|
||||
**** SPECIAL COMMAND LINE OPTIONS
|
||||
Special Command Line Options
|
||||
============================
|
||||
|
||||
If the command line provided by the boot loader is entered by the
|
||||
user, the user may expect the following command line options to work.
|
||||
@ -966,7 +1083,8 @@ or configuration-specified command line. Otherwise, "init=/bin/sh"
|
||||
gets confused by the "auto" option.
|
||||
|
||||
|
||||
**** RUNNING THE KERNEL
|
||||
Running the Kernel
|
||||
==================
|
||||
|
||||
The kernel is started by jumping to the kernel entry point, which is
|
||||
located at *segment* offset 0x20 from the start of the real mode
|
||||
@ -980,7 +1098,7 @@ interrupts should be disabled. Furthermore, to guard against bugs in
|
||||
the kernel, it is recommended that the boot loader sets fs = gs = ds =
|
||||
es = ss.
|
||||
|
||||
In our example from above, we would do:
|
||||
In our example from above, we would do::
|
||||
|
||||
/* Note: in the case of the "old" kernel protocol, base_ptr must
|
||||
be == 0x90000 at this point; see the previous sample code */
|
||||
@ -1003,7 +1121,8 @@ switched off, especially if the loaded kernel has the floppy driver as
|
||||
a demand-loaded module!
|
||||
|
||||
|
||||
**** ADVANCED BOOT LOADER HOOKS
|
||||
Advanced Boot Loader Hooks
|
||||
==========================
|
||||
|
||||
If the boot loader runs in a particularly hostile environment (such as
|
||||
LOADLIN, which runs under DOS) it may be impossible to follow the
|
||||
@ -1032,7 +1151,8 @@ IMPORTANT: All the hooks are required to preserve %esp, %ebp, %esi and
|
||||
(relocated, if appropriate.)
|
||||
|
||||
|
||||
**** 32-bit BOOT PROTOCOL
|
||||
32-bit Boot Protocol
|
||||
====================
|
||||
|
||||
For machine with some new BIOS other than legacy BIOS, such as EFI,
|
||||
LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel
|
||||
@ -1045,7 +1165,7 @@ traditionally known as "zero page"). The memory for struct boot_params
|
||||
should be allocated and initialized to all zero. Then the setup header
|
||||
from offset 0x01f1 of kernel image on should be loaded into struct
|
||||
boot_params and examined. The end of setup header can be calculated as
|
||||
follow:
|
||||
follow::
|
||||
|
||||
0x0202 + byte value at offset 0x0201
|
||||
|
||||
@ -1069,7 +1189,8 @@ must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
|
||||
must be __BOOT_DS; interrupt must be disabled; %esi must hold the base
|
||||
address of the struct boot_params; %ebp, %edi and %ebx must be zero.
|
||||
|
||||
**** 64-bit BOOT PROTOCOL
|
||||
64-bit Boot Protocol
|
||||
====================
|
||||
|
||||
For machine with 64bit cpus and 64bit kernel, we could use 64bit bootloader
|
||||
and we need a 64-bit boot protocol.
|
||||
@ -1080,7 +1201,7 @@ traditionally known as "zero page"). The memory for struct boot_params
|
||||
could be allocated anywhere (even above 4G) and initialized to all zero.
|
||||
Then, the setup header at offset 0x01f1 of kernel image on should be
|
||||
loaded into struct boot_params and examined. The end of setup header
|
||||
can be calculated as follows:
|
||||
can be calculated as follows::
|
||||
|
||||
0x0202 + byte value at offset 0x0201
|
||||
|
||||
@ -1107,7 +1228,8 @@ must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
|
||||
must be __BOOT_DS; interrupt must be disabled; %rsi must hold the base
|
||||
address of the struct boot_params.
|
||||
|
||||
**** EFI HANDOVER PROTOCOL
|
||||
EFI Handover Protocol
|
||||
=====================
|
||||
|
||||
This protocol allows boot loaders to defer initialisation to the EFI
|
||||
boot stub. The boot loader is required to load the kernel/initrd(s)
|
||||
@ -1115,7 +1237,7 @@ from the boot media and jump to the EFI handover protocol entry point
|
||||
which is hdr->handover_offset bytes from the beginning of
|
||||
startup_{32,64}.
|
||||
|
||||
The function prototype for the handover entry point looks like this,
|
||||
The function prototype for the handover entry point looks like this::
|
||||
|
||||
efi_main(void *handle, efi_system_table_t *table, struct boot_params *bp)
|
||||
|
||||
@ -1124,11 +1246,11 @@ firmware, 'table' is the EFI system table - these are the first two
|
||||
arguments of the "handoff state" as described in section 2.3 of the
|
||||
UEFI specification. 'bp' is the boot loader-allocated boot params.
|
||||
|
||||
The boot loader *must* fill out the following fields in bp,
|
||||
The boot loader *must* fill out the following fields in bp::
|
||||
|
||||
o hdr.code32_start
|
||||
o hdr.cmd_line_ptr
|
||||
o hdr.ramdisk_image (if applicable)
|
||||
o hdr.ramdisk_size (if applicable)
|
||||
- hdr.code32_start
|
||||
- hdr.cmd_line_ptr
|
||||
- hdr.ramdisk_image (if applicable)
|
||||
- hdr.ramdisk_size (if applicable)
|
||||
|
||||
All other fields should be zero.
|
@ -1,18 +1,24 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
============
|
||||
Early Printk
|
||||
============
|
||||
|
||||
Mini-HOWTO for using the earlyprintk=dbgp boot option with a
|
||||
USB2 Debug port key and a debug cable, on x86 systems.
|
||||
|
||||
You need two computers, the 'USB debug key' special gadget and
|
||||
and two USB cables, connected like this:
|
||||
and two USB cables, connected like this::
|
||||
|
||||
[host/target] <-------> [USB debug key] <-------> [client/console]
|
||||
|
||||
1. There are a number of specific hardware requirements:
|
||||
Hardware requirements
|
||||
=====================
|
||||
|
||||
a.) Host/target system needs to have USB debug port capability.
|
||||
a) Host/target system needs to have USB debug port capability.
|
||||
|
||||
You can check this capability by looking at a 'Debug port' bit in
|
||||
the lspci -vvv output:
|
||||
the lspci -vvv output::
|
||||
|
||||
# lspci -vvv
|
||||
...
|
||||
@ -32,20 +38,20 @@ and two USB cables, connected like this:
|
||||
Kernel modules: ehci-hcd
|
||||
...
|
||||
|
||||
( If your system does not list a debug port capability then you probably
|
||||
won't be able to use the USB debug key. )
|
||||
.. note::
|
||||
If your system does not list a debug port capability then you probably
|
||||
won't be able to use the USB debug key.
|
||||
|
||||
b.) You also need a NetChip USB debug cable/key:
|
||||
b) You also need a NetChip USB debug cable/key:
|
||||
|
||||
http://www.plxtech.com/products/NET2000/NET20DC/default.asp
|
||||
|
||||
This is a small blue plastic connector with two USB connections;
|
||||
it draws power from its USB connections.
|
||||
|
||||
c.) You need a second client/console system with a high speed USB 2.0
|
||||
port.
|
||||
c) You need a second client/console system with a high speed USB 2.0 port.
|
||||
|
||||
d.) The NetChip device must be plugged directly into the physical
|
||||
d) The NetChip device must be plugged directly into the physical
|
||||
debug port on the "host/target" system. You cannot use a USB hub in
|
||||
between the physical debug port and the "host/target" system.
|
||||
|
||||
@ -65,29 +71,31 @@ and two USB cables, connected like this:
|
||||
to the hardware vendor, because there is no reason not to wire
|
||||
this port into one of the physically accessible ports.
|
||||
|
||||
e.) It is also important to note, that many versions of the NetChip
|
||||
e) It is also important to note, that many versions of the NetChip
|
||||
device require the "client/console" system to be plugged into the
|
||||
right hand side of the device (with the product logo facing up and
|
||||
readable left to right). The reason being is that the 5 volt
|
||||
power supply is taken from only one side of the device and it
|
||||
must be the side that does not get rebooted.
|
||||
|
||||
2. Software requirements:
|
||||
Software requirements
|
||||
=====================
|
||||
|
||||
a.) On the host/target system:
|
||||
a) On the host/target system:
|
||||
|
||||
You need to enable the following kernel config option:
|
||||
You need to enable the following kernel config option::
|
||||
|
||||
CONFIG_EARLY_PRINTK_DBGP=y
|
||||
|
||||
And you need to add the boot command line: "earlyprintk=dbgp".
|
||||
|
||||
(If you are using Grub, append it to the 'kernel' line in
|
||||
.. note::
|
||||
If you are using Grub, append it to the 'kernel' line in
|
||||
/etc/grub.conf. If you are using Grub2 on a BIOS firmware system,
|
||||
append it to the 'linux' line in /boot/grub2/grub.cfg. If you are
|
||||
using Grub2 on an EFI firmware system, append it to the 'linux'
|
||||
or 'linuxefi' line in /boot/grub2/grub.cfg or
|
||||
/boot/efi/EFI/<distro>/grub.cfg.)
|
||||
/boot/efi/EFI/<distro>/grub.cfg.
|
||||
|
||||
On systems with more than one EHCI debug controller you must
|
||||
specify the correct EHCI debug controller number. The ordering
|
||||
@ -96,14 +104,15 @@ and two USB cables, connected like this:
|
||||
controller. To use the second EHCI debug controller, you would
|
||||
use the command line: "earlyprintk=dbgp1"
|
||||
|
||||
NOTE: normally earlyprintk console gets turned off once the
|
||||
.. note::
|
||||
normally earlyprintk console gets turned off once the
|
||||
regular console is alive - use "earlyprintk=dbgp,keep" to keep
|
||||
this channel open beyond early bootup. This can be useful for
|
||||
debugging crashes under Xorg, etc.
|
||||
|
||||
b.) On the client/console system:
|
||||
b) On the client/console system:
|
||||
|
||||
You should enable the following kernel config option:
|
||||
You should enable the following kernel config option::
|
||||
|
||||
CONFIG_USB_SERIAL_DEBUG=y
|
||||
|
||||
@ -115,22 +124,23 @@ and two USB cables, connected like this:
|
||||
it up to use /dev/ttyUSB0 - or use a raw 'cat /dev/ttyUSBx' to
|
||||
see the raw output.
|
||||
|
||||
c.) On Nvidia Southbridge based systems: the kernel will try to probe
|
||||
c) On Nvidia Southbridge based systems: the kernel will try to probe
|
||||
and find out which port has a debug device connected.
|
||||
|
||||
3. Testing that it works fine:
|
||||
Testing
|
||||
=======
|
||||
|
||||
You can test the output by using earlyprintk=dbgp,keep and provoking
|
||||
kernel messages on the host/target system. You can provoke a harmless
|
||||
kernel message by for example doing:
|
||||
kernel message by for example doing::
|
||||
|
||||
echo h > /proc/sysrq-trigger
|
||||
|
||||
On the host/target system you should see this help line in "dmesg" output:
|
||||
On the host/target system you should see this help line in "dmesg" output::
|
||||
|
||||
SysRq : HELP : loglevel(0-9) reBoot Crashdump terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
|
||||
|
||||
On the client/console system do:
|
||||
On the client/console system do::
|
||||
|
||||
cat /dev/ttyUSB0
|
||||
|
@ -1,3 +1,9 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==============
|
||||
Kernel Entries
|
||||
==============
|
||||
|
||||
This file documents some of the kernel entries in
|
||||
arch/x86/entry/entry_64.S. A lot of this explanation is adapted from
|
||||
an email from Ingo Molnar:
|
||||
@ -59,7 +65,7 @@ Now, there's a secondary complication: there's a cheap way to test
|
||||
which mode the CPU is in and an expensive way.
|
||||
|
||||
The cheap way is to pick this info off the entry frame on the kernel
|
||||
stack, from the CS of the ptregs area of the kernel stack:
|
||||
stack, from the CS of the ptregs area of the kernel stack::
|
||||
|
||||
xorl %ebx,%ebx
|
||||
testl $3,CS+8(%rsp)
|
||||
@ -67,7 +73,7 @@ stack, from the CS of the ptregs area of the kernel stack:
|
||||
SWAPGS
|
||||
|
||||
The expensive (paranoid) way is to read back the MSR_GS_BASE value
|
||||
(which is what SWAPGS modifies):
|
||||
(which is what SWAPGS modifies)::
|
||||
|
||||
movl $1,%ebx
|
||||
movl $MSR_GS_BASE,%ecx
|
@ -1,4 +1,9 @@
|
||||
Kernel level exception handling in Linux
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===============================
|
||||
Kernel level exception handling
|
||||
===============================
|
||||
|
||||
Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
|
||||
|
||||
When a process runs in kernel mode, it often has to access user
|
||||
@ -25,7 +30,7 @@ How does this work?
|
||||
|
||||
Whenever the kernel tries to access an address that is currently not
|
||||
accessible, the CPU generates a page fault exception and calls the
|
||||
page fault handler
|
||||
page fault handler::
|
||||
|
||||
void do_page_fault(struct pt_regs *regs, unsigned long error_code)
|
||||
|
||||
@ -57,10 +62,11 @@ as an example. The definition is somewhat hard to follow, so let's peek at
|
||||
the code generated by the preprocessor and the compiler. I selected
|
||||
the get_user call in drivers/char/sysrq.c for a detailed examination.
|
||||
|
||||
The original code in sysrq.c line 587:
|
||||
The original code in sysrq.c line 587::
|
||||
|
||||
get_user(c, buf);
|
||||
|
||||
The preprocessor output (edited to become somewhat readable):
|
||||
The preprocessor output (edited to become somewhat readable)::
|
||||
|
||||
(
|
||||
{
|
||||
@ -123,7 +129,7 @@ The preprocessor output (edited to become somewhat readable):
|
||||
);
|
||||
|
||||
WOW! Black GCC/assembly magic. This is impossible to follow, so let's
|
||||
see what code gcc generates:
|
||||
see what code gcc generates::
|
||||
|
||||
> xorl %edx,%edx
|
||||
> movl current_set,%eax
|
||||
@ -154,7 +160,7 @@ understand. Can we? The actual user access is quite obvious. Thanks
|
||||
to the unified address space we can just access the address in user
|
||||
memory. But what does the .section stuff do?????
|
||||
|
||||
To understand this we have to look at the final kernel:
|
||||
To understand this we have to look at the final kernel::
|
||||
|
||||
> objdump --section-headers vmlinux
|
||||
>
|
||||
@ -181,7 +187,7 @@ To understand this we have to look at the final kernel:
|
||||
|
||||
There are obviously 2 non standard ELF sections in the generated object
|
||||
file. But first we want to find out what happened to our code in the
|
||||
final kernel executable:
|
||||
final kernel executable::
|
||||
|
||||
> objdump --disassemble --section=.text vmlinux
|
||||
>
|
||||
@ -199,7 +205,7 @@ final kernel executable:
|
||||
The whole user memory access is reduced to 10 x86 machine instructions.
|
||||
The instructions bracketed in the .section directives are no longer
|
||||
in the normal execution path. They are located in a different section
|
||||
of the executable file:
|
||||
of the executable file::
|
||||
|
||||
> objdump --disassemble --section=.fixup vmlinux
|
||||
>
|
||||
@ -207,14 +213,15 @@ of the executable file:
|
||||
> c0199ffa <.fixup+10ba> xorb %dl,%dl
|
||||
> c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3>
|
||||
|
||||
And finally:
|
||||
And finally::
|
||||
|
||||
> objdump --full-contents --section=__ex_table vmlinux
|
||||
>
|
||||
> c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................
|
||||
> c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................
|
||||
> c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................
|
||||
|
||||
or in human readable byte order:
|
||||
or in human readable byte order::
|
||||
|
||||
> c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 ................
|
||||
> c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................
|
||||
@ -222,18 +229,22 @@ or in human readable byte order:
|
||||
this is the interesting part!
|
||||
> c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ................
|
||||
|
||||
What happened? The assembly directives
|
||||
What happened? The assembly directives::
|
||||
|
||||
.section .fixup,"ax"
|
||||
.section __ex_table,"a"
|
||||
|
||||
told the assembler to move the following code to the specified
|
||||
sections in the ELF object file. So the instructions
|
||||
sections in the ELF object file. So the instructions::
|
||||
|
||||
3: movl $-14,%eax
|
||||
xorb %dl,%dl
|
||||
jmp 2b
|
||||
ended up in the .fixup section of the object file and the addresses
|
||||
|
||||
ended up in the .fixup section of the object file and the addresses::
|
||||
|
||||
.long 1b,3b
|
||||
|
||||
ended up in the __ex_table section of the object file. 1b and 3b
|
||||
are local labels. The local label 1b (1b stands for next label 1
|
||||
backward) is the address of the instruction that might fault, i.e.
|
||||
@ -246,34 +257,38 @@ the fault, in our case the actual value is c0199ff5:
|
||||
the original assembly code: > 3: movl $-14,%eax
|
||||
and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
|
||||
|
||||
The assembly code
|
||||
The assembly code::
|
||||
|
||||
> .section __ex_table,"a"
|
||||
> .align 4
|
||||
> .long 1b,3b
|
||||
|
||||
becomes the value pair
|
||||
becomes the value pair::
|
||||
|
||||
> c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................
|
||||
^this is ^this is
|
||||
1b 3b
|
||||
|
||||
c017e7a5,c0199ff5 in the exception table of the kernel.
|
||||
|
||||
So, what actually happens if a fault from kernel mode with no suitable
|
||||
vma occurs?
|
||||
|
||||
1.) access to invalid address:
|
||||
#. access to invalid address::
|
||||
|
||||
> c017e7a5 <do_con_write+e1> movb (%ebx),%dl
|
||||
2.) MMU generates exception
|
||||
3.) CPU calls do_page_fault
|
||||
4.) do page fault calls search_exception_table (regs->eip == c017e7a5);
|
||||
5.) search_exception_table looks up the address c017e7a5 in the
|
||||
#. MMU generates exception
|
||||
#. CPU calls do_page_fault
|
||||
#. do page fault calls search_exception_table (regs->eip == c017e7a5);
|
||||
#. search_exception_table looks up the address c017e7a5 in the
|
||||
exception table (i.e. the contents of the ELF section __ex_table)
|
||||
and returns the address of the associated fault handle code c0199ff5.
|
||||
6.) do_page_fault modifies its own return address to point to the fault
|
||||
#. do_page_fault modifies its own return address to point to the fault
|
||||
handle code and returns.
|
||||
7.) execution continues in the fault handling code.
|
||||
8.) 8a) EAX becomes -EFAULT (== -14)
|
||||
8b) DL becomes zero (the value we "read" from user space)
|
||||
8c) execution continues at local label 2 (address of the
|
||||
#. execution continues in the fault handling code.
|
||||
#. a) EAX becomes -EFAULT (== -14)
|
||||
b) DL becomes zero (the value we "read" from user space)
|
||||
c) execution continues at local label 2 (address of the
|
||||
instruction immediately after the faulting user access).
|
||||
|
||||
The steps 8a to 8c in a certain way emulate the faulting instruction.
|
||||
@ -295,14 +310,15 @@ Things changed when 64-bit support was added to x86 Linux. Rather than
|
||||
double the size of the exception table by expanding the two entries
|
||||
from 32-bits to 64 bits, a clever trick was used to store addresses
|
||||
as relative offsets from the table itself. The assembly code changed
|
||||
from:
|
||||
from::
|
||||
|
||||
.long 1b,3b
|
||||
to:
|
||||
.long (from) - .
|
||||
.long (to) - .
|
||||
|
||||
and the C-code that uses these values converts back to absolute addresses
|
||||
like this:
|
||||
like this::
|
||||
|
||||
ex_insn_addr(const struct exception_table_entry *x)
|
||||
{
|
||||
@ -313,15 +329,18 @@ In v4.6 the exception table entry was expanded with a new field "handler".
|
||||
This is also 32-bits wide and contains a third relative function
|
||||
pointer which points to one of:
|
||||
|
||||
1) int ex_handler_default(const struct exception_table_entry *fixup)
|
||||
1) ``int ex_handler_default(const struct exception_table_entry *fixup)``
|
||||
This is legacy case that just jumps to the fixup code
|
||||
2) int ex_handler_fault(const struct exception_table_entry *fixup)
|
||||
|
||||
2) ``int ex_handler_fault(const struct exception_table_entry *fixup)``
|
||||
This case provides the fault number of the trap that occurred at
|
||||
entry->insn. It is used to distinguish page faults from machine
|
||||
check.
|
||||
3) int ex_handler_ext(const struct exception_table_entry *fixup)
|
||||
|
||||
3) ``int ex_handler_ext(const struct exception_table_entry *fixup)``
|
||||
This case is used for uaccess_err ... we need to set a flag
|
||||
in the task structure. Before the handler functions existed this
|
||||
case was handled by adding a large offset to the fixup to tag
|
||||
it as special.
|
||||
|
||||
More functions can easily be added.
|
@ -1,3 +1,11 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=======
|
||||
IO-APIC
|
||||
=======
|
||||
|
||||
:Author: Ingo Molnar <mingo@kernel.org>
|
||||
|
||||
Most (all) Intel-MP compliant SMP boards have the so-called 'IO-APIC',
|
||||
which is an enhanced interrupt controller. It enables us to route
|
||||
hardware interrupts to multiple CPUs, or to CPU groups. Without an
|
||||
@ -13,9 +21,8 @@ usually worked around by the kernel. If your MP-compliant SMP board does
|
||||
not boot Linux, then consult the linux-smp mailing list archives first.
|
||||
|
||||
If your box boots fine with enabled IO-APIC IRQs, then your
|
||||
/proc/interrupts will look like this one:
|
||||
/proc/interrupts will look like this one::
|
||||
|
||||
---------------------------->
|
||||
hell:~> cat /proc/interrupts
|
||||
CPU0
|
||||
0: 1360293 IO-APIC-edge timer
|
||||
@ -28,7 +35,6 @@ If your box boots fine with enabled IO-APIC IRQs, then your
|
||||
NMI: 0
|
||||
ERR: 0
|
||||
hell:~>
|
||||
<----------------------------
|
||||
|
||||
Some interrupts are still listed as 'XT PIC', but this is not a problem;
|
||||
none of those IRQ sources is performance-critical.
|
||||
@ -37,14 +43,14 @@ none of those IRQ sources is performance-critical.
|
||||
In the unlikely case that your board does not create a working mp-table,
|
||||
you can use the pirq= boot parameter to 'hand-construct' IRQ entries. This
|
||||
is non-trivial though and cannot be automated. One sample /etc/lilo.conf
|
||||
entry:
|
||||
entry::
|
||||
|
||||
append="pirq=15,11,10"
|
||||
|
||||
The actual numbers depend on your system, on your PCI cards and on their
|
||||
PCI slot position. Usually PCI slots are 'daisy chained' before they are
|
||||
connected to the PCI chipset IRQ routing facility (the incoming PIRQ1-4
|
||||
lines):
|
||||
lines)::
|
||||
|
||||
,-. ,-. ,-. ,-. ,-.
|
||||
PIRQ4 ----| |-. ,-| |-. ,-| |-. ,-| |--------| |
|
||||
@ -56,7 +62,7 @@ lines):
|
||||
PIRQ1 ----| |- `----| |- `----| |- `----| |--------| |
|
||||
`-' `-' `-' `-' `-'
|
||||
|
||||
Every PCI card emits a PCI IRQ, which can be INTA, INTB, INTC or INTD:
|
||||
Every PCI card emits a PCI IRQ, which can be INTA, INTB, INTC or INTD::
|
||||
|
||||
,-.
|
||||
INTD--| |
|
||||
@ -78,19 +84,19 @@ to have non shared interrupts). Slot5 should be used for videocards, they
|
||||
do not use interrupts normally, thus they are not daisy chained either.
|
||||
|
||||
so if you have your SCSI card (IRQ11) in Slot1, Tulip card (IRQ9) in
|
||||
Slot2, then you'll have to specify this pirq= line:
|
||||
Slot2, then you'll have to specify this pirq= line::
|
||||
|
||||
append="pirq=11,9"
|
||||
|
||||
the following script tries to figure out such a default pirq= line from
|
||||
your PCI configuration:
|
||||
your PCI configuration::
|
||||
|
||||
echo -n pirq=; echo `scanpci | grep T_L | cut -c56-` | sed 's/ /,/g'
|
||||
|
||||
note that this script won't work if you have skipped a few slots or if your
|
||||
board does not do default daisy-chaining. (or the IO-APIC has the PIRQ pins
|
||||
connected in some strange way). E.g. if in the above case you have your SCSI
|
||||
card (IRQ11) in Slot3, and have Slot1 empty:
|
||||
card (IRQ11) in Slot3, and have Slot1 empty::
|
||||
|
||||
append="pirq=0,9,11"
|
||||
|
||||
@ -105,7 +111,7 @@ won't function properly (e.g. if it's inserted as a module).
|
||||
If you have 2 PCI buses, then you can use up to 8 pirq values, although such
|
||||
boards tend to have a good configuration.
|
||||
|
||||
Be prepared that it might happen that you need some strange pirq line:
|
||||
Be prepared that it might happen that you need some strange pirq line::
|
||||
|
||||
append="pirq=0,0,0,0,0,0,9,11"
|
||||
|
||||
@ -115,5 +121,3 @@ Good luck and mail to linux-smp@vger.kernel.org or
|
||||
linux-kernel@vger.kernel.org if you have any problems that are not covered
|
||||
by this document.
|
||||
|
||||
-- mingo
|
||||
|
10
Documentation/x86/i386/index.rst
Normal file
10
Documentation/x86/i386/index.rst
Normal file
@ -0,0 +1,10 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
============
|
||||
i386 Support
|
||||
============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
IO-APIC
|
30
Documentation/x86/index.rst
Normal file
30
Documentation/x86/index.rst
Normal file
@ -0,0 +1,30 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==========================
|
||||
x86-specific Documentation
|
||||
==========================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:numbered:
|
||||
|
||||
boot
|
||||
topology
|
||||
exception-tables
|
||||
kernel-stacks
|
||||
entry_64
|
||||
earlyprintk
|
||||
orc-unwinder
|
||||
zero-page
|
||||
tlb
|
||||
mtrr
|
||||
pat
|
||||
protection-keys
|
||||
intel_mpx
|
||||
amd-memory-encryption
|
||||
pti
|
||||
microcode
|
||||
resctrl_ui
|
||||
usb-legacy-support
|
||||
i386/index
|
||||
x86_64/index
|
@ -1,5 +1,11 @@
|
||||
1. Intel(R) MPX Overview
|
||||
========================
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===========================================
|
||||
Intel(R) Memory Protection Extensions (MPX)
|
||||
===========================================
|
||||
|
||||
Intel(R) MPX Overview
|
||||
=====================
|
||||
|
||||
Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability
|
||||
introduced into Intel Architecture. Intel MPX provides hardware features
|
||||
@ -7,7 +13,7 @@ that can be used in conjunction with compiler changes to check memory
|
||||
references, for those references whose compile-time normal intentions are
|
||||
usurped at runtime due to buffer overflow or underflow.
|
||||
|
||||
You can tell if your CPU supports MPX by looking in /proc/cpuinfo:
|
||||
You can tell if your CPU supports MPX by looking in /proc/cpuinfo::
|
||||
|
||||
cat /proc/cpuinfo | grep ' mpx '
|
||||
|
||||
@ -21,8 +27,8 @@ can be downloaded from
|
||||
http://software.intel.com/en-us/articles/intel-software-development-emulator
|
||||
|
||||
|
||||
2. How to get the advantage of MPX
|
||||
==================================
|
||||
How to get the advantage of MPX
|
||||
===============================
|
||||
|
||||
For MPX to work, changes are required in the kernel, binutils and compiler.
|
||||
No source changes are required for applications, just a recompile.
|
||||
@ -84,14 +90,15 @@ Kernel MPX Code:
|
||||
is unmapped.
|
||||
|
||||
|
||||
3. How does MPX kernel code work
|
||||
================================
|
||||
How does MPX kernel code work
|
||||
=============================
|
||||
|
||||
Handling #BR faults caused by MPX
|
||||
---------------------------------
|
||||
|
||||
When MPX is enabled, there are 2 new situations that can generate
|
||||
#BR faults.
|
||||
|
||||
* new bounds tables (BT) need to be allocated to save bounds.
|
||||
* bounds violation caused by MPX instructions.
|
||||
|
||||
@ -124,9 +131,9 @@ the kernel. It can theoretically be done completely from userspace. Here
|
||||
are a few ways this could be done. We don't think any of them are practical
|
||||
in the real-world, but here they are.
|
||||
|
||||
Q: Can virtual space simply be reserved for the bounds tables so that we
|
||||
:Q: Can virtual space simply be reserved for the bounds tables so that we
|
||||
never have to allocate them?
|
||||
A: MPX-enabled application will possibly create a lot of bounds tables in
|
||||
:A: MPX-enabled application will possibly create a lot of bounds tables in
|
||||
process address space to save bounds information. These tables can take
|
||||
up huge swaths of memory (as much as 80% of the memory on the system)
|
||||
even if we clean them up aggressively. In the worst-case scenario, the
|
||||
@ -140,19 +147,19 @@ A: MPX-enabled application will possibly create a lot of bounds tables in
|
||||
consumes 2GB of virtual *AND* physical memory. IOW, it's completely
|
||||
infeasible to prepopulate bounds directories.
|
||||
|
||||
Q: Can we preallocate bounds table space at the same time memory is
|
||||
:Q: Can we preallocate bounds table space at the same time memory is
|
||||
allocated which might contain pointers that might eventually need
|
||||
bounds tables?
|
||||
A: This would work if we could hook the site of each and every memory
|
||||
:A: This would work if we could hook the site of each and every memory
|
||||
allocation syscall. This can be done for small, constrained applications.
|
||||
But, it isn't practical at a larger scale since a given app has no
|
||||
way of controlling how all the parts of the app might allocate memory
|
||||
(think libraries). The kernel is really the only place to intercept
|
||||
these calls.
|
||||
|
||||
Q: Could a bounds fault be handed to userspace and the tables allocated
|
||||
:Q: Could a bounds fault be handed to userspace and the tables allocated
|
||||
there in a signal handler instead of in the kernel?
|
||||
A: mmap() is not on the list of safe async handler functions and even
|
||||
:A: mmap() is not on the list of safe async handler functions and even
|
||||
if mmap() would work it still requires locking or nasty tricks to
|
||||
keep track of the allocation state there.
|
||||
|
||||
@ -167,7 +174,7 @@ If a #BR is generated due to a bounds violation caused by MPX.
|
||||
We need to decode MPX instructions to get violation address and
|
||||
set this address into extended struct siginfo.
|
||||
|
||||
The _sigfault field of struct siginfo is extended as follow:
|
||||
The _sigfault field of struct siginfo is extended as follow::
|
||||
|
||||
87 /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
|
||||
88 struct {
|
||||
@ -209,6 +216,7 @@ Adding new prctl commands
|
||||
|
||||
Two new prctl commands are added to enable and disable MPX bounds tables
|
||||
management in kernel.
|
||||
::
|
||||
|
||||
155 #define PR_MPX_ENABLE_MANAGEMENT 43
|
||||
156 #define PR_MPX_DISABLE_MANAGEMENT 44
|
||||
@ -223,8 +231,8 @@ into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT
|
||||
command execution.
|
||||
|
||||
|
||||
4. Special rules
|
||||
================
|
||||
Special rules
|
||||
=============
|
||||
|
||||
1) If userspace is requesting help from the kernel to do the management
|
||||
of bounds tables, it may not create or modify entries in the bounds directory.
|
@ -1,5 +1,11 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=============
|
||||
Kernel Stacks
|
||||
=============
|
||||
|
||||
Kernel stacks on x86-64 bit
|
||||
---------------------------
|
||||
===========================
|
||||
|
||||
Most of the text from Keith Owens, hacked by AK
|
||||
|
||||
@ -57,7 +63,7 @@ IST events with the same code to be nested. However in most cases, the
|
||||
stack size allocated to an IST assumes no nesting for the same code.
|
||||
If that assumption is ever broken then the stacks will become corrupt.
|
||||
|
||||
The currently assigned IST stacks are :-
|
||||
The currently assigned IST stacks are:
|
||||
|
||||
* ESTACK_DF. EXCEPTION_STKSZ (PAGE_SIZE).
|
||||
|
||||
@ -103,7 +109,7 @@ For more details see the Intel IA32 or AMD AMD64 architecture manuals.
|
||||
|
||||
|
||||
Printing backtraces on x86
|
||||
--------------------------
|
||||
==========================
|
||||
|
||||
The question about the '?' preceding function names in an x86 stacktrace
|
||||
keeps popping up, here's an indepth explanation. It helps if the reader
|
||||
@ -113,7 +119,7 @@ arch/x86/kernel/dumpstack.c.
|
||||
Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>:
|
||||
|
||||
We always scan the full kernel stack for return addresses stored on
|
||||
the kernel stack(s) [*], from stack top to stack bottom, and print out
|
||||
the kernel stack(s) [1]_, from stack top to stack bottom, and print out
|
||||
anything that 'looks like' a kernel text address.
|
||||
|
||||
If it fits into the frame pointer chain, we print it without a question
|
||||
@ -141,6 +147,6 @@ that look like kernel text addresses, so if debug information is wrong,
|
||||
we still print out the real call chain as well - just with more question
|
||||
marks than ideal.
|
||||
|
||||
[*] For things like IRQ and IST stacks, we also scan those stacks, in
|
||||
.. [1] For things like IRQ and IST stacks, we also scan those stacks, in
|
||||
the right order, and try to cross from one stack into another
|
||||
reconstructing the call chain. This works most of the time.
|
@ -1,7 +1,11 @@
|
||||
The Linux Microcode Loader
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Authors: Fenghua Yu <fenghua.yu@intel.com>
|
||||
Borislav Petkov <bp@suse.de>
|
||||
==========================
|
||||
The Linux Microcode Loader
|
||||
==========================
|
||||
|
||||
:Authors: - Fenghua Yu <fenghua.yu@intel.com>
|
||||
- Borislav Petkov <bp@suse.de>
|
||||
|
||||
The kernel has a x86 microcode loading facility which is supposed to
|
||||
provide microcode loading methods in the OS. Potential use cases are
|
||||
@ -10,8 +14,8 @@ and updating the microcode on long-running systems without rebooting.
|
||||
|
||||
The loader supports three loading methods:
|
||||
|
||||
1. Early load microcode
|
||||
=======================
|
||||
Early load microcode
|
||||
====================
|
||||
|
||||
The kernel can update microcode very early during boot. Loading
|
||||
microcode early can fix CPU issues before they are observed during
|
||||
@ -26,8 +30,10 @@ loader parses the combined initrd image during boot.
|
||||
|
||||
The microcode files in cpio name space are:
|
||||
|
||||
on Intel: kernel/x86/microcode/GenuineIntel.bin
|
||||
on AMD : kernel/x86/microcode/AuthenticAMD.bin
|
||||
on Intel:
|
||||
kernel/x86/microcode/GenuineIntel.bin
|
||||
on AMD :
|
||||
kernel/x86/microcode/AuthenticAMD.bin
|
||||
|
||||
During BSP (BootStrapping Processor) boot (pre-SMP), the kernel
|
||||
scans the microcode file in the initrd. If microcode matching the
|
||||
@ -42,8 +48,8 @@ Here's a crude example how to prepare an initrd with microcode (this is
|
||||
normally done automatically by the distribution, when recreating the
|
||||
initrd, so you don't really have to do it yourself. It is documented
|
||||
here for future reference only).
|
||||
::
|
||||
|
||||
---
|
||||
#!/bin/bash
|
||||
|
||||
if [ -z "$1" ]; then
|
||||
@ -76,15 +82,15 @@ here for future reference only).
|
||||
cat ucode.cpio $INITRD.orig > $INITRD
|
||||
|
||||
rm -rf $TMPDIR
|
||||
---
|
||||
|
||||
|
||||
The system needs to have the microcode packages installed into
|
||||
/lib/firmware or you need to fixup the paths above if yours are
|
||||
somewhere else and/or you've downloaded them directly from the processor
|
||||
vendor's site.
|
||||
|
||||
2. Late loading
|
||||
===============
|
||||
Late loading
|
||||
============
|
||||
|
||||
There are two legacy user space interfaces to load microcode, either through
|
||||
/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file
|
||||
@ -94,7 +100,7 @@ The /dev/cpu/microcode method is deprecated because it needs a special
|
||||
userspace tool for that.
|
||||
|
||||
The easier method is simply installing the microcode packages your distro
|
||||
supplies and running:
|
||||
supplies and running::
|
||||
|
||||
# echo 1 > /sys/devices/system/cpu/microcode/reload
|
||||
|
||||
@ -104,19 +110,19 @@ The loading mechanism looks for microcode blobs in
|
||||
/lib/firmware/{intel-ucode,amd-ucode}. The default distro installation
|
||||
packages already put them there.
|
||||
|
||||
3. Builtin microcode
|
||||
====================
|
||||
Builtin microcode
|
||||
=================
|
||||
|
||||
The loader supports also loading of a builtin microcode supplied through
|
||||
the regular builtin firmware method CONFIG_EXTRA_FIRMWARE. Only 64-bit is
|
||||
currently supported.
|
||||
|
||||
Here's an example:
|
||||
Here's an example::
|
||||
|
||||
CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin"
|
||||
CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware"
|
||||
|
||||
This basically means, you have the following tree structure locally:
|
||||
This basically means, you have the following tree structure locally::
|
||||
|
||||
/lib/firmware/
|
||||
|-- amd-ucode
|
354
Documentation/x86/mtrr.rst
Normal file
354
Documentation/x86/mtrr.rst
Normal file
@ -0,0 +1,354 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=========================================
|
||||
MTRR (Memory Type Range Register) control
|
||||
=========================================
|
||||
|
||||
:Authors: - Richard Gooch <rgooch@atnf.csiro.au> - 3 Jun 1999
|
||||
- Luis R. Rodriguez <mcgrof@do-not-panic.com> - April 9, 2015
|
||||
|
||||
|
||||
Phasing out MTRR use
|
||||
====================
|
||||
|
||||
MTRR use is replaced on modern x86 hardware with PAT. Direct MTRR use by
|
||||
drivers on Linux is now completely phased out, device drivers should use
|
||||
arch_phys_wc_add() in combination with ioremap_wc() to make MTRR effective on
|
||||
non-PAT systems while a no-op but equally effective on PAT enabled systems.
|
||||
|
||||
Even if Linux does not use MTRRs directly, some x86 platform firmware may still
|
||||
set up MTRRs early before booting the OS. They do this as some platform
|
||||
firmware may still have implemented access to MTRRs which would be controlled
|
||||
and handled by the platform firmware directly. An example of platform use of
|
||||
MTRRs is through the use of SMI handlers, one case could be for fan control,
|
||||
the platform code would need uncachable access to some of its fan control
|
||||
registers. Such platform access does not need any Operating System MTRR code in
|
||||
place other than mtrr_type_lookup() to ensure any OS specific mapping requests
|
||||
are aligned with platform MTRR setup. If MTRRs are only set up by the platform
|
||||
firmware code though and the OS does not make any specific MTRR mapping
|
||||
requests mtrr_type_lookup() should always return MTRR_TYPE_INVALID.
|
||||
|
||||
For details refer to :doc:`pat`.
|
||||
|
||||
.. tip::
|
||||
On Intel P6 family processors (Pentium Pro, Pentium II and later)
|
||||
the Memory Type Range Registers (MTRRs) may be used to control
|
||||
processor access to memory ranges. This is most useful when you have
|
||||
a video (VGA) card on a PCI or AGP bus. Enabling write-combining
|
||||
allows bus write transfers to be combined into a larger transfer
|
||||
before bursting over the PCI/AGP bus. This can increase performance
|
||||
of image write operations 2.5 times or more.
|
||||
|
||||
The Cyrix 6x86, 6x86MX and M II processors have Address Range
|
||||
Registers (ARRs) which provide a similar functionality to MTRRs. For
|
||||
these, the ARRs are used to emulate the MTRRs.
|
||||
|
||||
The AMD K6-2 (stepping 8 and above) and K6-3 processors have two
|
||||
MTRRs. These are supported. The AMD Athlon family provide 8 Intel
|
||||
style MTRRs.
|
||||
|
||||
The Centaur C6 (WinChip) has 8 MCRs, allowing write-combining. These
|
||||
are supported.
|
||||
|
||||
The VIA Cyrix III and VIA C3 CPUs offer 8 Intel style MTRRs.
|
||||
|
||||
The CONFIG_MTRR option creates a /proc/mtrr file which may be used
|
||||
to manipulate your MTRRs. Typically the X server should use
|
||||
this. This should have a reasonably generic interface so that
|
||||
similar control registers on other processors can be easily
|
||||
supported.
|
||||
|
||||
There are two interfaces to /proc/mtrr: one is an ASCII interface
|
||||
which allows you to read and write. The other is an ioctl()
|
||||
interface. The ASCII interface is meant for administration. The
|
||||
ioctl() interface is meant for C programs (i.e. the X server). The
|
||||
interfaces are described below, with sample commands and C code.
|
||||
|
||||
|
||||
Reading MTRRs from the shell
|
||||
============================
|
||||
::
|
||||
|
||||
% cat /proc/mtrr
|
||||
reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1
|
||||
reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1
|
||||
|
||||
Creating MTRRs from the C-shell::
|
||||
|
||||
# echo "base=0xf8000000 size=0x400000 type=write-combining" >! /proc/mtrr
|
||||
|
||||
or if you use bash::
|
||||
|
||||
# echo "base=0xf8000000 size=0x400000 type=write-combining" >| /proc/mtrr
|
||||
|
||||
And the result thereof::
|
||||
|
||||
% cat /proc/mtrr
|
||||
reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1
|
||||
reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1
|
||||
reg02: base=0xf8000000 (3968MB), size= 4MB: write-combining, count=1
|
||||
|
||||
This is for video RAM at base address 0xf8000000 and size 4 megabytes. To
|
||||
find out your base address, you need to look at the output of your X
|
||||
server, which tells you where the linear framebuffer address is. A
|
||||
typical line that you may get is::
|
||||
|
||||
(--) S3: PCI: 968 rev 0, Linear FB @ 0xf8000000
|
||||
|
||||
Note that you should only use the value from the X server, as it may
|
||||
move the framebuffer base address, so the only value you can trust is
|
||||
that reported by the X server.
|
||||
|
||||
To find out the size of your framebuffer (what, you don't actually
|
||||
know?), the following line will tell you::
|
||||
|
||||
(--) S3: videoram: 4096k
|
||||
|
||||
That's 4 megabytes, which is 0x400000 bytes (in hexadecimal).
|
||||
A patch is being written for XFree86 which will make this automatic:
|
||||
in other words the X server will manipulate /proc/mtrr using the
|
||||
ioctl() interface, so users won't have to do anything. If you use a
|
||||
commercial X server, lobby your vendor to add support for MTRRs.
|
||||
|
||||
|
||||
Creating overlapping MTRRs
|
||||
==========================
|
||||
::
|
||||
|
||||
%echo "base=0xfb000000 size=0x1000000 type=write-combining" >/proc/mtrr
|
||||
%echo "base=0xfb000000 size=0x1000 type=uncachable" >/proc/mtrr
|
||||
|
||||
And the results::
|
||||
|
||||
% cat /proc/mtrr
|
||||
reg00: base=0x00000000 ( 0MB), size= 64MB: write-back, count=1
|
||||
reg01: base=0xfb000000 (4016MB), size= 16MB: write-combining, count=1
|
||||
reg02: base=0xfb000000 (4016MB), size= 4kB: uncachable, count=1
|
||||
|
||||
Some cards (especially Voodoo Graphics boards) need this 4 kB area
|
||||
excluded from the beginning of the region because it is used for
|
||||
registers.
|
||||
|
||||
NOTE: You can only create type=uncachable region, if the first
|
||||
region that you created is type=write-combining.
|
||||
|
||||
|
||||
Removing MTRRs from the C-shel
|
||||
==============================
|
||||
::
|
||||
|
||||
% echo "disable=2" >! /proc/mtrr
|
||||
|
||||
or using bash::
|
||||
|
||||
% echo "disable=2" >| /proc/mtrr
|
||||
|
||||
|
||||
Reading MTRRs from a C program using ioctl()'s
|
||||
==============================================
|
||||
::
|
||||
|
||||
/* mtrr-show.c
|
||||
|
||||
Source file for mtrr-show (example program to show MTRRs using ioctl()'s)
|
||||
|
||||
Copyright (C) 1997-1998 Richard Gooch
|
||||
|
||||
This program is free software; you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation; either version 2 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with this program; if not, write to the Free Software
|
||||
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
|
||||
|
||||
Richard Gooch may be reached by email at rgooch@atnf.csiro.au
|
||||
The postal address is:
|
||||
Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
|
||||
*/
|
||||
|
||||
/*
|
||||
This program will use an ioctl() on /proc/mtrr to show the current MTRR
|
||||
settings. This is an alternative to reading /proc/mtrr.
|
||||
|
||||
|
||||
Written by Richard Gooch 17-DEC-1997
|
||||
|
||||
Last updated by Richard Gooch 2-MAY-1998
|
||||
|
||||
|
||||
*/
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <string.h>
|
||||
#include <sys/types.h>
|
||||
#include <sys/stat.h>
|
||||
#include <fcntl.h>
|
||||
#include <sys/ioctl.h>
|
||||
#include <errno.h>
|
||||
#include <asm/mtrr.h>
|
||||
|
||||
#define TRUE 1
|
||||
#define FALSE 0
|
||||
#define ERRSTRING strerror (errno)
|
||||
|
||||
static char *mtrr_strings[MTRR_NUM_TYPES] =
|
||||
{
|
||||
"uncachable", /* 0 */
|
||||
"write-combining", /* 1 */
|
||||
"?", /* 2 */
|
||||
"?", /* 3 */
|
||||
"write-through", /* 4 */
|
||||
"write-protect", /* 5 */
|
||||
"write-back", /* 6 */
|
||||
};
|
||||
|
||||
int main ()
|
||||
{
|
||||
int fd;
|
||||
struct mtrr_gentry gentry;
|
||||
|
||||
if ( ( fd = open ("/proc/mtrr", O_RDONLY, 0) ) == -1 )
|
||||
{
|
||||
if (errno == ENOENT)
|
||||
{
|
||||
fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
|
||||
stderr);
|
||||
exit (1);
|
||||
}
|
||||
fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
|
||||
exit (2);
|
||||
}
|
||||
for (gentry.regnum = 0; ioctl (fd, MTRRIOC_GET_ENTRY, &gentry) == 0;
|
||||
++gentry.regnum)
|
||||
{
|
||||
if (gentry.size < 1)
|
||||
{
|
||||
fprintf (stderr, "Register: %u disabled\n", gentry.regnum);
|
||||
continue;
|
||||
}
|
||||
fprintf (stderr, "Register: %u base: 0x%lx size: 0x%lx type: %s\n",
|
||||
gentry.regnum, gentry.base, gentry.size,
|
||||
mtrr_strings[gentry.type]);
|
||||
}
|
||||
if (errno == EINVAL) exit (0);
|
||||
fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
|
||||
exit (3);
|
||||
} /* End Function main */
|
||||
|
||||
|
||||
Creating MTRRs from a C programme using ioctl()'s
|
||||
=================================================
|
||||
::
|
||||
|
||||
/* mtrr-add.c
|
||||
|
||||
Source file for mtrr-add (example programme to add an MTRRs using ioctl())
|
||||
|
||||
Copyright (C) 1997-1998 Richard Gooch
|
||||
|
||||
This program is free software; you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation; either version 2 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with this program; if not, write to the Free Software
|
||||
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
|
||||
|
||||
Richard Gooch may be reached by email at rgooch@atnf.csiro.au
|
||||
The postal address is:
|
||||
Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
|
||||
*/
|
||||
|
||||
/*
|
||||
This programme will use an ioctl() on /proc/mtrr to add an entry. The first
|
||||
available mtrr is used. This is an alternative to writing /proc/mtrr.
|
||||
|
||||
|
||||
Written by Richard Gooch 17-DEC-1997
|
||||
|
||||
Last updated by Richard Gooch 2-MAY-1998
|
||||
|
||||
|
||||
*/
|
||||
#include <stdio.h>
|
||||
#include <string.h>
|
||||
#include <stdlib.h>
|
||||
#include <unistd.h>
|
||||
#include <sys/types.h>
|
||||
#include <sys/stat.h>
|
||||
#include <fcntl.h>
|
||||
#include <sys/ioctl.h>
|
||||
#include <errno.h>
|
||||
#include <asm/mtrr.h>
|
||||
|
||||
#define TRUE 1
|
||||
#define FALSE 0
|
||||
#define ERRSTRING strerror (errno)
|
||||
|
||||
static char *mtrr_strings[MTRR_NUM_TYPES] =
|
||||
{
|
||||
"uncachable", /* 0 */
|
||||
"write-combining", /* 1 */
|
||||
"?", /* 2 */
|
||||
"?", /* 3 */
|
||||
"write-through", /* 4 */
|
||||
"write-protect", /* 5 */
|
||||
"write-back", /* 6 */
|
||||
};
|
||||
|
||||
int main (int argc, char **argv)
|
||||
{
|
||||
int fd;
|
||||
struct mtrr_sentry sentry;
|
||||
|
||||
if (argc != 4)
|
||||
{
|
||||
fprintf (stderr, "Usage:\tmtrr-add base size type\n");
|
||||
exit (1);
|
||||
}
|
||||
sentry.base = strtoul (argv[1], NULL, 0);
|
||||
sentry.size = strtoul (argv[2], NULL, 0);
|
||||
for (sentry.type = 0; sentry.type < MTRR_NUM_TYPES; ++sentry.type)
|
||||
{
|
||||
if (strcmp (argv[3], mtrr_strings[sentry.type]) == 0) break;
|
||||
}
|
||||
if (sentry.type >= MTRR_NUM_TYPES)
|
||||
{
|
||||
fprintf (stderr, "Illegal type: \"%s\"\n", argv[3]);
|
||||
exit (2);
|
||||
}
|
||||
if ( ( fd = open ("/proc/mtrr", O_WRONLY, 0) ) == -1 )
|
||||
{
|
||||
if (errno == ENOENT)
|
||||
{
|
||||
fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
|
||||
stderr);
|
||||
exit (3);
|
||||
}
|
||||
fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
|
||||
exit (4);
|
||||
}
|
||||
if (ioctl (fd, MTRRIOC_ADD_ENTRY, &sentry) == -1)
|
||||
{
|
||||
fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
|
||||
exit (5);
|
||||
}
|
||||
fprintf (stderr, "Sleeping for 5 seconds so you can see the new entry\n");
|
||||
sleep (5);
|
||||
close (fd);
|
||||
fputs ("I've just closed /proc/mtrr so now the new entry should be gone\n",
|
||||
stderr);
|
||||
} /* End Function main */
|
@ -1,329 +0,0 @@
|
||||
MTRR (Memory Type Range Register) control
|
||||
|
||||
Richard Gooch <rgooch@atnf.csiro.au> - 3 Jun 1999
|
||||
Luis R. Rodriguez <mcgrof@do-not-panic.com> - April 9, 2015
|
||||
|
||||
===============================================================================
|
||||
Phasing out MTRR use
|
||||
|
||||
MTRR use is replaced on modern x86 hardware with PAT. Direct MTRR use by
|
||||
drivers on Linux is now completely phased out, device drivers should use
|
||||
arch_phys_wc_add() in combination with ioremap_wc() to make MTRR effective on
|
||||
non-PAT systems while a no-op but equally effective on PAT enabled systems.
|
||||
|
||||
Even if Linux does not use MTRRs directly, some x86 platform firmware may still
|
||||
set up MTRRs early before booting the OS. They do this as some platform
|
||||
firmware may still have implemented access to MTRRs which would be controlled
|
||||
and handled by the platform firmware directly. An example of platform use of
|
||||
MTRRs is through the use of SMI handlers, one case could be for fan control,
|
||||
the platform code would need uncachable access to some of its fan control
|
||||
registers. Such platform access does not need any Operating System MTRR code in
|
||||
place other than mtrr_type_lookup() to ensure any OS specific mapping requests
|
||||
are aligned with platform MTRR setup. If MTRRs are only set up by the platform
|
||||
firmware code though and the OS does not make any specific MTRR mapping
|
||||
requests mtrr_type_lookup() should always return MTRR_TYPE_INVALID.
|
||||
|
||||
For details refer to Documentation/x86/pat.txt.
|
||||
|
||||
===============================================================================
|
||||
|
||||
On Intel P6 family processors (Pentium Pro, Pentium II and later)
|
||||
the Memory Type Range Registers (MTRRs) may be used to control
|
||||
processor access to memory ranges. This is most useful when you have
|
||||
a video (VGA) card on a PCI or AGP bus. Enabling write-combining
|
||||
allows bus write transfers to be combined into a larger transfer
|
||||
before bursting over the PCI/AGP bus. This can increase performance
|
||||
of image write operations 2.5 times or more.
|
||||
|
||||
The Cyrix 6x86, 6x86MX and M II processors have Address Range
|
||||
Registers (ARRs) which provide a similar functionality to MTRRs. For
|
||||
these, the ARRs are used to emulate the MTRRs.
|
||||
|
||||
The AMD K6-2 (stepping 8 and above) and K6-3 processors have two
|
||||
MTRRs. These are supported. The AMD Athlon family provide 8 Intel
|
||||
style MTRRs.
|
||||
|
||||
The Centaur C6 (WinChip) has 8 MCRs, allowing write-combining. These
|
||||
are supported.
|
||||
|
||||
The VIA Cyrix III and VIA C3 CPUs offer 8 Intel style MTRRs.
|
||||
|
||||
The CONFIG_MTRR option creates a /proc/mtrr file which may be used
|
||||
to manipulate your MTRRs. Typically the X server should use
|
||||
this. This should have a reasonably generic interface so that
|
||||
similar control registers on other processors can be easily
|
||||
supported.
|
||||
|
||||
|
||||
There are two interfaces to /proc/mtrr: one is an ASCII interface
|
||||
which allows you to read and write. The other is an ioctl()
|
||||
interface. The ASCII interface is meant for administration. The
|
||||
ioctl() interface is meant for C programs (i.e. the X server). The
|
||||
interfaces are described below, with sample commands and C code.
|
||||
|
||||
===============================================================================
|
||||
Reading MTRRs from the shell:
|
||||
|
||||
% cat /proc/mtrr
|
||||
reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1
|
||||
reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1
|
||||
===============================================================================
|
||||
Creating MTRRs from the C-shell:
|
||||
# echo "base=0xf8000000 size=0x400000 type=write-combining" >! /proc/mtrr
|
||||
or if you use bash:
|
||||
# echo "base=0xf8000000 size=0x400000 type=write-combining" >| /proc/mtrr
|
||||
|
||||
And the result thereof:
|
||||
% cat /proc/mtrr
|
||||
reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1
|
||||
reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1
|
||||
reg02: base=0xf8000000 (3968MB), size= 4MB: write-combining, count=1
|
||||
|
||||
This is for video RAM at base address 0xf8000000 and size 4 megabytes. To
|
||||
find out your base address, you need to look at the output of your X
|
||||
server, which tells you where the linear framebuffer address is. A
|
||||
typical line that you may get is:
|
||||
|
||||
(--) S3: PCI: 968 rev 0, Linear FB @ 0xf8000000
|
||||
|
||||
Note that you should only use the value from the X server, as it may
|
||||
move the framebuffer base address, so the only value you can trust is
|
||||
that reported by the X server.
|
||||
|
||||
To find out the size of your framebuffer (what, you don't actually
|
||||
know?), the following line will tell you:
|
||||
|
||||
(--) S3: videoram: 4096k
|
||||
|
||||
That's 4 megabytes, which is 0x400000 bytes (in hexadecimal).
|
||||
A patch is being written for XFree86 which will make this automatic:
|
||||
in other words the X server will manipulate /proc/mtrr using the
|
||||
ioctl() interface, so users won't have to do anything. If you use a
|
||||
commercial X server, lobby your vendor to add support for MTRRs.
|
||||
===============================================================================
|
||||
Creating overlapping MTRRs:
|
||||
|
||||
%echo "base=0xfb000000 size=0x1000000 type=write-combining" >/proc/mtrr
|
||||
%echo "base=0xfb000000 size=0x1000 type=uncachable" >/proc/mtrr
|
||||
|
||||
And the results: cat /proc/mtrr
|
||||
reg00: base=0x00000000 ( 0MB), size= 64MB: write-back, count=1
|
||||
reg01: base=0xfb000000 (4016MB), size= 16MB: write-combining, count=1
|
||||
reg02: base=0xfb000000 (4016MB), size= 4kB: uncachable, count=1
|
||||
|
||||
Some cards (especially Voodoo Graphics boards) need this 4 kB area
|
||||
excluded from the beginning of the region because it is used for
|
||||
registers.
|
||||
|
||||
NOTE: You can only create type=uncachable region, if the first
|
||||
region that you created is type=write-combining.
|
||||
===============================================================================
|
||||
Removing MTRRs from the C-shell:
|
||||
% echo "disable=2" >! /proc/mtrr
|
||||
or using bash:
|
||||
% echo "disable=2" >| /proc/mtrr
|
||||
===============================================================================
|
||||
Reading MTRRs from a C program using ioctl()'s:
|
||||
|
||||
/* mtrr-show.c
|
||||
|
||||
Source file for mtrr-show (example program to show MTRRs using ioctl()'s)
|
||||
|
||||
Copyright (C) 1997-1998 Richard Gooch
|
||||
|
||||
This program is free software; you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation; either version 2 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with this program; if not, write to the Free Software
|
||||
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
|
||||
|
||||
Richard Gooch may be reached by email at rgooch@atnf.csiro.au
|
||||
The postal address is:
|
||||
Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
|
||||
*/
|
||||
|
||||
/*
|
||||
This program will use an ioctl() on /proc/mtrr to show the current MTRR
|
||||
settings. This is an alternative to reading /proc/mtrr.
|
||||
|
||||
|
||||
Written by Richard Gooch 17-DEC-1997
|
||||
|
||||
Last updated by Richard Gooch 2-MAY-1998
|
||||
|
||||
|
||||
*/
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <string.h>
|
||||
#include <sys/types.h>
|
||||
#include <sys/stat.h>
|
||||
#include <fcntl.h>
|
||||
#include <sys/ioctl.h>
|
||||
#include <errno.h>
|
||||
#include <asm/mtrr.h>
|
||||
|
||||
#define TRUE 1
|
||||
#define FALSE 0
|
||||
#define ERRSTRING strerror (errno)
|
||||
|
||||
static char *mtrr_strings[MTRR_NUM_TYPES] =
|
||||
{
|
||||
"uncachable", /* 0 */
|
||||
"write-combining", /* 1 */
|
||||
"?", /* 2 */
|
||||
"?", /* 3 */
|
||||
"write-through", /* 4 */
|
||||
"write-protect", /* 5 */
|
||||
"write-back", /* 6 */
|
||||
};
|
||||
|
||||
int main ()
|
||||
{
|
||||
int fd;
|
||||
struct mtrr_gentry gentry;
|
||||
|
||||
if ( ( fd = open ("/proc/mtrr", O_RDONLY, 0) ) == -1 )
|
||||
{
|
||||
if (errno == ENOENT)
|
||||
{
|
||||
fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
|
||||
stderr);
|
||||
exit (1);
|
||||
}
|
||||
fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
|
||||
exit (2);
|
||||
}
|
||||
for (gentry.regnum = 0; ioctl (fd, MTRRIOC_GET_ENTRY, &gentry) == 0;
|
||||
++gentry.regnum)
|
||||
{
|
||||
if (gentry.size < 1)
|
||||
{
|
||||
fprintf (stderr, "Register: %u disabled\n", gentry.regnum);
|
||||
continue;
|
||||
}
|
||||
fprintf (stderr, "Register: %u base: 0x%lx size: 0x%lx type: %s\n",
|
||||
gentry.regnum, gentry.base, gentry.size,
|
||||
mtrr_strings[gentry.type]);
|
||||
}
|
||||
if (errno == EINVAL) exit (0);
|
||||
fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
|
||||
exit (3);
|
||||
} /* End Function main */
|
||||
===============================================================================
|
||||
Creating MTRRs from a C programme using ioctl()'s:
|
||||
|
||||
/* mtrr-add.c
|
||||
|
||||
Source file for mtrr-add (example programme to add an MTRRs using ioctl())
|
||||
|
||||
Copyright (C) 1997-1998 Richard Gooch
|
||||
|
||||
This program is free software; you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation; either version 2 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with this program; if not, write to the Free Software
|
||||
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
|
||||
|
||||
Richard Gooch may be reached by email at rgooch@atnf.csiro.au
|
||||
The postal address is:
|
||||
Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
|
||||
*/
|
||||
|
||||
/*
|
||||
This programme will use an ioctl() on /proc/mtrr to add an entry. The first
|
||||
available mtrr is used. This is an alternative to writing /proc/mtrr.
|
||||
|
||||
|
||||
Written by Richard Gooch 17-DEC-1997
|
||||
|
||||
Last updated by Richard Gooch 2-MAY-1998
|
||||
|
||||
|
||||
*/
|
||||
#include <stdio.h>
|
||||
#include <string.h>
|
||||
#include <stdlib.h>
|
||||
#include <unistd.h>
|
||||
#include <sys/types.h>
|
||||
#include <sys/stat.h>
|
||||
#include <fcntl.h>
|
||||
#include <sys/ioctl.h>
|
||||
#include <errno.h>
|
||||
#include <asm/mtrr.h>
|
||||
|
||||
#define TRUE 1
|
||||
#define FALSE 0
|
||||
#define ERRSTRING strerror (errno)
|
||||
|
||||
static char *mtrr_strings[MTRR_NUM_TYPES] =
|
||||
{
|
||||
"uncachable", /* 0 */
|
||||
"write-combining", /* 1 */
|
||||
"?", /* 2 */
|
||||
"?", /* 3 */
|
||||
"write-through", /* 4 */
|
||||
"write-protect", /* 5 */
|
||||
"write-back", /* 6 */
|
||||
};
|
||||
|
||||
int main (int argc, char **argv)
|
||||
{
|
||||
int fd;
|
||||
struct mtrr_sentry sentry;
|
||||
|
||||
if (argc != 4)
|
||||
{
|
||||
fprintf (stderr, "Usage:\tmtrr-add base size type\n");
|
||||
exit (1);
|
||||
}
|
||||
sentry.base = strtoul (argv[1], NULL, 0);
|
||||
sentry.size = strtoul (argv[2], NULL, 0);
|
||||
for (sentry.type = 0; sentry.type < MTRR_NUM_TYPES; ++sentry.type)
|
||||
{
|
||||
if (strcmp (argv[3], mtrr_strings[sentry.type]) == 0) break;
|
||||
}
|
||||
if (sentry.type >= MTRR_NUM_TYPES)
|
||||
{
|
||||
fprintf (stderr, "Illegal type: \"%s\"\n", argv[3]);
|
||||
exit (2);
|
||||
}
|
||||
if ( ( fd = open ("/proc/mtrr", O_WRONLY, 0) ) == -1 )
|
||||
{
|
||||
if (errno == ENOENT)
|
||||
{
|
||||
fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
|
||||
stderr);
|
||||
exit (3);
|
||||
}
|
||||
fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
|
||||
exit (4);
|
||||
}
|
||||
if (ioctl (fd, MTRRIOC_ADD_ENTRY, &sentry) == -1)
|
||||
{
|
||||
fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
|
||||
exit (5);
|
||||
}
|
||||
fprintf (stderr, "Sleeping for 5 seconds so you can see the new entry\n");
|
||||
sleep (5);
|
||||
close (fd);
|
||||
fputs ("I've just closed /proc/mtrr so now the new entry should be gone\n",
|
||||
stderr);
|
||||
} /* End Function main */
|
||||
===============================================================================
|
@ -1,8 +1,11 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
============
|
||||
ORC unwinder
|
||||
============
|
||||
|
||||
Overview
|
||||
--------
|
||||
========
|
||||
|
||||
The kernel CONFIG_UNWINDER_ORC option enables the ORC unwinder, which is
|
||||
similar in concept to a DWARF unwinder. The difference is that the
|
||||
@ -23,12 +26,12 @@ correlate instruction addresses with their stack states at run time.
|
||||
|
||||
|
||||
ORC vs frame pointers
|
||||
---------------------
|
||||
=====================
|
||||
|
||||
With frame pointers enabled, GCC adds instrumentation code to every
|
||||
function in the kernel. The kernel's .text size increases by about
|
||||
3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel
|
||||
Gorman [1] have shown a slowdown of 5-10% for some workloads.
|
||||
Gorman [1]_ have shown a slowdown of 5-10% for some workloads.
|
||||
|
||||
In contrast, the ORC unwinder has no effect on text size or runtime
|
||||
performance, because the debuginfo is out of band. So if you disable
|
||||
@ -55,7 +58,7 @@ depending on the kernel config.
|
||||
|
||||
|
||||
ORC vs DWARF
|
||||
------------
|
||||
============
|
||||
|
||||
ORC debuginfo's advantage over DWARF itself is that it's much simpler.
|
||||
It gets rid of the complex DWARF CFI state machine and also gets rid of
|
||||
@ -65,7 +68,7 @@ mission critical oops code.
|
||||
|
||||
The simpler debuginfo format also enables the unwinder to be much faster
|
||||
than DWARF, which is important for perf and lockdep. In a basic
|
||||
performance test by Jiri Slaby [2], the ORC unwinder was about 20x
|
||||
performance test by Jiri Slaby [2]_, the ORC unwinder was about 20x
|
||||
faster than an out-of-tree DWARF unwinder. (Note: That measurement was
|
||||
taken before some performance tweaks were added, which doubled
|
||||
performance, so the speedup over DWARF may be closer to 40x.)
|
||||
@ -85,7 +88,7 @@ still be able to control the format, e.g. no complex state machines.
|
||||
|
||||
|
||||
ORC unwind table generation
|
||||
---------------------------
|
||||
===========================
|
||||
|
||||
The ORC data is generated by objtool. With the existing compile-time
|
||||
stack metadata validation feature, objtool already follows all code
|
||||
@ -133,7 +136,7 @@ objtool follows GCC code quite well.
|
||||
|
||||
|
||||
Unwinder implementation details
|
||||
-------------------------------
|
||||
===============================
|
||||
|
||||
Objtool generates the ORC data by integrating with the compile-time
|
||||
stack metadata validation feature, which is described in detail in
|
||||
@ -154,7 +157,7 @@ subset of the table needs to be searched.
|
||||
|
||||
|
||||
Etymology
|
||||
---------
|
||||
=========
|
||||
|
||||
Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
|
||||
enemies. Similarly, the ORC unwinder was created in opposition to the
|
||||
@ -162,7 +165,7 @@ complexity and slowness of DWARF.
|
||||
|
||||
"Although Orcs rarely consider multiple solutions to a problem, they do
|
||||
excel at getting things done because they are creatures of action, not
|
||||
thought." [3] Similarly, unlike the esoteric DWARF unwinder, the
|
||||
thought." [3]_ Similarly, unlike the esoteric DWARF unwinder, the
|
||||
veracious ORC unwinder wastes no time or siloconic effort decoding
|
||||
variable-length zero-extended unsigned-integer byte-coded
|
||||
state-machine-based debug information entries.
|
||||
@ -174,6 +177,6 @@ brutal, unyielding efficiency.
|
||||
ORC stands for Oops Rewind Capability.
|
||||
|
||||
|
||||
[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
|
||||
[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
|
||||
[3] http://dustin.wikidot.com/half-orcs-and-orcs
|
||||
.. [1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
|
||||
.. [2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
|
||||
.. [3] http://dustin.wikidot.com/half-orcs-and-orcs
|
242
Documentation/x86/pat.rst
Normal file
242
Documentation/x86/pat.rst
Normal file
@ -0,0 +1,242 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==========================
|
||||
PAT (Page Attribute Table)
|
||||
==========================
|
||||
|
||||
x86 Page Attribute Table (PAT) allows for setting the memory attribute at the
|
||||
page level granularity. PAT is complementary to the MTRR settings which allows
|
||||
for setting of memory types over physical address ranges. However, PAT is
|
||||
more flexible than MTRR due to its capability to set attributes at page level
|
||||
and also due to the fact that there are no hardware limitations on number of
|
||||
such attribute settings allowed. Added flexibility comes with guidelines for
|
||||
not having memory type aliasing for the same physical memory with multiple
|
||||
virtual addresses.
|
||||
|
||||
PAT allows for different types of memory attributes. The most commonly used
|
||||
ones that will be supported at this time are:
|
||||
|
||||
=== ==============
|
||||
WB Write-back
|
||||
UC Uncached
|
||||
WC Write-combined
|
||||
WT Write-through
|
||||
UC- Uncached Minus
|
||||
=== ==============
|
||||
|
||||
|
||||
PAT APIs
|
||||
========
|
||||
|
||||
There are many different APIs in the kernel that allows setting of memory
|
||||
attributes at the page level. In order to avoid aliasing, these interfaces
|
||||
should be used thoughtfully. Below is a table of interfaces available,
|
||||
their intended usage and their memory attribute relationships. Internally,
|
||||
these APIs use a reserve_memtype()/free_memtype() interface on the physical
|
||||
address range to avoid any aliasing.
|
||||
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| API | RAM | ACPI,... | Reserved/Holes |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| ioremap | -- | UC- | UC- |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| ioremap_cache | -- | WB | WB |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| ioremap_uc | -- | UC | UC |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| ioremap_nocache | -- | UC- | UC- |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| ioremap_wc | -- | -- | WC |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| ioremap_wt | -- | -- | WT |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| set_memory_uc, | UC- | -- | -- |
|
||||
| set_memory_wb | | | |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| set_memory_wc, | WC | -- | -- |
|
||||
| set_memory_wb | | | |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| set_memory_wt, | WT | -- | -- |
|
||||
| set_memory_wb | | | |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| pci sysfs resource | -- | -- | UC- |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| pci sysfs resource_wc | -- | -- | WC |
|
||||
| is IORESOURCE_PREFETCH | | | |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| pci proc | -- | -- | UC- |
|
||||
| !PCIIOC_WRITE_COMBINE | | | |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| pci proc | -- | -- | WC |
|
||||
| PCIIOC_WRITE_COMBINE | | | |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| /dev/mem | -- | WB/WC/UC- | WB/WC/UC- |
|
||||
| read-write | | | |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| /dev/mem | -- | UC- | UC- |
|
||||
| mmap SYNC flag | | | |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| /dev/mem | -- | WB/WC/UC- | WB/WC/UC- |
|
||||
| mmap !SYNC flag | | | |
|
||||
| and | |(from existing| (from existing |
|
||||
| any alias to this area | |alias) | alias) |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| /dev/mem | -- | WB | WB |
|
||||
| mmap !SYNC flag | | | |
|
||||
| no alias to this area | | | |
|
||||
| and | | | |
|
||||
| MTRR says WB | | | |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
| /dev/mem | -- | -- | UC- |
|
||||
| mmap !SYNC flag | | | |
|
||||
| no alias to this area | | | |
|
||||
| and | | | |
|
||||
| MTRR says !WB | | | |
|
||||
+------------------------+----------+--------------+------------------+
|
||||
|
||||
|
||||
Advanced APIs for drivers
|
||||
=========================
|
||||
|
||||
A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range,
|
||||
vmf_insert_pfn.
|
||||
|
||||
Drivers wanting to export some pages to userspace do it by using mmap
|
||||
interface and a combination of:
|
||||
|
||||
1) pgprot_noncached()
|
||||
2) io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()
|
||||
|
||||
With PAT support, a new API pgprot_writecombine is being added. So, drivers can
|
||||
continue to use the above sequence, with either pgprot_noncached() or
|
||||
pgprot_writecombine() in step 1, followed by step 2.
|
||||
|
||||
In addition, step 2 internally tracks the region as UC or WC in memtype
|
||||
list in order to ensure no conflicting mapping.
|
||||
|
||||
Note that this set of APIs only works with IO (non RAM) regions. If driver
|
||||
wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc()
|
||||
as step 0 above and also track the usage of those pages and use set_memory_wb()
|
||||
before the page is freed to free pool.
|
||||
|
||||
MTRR effects on PAT / non-PAT systems
|
||||
=====================================
|
||||
|
||||
The following table provides the effects of using write-combining MTRRs when
|
||||
using ioremap*() calls on x86 for both non-PAT and PAT systems. Ideally
|
||||
mtrr_add() usage will be phased out in favor of arch_phys_wc_add() which will
|
||||
be a no-op on PAT enabled systems. The region over which a arch_phys_wc_add()
|
||||
is made, should already have been ioremapped with WC attributes or PAT entries,
|
||||
this can be done by using ioremap_wc() / set_memory_wc(). Devices which
|
||||
combine areas of IO memory desired to remain uncacheable with areas where
|
||||
write-combining is desirable should consider use of ioremap_uc() followed by
|
||||
set_memory_wc() to white-list effective write-combined areas. Such use is
|
||||
nevertheless discouraged as the effective memory type is considered
|
||||
implementation defined, yet this strategy can be used as last resort on devices
|
||||
with size-constrained regions where otherwise MTRR write-combining would
|
||||
otherwise not be effective.
|
||||
::
|
||||
|
||||
==== ======= === ========================= =====================
|
||||
MTRR Non-PAT PAT Linux ioremap value Effective memory type
|
||||
==== ======= === ========================= =====================
|
||||
PAT Non-PAT | PAT
|
||||
|PCD |
|
||||
||PWT |
|
||||
||| |
|
||||
WC 000 WB _PAGE_CACHE_MODE_WB WC | WC
|
||||
WC 001 WC _PAGE_CACHE_MODE_WC WC* | WC
|
||||
WC 010 UC- _PAGE_CACHE_MODE_UC_MINUS WC* | UC
|
||||
WC 011 UC _PAGE_CACHE_MODE_UC UC | UC
|
||||
==== ======= === ========================= =====================
|
||||
|
||||
(*) denotes implementation defined and is discouraged
|
||||
|
||||
.. note:: -- in the above table mean "Not suggested usage for the API". Some
|
||||
of the --'s are strictly enforced by the kernel. Some others are not really
|
||||
enforced today, but may be enforced in future.
|
||||
|
||||
For ioremap and pci access through /sys or /proc - The actual type returned
|
||||
can be more restrictive, in case of any existing aliasing for that address.
|
||||
For example: If there is an existing uncached mapping, a new ioremap_wc can
|
||||
return uncached mapping in place of write-combine requested.
|
||||
|
||||
set_memory_[uc|wc|wt] and set_memory_wb should be used in pairs, where driver
|
||||
will first make a region uc, wc or wt and switch it back to wb after use.
|
||||
|
||||
Over time writes to /proc/mtrr will be deprecated in favor of using PAT based
|
||||
interfaces. Users writing to /proc/mtrr are suggested to use above interfaces.
|
||||
|
||||
Drivers should use ioremap_[uc|wc] to access PCI BARs with [uc|wc] access
|
||||
types.
|
||||
|
||||
Drivers should use set_memory_[uc|wc|wt] to set access type for RAM ranges.
|
||||
|
||||
|
||||
PAT debugging
|
||||
=============
|
||||
|
||||
With CONFIG_DEBUG_FS enabled, PAT memtype list can be examined by::
|
||||
|
||||
# mount -t debugfs debugfs /sys/kernel/debug
|
||||
# cat /sys/kernel/debug/x86/pat_memtype_list
|
||||
PAT memtype list:
|
||||
uncached-minus @ 0x7fadf000-0x7fae0000
|
||||
uncached-minus @ 0x7fb19000-0x7fb1a000
|
||||
uncached-minus @ 0x7fb1a000-0x7fb1b000
|
||||
uncached-minus @ 0x7fb1b000-0x7fb1c000
|
||||
uncached-minus @ 0x7fb1c000-0x7fb1d000
|
||||
uncached-minus @ 0x7fb1d000-0x7fb1e000
|
||||
uncached-minus @ 0x7fb1e000-0x7fb25000
|
||||
uncached-minus @ 0x7fb25000-0x7fb26000
|
||||
uncached-minus @ 0x7fb26000-0x7fb27000
|
||||
uncached-minus @ 0x7fb27000-0x7fb28000
|
||||
uncached-minus @ 0x7fb28000-0x7fb2e000
|
||||
uncached-minus @ 0x7fb2e000-0x7fb2f000
|
||||
uncached-minus @ 0x7fb2f000-0x7fb30000
|
||||
uncached-minus @ 0x7fb31000-0x7fb32000
|
||||
uncached-minus @ 0x80000000-0x90000000
|
||||
|
||||
This list shows physical address ranges and various PAT settings used to
|
||||
access those physical address ranges.
|
||||
|
||||
Another, more verbose way of getting PAT related debug messages is with
|
||||
"debugpat" boot parameter. With this parameter, various debug messages are
|
||||
printed to dmesg log.
|
||||
|
||||
PAT Initialization
|
||||
==================
|
||||
|
||||
The following table describes how PAT is initialized under various
|
||||
configurations. The PAT MSR must be updated by Linux in order to support WC
|
||||
and WT attributes. Otherwise, the PAT MSR has the value programmed in it
|
||||
by the firmware. Note, Xen enables WC attribute in the PAT MSR for guests.
|
||||
|
||||
==== ===== ========================== ========= =======
|
||||
MTRR PAT Call Sequence PAT State PAT MSR
|
||||
==== ===== ========================== ========= =======
|
||||
E E MTRR -> PAT init Enabled OS
|
||||
E D MTRR -> PAT init Disabled -
|
||||
D E MTRR -> PAT disable Disabled BIOS
|
||||
D D MTRR -> PAT disable Disabled -
|
||||
- np/E PAT -> PAT disable Disabled BIOS
|
||||
- np/D PAT -> PAT disable Disabled -
|
||||
E !P/E MTRR -> PAT init Disabled BIOS
|
||||
D !P/E MTRR -> PAT disable Disabled BIOS
|
||||
!M !P/E MTRR stub -> PAT disable Disabled BIOS
|
||||
==== ===== ========================== ========= =======
|
||||
|
||||
Legend
|
||||
|
||||
========= =======================================
|
||||
E Feature enabled in CPU
|
||||
D Feature disabled/unsupported in CPU
|
||||
np "nopat" boot option specified
|
||||
!P CONFIG_X86_PAT option unset
|
||||
!M CONFIG_MTRR option unset
|
||||
Enabled PAT state set to enabled
|
||||
Disabled PAT state set to disabled
|
||||
OS PAT initializes PAT MSR with OS setting
|
||||
BIOS PAT keeps PAT MSR with BIOS setting
|
||||
========= =======================================
|
||||
|
@ -1,230 +0,0 @@
|
||||
|
||||
PAT (Page Attribute Table)
|
||||
|
||||
x86 Page Attribute Table (PAT) allows for setting the memory attribute at the
|
||||
page level granularity. PAT is complementary to the MTRR settings which allows
|
||||
for setting of memory types over physical address ranges. However, PAT is
|
||||
more flexible than MTRR due to its capability to set attributes at page level
|
||||
and also due to the fact that there are no hardware limitations on number of
|
||||
such attribute settings allowed. Added flexibility comes with guidelines for
|
||||
not having memory type aliasing for the same physical memory with multiple
|
||||
virtual addresses.
|
||||
|
||||
PAT allows for different types of memory attributes. The most commonly used
|
||||
ones that will be supported at this time are Write-back, Uncached,
|
||||
Write-combined, Write-through and Uncached Minus.
|
||||
|
||||
|
||||
PAT APIs
|
||||
--------
|
||||
|
||||
There are many different APIs in the kernel that allows setting of memory
|
||||
attributes at the page level. In order to avoid aliasing, these interfaces
|
||||
should be used thoughtfully. Below is a table of interfaces available,
|
||||
their intended usage and their memory attribute relationships. Internally,
|
||||
these APIs use a reserve_memtype()/free_memtype() interface on the physical
|
||||
address range to avoid any aliasing.
|
||||
|
||||
|
||||
-------------------------------------------------------------------
|
||||
API | RAM | ACPI,... | Reserved/Holes |
|
||||
-----------------------|----------|------------|------------------|
|
||||
| | | |
|
||||
ioremap | -- | UC- | UC- |
|
||||
| | | |
|
||||
ioremap_cache | -- | WB | WB |
|
||||
| | | |
|
||||
ioremap_uc | -- | UC | UC |
|
||||
| | | |
|
||||
ioremap_nocache | -- | UC- | UC- |
|
||||
| | | |
|
||||
ioremap_wc | -- | -- | WC |
|
||||
| | | |
|
||||
ioremap_wt | -- | -- | WT |
|
||||
| | | |
|
||||
set_memory_uc | UC- | -- | -- |
|
||||
set_memory_wb | | | |
|
||||
| | | |
|
||||
set_memory_wc | WC | -- | -- |
|
||||
set_memory_wb | | | |
|
||||
| | | |
|
||||
set_memory_wt | WT | -- | -- |
|
||||
set_memory_wb | | | |
|
||||
| | | |
|
||||
pci sysfs resource | -- | -- | UC- |
|
||||
| | | |
|
||||
pci sysfs resource_wc | -- | -- | WC |
|
||||
is IORESOURCE_PREFETCH| | | |
|
||||
| | | |
|
||||
pci proc | -- | -- | UC- |
|
||||
!PCIIOC_WRITE_COMBINE | | | |
|
||||
| | | |
|
||||
pci proc | -- | -- | WC |
|
||||
PCIIOC_WRITE_COMBINE | | | |
|
||||
| | | |
|
||||
/dev/mem | -- | WB/WC/UC- | WB/WC/UC- |
|
||||
read-write | | | |
|
||||
| | | |
|
||||
/dev/mem | -- | UC- | UC- |
|
||||
mmap SYNC flag | | | |
|
||||
| | | |
|
||||
/dev/mem | -- | WB/WC/UC- | WB/WC/UC- |
|
||||
mmap !SYNC flag | |(from exist-| (from exist- |
|
||||
and | | ing alias)| ing alias) |
|
||||
any alias to this area| | | |
|
||||
| | | |
|
||||
/dev/mem | -- | WB | WB |
|
||||
mmap !SYNC flag | | | |
|
||||
no alias to this area | | | |
|
||||
and | | | |
|
||||
MTRR says WB | | | |
|
||||
| | | |
|
||||
/dev/mem | -- | -- | UC- |
|
||||
mmap !SYNC flag | | | |
|
||||
no alias to this area | | | |
|
||||
and | | | |
|
||||
MTRR says !WB | | | |
|
||||
| | | |
|
||||
-------------------------------------------------------------------
|
||||
|
||||
Advanced APIs for drivers
|
||||
-------------------------
|
||||
A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range,
|
||||
vmf_insert_pfn
|
||||
|
||||
Drivers wanting to export some pages to userspace do it by using mmap
|
||||
interface and a combination of
|
||||
1) pgprot_noncached()
|
||||
2) io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()
|
||||
|
||||
With PAT support, a new API pgprot_writecombine is being added. So, drivers can
|
||||
continue to use the above sequence, with either pgprot_noncached() or
|
||||
pgprot_writecombine() in step 1, followed by step 2.
|
||||
|
||||
In addition, step 2 internally tracks the region as UC or WC in memtype
|
||||
list in order to ensure no conflicting mapping.
|
||||
|
||||
Note that this set of APIs only works with IO (non RAM) regions. If driver
|
||||
wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc()
|
||||
as step 0 above and also track the usage of those pages and use set_memory_wb()
|
||||
before the page is freed to free pool.
|
||||
|
||||
MTRR effects on PAT / non-PAT systems
|
||||
-------------------------------------
|
||||
|
||||
The following table provides the effects of using write-combining MTRRs when
|
||||
using ioremap*() calls on x86 for both non-PAT and PAT systems. Ideally
|
||||
mtrr_add() usage will be phased out in favor of arch_phys_wc_add() which will
|
||||
be a no-op on PAT enabled systems. The region over which a arch_phys_wc_add()
|
||||
is made, should already have been ioremapped with WC attributes or PAT entries,
|
||||
this can be done by using ioremap_wc() / set_memory_wc(). Devices which
|
||||
combine areas of IO memory desired to remain uncacheable with areas where
|
||||
write-combining is desirable should consider use of ioremap_uc() followed by
|
||||
set_memory_wc() to white-list effective write-combined areas. Such use is
|
||||
nevertheless discouraged as the effective memory type is considered
|
||||
implementation defined, yet this strategy can be used as last resort on devices
|
||||
with size-constrained regions where otherwise MTRR write-combining would
|
||||
otherwise not be effective.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
MTRR Non-PAT PAT Linux ioremap value Effective memory type
|
||||
----------------------------------------------------------------------
|
||||
Non-PAT | PAT
|
||||
PAT
|
||||
|PCD
|
||||
||PWT
|
||||
|||
|
||||
WC 000 WB _PAGE_CACHE_MODE_WB WC | WC
|
||||
WC 001 WC _PAGE_CACHE_MODE_WC WC* | WC
|
||||
WC 010 UC- _PAGE_CACHE_MODE_UC_MINUS WC* | UC
|
||||
WC 011 UC _PAGE_CACHE_MODE_UC UC | UC
|
||||
----------------------------------------------------------------------
|
||||
|
||||
(*) denotes implementation defined and is discouraged
|
||||
|
||||
Notes:
|
||||
|
||||
-- in the above table mean "Not suggested usage for the API". Some of the --'s
|
||||
are strictly enforced by the kernel. Some others are not really enforced
|
||||
today, but may be enforced in future.
|
||||
|
||||
For ioremap and pci access through /sys or /proc - The actual type returned
|
||||
can be more restrictive, in case of any existing aliasing for that address.
|
||||
For example: If there is an existing uncached mapping, a new ioremap_wc can
|
||||
return uncached mapping in place of write-combine requested.
|
||||
|
||||
set_memory_[uc|wc|wt] and set_memory_wb should be used in pairs, where driver
|
||||
will first make a region uc, wc or wt and switch it back to wb after use.
|
||||
|
||||
Over time writes to /proc/mtrr will be deprecated in favor of using PAT based
|
||||
interfaces. Users writing to /proc/mtrr are suggested to use above interfaces.
|
||||
|
||||
Drivers should use ioremap_[uc|wc] to access PCI BARs with [uc|wc] access
|
||||
types.
|
||||
|
||||
Drivers should use set_memory_[uc|wc|wt] to set access type for RAM ranges.
|
||||
|
||||
|
||||
PAT debugging
|
||||
-------------
|
||||
|
||||
With CONFIG_DEBUG_FS enabled, PAT memtype list can be examined by
|
||||
|
||||
# mount -t debugfs debugfs /sys/kernel/debug
|
||||
# cat /sys/kernel/debug/x86/pat_memtype_list
|
||||
PAT memtype list:
|
||||
uncached-minus @ 0x7fadf000-0x7fae0000
|
||||
uncached-minus @ 0x7fb19000-0x7fb1a000
|
||||
uncached-minus @ 0x7fb1a000-0x7fb1b000
|
||||
uncached-minus @ 0x7fb1b000-0x7fb1c000
|
||||
uncached-minus @ 0x7fb1c000-0x7fb1d000
|
||||
uncached-minus @ 0x7fb1d000-0x7fb1e000
|
||||
uncached-minus @ 0x7fb1e000-0x7fb25000
|
||||
uncached-minus @ 0x7fb25000-0x7fb26000
|
||||
uncached-minus @ 0x7fb26000-0x7fb27000
|
||||
uncached-minus @ 0x7fb27000-0x7fb28000
|
||||
uncached-minus @ 0x7fb28000-0x7fb2e000
|
||||
uncached-minus @ 0x7fb2e000-0x7fb2f000
|
||||
uncached-minus @ 0x7fb2f000-0x7fb30000
|
||||
uncached-minus @ 0x7fb31000-0x7fb32000
|
||||
uncached-minus @ 0x80000000-0x90000000
|
||||
|
||||
This list shows physical address ranges and various PAT settings used to
|
||||
access those physical address ranges.
|
||||
|
||||
Another, more verbose way of getting PAT related debug messages is with
|
||||
"debugpat" boot parameter. With this parameter, various debug messages are
|
||||
printed to dmesg log.
|
||||
|
||||
PAT Initialization
|
||||
------------------
|
||||
|
||||
The following table describes how PAT is initialized under various
|
||||
configurations. The PAT MSR must be updated by Linux in order to support WC
|
||||
and WT attributes. Otherwise, the PAT MSR has the value programmed in it
|
||||
by the firmware. Note, Xen enables WC attribute in the PAT MSR for guests.
|
||||
|
||||
MTRR PAT Call Sequence PAT State PAT MSR
|
||||
=========================================================
|
||||
E E MTRR -> PAT init Enabled OS
|
||||
E D MTRR -> PAT init Disabled -
|
||||
D E MTRR -> PAT disable Disabled BIOS
|
||||
D D MTRR -> PAT disable Disabled -
|
||||
- np/E PAT -> PAT disable Disabled BIOS
|
||||
- np/D PAT -> PAT disable Disabled -
|
||||
E !P/E MTRR -> PAT init Disabled BIOS
|
||||
D !P/E MTRR -> PAT disable Disabled BIOS
|
||||
!M !P/E MTRR stub -> PAT disable Disabled BIOS
|
||||
|
||||
Legend
|
||||
------------------------------------------------
|
||||
E Feature enabled in CPU
|
||||
D Feature disabled/unsupported in CPU
|
||||
np "nopat" boot option specified
|
||||
!P CONFIG_X86_PAT option unset
|
||||
!M CONFIG_MTRR option unset
|
||||
Enabled PAT state set to enabled
|
||||
Disabled PAT state set to disabled
|
||||
OS PAT initializes PAT MSR with OS setting
|
||||
BIOS PAT keeps PAT MSR with BIOS setting
|
||||
|
@ -1,3 +1,9 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
======================
|
||||
Memory Protection Keys
|
||||
======================
|
||||
|
||||
Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
|
||||
which is found on Intel's Skylake "Scalable Processor" Server CPUs.
|
||||
It will be avalable in future non-server parts.
|
||||
@ -23,9 +29,10 @@ even though there is theoretically space in the PAE PTEs. These
|
||||
permissions are enforced on data access only and have no effect on
|
||||
instruction fetches.
|
||||
|
||||
=========================== Syscalls ===========================
|
||||
Syscalls
|
||||
========
|
||||
|
||||
There are 3 system calls which directly interact with pkeys:
|
||||
There are 3 system calls which directly interact with pkeys::
|
||||
|
||||
int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
|
||||
int pkey_free(int pkey);
|
||||
@ -37,6 +44,7 @@ pkey_alloc(). An application calls the WRPKRU instruction
|
||||
directly in order to change access permissions to memory covered
|
||||
with a key. In this example WRPKRU is wrapped by a C function
|
||||
called pkey_set().
|
||||
::
|
||||
|
||||
int real_prot = PROT_READ|PROT_WRITE;
|
||||
pkey = pkey_alloc(0, PKEY_DISABLE_WRITE);
|
||||
@ -45,43 +53,44 @@ called pkey_set().
|
||||
... application runs here
|
||||
|
||||
Now, if the application needs to update the data at 'ptr', it can
|
||||
gain access, do the update, then remove its write access:
|
||||
gain access, do the update, then remove its write access::
|
||||
|
||||
pkey_set(pkey, 0); // clear PKEY_DISABLE_WRITE
|
||||
*ptr = foo; // assign something
|
||||
pkey_set(pkey, PKEY_DISABLE_WRITE); // set PKEY_DISABLE_WRITE again
|
||||
|
||||
Now when it frees the memory, it will also free the pkey since it
|
||||
is no longer in use:
|
||||
is no longer in use::
|
||||
|
||||
munmap(ptr, PAGE_SIZE);
|
||||
pkey_free(pkey);
|
||||
|
||||
(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
|
||||
.. note:: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
|
||||
An example implementation can be found in
|
||||
tools/testing/selftests/x86/protection_keys.c)
|
||||
tools/testing/selftests/x86/protection_keys.c.
|
||||
|
||||
=========================== Behavior ===========================
|
||||
Behavior
|
||||
========
|
||||
|
||||
The kernel attempts to make protection keys consistent with the
|
||||
behavior of a plain mprotect(). For instance if you do this:
|
||||
behavior of a plain mprotect(). For instance if you do this::
|
||||
|
||||
mprotect(ptr, size, PROT_NONE);
|
||||
something(ptr);
|
||||
|
||||
you can expect the same effects with protection keys when doing this:
|
||||
you can expect the same effects with protection keys when doing this::
|
||||
|
||||
pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
|
||||
pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
|
||||
something(ptr);
|
||||
|
||||
That should be true whether something() is a direct access to 'ptr'
|
||||
like:
|
||||
like::
|
||||
|
||||
*ptr = foo;
|
||||
|
||||
or when the kernel does the access on the application's behalf like
|
||||
with a read():
|
||||
with a read()::
|
||||
|
||||
read(fd, ptr, 1);
|
||||
|
@ -1,9 +1,15 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==========================
|
||||
Page Table Isolation (PTI)
|
||||
==========================
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
Page Table Isolation (pti, previously known as KAISER[1]) is a
|
||||
Page Table Isolation (pti, previously known as KAISER [1]_) is a
|
||||
countermeasure against attacks on the shared user/kernel address
|
||||
space such as the "Meltdown" approach[2].
|
||||
space such as the "Meltdown" approach [2]_.
|
||||
|
||||
To mitigate this class of attacks, we create an independent set of
|
||||
page tables for use only when running userspace applications. When
|
||||
@ -60,6 +66,7 @@ Protection against side-channel attacks is important. But,
|
||||
this protection comes at a cost:
|
||||
|
||||
1. Increased Memory Use
|
||||
|
||||
a. Each process now needs an order-1 PGD instead of order-0.
|
||||
(Consumes an additional 4k per process).
|
||||
b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
|
||||
@ -68,6 +75,7 @@ this protection comes at a cost:
|
||||
is decompressed, but no space in the kernel image itself.
|
||||
|
||||
2. Runtime Cost
|
||||
|
||||
a. CR3 manipulation to switch between the page table copies
|
||||
must be done at interrupt, syscall, and exception entry
|
||||
and exit (it can be skipped when the kernel is interrupted,
|
||||
@ -142,6 +150,7 @@ ideally doing all of these in parallel:
|
||||
interrupted, including nested NMIs. Using "-c" boosts the rate of
|
||||
NMIs, and using two -c with separate counters encourages nested NMIs
|
||||
and less deterministic behavior.
|
||||
::
|
||||
|
||||
while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
|
||||
|
||||
@ -182,5 +191,5 @@ that are worth noting here.
|
||||
tended to be TLB invalidation issues. Usually invalidating
|
||||
the wrong PCID, or otherwise missing an invalidation.
|
||||
|
||||
1. https://gruss.cc/files/kaiser.pdf
|
||||
2. https://meltdownattack.com/meltdown.pdf
|
||||
.. [1] https://gruss.cc/files/kaiser.pdf
|
||||
.. [2] https://meltdownattack.com/meltdown.pdf
|
@ -1,32 +1,43 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
.. include:: <isonum.txt>
|
||||
|
||||
===========================================
|
||||
User Interface for Resource Control feature
|
||||
===========================================
|
||||
|
||||
:Copyright: |copy| 2016 Intel Corporation
|
||||
:Authors: - Fenghua Yu <fenghua.yu@intel.com>
|
||||
- Tony Luck <tony.luck@intel.com>
|
||||
- Vikas Shivappa <vikas.shivappa@intel.com>
|
||||
|
||||
|
||||
Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
|
||||
AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
|
||||
|
||||
Copyright (C) 2016 Intel Corporation
|
||||
|
||||
Fenghua Yu <fenghua.yu@intel.com>
|
||||
Tony Luck <tony.luck@intel.com>
|
||||
Vikas Shivappa <vikas.shivappa@intel.com>
|
||||
|
||||
This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
|
||||
flag bits:
|
||||
RDT (Resource Director Technology) Allocation - "rdt_a"
|
||||
CAT (Cache Allocation Technology) - "cat_l3", "cat_l2"
|
||||
CDP (Code and Data Prioritization ) - "cdp_l3", "cdp_l2"
|
||||
CQM (Cache QoS Monitoring) - "cqm_llc", "cqm_occup_llc"
|
||||
MBM (Memory Bandwidth Monitoring) - "cqm_mbm_total", "cqm_mbm_local"
|
||||
MBA (Memory Bandwidth Allocation) - "mba"
|
||||
|
||||
To use the feature mount the file system:
|
||||
============================================= ================================
|
||||
RDT (Resource Director Technology) Allocation "rdt_a"
|
||||
CAT (Cache Allocation Technology) "cat_l3", "cat_l2"
|
||||
CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2"
|
||||
CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc"
|
||||
MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
|
||||
MBA (Memory Bandwidth Allocation) "mba"
|
||||
============================================= ================================
|
||||
|
||||
To use the feature mount the file system::
|
||||
|
||||
# mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
|
||||
|
||||
mount options are:
|
||||
|
||||
"cdp": Enable code/data prioritization in L3 cache allocations.
|
||||
"cdpl2": Enable code/data prioritization in L2 cache allocations.
|
||||
"mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA
|
||||
"cdp":
|
||||
Enable code/data prioritization in L3 cache allocations.
|
||||
"cdpl2":
|
||||
Enable code/data prioritization in L2 cache allocations.
|
||||
"mba_MBps":
|
||||
Enable the MBA Software Controller(mba_sc) to specify MBA
|
||||
bandwidth in MBps
|
||||
|
||||
L2 and L3 CDP are controlled seperately.
|
||||
@ -44,7 +55,7 @@ For more details on the behavior of the interface during monitoring
|
||||
and allocation, see the "Resource alloc and monitor groups" section.
|
||||
|
||||
Info directory
|
||||
--------------
|
||||
==============
|
||||
|
||||
The 'info' directory contains information about the enabled
|
||||
resources. Each resource has its own subdirectory. The subdirectory
|
||||
@ -56,71 +67,87 @@ allocation:
|
||||
Cache resource(L3/L2) subdirectory contains the following files
|
||||
related to allocation:
|
||||
|
||||
"num_closids": The number of CLOSIDs which are valid for this
|
||||
"num_closids":
|
||||
The number of CLOSIDs which are valid for this
|
||||
resource. The kernel uses the smallest number of
|
||||
CLOSIDs of all enabled resources as limit.
|
||||
|
||||
"cbm_mask": The bitmask which is valid for this resource.
|
||||
"cbm_mask":
|
||||
The bitmask which is valid for this resource.
|
||||
This mask is equivalent to 100%.
|
||||
|
||||
"min_cbm_bits": The minimum number of consecutive bits which
|
||||
"min_cbm_bits":
|
||||
The minimum number of consecutive bits which
|
||||
must be set when writing a mask.
|
||||
|
||||
"shareable_bits": Bitmask of shareable resource with other executing
|
||||
"shareable_bits":
|
||||
Bitmask of shareable resource with other executing
|
||||
entities (e.g. I/O). User can use this when
|
||||
setting up exclusive cache partitions. Note that
|
||||
some platforms support devices that have their
|
||||
own settings for cache use which can over-ride
|
||||
these bits.
|
||||
"bit_usage": Annotated capacity bitmasks showing how all
|
||||
"bit_usage":
|
||||
Annotated capacity bitmasks showing how all
|
||||
instances of the resource are used. The legend is:
|
||||
"0" - Corresponding region is unused. When the system's
|
||||
|
||||
"0":
|
||||
Corresponding region is unused. When the system's
|
||||
resources have been allocated and a "0" is found
|
||||
in "bit_usage" it is a sign that resources are
|
||||
wasted.
|
||||
"H" - Corresponding region is used by hardware only
|
||||
|
||||
"H":
|
||||
Corresponding region is used by hardware only
|
||||
but available for software use. If a resource
|
||||
has bits set in "shareable_bits" but not all
|
||||
of these bits appear in the resource groups'
|
||||
schematas then the bits appearing in
|
||||
"shareable_bits" but no resource group will
|
||||
be marked as "H".
|
||||
"X" - Corresponding region is available for sharing and
|
||||
"X":
|
||||
Corresponding region is available for sharing and
|
||||
used by hardware and software. These are the
|
||||
bits that appear in "shareable_bits" as
|
||||
well as a resource group's allocation.
|
||||
"S" - Corresponding region is used by software
|
||||
"S":
|
||||
Corresponding region is used by software
|
||||
and available for sharing.
|
||||
"E" - Corresponding region is used exclusively by
|
||||
"E":
|
||||
Corresponding region is used exclusively by
|
||||
one resource group. No sharing allowed.
|
||||
"P" - Corresponding region is pseudo-locked. No
|
||||
"P":
|
||||
Corresponding region is pseudo-locked. No
|
||||
sharing allowed.
|
||||
|
||||
Memory bandwitdh(MB) subdirectory contains the following files
|
||||
with respect to allocation:
|
||||
|
||||
"min_bandwidth": The minimum memory bandwidth percentage which
|
||||
"min_bandwidth":
|
||||
The minimum memory bandwidth percentage which
|
||||
user can request.
|
||||
|
||||
"bandwidth_gran": The granularity in which the memory bandwidth
|
||||
"bandwidth_gran":
|
||||
The granularity in which the memory bandwidth
|
||||
percentage is allocated. The allocated
|
||||
b/w percentage is rounded off to the next
|
||||
control step available on the hardware. The
|
||||
available bandwidth control steps are:
|
||||
min_bandwidth + N * bandwidth_gran.
|
||||
|
||||
"delay_linear": Indicates if the delay scale is linear or
|
||||
"delay_linear":
|
||||
Indicates if the delay scale is linear or
|
||||
non-linear. This field is purely informational
|
||||
only.
|
||||
|
||||
If RDT monitoring is available there will be an "L3_MON" directory
|
||||
with the following files:
|
||||
|
||||
"num_rmids": The number of RMIDs available. This is the
|
||||
"num_rmids":
|
||||
The number of RMIDs available. This is the
|
||||
upper bound for how many "CTRL_MON" + "MON"
|
||||
groups can be created.
|
||||
|
||||
"mon_features": Lists the monitoring events if
|
||||
"mon_features":
|
||||
Lists the monitoring events if
|
||||
monitoring is enabled for the resource.
|
||||
|
||||
"max_threshold_occupancy":
|
||||
@ -134,6 +161,7 @@ via the file system (making new directories or writing to any of the
|
||||
control files). If the command was successful, it will read as "ok".
|
||||
If the command failed, it will provide more information that can be
|
||||
conveyed in the error returns from file operations. E.g.
|
||||
::
|
||||
|
||||
# echo L3:0=f7 > schemata
|
||||
bash: echo: write error: Invalid argument
|
||||
@ -141,7 +169,7 @@ conveyed in the error returns from file operations. E.g.
|
||||
mask f7 has non-consecutive 1-bits
|
||||
|
||||
Resource alloc and monitor groups
|
||||
---------------------------------
|
||||
=================================
|
||||
|
||||
Resource groups are represented as directories in the resctrl file
|
||||
system. The default group is the root directory which, immediately
|
||||
@ -226,6 +254,7 @@ When monitoring is enabled all MON groups will also contain:
|
||||
|
||||
Resource allocation rules
|
||||
-------------------------
|
||||
|
||||
When a task is running the following rules define which resources are
|
||||
available to it:
|
||||
|
||||
@ -252,7 +281,7 @@ Resource monitoring rules
|
||||
|
||||
|
||||
Notes on cache occupancy monitoring and control
|
||||
-----------------------------------------------
|
||||
===============================================
|
||||
When moving a task from one group to another you should remember that
|
||||
this only affects *new* cache allocations by the task. E.g. you may have
|
||||
a task in a monitor group showing 3 MB of cache occupancy. If you move
|
||||
@ -321,7 +350,7 @@ of the capacity of the cache. You could partition the cache into four
|
||||
equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
|
||||
|
||||
Memory bandwidth Allocation and monitoring
|
||||
------------------------------------------
|
||||
==========================================
|
||||
|
||||
For Memory bandwidth resource, by default the user controls the resource
|
||||
by indicating the percentage of total memory bandwidth.
|
||||
@ -369,7 +398,7 @@ In order to mitigate this and make the interface more user friendly,
|
||||
resctrl added support for specifying the bandwidth in MBps as well. The
|
||||
kernel underneath would use a software feedback mechanism or a "Software
|
||||
Controller(mba_sc)" which reads the actual bandwidth using MBM counters
|
||||
and adjust the memowy bandwidth percentages to ensure
|
||||
and adjust the memowy bandwidth percentages to ensure::
|
||||
|
||||
"actual bandwidth < user specified bandwidth".
|
||||
|
||||
@ -380,14 +409,14 @@ sections.
|
||||
|
||||
L3 schemata file details (code and data prioritization disabled)
|
||||
----------------------------------------------------------------
|
||||
With CDP disabled the L3 schemata format is:
|
||||
With CDP disabled the L3 schemata format is::
|
||||
|
||||
L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
|
||||
|
||||
L3 schemata file details (CDP enabled via mount option to resctrl)
|
||||
------------------------------------------------------------------
|
||||
When CDP is enabled L3 control is split into two separate resources
|
||||
so you can specify independent masks for code and data like this:
|
||||
so you can specify independent masks for code and data like this::
|
||||
|
||||
L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
|
||||
L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
|
||||
@ -395,7 +424,7 @@ so you can specify independent masks for code and data like this:
|
||||
L2 schemata file details
|
||||
------------------------
|
||||
L2 cache does not support code and data prioritization, so the
|
||||
schemata format is always:
|
||||
schemata format is always::
|
||||
|
||||
L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
|
||||
|
||||
@ -403,6 +432,7 @@ Memory bandwidth Allocation (default mode)
|
||||
------------------------------------------
|
||||
|
||||
Memory b/w domain is L3 cache.
|
||||
::
|
||||
|
||||
MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
|
||||
|
||||
@ -410,6 +440,7 @@ Memory bandwidth Allocation specified in MBps
|
||||
---------------------------------------------
|
||||
|
||||
Memory bandwidth domain is L3 cache.
|
||||
::
|
||||
|
||||
MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
|
||||
|
||||
@ -418,6 +449,7 @@ Reading/writing the schemata file
|
||||
Reading the schemata file will show the state of all resources
|
||||
on all domains. When writing you only need to specify those values
|
||||
which you wish to change. E.g.
|
||||
::
|
||||
|
||||
# cat schemata
|
||||
L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
|
||||
@ -428,7 +460,7 @@ L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
|
||||
L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
|
||||
|
||||
Cache Pseudo-Locking
|
||||
--------------------
|
||||
====================
|
||||
CAT enables a user to specify the amount of cache space that an
|
||||
application can fill. Cache pseudo-locking builds on the fact that a
|
||||
CPU can still read and write data pre-allocated outside its current
|
||||
@ -442,6 +474,7 @@ a region of memory with reduced average read latency.
|
||||
The creation of a cache pseudo-locked region is triggered by a request
|
||||
from the user to do so that is accompanied by a schemata of the region
|
||||
to be pseudo-locked. The cache pseudo-locked region is created as follows:
|
||||
|
||||
- Create a CAT allocation CLOSNEW with a CBM matching the schemata
|
||||
from the user of the cache region that will contain the pseudo-locked
|
||||
memory. This region must not overlap with any current CAT allocation/CLOS
|
||||
@ -480,6 +513,7 @@ initial mmap() handling, there is no enforcement afterwards and the
|
||||
application self needs to ensure it remains affine to the correct cores.
|
||||
|
||||
Pseudo-locking is accomplished in two stages:
|
||||
|
||||
1) During the first stage the system administrator allocates a portion
|
||||
of cache that should be dedicated to pseudo-locking. At this time an
|
||||
equivalent portion of memory is allocated, loaded into allocated
|
||||
@ -506,7 +540,7 @@ by user space in order to obtain access to the pseudo-locked memory region.
|
||||
An example of cache pseudo-locked region creation and usage can be found below.
|
||||
|
||||
Cache Pseudo-Locking Debugging Interface
|
||||
---------------------------------------
|
||||
----------------------------------------
|
||||
The pseudo-locking debugging interface is enabled by default (if
|
||||
CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
|
||||
|
||||
@ -514,6 +548,7 @@ There is no explicit way for the kernel to test if a provided memory
|
||||
location is present in the cache. The pseudo-locking debugging interface uses
|
||||
the tracing infrastructure to provide two ways to measure cache residency of
|
||||
the pseudo-locked region:
|
||||
|
||||
1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
|
||||
from these measurements are best visualized using a hist trigger (see
|
||||
example below). In this test the pseudo-locked region is traversed at
|
||||
@ -529,24 +564,30 @@ it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
|
||||
write-only file, pseudo_lock_measure, is present in this directory. The
|
||||
measurement of the pseudo-locked region depends on the number written to this
|
||||
debugfs file:
|
||||
1 - writing "1" to the pseudo_lock_measure file will trigger the latency
|
||||
|
||||
1:
|
||||
writing "1" to the pseudo_lock_measure file will trigger the latency
|
||||
measurement captured in the pseudo_lock_mem_latency tracepoint. See
|
||||
example below.
|
||||
2 - writing "2" to the pseudo_lock_measure file will trigger the L2 cache
|
||||
2:
|
||||
writing "2" to the pseudo_lock_measure file will trigger the L2 cache
|
||||
residency (cache hits and misses) measurement captured in the
|
||||
pseudo_lock_l2 tracepoint. See example below.
|
||||
3 - writing "3" to the pseudo_lock_measure file will trigger the L3 cache
|
||||
3:
|
||||
writing "3" to the pseudo_lock_measure file will trigger the L3 cache
|
||||
residency (cache hits and misses) measurement captured in the
|
||||
pseudo_lock_l3 tracepoint.
|
||||
|
||||
All measurements are recorded with the tracing infrastructure. This requires
|
||||
the relevant tracepoints to be enabled before the measurement is triggered.
|
||||
|
||||
Example of latency debugging interface:
|
||||
Example of latency debugging interface
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
In this example a pseudo-locked region named "newlock" was created. Here is
|
||||
how we can measure the latency in cycles of reading from this region and
|
||||
visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
|
||||
is set:
|
||||
is set::
|
||||
|
||||
# :> /sys/kernel/debug/tracing/trace
|
||||
# echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
|
||||
# echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable
|
||||
@ -574,10 +615,12 @@ Totals:
|
||||
Entries: 9
|
||||
Dropped: 0
|
||||
|
||||
Example of cache hits/misses debugging:
|
||||
Example of cache hits/misses debugging
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
In this example a pseudo-locked region named "newlock" was created on the L2
|
||||
cache of a platform. Here is how we can obtain details of the cache hits
|
||||
and misses using the platform's precision counters.
|
||||
::
|
||||
|
||||
# :> /sys/kernel/debug/tracing/trace
|
||||
# echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable
|
||||
@ -597,13 +640,15 @@ and misses using the platform's precision counters.
|
||||
pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0
|
||||
|
||||
|
||||
Examples for RDT allocation usage:
|
||||
Examples for RDT allocation usage
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
1) Example 1
|
||||
|
||||
Example 1
|
||||
---------
|
||||
On a two socket machine (one L3 cache per socket) with just four bits
|
||||
for cache bit masks, minimum b/w of 10% with a memory bandwidth
|
||||
granularity of 10%
|
||||
granularity of 10%.
|
||||
::
|
||||
|
||||
# mount -t resctrl resctrl /sys/fs/resctrl
|
||||
# cd /sys/fs/resctrl
|
||||
@ -628,6 +673,7 @@ the b/w accordingly.
|
||||
|
||||
If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB
|
||||
rather than the percentage values.
|
||||
::
|
||||
|
||||
# echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
|
||||
# echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
|
||||
@ -635,26 +681,28 @@ rather than the percentage values.
|
||||
In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
|
||||
of 1024MB where as on socket 1 they would use 500MB.
|
||||
|
||||
Example 2
|
||||
---------
|
||||
2) Example 2
|
||||
|
||||
Again two sockets, but this time with a more realistic 20-bit mask.
|
||||
|
||||
Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
|
||||
processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
|
||||
neighbors, each of the two real-time tasks exclusively occupies one quarter
|
||||
of L3 cache on socket 0.
|
||||
::
|
||||
|
||||
# mount -t resctrl resctrl /sys/fs/resctrl
|
||||
# cd /sys/fs/resctrl
|
||||
|
||||
First we reset the schemata for the default group so that the "upper"
|
||||
50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
|
||||
ordinary tasks:
|
||||
ordinary tasks::
|
||||
|
||||
# echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
|
||||
|
||||
Next we make a resource group for our first real time task and give
|
||||
it access to the "top" 25% of the cache on socket 0.
|
||||
::
|
||||
|
||||
# mkdir p0
|
||||
# echo "L3:0=f8000;1=fffff" > p0/schemata
|
||||
@ -663,11 +711,12 @@ Finally we move our first real time task into this resource group. We
|
||||
also use taskset(1) to ensure the task always runs on a dedicated CPU
|
||||
on socket 0. Most uses of resource groups will also constrain which
|
||||
processors tasks run on.
|
||||
::
|
||||
|
||||
# echo 1234 > p0/tasks
|
||||
# taskset -cp 1 1234
|
||||
|
||||
Ditto for the second real time task (with the remaining 25% of cache):
|
||||
Ditto for the second real time task (with the remaining 25% of cache)::
|
||||
|
||||
# mkdir p1
|
||||
# echo "L3:0=7c00;1=fffff" > p1/schemata
|
||||
@ -678,37 +727,39 @@ For the same 2 socket system with memory b/w resource and CAT L3 the
|
||||
schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
|
||||
10):
|
||||
|
||||
For our first real time task this would request 20% memory b/w on socket
|
||||
0.
|
||||
For our first real time task this would request 20% memory b/w on socket 0.
|
||||
::
|
||||
|
||||
# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
|
||||
|
||||
For our second real time task this would request an other 20% memory b/w
|
||||
on socket 0.
|
||||
::
|
||||
|
||||
# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
|
||||
|
||||
Example 3
|
||||
---------
|
||||
3) Example 3
|
||||
|
||||
A single socket system which has real-time tasks running on core 4-7 and
|
||||
non real-time workload assigned to core 0-3. The real-time tasks share text
|
||||
and data, so a per task association is not required and due to interaction
|
||||
with the kernel it's desired that the kernel on these cores shares L3 with
|
||||
the tasks.
|
||||
::
|
||||
|
||||
# mount -t resctrl resctrl /sys/fs/resctrl
|
||||
# cd /sys/fs/resctrl
|
||||
|
||||
First we reset the schemata for the default group so that the "upper"
|
||||
50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
|
||||
cannot be used by ordinary tasks:
|
||||
cannot be used by ordinary tasks::
|
||||
|
||||
# echo "L3:0=3ff\nMB:0=50" > schemata
|
||||
|
||||
Next we make a resource group for our real time cores and give it access
|
||||
to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
|
||||
socket 0.
|
||||
::
|
||||
|
||||
# mkdir p0
|
||||
# echo "L3:0=ffc00\nMB:0=50" > p0/schemata
|
||||
@ -717,11 +768,11 @@ Finally we move core 4-7 over to the new group and make sure that the
|
||||
kernel and the tasks running there get 50% of the cache. They should
|
||||
also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
|
||||
siblings and only the real time threads are scheduled on the cores 4-7.
|
||||
::
|
||||
|
||||
# echo F0 > p0/cpus
|
||||
|
||||
Example 4
|
||||
---------
|
||||
4) Example 4
|
||||
|
||||
The resource groups in previous examples were all in the default "shareable"
|
||||
mode allowing sharing of their cache allocations. If one resource group
|
||||
@ -732,18 +783,20 @@ In this example a new exclusive resource group will be created on a L2 CAT
|
||||
system with two L2 cache instances that can be configured with an 8-bit
|
||||
capacity bitmask. The new exclusive resource group will be configured to use
|
||||
25% of each cache instance.
|
||||
::
|
||||
|
||||
# mount -t resctrl resctrl /sys/fs/resctrl/
|
||||
# cd /sys/fs/resctrl
|
||||
|
||||
First, we observe that the default group is configured to allocate to all L2
|
||||
cache:
|
||||
cache::
|
||||
|
||||
# cat schemata
|
||||
L2:0=ff;1=ff
|
||||
|
||||
We could attempt to create the new resource group at this point, but it will
|
||||
fail because of the overlap with the schemata of the default group:
|
||||
fail because of the overlap with the schemata of the default group::
|
||||
|
||||
# mkdir p0
|
||||
# echo 'L2:0=0x3;1=0x3' > p0/schemata
|
||||
# cat p0/mode
|
||||
@ -756,6 +809,8 @@ schemata overlaps
|
||||
To ensure that there is no overlap with another resource group the default
|
||||
resource group's schemata has to change, making it possible for the new
|
||||
resource group to become exclusive.
|
||||
::
|
||||
|
||||
# echo 'L2:0=0xfc;1=0xfc' > schemata
|
||||
# echo exclusive > p0/mode
|
||||
# grep . p0/*
|
||||
@ -765,7 +820,8 @@ p0/schemata:L2:0=03;1=03
|
||||
p0/size:L2:0=262144;1=262144
|
||||
|
||||
A new resource group will on creation not overlap with an exclusive resource
|
||||
group:
|
||||
group::
|
||||
|
||||
# mkdir p1
|
||||
# grep . p1/*
|
||||
p1/cpus:0
|
||||
@ -773,28 +829,32 @@ p1/mode:shareable
|
||||
p1/schemata:L2:0=fc;1=fc
|
||||
p1/size:L2:0=786432;1=786432
|
||||
|
||||
The bit_usage will reflect how the cache is used:
|
||||
The bit_usage will reflect how the cache is used::
|
||||
|
||||
# cat info/L2/bit_usage
|
||||
0=SSSSSSEE;1=SSSSSSEE
|
||||
|
||||
A resource group cannot be forced to overlap with an exclusive resource group:
|
||||
A resource group cannot be forced to overlap with an exclusive resource group::
|
||||
|
||||
# echo 'L2:0=0x1;1=0x1' > p1/schemata
|
||||
-sh: echo: write error: Invalid argument
|
||||
# cat info/last_cmd_status
|
||||
overlaps with exclusive group
|
||||
|
||||
Example of Cache Pseudo-Locking
|
||||
-------------------------------
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
|
||||
region is exposed at /dev/pseudo_lock/newlock that can be provided to
|
||||
application for argument to mmap().
|
||||
::
|
||||
|
||||
# mount -t resctrl resctrl /sys/fs/resctrl/
|
||||
# cd /sys/fs/resctrl
|
||||
|
||||
Ensure that there are bits available that can be pseudo-locked, since only
|
||||
unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
|
||||
removed from the default resource group's schemata:
|
||||
removed from the default resource group's schemata::
|
||||
|
||||
# cat info/L2/bit_usage
|
||||
0=SSSSSSSS;1=SSSSSSSS
|
||||
# echo 'L2:1=0xfc' > schemata
|
||||
@ -803,7 +863,7 @@ removed from the default resource group's schemata:
|
||||
|
||||
Create a new resource group that will be associated with the pseudo-locked
|
||||
region, indicate that it will be used for a pseudo-locked region, and
|
||||
configure the requested pseudo-locked region capacity bitmask:
|
||||
configure the requested pseudo-locked region capacity bitmask::
|
||||
|
||||
# mkdir newlock
|
||||
# echo pseudo-locksetup > newlock/mode
|
||||
@ -811,7 +871,7 @@ configure the requested pseudo-locked region capacity bitmask:
|
||||
|
||||
On success the resource group's mode will change to pseudo-locked, the
|
||||
bit_usage will reflect the pseudo-locked region, and the character device
|
||||
exposing the pseudo-locked region will exist:
|
||||
exposing the pseudo-locked region will exist::
|
||||
|
||||
# cat newlock/mode
|
||||
pseudo-locked
|
||||
@ -820,6 +880,8 @@ pseudo-locked
|
||||
# ls -l /dev/pseudo_lock/newlock
|
||||
crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock
|
||||
|
||||
::
|
||||
|
||||
/*
|
||||
* Example code to access one page of pseudo-locked cache region
|
||||
* from user space.
|
||||
@ -921,7 +983,7 @@ Read lock:
|
||||
B) If success read the directory structure.
|
||||
C) funlock
|
||||
|
||||
Example with bash:
|
||||
Example with bash::
|
||||
|
||||
# Atomically read directory structure
|
||||
$ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
|
||||
@ -936,7 +998,7 @@ echo mask > /sys/fs/resctrl/newres/schemata
|
||||
|
||||
$ flock /sys/fs/resctrl/ ./create-dir.sh
|
||||
|
||||
Example with C:
|
||||
Example with C::
|
||||
|
||||
/*
|
||||
* Example code do take advisory locks
|
||||
@ -999,8 +1061,8 @@ void main(void)
|
||||
resctrl_release_lock(fd);
|
||||
}
|
||||
|
||||
Examples for RDT Monitoring along with allocation usage:
|
||||
|
||||
Examples for RDT Monitoring along with allocation usage
|
||||
=======================================================
|
||||
Reading monitored data
|
||||
----------------------
|
||||
Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
|
||||
@ -1009,9 +1071,9 @@ group or CTRL_MON group.
|
||||
|
||||
|
||||
Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
|
||||
---------
|
||||
------------------------------------------------------------------------
|
||||
On a two socket machine (one L3 cache per socket) with just four bits
|
||||
for cache bit masks
|
||||
for cache bit masks::
|
||||
|
||||
# mount -t resctrl resctrl /sys/fs/resctrl
|
||||
# cd /sys/fs/resctrl
|
||||
@ -1029,6 +1091,7 @@ Tasks that are under the control of group "p0" may only allocate from the
|
||||
Tasks in group "p1" use the "lower" 50% of cache on both sockets.
|
||||
|
||||
Create monitor groups and assign a subset of tasks to each monitor group.
|
||||
::
|
||||
|
||||
# cd /sys/fs/resctrl/p1/mon_groups
|
||||
# mkdir m11 m12
|
||||
@ -1036,6 +1099,7 @@ Create monitor groups and assign a subset of tasks to each monitor group.
|
||||
# echo 5679 > m12/tasks
|
||||
|
||||
fetch data (data shown in bytes)
|
||||
::
|
||||
|
||||
# cat m11/mon_data/mon_L3_00/llc_occupancy
|
||||
16234000
|
||||
@ -1045,13 +1109,14 @@ fetch data (data shown in bytes)
|
||||
16789000
|
||||
|
||||
The parent ctrl_mon group shows the aggregated data.
|
||||
::
|
||||
|
||||
# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
|
||||
31234000
|
||||
|
||||
Example 2 (Monitor a task from its creation)
|
||||
---------
|
||||
On a two socket machine (one L3 cache per socket)
|
||||
--------------------------------------------
|
||||
On a two socket machine (one L3 cache per socket)::
|
||||
|
||||
# mount -t resctrl resctrl /sys/fs/resctrl
|
||||
# cd /sys/fs/resctrl
|
||||
@ -1059,17 +1124,18 @@ On a two socket machine (one L3 cache per socket)
|
||||
|
||||
An RMID is allocated to the group once its created and hence the <cmd>
|
||||
below is monitored from its creation.
|
||||
::
|
||||
|
||||
# echo $$ > /sys/fs/resctrl/p1/tasks
|
||||
# <cmd>
|
||||
|
||||
Fetch the data
|
||||
Fetch the data::
|
||||
|
||||
# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
|
||||
31789000
|
||||
|
||||
Example 3 (Monitor without CAT support or before creating CAT groups)
|
||||
---------
|
||||
---------------------------------------------------------------------
|
||||
|
||||
Assume a system like HSW has only CQM and no CAT support. In this case
|
||||
the resctrl will still mount but cannot create CTRL_MON directories.
|
||||
@ -1078,6 +1144,7 @@ able to monitor all tasks including kernel threads.
|
||||
|
||||
This can also be used to profile jobs cache size footprint before being
|
||||
able to allocate them to different allocation groups.
|
||||
::
|
||||
|
||||
# mount -t resctrl resctrl /sys/fs/resctrl
|
||||
# cd /sys/fs/resctrl
|
||||
@ -1090,6 +1157,7 @@ able to allocate them to different allocation groups.
|
||||
Monitor the groups separately and also get per domain data. From the
|
||||
below its apparent that the tasks are mostly doing work on
|
||||
domain(socket) 0.
|
||||
::
|
||||
|
||||
# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
|
||||
31234000
|
||||
@ -1107,15 +1175,17 @@ Example 4 (Monitor real time tasks)
|
||||
A single socket system which has real time tasks running on cores 4-7
|
||||
and non real time tasks on other cpus. We want to monitor the cache
|
||||
occupancy of the real time threads on these cores.
|
||||
::
|
||||
|
||||
# mount -t resctrl resctrl /sys/fs/resctrl
|
||||
# cd /sys/fs/resctrl
|
||||
# mkdir p1
|
||||
|
||||
Move the cpus 4-7 over to p1
|
||||
Move the cpus 4-7 over to p1::
|
||||
|
||||
# echo f0 > p1/cpus
|
||||
|
||||
View the llc occupancy snapshot
|
||||
View the llc occupancy snapshot::
|
||||
|
||||
# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
|
||||
11234000
|
@ -1,5 +1,12 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=======
|
||||
The TLB
|
||||
=======
|
||||
|
||||
When the kernel unmaps or modified the attributes of a range of
|
||||
memory, it has two choices:
|
||||
|
||||
1. Flush the entire TLB with a two-instruction sequence. This is
|
||||
a quick operation, but it causes collateral damage: TLB entries
|
||||
from areas other than the one we are trying to flush will be
|
||||
@ -10,6 +17,7 @@ memory, it has two choices:
|
||||
damage to other TLB entries.
|
||||
|
||||
Which method to do depends on a few things:
|
||||
|
||||
1. The size of the flush being performed. A flush of the entire
|
||||
address space is obviously better performed by flushing the
|
||||
entire TLB than doing 2^48/PAGE_SIZE individual flushes.
|
||||
@ -33,7 +41,7 @@ well. There is essentially no "right" point to choose.
|
||||
You may be doing too many individual invalidations if you see the
|
||||
invlpg instruction (or instructions _near_ it) show up high in
|
||||
profiles. If you believe that individual invalidations being
|
||||
called too often, you can lower the tunable:
|
||||
called too often, you can lower the tunable::
|
||||
|
||||
/sys/kernel/debug/x86/tlb_single_page_flush_ceiling
|
||||
|
||||
@ -43,7 +51,7 @@ Setting it to 1 is a very conservative setting and it should
|
||||
never need to be 0 under normal circumstances.
|
||||
|
||||
Despite the fact that a single individual flush on x86 is
|
||||
guaranteed to flush a full 2MB [1], hugetlbfs always uses the full
|
||||
guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full
|
||||
flushes. THP is treated exactly the same as normal memory.
|
||||
|
||||
You might see invlpg inside of flush_tlb_mm_range() show up in
|
||||
@ -54,7 +62,7 @@ Essentially, you are balancing the cycles you spend doing invlpg
|
||||
with the cycles that you spend refilling the TLB later.
|
||||
|
||||
You can measure how expensive TLB refills are by using
|
||||
performance counters and 'perf stat', like this:
|
||||
performance counters and 'perf stat', like this::
|
||||
|
||||
perf stat -e
|
||||
cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
|
||||
@ -70,6 +78,6 @@ be there in some form. You can use pmu-tools 'ocperf list'
|
||||
(https://github.com/andikleen/pmu-tools) to find the right
|
||||
counters for a given CPU.
|
||||
|
||||
1. A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
|
||||
.. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
|
||||
says: "One execution of INVLPG is sufficient even for a page
|
||||
with size greater than 4 KBytes."
|
@ -1,3 +1,6 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
============
|
||||
x86 Topology
|
||||
============
|
||||
|
||||
@ -33,8 +36,8 @@ The topology of a system is described in the units of:
|
||||
- cores
|
||||
- threads
|
||||
|
||||
* Package:
|
||||
|
||||
Package
|
||||
=======
|
||||
Packages contain a number of cores plus shared resources, e.g. DRAM
|
||||
controller, shared caches etc.
|
||||
|
||||
@ -66,6 +69,7 @@ The topology of a system is described in the units of:
|
||||
- cpu_llc_id:
|
||||
|
||||
A per-CPU variable containing:
|
||||
|
||||
- On Intel, the first APIC ID of the list of CPUs sharing the Last Level
|
||||
Cache
|
||||
|
||||
@ -73,8 +77,8 @@ The topology of a system is described in the units of:
|
||||
Cache. In general, it is a number identifying an LLC uniquely on the
|
||||
system.
|
||||
|
||||
* Cores:
|
||||
|
||||
Cores
|
||||
=====
|
||||
A core consists of 1 or more threads. It does not matter whether the threads
|
||||
are SMT- or CMT-type threads.
|
||||
|
||||
@ -86,13 +90,13 @@ The topology of a system is described in the units of:
|
||||
- smp_num_siblings:
|
||||
|
||||
The number of threads in a core. The number of threads in a package can be
|
||||
calculated by:
|
||||
calculated by::
|
||||
|
||||
threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings
|
||||
|
||||
|
||||
* Threads:
|
||||
|
||||
Threads
|
||||
=======
|
||||
A thread is a single scheduling unit. It's the equivalent to a logical Linux
|
||||
CPU.
|
||||
|
||||
@ -129,41 +133,41 @@ The topology of a system is described in the units of:
|
||||
|
||||
|
||||
System topology examples
|
||||
========================
|
||||
|
||||
Note:
|
||||
|
||||
.. note::
|
||||
The alternative Linux CPU enumeration depends on how the BIOS enumerates the
|
||||
threads. Many BIOSes enumerate all threads 0 first and then all threads 1.
|
||||
That has the "advantage" that the logical Linux CPU numbers of threads 0 stay
|
||||
the same whether threads are enabled or not. That's merely an implementation
|
||||
detail and has no practical impact.
|
||||
|
||||
1) Single Package, Single Core
|
||||
1) Single Package, Single Core::
|
||||
|
||||
[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
|
||||
|
||||
2) Single Package, Dual Core
|
||||
|
||||
a) One thread per core
|
||||
a) One thread per core::
|
||||
|
||||
[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
|
||||
-> [core 1] -> [thread 0] -> Linux CPU 1
|
||||
|
||||
b) Two threads per core
|
||||
b) Two threads per core::
|
||||
|
||||
[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
|
||||
-> [thread 1] -> Linux CPU 1
|
||||
-> [core 1] -> [thread 0] -> Linux CPU 2
|
||||
-> [thread 1] -> Linux CPU 3
|
||||
|
||||
Alternative enumeration:
|
||||
Alternative enumeration::
|
||||
|
||||
[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
|
||||
-> [thread 1] -> Linux CPU 2
|
||||
-> [core 1] -> [thread 0] -> Linux CPU 1
|
||||
-> [thread 1] -> Linux CPU 3
|
||||
|
||||
AMD nomenclature for CMT systems:
|
||||
AMD nomenclature for CMT systems::
|
||||
|
||||
[node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0
|
||||
-> [Compute Unit Core 1] -> Linux CPU 1
|
||||
@ -172,7 +176,7 @@ detail and has no practical impact.
|
||||
|
||||
4) Dual Package, Dual Core
|
||||
|
||||
a) One thread per core
|
||||
a) One thread per core::
|
||||
|
||||
[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
|
||||
-> [core 1] -> [thread 0] -> Linux CPU 1
|
||||
@ -180,7 +184,7 @@ detail and has no practical impact.
|
||||
[package 1] -> [core 0] -> [thread 0] -> Linux CPU 2
|
||||
-> [core 1] -> [thread 0] -> Linux CPU 3
|
||||
|
||||
b) Two threads per core
|
||||
b) Two threads per core::
|
||||
|
||||
[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
|
||||
-> [thread 1] -> Linux CPU 1
|
||||
@ -192,7 +196,7 @@ detail and has no practical impact.
|
||||
-> [core 1] -> [thread 0] -> Linux CPU 6
|
||||
-> [thread 1] -> Linux CPU 7
|
||||
|
||||
Alternative enumeration:
|
||||
Alternative enumeration::
|
||||
|
||||
[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
|
||||
-> [thread 1] -> Linux CPU 4
|
||||
@ -204,7 +208,7 @@ detail and has no practical impact.
|
||||
-> [core 1] -> [thread 0] -> Linux CPU 3
|
||||
-> [thread 1] -> Linux CPU 7
|
||||
|
||||
AMD nomenclature for CMT systems:
|
||||
AMD nomenclature for CMT systems::
|
||||
|
||||
[node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0
|
||||
-> [Compute Unit Core 1] -> Linux CPU 1
|
@ -1,7 +1,11 @@
|
||||
USB Legacy support
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Vojtech Pavlik <vojtech@suse.cz>, January 2004
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==================
|
||||
USB Legacy support
|
||||
==================
|
||||
|
||||
:Author: Vojtech Pavlik <vojtech@suse.cz>, January 2004
|
||||
|
||||
|
||||
Also known as "USB Keyboard" or "USB Mouse support" in the BIOS Setup is a
|
||||
@ -27,18 +31,20 @@ It has several drawbacks, though:
|
||||
|
||||
Solutions:
|
||||
|
||||
Problem 1) can be solved by loading the USB drivers prior to loading the
|
||||
Problem 1)
|
||||
can be solved by loading the USB drivers prior to loading the
|
||||
PS/2 mouse driver. Since the PS/2 mouse driver is in 2.6 compiled into
|
||||
the kernel unconditionally, this means the USB drivers need to be
|
||||
compiled-in, too.
|
||||
|
||||
Problem 2) can currently only be solved by either disabling HIGHMEM64G
|
||||
Problem 2)
|
||||
can currently only be solved by either disabling HIGHMEM64G
|
||||
in the kernel config or USB Legacy support in the BIOS. A BIOS update
|
||||
could help, but so far no such update exists.
|
||||
|
||||
Problem 3) is usually fixed by a BIOS update. Check the board
|
||||
Problem 3)
|
||||
is usually fixed by a BIOS update. Check the board
|
||||
manufacturers web site. If an update is not available, disable USB
|
||||
Legacy support in the BIOS. If this alone doesn't help, try also adding
|
||||
idle=poll on the kernel command line. The BIOS may be entering the SMM
|
||||
on the HLT instruction as well.
|
||||
|
@ -1,5 +1,11 @@
|
||||
== Overview ==
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==============
|
||||
5-level paging
|
||||
==============
|
||||
|
||||
Overview
|
||||
========
|
||||
Original x86-64 was limited by 4-level paing to 256 TiB of virtual address
|
||||
space and 64 TiB of physical address space. We are already bumping into
|
||||
this limit: some vendors offers servers with 64 TiB of memory today.
|
||||
@ -16,16 +22,17 @@ QEMU 2.9 and later support 5-level paging.
|
||||
Virtual memory layout for 5-level paging is described in
|
||||
Documentation/x86/x86_64/mm.txt
|
||||
|
||||
== Enabling 5-level paging ==
|
||||
|
||||
Enabling 5-level paging
|
||||
=======================
|
||||
CONFIG_X86_5LEVEL=y enables the feature.
|
||||
|
||||
Kernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware.
|
||||
In this case additional page table level -- p4d -- will be folded at
|
||||
runtime.
|
||||
|
||||
== User-space and large virtual address space ==
|
||||
|
||||
User-space and large virtual address space
|
||||
==========================================
|
||||
On x86, 5-level paging enables 56-bit userspace virtual address space.
|
||||
Not all user space is ready to handle wide addresses. It's known that
|
||||
at least some JIT compilers use higher bits in pointers to encode their
|
||||
@ -58,4 +65,3 @@ One important case we need to handle here is interaction with MPX.
|
||||
MPX (without MAWA extension) cannot handle addresses above 47-bit, so we
|
||||
need to make sure that MPX cannot be enabled we already have VMA above
|
||||
the boundary and forbid creating such VMAs once MPX is enabled.
|
||||
|
335
Documentation/x86/x86_64/boot-options.rst
Normal file
335
Documentation/x86/x86_64/boot-options.rst
Normal file
@ -0,0 +1,335 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===========================
|
||||
AMD64 Specific Boot Options
|
||||
===========================
|
||||
|
||||
There are many others (usually documented in driver documentation), but
|
||||
only the AMD64 specific ones are listed here.
|
||||
|
||||
Machine check
|
||||
=============
|
||||
Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables.
|
||||
|
||||
mce=off
|
||||
Disable machine check
|
||||
mce=no_cmci
|
||||
Disable CMCI(Corrected Machine Check Interrupt) that
|
||||
Intel processor supports. Usually this disablement is
|
||||
not recommended, but it might be handy if your hardware
|
||||
is misbehaving.
|
||||
Note that you'll get more problems without CMCI than with
|
||||
due to the shared banks, i.e. you might get duplicated
|
||||
error logs.
|
||||
mce=dont_log_ce
|
||||
Don't make logs for corrected errors. All events reported
|
||||
as corrected are silently cleared by OS.
|
||||
This option will be useful if you have no interest in any
|
||||
of corrected errors.
|
||||
mce=ignore_ce
|
||||
Disable features for corrected errors, e.g. polling timer
|
||||
and CMCI. All events reported as corrected are not cleared
|
||||
by OS and remained in its error banks.
|
||||
Usually this disablement is not recommended, however if
|
||||
there is an agent checking/clearing corrected errors
|
||||
(e.g. BIOS or hardware monitoring applications), conflicting
|
||||
with OS's error handling, and you cannot deactivate the agent,
|
||||
then this option will be a help.
|
||||
mce=no_lmce
|
||||
Do not opt-in to Local MCE delivery. Use legacy method
|
||||
to broadcast MCEs.
|
||||
mce=bootlog
|
||||
Enable logging of machine checks left over from booting.
|
||||
Disabled by default on AMD Fam10h and older because some BIOS
|
||||
leave bogus ones.
|
||||
If your BIOS doesn't do that it's a good idea to enable though
|
||||
to make sure you log even machine check events that result
|
||||
in a reboot. On Intel systems it is enabled by default.
|
||||
mce=nobootlog
|
||||
Disable boot machine check logging.
|
||||
mce=tolerancelevel[,monarchtimeout] (number,number)
|
||||
tolerance levels:
|
||||
0: always panic on uncorrected errors, log corrected errors
|
||||
1: panic or SIGBUS on uncorrected errors, log corrected errors
|
||||
2: SIGBUS or log uncorrected errors, log corrected errors
|
||||
3: never panic or SIGBUS, log all errors (for testing only)
|
||||
Default is 1
|
||||
Can be also set using sysfs which is preferable.
|
||||
monarchtimeout:
|
||||
Sets the time in us to wait for other CPUs on machine checks. 0
|
||||
to disable.
|
||||
mce=bios_cmci_threshold
|
||||
Don't overwrite the bios-set CMCI threshold. This boot option
|
||||
prevents Linux from overwriting the CMCI threshold set by the
|
||||
bios. Without this option, Linux always sets the CMCI
|
||||
threshold to 1. Enabling this may make memory predictive failure
|
||||
analysis less effective if the bios sets thresholds for memory
|
||||
errors since we will not see details for all errors.
|
||||
mce=recovery
|
||||
Force-enable recoverable machine check code paths
|
||||
|
||||
nomce (for compatibility with i386)
|
||||
same as mce=off
|
||||
|
||||
Everything else is in sysfs now.
|
||||
|
||||
APICs
|
||||
=====
|
||||
|
||||
apic
|
||||
Use IO-APIC. Default
|
||||
|
||||
noapic
|
||||
Don't use the IO-APIC.
|
||||
|
||||
disableapic
|
||||
Don't use the local APIC
|
||||
|
||||
nolapic
|
||||
Don't use the local APIC (alias for i386 compatibility)
|
||||
|
||||
pirq=...
|
||||
See Documentation/x86/i386/IO-APIC.txt
|
||||
|
||||
noapictimer
|
||||
Don't set up the APIC timer
|
||||
|
||||
no_timer_check
|
||||
Don't check the IO-APIC timer. This can work around
|
||||
problems with incorrect timer initialization on some boards.
|
||||
|
||||
apicpmtimer
|
||||
Do APIC timer calibration using the pmtimer. Implies
|
||||
apicmaintimer. Useful when your PIT timer is totally broken.
|
||||
|
||||
Timing
|
||||
======
|
||||
|
||||
notsc
|
||||
Deprecated, use tsc=unstable instead.
|
||||
|
||||
nohpet
|
||||
Don't use the HPET timer.
|
||||
|
||||
Idle loop
|
||||
=========
|
||||
|
||||
idle=poll
|
||||
Don't do power saving in the idle loop using HLT, but poll for rescheduling
|
||||
event. This will make the CPUs eat a lot more power, but may be useful
|
||||
to get slightly better performance in multiprocessor benchmarks. It also
|
||||
makes some profiling using performance counters more accurate.
|
||||
Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
|
||||
CPUs) this option has no performance advantage over the normal idle loop.
|
||||
It may also interact badly with hyperthreading.
|
||||
|
||||
Rebooting
|
||||
=========
|
||||
|
||||
reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old]
|
||||
bios
|
||||
Use the CPU reboot vector for warm reset
|
||||
warm
|
||||
Don't set the cold reboot flag
|
||||
cold
|
||||
Set the cold reboot flag
|
||||
triple
|
||||
Force a triple fault (init)
|
||||
kbd
|
||||
Use the keyboard controller. cold reset (default)
|
||||
acpi
|
||||
Use the ACPI RESET_REG in the FADT. If ACPI is not configured or
|
||||
the ACPI reset does not work, the reboot path attempts the reset
|
||||
using the keyboard controller.
|
||||
efi
|
||||
Use efi reset_system runtime service. If EFI is not configured or
|
||||
the EFI reset does not work, the reboot path attempts the reset using
|
||||
the keyboard controller.
|
||||
|
||||
Using warm reset will be much faster especially on big memory
|
||||
systems because the BIOS will not go through the memory check.
|
||||
Disadvantage is that not all hardware will be completely reinitialized
|
||||
on reboot so there may be boot problems on some systems.
|
||||
|
||||
reboot=force
|
||||
Don't stop other CPUs on reboot. This can make reboot more reliable
|
||||
in some cases.
|
||||
|
||||
Non Executable Mappings
|
||||
=======================
|
||||
|
||||
noexec=on|off
|
||||
on
|
||||
Enable(default)
|
||||
off
|
||||
Disable
|
||||
|
||||
NUMA
|
||||
====
|
||||
|
||||
numa=off
|
||||
Only set up a single NUMA node spanning all memory.
|
||||
|
||||
numa=noacpi
|
||||
Don't parse the SRAT table for NUMA setup
|
||||
|
||||
numa=fake=<size>[MG]
|
||||
If given as a memory unit, fills all system RAM with nodes of
|
||||
size interleaved over physical nodes.
|
||||
|
||||
numa=fake=<N>
|
||||
If given as an integer, fills all system RAM with N fake nodes
|
||||
interleaved over physical nodes.
|
||||
|
||||
numa=fake=<N>U
|
||||
If given as an integer followed by 'U', it will divide each
|
||||
physical node into N emulated nodes.
|
||||
|
||||
ACPI
|
||||
====
|
||||
|
||||
acpi=off
|
||||
Don't enable ACPI
|
||||
acpi=ht
|
||||
Use ACPI boot table parsing, but don't enable ACPI interpreter
|
||||
acpi=force
|
||||
Force ACPI on (currently not needed)
|
||||
acpi=strict
|
||||
Disable out of spec ACPI workarounds.
|
||||
acpi_sci={edge,level,high,low}
|
||||
Set up ACPI SCI interrupt.
|
||||
acpi=noirq
|
||||
Don't route interrupts
|
||||
acpi=nocmcff
|
||||
Disable firmware first mode for corrected errors. This
|
||||
disables parsing the HEST CMC error source to check if
|
||||
firmware has set the FF flag. This may result in
|
||||
duplicate corrected error reports.
|
||||
|
||||
PCI
|
||||
===
|
||||
|
||||
pci=off
|
||||
Don't use PCI
|
||||
pci=conf1
|
||||
Use conf1 access.
|
||||
pci=conf2
|
||||
Use conf2 access.
|
||||
pci=rom
|
||||
Assign ROMs.
|
||||
pci=assign-busses
|
||||
Assign busses
|
||||
pci=irqmask=MASK
|
||||
Set PCI interrupt mask to MASK
|
||||
pci=lastbus=NUMBER
|
||||
Scan up to NUMBER busses, no matter what the mptable says.
|
||||
pci=noacpi
|
||||
Don't use ACPI to set up PCI interrupt routing.
|
||||
|
||||
IOMMU (input/output memory management unit)
|
||||
===========================================
|
||||
Multiple x86-64 PCI-DMA mapping implementations exist, for example:
|
||||
|
||||
1. <lib/dma-direct.c>: use no hardware/software IOMMU at all
|
||||
(e.g. because you have < 3 GB memory).
|
||||
Kernel boot message: "PCI-DMA: Disabling IOMMU"
|
||||
|
||||
2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
|
||||
Kernel boot message: "PCI-DMA: using GART IOMMU"
|
||||
|
||||
3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
|
||||
e.g. if there is no hardware IOMMU in the system and it is need because
|
||||
you have >3GB memory or told the kernel to us it (iommu=soft))
|
||||
Kernel boot message: "PCI-DMA: Using software bounce buffering
|
||||
for IO (SWIOTLB)"
|
||||
|
||||
4. <arch/x86_64/pci-calgary.c> : IBM Calgary hardware IOMMU. Used in IBM
|
||||
pSeries and xSeries servers. This hardware IOMMU supports DMA address
|
||||
mapping with memory protection, etc.
|
||||
Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
|
||||
|
||||
::
|
||||
|
||||
iommu=[<size>][,noagp][,off][,force][,noforce]
|
||||
[,memaper[=<order>]][,merge][,fullflush][,nomerge]
|
||||
[,noaperture][,calgary]
|
||||
|
||||
General iommu options:
|
||||
|
||||
off
|
||||
Don't initialize and use any kind of IOMMU.
|
||||
noforce
|
||||
Don't force hardware IOMMU usage when it is not needed. (default).
|
||||
force
|
||||
Force the use of the hardware IOMMU even when it is
|
||||
not actually needed (e.g. because < 3 GB memory).
|
||||
soft
|
||||
Use software bounce buffering (SWIOTLB) (default for
|
||||
Intel machines). This can be used to prevent the usage
|
||||
of an available hardware IOMMU.
|
||||
|
||||
iommu options only relevant to the AMD GART hardware IOMMU:
|
||||
|
||||
<size>
|
||||
Set the size of the remapping area in bytes.
|
||||
allowed
|
||||
Overwrite iommu off workarounds for specific chipsets.
|
||||
fullflush
|
||||
Flush IOMMU on each allocation (default).
|
||||
nofullflush
|
||||
Don't use IOMMU fullflush.
|
||||
memaper[=<order>]
|
||||
Allocate an own aperture over RAM with size 32MB<<order.
|
||||
(default: order=1, i.e. 64MB)
|
||||
merge
|
||||
Do scatter-gather (SG) merging. Implies "force" (experimental).
|
||||
nomerge
|
||||
Don't do scatter-gather (SG) merging.
|
||||
noaperture
|
||||
Ask the IOMMU not to touch the aperture for AGP.
|
||||
noagp
|
||||
Don't initialize the AGP driver and use full aperture.
|
||||
panic
|
||||
Always panic when IOMMU overflows.
|
||||
calgary
|
||||
Use the Calgary IOMMU if it is available
|
||||
|
||||
iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
|
||||
implementation:
|
||||
|
||||
swiotlb=<pages>[,force]
|
||||
<pages>
|
||||
Prereserve that many 128K pages for the software IO bounce buffering.
|
||||
force
|
||||
Force all IO through the software TLB.
|
||||
|
||||
Settings for the IBM Calgary hardware IOMMU currently found in IBM
|
||||
pSeries and xSeries machines
|
||||
|
||||
calgary=[64k,128k,256k,512k,1M,2M,4M,8M]
|
||||
Set the size of each PCI slot's translation table when using the
|
||||
Calgary IOMMU. This is the size of the translation table itself
|
||||
in main memory. The smallest table, 64k, covers an IO space of
|
||||
32MB; the largest, 8MB table, can cover an IO space of 4GB.
|
||||
Normally the kernel will make the right choice by itself.
|
||||
calgary=[translate_empty_slots]
|
||||
Enable translation even on slots that have no devices attached to
|
||||
them, in case a device will be hotplugged in the future.
|
||||
calgary=[disable=<PCI bus number>]
|
||||
Disable translation on a given PHB. For
|
||||
example, the built-in graphics adapter resides on the first bridge
|
||||
(PCI bus number 0); if translation (isolation) is enabled on this
|
||||
bridge, X servers that access the hardware directly from user
|
||||
space might stop working. Use this option if you have devices that
|
||||
are accessed from userspace directly on some PCI host bridge.
|
||||
panic
|
||||
Always panic when IOMMU overflows
|
||||
|
||||
|
||||
Miscellaneous
|
||||
=============
|
||||
|
||||
nogbpages
|
||||
Do not use GB pages for kernel direct mappings.
|
||||
gbpages
|
||||
Use GB pages for kernel direct mappings.
|
@ -1,278 +0,0 @@
|
||||
AMD64 specific boot options
|
||||
|
||||
There are many others (usually documented in driver documentation), but
|
||||
only the AMD64 specific ones are listed here.
|
||||
|
||||
Machine check
|
||||
|
||||
Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables.
|
||||
|
||||
mce=off
|
||||
Disable machine check
|
||||
mce=no_cmci
|
||||
Disable CMCI(Corrected Machine Check Interrupt) that
|
||||
Intel processor supports. Usually this disablement is
|
||||
not recommended, but it might be handy if your hardware
|
||||
is misbehaving.
|
||||
Note that you'll get more problems without CMCI than with
|
||||
due to the shared banks, i.e. you might get duplicated
|
||||
error logs.
|
||||
mce=dont_log_ce
|
||||
Don't make logs for corrected errors. All events reported
|
||||
as corrected are silently cleared by OS.
|
||||
This option will be useful if you have no interest in any
|
||||
of corrected errors.
|
||||
mce=ignore_ce
|
||||
Disable features for corrected errors, e.g. polling timer
|
||||
and CMCI. All events reported as corrected are not cleared
|
||||
by OS and remained in its error banks.
|
||||
Usually this disablement is not recommended, however if
|
||||
there is an agent checking/clearing corrected errors
|
||||
(e.g. BIOS or hardware monitoring applications), conflicting
|
||||
with OS's error handling, and you cannot deactivate the agent,
|
||||
then this option will be a help.
|
||||
mce=no_lmce
|
||||
Do not opt-in to Local MCE delivery. Use legacy method
|
||||
to broadcast MCEs.
|
||||
mce=bootlog
|
||||
Enable logging of machine checks left over from booting.
|
||||
Disabled by default on AMD Fam10h and older because some BIOS
|
||||
leave bogus ones.
|
||||
If your BIOS doesn't do that it's a good idea to enable though
|
||||
to make sure you log even machine check events that result
|
||||
in a reboot. On Intel systems it is enabled by default.
|
||||
mce=nobootlog
|
||||
Disable boot machine check logging.
|
||||
mce=tolerancelevel[,monarchtimeout] (number,number)
|
||||
tolerance levels:
|
||||
0: always panic on uncorrected errors, log corrected errors
|
||||
1: panic or SIGBUS on uncorrected errors, log corrected errors
|
||||
2: SIGBUS or log uncorrected errors, log corrected errors
|
||||
3: never panic or SIGBUS, log all errors (for testing only)
|
||||
Default is 1
|
||||
Can be also set using sysfs which is preferable.
|
||||
monarchtimeout:
|
||||
Sets the time in us to wait for other CPUs on machine checks. 0
|
||||
to disable.
|
||||
mce=bios_cmci_threshold
|
||||
Don't overwrite the bios-set CMCI threshold. This boot option
|
||||
prevents Linux from overwriting the CMCI threshold set by the
|
||||
bios. Without this option, Linux always sets the CMCI
|
||||
threshold to 1. Enabling this may make memory predictive failure
|
||||
analysis less effective if the bios sets thresholds for memory
|
||||
errors since we will not see details for all errors.
|
||||
mce=recovery
|
||||
Force-enable recoverable machine check code paths
|
||||
|
||||
nomce (for compatibility with i386): same as mce=off
|
||||
|
||||
Everything else is in sysfs now.
|
||||
|
||||
APICs
|
||||
|
||||
apic Use IO-APIC. Default
|
||||
|
||||
noapic Don't use the IO-APIC.
|
||||
|
||||
disableapic Don't use the local APIC
|
||||
|
||||
nolapic Don't use the local APIC (alias for i386 compatibility)
|
||||
|
||||
pirq=... See Documentation/x86/i386/IO-APIC.txt
|
||||
|
||||
noapictimer Don't set up the APIC timer
|
||||
|
||||
no_timer_check Don't check the IO-APIC timer. This can work around
|
||||
problems with incorrect timer initialization on some boards.
|
||||
apicpmtimer
|
||||
Do APIC timer calibration using the pmtimer. Implies
|
||||
apicmaintimer. Useful when your PIT timer is totally
|
||||
broken.
|
||||
|
||||
Timing
|
||||
|
||||
notsc
|
||||
Deprecated, use tsc=unstable instead.
|
||||
|
||||
nohpet
|
||||
Don't use the HPET timer.
|
||||
|
||||
Idle loop
|
||||
|
||||
idle=poll
|
||||
Don't do power saving in the idle loop using HLT, but poll for rescheduling
|
||||
event. This will make the CPUs eat a lot more power, but may be useful
|
||||
to get slightly better performance in multiprocessor benchmarks. It also
|
||||
makes some profiling using performance counters more accurate.
|
||||
Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
|
||||
CPUs) this option has no performance advantage over the normal idle loop.
|
||||
It may also interact badly with hyperthreading.
|
||||
|
||||
Rebooting
|
||||
|
||||
reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old]
|
||||
bios Use the CPU reboot vector for warm reset
|
||||
warm Don't set the cold reboot flag
|
||||
cold Set the cold reboot flag
|
||||
triple Force a triple fault (init)
|
||||
kbd Use the keyboard controller. cold reset (default)
|
||||
acpi Use the ACPI RESET_REG in the FADT. If ACPI is not configured or the
|
||||
ACPI reset does not work, the reboot path attempts the reset using
|
||||
the keyboard controller.
|
||||
efi Use efi reset_system runtime service. If EFI is not configured or the
|
||||
EFI reset does not work, the reboot path attempts the reset using
|
||||
the keyboard controller.
|
||||
|
||||
Using warm reset will be much faster especially on big memory
|
||||
systems because the BIOS will not go through the memory check.
|
||||
Disadvantage is that not all hardware will be completely reinitialized
|
||||
on reboot so there may be boot problems on some systems.
|
||||
|
||||
reboot=force
|
||||
|
||||
Don't stop other CPUs on reboot. This can make reboot more reliable
|
||||
in some cases.
|
||||
|
||||
Non Executable Mappings
|
||||
|
||||
noexec=on|off
|
||||
|
||||
on Enable(default)
|
||||
off Disable
|
||||
|
||||
NUMA
|
||||
|
||||
numa=off Only set up a single NUMA node spanning all memory.
|
||||
|
||||
numa=noacpi Don't parse the SRAT table for NUMA setup
|
||||
|
||||
numa=fake=<size>[MG]
|
||||
If given as a memory unit, fills all system RAM with nodes of
|
||||
size interleaved over physical nodes.
|
||||
|
||||
numa=fake=<N>
|
||||
If given as an integer, fills all system RAM with N fake nodes
|
||||
interleaved over physical nodes.
|
||||
|
||||
numa=fake=<N>U
|
||||
If given as an integer followed by 'U', it will divide each
|
||||
physical node into N emulated nodes.
|
||||
|
||||
ACPI
|
||||
|
||||
acpi=off Don't enable ACPI
|
||||
acpi=ht Use ACPI boot table parsing, but don't enable ACPI
|
||||
interpreter
|
||||
acpi=force Force ACPI on (currently not needed)
|
||||
|
||||
acpi=strict Disable out of spec ACPI workarounds.
|
||||
|
||||
acpi_sci={edge,level,high,low} Set up ACPI SCI interrupt.
|
||||
|
||||
acpi=noirq Don't route interrupts
|
||||
|
||||
acpi=nocmcff Disable firmware first mode for corrected errors. This
|
||||
disables parsing the HEST CMC error source to check if
|
||||
firmware has set the FF flag. This may result in
|
||||
duplicate corrected error reports.
|
||||
|
||||
PCI
|
||||
|
||||
pci=off Don't use PCI
|
||||
pci=conf1 Use conf1 access.
|
||||
pci=conf2 Use conf2 access.
|
||||
pci=rom Assign ROMs.
|
||||
pci=assign-busses Assign busses
|
||||
pci=irqmask=MASK Set PCI interrupt mask to MASK
|
||||
pci=lastbus=NUMBER Scan up to NUMBER busses, no matter what the mptable says.
|
||||
pci=noacpi Don't use ACPI to set up PCI interrupt routing.
|
||||
|
||||
IOMMU (input/output memory management unit)
|
||||
|
||||
Multiple x86-64 PCI-DMA mapping implementations exist, for example:
|
||||
|
||||
1. <lib/dma-direct.c>: use no hardware/software IOMMU at all
|
||||
(e.g. because you have < 3 GB memory).
|
||||
Kernel boot message: "PCI-DMA: Disabling IOMMU"
|
||||
|
||||
2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
|
||||
Kernel boot message: "PCI-DMA: using GART IOMMU"
|
||||
|
||||
3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
|
||||
e.g. if there is no hardware IOMMU in the system and it is need because
|
||||
you have >3GB memory or told the kernel to us it (iommu=soft))
|
||||
Kernel boot message: "PCI-DMA: Using software bounce buffering
|
||||
for IO (SWIOTLB)"
|
||||
|
||||
4. <arch/x86_64/pci-calgary.c> : IBM Calgary hardware IOMMU. Used in IBM
|
||||
pSeries and xSeries servers. This hardware IOMMU supports DMA address
|
||||
mapping with memory protection, etc.
|
||||
Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
|
||||
|
||||
iommu=[<size>][,noagp][,off][,force][,noforce]
|
||||
[,memaper[=<order>]][,merge][,fullflush][,nomerge]
|
||||
[,noaperture][,calgary]
|
||||
|
||||
General iommu options:
|
||||
off Don't initialize and use any kind of IOMMU.
|
||||
noforce Don't force hardware IOMMU usage when it is not needed.
|
||||
(default).
|
||||
force Force the use of the hardware IOMMU even when it is
|
||||
not actually needed (e.g. because < 3 GB memory).
|
||||
soft Use software bounce buffering (SWIOTLB) (default for
|
||||
Intel machines). This can be used to prevent the usage
|
||||
of an available hardware IOMMU.
|
||||
|
||||
iommu options only relevant to the AMD GART hardware IOMMU:
|
||||
<size> Set the size of the remapping area in bytes.
|
||||
allowed Overwrite iommu off workarounds for specific chipsets.
|
||||
fullflush Flush IOMMU on each allocation (default).
|
||||
nofullflush Don't use IOMMU fullflush.
|
||||
memaper[=<order>] Allocate an own aperture over RAM with size 32MB<<order.
|
||||
(default: order=1, i.e. 64MB)
|
||||
merge Do scatter-gather (SG) merging. Implies "force"
|
||||
(experimental).
|
||||
nomerge Don't do scatter-gather (SG) merging.
|
||||
noaperture Ask the IOMMU not to touch the aperture for AGP.
|
||||
noagp Don't initialize the AGP driver and use full aperture.
|
||||
panic Always panic when IOMMU overflows.
|
||||
calgary Use the Calgary IOMMU if it is available
|
||||
|
||||
iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
|
||||
implementation:
|
||||
swiotlb=<pages>[,force]
|
||||
<pages> Prereserve that many 128K pages for the software IO
|
||||
bounce buffering.
|
||||
force Force all IO through the software TLB.
|
||||
|
||||
Settings for the IBM Calgary hardware IOMMU currently found in IBM
|
||||
pSeries and xSeries machines:
|
||||
|
||||
calgary=[64k,128k,256k,512k,1M,2M,4M,8M]
|
||||
calgary=[translate_empty_slots]
|
||||
calgary=[disable=<PCI bus number>]
|
||||
panic Always panic when IOMMU overflows
|
||||
|
||||
64k,...,8M - Set the size of each PCI slot's translation table
|
||||
when using the Calgary IOMMU. This is the size of the translation
|
||||
table itself in main memory. The smallest table, 64k, covers an IO
|
||||
space of 32MB; the largest, 8MB table, can cover an IO space of
|
||||
4GB. Normally the kernel will make the right choice by itself.
|
||||
|
||||
translate_empty_slots - Enable translation even on slots that have
|
||||
no devices attached to them, in case a device will be hotplugged
|
||||
in the future.
|
||||
|
||||
disable=<PCI bus number> - Disable translation on a given PHB. For
|
||||
example, the built-in graphics adapter resides on the first bridge
|
||||
(PCI bus number 0); if translation (isolation) is enabled on this
|
||||
bridge, X servers that access the hardware directly from user
|
||||
space might stop working. Use this option if you have devices that
|
||||
are accessed from userspace directly on some PCI host bridge.
|
||||
|
||||
Miscellaneous
|
||||
|
||||
nogbpages
|
||||
Do not use GB pages for kernel direct mappings.
|
||||
gbpages
|
||||
Use GB pages for kernel direct mappings.
|
@ -1,5 +1,8 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===================================================
|
||||
Firmware support for CPU hotplug under Linux/x86-64
|
||||
---------------------------------------------------
|
||||
===================================================
|
||||
|
||||
Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to
|
||||
know in advance of boot time the maximum number of CPUs that could be plugged
|
@ -1,5 +1,12 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=====================
|
||||
Fake NUMA For CPUSets
|
||||
=====================
|
||||
|
||||
:Author: David Rientjes <rientjes@cs.washington.edu>
|
||||
|
||||
Using numa=fake and CPUSets for Resource Management
|
||||
Written by David Rientjes <rientjes@cs.washington.edu>
|
||||
|
||||
This document describes how the numa=fake x86_64 command-line option can be used
|
||||
in conjunction with cpusets for coarse memory management. Using this feature,
|
||||
@ -20,7 +27,7 @@ you become more familiar with using this combination for resource control,
|
||||
you'll determine a better setup to minimize the number of nodes you have to deal
|
||||
with.
|
||||
|
||||
A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:
|
||||
A machine may be split as follows with "numa=fake=4*512," as reported by dmesg::
|
||||
|
||||
Faking node 0 at 0000000000000000-0000000020000000 (512MB)
|
||||
Faking node 1 at 0000000020000000-0000000040000000 (512MB)
|
||||
@ -34,7 +41,7 @@ A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:
|
||||
|
||||
Now following the instructions for mounting the cpusets filesystem from
|
||||
Documentation/cgroup-v1/cpusets.txt, you can assign fake nodes (i.e. contiguous memory
|
||||
address spaces) to individual cpusets:
|
||||
address spaces) to individual cpusets::
|
||||
|
||||
[root@xroads /]# mkdir exampleset
|
||||
[root@xroads /]# mount -t cpuset none exampleset
|
||||
@ -47,7 +54,7 @@ Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
|
||||
memory allocations (1G).
|
||||
|
||||
You can now assign tasks to these cpusets to limit the memory resources
|
||||
available to them according to the fake nodes assigned as mems:
|
||||
available to them according to the fake nodes assigned as mems::
|
||||
|
||||
[root@xroads /exampleset/ddset]# echo $$ > tasks
|
||||
[root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
|
||||
@ -57,9 +64,13 @@ Notice the difference between the system memory usage as reported by
|
||||
/proc/meminfo between the restricted cpuset case above and the unrestricted
|
||||
case (i.e. running the same 'dd' command without assigning it to a fake NUMA
|
||||
cpuset):
|
||||
Unrestricted Restricted
|
||||
MemTotal: 3091900 kB 3091900 kB
|
||||
MemFree: 42113 kB 1513236 kB
|
||||
|
||||
======== ============ ==========
|
||||
Name Unrestricted Restricted
|
||||
======== ============ ==========
|
||||
MemTotal 3091900 kB 3091900 kB
|
||||
MemFree 42113 kB 1513236 kB
|
||||
======== ============ ==========
|
||||
|
||||
This allows for coarse memory management for the tasks you assign to particular
|
||||
cpusets. Since cpusets can form a hierarchy, you can create some pretty
|
16
Documentation/x86/x86_64/index.rst
Normal file
16
Documentation/x86/x86_64/index.rst
Normal file
@ -0,0 +1,16 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==============
|
||||
x86_64 Support
|
||||
==============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
boot-options
|
||||
uefi
|
||||
mm
|
||||
5level-paging
|
||||
fake-numa-for-cpusets
|
||||
cpu-hotplug-spec
|
||||
machinecheck
|
@ -1,5 +1,8 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Configurable sysfs parameters for the x86-64 machine check code.
|
||||
===============================================================
|
||||
Configurable sysfs parameters for the x86-64 machine check code
|
||||
===============================================================
|
||||
|
||||
Machine checks report internal hardware error conditions detected
|
||||
by the CPU. Uncorrected errors typically cause a machine check
|
||||
@ -16,14 +19,13 @@ log then mcelog should run to collect and decode machine check entries
|
||||
from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
|
||||
|
||||
Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
|
||||
(N = CPU number)
|
||||
(N = CPU number).
|
||||
|
||||
The directory contains some configurable entries:
|
||||
|
||||
Entries:
|
||||
|
||||
bankNctl
|
||||
(N bank number)
|
||||
|
||||
64bit Hex bitmask enabling/disabling specific subevents for bank N
|
||||
When a bit in the bitmask is zero then the respective
|
||||
subevent will not be reported.
|
161
Documentation/x86/x86_64/mm.rst
Normal file
161
Documentation/x86/x86_64/mm.rst
Normal file
@ -0,0 +1,161 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
================
|
||||
Memory Managment
|
||||
================
|
||||
|
||||
Complete virtual memory map with 4-level page tables
|
||||
====================================================
|
||||
|
||||
.. note::
|
||||
|
||||
- Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down
|
||||
from the top of the 64-bit address space. It's easier to understand the layout
|
||||
when seen both in absolute addresses and in distance-from-top notation.
|
||||
|
||||
For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the
|
||||
64-bit address space (ffffffffffffffff).
|
||||
|
||||
Note that as we get closer to the top of the address space, the notation changes
|
||||
from TB to GB and then MB/KB.
|
||||
|
||||
- "16M TB" might look weird at first sight, but it's an easier to visualize size
|
||||
notation than "16 EB", which few will recognize at first sight as 16 exabytes.
|
||||
It also shows it nicely how incredibly large 64-bit address space is.
|
||||
|
||||
::
|
||||
|
||||
========================================================================================================================
|
||||
Start addr | Offset | End addr | Size | VM area description
|
||||
========================================================================================================================
|
||||
| | | |
|
||||
0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
| | | |
|
||||
0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
|
||||
| | | | virtual memory addresses up to the -128 TB
|
||||
| | | | starting offset of kernel mappings.
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
|
|
||||
| Kernel-space virtual memory, shared between all processes:
|
||||
____________________________________________________________|___________________________________________________________
|
||||
| | | |
|
||||
ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
|
||||
ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI
|
||||
ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)
|
||||
ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole
|
||||
ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base)
|
||||
ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole
|
||||
ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base)
|
||||
ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole
|
||||
ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory
|
||||
__________________|____________|__________________|_________|____________________________________________________________
|
||||
|
|
||||
| Identical layout to the 56-bit one from here on:
|
||||
____________________________________________________________|____________________________________________________________
|
||||
| | | |
|
||||
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
|
||||
| | | | vaddr_end for KASLR
|
||||
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
|
||||
fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
|
||||
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
|
||||
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
|
||||
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
|
||||
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
|
||||
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
|
||||
ffffffff80000000 |-2048 MB | | |
|
||||
ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
|
||||
ffffffffff000000 | -16 MB | | |
|
||||
FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
|
||||
ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
|
||||
ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
|
||||
|
||||
Complete virtual memory map with 5-level page tables
|
||||
====================================================
|
||||
|
||||
.. note::
|
||||
|
||||
- With 56-bit addresses, user-space memory gets expanded by a factor of 512x,
|
||||
from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PB starting
|
||||
offset and many of the regions expand to support the much larger physical
|
||||
memory supported.
|
||||
|
||||
::
|
||||
|
||||
========================================================================================================================
|
||||
Start addr | Offset | End addr | Size | VM area description
|
||||
========================================================================================================================
|
||||
| | | |
|
||||
0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
| | | |
|
||||
0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
|
||||
| | | | virtual memory addresses up to the -64 PB
|
||||
| | | | starting offset of kernel mappings.
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
|
|
||||
| Kernel-space virtual memory, shared between all processes:
|
||||
____________________________________________________________|___________________________________________________________
|
||||
| | | |
|
||||
ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
|
||||
ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
|
||||
ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)
|
||||
ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
|
||||
ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
|
||||
ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole
|
||||
ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base)
|
||||
ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole
|
||||
ffdf000000000000 | -8.25 PB | fffffbffffffffff | ~8 PB | KASAN shadow memory
|
||||
__________________|____________|__________________|_________|____________________________________________________________
|
||||
|
|
||||
| Identical layout to the 47-bit one from here on:
|
||||
____________________________________________________________|____________________________________________________________
|
||||
| | | |
|
||||
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
|
||||
| | | | vaddr_end for KASLR
|
||||
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
|
||||
fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
|
||||
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
|
||||
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
|
||||
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
|
||||
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
|
||||
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
|
||||
ffffffff80000000 |-2048 MB | | |
|
||||
ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
|
||||
ffffffffff000000 | -16 MB | | |
|
||||
FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
|
||||
ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
|
||||
ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
|
||||
Architecture defines a 64-bit virtual address. Implementations can support
|
||||
less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
|
||||
through to the most-significant implemented bit are sign extended.
|
||||
This causes hole between user space and kernel addresses if you interpret them
|
||||
as unsigned.
|
||||
|
||||
The direct mapping covers all memory in the system up to the highest
|
||||
memory address (this means in some cases it can also include PCI memory
|
||||
holes).
|
||||
|
||||
vmalloc space is lazily synchronized into the different PML4/PML5 pages of
|
||||
the processes using the page fault handler, with init_top_pgt as
|
||||
reference.
|
||||
|
||||
We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual
|
||||
memory window (this size is arbitrary, it can be raised later if needed).
|
||||
The mappings are not part of any other kernel PGD and are only available
|
||||
during EFI runtime calls.
|
||||
|
||||
Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
|
||||
physical memory, vmalloc/ioremap space and virtual memory map are randomized.
|
||||
Their order is preserved but their base will be offset early at boot time.
|
||||
|
||||
Be very careful vs. KASLR when changing anything here. The KASLR address
|
||||
range must not overlap with anything except the KASAN shadow area, which is
|
||||
correct as KASAN disables KASLR.
|
||||
|
||||
For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB
|
||||
hole: ffffffffffff4111
|
@ -1,153 +0,0 @@
|
||||
====================================================
|
||||
Complete virtual memory map with 4-level page tables
|
||||
====================================================
|
||||
|
||||
Notes:
|
||||
|
||||
- Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down
|
||||
from the top of the 64-bit address space. It's easier to understand the layout
|
||||
when seen both in absolute addresses and in distance-from-top notation.
|
||||
|
||||
For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the
|
||||
64-bit address space (ffffffffffffffff).
|
||||
|
||||
Note that as we get closer to the top of the address space, the notation changes
|
||||
from TB to GB and then MB/KB.
|
||||
|
||||
- "16M TB" might look weird at first sight, but it's an easier to visualize size
|
||||
notation than "16 EB", which few will recognize at first sight as 16 exabytes.
|
||||
It also shows it nicely how incredibly large 64-bit address space is.
|
||||
|
||||
========================================================================================================================
|
||||
Start addr | Offset | End addr | Size | VM area description
|
||||
========================================================================================================================
|
||||
| | | |
|
||||
0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
| | | |
|
||||
0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
|
||||
| | | | virtual memory addresses up to the -128 TB
|
||||
| | | | starting offset of kernel mappings.
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
|
|
||||
| Kernel-space virtual memory, shared between all processes:
|
||||
____________________________________________________________|___________________________________________________________
|
||||
| | | |
|
||||
ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
|
||||
ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI
|
||||
ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)
|
||||
ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole
|
||||
ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base)
|
||||
ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole
|
||||
ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base)
|
||||
ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole
|
||||
ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory
|
||||
__________________|____________|__________________|_________|____________________________________________________________
|
||||
|
|
||||
| Identical layout to the 56-bit one from here on:
|
||||
____________________________________________________________|____________________________________________________________
|
||||
| | | |
|
||||
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
|
||||
| | | | vaddr_end for KASLR
|
||||
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
|
||||
fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
|
||||
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
|
||||
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
|
||||
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
|
||||
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
|
||||
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
|
||||
ffffffff80000000 |-2048 MB | | |
|
||||
ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
|
||||
ffffffffff000000 | -16 MB | | |
|
||||
FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
|
||||
ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
|
||||
ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
|
||||
|
||||
====================================================
|
||||
Complete virtual memory map with 5-level page tables
|
||||
====================================================
|
||||
|
||||
Notes:
|
||||
|
||||
- With 56-bit addresses, user-space memory gets expanded by a factor of 512x,
|
||||
from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PB starting
|
||||
offset and many of the regions expand to support the much larger physical
|
||||
memory supported.
|
||||
|
||||
========================================================================================================================
|
||||
Start addr | Offset | End addr | Size | VM area description
|
||||
========================================================================================================================
|
||||
| | | |
|
||||
0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
| | | |
|
||||
0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
|
||||
| | | | virtual memory addresses up to the -64 PB
|
||||
| | | | starting offset of kernel mappings.
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
|
|
||||
| Kernel-space virtual memory, shared between all processes:
|
||||
____________________________________________________________|___________________________________________________________
|
||||
| | | |
|
||||
ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
|
||||
ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
|
||||
ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)
|
||||
ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
|
||||
ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
|
||||
ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole
|
||||
ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base)
|
||||
ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole
|
||||
ffdf000000000000 | -8.25 PB | fffffbffffffffff | ~8 PB | KASAN shadow memory
|
||||
__________________|____________|__________________|_________|____________________________________________________________
|
||||
|
|
||||
| Identical layout to the 47-bit one from here on:
|
||||
____________________________________________________________|____________________________________________________________
|
||||
| | | |
|
||||
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
|
||||
| | | | vaddr_end for KASLR
|
||||
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
|
||||
fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
|
||||
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
|
||||
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
|
||||
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
|
||||
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
|
||||
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
|
||||
ffffffff80000000 |-2048 MB | | |
|
||||
ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
|
||||
ffffffffff000000 | -16 MB | | |
|
||||
FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
|
||||
ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
|
||||
ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
|
||||
__________________|____________|__________________|_________|___________________________________________________________
|
||||
|
||||
Architecture defines a 64-bit virtual address. Implementations can support
|
||||
less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
|
||||
through to the most-significant implemented bit are sign extended.
|
||||
This causes hole between user space and kernel addresses if you interpret them
|
||||
as unsigned.
|
||||
|
||||
The direct mapping covers all memory in the system up to the highest
|
||||
memory address (this means in some cases it can also include PCI memory
|
||||
holes).
|
||||
|
||||
vmalloc space is lazily synchronized into the different PML4/PML5 pages of
|
||||
the processes using the page fault handler, with init_top_pgt as
|
||||
reference.
|
||||
|
||||
We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual
|
||||
memory window (this size is arbitrary, it can be raised later if needed).
|
||||
The mappings are not part of any other kernel PGD and are only available
|
||||
during EFI runtime calls.
|
||||
|
||||
Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
|
||||
physical memory, vmalloc/ioremap space and virtual memory map are randomized.
|
||||
Their order is preserved but their base will be offset early at boot time.
|
||||
|
||||
Be very careful vs. KASLR when changing anything here. The KASLR address
|
||||
range must not overlap with anything except the KASAN shadow area, which is
|
||||
correct as KASAN disables KASLR.
|
||||
|
||||
For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB
|
||||
hole: ffffffffffff4111
|
@ -1,5 +1,8 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=====================================
|
||||
General note on [U]EFI x86_64 support
|
||||
-------------------------------------
|
||||
=====================================
|
||||
|
||||
The nomenclature EFI and UEFI are used interchangeably in this document.
|
||||
|
||||
@ -14,29 +17,42 @@ with EFI firmware and specifications are listed below.
|
||||
|
||||
3. x86_64 platform with EFI/UEFI firmware.
|
||||
|
||||
Mechanics:
|
||||
Mechanics
|
||||
---------
|
||||
- Build the kernel with the following configuration.
|
||||
|
||||
- Build the kernel with the following configuration::
|
||||
|
||||
CONFIG_FB_EFI=y
|
||||
CONFIG_FRAMEBUFFER_CONSOLE=y
|
||||
|
||||
If EFI runtime services are expected, the following configuration should
|
||||
be selected.
|
||||
be selected::
|
||||
|
||||
CONFIG_EFI=y
|
||||
CONFIG_EFI_VARS=y or m # optional
|
||||
|
||||
- Create a VFAT partition on the disk
|
||||
- Copy the following to the VFAT partition:
|
||||
|
||||
elilo bootloader with x86_64 support, elilo configuration file,
|
||||
kernel image built in first step and corresponding
|
||||
initrd. Instructions on building elilo and its dependencies
|
||||
can be found in the elilo sourceforge project.
|
||||
|
||||
- Boot to EFI shell and invoke elilo choosing the kernel image built
|
||||
in first step.
|
||||
- If some or all EFI runtime services don't work, you can try following
|
||||
kernel command line parameters to turn off some or all EFI runtime
|
||||
services.
|
||||
noefi turn off all EFI runtime services
|
||||
reboot_type=k turn off EFI reboot runtime service
|
||||
|
||||
noefi
|
||||
turn off all EFI runtime services
|
||||
reboot_type=k
|
||||
turn off EFI reboot runtime service
|
||||
|
||||
- If the EFI memory map has additional entries not in the E820 map,
|
||||
you can include those entries in the kernels memory map of available
|
||||
physical RAM by using the following kernel command line parameter.
|
||||
add_efi_memmap include EFI memory map of available physical RAM
|
||||
|
||||
add_efi_memmap
|
||||
include EFI memory map of available physical RAM
|
45
Documentation/x86/zero-page.rst
Normal file
45
Documentation/x86/zero-page.rst
Normal file
@ -0,0 +1,45 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=========
|
||||
Zero Page
|
||||
=========
|
||||
The additional fields in struct boot_params as a part of 32-bit boot
|
||||
protocol of kernel. These should be filled by bootloader or 16-bit
|
||||
real-mode setup code of the kernel. References/settings to it mainly
|
||||
are in::
|
||||
|
||||
arch/x86/include/uapi/asm/bootparam.h
|
||||
|
||||
=========== ===== ======================= =================================================
|
||||
Offset/Size Proto Name Meaning
|
||||
|
||||
000/040 ALL screen_info Text mode or frame buffer information
|
||||
(struct screen_info)
|
||||
040/014 ALL apm_bios_info APM BIOS information (struct apm_bios_info)
|
||||
058/008 ALL tboot_addr Physical address of tboot shared page
|
||||
060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
|
||||
(struct ist_info)
|
||||
080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!!
|
||||
090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!!
|
||||
0A0/010 ALL sys_desc_table System description table (struct sys_desc_table),
|
||||
OBSOLETE!!
|
||||
0B0/010 ALL olpc_ofw_header OLPC's OpenFirmware CIF and friends
|
||||
0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits
|
||||
0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits
|
||||
0C8/004 ALL ext_cmd_line_ptr cmd_line_ptr high 32bits
|
||||
140/080 ALL edid_info Video mode setup (struct edid_info)
|
||||
1C0/020 ALL efi_info EFI 32 information (struct efi_info)
|
||||
1E0/004 ALL alt_mem_k Alternative mem check, in KB
|
||||
1E4/004 ALL scratch Scratch field for the kernel setup code
|
||||
1E8/001 ALL e820_entries Number of entries in e820_table (below)
|
||||
1E9/001 ALL eddbuf_entries Number of entries in eddbuf (below)
|
||||
1EA/001 ALL edd_mbr_sig_buf_entries Number of entries in edd_mbr_sig_buffer
|
||||
(below)
|
||||
1EB/001 ALL kbd_status Numlock is enabled
|
||||
1EC/001 ALL secure_boot Secure boot is enabled in the firmware
|
||||
1EF/001 ALL sentinel Used to detect broken bootloaders
|
||||
290/040 ALL edd_mbr_sig_buffer EDD MBR signatures
|
||||
2D0/A00 ALL e820_table E820 memory map table
|
||||
(array of struct e820_entry)
|
||||
D00/1EC ALL eddbuf EDD data (array of struct edd_info)
|
||||
=========== ===== ======================= =================================================
|
@ -1,40 +0,0 @@
|
||||
The additional fields in struct boot_params as a part of 32-bit boot
|
||||
protocol of kernel. These should be filled by bootloader or 16-bit
|
||||
real-mode setup code of the kernel. References/settings to it mainly
|
||||
are in:
|
||||
|
||||
arch/x86/include/uapi/asm/bootparam.h
|
||||
|
||||
|
||||
Offset Proto Name Meaning
|
||||
/Size
|
||||
|
||||
000/040 ALL screen_info Text mode or frame buffer information
|
||||
(struct screen_info)
|
||||
040/014 ALL apm_bios_info APM BIOS information (struct apm_bios_info)
|
||||
058/008 ALL tboot_addr Physical address of tboot shared page
|
||||
060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
|
||||
(struct ist_info)
|
||||
080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!!
|
||||
090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!!
|
||||
0A0/010 ALL sys_desc_table System description table (struct sys_desc_table),
|
||||
OBSOLETE!!
|
||||
0B0/010 ALL olpc_ofw_header OLPC's OpenFirmware CIF and friends
|
||||
0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits
|
||||
0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits
|
||||
0C8/004 ALL ext_cmd_line_ptr cmd_line_ptr high 32bits
|
||||
140/080 ALL edid_info Video mode setup (struct edid_info)
|
||||
1C0/020 ALL efi_info EFI 32 information (struct efi_info)
|
||||
1E0/004 ALL alt_mem_k Alternative mem check, in KB
|
||||
1E4/004 ALL scratch Scratch field for the kernel setup code
|
||||
1E8/001 ALL e820_entries Number of entries in e820_table (below)
|
||||
1E9/001 ALL eddbuf_entries Number of entries in eddbuf (below)
|
||||
1EA/001 ALL edd_mbr_sig_buf_entries Number of entries in edd_mbr_sig_buffer
|
||||
(below)
|
||||
1EB/001 ALL kbd_status Numlock is enabled
|
||||
1EC/001 ALL secure_boot Secure boot is enabled in the firmware
|
||||
1EF/001 ALL sentinel Used to detect broken bootloaders
|
||||
290/040 ALL edd_mbr_sig_buffer EDD MBR signatures
|
||||
2D0/A00 ALL e820_table E820 memory map table
|
||||
(array of struct e820_entry)
|
||||
D00/1EC ALL eddbuf EDD data (array of struct edd_info)
|
Loading…
Reference in New Issue
Block a user