xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
/*
|
|
|
|
* Machine specific setup for xen
|
|
|
|
*
|
|
|
|
* Jeremy Fitzhardinge <jeremy@xensource.com>, XenSource Inc, 2007
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/sched.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/pm.h>
|
2010-08-26 04:39:17 +08:00
|
|
|
#include <linux/memblock.h>
|
2011-04-02 06:28:35 +08:00
|
|
|
#include <linux/cpuidle.h>
|
2012-03-14 08:06:57 +08:00
|
|
|
#include <linux/cpufreq.h>
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
|
|
|
|
#include <asm/elf.h>
|
2008-01-30 20:30:42 +08:00
|
|
|
#include <asm/vdso.h>
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
#include <asm/e820.h>
|
|
|
|
#include <asm/setup.h>
|
2008-06-17 05:54:49 +08:00
|
|
|
#include <asm/acpi.h>
|
2012-08-17 22:22:37 +08:00
|
|
|
#include <asm/numa.h>
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
#include <asm/xen/hypervisor.h>
|
|
|
|
#include <asm/xen/hypercall.h>
|
|
|
|
|
2010-10-26 07:32:29 +08:00
|
|
|
#include <xen/xen.h>
|
2008-05-27 06:31:19 +08:00
|
|
|
#include <xen/page.h>
|
2008-03-18 07:37:17 +08:00
|
|
|
#include <xen/interface/callback.h>
|
2009-02-07 11:09:48 +08:00
|
|
|
#include <xen/interface/memory.h>
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
#include <xen/interface/physdev.h>
|
|
|
|
#include <xen/features.h>
|
|
|
|
#include "xen-ops.h"
|
2007-07-20 15:31:43 +08:00
|
|
|
#include "vdso.h"
|
2014-08-12 02:57:57 +08:00
|
|
|
#include "p2m.h"
|
2014-11-28 18:53:53 +08:00
|
|
|
#include "mmu.h"
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
|
2010-08-31 07:41:02 +08:00
|
|
|
/* Amount of extra memory space we add to the e820 ranges */
|
2011-09-29 00:46:34 +08:00
|
|
|
struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;
|
2010-08-31 07:41:02 +08:00
|
|
|
|
2011-09-29 00:46:32 +08:00
|
|
|
/* Number of pages released from the initial allocation. */
|
|
|
|
unsigned long xen_released_pages;
|
|
|
|
|
2015-07-17 12:51:26 +08:00
|
|
|
/* E820 map used during setting up memory. */
|
|
|
|
static struct e820entry xen_e820_map[E820MAX] __initdata;
|
|
|
|
static u32 xen_e820_map_entries __initdata;
|
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
/*
|
|
|
|
* Buffer used to remap identity mapped pages. We only need the virtual space.
|
|
|
|
* The physical page behind this address is remapped as needed to different
|
|
|
|
* buffer pages.
|
|
|
|
*/
|
|
|
|
#define REMAP_SIZE (P2M_PER_PAGE - 3)
|
|
|
|
static struct {
|
|
|
|
unsigned long next_area_mfn;
|
|
|
|
unsigned long target_pfn;
|
|
|
|
unsigned long size;
|
|
|
|
unsigned long mfns[REMAP_SIZE];
|
|
|
|
} xen_remap_buf __initdata __aligned(PAGE_SIZE);
|
|
|
|
static unsigned long xen_remap_mfn __initdata = INVALID_P2M_ENTRY;
|
2014-08-12 02:57:57 +08:00
|
|
|
|
2010-09-15 01:19:14 +08:00
|
|
|
/*
|
|
|
|
* The maximum amount of extra memory compared to the base size. The
|
|
|
|
* main scaling factor is the size of struct page. At extreme ratios
|
|
|
|
* of base:extra, all the base memory can be filled with page
|
|
|
|
* structures for the extra memory, leaving no space for anything
|
|
|
|
* else.
|
|
|
|
*
|
|
|
|
* 10x seems like a reasonable balance between scaling flexibility and
|
|
|
|
* leaving a practically usable system.
|
|
|
|
*/
|
|
|
|
#define EXTRA_MEM_RATIO (10)
|
|
|
|
|
2015-01-28 14:44:22 +08:00
|
|
|
static void __init xen_add_extra_mem(phys_addr_t start, phys_addr_t size)
|
2010-08-31 07:41:02 +08:00
|
|
|
{
|
2011-09-29 19:26:19 +08:00
|
|
|
int i;
|
2011-01-19 09:09:41 +08:00
|
|
|
|
2011-09-29 19:26:19 +08:00
|
|
|
for (i = 0; i < XEN_EXTRA_MEM_MAX_REGIONS; i++) {
|
|
|
|
/* Add new region. */
|
|
|
|
if (xen_extra_mem[i].size == 0) {
|
|
|
|
xen_extra_mem[i].start = start;
|
|
|
|
xen_extra_mem[i].size = size;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* Append to existing region. */
|
|
|
|
if (xen_extra_mem[i].start + xen_extra_mem[i].size == start) {
|
|
|
|
xen_extra_mem[i].size += size;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (i == XEN_EXTRA_MEM_MAX_REGIONS)
|
|
|
|
printk(KERN_WARNING "Warning: not enough extra memory regions\n");
|
2010-08-31 07:41:02 +08:00
|
|
|
|
2011-11-29 01:46:22 +08:00
|
|
|
memblock_reserve(start, size);
|
2014-11-28 18:53:55 +08:00
|
|
|
}
|
2010-09-16 04:32:49 +08:00
|
|
|
|
2015-01-28 14:44:22 +08:00
|
|
|
static void __init xen_del_extra_mem(phys_addr_t start, phys_addr_t size)
|
2014-11-28 18:53:55 +08:00
|
|
|
{
|
|
|
|
int i;
|
2015-01-28 14:44:22 +08:00
|
|
|
phys_addr_t start_r, size_r;
|
2012-08-18 04:43:28 +08:00
|
|
|
|
2014-11-28 18:53:55 +08:00
|
|
|
for (i = 0; i < XEN_EXTRA_MEM_MAX_REGIONS; i++) {
|
|
|
|
start_r = xen_extra_mem[i].start;
|
|
|
|
size_r = xen_extra_mem[i].size;
|
|
|
|
|
|
|
|
/* Start of region. */
|
|
|
|
if (start_r == start) {
|
|
|
|
BUG_ON(size > size_r);
|
|
|
|
xen_extra_mem[i].start += size;
|
|
|
|
xen_extra_mem[i].size -= size;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* End of region. */
|
|
|
|
if (start_r + size_r == start + size) {
|
|
|
|
BUG_ON(size > size_r);
|
|
|
|
xen_extra_mem[i].size -= size;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* Mid of region. */
|
|
|
|
if (start > start_r && start < start_r + size_r) {
|
|
|
|
BUG_ON(start + size > start_r + size_r);
|
|
|
|
xen_extra_mem[i].size = start - start_r;
|
|
|
|
/* Calling memblock_reserve() again is okay. */
|
|
|
|
xen_add_extra_mem(start + size, start_r + size_r -
|
|
|
|
(start + size));
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
memblock_free(start, size);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Called during boot before the p2m list can take entries beyond the
|
|
|
|
* hypervisor supplied p2m list. Entries in extra mem are to be regarded as
|
|
|
|
* invalid.
|
|
|
|
*/
|
|
|
|
unsigned long __ref xen_chk_extra_mem(unsigned long pfn)
|
|
|
|
{
|
|
|
|
int i;
|
2015-01-12 13:05:09 +08:00
|
|
|
phys_addr_t addr = PFN_PHYS(pfn);
|
2011-01-19 09:09:41 +08:00
|
|
|
|
2014-11-28 18:53:55 +08:00
|
|
|
for (i = 0; i < XEN_EXTRA_MEM_MAX_REGIONS; i++) {
|
|
|
|
if (addr >= xen_extra_mem[i].start &&
|
|
|
|
addr < xen_extra_mem[i].start + xen_extra_mem[i].size)
|
|
|
|
return INVALID_P2M_ENTRY;
|
|
|
|
}
|
|
|
|
|
|
|
|
return IDENTITY_FRAME(pfn);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Mark all pfns of extra mem as invalid in p2m list.
|
|
|
|
*/
|
|
|
|
void __init xen_inv_extra_mem(void)
|
|
|
|
{
|
|
|
|
unsigned long pfn, pfn_s, pfn_e;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < XEN_EXTRA_MEM_MAX_REGIONS; i++) {
|
2015-01-12 13:05:10 +08:00
|
|
|
if (!xen_extra_mem[i].size)
|
|
|
|
continue;
|
2014-11-28 18:53:55 +08:00
|
|
|
pfn_s = PFN_DOWN(xen_extra_mem[i].start);
|
|
|
|
pfn_e = PFN_UP(xen_extra_mem[i].start + xen_extra_mem[i].size);
|
|
|
|
for (pfn = pfn_s; pfn < pfn_e; pfn++)
|
|
|
|
set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
|
2012-08-18 04:43:28 +08:00
|
|
|
}
|
2010-08-31 07:41:02 +08:00
|
|
|
}
|
|
|
|
|
2014-08-12 02:57:57 +08:00
|
|
|
/*
|
|
|
|
* Finds the next RAM pfn available in the E820 map after min_pfn.
|
|
|
|
* This function updates min_pfn with the pfn found and returns
|
|
|
|
* the size of that range or zero if not found.
|
|
|
|
*/
|
2015-07-17 12:51:26 +08:00
|
|
|
static unsigned long __init xen_find_pfn_range(unsigned long *min_pfn)
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
{
|
2015-07-17 12:51:26 +08:00
|
|
|
const struct e820entry *entry = xen_e820_map;
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
unsigned int i;
|
|
|
|
unsigned long done = 0;
|
|
|
|
|
2015-07-17 12:51:26 +08:00
|
|
|
for (i = 0; i < xen_e820_map_entries; i++, entry++) {
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
unsigned long s_pfn;
|
|
|
|
unsigned long e_pfn;
|
|
|
|
|
|
|
|
if (entry->type != E820_RAM)
|
|
|
|
continue;
|
|
|
|
|
2012-07-18 13:06:39 +08:00
|
|
|
e_pfn = PFN_DOWN(entry->addr + entry->size);
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
|
2014-08-12 02:57:57 +08:00
|
|
|
/* We only care about E820 after this */
|
|
|
|
if (e_pfn < *min_pfn)
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
continue;
|
|
|
|
|
2012-07-18 13:06:39 +08:00
|
|
|
s_pfn = PFN_UP(entry->addr);
|
2014-08-12 02:57:57 +08:00
|
|
|
|
|
|
|
/* If min_pfn falls within the E820 entry, we want to start
|
|
|
|
* at the min_pfn PFN.
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
*/
|
2014-08-12 02:57:57 +08:00
|
|
|
if (s_pfn <= *min_pfn) {
|
|
|
|
done = e_pfn - *min_pfn;
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
} else {
|
2014-08-12 02:57:57 +08:00
|
|
|
done = e_pfn - s_pfn;
|
|
|
|
*min_pfn = s_pfn;
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
}
|
2014-08-12 02:57:57 +08:00
|
|
|
break;
|
|
|
|
}
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
|
2014-08-12 02:57:57 +08:00
|
|
|
return done;
|
|
|
|
}
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
static int __init xen_free_mfn(unsigned long mfn)
|
|
|
|
{
|
|
|
|
struct xen_memory_reservation reservation = {
|
|
|
|
.address_bits = 0,
|
|
|
|
.extent_order = 0,
|
|
|
|
.domid = DOMID_SELF
|
|
|
|
};
|
|
|
|
|
|
|
|
set_xen_guest_handle(reservation.extent_start, &mfn);
|
|
|
|
reservation.nr_extents = 1;
|
|
|
|
|
|
|
|
return HYPERVISOR_memory_op(XENMEM_decrease_reservation, &reservation);
|
|
|
|
}
|
|
|
|
|
2014-08-12 02:57:57 +08:00
|
|
|
/*
|
2014-11-28 18:53:53 +08:00
|
|
|
* This releases a chunk of memory and then does the identity map. It's used
|
2014-08-12 02:57:57 +08:00
|
|
|
* as a fallback if the remapping fails.
|
|
|
|
*/
|
|
|
|
static void __init xen_set_identity_and_release_chunk(unsigned long start_pfn,
|
2015-01-07 19:01:08 +08:00
|
|
|
unsigned long end_pfn, unsigned long nr_pages, unsigned long *released)
|
2014-08-12 02:57:57 +08:00
|
|
|
{
|
2014-11-28 18:53:53 +08:00
|
|
|
unsigned long pfn, end;
|
|
|
|
int ret;
|
|
|
|
|
2014-08-12 02:57:57 +08:00
|
|
|
WARN_ON(start_pfn > end_pfn);
|
|
|
|
|
2015-01-07 19:01:08 +08:00
|
|
|
/* Release pages first. */
|
2014-11-28 18:53:53 +08:00
|
|
|
end = min(end_pfn, nr_pages);
|
|
|
|
for (pfn = start_pfn; pfn < end; pfn++) {
|
|
|
|
unsigned long mfn = pfn_to_mfn(pfn);
|
|
|
|
|
|
|
|
/* Make sure pfn exists to start with */
|
|
|
|
if (mfn == INVALID_P2M_ENTRY || mfn_to_pfn(mfn) != pfn)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
ret = xen_free_mfn(mfn);
|
|
|
|
WARN(ret != 1, "Failed to release pfn %lx err=%d\n", pfn, ret);
|
|
|
|
|
|
|
|
if (ret == 1) {
|
2015-01-07 19:01:08 +08:00
|
|
|
(*released)++;
|
2014-11-28 18:53:53 +08:00
|
|
|
if (!__set_phys_to_machine(pfn, INVALID_P2M_ENTRY))
|
|
|
|
break;
|
|
|
|
} else
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2015-01-07 19:01:08 +08:00
|
|
|
set_phys_range_identity(start_pfn, end_pfn);
|
2014-08-12 02:57:57 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2014-11-28 18:53:53 +08:00
|
|
|
* Helper function to update the p2m and m2p tables and kernel mapping.
|
2014-08-12 02:57:57 +08:00
|
|
|
*/
|
2014-11-28 18:53:53 +08:00
|
|
|
static void __init xen_update_mem_tables(unsigned long pfn, unsigned long mfn)
|
2014-08-12 02:57:57 +08:00
|
|
|
{
|
|
|
|
struct mmu_update update = {
|
2015-01-28 14:44:22 +08:00
|
|
|
.ptr = ((uint64_t)mfn << PAGE_SHIFT) | MMU_MACHPHYS_UPDATE,
|
2014-08-12 02:57:57 +08:00
|
|
|
.val = pfn
|
|
|
|
};
|
|
|
|
|
|
|
|
/* Update p2m */
|
2014-11-28 18:53:53 +08:00
|
|
|
if (!set_phys_to_machine(pfn, mfn)) {
|
2014-08-12 02:57:57 +08:00
|
|
|
WARN(1, "Failed to set p2m mapping for pfn=%ld mfn=%ld\n",
|
|
|
|
pfn, mfn);
|
2014-11-28 18:53:53 +08:00
|
|
|
BUG();
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
}
|
2014-08-12 02:57:57 +08:00
|
|
|
|
|
|
|
/* Update m2p */
|
|
|
|
if (HYPERVISOR_mmu_update(&update, 1, NULL, DOMID_SELF) < 0) {
|
|
|
|
WARN(1, "Failed to set m2p mapping for mfn=%ld pfn=%ld\n",
|
|
|
|
mfn, pfn);
|
2014-11-28 18:53:53 +08:00
|
|
|
BUG();
|
2014-08-12 02:57:57 +08:00
|
|
|
}
|
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
/* Update kernel mapping, but not for highmem. */
|
2015-01-12 13:05:09 +08:00
|
|
|
if (pfn >= PFN_UP(__pa(high_memory - 1)))
|
2014-11-28 18:53:53 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
if (HYPERVISOR_update_va_mapping((unsigned long)__va(pfn << PAGE_SHIFT),
|
|
|
|
mfn_pte(mfn, PAGE_KERNEL), 0)) {
|
|
|
|
WARN(1, "Failed to update kernel mapping for mfn=%ld pfn=%ld\n",
|
|
|
|
mfn, pfn);
|
|
|
|
BUG();
|
|
|
|
}
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
}
|
2012-05-03 23:15:42 +08:00
|
|
|
|
2014-08-12 02:57:57 +08:00
|
|
|
/*
|
|
|
|
* This function updates the p2m and m2p tables with an identity map from
|
2014-11-28 18:53:53 +08:00
|
|
|
* start_pfn to start_pfn+size and prepares remapping the underlying RAM of the
|
|
|
|
* original allocation at remap_pfn. The information needed for remapping is
|
|
|
|
* saved in the memory itself to avoid the need for allocating buffers. The
|
|
|
|
* complete remap information is contained in a list of MFNs each containing
|
|
|
|
* up to REMAP_SIZE MFNs and the start target PFN for doing the remap.
|
|
|
|
* This enables us to preserve the original mfn sequence while doing the
|
|
|
|
* remapping at a time when the memory management is capable of allocating
|
|
|
|
* virtual and physical memory in arbitrary amounts, see 'xen_remap_memory' and
|
|
|
|
* its callers.
|
2014-08-12 02:57:57 +08:00
|
|
|
*/
|
2014-11-28 18:53:53 +08:00
|
|
|
static void __init xen_do_set_identity_and_remap_chunk(
|
2014-08-12 02:57:57 +08:00
|
|
|
unsigned long start_pfn, unsigned long size, unsigned long remap_pfn)
|
2012-05-03 23:15:42 +08:00
|
|
|
{
|
2014-11-28 18:53:53 +08:00
|
|
|
unsigned long buf = (unsigned long)&xen_remap_buf;
|
|
|
|
unsigned long mfn_save, mfn;
|
2014-08-12 02:57:57 +08:00
|
|
|
unsigned long ident_pfn_iter, remap_pfn_iter;
|
2014-11-28 18:53:53 +08:00
|
|
|
unsigned long ident_end_pfn = start_pfn + size;
|
2014-08-12 02:57:57 +08:00
|
|
|
unsigned long left = size;
|
2014-11-28 18:53:53 +08:00
|
|
|
unsigned int i, chunk;
|
2014-08-12 02:57:57 +08:00
|
|
|
|
|
|
|
WARN_ON(size == 0);
|
|
|
|
|
|
|
|
BUG_ON(xen_feature(XENFEAT_auto_translated_physmap));
|
2012-05-03 23:15:42 +08:00
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
mfn_save = virt_to_mfn(buf);
|
2013-07-22 22:29:00 +08:00
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
for (ident_pfn_iter = start_pfn, remap_pfn_iter = remap_pfn;
|
|
|
|
ident_pfn_iter < ident_end_pfn;
|
|
|
|
ident_pfn_iter += REMAP_SIZE, remap_pfn_iter += REMAP_SIZE) {
|
|
|
|
chunk = (left < REMAP_SIZE) ? left : REMAP_SIZE;
|
2014-08-12 02:57:57 +08:00
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
/* Map first pfn to xen_remap_buf */
|
|
|
|
mfn = pfn_to_mfn(ident_pfn_iter);
|
|
|
|
set_pte_mfn(buf, mfn, PAGE_KERNEL);
|
2014-08-12 02:57:57 +08:00
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
/* Save mapping information in page */
|
|
|
|
xen_remap_buf.next_area_mfn = xen_remap_mfn;
|
|
|
|
xen_remap_buf.target_pfn = remap_pfn_iter;
|
|
|
|
xen_remap_buf.size = chunk;
|
|
|
|
for (i = 0; i < chunk; i++)
|
|
|
|
xen_remap_buf.mfns[i] = pfn_to_mfn(ident_pfn_iter + i);
|
2014-08-12 02:57:57 +08:00
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
/* Put remap buf into list. */
|
|
|
|
xen_remap_mfn = mfn;
|
2014-08-12 02:57:57 +08:00
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
/* Set identity map */
|
2015-01-07 19:01:08 +08:00
|
|
|
set_phys_range_identity(ident_pfn_iter, ident_pfn_iter + chunk);
|
2012-05-03 23:15:42 +08:00
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
left -= chunk;
|
2014-08-12 02:57:57 +08:00
|
|
|
}
|
2012-05-03 23:15:42 +08:00
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
/* Restore old xen_remap_buf mapping */
|
|
|
|
set_pte_mfn(buf, mfn_save, PAGE_KERNEL);
|
2012-05-03 23:15:42 +08:00
|
|
|
}
|
|
|
|
|
2014-08-12 02:57:57 +08:00
|
|
|
/*
|
|
|
|
* This function takes a contiguous pfn range that needs to be identity mapped
|
|
|
|
* and:
|
|
|
|
*
|
|
|
|
* 1) Finds a new range of pfns to use to remap based on E820 and remap_pfn.
|
|
|
|
* 2) Calls the do_ function to actually do the mapping/remapping work.
|
|
|
|
*
|
|
|
|
* The goal is to not allocate additional memory but to remap the existing
|
|
|
|
* pages. In the case of an error the underlying memory is simply released back
|
|
|
|
* to Xen and not remapped.
|
|
|
|
*/
|
2014-12-08 13:32:19 +08:00
|
|
|
static unsigned long __init xen_set_identity_and_remap_chunk(
|
2015-07-17 12:51:26 +08:00
|
|
|
unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_pages,
|
|
|
|
unsigned long remap_pfn, unsigned long *released,
|
|
|
|
unsigned long *remapped)
|
2014-08-12 02:57:57 +08:00
|
|
|
{
|
|
|
|
unsigned long pfn;
|
|
|
|
unsigned long i = 0;
|
|
|
|
unsigned long n = end_pfn - start_pfn;
|
|
|
|
|
|
|
|
while (i < n) {
|
|
|
|
unsigned long cur_pfn = start_pfn + i;
|
|
|
|
unsigned long left = n - i;
|
|
|
|
unsigned long size = left;
|
|
|
|
unsigned long remap_range_size;
|
|
|
|
|
|
|
|
/* Do not remap pages beyond the current allocation */
|
|
|
|
if (cur_pfn >= nr_pages) {
|
|
|
|
/* Identity map remaining pages */
|
2015-01-07 19:01:08 +08:00
|
|
|
set_phys_range_identity(cur_pfn, cur_pfn + size);
|
2014-08-12 02:57:57 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (cur_pfn + size > nr_pages)
|
|
|
|
size = nr_pages - cur_pfn;
|
|
|
|
|
2015-07-17 12:51:26 +08:00
|
|
|
remap_range_size = xen_find_pfn_range(&remap_pfn);
|
2014-08-12 02:57:57 +08:00
|
|
|
if (!remap_range_size) {
|
|
|
|
pr_warning("Unable to find available pfn range, not remapping identity pages\n");
|
|
|
|
xen_set_identity_and_release_chunk(cur_pfn,
|
2015-01-07 19:01:08 +08:00
|
|
|
cur_pfn + left, nr_pages, released);
|
2014-08-12 02:57:57 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* Adjust size to fit in current e820 RAM region */
|
|
|
|
if (size > remap_range_size)
|
|
|
|
size = remap_range_size;
|
|
|
|
|
2014-11-28 18:53:53 +08:00
|
|
|
xen_do_set_identity_and_remap_chunk(cur_pfn, size, remap_pfn);
|
2014-08-12 02:57:57 +08:00
|
|
|
|
|
|
|
/* Update variables to reflect new mappings. */
|
|
|
|
i += size;
|
|
|
|
remap_pfn += size;
|
2015-01-07 19:21:50 +08:00
|
|
|
*remapped += size;
|
2014-08-12 02:57:57 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the PFNs are currently mapped, the VA mapping also needs
|
|
|
|
* to be updated to be 1:1.
|
|
|
|
*/
|
|
|
|
for (pfn = start_pfn; pfn <= max_pfn_mapped && pfn < end_pfn; pfn++)
|
|
|
|
(void)HYPERVISOR_update_va_mapping(
|
|
|
|
(unsigned long)__va(pfn << PAGE_SHIFT),
|
|
|
|
mfn_pte(pfn, PAGE_KERNEL_IO), 0);
|
|
|
|
|
|
|
|
return remap_pfn;
|
|
|
|
}
|
|
|
|
|
2015-07-17 12:51:26 +08:00
|
|
|
static void __init xen_set_identity_and_remap(unsigned long nr_pages,
|
|
|
|
unsigned long *released, unsigned long *remapped)
|
2009-09-16 15:56:17 +08:00
|
|
|
{
|
2011-09-29 00:46:36 +08:00
|
|
|
phys_addr_t start = 0;
|
2014-08-12 02:57:57 +08:00
|
|
|
unsigned long last_pfn = nr_pages;
|
2015-07-17 12:51:26 +08:00
|
|
|
const struct e820entry *entry = xen_e820_map;
|
2014-08-12 02:57:57 +08:00
|
|
|
unsigned long num_released = 0;
|
2015-01-07 19:21:50 +08:00
|
|
|
unsigned long num_remapped = 0;
|
2011-02-02 06:15:30 +08:00
|
|
|
int i;
|
|
|
|
|
2011-09-29 00:46:36 +08:00
|
|
|
/*
|
|
|
|
* Combine non-RAM regions and gaps until a RAM region (or the
|
|
|
|
* end of the map) is reached, then set the 1:1 map and
|
2014-08-12 02:57:57 +08:00
|
|
|
* remap the memory in those non-RAM regions.
|
2011-09-29 00:46:36 +08:00
|
|
|
*
|
|
|
|
* The combined non-RAM regions are rounded to a whole number
|
|
|
|
* of pages so any partial pages are accessible via the 1:1
|
|
|
|
* mapping. This is needed for some BIOSes that put (for
|
|
|
|
* example) the DMI tables in a reserved region that begins on
|
|
|
|
* a non-page boundary.
|
|
|
|
*/
|
2015-07-17 12:51:26 +08:00
|
|
|
for (i = 0; i < xen_e820_map_entries; i++, entry++) {
|
2011-09-29 00:46:36 +08:00
|
|
|
phys_addr_t end = entry->addr + entry->size;
|
2015-07-17 12:51:26 +08:00
|
|
|
if (entry->type == E820_RAM || i == xen_e820_map_entries - 1) {
|
2011-09-29 00:46:36 +08:00
|
|
|
unsigned long start_pfn = PFN_DOWN(start);
|
|
|
|
unsigned long end_pfn = PFN_UP(end);
|
2011-02-02 06:15:30 +08:00
|
|
|
|
2011-09-29 00:46:36 +08:00
|
|
|
if (entry->type == E820_RAM)
|
|
|
|
end_pfn = PFN_UP(entry->addr);
|
2011-02-02 06:15:30 +08:00
|
|
|
|
2012-05-03 23:15:42 +08:00
|
|
|
if (start_pfn < end_pfn)
|
2014-08-12 02:57:57 +08:00
|
|
|
last_pfn = xen_set_identity_and_remap_chunk(
|
2015-07-17 12:51:26 +08:00
|
|
|
start_pfn, end_pfn, nr_pages,
|
|
|
|
last_pfn, &num_released,
|
|
|
|
&num_remapped);
|
2011-09-29 00:46:36 +08:00
|
|
|
start = end;
|
2011-02-02 06:15:30 +08:00
|
|
|
}
|
|
|
|
}
|
2011-09-29 00:46:36 +08:00
|
|
|
|
2014-08-12 02:57:57 +08:00
|
|
|
*released = num_released;
|
2015-01-07 19:21:50 +08:00
|
|
|
*remapped = num_remapped;
|
2011-09-29 00:46:36 +08:00
|
|
|
|
2014-08-12 02:57:57 +08:00
|
|
|
pr_info("Released %ld page(s)\n", num_released);
|
|
|
|
}
|
2014-11-28 18:53:53 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Remap the memory prepared in xen_do_set_identity_and_remap_chunk().
|
|
|
|
* The remap information (which mfn remap to which pfn) is contained in the
|
|
|
|
* to be remapped memory itself in a linked list anchored at xen_remap_mfn.
|
|
|
|
* This scheme allows to remap the different chunks in arbitrary order while
|
|
|
|
* the resulting mapping will be independant from the order.
|
|
|
|
*/
|
|
|
|
void __init xen_remap_memory(void)
|
|
|
|
{
|
|
|
|
unsigned long buf = (unsigned long)&xen_remap_buf;
|
|
|
|
unsigned long mfn_save, mfn, pfn;
|
|
|
|
unsigned long remapped = 0;
|
|
|
|
unsigned int i;
|
|
|
|
unsigned long pfn_s = ~0UL;
|
|
|
|
unsigned long len = 0;
|
|
|
|
|
|
|
|
mfn_save = virt_to_mfn(buf);
|
|
|
|
|
|
|
|
while (xen_remap_mfn != INVALID_P2M_ENTRY) {
|
|
|
|
/* Map the remap information */
|
|
|
|
set_pte_mfn(buf, xen_remap_mfn, PAGE_KERNEL);
|
|
|
|
|
|
|
|
BUG_ON(xen_remap_mfn != xen_remap_buf.mfns[0]);
|
|
|
|
|
|
|
|
pfn = xen_remap_buf.target_pfn;
|
|
|
|
for (i = 0; i < xen_remap_buf.size; i++) {
|
|
|
|
mfn = xen_remap_buf.mfns[i];
|
|
|
|
xen_update_mem_tables(pfn, mfn);
|
|
|
|
remapped++;
|
|
|
|
pfn++;
|
|
|
|
}
|
|
|
|
if (pfn_s == ~0UL || pfn == pfn_s) {
|
|
|
|
pfn_s = xen_remap_buf.target_pfn;
|
|
|
|
len += xen_remap_buf.size;
|
|
|
|
} else if (pfn_s + len == xen_remap_buf.target_pfn) {
|
|
|
|
len += xen_remap_buf.size;
|
|
|
|
} else {
|
2014-11-28 18:53:55 +08:00
|
|
|
xen_del_extra_mem(PFN_PHYS(pfn_s), PFN_PHYS(len));
|
2014-11-28 18:53:53 +08:00
|
|
|
pfn_s = xen_remap_buf.target_pfn;
|
|
|
|
len = xen_remap_buf.size;
|
|
|
|
}
|
|
|
|
|
|
|
|
mfn = xen_remap_mfn;
|
|
|
|
xen_remap_mfn = xen_remap_buf.next_area_mfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (pfn_s != ~0UL && len)
|
2014-11-28 18:53:55 +08:00
|
|
|
xen_del_extra_mem(PFN_PHYS(pfn_s), PFN_PHYS(len));
|
2014-11-28 18:53:53 +08:00
|
|
|
|
|
|
|
set_pte_mfn(buf, mfn_save, PAGE_KERNEL);
|
|
|
|
|
|
|
|
pr_info("Remapped %ld page(s)\n", remapped);
|
|
|
|
}
|
|
|
|
|
2011-08-19 22:57:16 +08:00
|
|
|
static unsigned long __init xen_get_max_pages(void)
|
|
|
|
{
|
|
|
|
unsigned long max_pages = MAX_DOMAIN_PAGES;
|
|
|
|
domid_t domid = DOMID_SELF;
|
|
|
|
int ret;
|
|
|
|
|
2011-12-14 20:16:08 +08:00
|
|
|
/*
|
|
|
|
* For the initial domain we use the maximum reservation as
|
|
|
|
* the maximum page.
|
|
|
|
*
|
|
|
|
* For guest domains the current maximum reservation reflects
|
|
|
|
* the current maximum rather than the static maximum. In this
|
|
|
|
* case the e820 map provided to us will cover the static
|
|
|
|
* maximum region.
|
|
|
|
*/
|
|
|
|
if (xen_initial_domain()) {
|
|
|
|
ret = HYPERVISOR_memory_op(XENMEM_maximum_reservation, &domid);
|
|
|
|
if (ret > 0)
|
|
|
|
max_pages = ret;
|
|
|
|
}
|
|
|
|
|
2011-08-19 22:57:16 +08:00
|
|
|
return min(max_pages, MAX_DOMAIN_PAGES);
|
|
|
|
}
|
|
|
|
|
2015-01-28 14:44:23 +08:00
|
|
|
static void __init xen_align_and_add_e820_region(phys_addr_t start,
|
|
|
|
phys_addr_t size, int type)
|
2011-09-29 19:26:19 +08:00
|
|
|
{
|
2015-01-28 14:44:22 +08:00
|
|
|
phys_addr_t end = start + size;
|
2011-09-29 19:26:19 +08:00
|
|
|
|
|
|
|
/* Align RAM regions to page boundaries. */
|
|
|
|
if (type == E820_RAM) {
|
|
|
|
start = PAGE_ALIGN(start);
|
2015-01-28 14:44:22 +08:00
|
|
|
end &= ~((phys_addr_t)PAGE_SIZE - 1);
|
2011-09-29 19:26:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
e820_add_region(start, end - start, type);
|
|
|
|
}
|
|
|
|
|
2015-07-17 12:51:26 +08:00
|
|
|
static void __init xen_ignore_unusable(void)
|
2013-08-16 22:42:55 +08:00
|
|
|
{
|
2015-07-17 12:51:26 +08:00
|
|
|
struct e820entry *entry = xen_e820_map;
|
2013-08-16 22:42:55 +08:00
|
|
|
unsigned int i;
|
|
|
|
|
2015-07-17 12:51:26 +08:00
|
|
|
for (i = 0; i < xen_e820_map_entries; i++, entry++) {
|
2013-08-16 22:42:55 +08:00
|
|
|
if (entry->type == E820_UNUSABLE)
|
|
|
|
entry->type = E820_RAM;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-07-17 12:51:25 +08:00
|
|
|
/*
|
|
|
|
* Reserve Xen mfn_list.
|
|
|
|
* See comment above "struct start_info" in <xen/interface/xen.h>
|
|
|
|
* We tried to make the the memblock_reserve more selective so
|
|
|
|
* that it would be clear what region is reserved. Sadly we ran
|
|
|
|
* in the problem wherein on a 64-bit hypervisor with a 32-bit
|
|
|
|
* initial domain, the pt_base has the cr3 value which is not
|
|
|
|
* neccessarily where the pagetable starts! As Jan put it: "
|
|
|
|
* Actually, the adjustment turns out to be correct: The page
|
|
|
|
* tables for a 32-on-64 dom0 get allocated in the order "first L1",
|
|
|
|
* "first L2", "first L3", so the offset to the page table base is
|
|
|
|
* indeed 2. When reading xen/include/public/xen.h's comment
|
|
|
|
* very strictly, this is not a violation (since there nothing is said
|
|
|
|
* that the first thing in the page table space is pointed to by
|
|
|
|
* pt_base; I admit that this seems to be implied though, namely
|
|
|
|
* do I think that it is implied that the page table space is the
|
|
|
|
* range [pt_base, pt_base + nt_pt_frames), whereas that
|
|
|
|
* range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
|
|
|
|
* which - without a priori knowledge - the kernel would have
|
|
|
|
* difficulty to figure out)." - so lets just fall back to the
|
|
|
|
* easy way and reserve the whole region.
|
|
|
|
*/
|
|
|
|
static void __init xen_reserve_xen_mfnlist(void)
|
|
|
|
{
|
|
|
|
if (xen_start_info->mfn_list >= __START_KERNEL_map) {
|
|
|
|
memblock_reserve(__pa(xen_start_info->mfn_list),
|
|
|
|
xen_start_info->pt_base -
|
|
|
|
xen_start_info->mfn_list);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
|
|
|
|
PFN_PHYS(xen_start_info->nr_p2m_frames));
|
|
|
|
}
|
|
|
|
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
/**
|
|
|
|
* machine_specific_memory_setup - Hook for machine specific memory setup.
|
|
|
|
**/
|
|
|
|
char * __init xen_memory_setup(void)
|
|
|
|
{
|
|
|
|
unsigned long max_pfn = xen_start_info->nr_pages;
|
2015-01-28 14:44:22 +08:00
|
|
|
phys_addr_t mem_end;
|
2009-02-07 11:09:48 +08:00
|
|
|
int rc;
|
|
|
|
struct xen_memory_map memmap;
|
2011-09-29 19:26:19 +08:00
|
|
|
unsigned long max_pages;
|
2010-08-31 07:41:02 +08:00
|
|
|
unsigned long extra_pages = 0;
|
2015-01-07 19:21:50 +08:00
|
|
|
unsigned long remapped_pages;
|
2009-02-07 11:09:48 +08:00
|
|
|
int i;
|
2010-09-02 23:16:00 +08:00
|
|
|
int op;
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
|
2008-05-27 06:31:19 +08:00
|
|
|
max_pfn = min(MAX_DOMAIN_PAGES, max_pfn);
|
2009-02-07 11:09:48 +08:00
|
|
|
mem_end = PFN_PHYS(max_pfn);
|
|
|
|
|
|
|
|
memmap.nr_entries = E820MAX;
|
2015-07-17 12:51:26 +08:00
|
|
|
set_xen_guest_handle(memmap.buffer, xen_e820_map);
|
2009-02-07 11:09:48 +08:00
|
|
|
|
2010-09-02 23:16:00 +08:00
|
|
|
op = xen_initial_domain() ?
|
|
|
|
XENMEM_machine_memory_map :
|
|
|
|
XENMEM_memory_map;
|
|
|
|
rc = HYPERVISOR_memory_op(op, &memmap);
|
2009-02-07 11:09:48 +08:00
|
|
|
if (rc == -ENOSYS) {
|
2010-10-29 02:32:29 +08:00
|
|
|
BUG_ON(xen_initial_domain());
|
2009-02-07 11:09:48 +08:00
|
|
|
memmap.nr_entries = 1;
|
2015-07-17 12:51:26 +08:00
|
|
|
xen_e820_map[0].addr = 0ULL;
|
|
|
|
xen_e820_map[0].size = mem_end;
|
2009-02-07 11:09:48 +08:00
|
|
|
/* 8MB slack (to balance backend allocations). */
|
2015-07-17 12:51:26 +08:00
|
|
|
xen_e820_map[0].size += 8ULL << 20;
|
|
|
|
xen_e820_map[0].type = E820_RAM;
|
2009-02-07 11:09:48 +08:00
|
|
|
rc = 0;
|
|
|
|
}
|
|
|
|
BUG_ON(rc);
|
2014-10-17 11:48:11 +08:00
|
|
|
BUG_ON(memmap.nr_entries == 0);
|
2015-07-17 12:51:26 +08:00
|
|
|
xen_e820_map_entries = memmap.nr_entries;
|
2008-05-27 06:31:19 +08:00
|
|
|
|
2013-08-16 22:42:55 +08:00
|
|
|
/*
|
|
|
|
* Xen won't allow a 1:1 mapping to be created to UNUSABLE
|
|
|
|
* regions, so if we're using the machine memory map leave the
|
|
|
|
* region as RAM as it is in the pseudo-physical map.
|
|
|
|
*
|
|
|
|
* UNUSABLE regions in domUs are not handled and will need
|
|
|
|
* a patch in the future.
|
|
|
|
*/
|
|
|
|
if (xen_initial_domain())
|
2015-07-17 12:51:26 +08:00
|
|
|
xen_ignore_unusable();
|
2013-08-16 22:42:55 +08:00
|
|
|
|
2011-09-29 19:26:19 +08:00
|
|
|
/* Make sure the Xen-supplied memory map is well-ordered. */
|
2015-07-17 12:51:26 +08:00
|
|
|
sanitize_e820_map(xen_e820_map, xen_e820_map_entries,
|
|
|
|
&xen_e820_map_entries);
|
2011-09-29 19:26:19 +08:00
|
|
|
|
|
|
|
max_pages = xen_get_max_pages();
|
|
|
|
if (max_pages > max_pfn)
|
|
|
|
extra_pages += max_pages - max_pfn;
|
|
|
|
|
2011-09-29 00:46:36 +08:00
|
|
|
/*
|
2014-11-28 18:53:53 +08:00
|
|
|
* Set identity map on non-RAM pages and prepare remapping the
|
|
|
|
* underlying RAM.
|
2011-09-29 00:46:36 +08:00
|
|
|
*/
|
2015-07-17 12:51:26 +08:00
|
|
|
xen_set_identity_and_remap(max_pfn, &xen_released_pages,
|
|
|
|
&remapped_pages);
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
|
2012-05-30 00:36:43 +08:00
|
|
|
extra_pages += xen_released_pages;
|
2015-01-07 19:21:50 +08:00
|
|
|
extra_pages += remapped_pages;
|
xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
When the Xen hypervisor boots a PV kernel it hands it two pieces
of information: nr_pages and a made up E820 entry.
The nr_pages value defines the range from zero to nr_pages of PFNs
which have a valid Machine Frame Number (MFN) underneath it. The
E820 mirrors that (with the VGA hole):
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 00000000000a0000 (usable)
Xen: 00000000000a0000 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000080800000 (usable)
The fun comes when a PV guest that is run with a machine E820 - that
can either be the initial domain or a PCI PV guest, where the E820
looks like the normal thing:
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 000000000009e000 (usable)
Xen: 000000000009ec00 - 0000000000100000 (reserved)
Xen: 0000000000100000 - 0000000020000000 (usable)
Xen: 0000000020000000 - 0000000020200000 (reserved)
Xen: 0000000020200000 - 0000000040000000 (usable)
Xen: 0000000040000000 - 0000000040200000 (reserved)
Xen: 0000000040200000 - 00000000bad80000 (usable)
Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
..
With that overlaying the nr_pages directly on the E820 does not
work as there are gaps and non-RAM regions that won't be used
by the memory allocator. The 'xen_release_chunk' helps with that
by punching holes in the P2M (PFN to MFN lookup tree) for those
regions and tells us that:
Freeing 20000-20200 pfn range: 512 pages freed
Freeing 40000-40200 pfn range: 512 pages freed
Freeing bad80-badf4 pfn range: 116 pages freed
Freeing badf6-bae7f pfn range: 137 pages freed
Freeing bb000-100000 pfn range: 282624 pages freed
Released 283999 pages of unused memory
Those 283999 pages are subtracted from the nr_pages and are returned
to the hypervisor. The end result is that the initial domain
boots with 1GB less memory as the nr_pages has been subtracted by
the amount of pages residing within the PCI hole. It can balloon up
to that if desired using 'xl mem-set 0 8092', but the balloon driver
is not always compiled in for the initial domain.
This patch, implements the populate hypercall (XENMEM_populate_physmap)
which increases the the domain with the same amount of pages that
were released.
The other solution (that did not work) was to transplant the MFN in
the P2M tree - the ones that were going to be freed were put in
the E820_RAM regions past the nr_pages. But the modifications to the
M2P array (the other side of creating PTEs) were not carried away.
As the hypervisor is the only one capable of modifying that and the
only two hypercalls that would do this are: the update_va_mapping
(which won't work, as during initial bootup only PFNs up to nr_pages
are mapped in the guest) or via the populate hypercall.
The end result is that the kernel can now boot with the
nr_pages without having to subtract the 283999 pages.
On a 8GB machine, with various dom0_mem= parameters this is what we get:
no dom0_mem
-Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
+Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
dom0_mem=3G
-Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
+Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
dom0_mem=max:3G
-Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
+Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
And the 'xm list' or 'xl list' now reflect what the dom0_mem=
argument is.
Acked-by: David Vrabel <david.vrabel@citrix.com>
[v2: Use populate hypercall]
[v3: Remove debug printks]
[v4: Simplify code]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 22:07:11 +08:00
|
|
|
|
2011-09-29 19:26:19 +08:00
|
|
|
/*
|
|
|
|
* Clamp the amount of extra memory to a EXTRA_MEM_RATIO
|
|
|
|
* factor the base size. On non-highmem systems, the base
|
|
|
|
* size is the full initial memory allocation; on highmem it
|
|
|
|
* is limited to the max size of lowmem, so that it doesn't
|
|
|
|
* get completely filled.
|
|
|
|
*
|
|
|
|
* In principle there could be a problem in lowmem systems if
|
|
|
|
* the initial memory is also very large with respect to
|
|
|
|
* lowmem, but we won't try to deal with that here.
|
|
|
|
*/
|
|
|
|
extra_pages = min(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
|
|
|
|
extra_pages);
|
|
|
|
i = 0;
|
2015-07-17 12:51:26 +08:00
|
|
|
while (i < xen_e820_map_entries) {
|
|
|
|
phys_addr_t addr = xen_e820_map[i].addr;
|
|
|
|
phys_addr_t size = xen_e820_map[i].size;
|
|
|
|
u32 type = xen_e820_map[i].type;
|
2011-09-29 19:26:19 +08:00
|
|
|
|
|
|
|
if (type == E820_RAM) {
|
|
|
|
if (addr < mem_end) {
|
|
|
|
size = min(size, mem_end - addr);
|
|
|
|
} else if (extra_pages) {
|
2015-01-28 14:44:22 +08:00
|
|
|
size = min(size, PFN_PHYS(extra_pages));
|
|
|
|
extra_pages -= PFN_DOWN(size);
|
2011-09-29 19:26:19 +08:00
|
|
|
xen_add_extra_mem(addr, size);
|
2014-11-28 18:53:55 +08:00
|
|
|
xen_max_p2m_pfn = PFN_DOWN(addr + size);
|
2011-09-29 19:26:19 +08:00
|
|
|
} else
|
|
|
|
type = E820_UNUSABLE;
|
2010-09-30 07:54:33 +08:00
|
|
|
}
|
|
|
|
|
2011-09-29 19:26:19 +08:00
|
|
|
xen_align_and_add_e820_region(addr, size, type);
|
2010-09-03 08:10:12 +08:00
|
|
|
|
2015-07-17 12:51:26 +08:00
|
|
|
xen_e820_map[i].addr += size;
|
|
|
|
xen_e820_map[i].size -= size;
|
|
|
|
if (xen_e820_map[i].size == 0)
|
2011-09-29 19:26:19 +08:00
|
|
|
i++;
|
2009-02-07 11:09:48 +08:00
|
|
|
}
|
2008-06-17 05:54:49 +08:00
|
|
|
|
2014-01-03 23:46:10 +08:00
|
|
|
/*
|
|
|
|
* Set the rest as identity mapped, in case PCI BARs are
|
|
|
|
* located here.
|
|
|
|
*
|
|
|
|
* PFNs above MAX_P2M_PFN are considered identity mapped as
|
|
|
|
* well.
|
|
|
|
*/
|
2015-07-17 12:51:26 +08:00
|
|
|
set_phys_range_identity(xen_e820_map[i - 1].addr / PAGE_SIZE, ~0ul);
|
2014-01-03 23:46:10 +08:00
|
|
|
|
2008-06-17 05:54:49 +08:00
|
|
|
/*
|
2010-10-29 02:32:29 +08:00
|
|
|
* In domU, the ISA region is normal, usable memory, but we
|
|
|
|
* reserve ISA memory anyway because too many things poke
|
2008-06-17 05:54:49 +08:00
|
|
|
* about in there.
|
|
|
|
*/
|
|
|
|
e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
|
|
|
|
E820_RESERVED);
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
|
2008-06-17 05:54:46 +08:00
|
|
|
sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
|
|
|
|
|
2015-07-17 12:51:25 +08:00
|
|
|
xen_reserve_xen_mfnlist();
|
|
|
|
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
return "Xen";
|
|
|
|
}
|
|
|
|
|
2014-06-03 00:58:01 +08:00
|
|
|
/*
|
|
|
|
* Machine specific memory setup for auto-translated guests.
|
|
|
|
*/
|
|
|
|
char * __init xen_auto_xlated_memory_setup(void)
|
|
|
|
{
|
|
|
|
struct xen_memory_map memmap;
|
|
|
|
int i;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
memmap.nr_entries = E820MAX;
|
2015-07-17 12:51:26 +08:00
|
|
|
set_xen_guest_handle(memmap.buffer, xen_e820_map);
|
2014-06-03 00:58:01 +08:00
|
|
|
|
|
|
|
rc = HYPERVISOR_memory_op(XENMEM_memory_map, &memmap);
|
|
|
|
if (rc < 0)
|
|
|
|
panic("No memory map (%d)\n", rc);
|
|
|
|
|
2015-07-17 12:51:26 +08:00
|
|
|
xen_e820_map_entries = memmap.nr_entries;
|
|
|
|
|
|
|
|
sanitize_e820_map(xen_e820_map, ARRAY_SIZE(xen_e820_map),
|
|
|
|
&xen_e820_map_entries);
|
2014-06-03 00:58:01 +08:00
|
|
|
|
2015-07-17 12:51:26 +08:00
|
|
|
for (i = 0; i < xen_e820_map_entries; i++)
|
|
|
|
e820_add_region(xen_e820_map[i].addr, xen_e820_map[i].size,
|
|
|
|
xen_e820_map[i].type);
|
2014-06-03 00:58:01 +08:00
|
|
|
|
2015-07-17 12:51:25 +08:00
|
|
|
xen_reserve_xen_mfnlist();
|
2014-06-03 00:58:01 +08:00
|
|
|
|
|
|
|
return "Xen";
|
|
|
|
}
|
|
|
|
|
2007-07-20 15:31:43 +08:00
|
|
|
/*
|
|
|
|
* Set the bit indicating "nosegneg" library variants should be used.
|
2008-07-12 17:22:00 +08:00
|
|
|
* We only need to bother in pure 32-bit mode; compat 32-bit processes
|
|
|
|
* can have un-truncated segments, so wrapping around is allowed.
|
2007-07-20 15:31:43 +08:00
|
|
|
*/
|
2008-01-30 20:33:25 +08:00
|
|
|
static void __init fiddle_vdso(void)
|
2007-07-20 15:31:43 +08:00
|
|
|
{
|
2008-07-12 17:22:00 +08:00
|
|
|
#ifdef CONFIG_X86_32
|
x86, vdso: Reimplement vdso.so preparation in build-time C
Currently, vdso.so files are prepared and analyzed by a combination
of objcopy, nm, some linker script tricks, and some simple ELF
parsers in the kernel. Replace all of that with plain C code that
runs at build time.
All five vdso images now generate .c files that are compiled and
linked in to the kernel image.
This should cause only one userspace-visible change: the loaded vDSO
images are stripped more heavily than they used to be. Everything
outside the loadable segment is dropped. In particular, this causes
the section table and section name strings to be missing. This
should be fine: real dynamic loaders don't load or inspect these
tables anyway. The result is roughly equivalent to eu-strip's
--strip-sections option.
The purpose of this change is to enable the vvar and hpet mappings
to be moved to the page following the vDSO load segment. Currently,
it is possible for the section table to extend into the page after
the load segment, so, if we map it, it risks overlapping the vvar or
hpet page. This happens whenever the load segment is just under a
multiple of PAGE_SIZE.
The only real subtlety here is that the old code had a C file with
inline assembler that did 'call VDSO32_vsyscall' and a linker script
that defined 'VDSO32_vsyscall = __kernel_vsyscall'. This most
likely worked by accident: the linker script entry defines a symbol
associated with an address as opposed to an alias for the real
dynamic symbol __kernel_vsyscall. That caused ld to relocate the
reference at link time instead of leaving an interposable dynamic
relocation. Since the VDSO32_vsyscall hack is no longer needed, I
now use 'call __kernel_vsyscall', and I added -Bsymbolic to make it
work. vdso2c will generate an error and abort the build if the
resulting image contains any dynamic relocations, so we won't
silently generate bad vdso images.
(Dynamic relocations are a problem because nothing will even attempt
to relocate the vdso.)
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2c4fcf45524162a34d87fdda1eb046b2a5cecee7.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-06 03:19:34 +08:00
|
|
|
/*
|
|
|
|
* This could be called before selected_vdso32 is initialized, so
|
|
|
|
* just fiddle with both possible images. vdso_image_32_syscall
|
|
|
|
* can't be selected, since it only exists on 64-bit systems.
|
|
|
|
*/
|
2008-07-12 17:22:00 +08:00
|
|
|
u32 *mask;
|
x86, vdso: Reimplement vdso.so preparation in build-time C
Currently, vdso.so files are prepared and analyzed by a combination
of objcopy, nm, some linker script tricks, and some simple ELF
parsers in the kernel. Replace all of that with plain C code that
runs at build time.
All five vdso images now generate .c files that are compiled and
linked in to the kernel image.
This should cause only one userspace-visible change: the loaded vDSO
images are stripped more heavily than they used to be. Everything
outside the loadable segment is dropped. In particular, this causes
the section table and section name strings to be missing. This
should be fine: real dynamic loaders don't load or inspect these
tables anyway. The result is roughly equivalent to eu-strip's
--strip-sections option.
The purpose of this change is to enable the vvar and hpet mappings
to be moved to the page following the vDSO load segment. Currently,
it is possible for the section table to extend into the page after
the load segment, so, if we map it, it risks overlapping the vvar or
hpet page. This happens whenever the load segment is just under a
multiple of PAGE_SIZE.
The only real subtlety here is that the old code had a C file with
inline assembler that did 'call VDSO32_vsyscall' and a linker script
that defined 'VDSO32_vsyscall = __kernel_vsyscall'. This most
likely worked by accident: the linker script entry defines a symbol
associated with an address as opposed to an alias for the real
dynamic symbol __kernel_vsyscall. That caused ld to relocate the
reference at link time instead of leaving an interposable dynamic
relocation. Since the VDSO32_vsyscall hack is no longer needed, I
now use 'call __kernel_vsyscall', and I added -Bsymbolic to make it
work. vdso2c will generate an error and abort the build if the
resulting image contains any dynamic relocations, so we won't
silently generate bad vdso images.
(Dynamic relocations are a problem because nothing will even attempt
to relocate the vdso.)
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2c4fcf45524162a34d87fdda1eb046b2a5cecee7.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-06 03:19:34 +08:00
|
|
|
mask = vdso_image_32_int80.data +
|
|
|
|
vdso_image_32_int80.sym_VDSO32_NOTE_MASK;
|
2008-07-12 17:22:00 +08:00
|
|
|
*mask |= 1 << VDSO_NOTE_NONEGSEG_BIT;
|
x86, vdso: Reimplement vdso.so preparation in build-time C
Currently, vdso.so files are prepared and analyzed by a combination
of objcopy, nm, some linker script tricks, and some simple ELF
parsers in the kernel. Replace all of that with plain C code that
runs at build time.
All five vdso images now generate .c files that are compiled and
linked in to the kernel image.
This should cause only one userspace-visible change: the loaded vDSO
images are stripped more heavily than they used to be. Everything
outside the loadable segment is dropped. In particular, this causes
the section table and section name strings to be missing. This
should be fine: real dynamic loaders don't load or inspect these
tables anyway. The result is roughly equivalent to eu-strip's
--strip-sections option.
The purpose of this change is to enable the vvar and hpet mappings
to be moved to the page following the vDSO load segment. Currently,
it is possible for the section table to extend into the page after
the load segment, so, if we map it, it risks overlapping the vvar or
hpet page. This happens whenever the load segment is just under a
multiple of PAGE_SIZE.
The only real subtlety here is that the old code had a C file with
inline assembler that did 'call VDSO32_vsyscall' and a linker script
that defined 'VDSO32_vsyscall = __kernel_vsyscall'. This most
likely worked by accident: the linker script entry defines a symbol
associated with an address as opposed to an alias for the real
dynamic symbol __kernel_vsyscall. That caused ld to relocate the
reference at link time instead of leaving an interposable dynamic
relocation. Since the VDSO32_vsyscall hack is no longer needed, I
now use 'call __kernel_vsyscall', and I added -Bsymbolic to make it
work. vdso2c will generate an error and abort the build if the
resulting image contains any dynamic relocations, so we won't
silently generate bad vdso images.
(Dynamic relocations are a problem because nothing will even attempt
to relocate the vdso.)
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2c4fcf45524162a34d87fdda1eb046b2a5cecee7.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-06 03:19:34 +08:00
|
|
|
mask = vdso_image_32_sysenter.data +
|
|
|
|
vdso_image_32_sysenter.sym_VDSO32_NOTE_MASK;
|
2007-07-20 15:31:43 +08:00
|
|
|
*mask |= 1 << VDSO_NOTE_NONEGSEG_BIT;
|
2008-07-09 06:07:14 +08:00
|
|
|
#endif
|
2007-07-20 15:31:43 +08:00
|
|
|
}
|
|
|
|
|
x86: delete __cpuinit usage from all x86 files
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
Note that some harmless section mismatch warnings may result, since
notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
are flagged as __cpuinit -- so if we remove the __cpuinit from
arch specific callers, we will also get section mismatch warnings.
As an intermediate step, we intend to turn the linux/init.h cpuinit
content into no-ops as early as possible, since that will get rid
of these warnings. In any case, they are temporary and harmless.
This removes all the arch/x86 uses of the __cpuinit macros from
all C files. x86 only had the one __CPUINIT used in assembly files,
and it wasn't paired off with a .previous or a __FINIT, so we can
delete it directly w/o any corresponding additional change there.
[1] https://lkml.org/lkml/2013/5/20/589
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-06-19 06:23:59 +08:00
|
|
|
static int register_callback(unsigned type, const void *func)
|
2008-03-18 07:37:17 +08:00
|
|
|
{
|
2008-07-09 06:07:02 +08:00
|
|
|
struct callback_register callback = {
|
|
|
|
.type = type,
|
|
|
|
.address = XEN_CALLBACK(__KERNEL_CS, func),
|
2008-03-18 07:37:17 +08:00
|
|
|
.flags = CALLBACKF_mask_events,
|
|
|
|
};
|
|
|
|
|
2008-07-09 06:07:02 +08:00
|
|
|
return HYPERVISOR_callback_op(CALLBACKOP_register, &callback);
|
|
|
|
}
|
|
|
|
|
x86: delete __cpuinit usage from all x86 files
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
Note that some harmless section mismatch warnings may result, since
notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
are flagged as __cpuinit -- so if we remove the __cpuinit from
arch specific callers, we will also get section mismatch warnings.
As an intermediate step, we intend to turn the linux/init.h cpuinit
content into no-ops as early as possible, since that will get rid
of these warnings. In any case, they are temporary and harmless.
This removes all the arch/x86 uses of the __cpuinit macros from
all C files. x86 only had the one __CPUINIT used in assembly files,
and it wasn't paired off with a .previous or a __FINIT, so we can
delete it directly w/o any corresponding additional change there.
[1] https://lkml.org/lkml/2013/5/20/589
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-06-19 06:23:59 +08:00
|
|
|
void xen_enable_sysenter(void)
|
2008-07-09 06:07:02 +08:00
|
|
|
{
|
2008-07-09 06:07:14 +08:00
|
|
|
int ret;
|
2008-07-11 07:24:08 +08:00
|
|
|
unsigned sysenter_feature;
|
2008-07-09 06:07:14 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_X86_32
|
2008-07-11 07:24:08 +08:00
|
|
|
sysenter_feature = X86_FEATURE_SEP;
|
2008-07-09 06:07:14 +08:00
|
|
|
#else
|
2008-07-11 07:24:08 +08:00
|
|
|
sysenter_feature = X86_FEATURE_SYSENTER32;
|
2008-07-09 06:07:14 +08:00
|
|
|
#endif
|
2008-07-09 06:07:02 +08:00
|
|
|
|
2008-07-11 07:24:08 +08:00
|
|
|
if (!boot_cpu_has(sysenter_feature))
|
|
|
|
return;
|
|
|
|
|
2008-07-09 06:07:14 +08:00
|
|
|
ret = register_callback(CALLBACKTYPE_sysenter, xen_sysenter_target);
|
2008-07-11 07:24:08 +08:00
|
|
|
if(ret != 0)
|
|
|
|
setup_clear_cpu_cap(sysenter_feature);
|
2008-03-18 07:37:17 +08:00
|
|
|
}
|
|
|
|
|
x86: delete __cpuinit usage from all x86 files
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
Note that some harmless section mismatch warnings may result, since
notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
are flagged as __cpuinit -- so if we remove the __cpuinit from
arch specific callers, we will also get section mismatch warnings.
As an intermediate step, we intend to turn the linux/init.h cpuinit
content into no-ops as early as possible, since that will get rid
of these warnings. In any case, they are temporary and harmless.
This removes all the arch/x86 uses of the __cpuinit macros from
all C files. x86 only had the one __CPUINIT used in assembly files,
and it wasn't paired off with a .previous or a __FINIT, so we can
delete it directly w/o any corresponding additional change there.
[1] https://lkml.org/lkml/2013/5/20/589
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-06-19 06:23:59 +08:00
|
|
|
void xen_enable_syscall(void)
|
2008-07-09 06:07:14 +08:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = register_callback(CALLBACKTYPE_syscall, xen_syscall_target);
|
|
|
|
if (ret != 0) {
|
2008-07-12 17:22:06 +08:00
|
|
|
printk(KERN_ERR "Failed to set syscall callback: %d\n", ret);
|
2008-07-11 07:24:08 +08:00
|
|
|
/* Pretty fatal; 64-bit userspace has no other
|
|
|
|
mechanism for syscalls. */
|
|
|
|
}
|
|
|
|
|
|
|
|
if (boot_cpu_has(X86_FEATURE_SYSCALL32)) {
|
2008-07-09 06:07:14 +08:00
|
|
|
ret = register_callback(CALLBACKTYPE_syscall32,
|
|
|
|
xen_syscall32_target);
|
2008-07-12 17:22:06 +08:00
|
|
|
if (ret != 0)
|
2008-07-11 07:24:08 +08:00
|
|
|
setup_clear_cpu_cap(X86_FEATURE_SYSCALL32);
|
2008-07-09 06:07:14 +08:00
|
|
|
}
|
|
|
|
#endif /* CONFIG_X86_64 */
|
|
|
|
}
|
2014-06-16 19:07:00 +08:00
|
|
|
|
2013-12-14 01:45:31 +08:00
|
|
|
void __init xen_pvmmu_arch_setup(void)
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
{
|
|
|
|
HYPERVISOR_vm_assist(VMASST_CMD_enable, VMASST_TYPE_4gb_segments);
|
|
|
|
HYPERVISOR_vm_assist(VMASST_CMD_enable, VMASST_TYPE_writable_pagetables);
|
|
|
|
|
2013-12-14 01:45:31 +08:00
|
|
|
HYPERVISOR_vm_assist(VMASST_CMD_enable,
|
|
|
|
VMASST_TYPE_pae_extended_cr3);
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
|
2008-07-09 06:07:02 +08:00
|
|
|
if (register_callback(CALLBACKTYPE_event, xen_hypervisor_callback) ||
|
|
|
|
register_callback(CALLBACKTYPE_failsafe, xen_failsafe_callback))
|
|
|
|
BUG();
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
|
2008-03-18 07:37:17 +08:00
|
|
|
xen_enable_sysenter();
|
2008-07-09 06:07:14 +08:00
|
|
|
xen_enable_syscall();
|
2013-12-14 01:45:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* This function is not called for HVM domains */
|
|
|
|
void __init xen_arch_setup(void)
|
|
|
|
{
|
|
|
|
xen_panic_handler_init();
|
|
|
|
if (!xen_feature(XENFEAT_auto_translated_physmap))
|
|
|
|
xen_pvmmu_arch_setup();
|
|
|
|
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
#ifdef CONFIG_ACPI
|
|
|
|
if (!(xen_start_info->flags & SIF_INITDOMAIN)) {
|
|
|
|
printk(KERN_INFO "ACPI in unprivileged domain disabled\n");
|
|
|
|
disable_acpi();
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
memcpy(boot_command_line, xen_start_info->cmd_line,
|
|
|
|
MAX_GUEST_CMDLINE > COMMAND_LINE_SIZE ?
|
|
|
|
COMMAND_LINE_SIZE : MAX_GUEST_CMDLINE);
|
|
|
|
|
2010-11-23 09:17:50 +08:00
|
|
|
/* Set up idle, making sure it calls safe_halt() pvop */
|
2011-04-02 06:28:35 +08:00
|
|
|
disable_cpuidle();
|
2012-03-14 08:06:57 +08:00
|
|
|
disable_cpufreq();
|
2013-02-10 12:08:07 +08:00
|
|
|
WARN_ON(xen_set_default_idle());
|
2007-07-20 15:31:43 +08:00
|
|
|
fiddle_vdso();
|
2012-08-17 22:22:37 +08:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
numa_off = 1;
|
|
|
|
#endif
|
xen: Core Xen implementation
This patch is a rollup of all the core pieces of the Xen
implementation, including:
- booting and setup
- pagetable setup
- privileged instructions
- segmentation
- interrupt flags
- upcalls
- multicall batching
BOOTING AND SETUP
The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.
Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.
Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper. The main
steps are:
1. Install the Xen paravirt_ops, which is simply a matter of a
structure assignment.
2. Set init_mm to use the Xen-supplied pagetables (analogous to the
head.S generated pagetables in a native boot).
3. Reserve address space for Xen, since it takes a chunk at the top
of the address space for its own use.
4. Call start_kernel()
PAGETABLE SETUP
Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state. One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable. Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind. It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.
This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.
PRIVILEGED INSTRUCTIONS AND SEGMENTATION
When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly. Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops. In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.
The privileged instructions fall into the broad classes of:
Segmentation: setting up the GDT and the GDT entries, LDT,
TLS and so on. Xen doesn't allow the GDT to be directly
modified; all GDT updates are done via hypercalls where the new
entries can be validated. This is important because Xen uses
segment limits to prevent the guest kernel from damaging the
hypervisor itself.
Traps and exceptions: Xen uses a special format for trap entrypoints,
so when the kernel wants to set an IDT entry, it needs to be
converted to the form Xen expects. Xen sets int 0x80 up specially
so that the trap goes straight from userspace into the guest kernel
without going via the hypervisor. sysenter isn't supported.
Kernel stack: The esp0 entry is extracted from the tss and provided to
Xen.
TLB operations: the various TLB calls are mapped into corresponding
Xen hypercalls.
Control registers: all the control registers are privileged. The most
important is cr3, which points to the base of the current pagetable,
and we handle it specially.
Another instruction we treat specially is CPUID, even though its not
privileged. We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.
INTERRUPT FLAGS
Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure. Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).
(A note on terminology: "events" and interrupts are effectively
synonymous. However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)
There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state. The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.
UPCALLS
Xen needs a couple of upcall (or callback) functions to be implemented
by each guest. One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests. The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret. These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.
MULTICALL BATCHING
Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor. This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one. This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>
2007-07-18 09:37:04 +08:00
|
|
|
}
|