License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 1992 Krishna Balasubramanian and Linus Torvalds
|
|
|
|
* Copyright (C) 1999 Ingo Molnar <mingo@redhat.com>
|
|
|
|
* Copyright (C) 2002 Andi Kleen
|
2008-01-30 20:30:13 +08:00
|
|
|
*
|
2005-04-17 06:20:36 +08:00
|
|
|
* This handles calls from both 32bit and 64bit mode.
|
2017-12-14 19:27:30 +08:00
|
|
|
*
|
|
|
|
* Lock order:
|
|
|
|
* contex.ldt_usr_sem
|
2020-06-09 12:33:54 +08:00
|
|
|
* mmap_lock
|
2017-12-14 19:27:30 +08:00
|
|
|
* context.lock
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/errno.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/gfp.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/sched.h>
|
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/smp.h>
|
2017-10-19 01:21:07 +08:00
|
|
|
#include <linux/syscalls.h>
|
2015-07-31 05:31:32 +08:00
|
|
|
#include <linux/slab.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/vmalloc.h>
|
2008-12-31 19:12:20 +08:00
|
|
|
#include <linux/uaccess.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#include <asm/ldt.h>
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
#include <asm/tlb.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <asm/desc.h>
|
2008-01-30 20:30:13 +08:00
|
|
|
#include <asm/mmu_context.h>
|
2019-11-29 15:17:25 +08:00
|
|
|
#include <asm/pgtable_areas.h>
|
|
|
|
|
2020-07-04 01:02:57 +08:00
|
|
|
#include <xen/xen.h>
|
|
|
|
|
2019-11-29 15:17:25 +08:00
|
|
|
/* This is a multiple of PAGE_SIZE. */
|
|
|
|
#define LDT_SLOT_STRIDE (LDT_ENTRIES * LDT_ENTRY_SIZE)
|
|
|
|
|
|
|
|
static inline void *ldt_slot_va(int slot)
|
|
|
|
{
|
|
|
|
return (void *)(LDT_BASE_ADDR + LDT_SLOT_STRIDE * slot);
|
|
|
|
}
|
|
|
|
|
|
|
|
void load_mm_ldt(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
struct ldt_struct *ldt;
|
|
|
|
|
|
|
|
/* READ_ONCE synchronizes with smp_store_release */
|
|
|
|
ldt = READ_ONCE(mm->context.ldt);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Any change to mm->context.ldt is followed by an IPI to all
|
|
|
|
* CPUs with the mm active. The LDT will not be freed until
|
|
|
|
* after the IPI is handled by all such CPUs. This means that,
|
|
|
|
* if the ldt_struct changes before we return, the values we see
|
|
|
|
* will be safe, and the new values will be loaded before we run
|
|
|
|
* any user code.
|
|
|
|
*
|
|
|
|
* NB: don't try to convert this to use RCU without extreme care.
|
|
|
|
* We would still need IRQs off, because we don't want to change
|
|
|
|
* the local LDT after an IPI loaded a newer value than the one
|
|
|
|
* that we can see.
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (unlikely(ldt)) {
|
|
|
|
if (static_cpu_has(X86_FEATURE_PTI)) {
|
|
|
|
if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {
|
|
|
|
/*
|
|
|
|
* Whoops -- either the new LDT isn't mapped
|
|
|
|
* (if slot == -1) or is mapped into a bogus
|
|
|
|
* slot (if slot > 1).
|
|
|
|
*/
|
|
|
|
clear_LDT();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If page table isolation is enabled, ldt->entries
|
|
|
|
* will not be mapped in the userspace pagetables.
|
|
|
|
* Tell the CPU to access the LDT through the alias
|
|
|
|
* at ldt_slot_va(ldt->slot).
|
|
|
|
*/
|
|
|
|
set_ldt(ldt_slot_va(ldt->slot), ldt->nr_entries);
|
|
|
|
} else {
|
|
|
|
set_ldt(ldt->entries, ldt->nr_entries);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
clear_LDT();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Load the LDT if either the old or new mm had an LDT.
|
|
|
|
*
|
|
|
|
* An mm will never go from having an LDT to not having an LDT. Two
|
|
|
|
* mms never share an LDT, so we don't gain anything by checking to
|
|
|
|
* see whether the LDT changed. There's also no guarantee that
|
|
|
|
* prev->context.ldt actually matches LDTR, but, if LDTR is non-NULL,
|
|
|
|
* then prev->context.ldt will also be non-NULL.
|
|
|
|
*
|
|
|
|
* If we really cared, we could optimize the case where prev == next
|
|
|
|
* and we're exiting lazy mode. Most of the time, if this happens,
|
|
|
|
* we don't actually need to reload LDTR, but modify_ldt() is mostly
|
|
|
|
* used by legacy code and emulators where we don't need this level of
|
|
|
|
* performance.
|
|
|
|
*
|
|
|
|
* This uses | instead of || because it generates better code.
|
|
|
|
*/
|
|
|
|
if (unlikely((unsigned long)prev->context.ldt |
|
|
|
|
(unsigned long)next->context.ldt))
|
|
|
|
load_mm_ldt(next);
|
|
|
|
|
|
|
|
DEBUG_LOCKS_WARN_ON(preemptible());
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-07-26 22:16:30 +08:00
|
|
|
static void refresh_ldt_segments(void)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
unsigned short sel;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure that the cached DS and ES descriptors match the updated
|
|
|
|
* LDT.
|
|
|
|
*/
|
|
|
|
savesegment(ds, sel);
|
|
|
|
if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT)
|
|
|
|
loadsegment(ds, sel);
|
|
|
|
|
|
|
|
savesegment(es, sel);
|
|
|
|
if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT)
|
|
|
|
loadsegment(es, sel);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2017-12-14 19:27:30 +08:00
|
|
|
/* context.lock is held by the task which issued the smp function call */
|
x86/mm: Rework lazy TLB to track the actual loaded mm
Lazy TLB state is currently managed in a rather baroque manner.
AFAICT, there are three possible states:
- Non-lazy. This means that we're running a user thread or a
kernel thread that has called use_mm(). current->mm ==
current->active_mm == cpu_tlbstate.active_mm and
cpu_tlbstate.state == TLBSTATE_OK.
- Lazy with user mm. We're running a kernel thread without an mm
and we're borrowing an mm_struct. We have current->mm == NULL,
current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state
!= TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0). The current cpu is set
in mm_cpumask(current->active_mm). CR3 points to
current->active_mm->pgd. The TLB is up to date.
- Lazy with init_mm. This happens when we call leave_mm(). We
have current->mm == NULL, current->active_mm ==
cpu_tlbstate.active_mm, but that mm is only relelvant insofar as
the scheduler is tracking it for refcounting. cpu_tlbstate.state
!= TLBSTATE_OK. The current cpu is clear in
mm_cpumask(current->active_mm). CR3 points to swapper_pg_dir,
i.e. init_mm->pgd.
This patch simplifies the situation. Other than perf, x86 stops
caring about current->active_mm at all. We have
cpu_tlbstate.loaded_mm pointing to the mm that CR3 references. The
TLB is always up to date for that mm. leave_mm() just switches us
to init_mm. There are no longer any special cases for mm_cpumask,
and switch_mm() switches mms without worrying about laziness.
After this patch, cpu_tlbstate.state serves only to tell the TLB
flush code whether it may switch to init_mm instead of doing a
normal flush.
This makes fairly extensive changes to xen_exit_mmap(), which used
to look a bit like black magic.
Perf is unchanged. With or without this change, perf may behave a bit
erratically if it tries to read user memory in kernel thread context.
We should build on this patch to teach perf to never look at user
memory when cpu_tlbstate.loaded_mm != current->mm.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-29 01:00:15 +08:00
|
|
|
static void flush_ldt(void *__mm)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
x86/mm: Rework lazy TLB to track the actual loaded mm
Lazy TLB state is currently managed in a rather baroque manner.
AFAICT, there are three possible states:
- Non-lazy. This means that we're running a user thread or a
kernel thread that has called use_mm(). current->mm ==
current->active_mm == cpu_tlbstate.active_mm and
cpu_tlbstate.state == TLBSTATE_OK.
- Lazy with user mm. We're running a kernel thread without an mm
and we're borrowing an mm_struct. We have current->mm == NULL,
current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state
!= TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0). The current cpu is set
in mm_cpumask(current->active_mm). CR3 points to
current->active_mm->pgd. The TLB is up to date.
- Lazy with init_mm. This happens when we call leave_mm(). We
have current->mm == NULL, current->active_mm ==
cpu_tlbstate.active_mm, but that mm is only relelvant insofar as
the scheduler is tracking it for refcounting. cpu_tlbstate.state
!= TLBSTATE_OK. The current cpu is clear in
mm_cpumask(current->active_mm). CR3 points to swapper_pg_dir,
i.e. init_mm->pgd.
This patch simplifies the situation. Other than perf, x86 stops
caring about current->active_mm at all. We have
cpu_tlbstate.loaded_mm pointing to the mm that CR3 references. The
TLB is always up to date for that mm. leave_mm() just switches us
to init_mm. There are no longer any special cases for mm_cpumask,
and switch_mm() switches mms without worrying about laziness.
After this patch, cpu_tlbstate.state serves only to tell the TLB
flush code whether it may switch to init_mm instead of doing a
normal flush.
This makes fairly extensive changes to xen_exit_mmap(), which used
to look a bit like black magic.
Perf is unchanged. With or without this change, perf may behave a bit
erratically if it tries to read user memory in kernel thread context.
We should build on this patch to teach perf to never look at user
memory when cpu_tlbstate.loaded_mm != current->mm.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-29 01:00:15 +08:00
|
|
|
struct mm_struct *mm = __mm;
|
2015-07-31 05:31:32 +08:00
|
|
|
|
x86/mm: Rework lazy TLB to track the actual loaded mm
Lazy TLB state is currently managed in a rather baroque manner.
AFAICT, there are three possible states:
- Non-lazy. This means that we're running a user thread or a
kernel thread that has called use_mm(). current->mm ==
current->active_mm == cpu_tlbstate.active_mm and
cpu_tlbstate.state == TLBSTATE_OK.
- Lazy with user mm. We're running a kernel thread without an mm
and we're borrowing an mm_struct. We have current->mm == NULL,
current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state
!= TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0). The current cpu is set
in mm_cpumask(current->active_mm). CR3 points to
current->active_mm->pgd. The TLB is up to date.
- Lazy with init_mm. This happens when we call leave_mm(). We
have current->mm == NULL, current->active_mm ==
cpu_tlbstate.active_mm, but that mm is only relelvant insofar as
the scheduler is tracking it for refcounting. cpu_tlbstate.state
!= TLBSTATE_OK. The current cpu is clear in
mm_cpumask(current->active_mm). CR3 points to swapper_pg_dir,
i.e. init_mm->pgd.
This patch simplifies the situation. Other than perf, x86 stops
caring about current->active_mm at all. We have
cpu_tlbstate.loaded_mm pointing to the mm that CR3 references. The
TLB is always up to date for that mm. leave_mm() just switches us
to init_mm. There are no longer any special cases for mm_cpumask,
and switch_mm() switches mms without worrying about laziness.
After this patch, cpu_tlbstate.state serves only to tell the TLB
flush code whether it may switch to init_mm instead of doing a
normal flush.
This makes fairly extensive changes to xen_exit_mmap(), which used
to look a bit like black magic.
Perf is unchanged. With or without this change, perf may behave a bit
erratically if it tries to read user memory in kernel thread context.
We should build on this patch to teach perf to never look at user
memory when cpu_tlbstate.loaded_mm != current->mm.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-29 01:00:15 +08:00
|
|
|
if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
|
2015-07-31 05:31:32 +08:00
|
|
|
return;
|
|
|
|
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
load_mm_ldt(mm);
|
2017-07-26 22:16:30 +08:00
|
|
|
|
|
|
|
refresh_ldt_segments();
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2015-07-31 05:31:32 +08:00
|
|
|
/* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
|
2017-06-07 01:31:16 +08:00
|
|
|
static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2015-07-31 05:31:32 +08:00
|
|
|
struct ldt_struct *new_ldt;
|
2016-12-10 07:13:51 +08:00
|
|
|
unsigned int alloc_size;
|
2015-07-31 05:31:32 +08:00
|
|
|
|
2017-06-07 01:31:16 +08:00
|
|
|
if (num_entries > LDT_ENTRIES)
|
2015-07-31 05:31:32 +08:00
|
|
|
return NULL;
|
|
|
|
|
|
|
|
new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
|
|
|
|
if (!new_ldt)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
BUILD_BUG_ON(LDT_ENTRY_SIZE != sizeof(struct desc_struct));
|
2017-06-07 01:31:16 +08:00
|
|
|
alloc_size = num_entries * LDT_ENTRY_SIZE;
|
2015-07-31 05:31:32 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Xen is very picky: it requires a page-aligned LDT that has no
|
|
|
|
* trailing nonzero bytes in any page that contains LDT descriptors.
|
|
|
|
* Keep it simple: zero the whole allocation and never allocate less
|
|
|
|
* than PAGE_SIZE.
|
|
|
|
*/
|
|
|
|
if (alloc_size > PAGE_SIZE)
|
|
|
|
new_ldt->entries = vzalloc(alloc_size);
|
2005-04-17 06:20:36 +08:00
|
|
|
else
|
2015-09-02 23:45:58 +08:00
|
|
|
new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2015-07-31 05:31:32 +08:00
|
|
|
if (!new_ldt->entries) {
|
|
|
|
kfree(new_ldt);
|
|
|
|
return NULL;
|
|
|
|
}
|
2008-01-30 20:30:14 +08:00
|
|
|
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
/* The new LDT isn't aliased for PTI yet. */
|
|
|
|
new_ldt->slot = -1;
|
|
|
|
|
2017-06-07 01:31:16 +08:00
|
|
|
new_ldt->nr_entries = num_entries;
|
2015-07-31 05:31:32 +08:00
|
|
|
return new_ldt;
|
|
|
|
}
|
2008-07-24 05:21:18 +08:00
|
|
|
|
2018-07-18 17:41:12 +08:00
|
|
|
#ifdef CONFIG_PAGE_TABLE_ISOLATION
|
|
|
|
|
|
|
|
static void do_sanity_check(struct mm_struct *mm,
|
|
|
|
bool had_kernel_mapping,
|
|
|
|
bool had_user_mapping)
|
|
|
|
{
|
|
|
|
if (mm->context.ldt) {
|
|
|
|
/*
|
|
|
|
* We already had an LDT. The top-level entry should already
|
|
|
|
* have been allocated and synchronized with the usermode
|
|
|
|
* tables.
|
|
|
|
*/
|
|
|
|
WARN_ON(!had_kernel_mapping);
|
2019-03-30 02:52:59 +08:00
|
|
|
if (boot_cpu_has(X86_FEATURE_PTI))
|
2018-07-18 17:41:12 +08:00
|
|
|
WARN_ON(!had_user_mapping);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* This is the first time we're mapping an LDT for this process.
|
|
|
|
* Sync the pgd to the usermode tables.
|
|
|
|
*/
|
|
|
|
WARN_ON(had_kernel_mapping);
|
2019-03-30 02:52:59 +08:00
|
|
|
if (boot_cpu_has(X86_FEATURE_PTI))
|
2018-07-18 17:41:12 +08:00
|
|
|
WARN_ON(had_user_mapping);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-07-18 17:41:13 +08:00
|
|
|
#ifdef CONFIG_X86_PAE
|
|
|
|
|
|
|
|
static pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va)
|
|
|
|
{
|
|
|
|
p4d_t *p4d;
|
|
|
|
pud_t *pud;
|
|
|
|
|
|
|
|
if (pgd->pgd == 0)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
p4d = p4d_offset(pgd, va);
|
|
|
|
if (p4d_none(*p4d))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
pud = pud_offset(p4d, va);
|
|
|
|
if (pud_none(*pud))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return pmd_offset(pud, va);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void map_ldt_struct_to_user(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
|
|
|
|
pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
|
|
|
|
pmd_t *k_pmd, *u_pmd;
|
|
|
|
|
|
|
|
k_pmd = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
|
|
|
|
u_pmd = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
|
|
|
|
|
2019-03-30 02:52:59 +08:00
|
|
|
if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
|
2018-07-18 17:41:13 +08:00
|
|
|
set_pmd(u_pmd, *k_pmd);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void sanity_check_ldt_mapping(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
|
|
|
|
pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
|
|
|
|
bool had_kernel, had_user;
|
|
|
|
pmd_t *k_pmd, *u_pmd;
|
|
|
|
|
|
|
|
k_pmd = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
|
|
|
|
u_pmd = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
|
|
|
|
had_kernel = (k_pmd->pmd != 0);
|
|
|
|
had_user = (u_pmd->pmd != 0);
|
|
|
|
|
|
|
|
do_sanity_check(mm, had_kernel, had_user);
|
|
|
|
}
|
|
|
|
|
|
|
|
#else /* !CONFIG_X86_PAE */
|
|
|
|
|
2018-07-18 17:41:12 +08:00
|
|
|
static void map_ldt_struct_to_user(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
|
|
|
|
|
2019-03-30 02:52:59 +08:00
|
|
|
if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
|
2018-07-18 17:41:12 +08:00
|
|
|
set_pgd(kernel_to_user_pgdp(pgd), *pgd);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void sanity_check_ldt_mapping(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
|
|
|
|
bool had_kernel = (pgd->pgd != 0);
|
|
|
|
bool had_user = (kernel_to_user_pgdp(pgd)->pgd != 0);
|
|
|
|
|
|
|
|
do_sanity_check(mm, had_kernel, had_user);
|
|
|
|
}
|
|
|
|
|
2018-07-18 17:41:13 +08:00
|
|
|
#endif /* CONFIG_X86_PAE */
|
|
|
|
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
/*
|
|
|
|
* If PTI is enabled, this maps the LDT into the kernelmode and
|
|
|
|
* usermode tables for the given mm.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
|
|
|
|
{
|
|
|
|
unsigned long va;
|
2018-07-18 17:41:12 +08:00
|
|
|
bool is_vmalloc;
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
spinlock_t *ptl;
|
2018-10-26 20:28:55 +08:00
|
|
|
int i, nr_pages;
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
|
2019-03-30 02:52:59 +08:00
|
|
|
if (!boot_cpu_has(X86_FEATURE_PTI))
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Any given ldt_struct should have map_ldt_struct() called at most
|
|
|
|
* once.
|
|
|
|
*/
|
|
|
|
WARN_ON(ldt->slot != -1);
|
|
|
|
|
2018-07-18 17:41:12 +08:00
|
|
|
/* Check if the current mappings are sane */
|
|
|
|
sanity_check_ldt_mapping(mm);
|
|
|
|
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
is_vmalloc = is_vmalloc_addr(ldt->entries);
|
|
|
|
|
2018-10-26 20:28:55 +08:00
|
|
|
nr_pages = DIV_ROUND_UP(ldt->nr_entries * LDT_ENTRY_SIZE, PAGE_SIZE);
|
|
|
|
|
|
|
|
for (i = 0; i < nr_pages; i++) {
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
unsigned long offset = i << PAGE_SHIFT;
|
|
|
|
const void *src = (char *)ldt->entries + offset;
|
|
|
|
unsigned long pfn;
|
2018-04-07 04:55:09 +08:00
|
|
|
pgprot_t pte_prot;
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
pte_t pte, *ptep;
|
|
|
|
|
|
|
|
va = (unsigned long)ldt_slot_va(slot) + offset;
|
|
|
|
pfn = is_vmalloc ? vmalloc_to_pfn(src) :
|
|
|
|
page_to_pfn(virt_to_page(src));
|
|
|
|
/*
|
|
|
|
* Treat the PTI LDT range as a *userspace* range.
|
|
|
|
* get_locked_pte() will allocate all needed pagetables
|
|
|
|
* and account for them in this mm.
|
|
|
|
*/
|
|
|
|
ptep = get_locked_pte(mm, va, &ptl);
|
|
|
|
if (!ptep)
|
|
|
|
return -ENOMEM;
|
2017-12-16 03:35:11 +08:00
|
|
|
/*
|
|
|
|
* Map it RO so the easy to find address is not a primary
|
|
|
|
* target via some kernel interface which misses a
|
|
|
|
* permission check.
|
|
|
|
*/
|
2018-04-07 04:55:09 +08:00
|
|
|
pte_prot = __pgprot(__PAGE_KERNEL_RO & ~_PAGE_GLOBAL);
|
|
|
|
/* Filter out unsuppored __PAGE_KERNEL* bits: */
|
2018-04-16 17:43:57 +08:00
|
|
|
pgprot_val(pte_prot) &= __supported_pte_mask;
|
2018-04-07 04:55:09 +08:00
|
|
|
pte = pfn_pte(pfn, pte_prot);
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
set_pte_at(mm, va, ptep, pte);
|
|
|
|
pte_unmap_unlock(ptep, ptl);
|
|
|
|
}
|
|
|
|
|
2018-07-18 17:41:12 +08:00
|
|
|
/* Propagate LDT mapping to the user page-table */
|
|
|
|
map_ldt_struct_to_user(mm);
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
|
|
|
|
ldt->slot = slot;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-10-26 20:28:55 +08:00
|
|
|
static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
|
|
|
|
{
|
|
|
|
unsigned long va;
|
|
|
|
int i, nr_pages;
|
|
|
|
|
|
|
|
if (!ldt)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* LDT map/unmap is only required for PTI */
|
2019-03-30 02:52:59 +08:00
|
|
|
if (!boot_cpu_has(X86_FEATURE_PTI))
|
2018-10-26 20:28:55 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
nr_pages = DIV_ROUND_UP(ldt->nr_entries * LDT_ENTRY_SIZE, PAGE_SIZE);
|
|
|
|
|
|
|
|
for (i = 0; i < nr_pages; i++) {
|
|
|
|
unsigned long offset = i << PAGE_SHIFT;
|
|
|
|
spinlock_t *ptl;
|
|
|
|
pte_t *ptep;
|
|
|
|
|
|
|
|
va = (unsigned long)ldt_slot_va(ldt->slot) + offset;
|
|
|
|
ptep = get_locked_pte(mm, va, &ptl);
|
|
|
|
pte_clear(mm, va, ptep);
|
|
|
|
pte_unmap_unlock(ptep, ptl);
|
|
|
|
}
|
|
|
|
|
|
|
|
va = (unsigned long)ldt_slot_va(ldt->slot);
|
|
|
|
flush_tlb_mm_range(mm, va, va + nr_pages * PAGE_SIZE, PAGE_SHIFT, false);
|
|
|
|
}
|
|
|
|
|
2018-07-18 17:41:12 +08:00
|
|
|
#else /* !CONFIG_PAGE_TABLE_ISOLATION */
|
|
|
|
|
|
|
|
static int
|
|
|
|
map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2018-10-26 20:28:55 +08:00
|
|
|
|
|
|
|
static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
|
|
|
|
{
|
|
|
|
}
|
2018-07-18 17:41:12 +08:00
|
|
|
#endif /* CONFIG_PAGE_TABLE_ISOLATION */
|
|
|
|
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
static void free_ldt_pgtables(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_PAGE_TABLE_ISOLATION
|
|
|
|
struct mmu_gather tlb;
|
|
|
|
unsigned long start = LDT_BASE_ADDR;
|
2018-07-18 17:41:11 +08:00
|
|
|
unsigned long end = LDT_END_ADDR;
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
|
2019-03-30 02:52:59 +08:00
|
|
|
if (!boot_cpu_has(X86_FEATURE_PTI))
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
tlb_gather_mmu(&tlb, mm, start, end);
|
|
|
|
free_pgd_range(&tlb, start, end, start, end);
|
|
|
|
tlb_finish_mmu(&tlb, start, end);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2015-07-31 05:31:32 +08:00
|
|
|
/* After calling this, the LDT is immutable. */
|
|
|
|
static void finalize_ldt_struct(struct ldt_struct *ldt)
|
|
|
|
{
|
2017-06-07 01:31:16 +08:00
|
|
|
paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2017-12-14 19:27:30 +08:00
|
|
|
static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2017-12-14 19:27:30 +08:00
|
|
|
mutex_lock(&mm->context.lock);
|
|
|
|
|
2017-10-24 18:22:48 +08:00
|
|
|
/* Synchronizes with READ_ONCE in load_mm_ldt. */
|
2017-12-14 19:27:30 +08:00
|
|
|
smp_store_release(&mm->context.ldt, ldt);
|
2015-07-31 05:31:32 +08:00
|
|
|
|
2017-12-14 19:27:30 +08:00
|
|
|
/* Activate the LDT for all CPUs using currents mm. */
|
|
|
|
on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true);
|
|
|
|
|
|
|
|
mutex_unlock(&mm->context.lock);
|
2015-07-31 05:31:32 +08:00
|
|
|
}
|
2008-01-30 20:30:13 +08:00
|
|
|
|
2015-07-31 05:31:32 +08:00
|
|
|
static void free_ldt_struct(struct ldt_struct *ldt)
|
|
|
|
{
|
|
|
|
if (likely(!ldt))
|
|
|
|
return;
|
2008-07-24 05:21:18 +08:00
|
|
|
|
2017-06-07 01:31:16 +08:00
|
|
|
paravirt_free_ldt(ldt->entries, ldt->nr_entries);
|
|
|
|
if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
|
2016-12-13 08:44:17 +08:00
|
|
|
vfree_atomic(ldt->entries);
|
2015-07-31 05:31:32 +08:00
|
|
|
else
|
2015-09-02 23:45:58 +08:00
|
|
|
free_page((unsigned long)ldt->entries);
|
2015-07-31 05:31:32 +08:00
|
|
|
kfree(ldt);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-12-14 19:27:31 +08:00
|
|
|
* Called on fork from arch_dup_mmap(). Just copy the current LDT state,
|
|
|
|
* the new task is not running, so nothing can be installed.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2017-12-14 19:27:31 +08:00
|
|
|
int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2015-07-31 05:31:32 +08:00
|
|
|
struct ldt_struct *new_ldt;
|
2005-04-17 06:20:36 +08:00
|
|
|
int retval = 0;
|
|
|
|
|
2017-12-14 19:27:31 +08:00
|
|
|
if (!old_mm)
|
2015-07-31 05:31:32 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
mutex_lock(&old_mm->context.lock);
|
2017-12-14 19:27:31 +08:00
|
|
|
if (!old_mm->context.ldt)
|
2015-07-31 05:31:32 +08:00
|
|
|
goto out_unlock;
|
|
|
|
|
2017-06-07 01:31:16 +08:00
|
|
|
new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
|
2015-07-31 05:31:32 +08:00
|
|
|
if (!new_ldt) {
|
|
|
|
retval = -ENOMEM;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
memcpy(new_ldt->entries, old_mm->context.ldt->entries,
|
2017-06-07 01:31:16 +08:00
|
|
|
new_ldt->nr_entries * LDT_ENTRY_SIZE);
|
2015-07-31 05:31:32 +08:00
|
|
|
finalize_ldt_struct(new_ldt);
|
|
|
|
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
retval = map_ldt_struct(mm, new_ldt, 0);
|
|
|
|
if (retval) {
|
|
|
|
free_ldt_pgtables(mm);
|
|
|
|
free_ldt_struct(new_ldt);
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
2015-07-31 05:31:32 +08:00
|
|
|
mm->context.ldt = new_ldt;
|
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&old_mm->context.lock);
|
2005-04-17 06:20:36 +08:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2008-01-30 20:30:14 +08:00
|
|
|
* No need to lock the MM as we are the last user
|
|
|
|
*
|
|
|
|
* 64bit: Don't touch the LDT register - we're already in the next thread.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2016-02-13 05:02:34 +08:00
|
|
|
void destroy_context_ldt(struct mm_struct *mm)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2015-07-31 05:31:32 +08:00
|
|
|
free_ldt_struct(mm->context.ldt);
|
|
|
|
mm->context.ldt = NULL;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
void ldt_arch_exit_mmap(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
free_ldt_pgtables(mm);
|
|
|
|
}
|
|
|
|
|
2008-01-30 20:30:13 +08:00
|
|
|
static int read_ldt(void __user *ptr, unsigned long bytecount)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2008-01-30 20:30:13 +08:00
|
|
|
struct mm_struct *mm = current->mm;
|
2017-06-07 01:31:16 +08:00
|
|
|
unsigned long entries_size;
|
|
|
|
int retval;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-12-14 19:27:30 +08:00
|
|
|
down_read(&mm->context.ldt_usr_sem);
|
2015-07-31 05:31:32 +08:00
|
|
|
|
|
|
|
if (!mm->context.ldt) {
|
|
|
|
retval = 0;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2008-01-30 20:30:13 +08:00
|
|
|
if (bytecount > LDT_ENTRY_SIZE * LDT_ENTRIES)
|
|
|
|
bytecount = LDT_ENTRY_SIZE * LDT_ENTRIES;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-06-07 01:31:16 +08:00
|
|
|
entries_size = mm->context.ldt->nr_entries * LDT_ENTRY_SIZE;
|
|
|
|
if (entries_size > bytecount)
|
|
|
|
entries_size = bytecount;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-06-07 01:31:16 +08:00
|
|
|
if (copy_to_user(ptr, mm->context.ldt->entries, entries_size)) {
|
2015-07-31 05:31:32 +08:00
|
|
|
retval = -EFAULT;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2017-06-07 01:31:16 +08:00
|
|
|
if (entries_size != bytecount) {
|
2015-07-31 05:31:32 +08:00
|
|
|
/* Zero-fill the rest and pretend we read bytecount bytes. */
|
2017-06-07 01:31:16 +08:00
|
|
|
if (clear_user(ptr + entries_size, bytecount - entries_size)) {
|
2015-07-31 05:31:32 +08:00
|
|
|
retval = -EFAULT;
|
|
|
|
goto out_unlock;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
}
|
2015-07-31 05:31:32 +08:00
|
|
|
retval = bytecount;
|
|
|
|
|
|
|
|
out_unlock:
|
2017-12-14 19:27:30 +08:00
|
|
|
up_read(&mm->context.ldt_usr_sem);
|
2015-07-31 05:31:32 +08:00
|
|
|
return retval;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2008-01-30 20:30:13 +08:00
|
|
|
static int read_default_ldt(void __user *ptr, unsigned long bytecount)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2008-01-30 20:30:14 +08:00
|
|
|
/* CHECKME: Can we use _one_ random number ? */
|
|
|
|
#ifdef CONFIG_X86_32
|
|
|
|
unsigned long size = 5 * sizeof(struct desc_struct);
|
|
|
|
#else
|
|
|
|
unsigned long size = 128;
|
|
|
|
#endif
|
|
|
|
if (bytecount > size)
|
|
|
|
bytecount = size;
|
2005-04-17 06:20:36 +08:00
|
|
|
if (clear_user(ptr, bytecount))
|
|
|
|
return -EFAULT;
|
2008-01-30 20:30:13 +08:00
|
|
|
return bytecount;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2020-07-04 01:02:57 +08:00
|
|
|
static bool allow_16bit_segments(void)
|
|
|
|
{
|
|
|
|
if (!IS_ENABLED(CONFIG_X86_16BIT))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
#ifdef CONFIG_XEN_PV
|
|
|
|
/*
|
|
|
|
* Xen PV does not implement ESPFIX64, which means that 16-bit
|
|
|
|
* segments will not work correctly. Until either Xen PV implements
|
|
|
|
* ESPFIX64 and can signal this fact to the guest or unless someone
|
|
|
|
* provides compelling evidence that allowing broken 16-bit segments
|
|
|
|
* is worthwhile, disallow 16-bit segments under Xen PV.
|
|
|
|
*/
|
|
|
|
if (xen_pv_domain()) {
|
2020-07-06 03:50:20 +08:00
|
|
|
pr_info_once("Warning: 16-bit segments do not work correctly in a Xen PV guest\n");
|
2020-07-04 01:02:57 +08:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2008-01-30 20:30:13 +08:00
|
|
|
static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2008-01-30 20:30:13 +08:00
|
|
|
struct mm_struct *mm = current->mm;
|
2016-12-10 07:13:51 +08:00
|
|
|
struct ldt_struct *new_ldt, *old_ldt;
|
2017-06-07 01:31:16 +08:00
|
|
|
unsigned int old_nr_entries, new_nr_entries;
|
2016-12-10 07:13:51 +08:00
|
|
|
struct user_desc ldt_info;
|
2008-01-30 20:31:13 +08:00
|
|
|
struct desc_struct ldt;
|
2005-04-17 06:20:36 +08:00
|
|
|
int error;
|
|
|
|
|
|
|
|
error = -EINVAL;
|
|
|
|
if (bytecount != sizeof(ldt_info))
|
|
|
|
goto out;
|
2008-01-30 20:30:13 +08:00
|
|
|
error = -EFAULT;
|
2008-01-30 20:30:13 +08:00
|
|
|
if (copy_from_user(&ldt_info, ptr, sizeof(ldt_info)))
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
error = -EINVAL;
|
|
|
|
if (ldt_info.entry_number >= LDT_ENTRIES)
|
|
|
|
goto out;
|
|
|
|
if (ldt_info.contents == 3) {
|
|
|
|
if (oldmode)
|
|
|
|
goto out;
|
|
|
|
if (ldt_info.seg_not_present == 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2015-07-31 05:31:32 +08:00
|
|
|
if ((oldmode && !ldt_info.base_addr && !ldt_info.limit) ||
|
|
|
|
LDT_empty(&ldt_info)) {
|
|
|
|
/* The user wants to clear the entry. */
|
|
|
|
memset(&ldt, 0, sizeof(ldt));
|
|
|
|
} else {
|
2020-07-04 01:02:57 +08:00
|
|
|
if (!ldt_info.seg_32bit && !allow_16bit_segments()) {
|
2015-07-31 05:31:32 +08:00
|
|
|
error = -EINVAL;
|
|
|
|
goto out;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2015-07-31 05:31:32 +08:00
|
|
|
|
|
|
|
fill_ldt(&ldt, &ldt_info);
|
|
|
|
if (oldmode)
|
|
|
|
ldt.avl = 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2017-12-14 19:27:30 +08:00
|
|
|
if (down_write_killable(&mm->context.ldt_usr_sem))
|
|
|
|
return -EINTR;
|
2015-07-31 05:31:32 +08:00
|
|
|
|
2017-06-07 01:31:16 +08:00
|
|
|
old_ldt = mm->context.ldt;
|
|
|
|
old_nr_entries = old_ldt ? old_ldt->nr_entries : 0;
|
|
|
|
new_nr_entries = max(ldt_info.entry_number + 1, old_nr_entries);
|
2015-07-31 05:31:32 +08:00
|
|
|
|
|
|
|
error = -ENOMEM;
|
2017-06-07 01:31:16 +08:00
|
|
|
new_ldt = alloc_ldt_struct(new_nr_entries);
|
2015-07-31 05:31:32 +08:00
|
|
|
if (!new_ldt)
|
2014-05-05 01:36:22 +08:00
|
|
|
goto out_unlock;
|
|
|
|
|
2015-07-31 05:31:32 +08:00
|
|
|
if (old_ldt)
|
2017-06-07 01:31:16 +08:00
|
|
|
memcpy(new_ldt->entries, old_ldt->entries, old_nr_entries * LDT_ENTRY_SIZE);
|
|
|
|
|
2015-07-31 05:31:32 +08:00
|
|
|
new_ldt->entries[ldt_info.entry_number] = ldt;
|
|
|
|
finalize_ldt_struct(new_ldt);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
/*
|
|
|
|
* If we are using PTI, map the new LDT into the userspace pagetables.
|
|
|
|
* If there is already an LDT, use the other slot so that other CPUs
|
|
|
|
* will continue to use the old LDT until install_ldt() switches
|
|
|
|
* them over to the new LDT.
|
|
|
|
*/
|
|
|
|
error = map_ldt_struct(mm, new_ldt, old_ldt ? !old_ldt->slot : 0);
|
|
|
|
if (error) {
|
2017-12-31 18:24:34 +08:00
|
|
|
/*
|
|
|
|
* This only can fail for the first LDT setup. If an LDT is
|
|
|
|
* already installed then the PTE page is already
|
|
|
|
* populated. Mop up a half populated page table.
|
|
|
|
*/
|
2017-12-31 23:52:15 +08:00
|
|
|
if (!WARN_ON_ONCE(old_ldt))
|
|
|
|
free_ldt_pgtables(mm);
|
2017-12-31 18:24:34 +08:00
|
|
|
free_ldt_struct(new_ldt);
|
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 23:56:45 +08:00
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2015-07-31 05:31:32 +08:00
|
|
|
install_ldt(mm, new_ldt);
|
2018-10-26 20:28:55 +08:00
|
|
|
unmap_ldt_struct(mm, old_ldt);
|
2015-07-31 05:31:32 +08:00
|
|
|
free_ldt_struct(old_ldt);
|
2005-04-17 06:20:36 +08:00
|
|
|
error = 0;
|
|
|
|
|
|
|
|
out_unlock:
|
2017-12-14 19:27:30 +08:00
|
|
|
up_write(&mm->context.ldt_usr_sem);
|
2005-04-17 06:20:36 +08:00
|
|
|
out:
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2017-10-19 01:21:07 +08:00
|
|
|
SYSCALL_DEFINE3(modify_ldt, int , func , void __user * , ptr ,
|
|
|
|
unsigned long , bytecount)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
int ret = -ENOSYS;
|
|
|
|
|
|
|
|
switch (func) {
|
|
|
|
case 0:
|
|
|
|
ret = read_ldt(ptr, bytecount);
|
|
|
|
break;
|
|
|
|
case 1:
|
|
|
|
ret = write_ldt(ptr, bytecount, 1);
|
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
ret = read_default_ldt(ptr, bytecount);
|
|
|
|
break;
|
|
|
|
case 0x11:
|
|
|
|
ret = write_ldt(ptr, bytecount, 0);
|
|
|
|
break;
|
|
|
|
}
|
2017-10-19 01:21:07 +08:00
|
|
|
/*
|
|
|
|
* The SYSCALL_DEFINE() macros give us an 'unsigned long'
|
|
|
|
* return type, but tht ABI for sys_modify_ldt() expects
|
|
|
|
* 'int'. This cast gives us an int-sized value in %rax
|
|
|
|
* for the return code. The 'unsigned' is necessary so
|
|
|
|
* the compiler does not try to sign-extend the negative
|
|
|
|
* return codes into the high half of the register when
|
|
|
|
* taking the value from int->long.
|
|
|
|
*/
|
|
|
|
return (unsigned int)ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|