2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-05 04:04:01 +08:00
Commit Graph

37346 Commits

Author SHA1 Message Date
Giovanni Gherdovich
3149cd5530 x86: Print ratio freq_max/freq_base used in frequency invariance calculations
The value freq_max/freq_base is a fundamental component of frequency
invariance calculations. It may come from a variety of sources such as MSRs
or ACPI data, tracking it down when troubleshooting a system could be
non-trivial. It is worth saving it in the kernel logs.

 # dmesg | grep 'Estimated ratio of average max'
 [   14.024036] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1289

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-4-ggherdovich@suse.cz
2020-12-11 10:30:23 +01:00
Giovanni Gherdovich
976df7e573 x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
Frequency invariant accounting calculations need the ratio
freq_curr/freq_max, but freq_max is unknown as it depends on dynamic power
allocation between cores: AMD EPYC CPUs implement "Core Performance Boost".
Three candidates are considered to estimate this value:

- maximum non-boost frequency
- maximum boost frequency
- the mid point between the above two

Experimental data on an AMD EPYC Zen2 machine slightly favors the third
option, which is applied with this patch.

The analysis uses the ondemand cpufreq governor as baseline, and compares
it with schedutil in a number of configurations. Using the freq_max value
described above offers a moderate advantage in performance and efficiency:

sugov-max (freq_max=max_boost) performs the worst on tbench: less
throughput and reduced efficiency than the other invariant-schedutil
options (see "Data Overview" below). Consider that tbench is generally a
problematic case as no schedutil version currently is better than ondemand.

sugov-P0 (freq_max=max_P) is the worst on dbench, while the other sugov's
can surpass ondemand with less filesystem latency and slightly increased
efficiency.

1. DATA OVERVIEW
2. DETAILED PERFORMANCE TABLES
3. POWER CONSUMPTION TABLE

1. DATA OVERVIEW
================

sugov-noinv : non-invariant schedutil governor
sugov-max   : invariant schedutil, freq_max=max_boost
sugov-mid   : invariant schedutil, freq_max=midpoint
sugov-P0    : invariant schedutil, freq_max=max_P
perfgov     : performance governor

driver      : acpi_cpufreq
machine     : AMD EPYC 7742 (Zen2, aka "Rome"), dual socket,
              128 cores / 256 threads, SATA SSD storage, 250G of memory,
	      XFS filesystem

Benchmarks are described in the next section.
Tilde (~) means the value is the same as baseline.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
            ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0  better if
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                        PERFORMANCE RATIOS
tbench        1.00       1.44       0.90       0.87       0.93       0.93      higher
dbench        1.00       0.91       0.95       0.94       0.94       1.06      lower
kernbench     1.00       0.93       ~          ~          ~          0.97      lower
gitsource     1.00       0.66       0.97       0.96       ~          0.95      lower
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                    PERFORMANCE-PER-WATT RATIOS
tbench        1.00       1.16       0.84       0.84       0.88       0.85      higher
dbench        1.00       1.03       1.02       1.02       1.02       0.93      higher
kernbench     1.00       1.05       ~          ~          ~          ~         higher
gitsource     1.00       1.46       1.04       1.04       ~          1.05      higher

2. DETAILED PERFORMANCE TABLES
==============================

Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        427.19  +- 0.16% (        )     778.35  +- 0.10% (  82.20%)     346.92  +- 0.14% ( -18.79%)
Hmean  2        853.82  +- 0.09% (        )    1536.23  +- 0.03% (  79.93%)     694.36  +- 0.05% ( -18.68%)
Hmean  4       1657.54  +- 0.12% (        )    2938.18  +- 0.12% (  77.26%)    1362.81  +- 0.11% ( -17.78%)
Hmean  8       3301.87  +- 0.06% (        )    5679.10  +- 0.04% (  72.00%)    2693.35  +- 0.04% ( -18.43%)
Hmean  16      6139.65  +- 0.05% (        )    9498.81  +- 0.04% (  54.71%)    4889.97  +- 0.17% ( -20.35%)
Hmean  32     11170.28  +- 0.09% (        )   17393.25  +- 0.08% (  55.71%)    9104.55  +- 0.09% ( -18.49%)
Hmean  64     19322.97  +- 0.17% (        )   31573.91  +- 0.08% (  63.40%)   18552.52  +- 0.40% (  -3.99%)
Hmean  128    30383.71  +- 0.11% (        )   37416.91  +- 0.15% (  23.15%)   25938.70  +- 0.41% ( -14.63%)
Hmean  256    31143.96  +- 0.41% (        )   30908.76  +- 0.88% (  -0.76%)   29754.32  +- 0.24% (  -4.46%)
Hmean  512    30858.49  +- 0.26% (        )   38524.60  +- 1.19% (  24.84%)   42080.39  +- 0.56% (  36.37%)
Hmean  1024   39187.37  +- 0.19% (        )   36213.86  +- 0.26% (  -7.59%)   39555.98  +- 0.12% (   0.94%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        352.59  +- 1.03% ( -17.46%)     352.08  +- 0.75% ( -17.58%)     352.31  +- 1.48% ( -17.53%)
Hmean  2        697.32  +- 0.08% ( -18.33%)     700.16  +- 0.20% ( -18.00%)     696.79  +- 0.06% ( -18.39%)
Hmean  4       1369.88  +- 0.04% ( -17.35%)    1369.72  +- 0.07% ( -17.36%)    1365.91  +- 0.05% ( -17.59%)
Hmean  8       2696.79  +- 0.04% ( -18.33%)    2711.06  +- 0.04% ( -17.89%)    2715.10  +- 0.61% ( -17.77%)
Hmean  16      4725.03  +- 0.03% ( -23.04%)    4875.65  +- 0.02% ( -20.59%)    4953.05  +- 0.28% ( -19.33%)
Hmean  32      9231.65  +- 0.10% ( -17.36%)    8704.89  +- 0.27% ( -22.07%)   10562.02  +- 0.36% (  -5.45%)
Hmean  64     15364.27  +- 0.19% ( -20.49%)   17786.64  +- 0.15% (  -7.95%)   19665.40  +- 0.22% (   1.77%)
Hmean  128    42100.58  +- 0.13% (  38.56%)   34946.28  +- 0.13% (  15.02%)   38635.79  +- 0.06% (  27.16%)
Hmean  256    30660.23  +- 1.08% (  -1.55%)   32307.67  +- 0.54% (   3.74%)   31153.27  +- 0.12% (   0.03%)
Hmean  512    24604.32  +- 0.14% ( -20.27%)   40408.50  +- 1.10% (  30.95%)   38800.29  +- 1.23% (  25.74%)
Hmean  1024   35535.47  +- 0.28% (  -9.32%)   41070.38  +- 2.56% (   4.81%)   31308.29  +- 2.52% ( -20.11%)

Benchmark          : dbench (filesystem stressor)
Varying parameter  : number of clients
Unit               : seconds (lower is better)

NOTE-1: This dbench version measures the average latency of a set of filesystem
        operations, as we found the traditional dbench metric (throughput) to be
	misleading.
NOTE-2: Due to high variability, we partition the original dataset and apply
        statistical bootrapping (a resampling method). Accuracy is reported in the
	form of 95% confidence intervals.

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SubAmean  1         98.79  +- 0.92 (        )      83.36  +- 0.82 (  15.62%)      84.82  +- 0.92 (  14.14%)
SubAmean  2        116.00  +- 0.89 (        )     102.12  +- 0.77 (  11.96%)     109.63  +- 0.89 (   5.49%)
SubAmean  4        149.90  +- 1.03 (        )     132.12  +- 0.91 (  11.86%)     143.90  +- 1.15 (   4.00%)
SubAmean  8        182.41  +- 1.13 (        )     159.86  +- 0.93 (  12.36%)     165.82  +- 1.03 (   9.10%)
SubAmean  16       237.83  +- 1.23 (        )     219.46  +- 1.14 (   7.72%)     229.28  +- 1.19 (   3.59%)
SubAmean  32       334.34  +- 1.49 (        )     309.94  +- 1.42 (   7.30%)     321.19  +- 1.36 (   3.93%)
SubAmean  64       576.61  +- 2.16 (        )     540.75  +- 2.00 (   6.22%)     551.27  +- 1.99 (   4.39%)
SubAmean  128     1350.07  +- 4.14 (        )    1205.47  +- 3.20 (  10.71%)    1280.26  +- 3.75 (   5.17%)
SubAmean  256     3444.42  +- 7.97 (        )    3698.00 +- 27.43 (  -7.36%)    3494.14  +- 7.81 (  -1.44%)
SubAmean  2048   39457.89 +- 29.01 (        )   34105.33 +- 41.85 (  13.57%)   39688.52 +- 36.26 (  -0.58%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SubAmean  1         85.68  +- 1.04 (  13.27%)      84.16  +- 0.84 (  14.81%)      83.99  +- 0.90 (  14.99%)
SubAmean  2        108.42  +- 0.95 (   6.54%)     109.91  +- 1.39 (   5.24%)     112.06  +- 0.91 (   3.39%)
SubAmean  4        136.90  +- 1.04 (   8.67%)     137.59  +- 0.93 (   8.21%)     136.55  +- 0.95 (   8.91%)
SubAmean  8        163.15  +- 0.96 (  10.56%)     166.07  +- 1.02 (   8.96%)     165.81  +- 0.99 (   9.10%)
SubAmean  16       224.86  +- 1.12 (   5.45%)     223.83  +- 1.06 (   5.89%)     230.66  +- 1.19 (   3.01%)
SubAmean  32       320.51  +- 1.38 (   4.13%)     322.85  +- 1.49 (   3.44%)     321.96  +- 1.46 (   3.70%)
SubAmean  64       553.25  +- 1.93 (   4.05%)     554.19  +- 2.08 (   3.89%)     562.26  +- 2.22 (   2.49%)
SubAmean  128     1264.35  +- 3.72 (   6.35%)    1256.99  +- 3.46 (   6.89%)    2018.97 +- 18.79 ( -49.55%)
SubAmean  256     3466.25  +- 8.25 (  -0.63%)    3450.58  +- 8.44 (  -0.18%)    5032.12 +- 38.74 ( -46.09%)
SubAmean  2048   39133.10 +- 45.71 (   0.82%)   39905.95 +- 34.33 (  -1.14%)   53811.86 +-193.04 ( -36.38%)

Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        471.71 +- 26.61% (        )     409.88 +- 16.99% (  13.11%)     430.63  +- 0.18% (   8.71%)
Amean  4        211.87  +- 0.58% (        )     194.03  +- 0.74% (   8.42%)     215.33  +- 0.64% (  -1.63%)
Amean  8        109.79  +- 1.27% (        )     101.43  +- 1.53% (   7.61%)     111.05  +- 1.95% (  -1.15%)
Amean  16        59.50  +- 1.28% (        )      55.61  +- 1.35% (   6.55%)      59.65  +- 1.78% (  -0.24%)
Amean  32        34.94  +- 1.22% (        )      32.36  +- 1.95% (   7.41%)      35.44  +- 0.63% (  -1.43%)
Amean  64        22.58  +- 0.38% (        )      20.97  +- 1.28% (   7.11%)      22.41  +- 1.73% (   0.74%)
Amean  128       17.72  +- 0.44% (        )      16.68  +- 0.32% (   5.88%)      17.65  +- 0.96% (   0.37%)
Amean  256       16.44  +- 0.53% (        )      15.76  +- 0.32% (   4.18%)      16.76  +- 0.60% (  -1.93%)
Amean  512       16.54  +- 0.21% (        )      15.62  +- 0.41% (   5.53%)      16.84  +- 0.85% (  -1.83%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        421.30  +- 0.24% (  10.69%)     419.26  +- 0.15% (  11.12%)     414.38  +- 0.33% (  12.15%)
Amean  4  	217.81  +- 5.53% (  -2.80%)     211.63  +- 0.99% (   0.12%)     208.43  +- 0.47% (   1.63%)
Amean  8  	108.80  +- 0.43% (   0.90%)     108.48  +- 1.44% (   1.19%)     108.59  +- 3.08% (   1.09%)
Amean  16 	 58.84  +- 0.74% (   1.12%)      58.37  +- 0.94% (   1.91%)      57.78  +- 0.78% (   2.90%)
Amean  32 	 34.04  +- 2.00% (   2.59%)      34.28  +- 1.18% (   1.91%)      33.98  +- 2.21% (   2.75%)
Amean  64 	 22.22  +- 1.69% (   1.60%)      22.27  +- 1.60% (   1.38%)      22.25  +- 1.41% (   1.47%)
Amean  128	 17.55  +- 0.24% (   0.97%)      17.53  +- 0.94% (   1.04%)      17.49  +- 0.43% (   1.30%)
Amean  256	 16.51  +- 0.46% (  -0.40%)      16.48  +- 0.48% (  -0.19%)      16.44  +- 1.21% (   0.00%)
Amean  512	 16.50  +- 0.35% (   0.19%)      16.35  +- 0.42% (   1.14%)      16.37  +- 0.33% (   0.99%)

Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean          1035.76  +- 0.30% (        )     688.21  +- 0.04% (  33.56%)    1003.85  +- 0.14% (   3.08%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean           995.82  +- 0.08% (   3.86%)    1011.98  +- 0.03% (   2.30%)     986.87  +- 0.19% (   4.72%)

3. POWER CONSUMPTION TABLE
==========================

Average power consumption (watts).

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
            ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
tbench4     227.25     281.83     244.17     236.76     241.50     247.99
dbench4     151.97     161.87     157.08     158.10     158.06     153.73
kernbench   162.78     167.22     162.90     164.19     164.65     164.72
gitsource   133.65     139.00     133.04     134.43     134.18     134.32

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-3-ggherdovich@suse.cz
2020-12-11 10:29:55 +01:00
Nathan Fontenot
41ea667227 x86, sched: Calculate frequency invariance for AMD systems
This is the first pass in creating the ability to calculate the
frequency invariance on AMD systems. This approach uses the CPPC
highest performance and nominal performance values that range from
0 - 255 instead of a high and base frquency. This is because we do
not have the ability on AMD to get a highest frequency value.

On AMD systems the highest performance and nominal performance
vaues do correspond to the highest and base frequencies for the system
so using them should produce an appropriate ratio but some tweaking
is likely necessary.

Due to CPPC being initialized later in boot than when the frequency
invariant calculation is currently made, I had to create a callback
from the CPPC init code to do the calculation after we have CPPC
data.

Special thanks to "kernel test robot <lkp@intel.com>" for reporting that
compilation of drivers/acpi/cppc_acpi.c is conditional to
CONFIG_ACPI_CPPC_LIB, not just CONFIG_ACPI.

[ ggherdovich@suse.cz: made safe under CPU hotplug, edited changelog. ]

Signed-off-by: Nathan Fontenot <nathan.fontenot@amd.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-2-ggherdovich@suse.cz
2020-12-11 10:26:00 +01:00
Thomas Gleixner
058df195c2 x86/ioapic: Cleanup the timer_works() irqflags mess
Mark tripped over the creative irqflags handling in the IO-APIC timer
delivery check which ends up doing:

        local_irq_save(flags);
	local_irq_enable();
        local_irq_restore(flags);

which triggered a new consistency check he's working on required for
replacing the POPF based restore with a conditional STI.

That code is a historical mess and none of this is needed. Make it
straightforward use local_irq_disable()/enable() as that's all what is
required. It is invoked from interrupt enabled code nowadays.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/87k0tpju47.fsf@nanos.tec.linutronix.de
2020-12-10 23:02:31 +01:00
Thomas Gleixner
190113b4c6 x86/apic/vector: Fix ordering in vector assignment
Prarit reported that depending on the affinity setting the

 ' irq $N: Affinity broken due to vector space exhaustion.'

message is showing up in dmesg, but the vector space on the CPUs in the
affinity mask is definitely not exhausted.

Shung-Hsi provided traces and analysis which pinpoints the problem:

The ordering of trying to assign an interrupt vector in
assign_irq_vector_any_locked() is simply wrong if the interrupt data has a
valid node assigned. It does:

 1) Try the intersection of affinity mask and node mask
 2) Try the node mask
 3) Try the full affinity mask
 4) Try the full online mask

Obviously #2 and #3 are in the wrong order as the requested affinity
mask has to take precedence.

In the observed cases #1 failed because the affinity mask did not contain
CPUs from node 0. That made it allocate a vector from node 0, thereby
breaking affinity and emitting the misleading message.

Revert the order of #2 and #3 so the full affinity mask without the node
intersection is tried before actually affinity is broken.

If no node is assigned then only the full affinity mask and if that fails
the full online mask is tried.

Fixes: d6ffc6ac83 ("x86/vector: Respect affinity mask in irq descriptor")
Reported-by: Prarit Bhargava <prarit@redhat.com>
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87ft4djtyp.fsf@nanos.tec.linutronix.de
2020-12-10 23:00:54 +01:00
Xiaochen Shen
06c5fe9b12 x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled
The MBA software controller (mba_sc) is a feedback loop which
periodically reads MBM counters and tries to restrict the bandwidth
below a user-specified value. It tags along the MBM counter overflow
handler to do the updates with 1s interval in mbm_update() and
update_mba_bw().

The purpose of mbm_update() is to periodically read the MBM counters to
make sure that the hardware counter doesn't wrap around more than once
between user samplings. mbm_update() calls __mon_event_count() for local
bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count()
instead when mba_sc is enabled. __mon_event_count() will not be called
for local bandwidth updating in MBM counter overflow handler, but it is
still called when reading MBM local bandwidth counter file
'mbm_local_bytes', the call path is as below:

  rdtgroup_mondata_show()
    mon_event_read()
      mon_event_count()
        __mon_event_count()

In __mon_event_count(), m->chunks is updated by delta chunks which is
calculated from previous MSR value (m->prev_msr) and current MSR value.
When mba_sc is enabled, m->chunks is also updated in mbm_update() by
mistake by the delta chunks which is calculated from m->prev_bw_msr
instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in
the mba_sc feedback loop.

When reading MBM local bandwidth counter file, m->chunks was changed
unexpectedly by mbm_bw_count(). As a result, the incorrect local
bandwidth counter which calculated from incorrect m->chunks is shown to
the user.

Fix this by removing incorrect m->chunks updating in mbm_bw_count() in
MBM counter overflow handler, and always calling __mon_event_count() in
mbm_update() to make sure that the hardware local bandwidth counter
doesn't wrap around.

Test steps:
  # Run workload with aggressive memory bandwidth (e.g., 10 GB/s)
  git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat
  && make
  ./tools/membw/membw -c 0 -b 10000 --read

  # Enable MBA software controller
  mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl

  # Create control group c1
  mkdir /sys/fs/resctrl/c1

  # Set MB throttle to 6 GB/s
  echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata

  # Write PID of the workload to tasks file
  echo `pidof membw` > /sys/fs/resctrl/c1/tasks

  # Read local bytes counters twice with 1s interval, the calculated
  # local bandwidth is not as expected (approaching to 6 GB/s):
  local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
  sleep 1
  local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
  echo "local b/w (bytes/s):" `expr $local_2 - $local_1`

Before fix:
  local b/w (bytes/s): 11076796416

After fix:
  local b/w (bytes/s): 5465014272

Fixes: ba0f26d852 (x86/intel_rdt/mba_sc: Prepare for feedback loop)
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/1607063279-19437-1-git-send-email-xiaochen.shen@intel.com
2020-12-10 17:52:37 +01:00
Arvind Sankar
29ac40cbed x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
The PAT bit is in different locations for 4k and 2M/1G page table
entries.

Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three
caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages,
and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT
index for write-protected pages.

Fixes: 6ebcb06071 ("x86/mm: Add support to encrypt the kernel in-place")
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201111160946.147341-1-nivedita@alum.mit.edu
2020-12-10 12:28:06 +01:00
Maxim Levitsky
f57ad63a83 KVM: x86: ignore SIPIs that are received while not in wait-for-sipi state
In the commit 1c96dcceae
("KVM: x86: fix apic_accept_events vs check_nested_events"),

we accidently started latching SIPIs that are received while the cpu is not
waiting for them.

This causes vCPUs to never enter a halted state.

Fixes: 1c96dcceae ("KVM: x86: fix apic_accept_events vs check_nested_events")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201203143319.159394-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-09 16:18:30 -05:00
Kan Liang
c2208046bb perf/x86/intel: Add Tremont Topdown support
Tremont has four L1 Topdown events, TOPDOWN_FE_BOUND.ALL,
TOPDOWN_BAD_SPECULATION.ALL, TOPDOWN_BE_BOUND.ALL and
TOPDOWN_RETIRING.ALL. They are available on GP counters.

Export them to sysfs and facilitate the perf stat tool.

 $perf stat --topdown -- sleep 1

 Performance counter stats for 'sleep 1':

            retiring      bad speculation       frontend bound
backend bound
               24.9%                16.8%                31.7%
26.6%

       1.001224610 seconds time elapsed

       0.001150000 seconds user
       0.000000000 seconds sys

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1607457952-3519-1-git-send-email-kan.liang@linux.intel.com
2020-12-09 17:08:59 +01:00
Gustavo A. R. Silva
bd11952b40 uprobes/x86: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
2020-12-09 17:08:59 +01:00
Gustavo A. R. Silva
b645957545 perf/x86: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a fallthrough pseudo-keyword as a replacement for
a /* fall through */ comment, instead of letting the code fall through
to the next case.

Notice that Clang doesn't recognize /* fall through */ comments as
implicit fall-through markings.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
2020-12-09 17:08:59 +01:00
Gustavo A. R. Silva
e689b300c9 kprobes/x86: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of just letting the code
fall through to the next case.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
2020-12-09 17:08:58 +01:00
Kan Liang
f8129cd958 perf/x86/intel/lbr: Fix the return type of get_lbr_cycles()
The cycle count of a timed LBR is always 1 in perf record -D.

The cycle count is stored in the first 16 bits of the IA32_LBR_x_INFO
register, but the get_lbr_cycles() return Boolean type.

Use u16 to replace the Boolean type.

Fixes: 47125db27e ("perf/x86/intel/lbr: Support Architectural LBR")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201125213720.15692-2-kan.liang@linux.intel.com
2020-12-09 17:08:58 +01:00
Kan Liang
46b72e1bf4 perf/x86/intel: Fix rtm_abort_event encoding on Ice Lake
According to the event list from icelake_core_v1.09.json, the encoding
of the RTM_RETIRED.ABORTED event on Ice Lake should be,
    "EventCode": "0xc9",
    "UMask": "0x04",
    "EventName": "RTM_RETIRED.ABORTED",

Correct the wrong encoding.

Fixes: 6017608936 ("perf/x86/intel: Add Icelake support")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201125213720.15692-1-kan.liang@linux.intel.com
2020-12-09 17:08:57 +01:00
Masami Hiramatsu
78ff2733ff x86/kprobes: Restore BTF if the single-stepping is cancelled
Fix to restore BTF if single-stepping causes a page fault and
it is cancelled.

Usually the BTF flag was restored when the single stepping is done
(in resume_execution()). However, if a page fault happens on the
single stepping instruction, the fault handler is invoked and
the single stepping is cancelled. Thus, the BTF flag is not
restored.

Fixes: 1ecc798c67 ("x86: debugctlmsr kprobes")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/160389546985.106936.12727996109376240993.stgit@devnote2
2020-12-09 17:08:57 +01:00
Andy Lutomirski
a493d1ca1a x86/membarrier: Get rid of a dubious optimization
sync_core_before_usermode() had an incorrect optimization.  If the kernel
returns from an interrupt, it can get to usermode without IRET. It just has
to schedule to a different task in the same mm and do SYSRET.  Fortunately,
there were no callers of sync_core_before_usermode() that could have had
in_irq() or in_nmi() equal to true, because it's only ever called from the
scheduler.

While at it, clarify a related comment.

Fixes: 70216e18e5 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/5afc7632be1422f91eaf7611aaaa1b5b8580a086.1607058304.git.luto@kernel.org
2020-12-09 09:37:42 +01:00
Arvind Sankar
262bd5724a x86/cpu/amd: Remove dead code for TSEG region remapping
Commit

  26bfa5f894 ("x86, amd: Cleanup init_amd")

moved the code that remaps the TSEG region using 4k pages from
init_amd() to bsp_init_amd().

However, bsp_init_amd() is executed well before the direct mapping is
actually created:

  setup_arch()
    -> early_cpu_init()
      -> early_identify_cpu()
        -> this_cpu->c_bsp_init()
	  -> bsp_init_amd()
    ...
    -> init_mem_mapping()

So the change effectively disabled the 4k remapping, because
pfn_range_is_mapped() is always false at this point.

It has been over six years since the commit, and no-one seems to have
noticed this, so just remove the code. The original code was also
incomplete, since it doesn't check how large the TSEG address range
actually is, so it might remap only part of it in any case.

Hygon has copied the incorrect version, so the code has never run on it
since the cpu support was added two years ago. Remove it from there as
well.

Committer notes:

This workaround is incomplete anyway:

1. The code must check MSRC001_0113.TValid (SMM TSeg Mask MSR) first, to
check whether the TSeg address range is enabled.

2. The code must check whether the range is not 2M aligned - if it is,
there's nothing to work around.

3. In all the BIOSes tested, the TSeg range is in a e820 reserved area
and those are not mapped anymore, after

  66520ebc2d ("x86, mm: Only direct map addresses that are marked as E820_RAM")

which means, there's nothing to be worked around either.

So let's rip it out.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201127171324.1846019-1-nivedita@alum.mit.edu
2020-12-08 18:45:21 +01:00
Borislav Petkov
f77f420d34 x86/msr: Add a pointer to an URL which contains further details
After having collected the majority of reports about MSRs being written
by userspace tools and what tools those are, and all newer reports
mostly repeating, add an URL where detailed information is gathered and
kept up-to-date.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201205002825.19107-1-bp@alien8.de
2020-12-08 10:17:08 +01:00
Mike Travis
148c277165 x86/platform/uv: Add deprecated messages to /proc info leaves
Add "deprecated" message to any access to old /proc/sgi_uv/* leaves.

 [ bp: Do not have a trailing function opening brace and the arguments
   continuing on the next line and align them on the opening brace. ]

Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lkml.kernel.org/r/20201128034227.120869-5-mike.travis@hpe.com
2020-12-07 20:03:09 +01:00
Mike Travis
a67fffb017 x86/platform/uv: Add kernel interfaces for obtaining system info
Add kernel interfaces used to obtain info for the uv_sysfs driver
to display.

Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lkml.kernel.org/r/20201128034227.120869-2-mike.travis@hpe.com
2020-12-07 19:44:54 +01:00
Qiujun Huang
72ebb5ff80 x86/alternative: Update text_poke_bp() kernel-doc comment
Update kernel-doc parameter name after

  c3d6324f84 ("x86/alternatives: Teach text_poke_bp() to emulate instructions")

changed the last parameter from @handler to @emulate.

 [ bp: Make commit message more precise. ]

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201203145020.2441-1-hqjagain@gmail.com
2020-12-07 12:44:22 +01:00
Linus Torvalds
8100a58044 A set of fixes for x86:
- Make the AMD L3 QoS code and data priorization enable/disable mechanism
    work correctly. The control bit was only set/cleared on one of the CPUs
    in a L3 domain, but it has to be modified on all CPUs in the domain. The
    initial documentation was not clear about this, but the updated one from
    Oct 2020 spells it out.
 
  - Fix an off by one in the UV platform detection code which causes the UV
    hubs to be identified wrongly. The chip revisions start at 1 not at 0.
 
  - Fix a long standing bug in the evaluation of prefixes in the uprobes
    code which fails to handle repeated prefixes properly. The aggregate
    size of the prefixes can be larger than the bytes array but the code
    blindly iterated over the aggregate size beyond the array boundary.
    Add a macro to handle this case properly and use it at the affected
    places.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/M2GoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUGeD/0TxQyVIZTJjY/ExDYvQZeSWgvFVon9
 A/QcwkgtkWzYKI3YThgxBtSNKhHPP8eQ9QK8ErcaQKEHU5TdXVEOwHgTm1A6OGat
 0+M9/EWEe7tTu+cLpu7esQ+1VbSvEcFXZbljbhCKrShlBlsIjFG4eCPAprSw9yjI
 RSgfXKZu4NmHVS6nfJTwmIIaTeLQ6U3b7b1D5s/66slBFScqnLbRNhABVbHbos4F
 pl/lxDCFOddy2YbEojHjjGqMA7oxPav7c0nYFOM/zG+wAqfEjbqOxReT31bGQPi2
 XT9K4JEqDqILo0KnhV4GsYoWAhes3BtmsJ9IoZ7IijsMriYl80mD9URAORidJ4PX
 28Ckk9V/DlE8uDrAnBDcWDSoKlg78mhVV7V9L6v43teg/gJfSZNROtNDBmqRmwG4
 Op2NJfzJITtaxVQuSZRkSs8rzGv+QUfaM1sBUQ+Oz4KYeIjjA7G2MAOECrzIAWKB
 GWc5toYRVS6oGT+RbZhSxZYoh8ASoGJ2MrL8K4OV4RqEqHHcXcih0WmmljtsDIFI
 td4FHHH6fghIb9S6iYKiApd6k2qKa33mwJwa/xZOoIrv0w5xT0WDJnxT60gu/Mec
 YDkqhmA009CNSD2G4oNRNF5MH7gp34UII+25jOGatbVh+5DDPYs+5Jnh/DR7jssR
 PryAG9ER7UUb6w==
 =rNBq
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A set of fixes for x86:

   - Make the AMD L3 QoS code and data priorization enable/disable
     mechanism work correctly.

     The control bit was only set/cleared on one of the CPUs in a L3
     domain, but it has to be modified on all CPUs in the domain. The
     initial documentation was not clear about this, but the updated one
     from Oct 2020 spells it out.

   - Fix an off by one in the UV platform detection code which causes
     the UV hubs to be identified wrongly.

     The chip revisions start at 1 not at 0.

   - Fix a long standing bug in the evaluation of prefixes in the
     uprobes code which fails to handle repeated prefixes properly.

     The aggregate size of the prefixes can be larger than the bytes
     array but the code blindly iterated over the aggregate size beyond
     the array boundary. Add a macro to handle this case properly and
     use it at the affected places"

* tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes
  x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes
  x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
  x86/platform/uv: Fix UV4 hub revision adjustment
  x86/resctrl: Fix AMD L3 QOS CDP enable/disable
2020-12-06 11:22:39 -08:00
Linus Torvalds
9f6b28d498 Two fixes for performance monitoring on X86:
- Add recursion protection to another callchain invoked from
       x86_pmu_stop() which can recurse back into x86_pmu_stop(). The first
       attempt to fix this missed this extra code path.
 
     - Use the already filtered status variable to check for PEBS counter
       overflow bits and not the unfiltered full status read from
       IA32_PERF_GLOBAL_STATUS which can have unrelated bits check which
       would be evaluated incorrectly.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/M1lwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeKPEACo/xfvFgp/rlSIjh12OW7XTyKqKMVq
 C8TvMFbzn3Pzx89RyJctyyi1d5xYz6ic3uXkV0yGJ1g9fkJdzhgCY/byoTd/nWor
 /r6RRtw5Ah3werXlESbZWTb6691eqGO4TpWZMAwdDeV1eRAc7ZSoOLJiY0o4HVWd
 AE2aX9/psikHBAeMzMHn5W4pELl0s6s6aX9OJLFFZeroA3UW3bWOHu8o/iylmld9
 wnQPvGe+kk35ZBza8/kbBJv3ZasUBPXy1+9Z1p1aE8dmd+HrYnw9vS3El9dgtX1Y
 ADKttAD3S3j/PbBOXOnIRoSTM7Vvc5gviLyPfw6erBIMtw/OBFSrRD7IbVJpyZmn
 yysBgacu6pzeLXRZ4ZktQ2GPeAgdiZwsqwuvuQGd5t1zfdNkZEEI7aK87CrHCYOk
 zMt7vNK/q5xF9wt9C3cYzka/Sp7UA10zqP31/fv2fOZb4dcANTtm9reS/fwqd67X
 HYhYiks4z1wxqTOf76wBpXau9ZtQLb20MmC3S6IknDF39pODO1sGcWMJSB0AgYUE
 3125ZTxrmlgv+U7S5qBHKB5ot5wvNUexiDD0g0o2N9x4dCQqErmvwUDWWUoqq35A
 g4tqODudwM4aU05leWO/piZEBr4xxAiRke9r8WJZNHAo0C7yHjjmXRtZYDkSf2pV
 U0gyy4dWTsXKKw==
 =YbCo
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "Two fixes for performance monitoring on X86:

   - Add recursion protection to another callchain invoked from
     x86_pmu_stop() which can recurse back into x86_pmu_stop(). The
     first attempt to fix this missed this extra code path.

   - Use the already filtered status variable to check for PEBS counter
     overflow bits and not the unfiltered full status read from
     IA32_PERF_GLOBAL_STATUS which can have unrelated bits check which
     would be evaluated incorrectly"

* tag 'perf-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Check PEBS status correctly
  perf/x86/intel: Fix a warning on x86_pmu_stop() with large PEBS
2020-12-06 11:20:18 -08:00
Linus Torvalds
e6585a4939 Kbuild fixes for v5.10 (2nd)
- Move -Wcast-align to W=3, which tends to be false-positive and there
    is no tree-wide solution.
 
  - Pass -fmacro-prefix-map to KBUILD_CPPFLAGS because it is a preprocessor
    option and makes sense for .S files as well.
 
  - Disable -gdwarf-2 for Clang's integrated assembler to avoid warnings.
 
  - Disable --orphan-handling=warn for LLD 10.0.1 to avoid warnings.
 
  - Fix undesirable line breaks in *.mod files.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl/MzyMVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGKJ8P/2kLq296XAPjqC90/LWMja8dsXO/
 Wgaq8zC819x0JFuGdBKlwlFe3AvFYRtts9V5+mzjxvsOjH/6+xzyrXjRPCwZYqlj
 XKC3ZwuS2SGDPFCriI1edwTUp5tyDnG/VBjqbf3ybQnz0LAShidXBD9IlM/XX9Rz
 BlWqd7Uib50Pq8AfM2JVokrSmkkvhqxocIsmjTa0wvRjRAw7+aVkGNCWXqnTho7y
 YuHmTWbmUQIROF3Bzs1fkGp+qaQofPRfA1tTwaTVvgmt8rEqyzXi11y6kj56INfg
 /pq4O1KrplKtJFdrcjj4/eptqHG3I+Jq56qCHVescF6+bH6cc6BUL8qDdAzFZQai
 e/pWCzREqFDKchEmT2d0Uzik8Zfxi5Cw68Otpzb4LqTUUxXSoRx1R9Of/Ei5QZum
 6b6s9Q41UwH983UQCOOSGjXGZYP6fZG1a0XejbduYo7TL4KEECAO/FlLBWGttYH3
 0i3aKz3aDKb/fo7hDbbqg+o6F0mShEraqxMmWgIvgGt+k76j0O0wS2KryqpTd7Vv
 xg72suGM7f9QBA50lZ0r32fm86XnlqwQAm9ZMaSXR1Ii7j4F9UNRmR/FUYq7dPwa
 COkuHr+9LqzV/tkluWi2rjLIGPaCuEVeSCcQ/wIDdp2iOyb54CbozwK0Yi2dxxus
 jVFKwSaMUDHrkSj6
 =/ysh
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Move -Wcast-align to W=3, which tends to be false-positive and there
   is no tree-wide solution.

 - Pass -fmacro-prefix-map to KBUILD_CPPFLAGS because it is a
   preprocessor option and makes sense for .S files as well.

 - Disable -gdwarf-2 for Clang's integrated assembler to avoid warnings.

 - Disable --orphan-handling=warn for LLD 10.0.1 to avoid warnings.

 - Fix undesirable line breaks in *.mod files.

* tag 'kbuild-fixes-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: avoid split lines in .mod files
  kbuild: Disable CONFIG_LD_ORPHAN_WARN for ld.lld 10.0.1
  kbuild: Hoist '--orphan-handling' into Kconfig
  Kbuild: do not emit debug info for assembly with LLVM_IAS=1
  kbuild: use -fmacro-prefix-map for .S sources
  Makefile.extrawarn: move -Wcast-align to W=3
2020-12-06 10:31:39 -08:00
Masami Hiramatsu
84da009f06 x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes
Since insn.prefixes.nbytes can be bigger than the size of
insn.prefixes.bytes[] when a prefix is repeated, the proper
check must be:

  insn.prefixes.bytes[i] != 0 and i < 4

instead of using insn.prefixes.nbytes. Use the new
for_each_insn_prefix() macro which does it correctly.

Debugged by Kees Cook <keescook@chromium.org>.

 [ bp: Massage commit message. ]

Fixes: 25189d08e5 ("x86/sev-es: Add support for handling IOIO exceptions")
Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/160697106089.3146288.2052422845039649176.stgit@devnote2
2020-12-06 10:03:08 +01:00
Masami Hiramatsu
12cb908a11 x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes
Since insn.prefixes.nbytes can be bigger than the size of
insn.prefixes.bytes[] when a prefix is repeated, the proper check must
be

  insn.prefixes.bytes[i] != 0 and i < 4

instead of using insn.prefixes.nbytes. Use the new
for_each_insn_prefix() macro which does it correctly.

Debugged by Kees Cook <keescook@chromium.org>.

 [ bp: Massage commit message. ]

Fixes: 32d0b95300 ("x86/insn-eval: Add utility functions to get segment selector")
Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/160697104969.3146288.16329307586428270032.stgit@devnote2
2020-12-06 10:03:08 +01:00
Masami Hiramatsu
4e9a5ae8df x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
Since insn.prefixes.nbytes can be bigger than the size of
insn.prefixes.bytes[] when a prefix is repeated, the proper check must
be

  insn.prefixes.bytes[i] != 0 and i < 4

instead of using insn.prefixes.nbytes.

Introduce a for_each_insn_prefix() macro for this purpose. Debugged by
Kees Cook <keescook@chromium.org>.

 [ bp: Massage commit message, sync with the respective header in tools/
   and drop "we". ]

Fixes: 2b14449835 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints")
Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/160697103739.3146288.7437620795200799020.stgit@devnote2
2020-12-06 09:58:13 +01:00
Rick Edgecombe
339f5a7fb2 kvm: x86/mmu: Use cpuid to determine max gfn
In the TDP MMU, use shadow_phys_bits to dermine the maximum possible GFN
mapped in the guest for zapping operations. boot_cpu_data.x86_phys_bits
may be reduced in the case of HW features that steal HPA bits for other
purposes. However, this doesn't necessarily reduce GPA space that can be
accessed via TDP. So zap based on a maximum gfn calculated with MAXPHYADDR
retrieved from CPUID. This is already stored in shadow_phys_bits, so use
it instead of x86_phys_bits.

Fixes: faaf05b00a ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-Id: <20201203231120.27307-1-rick.p.edgecombe@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-04 03:48:33 -05:00
Jacob Xu
a2b2d4bf50 kvm: svm: de-allocate svm_cpu_data for all cpus in svm_cpu_uninit()
The cpu arg for svm_cpu_uninit() was previously ignored resulting in the
per cpu structure svm_cpu_data not being de-allocated for all cpus.

Signed-off-by: Jacob Xu <jacobhxu@google.com>
Message-Id: <20201203205939.1783969-1-jacobhxu@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-04 03:47:58 -05:00
Uros Bizjak
be169fe3ce crypto: x86/poly1305 - Use TEST %reg,%reg instead of CMP $0,%reg
CMP $0,%reg can't set overflow flag, so we can use shorter TEST %reg,%reg
instruction when only zero and sign flags are checked (E,L,LE,G,GE conditions).

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-12-04 18:13:16 +11:00
Uros Bizjak
0b837f1ef8 crypto: x86/sha512 - Use TEST %reg,%reg instead of CMP $0,%reg
CMP $0,%reg can't set overflow flag, so we can use shorter TEST %reg,%reg
instruction when only zero and sign flags are checked (E,L,LE,G,GE conditions).

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-12-04 18:13:15 +11:00
Uros Bizjak
032d049ea0 crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg
CMP $0,%reg can't set overflow flag, so we can use shorter TEST %reg,%reg
instruction when only zero and sign flags are checked (E,L,LE,G,GE conditions).

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-12-04 18:13:15 +11:00
Jarkko Sakkinen
a4b9c48b96 x86/sgx: Return -EINVAL on a zero length buffer in sgx_ioc_enclave_add_pages()
The sgx_enclave_add_pages.length field is documented as

 * @length:     length of the data (multiple of the page size)

Fail with -EINVAL, when the caller gives a zero length buffer of data
to be added as pages to an enclave. Right now 'ret' is returned as
uninitialized in that case.

 [ bp: Flesh out commit message. ]

Fixes: c6d26d3707 ("x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/linux-sgx/X8ehQssnslm194ld@mwanda/
Link: https://lkml.kernel.org/r/20201203183527.139317-1-jarkko@kernel.org
2020-12-03 19:54:40 +01:00
Paolo Bonzini
dee734a7de KVM: x86: adjust SEV for commit 7e8e6eed75
Since the ASID is now stored in svm->asid, pre_sev_run should also place
it there and not directly in the VMCB control area.

Reported-by: Ashish Kalra <Ashish.Kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-03 12:38:39 -05:00
Mike Travis
8dcc0e19df x86/platform/uv: Fix UV4 hub revision adjustment
Currently, UV4 is incorrectly identified as UV4A and UV4A as UV5. Hub
chip starts with revision 1, fix it.

 [ bp: Massage commit message. ]

Fixes: 647128f153 ("x86/platform/uv: Update UV MMRs for UV5")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Link: https://lkml.kernel.org/r/20201203152252.371199-1-mike.travis@hpe.com
2020-12-03 18:09:18 +01:00
Stephane Eranian
fc17db8aa4 perf/x86/intel: Check PEBS status correctly
The kernel cannot disambiguate when 2+ PEBS counters overflow at the
same time. This is what the comment for this code suggests.  However,
I see the comparison is done with the unfiltered p->status which is a
copy of IA32_PERF_GLOBAL_STATUS at the time of the sample. This
register contains more than the PEBS counter overflow bits. It also
includes many other bits which could also be set.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201126110922.317681-2-namhyung@kernel.org
2020-12-03 10:00:26 +01:00
Namhyung Kim
5debf02131 perf/x86/intel: Fix a warning on x86_pmu_stop() with large PEBS
The commit 3966c3feca ("x86/perf/amd: Remove need to check "running"
bit in NMI handler") introduced this.  It seems x86_pmu_stop can be
called recursively (like when it losts some samples) like below:

  x86_pmu_stop
    intel_pmu_disable_event  (x86_pmu_disable)
      intel_pmu_pebs_disable
        intel_pmu_drain_pebs_nhm  (x86_pmu_drain_pebs_buffer)
          x86_pmu_stop

While commit 35d1ce6bec ("perf/x86/intel/ds: Fix x86_pmu_stop
warning for large PEBS") fixed it for the normal cases, there's
another path to call x86_pmu_stop() recursively when a PEBS error was
detected (like two or more counters overflowed at the same time).

Like in the Kan's previous fix, we can skip the interrupt accounting
for large PEBS, so check the iregs which is set for PMI only.

Fixes: 3966c3feca ("x86/perf/amd: Remove need to check "running" bit in NMI handler")
Reported-by: John Sperbeck <jsperbeck@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201126110922.317681-1-namhyung@kernel.org
2020-12-03 10:00:26 +01:00
Mauro Carvalho Chehab
bab8c183d1 x86/sgx: Fix a typo in kernel-doc markup
Fix the following kernel-doc warning:

  arch/x86/include/uapi/asm/sgx.h:19: warning: expecting prototype \
    for enum sgx_epage_flags. Prototype was for enum sgx_page_flags instead

 [ bp: Launder the commit message. ]

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/ca11a4540d981cbd5f026b6cbc8931aa55654e00.1606897462.git.mchehab+huawei@kernel.org
2020-12-02 12:54:47 +01:00
Gabriel Krisman Bertazi
1d7637d89c signal: Expose SYS_USER_DISPATCH si_code type
SYS_USER_DISPATCH will be triggered when a syscall is sent to userspace
by the Syscall User Dispatch mechanism.  This adjusts eventual
BUILD_BUG_ON around the tree.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201127193238.821364-3-krisman@collabora.com
2020-12-02 10:32:16 +01:00
Gabriel Krisman Bertazi
c5c878125a x86: vdso: Expose sigreturn address on vdso to the kernel
Syscall user redirection requires the signal trampoline code to not be
captured, in order to support returning with a locked selector while
avoiding recursion back into the signal handler.  For ia-32, which has
the trampoline in the vDSO, expose the entry points to the kernel, such
that it can avoid dispatching syscalls from that region to userspace.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201127193238.821364-2-krisman@collabora.com
2020-12-02 10:32:16 +01:00
Gabriele Paoloni
e1c06d2366 x86/mce: Rename kill_it to kill_current_task
Currently, if an MCE happens in user-mode or while the kernel is copying
data from user space, 'kill_it' is used to check if execution of the
interrupted task can be recovered or not; the flag name however is not
very meaningful, hence rename it to match its goal.

 [ bp: Massage commit message, rename the queue_task_work() arg too. ]

Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201127161819.3106432-6-gabriele.paoloni@intel.com
2020-12-01 18:58:50 +01:00
Gabriele Paoloni
d5b38e3d0f x86/mce: Remove redundant call to irq_work_queue()
Currently, __mc_scan_banks() in do_machine_check() does the following
callchain:

  __mc_scan_banks()->mce_log()->irq_work_queue(&mce_irq_work).

Hence, the call to irq_work_queue() below after __mc_scan_banks()
seems redundant. Just remove it.

Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201127161819.3106432-5-gabriele.paoloni@intel.com
2020-12-01 18:54:32 +01:00
Gabriele Paoloni
3a866b16fd x86/mce: Panic for LMCE only if mca_cfg.tolerant < 3
Right now for LMCE, if no_way_out is set, mce_panic() is called
regardless of mca_cfg.tolerant. This is not correct as, if
mca_cfg.tolerant = 3, the code should never panic.

Add that check.

 [ bp: use local ptr 'cfg'. ]

Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201127161819.3106432-4-gabriele.paoloni@intel.com
2020-12-01 18:49:29 +01:00
Gabriele Paoloni
e273e6e12a x86/mce: Move the mce_panic() call and 'kill_it' assignments to the right places
Right now, for local MCEs the machine calls panic(), if needed, right
after lmce is set. For MCE broadcasting, mce_reign() takes care of
calling mce_panic().

Hence:
- improve readability by moving the conditional evaluation of
tolerant up to when kill_it is set first;
- move the mce_panic() call up into the statement where mce_end()
fails.

 [ bp: Massage, remove comment in the mce_end() failure case because it
   is superfluous; use local ptr 'cfg' in both tests. ]

Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201127161819.3106432-3-gabriele.paoloni@intel.com
2020-12-01 18:45:56 +01:00
Borislav Petkov
15936ca13d Linux 5.10-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl/EM9oeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/3kH/RNkFyTlHlUkZpJx
 8Ks2yWgUln7YhZcmOaG/IcIyWnhCgo3l35kiaH7XxM+rPMZzidp51MHUllaTAQDc
 u+5EFHMJsmTWUfE8ocHPb1cPdYEDSoVr6QUsixbL9+uADpRz+VZVtWMb89EiyMrC
 wvLIzpnqY5UNriWWBxD0hrmSsT4g9XCsauer4k2KB+zvebwg6vFOMCFLFc2qz7fb
 ABsrPFqLZOMp+16chGxyHP7LJ6ygI/Hwf7tPW8ppv4c+hes4HZg7yqJxXhV02QbJ
 s10s6BTcEWMqKg/T6L/VoScsMHWUcNdvrr3uuPQhgup240XdmB1XO8rOKddw27e7
 VIjrjNw=
 =4ZaP
 -----END PGP SIGNATURE-----

Merge tag 'v5.10-rc6' into ras/core

Merge the -rc6 tag to pick up dependent changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
2020-12-01 18:22:55 +01:00
Xiaochen Shen
19eb86a72d x86/resctrl: Clean up unused function parameter in rmdir path
Commit

  fd8d9db355 ("x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak")

removed superfluous kernfs_get() calls in rdtgroup_ctrl_remove() and
rdtgroup_rmdir_ctrl(). That change resulted in an unused function
parameter to these two functions.

Clean up the unused function parameter in rdtgroup_ctrl_remove(),
rdtgroup_rmdir_mon() and their callers rdtgroup_rmdir_ctrl() and
rdtgroup_rmdir().

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/1606759618-13181-1-git-send-email-xiaochen.shen@intel.com
2020-12-01 18:06:35 +01:00
Borislav Petkov
87314fb181 Linux 5.10-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl/EM9oeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/3kH/RNkFyTlHlUkZpJx
 8Ks2yWgUln7YhZcmOaG/IcIyWnhCgo3l35kiaH7XxM+rPMZzidp51MHUllaTAQDc
 u+5EFHMJsmTWUfE8ocHPb1cPdYEDSoVr6QUsixbL9+uADpRz+VZVtWMb89EiyMrC
 wvLIzpnqY5UNriWWBxD0hrmSsT4g9XCsauer4k2KB+zvebwg6vFOMCFLFc2qz7fb
 ABsrPFqLZOMp+16chGxyHP7LJ6ygI/Hwf7tPW8ppv4c+hes4HZg7yqJxXhV02QbJ
 s10s6BTcEWMqKg/T6L/VoScsMHWUcNdvrr3uuPQhgup240XdmB1XO8rOKddw27e7
 VIjrjNw=
 =4ZaP
 -----END PGP SIGNATURE-----

Merge tag 'v5.10-rc6' into x86/cache

Merge -rc6 tag to pick up dependent changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
2020-12-01 18:01:27 +01:00
Babu Moger
fae3a13d2a x86/resctrl: Fix AMD L3 QOS CDP enable/disable
When the AMD QoS feature CDP (code and data prioritization) is enabled
or disabled, the CDP bit in MSR 0000_0C81 is written on one of the CPUs
in an L3 domain (core complex). That is not correct - the CDP bit needs
to be updated on all the logical CPUs in the domain.

This was not spelled out clearly in the spec earlier. The specification
has been updated and the updated document, "AMD64 Technology Platform
Quality of Service Extensions Publication # 56375 Revision: 1.02 Issue
Date: October 2020" is available now. Refer the section: Code and Data
Prioritization.

Fix the issue by adding a new flag arch_has_per_cpu_cfg in rdt_cache
data structure.

The documentation can be obtained at:
https://developer.amd.com/wp-content/resources/56375.pdf
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

 [ bp: Massage commit message. ]

Fixes: 4d05bf71f1 ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/160675180380.15628.3309402017215002347.stgit@bmoger-ubuntu
2020-12-01 17:53:31 +01:00
Nick Desaulniers
2838307b01 x86/build: Remove -m16 workaround for unsupported versions of GCC
Revert the following two commits:

  de3accdaec ("x86, build: Build 16-bit code with -m16 where possible")
  a9cfccee66 ("x86, build: Change code16gcc.h from a C header to an assembly header")

Since

  0bddd227f3 ("Documentation: update for gcc 4.9 requirement")

the minimum supported version of GCC is gcc-4.9. It's now safe to remove
this code.

Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/20201201011307.3676986-1-ndesaulniers@google.com
2020-12-01 17:17:18 +01:00
Nathan Chancellor
59612b24f7 kbuild: Hoist '--orphan-handling' into Kconfig
Currently, '--orphan-handling=warn' is spread out across four different
architectures in their respective Makefiles, which makes it a little
unruly to deal with in case it needs to be disabled for a specific
linker version (in this case, ld.lld 10.0.1).

To make it easier to control this, hoist this warning into Kconfig and
the main Makefile so that disabling it is simpler, as the warning will
only be enabled in a couple places (main Makefile and a couple of
compressed boot folders that blow away LDFLAGS_vmlinx) and making it
conditional is easier due to Kconfig syntax. One small additional
benefit of this is saving a call to ld-option on incremental builds
because we will have already evaluated it for CONFIG_LD_ORPHAN_WARN.

To keep the list of supported architectures the same, introduce
CONFIG_ARCH_WANT_LD_ORPHAN_WARN, which an architecture can select to
gain this automatically after all of the sections are specified and size
asserted. A special thanks to Kees Cook for the help text on this
config.

Link: https://github.com/ClangBuiltLinux/linux/issues/1187
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-12-01 22:45:36 +09:00
Sami Tolvanen
83321c335d x86/pci: Fix the function type for check_reserved_t
e820__mapped_all() is passed as a callback to is_mmconf_reserved(),
which expects a function of type:

  typedef bool (*check_reserved_t)(u64 start, u64 end, unsigned type);

However, e820__mapped_all() accepts enum e820_type as the last argument
and this type mismatch trips indirect call checking with Clang's
Control-Flow Integrity (CFI).

As is_mmconf_reserved() only passes enum e820_type values for the
type argument, change the typedef and the unused type argument in
is_acpi_reserved() to enum e820_type to fix the type mismatch.

Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201130193900.456726-1-samitolvanen@google.com
2020-12-01 14:22:52 +01:00
Linus Torvalds
f91a3aa6bc Yet two more places which invoke tracing from RCU disabled regions in the
idle path. Similar to the entry path the low level idle functions have to
 be non-instrumentable.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/DpAUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXSLD/9klc0YimnEnROW6Q5Svb2IcyIutmXF
 bOIY1bYYoKILOBj3wyvDUhmdMuq5zh7H9yG11hO8MaVVWVQcLcOMLdHTYm9dcdmF
 xQk33+xqjuhRShB+nEmC9ayYtWogtH6W6uZ6WDtF9ZltMKU85n5ddGJ/Fvo+HoCb
 NbOdHGJdJ3/3ZCeHnxOnxM+5/GwjkBuccTV/tXmb3yXrfU9DBySyQ4/UchcpF43w
 LcEb0kiQbpZsBTByKJOQV8+RR654S0sILlvRwVXpmj94vrgGwhlVk1/9rz7tkOhF
 ksoo1mTVu75LMt22G/hXxE63787yRvFdHjapf0+kCOAuhl992NK+xlGDH8o9DXcu
 9y73D4bI0HnDFs20w6vs20iLvxECJiYHJqlgR5ZwFUToceaNgtiYr8kzuD7Zbae1
 KG2E7BuNSwHWMtf97fGn44GZknPEOaKdDn4Wv6/bvKHxLm77qe11RKF70Stcz2AI
 am13KmQzzsHGF5qNWwpElRUxSdxfJMR66RnOdTQULGrRedaZTFol/y2pnVzTSe3k
 SZnlpL5kE7y92UYDogPb5wWA7b+YkJN0OdSkRFy1FH26ZG8E4M7ZJ2tql5Sw7pGM
 lsTjXpAUphnK5rz7QcYE8KAZWj//fIAcElIrvdklVcBnS3IqjfksYW27B64133vx
 cT1B/lA1PHXj6Q==
 =raED
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "Two more places which invoke tracing from RCU disabled regions in the
  idle path.

  Similar to the entry path the low level idle functions have to be
  non-instrumentable"

* tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  intel_idle: Fix intel_idle() vs tracing
  sched/idle: Fix arch_cpu_idle() vs tracing
2020-11-29 11:19:26 -08:00
Linus Torvalds
7255a39d24 - Two resctrl fixes to prevent refcount leaks when manipulating the
resctrl fs (Xiaochen Shen)
 
 - Correct prctl(PR_GET_SPECULATION_CTRL) reporting (Anand K Mistry)
 
 - A fix to not lose already seen MCE severity which determines whether
   the machine can recover (Gabriele Paoloni)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/DeEcACgkQEsHwGGHe
 VUrAlg/+KO2GMYiz5mNslI3tg0tUf0tDEw7bsfsibGL9XkkRRre0zn2pTN3GoutG
 ZeM9d+epFNfv1msyTYXjhhMeO5XLGnuBzyeCv4wQkIH0zdNJSencPlr6tKnaMrAo
 Alk/BxmvBmOOWHayBOhgNBVYWXBoACkIzIUFi0d3lRbgiGDhOqowYbYkLWIu/tvm
 pI8anovuAfkluhh+tD96TyBrSa+GfWWiOYItNbJbmKAbGjGbJU5IEVmlXeMEziGO
 7zmGNwl4Di2V3lU5oDJNcYvF2Ngok/L81QqOMKeVu7EYDHLBjHxp0rxwuE8QqlX8
 bk40EEHyI1Aiwx/7WxB/ChnDOfpIpXljWEvropZeNOCY1LqDMnMnAJFrygQAmYeo
 t9GxbhBYvtXSxTkkfTWzowt20Vmm2+r5j09kpq3tfMrZXwk7VrSF8TjQ/3iM/5iM
 eV/wPaigqX0xZwdzF1o3fuXeM+RXHhMrnX+cKlFemUX9uFZi0/ZXk3TlJwCG/jKL
 QWbpTBuxcDwXR3K1RNPjBJG1OA1cupZQ9ePK8h95H1844rEwglreypCOf3LwDJzU
 9XEZo94V/J+gxNzg5iakidtdBMSst+SYY+vd/SUlLHXzkwcykOzwPgvZwVA7U2HM
 74Eh7eQhIht3Lin35i9h1Ez/bxE/Ss6Fi9wnObp8ij5ImOaS5eU=
 =JrT9
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "A couple of urgent fixes which accumulated this last week:

   - Two resctrl fixes to prevent refcount leaks when manipulating the
     resctrl fs (Xiaochen Shen)

   - Correct prctl(PR_GET_SPECULATION_CTRL) reporting (Anand K Mistry)

   - A fix to not lose already seen MCE severity which determines
     whether the machine can recover (Gabriele Paoloni)"

* tag 'x86_urgent_for_v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Do not overwrite no_way_out if mce_end() fails
  x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb
  x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak
  x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak
2020-11-29 10:08:17 -08:00
Arnd Bergmann
718e43b5f8 Linux 5.10-rc2
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl+fOigeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGoQ0H/RLJU2FMIjO0mzLX
 9LqePQ9QmNWG4KeqxwWaKq90MinIbnSG3CDPKruu8RNh2Rr6nsEJmqg1DWyEiFRB
 8gzsBXMAC1i2aPfOrOnCJEfP+L+svKlbSii475tNdZw2DhP+/FBT0RVCt3rRhrRs
 atc8+dM7ViGLnlvRJ4LlVqA3d1kjOr5bsPYcIcnGIHY8mYWBLFzTSVgDdrcB9+3l
 7lZud/zMhJ3dS0bcnbIUS1YpBxHCsgEaMFQYmcv3RruIaaFbh5THkfQUSmbmrAru
 /EeVjwVMuvpvb2jxS1ofLx2in7t4tsNgItu4AfMmV0BurM5NhpqKo7mo/1nmR/X9
 Q4tjPRc=
 =cUbb
 -----END PGP SIGNATURE-----

Backmerge tag 'v5.10-rc2' into arm/drivers

The SCMI pull request for the arm/drivers branch requires v5.10-rc2
because of dependencies with other git trees, so merge that in here.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2020-11-27 21:04:53 +01:00
Linus Torvalds
3913a2bc81 ARM:
- Fix alignment of the new HYP sections
 - Fix GICR_TYPER access from userspace
 
 S390:
 - do not reset the global diag318 data for per-cpu reset
 - do not mark memory as protected too early
 - fix for destroy page ultravisor call
 
 x86:
 - fix for SEV debugging
 - fix incorrect return code
 - fix for "noapic" with PIC in userspace and LAPIC in kernel
 - fix for 5-level paging
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl/BKpQUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPrZgf+Jdw1ONU5hFLl5Xz2YneVppqMr3nh
 X/Nr/dGzP+ve2FPNgkMotwqOWb/6jwKYKbliB2Q6fS51/7MiV7TDizna8ZpzEn12
 M0/NMWtW7Luq7yTTnXUhClG4QfRvn90EaflxUYxCBSRRhDleJ9sCl4Ga5b1fDIdQ
 AeDdqJV4ElCGUrPM1my4vemrbFeiiEeDeWZvb6TP5LlJS+EDZeehk9zEAB7PFwAu
 oX3O8WUbRxRYakZR1PPIn8e0qh2zaVDFUk/sZKJLOCCPx2UnOErf3jV6rQEMeSPC
 5aOspfq+gI3jukufdyNxcKxRSj8Jw63f0vDaUgd4H71dsG390gM6onQiQg==
 =IyC5
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:
   - Fix alignment of the new HYP sections
   - Fix GICR_TYPER access from userspace

  S390:
   - do not reset the global diag318 data for per-cpu reset
   - do not mark memory as protected too early
   - fix for destroy page ultravisor call

  x86:
   - fix for SEV debugging
   - fix incorrect return code
   - fix for 'noapic' with PIC in userspace and LAPIC in kernel
   - fix for 5-level paging"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: x86/mmu: Fix get_mmio_spte() on CPUs supporting 5-level PT
  KVM: x86: Fix split-irqchip vs interrupt injection window request
  KVM: x86: handle !lapic_in_kernel case in kvm_cpu_*_extint
  MAINTAINERS: Update email address for Sean Christopherson
  MAINTAINERS: add uv.c also to KVM/s390
  s390/uv: handle destroy page legacy interface
  KVM: arm64: vgic-v3: Drop the reporting of GICR_TYPER.Last for userspace
  KVM: SVM: fix error return code in svm_create_vcpu()
  KVM: SVM: Fix offset computation bug in __sev_dbg_decrypt().
  KVM: arm64: Correctly align nVHE percpu data
  KVM: s390: remove diag318 reset code
  KVM: s390: pv: Mark mm as protected after the set secure parameters and improve cleanup
2020-11-27 11:04:13 -08:00
Linus Torvalds
6adf33a5e4 iommu fixes for -rc6
- Fix intel iommu driver when running on devices without VCCAP_REG
 
 - Fix swiotlb and "iommu=pt" interaction under TXT (tboot)
 
 - Fix missing return value check during device probe()
 
 - Fix probe ordering for Qualcomm SMMU implementation
 
 - Ensure page-sized mappings are used for AMD IOMMU buffers with SNP RMP
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl/A3msQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNAI6B/9/MLjurPmQrSusq9Y7dhnGR7aahtICLAwR
 UDZHebGTeVxNqtTklQVl/1qgcY7DxU40LAO1777MM7eHOW1FnlCAWrkTo6BBQsGx
 U1FKegOJ/0eVHFPtFDfM7IA2skeLwlZW+hywNLAksme5mtd6iZG9yQLlDFqAjxL7
 v8uXgfHFn6Z2MvMc2O+IeKTtflIwPek/6rYuaEf7UknA6ZYPAD3hnu9i1RTEuUAA
 h2PoVrrJ/KefEsCMUIq2jwMTsSvxohDH8ClGK9b6h74J2CLKKuhALSgABRAmsEL9
 7w5TVdtMQ85n3ccnXyT4RBQ+O/eVtmsKSdfeAbFI4nG9e9j7YE1L
 =BI1u
 -----END PGP SIGNATURE-----

Merge tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull iommu fixes from Will Deacon:
 "Here's another round of IOMMU fixes for -rc6 consisting mainly of a
  bunch of independent driver fixes. Thomas agreed for me to take the
  x86 'tboot' fix here, as it fixes a regression introduced by a vt-d
  change.

   - Fix intel iommu driver when running on devices without VCCAP_REG

   - Fix swiotlb and "iommu=pt" interaction under TXT (tboot)

   - Fix missing return value check during device probe()

   - Fix probe ordering for Qualcomm SMMU implementation

   - Ensure page-sized mappings are used for AMD IOMMU buffers with SNP
     RMP"

* tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  iommu/vt-d: Don't read VCCAP register unless it exists
  x86/tboot: Don't disable swiotlb when iommu is forced on
  iommu: Check return of __iommu_attach_device()
  arm-smmu-qcom: Ensure the qcom_scm driver has finished probing
  iommu/amd: Enforce 4k mapping for certain IOMMU data structures
2020-11-27 10:41:19 -08:00
Paolo Bonzini
8cce12b3c8 KVM: nSVM: set fixed bits by hand
SVM generally ignores fixed-1 bits.  Set them manually so that we
do not end up by mistake without those bits set in struct kvm_vcpu;
it is part of userspace API that KVM always returns value with the
bits set.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-27 12:46:36 -05:00
Gabriele Paoloni
25bc65d8dd x86/mce: Do not overwrite no_way_out if mce_end() fails
Currently, if mce_end() fails, no_way_out - the variable denoting
whether the machine can recover from this MCE - is determined by whether
the worst severity that was found across the MCA banks associated with
the current CPU, is of panic severity.

However, at this point no_way_out could have been already set by
mca_start() after looking at all severities of all CPUs that entered the
MCE handler. If mce_end() fails, check first if no_way_out is already
set and, if so, stick to it, otherwise use the local worst value.

 [ bp: Massage. ]

Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201127161819.3106432-2-gabriele.paoloni@intel.com
2020-11-27 17:38:36 +01:00
Vitaly Kuznetsov
9a2a0d3ca1 kvm: x86/mmu: Fix get_mmio_spte() on CPUs supporting 5-level PT
Commit 95fb5b0258 ("kvm: x86/mmu: Support MMIO in the TDP MMU") caused
the following WARNING on an Intel Ice Lake CPU:

 get_mmio_spte: detect reserved bits on spte, addr 0xb80a0, dump hierarchy:
 ------ spte 0xb80a0 level 5.
 ------ spte 0xfcd210107 level 4.
 ------ spte 0x1004c40107 level 3.
 ------ spte 0x1004c41107 level 2.
 ------ spte 0x1db00000000b83b6 level 1.
 WARNING: CPU: 109 PID: 10254 at arch/x86/kvm/mmu/mmu.c:3569 kvm_mmu_page_fault.cold.150+0x54/0x22f [kvm]
...
 Call Trace:
  ? kvm_io_bus_get_first_dev+0x55/0x110 [kvm]
  vcpu_enter_guest+0xaa1/0x16a0 [kvm]
  ? vmx_get_cs_db_l_bits+0x17/0x30 [kvm_intel]
  ? skip_emulated_instruction+0xaa/0x150 [kvm_intel]
  kvm_arch_vcpu_ioctl_run+0xca/0x520 [kvm]

The guest triggering this crashes. Note, this happens with the traditional
MMU and EPT enabled, not with the newly introduced TDP MMU. Turns out,
there was a subtle change in the above mentioned commit. Previously,
walk_shadow_page_get_mmio_spte() was setting 'root' to 'iterator.level'
which is returned by shadow_walk_init() and this equals to
'vcpu->arch.mmu->shadow_root_level'. Now, get_mmio_spte() sets it to
'int root = vcpu->arch.mmu->root_level'.

The difference between 'root_level' and 'shadow_root_level' on CPUs
supporting 5-level page tables is that in some case we don't want to
use 5-level, in particular when 'cpuid_maxphyaddr(vcpu) <= 48'
kvm_mmu_get_tdp_level() returns '4'. In case upper layer is not used,
the corresponding SPTE will fail '__is_rsvd_bits_set()' check.

Revert to using 'shadow_root_level'.

Fixes: 95fb5b0258 ("kvm: x86/mmu: Support MMIO in the TDP MMU")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20201126110206.2118959-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-27 11:14:27 -05:00
Paolo Bonzini
71cc849b70 KVM: x86: Fix split-irqchip vs interrupt injection window request
kvm_cpu_accept_dm_intr and kvm_vcpu_ready_for_interrupt_injection are
a hodge-podge of conditions, hacked together to get something that
more or less works.  But what is actually needed is much simpler;
in both cases the fundamental question is, do we have a place to stash
an interrupt if userspace does KVM_INTERRUPT?

In userspace irqchip mode, that is !vcpu->arch.interrupt.injected.
Currently kvm_event_needs_reinjection(vcpu) covers it, but it is
unnecessarily restrictive.

In split irqchip mode it's a bit more complicated, we need to check
kvm_apic_accept_pic_intr(vcpu) (the IRQ window exit is basically an INTACK
cycle and thus requires ExtINTs not to be masked) as well as
!pending_userspace_extint(vcpu).  However, there is no need to
check kvm_event_needs_reinjection(vcpu), since split irqchip keeps
pending ExtINT state separate from event injection state, and checking
kvm_cpu_has_interrupt(vcpu) is wrong too since ExtINT has higher
priority than APIC interrupts.  In fact the latter fixes a bug:
when userspace requests an IRQ window vmexit, an interrupt in the
local APIC can cause kvm_cpu_has_interrupt() to be true and thus
kvm_vcpu_ready_for_interrupt_injection() to return false.  When this
happens, vcpu_run does not exit to userspace but the interrupt window
vmexits keep occurring.  The VM loops without any hope of making progress.

Once we try to fix these with something like

     return kvm_arch_interrupt_allowed(vcpu) &&
-        !kvm_cpu_has_interrupt(vcpu) &&
-        !kvm_event_needs_reinjection(vcpu) &&
-        kvm_cpu_accept_dm_intr(vcpu);
+        (!lapic_in_kernel(vcpu)
+         ? !vcpu->arch.interrupt.injected
+         : (kvm_apic_accept_pic_intr(vcpu)
+            && !pending_userspace_extint(v)));

we realize two things.  First, thanks to the previous patch the complex
conditional can reuse !kvm_cpu_has_extint(vcpu).  Second, the interrupt
window request in vcpu_enter_guest()

        bool req_int_win =
                dm_request_for_irq_injection(vcpu) &&
                kvm_cpu_accept_dm_intr(vcpu);

should be kept in sync with kvm_vcpu_ready_for_interrupt_injection():
it is unnecessary to ask the processor for an interrupt window
if we would not be able to return to userspace.  Therefore,
kvm_cpu_accept_dm_intr(vcpu) is basically !kvm_cpu_has_extint(vcpu)
ANDed with the existing check for masked ExtINT.  It all makes sense:

- we can accept an interrupt from userspace if there is a place
  to stash it (and, for irqchip split, ExtINTs are not masked).
  Interrupts from userspace _can_ be accepted even if right now
  EFLAGS.IF=0.

- in order to tell userspace we will inject its interrupt ("IRQ
  window open" i.e. kvm_vcpu_ready_for_interrupt_injection), both
  KVM and the vCPU need to be ready to accept the interrupt.

... and this is what the patch implements.

Reported-by: David Woodhouse <dwmw@amazon.co.uk>
Analyzed-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
2020-11-27 09:27:28 -05:00
Paolo Bonzini
72c3bcdcda KVM: x86: handle !lapic_in_kernel case in kvm_cpu_*_extint
Centralize handling of interrupts from the userspace APIC
in kvm_cpu_has_extint and kvm_cpu_get_extint, since
userspace APIC interrupts are handled more or less the
same as ExtINTs are with split irqchip.  This removes
duplicated code from kvm_cpu_has_injectable_intr and
kvm_cpu_has_interrupt, and makes the code more similar
between kvm_cpu_has_{extint,interrupt} on one side
and kvm_cpu_get_{extint,interrupt} on the other.

Cc: stable@vger.kernel.org
Reviewed-by: Filippo Sironi <sironi@amazon.de>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-27 09:19:52 -05:00
Alex Shi
638920a66a x86/PCI: Make a kernel-doc comment a normal one
The comment is using kernel-doc markup but that comment isn't a
kernel-doc comment so make it a normal one to avoid:

  arch/x86/pci/i386.c:373: warning: Function parameter or member \
	  'pcibios_assign_resources' not described in 'fs_initcall'

 [ bp: Massage and fixup comment while at it. ]

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1605257895-5536-5-git-send-email-alex.shi@linux.alibaba.com
2020-11-27 13:43:09 +01:00
Ingo Molnar
a787bdaff8 Merge branch 'linus' into sched/core, to resolve semantic conflict
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-11-27 11:10:50 +01:00
Peter Zijlstra
20c7775aec Merge remote-tracking branch 'origin/master' into perf/core
Further perf/core patches will depend on:

  d3f7b1bb20 ("mm/gup: fix gup_fast with dynamic page table folding")

which is already in Linus' tree.
2020-11-26 13:16:55 +01:00
Sean Christopherson
8539d3f067 x86/asm: Drop unused RDPID macro
Drop the GAS-compatible RDPID macro. RDPID is unsafe in the kernel
because KVM loads guest's TSC_AUX on VM-entry and may not restore the
host's value until the CPU returns to userspace.

See

  6a3ea3e68b ("x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM")

for details.

It can always be resurrected from git history, if needed.

 [ bp: Massage commit message. ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201027214532.1792-1-sean.j.christopherson@intel.com
2020-11-26 12:58:56 +01:00
Justin Ernst
9a3c425cfd x86/platform/uv: Add and export uv_bios_* functions
Add additional uv_bios_call() variant functions to expose information
needed by the new uv_sysfs driver. This includes the addition of several
new data types defined by UV BIOS and used in the new functions.

Signed-off-by: Justin Ernst <justin.ernst@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lkml.kernel.org/r/20201125175444.279074-3-justin.ernst@hpe.com
2020-11-26 12:50:44 +01:00
Justin Ernst
8f061abbf5 x86/platform/uv: Remove existing /sys/firmware/sgi_uv/ interface
Remove existing interface at /sys/firmware/sgi_uv/, created by
arch/x86/platform/uv/uv_sysfs.c

This interface includes:
/sys/firmware/sgi_uv/coherence_id
/sys/firmware/sgi_uv/partition_id

Both coherence_id and partition_id will be re-introduced via a
new uv_sysfs driver.

Signed-off-by: Justin Ernst <justin.ernst@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lkml.kernel.org/r/20201125175444.279074-2-justin.ernst@hpe.com
2020-11-26 12:08:17 +01:00
Anand K Mistry
33fc379df7 x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb
When spectre_v2_user={seccomp,prctl},ibpb is specified on the command
line, IBPB is force-enabled and STIPB is conditionally-enabled (or not
available).

However, since

  21998a3515 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")

the spectre_v2_user_ibpb variable is set to SPECTRE_V2_USER_{PRCTL,SECCOMP}
instead of SPECTRE_V2_USER_STRICT, which is the actual behaviour.
Because the issuing of IBPB relies on the switch_mm_*_ibpb static
branches, the mitigations behave as expected.

Since

  1978b3a53a ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP")

this discrepency caused the misreporting of IB speculation via prctl().

On CPUs with STIBP always-on and spectre_v2_user=seccomp,ibpb,
prctl(PR_GET_SPECULATION_CTRL) would return PR_SPEC_PRCTL |
PR_SPEC_ENABLE instead of PR_SPEC_DISABLE since both IBPB and STIPB are
always on. It also allowed prctl(PR_SET_SPECULATION_CTRL) to set the IB
speculation mode, even though the flag is ignored.

Similarly, for CPUs without SMT, prctl(PR_GET_SPECULATION_CTRL) should
also return PR_SPEC_DISABLE since IBPB is always on and STIBP is not
available.

 [ bp: Massage commit message. ]

Fixes: 21998a3515 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")
Fixes: 1978b3a53a ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP")
Signed-off-by: Anand K Mistry <amistry@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201110123349.1.Id0cbf996d2151f4c143c90f9028651a5b49a5908@changeid
2020-11-25 20:17:09 +01:00
Lu Baolu
e2be2a833a x86/tboot: Don't disable swiotlb when iommu is forced on
After commit 327d5b2fee ("iommu/vt-d: Allow 32bit devices to uses DMA
domain"), swiotlb could also be used for direct memory access if IOMMU
is enabled but a device is configured to pass through the DMA translation.
Keep swiotlb when IOMMU is forced on, otherwise, some devices won't work
if "iommu=pt" kernel parameter is used.

Fixes: 327d5b2fee ("iommu/vt-d: Allow 32bit devices to uses DMA domain")
Reported-and-tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20201125014124.4070776-1-baolu.lu@linux.intel.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210237
Signed-off-by: Will Deacon <will@kernel.org>
2020-11-25 12:07:32 +00:00
Peter Zijlstra
545b8c8df4 smp: Cleanup smp_call_function*()
Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2020-11-24 16:47:49 +01:00
Peter Zijlstra
58c644ba51 sched/idle: Fix arch_cpu_idle() vs tracing
We call arch_cpu_idle() with RCU disabled, but then use
local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.

Switch all arch_cpu_idle() implementations to use
raw_local_irq_{en,dis}able() and carefully manage the
lockdep,rcu,tracing state like we do in entry.

(XXX: we really should change arch_cpu_idle() to not return with
interrupts enabled)

Reported-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org
2020-11-24 16:47:35 +01:00
Thomas Gleixner
7e015a2798 x86/crashdump/32: Simplify copy_oldmem_page()
Replace kmap_atomic_pfn() with kmap_local_pfn() which is preemptible and
can take page faults.

Remove the indirection of the dump page and the related cruft which is not
longer required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201118204007.670851839@linutronix.de
2020-11-24 14:42:09 +01:00
Thomas Gleixner
14df326702 x86: Support kmap_local() forced debugging
kmap_local() and related interfaces are NOOPs on 64bit and only create
temporary fixmaps for highmem pages on 32bit. That means the test coverage
for this code is pretty small.

CONFIG_KMAP_LOCAL can be enabled independent from CONFIG_HIGHMEM, which
allows to provide support for enforced kmap_local() debugging even on
64bit.

For 32bit the support is unconditional, for 64bit it's only supported when
CONFIG_NR_CPUS <= 4096 as supporting it for 8192 CPUs would require to set
up yet another fixmap PGT.

If CONFIG_KMAP_LOCAL_FORCE_DEBUG is enabled then kmap_local()/kmap_atomic()
will use the temporary fixmap mapping path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201118204007.169209557@linutronix.de
2020-11-24 14:42:09 +01:00
Xiaochen Shen
7589992469 x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak
On resource group creation via a mkdir an extra kernfs_node reference is
obtained by kernfs_get() to ensure that the rdtgroup structure remains
accessible for the rdtgroup_kn_unlock() calls where it is removed on
deletion. Currently the extra kernfs_node reference count is only
dropped by kernfs_put() in rdtgroup_kn_unlock() while the rdtgroup
structure is removed in a few other locations that lack the matching
reference drop.

In call paths of rmdir and umount, when a control group is removed,
kernfs_remove() is called to remove the whole kernfs nodes tree of the
control group (including the kernfs nodes trees of all child monitoring
groups), and then rdtgroup structure is freed by kfree(). The rdtgroup
structures of all child monitoring groups under the control group are
freed by kfree() in free_all_child_rdtgrp().

Before calling kfree() to free the rdtgroup structures, the kernfs node
of the control group itself as well as the kernfs nodes of all child
monitoring groups still take the extra references which will never be
dropped to 0 and the kernfs nodes will never be freed. It leads to
reference count leak and kernfs_node_cache memory leak.

For example, reference count leak is observed in these two cases:
  (1) mount -t resctrl resctrl /sys/fs/resctrl
      mkdir /sys/fs/resctrl/c1
      mkdir /sys/fs/resctrl/c1/mon_groups/m1
      umount /sys/fs/resctrl

  (2) mkdir /sys/fs/resctrl/c1
      mkdir /sys/fs/resctrl/c1/mon_groups/m1
      rmdir /sys/fs/resctrl/c1

The same reference count leak issue also exists in the error exit paths
of mkdir in mkdir_rdt_prepare() and rdtgroup_mkdir_ctrl_mon().

Fix this issue by following changes to make sure the extra kernfs_node
reference on rdtgroup is dropped before freeing the rdtgroup structure.
  (1) Introduce rdtgroup removal helper rdtgroup_remove() to wrap up
  kernfs_put() and kfree().

  (2) Call rdtgroup_remove() in rdtgroup removal path where the rdtgroup
  structure is about to be freed by kfree().

  (3) Call rdtgroup_remove() or kernfs_put() as appropriate in the error
  exit paths of mkdir where an extra reference is taken by kernfs_get().

Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Fixes: e02737d5b8 ("x86/intel_rdt: Add tasks files")
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1604085088-31707-1-git-send-email-xiaochen.shen@intel.com
2020-11-24 12:13:37 +01:00
Xiaochen Shen
fd8d9db355 x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak
Willem reported growing of kernfs_node_cache entries in slabtop when
repeatedly creating and removing resctrl subdirectories as well as when
repeatedly mounting and unmounting the resctrl filesystem.

On resource group (control as well as monitoring) creation via a mkdir
an extra kernfs_node reference is obtained to ensure that the rdtgroup
structure remains accessible for the rdtgroup_kn_unlock() calls where it
is removed on deletion. The kernfs_node reference count is dropped by
kernfs_put() in rdtgroup_kn_unlock().

With the above explaining the need for one kernfs_get()/kernfs_put()
pair in resctrl there are more places where a kernfs_node reference is
obtained without a corresponding release. The excessive amount of
reference count on kernfs nodes will never be dropped to 0 and the
kernfs nodes will never be freed in the call paths of rmdir and umount.
It leads to reference count leak and kernfs_node_cache memory leak.

Remove the superfluous kernfs_get() calls and expand the existing
comments surrounding the remaining kernfs_get()/kernfs_put() pair that
remains in use.

Superfluous kernfs_get() calls are removed from two areas:

  (1) In call paths of mount and mkdir, when kernfs nodes for "info",
  "mon_groups" and "mon_data" directories and sub-directories are
  created, the reference count of newly created kernfs node is set to 1.
  But after kernfs_create_dir() returns, superfluous kernfs_get() are
  called to take an additional reference.

  (2) kernfs_get() calls in rmdir call paths.

Fixes: 17eafd0762 ("x86/intel_rdt: Split resource group removal in two")
Fixes: 4af4a88e0c ("x86/intel_rdt/cqm: Add mount,umount support")
Fixes: f3cbeacaa0 ("x86/intel_rdt/cqm: Add rmdir support")
Fixes: d89b737901 ("x86/intel_rdt/cqm: Add mon_data")
Fixes: c7d9aac613 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring")
Fixes: 5dc1d5c6ba ("x86/intel_rdt: Simplify info and base file lists")
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Fixes: 4e978d06de ("x86/intel_rdt: Add "info" files to resctrl file system")
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1604085053-31639-1-git-send-email-xiaochen.shen@intel.com
2020-11-24 12:03:04 +01:00
Borislav Petkov
afe76eca86 x86/sgx: Fix sgx_ioc_enclave_provision() kernel-doc comment
Fix

  ./arch/x86/kernel/cpu/sgx/ioctl.c:666: warning: Function parameter or member \
	  'encl' not described in 'sgx_ioc_enclave_provision'
  ./arch/x86/kernel/cpu/sgx/ioctl.c:666: warning: Excess function parameter \
	  'enclave' description in 'sgx_ioc_enclave_provision'

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201123181922.0c009406@canb.auug.org.au
2020-11-24 10:46:01 +01:00
Peter Collingbourne
23acdc76f1 signal: clear non-uapi flag bits when passing/returning sa_flags
Previously we were not clearing non-uapi flag bits in
sigaction.sa_flags when storing the userspace-provided sa_flags or
when returning them via oldact. Start doing so.

This allows userspace to detect missing support for flag bits and
allows the kernel to use non-uapi bits internally, as we are already
doing in arch/x86 for two flag bits. Now that this change is in
place, we no longer need the code in arch/x86 that was hiding these
bits from userspace, so remove it.

This is technically a userspace-visible behavior change for sigaction, as
the unknown bits returned via oldact.sa_flags are no longer set. However,
we are free to define the behavior for unknown bits exactly because
their behavior is currently undefined, so for now we can define the
meaning of each of them to be "clear the bit in oldact.sa_flags unless
the bit becomes known in the future". Furthermore, this behavior is
consistent with OpenBSD [1], illumos [2] and XNU [3] (FreeBSD [4] and
NetBSD [5] fail the syscall if unknown bits are set). So there is some
precedent for this behavior in other kernels, and in particular in XNU,
which is probably the most popular kernel among those that I looked at,
which means that this change is less likely to be a compatibility issue.

Link: [1] f634a6a4b5/sys/kern/kern_sig.c (L278)
Link: [2] 76f19f5fdc/usr/src/uts/common/syscall/sigaction.c (L86)
Link: [3] a449c6a3b8/bsd/kern/kern_sig.c (L480)
Link: [4] eded70c370/sys/kern/kern_sig.c (L699)
Link: [5] 3365779bec/sys/kern/sys_sig.c (L473)
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://linux-review.googlesource.com/id/I35aab6f5be932505d90f3b3450c083b4db1eca86
Link: https://lkml.kernel.org/r/878dbcb5f47bc9b11881c81f745c0bef5c23f97f.1605235762.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-11-23 10:31:05 -06:00
Peter Collingbourne
1d82b7898f arch: move SA_* definitions to generic headers
Most architectures with the exception of alpha, mips, parisc and
sparc use the same values for these flags. Move their definitions into
asm-generic/signal-defs.h and allow the architectures with non-standard
values to override them. Also, document the non-standard flag values
in order to make it easier to add new generic flags in the future.

A consequence of this change is that on powerpc and x86, the constants'
values aside from SA_RESETHAND change signedness from unsigned
to signed. This is not expected to impact realistic use of these
constants. In particular the typical use of the constants where they
are or'ed together and assigned to sa_flags (or another int variable)
would not be affected.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ia3849f18b8009bf41faca374e701cdca36974528
Link: https://lkml.kernel.org/r/b6d0d1ec34f9ee93e1105f14f288fba5f89d1f24.1605235762.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-11-23 10:31:05 -06:00
Linus Torvalds
48da330589 A single fix for the x86 perf sysfs interfaces which used kobject
attributes instead of device attributes and therefore making Clangs control
 flow integrity checker upset.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+6diATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaVKD/9pbxJEopP0ojZx8MbGVxPn1dzgii6k
 YgWOemMPGV/hi87pCSah1RcNVY8aMwMUBnzMJEhx9bCYFMxCummM7wcoN7bJneWU
 f548CesgeKP662vFYjAXrsASX0DZAUUQm/MTnkLqVlD11yaqudxKEjEYNi2jVehq
 7tTip8Eb88w5dyQJLN7sjZuV9AdZCiXeiyGgYb3jVjfdfTuq4KRp6XdL9SghEfwV
 3qDOrcBuLKgQbz+eE2C1tVX+/kSNfG1stNAWkENHBWfrnmbeCt7YdT6Gd6bays0b
 1Ul0LgZwXbDdkQrtmoEGtI8j/Q+yiozeJY7+eeh/Z9nw0eviZ8ir36y96wRukOqw
 od+EjENiF1hIJk21r6mIuQoMyezHraIMgg0WF4/YT1XRGqwR1wqgHpmwqMjmem0Y
 Kuab7aYC4g8vGk9ah2/FmTnTHooEyk8GyxnVXO+aSdLbFI04L7HBDILvgvpHPyCS
 BnUDZE785W8yLydhKslfy92kVQGKuGaSqQUQcHD+c/L7m2AexoKrp1RJgA0gXtDj
 ilmE7KNCCbeNXt3cDep6BQzymw9YStPNSPBSv4dK73DM/Y2qKBk3RD0BfMqdCkOy
 sBqUCWRY1/q6Z+64zwDlER5Fn+KflE0ihobTxhA6m0yx7AOn9BoMDWfo3xGKw6Mn
 nhVDpafF8YuVIw==
 =xsma
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Thomas Gleixner:
 "A single fix for the x86 perf sysfs interfaces which used kobject
  attributes instead of device attributes and therefore making clang's
  control flow integrity checker upset"

* tag 'perf-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: fix sysfs type mismatches
2020-11-22 13:23:43 -08:00
Linus Torvalds
68d3fa235f Couple of EFI fixes for v5.10:
- fix memory leak in efivarfs driver
 - fix HYP mode issue in 32-bit ARM version of the EFI stub when built in
   Thumb2 mode
 - avoid leaking EFI pgd pages on allocation failure
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEnNKg2mrY9zMBdeK7wjcgfpV0+n0FAl+tRcoACgkQwjcgfpV0
 +n1LZQf/af1p9A0zT1nC3IrcaABceJDnD3dJuV9SD6QhFuD2Dw/Mshr+MVzDsO+3
 btvuu0r4CzQ5ajfpfcGcvBFFWbbPTwKvWQe++9Unwoz5acw7hpV5yxNwMivdaJEh
 3o4pkgpCmwtliTwiroDficO7Vlqefqf4LZd7/iQYQTnuPK3waYQBjwp9t2D9tlx7
 kXiEQDP2BDRCUrKEjlR7AHTZ156mw+UsiquAuxMCGTKBqwiELEEV6aPseqa5MmNV
 RDV1IXWdhOQyQfzg0s6vTzwGeN0JubSxHng6O9UbE+tctz4EqaaHIEsRuMBq8oLD
 Y8JeGp1ovypTJxeLE5t6eEzsfvTRsg==
 =GnmM
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent-for-v5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI fixes from Borislav Petkov:
 "Forwarded EFI fixes from Ard Biesheuvel:

   - fix memory leak in efivarfs driver

   - fix HYP mode issue in 32-bit ARM version of the EFI stub when built
     in Thumb2 mode

   - avoid leaking EFI pgd pages on allocation failure"

* tag 'efi-urgent-for-v5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/x86: Free efi_pgd with free_pages()
  efivarfs: fix memory leak in efivarfs_create()
  efi/arm: set HSCTLR Thumb2 bit correctly for HVC calls from HYP
2020-11-22 13:05:48 -08:00
Linus Torvalds
7d53be55c9 * An IOMMU VT-d build fix when CONFIG_PCI_ATS=n along with a revert of
same because the proper one is going through the IOMMU tree. (Thomas Gleixner)
 
 * An Intel microcode loader fix to save the correct microcode patch to
   apply during resume. (Chen Yu)
 
 * A fix to not access user memory of other processes when dumping opcode
   bytes. (Thomas Gleixner)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+6QnYACgkQEsHwGGHe
 VUpMgg//cE4GvK7dFihq5iNYS16yKKgyVBLwHiHoCxQkOu4CPSf0EejTMLKPSKpA
 hS7pJm6dMYLJ7PhIKRwlQD1NYet3SHCOi1h0Y2/3rVay82Vz/As2nZRd7paEkP9k
 kiNd/Ib2uAQ9ZwCKhmLanOWDoXXFo5Fwmpw+O/zhn31Y3kBx05mc2ojXnEwCERK/
 j6GwOBkfOq+q6D7HzsfEehlX0iy2rlPSq6OUN1H/Z4w+W9zfpubLX9bhhr9RvDmS
 6Hsmwos8fM/cQXZC2oxoKUP6zeYk5lDF/Qf7hOFt3WgdCpTZP2z6uLwUL6+NXu/C
 Ms+77Cx+Loe8UaDkyxXXruiv7mZIZCnl0phZulE4ibOjzL7W7w5cEf0LHdDS7uDL
 JzZwTJlbK9RjrSp4pXR+rn+U/PdluZJYpjh/Gj8+V23QszvbDsVd55SV5+CLSTM/
 qwrNp5ffLEnIFHmW98/OQKMvNTTzBFdlltnJm7mMW2JucgfCZMx4zwY0+dZ+ocDq
 /Kt/HNCzez1jHWiS5jdZCM15KuIhPTNuTpq0ipU+UAXxZOaNtTo1st1lhKoPvTz/
 k4y7S/EibX6PK4lxBQMM3fBG/T1KK7ZMgNEN0jbhWegqEmdWtgwc57p1gVE70XUU
 f6elyUupD65TMwgio2I8sQuiNBCKfF2Wj144dxXGJnGnOoZKey8=
 =z/Gq
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - An IOMMU VT-d build fix when CONFIG_PCI_ATS=n along with a revert of
   same because the proper one is going through the IOMMU tree (Thomas
   Gleixner)

 - An Intel microcode loader fix to save the correct microcode patch to
   apply during resume (Chen Yu)

 - A fix to not access user memory of other processes when dumping
   opcode bytes (Thomas Gleixner)

* tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "iommu/vt-d: Take CONFIG_PCI_ATS into account"
  x86/dumpstack: Do not try to access user space code of other tasks
  x86/microcode/intel: Check patch signature before saving microcode for early loading
  iommu/vt-d: Take CONFIG_PCI_ATS into account
2020-11-22 12:55:50 -08:00
Dan Williams
a927bd6ba9 mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports
The core-mm has a default __weak implementation of phys_to_target_node()
to mirror the weak definition of memory_add_physaddr_to_nid().  That
symbol is exported for modules.  However, while the export in
mm/memory_hotplug.c exported the symbol in the configuration cases of:

	CONFIG_NUMA_KEEP_MEMINFO=y
	CONFIG_MEMORY_HOTPLUG=y

...and:

	CONFIG_NUMA_KEEP_MEMINFO=n
	CONFIG_MEMORY_HOTPLUG=y

...it failed to export the symbol in the case of:

	CONFIG_NUMA_KEEP_MEMINFO=y
	CONFIG_MEMORY_HOTPLUG=n

Not only is that broken, but Christoph points out that the kernel should
not be exporting any __weak symbol, which means that
memory_add_physaddr_to_nid() example that phys_to_target_node() copied
is broken too.

Rework the definition of phys_to_target_node() and
memory_add_physaddr_to_nid() to not require weak symbols.  Move to the
common arch override design-pattern of an asm header defining a symbol
to replace the default implementation.

The only common header that all memory_add_physaddr_to_nid() producing
architectures implement is asm/sparsemem.h.  In fact, powerpc already
defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
Double-down on that observation and define phys_to_target_node() where
necessary in asm/sparsemem.h.  An alternate consideration that was
discarded was to put this override in asm/numa.h, but that entangles
with the definition of MAX_NUMNODES relative to the inclusion of
linux/nodemask.h, and requires powerpc to grow a new header.

The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported / stubbed in all combinations
of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.

[dan.j.williams@intel.com: v4]
  Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
[dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
  Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com

Fixes: a035b6bf86 ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-22 10:48:22 -08:00
Smita Koralahalli
4a24d80b8c x86/mce, cper: Pass x86 CPER through the MCA handling chain
The kernel uses ACPI Boot Error Record Table (BERT) to report fatal
errors that occurred in a previous boot. The MCA errors in the BERT are
reported using the x86 Processor Error Common Platform Error Record
(CPER) format. Currently, the record prints out the raw MSR values and
AMD relies on the raw record to provide MCA information.

Extract the raw MSR values of MCA registers from the BERT and feed them
into mce_log() to decode them properly.

The implementation is SMCA-specific as the raw MCA register values are
given in the register offset order of the SMCA address space.

 [ bp: Massage. ]

[ Fix a build breakage in patch v1. ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Punit Agrawal <punit1.agrawal@toshiba.co.jp>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lkml.kernel.org/r/20201119182938.151155-1-Smita.KoralahalliChannabasappa@amd.com
2020-11-21 12:05:41 +01:00
Uros Bizjak
ab09b58e4b x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg
Use TEST %reg,%reg which sets the zero flag in the same way as CMP
$0,%reg, but the encoding uses one byte less.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20201029160258.139216-1-ubizjak@gmail.com
2020-11-21 10:26:25 +01:00
Kees Cook
25db91209a x86: Enable seccomp architecture tracking
Provide seccomp internals with the details to calculate which syscall
table the running kernel is expecting to deal with. This allows for
efficient architecture pinning and paves the way for constant-action
bitmaps.

Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/da58c3733d95c4f2115dd94225dfbe2573ba4d87.1602431034.git.yifeifz2@illinois.edu
2020-11-20 11:16:34 -08:00
Linus Torvalds
4ccf7a01e8 xen: branch for v5.10-rc5
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCX7dYpAAKCRCAXGG7T9hj
 vqRpAQDiOASJZyPTwbB3kW7EPIyMz5kZgeJUyX4BEBUOR0g3CAEA/Qz9ZE9h9h3l
 YIXs9LDGQJ0pOl04wOzjte57AjzGoAM=
 =B9Bu
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.10b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fix from Juergen Gross:
 "A single fix for avoiding WARN splats when booting a Xen guest with
  nosmt"

* tag 'for-linus-5.10b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/xen: don't unbind uninitialized lock_kicker_irq
2020-11-20 10:30:48 -08:00
Linus Torvalds
fc8299f9f3 iommu fixes for -rc5
- Fix boot when intel iommu initialisation fails under TXT (tboot)
 
 - Fix intel iommu compilation error when DMAR is enabled without ATS
 
 - Temporarily update IOMMU MAINTAINERs entry
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl+3p6gQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNPowCADBKq5PBRwqIM1tmntRu5rePf/WuB1BotlJ
 cUUR8dIxFDvKMeGvN4mDbn/5jjL1+TlD7sQqFh+gcWJ9tImpAMzp1ENbg71isl5o
 Ej21X9YhjfiIsKpTrNGxcetCkYaR2tp8Z4/WzSQaO+IGB578dQ9VwiwSnDyFdWUb
 1Qcj712ainYvKjbq4NyL80lPm9v43OV8QZ73WeelzBjE2UcDFAV0Xl4BIX8uuGtv
 I1x03JfgzmzhVBmpj5HUGPsSopMDdo5K4vlcqscqSqvBY90iHD5fgRF4ZatTjR1t
 PVM/1+c9vdJIWSxSt1hsB8DPtqXVOvGjzrLW/h63rxyWsISwkAq0
 =DNX/
 -----END PGP SIGNATURE-----

Merge tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull iommu fixes from Will Deacon:
 "Two straightforward vt-d fixes:

   - Fix boot when intel iommu initialisation fails under TXT (tboot)

   - Fix intel iommu compilation error when DMAR is enabled without ATS

  and temporarily update IOMMU MAINTAINERs entry"

* tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  MAINTAINERS: Temporarily add myself to the IOMMU entry
  iommu/vt-d: Fix compile error with CONFIG_PCI_ATS not set
  iommu/vt-d: Avoid panic if iommu init fails in tboot system
2020-11-20 10:20:16 -08:00
Wang Qing
61b39ad9a7 x86/head64: Remove duplicate include
Remove duplicate header include.

Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1604893542-20961-1-git-send-email-wangqing@vivo.com
2020-11-20 17:43:15 +01:00
Lukas Bulwahn
bab202ab87 x86/mm: Declare 'start' variable where it is used
It is not required to initialize the local variable start in
memory_map_top_down(), as the variable will be initialized in any path
before it is used.

make clang-analyzer on x86_64 tinyconfig reports:

  arch/x86/mm/init.c:612:15: warning: Although the value stored to 'start' \
  is used in the enclosing expression, the value is never actually read \
  from 'start' [clang-analyzer-deadcode.DeadStores]

Move the variable declaration into the loop, where it is used.

No code changed:

  # arch/x86/mm/init.o:

   text    data     bss     dec     hex filename
   7105    1424   26768   35297    89e1 init.o.before
   7105    1424   26768   35297    89e1 init.o.after

md5:
   a8d76c1bb5fce9cae251780a7ee7730f  init.o.before.asm
   a8d76c1bb5fce9cae251780a7ee7730f  init.o.after.asm

 [ bp: Massage. ]

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20200928100004.25674-1-lukas.bulwahn@gmail.com
2020-11-20 12:49:00 +01:00
Eric Biggers
a24d22b225 crypto: sha - split sha.h into sha1.h and sha2.h
Currently <crypto/sha.h> contains declarations for both SHA-1 and SHA-2,
and <crypto/sha3.h> contains declarations for SHA-3.

This organization is inconsistent, but more importantly SHA-1 is no
longer considered to be cryptographically secure.  So to the extent
possible, SHA-1 shouldn't be grouped together with any of the other SHA
versions, and usage of it should be phased out.

Therefore, split <crypto/sha.h> into two headers <crypto/sha1.h> and
<crypto/sha2.h>, and make everyone explicitly specify whether they want
the declarations for SHA-1, SHA-2, or both.

This avoids making the SHA-1 declarations visible to files that don't
want anything to do with SHA-1.  It also prepares for potentially moving
sha1.h into a new insecure/ or dangerous/ directory.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-20 14:45:33 +11:00
Paul E. McKenney
7fc91fc845 Merge branches 'cpuinfo.2020.11.06a', 'doc.2020.11.06a', 'fixes.2020.11.19b', 'lockdep.2020.11.02a', 'tasks.2020.11.06a' and 'torture.2020.11.06a' into HEAD
cpuinfo.2020.11.06a: Speedups for /proc/cpuinfo.
doc.2020.11.06a: Documentation updates.
fixes.2020.11.19b: Miscellaneous fixes.
lockdep.2020.11.02a: Lockdep-RCU updates to avoid "unused variable".
tasks.2020.11.06a: Tasks-RCU updates.
torture.2020.11.06a': Torture-test updates.
2020-11-19 19:37:47 -08:00
Paul E. McKenney
29368e0939 x86/smpboot: Move rcu_cpu_starting() earlier
The call to rcu_cpu_starting() in mtrr_ap_init() is not early enough
in the CPU-hotplug onlining process, which results in lockdep splats
as follows:

=============================
WARNING: suspicious RCU usage
5.9.0+ #268 Not tainted
-----------------------------
kernel/kprobes.c:300 RCU-list traversed in non-reader section!!

other info that might help us debug this:

RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
no locks held by swapper/1/0.

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0+ #268
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x77/0x97
 __is_insn_slot_addr+0x15d/0x170
 kernel_text_address+0xba/0xe0
 ? get_stack_info+0x22/0xa0
 __kernel_text_address+0x9/0x30
 show_trace_log_lvl+0x17d/0x380
 ? dump_stack+0x77/0x97
 dump_stack+0x77/0x97
 __lock_acquire+0xdf7/0x1bf0
 lock_acquire+0x258/0x3d0
 ? vprintk_emit+0x6d/0x2c0
 _raw_spin_lock+0x27/0x40
 ? vprintk_emit+0x6d/0x2c0
 vprintk_emit+0x6d/0x2c0
 printk+0x4d/0x69
 start_secondary+0x1c/0x100
 secondary_startup_64_no_verify+0xb8/0xbb

This is avoided by moving the call to rcu_cpu_starting up near
the beginning of the start_secondary() function.  Note that the
raw_smp_processor_id() is required in order to avoid calling into lockdep
before RCU has declared the CPU to be watched for readers.

Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
Reported-by: Qian Cai <cai@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:16 -08:00
Rikard Falkeborn
2002d29513 x86/resctrl: Constify kernfs_ops
The only usage of the kf_ops field in the rftype struct is to pass
it as argument to __kernfs_create_file(), which accepts a pointer to
const. Make it a pointer to const. This makes it possible to make
rdtgroup_kf_single_ops and kf_mondata_ops const, which allows the
compiler to put them in read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20201110230228.801785-1-rikard.falkeborn@gmail.com
2020-11-19 18:23:45 +01:00
Ben Gardon
b9a98c3437 kvm: x86/mmu: Add TDP MMU SPTE changed trace point
Add an extremely verbose trace point to the TDP MMU to log all SPTE
changes, regardless of callstack / motivation. This is useful when a
complete picture of the paging structure is needed or a change cannot be
explained with the other, existing trace points.

Tested: ran the demand paging selftest on an Intel Skylake machine with
	all the trace points used by the TDP MMU enabled and observed
	them firing with expected values.

This patch can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/3813

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20201027175944.1183301-2-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-19 10:57:16 -05:00
Ben Gardon
33dd3574f5 kvm: x86/mmu: Add existing trace points to TDP MMU
The TDP MMU was initially implemented without some of the usual
tracepoints found in mmu.c. Correct this discrepancy by adding the
missing trace points to the TDP MMU.

Tested: ran the demand paging selftest on an Intel Skylake machine with
	all the trace points used by the TDP MMU enabled and observed
	them firing with expected values.

This patch can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/3812

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20201027175944.1183301-1-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-19 10:57:16 -05:00
Yazen Ghannam
cb09a37972 x86/topology: Set cpu_die_id only if DIE_TYPE found
CPUID Leaf 0x1F defines a DIE_TYPE level (nb: ECX[8:15] level type == 0x5),
but CPUID Leaf 0xB does not. However, detect_extended_topology() will
set struct cpuinfo_x86.cpu_die_id regardless of whether a valid Die ID
was found.

Only set cpu_die_id if a DIE_TYPE level is found. CPU topology code may
use another value for cpu_die_id, e.g. the AMD NodeId on AMD-based
systems. Code ordering should be maintained so that the CPUID Leaf 0x1F
Die ID value will take precedence on systems that may use another value.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-5-Yazen.Ghannam@amd.com
2020-11-19 11:43:25 +01:00
Yazen Ghannam
db970bd231 x86/CPU/AMD: Remove amd_get_nb_id()
The Last Level Cache ID is returned by amd_get_nb_id(). In practice,
this value is the same as the AMD NodeId for callers of this function.
The NodeId is saved in struct cpuinfo_x86.cpu_die_id.

Replace calls to amd_get_nb_id() with the logical CPU's cpu_die_id and
remove the function.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-3-Yazen.Ghannam@amd.com
2020-11-19 11:43:17 +01:00
Yazen Ghannam
028c221ed1 x86/CPU/AMD: Save AMD NodeId as cpu_die_id
AMD systems provide a "NodeId" value that represents a global ID
indicating to which "Node" a logical CPU belongs. The "Node" is a
physical structure equivalent to a Die, and it should not be confused
with logical structures like NUMA nodes. Logical nodes can be adjusted
based on firmware or other settings whereas the physical nodes/dies are
fixed based on hardware topology.

The NodeId value can be used when a physical ID is needed by software.

Save the AMD NodeId to struct cpuinfo_x86.cpu_die_id. Use the value
from CPUID or MSR as appropriate. Default to phys_proc_id otherwise.
Do so for both AMD and Hygon systems.

Drop the node_id parameter from cacheinfo_*_init_llc_id() as it is no
longer needed.

Update the x86 topology documentation.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201109210659.754018-2-Yazen.Ghannam@amd.com
2020-11-19 11:43:13 +01:00
Frederic Weisbecker
d1f250e220 x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK
A lot of ground work has been performed on x86 entry code. Fragile path
between user_enter() and user_exit() have IRQs disabled. Uses of RCU and
intrumentation in these fragile areas have been explicitly annotated
and protected.

This architecture doesn't need exception_enter()/exception_exit()
anymore and has therefore earned CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201117151637.259084-6-frederic@kernel.org
2020-11-19 11:25:42 +01:00
Jarkko Sakkinen
14132a5b80 x86/sgx: Return -ERESTARTSYS in sgx_ioc_enclave_add_pages()
Return -ERESTARTSYS instead of -EINTR in sgx_ioc_enclave_add_pages()
when interrupted before any pages have been processed. At this point
ioctl can be obviously safely restarted.

Reported-by: Haitao Huang <haitao.huang@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201118213932.63341-1-jarkko@kernel.org
2020-11-19 10:51:24 +01:00
Will Deacon
388255ce95 A small set of fixes for x86:
- Cure the fallout from the MSI irqdomain overhaul which missed that the
    Intel IOMMU does not register virtual function devices and therefore
    never reaches the point where the MSI interrupt domain is assigned. This
    makes the VF devices use the non-remapped MSI domain which is trapped by
    the IOMMU/remap unit.
 
  - Remove an extra space in the SGI_UV architecture type procfs output for
    UV5.
 
  - Remove a unused function which was missed when removing the UV BAU TLB
    shootdown handler.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xJi0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVWxD/9Tq4W6Kniln7mtoEWHRvHRceiiGcS3
 MocvqurhoJwirH4F2gkvCegTBy0r3FdUORy3OMmChVs6nb8XpPpso84SANCRePWp
 JZezpVwLSNC4O1/ZCg1Kjj4eUpzLB/UjUUQV9RsjL5wyQEhfCZgb1D40yLM/2dj5
 SkVm/EAqWuQNtYe/jqAOwTX/7mV+k2QEmKCNOigM13R9EWgu6a4J8ta1gtNSbwvN
 jWMW+M1KjZ76pfRK+y4OpbuFixteSzhSWYPITSGwQz4IpQ+Ty2Rv0zzjidmDnAR+
 Q73cup0dretdVnVDRpMwDc06dBCmt/rbN50w4yGU0YFRFDgjGc8sIbQzuIP81nEQ
 XY4l4rcBgyVufFsLrRpQxu1iYPFrcgU38W1kRkkJ3Kl/rY1a2ZU7sLE4kt4Oh55W
 A9KCmsfqP1PCYppjAQ0QT4NOp4YtecPvAU4UcBOb722DDBd8TfhLWWGw2yG57Q/d
 Wnu8xCJGy7BaLHLGGGseAft+D4aNnCjKC3jgMyvNtRDXaV2cK2Kdd6ehMlWVUapD
 xfLlKXE+igXMyoWJIWjTXQJs4dpKu6QpJCPiorwEZ8rmNaRfxsWEJVbeYwEkmUke
 bMoBBSCbZT86WVOYhI8WtrIemraY0mMYrrcE03M96HU3eYB8BV92KrIzZWThupcQ
 ZqkZbqCZm3vfHA==
 =X/P+
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-next/iommu/fixes

Pull in x86 fixes from Thomas, as they include a change to the Intel DMAR
code on which we depend:

* tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  iommu/vt-d: Cure VF irqdomain hickup
  x86/platform/uv: Fix copied UV5 output archtype
  x86/platform/uv: Drop last traces of uv_flush_tlb_others
2020-11-19 09:50:06 +00:00
Borislav Petkov
b023fd5f74 x86/msr: Downgrade unrecognized MSR message
It is a warning and not an error so use pr_warn().

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201118123806.19672-1-bp@alien8.de
2020-11-19 10:46:54 +01:00
David Woodhouse
d1adcfbb52 iommu/amd: Fix IOMMU interrupt generation in X2APIC mode
The AMD IOMMU has two modes for generating its own interrupts.

The first is very much based on PCI MSI, and can be configured by Linux
precisely that way. But like legacy unmapped PCI MSI it's limited to
8 bits of APIC ID.

The second method does not use PCI MSI at all in hardawre, and instead
configures the INTCAPXT registers in the IOMMU directly with the APIC ID
and vector.

In the latter case, the IOMMU driver would still use pci_enable_msi(),
read back (through MMIO) the MSI message that Linux wrote to the PCI MSI
table, then swizzle those bits into the appropriate register.

Historically, this worked because__irq_compose_msi_msg() would silently
generate an invalid MSI message with the high bits of the APIC ID in the
high bits of the MSI address. That hack was intended only for the Intel
IOMMU, and I recently enforced that, introducing a warning in
__irq_msi_compose_msg() if it was invoked with an APIC ID above 255.

Fix the AMD IOMMU not to depend on that hack any more, by having its own
irqdomain and directly putting the bits from the irq_cfg into the right
place in its ->activate() method.

Fixes: 47bea873cf "x86/msi: Only use high bits of MSI address for DMAR unit")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/05e3a5ba317f5ff48d2f8356f19e617f8b9d23a4.camel@infradead.org
2020-11-18 20:55:59 +01:00
Dave Hansen
67655b57f8 x86/sgx: Clarify 'laundry_list' locking
Short Version:

The SGX section->laundry_list structure is effectively thread-local, but
declared next to some shared structures. Its semantics are clear as mud.
Fix that. No functional changes. Compile tested only.

Long Version:

The SGX hardware keeps per-page metadata. This can provide things like
permissions, integrity and replay protection. It also prevents things
like having an enclave page mapped multiple times or shared between
enclaves.

But, that presents a problem for kexec()'d kernels (or any other kernel
that does not run immediately after a hardware reset). This is because
the last kernel may have been rude and forgotten to reset pages, which
would trigger the "shared page" sanity check.

To fix this, the SGX code "launders" the pages by running the EREMOVE
instruction on all pages at boot. This is slow and can take a long
time, so it is performed off in the SGX-specific ksgxd instead of being
synchronous at boot. The init code hands the list of pages to launder in
a per-SGX-section list: ->laundry_list. The only code to touch this list
is the init code and ksgxd. This means that no locking is necessary for
->laundry_list.

However, a lock is required for section->page_list, which is accessed
while creating enclaves and by ksgxd. This lock (section->lock) is
acquired by ksgxd while also processing ->laundry_list. It is easy to
confuse the purpose of the locking as being for ->laundry_list and
->page_list.

Rename ->laundry_list to ->init_laundry_list to make it clear that this
is not normally used at runtime. Also add some comments clarifying the
locking, and reorganize 'sgx_epc_section' to put 'lock' near the things
it protects.

Note: init_laundry_list is 128 bytes of wasted space at runtime. It
could theoretically be dynamically allocated and then freed after
the laundering process. But it would take nearly 128 bytes of extra
instructions to do that.

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201116222531.4834-1-dave.hansen@intel.com
2020-11-18 18:16:28 +01:00
Arvind Sankar
31d8546033 x86/head/64: Remove unused GET_CR2_INTO() macro
Commit

  4b47cdbda6 ("x86/head/64: Move early exception dispatch to C code")

removed the usage of GET_CR2_INTO().

Drop the definition as well, and related definitions in paravirt.h and
asm-offsets.h

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201005151208.2212886-3-nivedita@alum.mit.edu
2020-11-18 18:09:38 +01:00
Jarkko Sakkinen
947c6e11fa x86/sgx: Add ptrace() support for the SGX driver
Enclave memory is normally inaccessible from outside the enclave. This
makes enclaves hard to debug. However, enclaves can be put in a debug
mode when they are being built. In that mode, enclave data *can* be read
and/or written by using the ENCLS[EDBGRD] and ENCLS[EDBGWR] functions.

This is obviously only for debugging and destroys all the protections
present with normal enclaves. But, enclaves know their own debug status
and can adjust their behavior appropriately.

Add a vm_ops->access() implementation which can be used to read and write
memory inside debug enclaves.  This is typically used via ptrace() APIs.

 [ bp: Massage. ]

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-23-jarkko@kernel.org
2020-11-18 18:04:11 +01:00
Jarkko Sakkinen
1728ab54b4 x86/sgx: Add a page reclaimer
Just like normal RAM, there is a limited amount of enclave memory available
and overcommitting it is a very valuable tool to reduce resource use.
Introduce a simple reclaim mechanism for enclave pages.

In contrast to normal page reclaim, the kernel cannot directly access
enclave memory.  To get around this, the SGX architecture provides a set of
functions to help.  Among other things, these functions copy enclave memory
to and from normal memory, encrypting it and protecting its integrity in
the process.

Implement a page reclaimer by using these functions. Picks victim pages in
LRU fashion from all the enclaves running in the system.  A new kernel
thread (ksgxswapd) reclaims pages in the background based on watermarks,
similar to normal kswapd.

All enclave pages can be reclaimed, architecturally.  But, there are some
limits to this, such as the special SECS metadata page which must be
reclaimed last.  The page version array (used to mitigate replaying old
reclaimed pages) is also architecturally reclaimable, but not yet
implemented.  The end result is that the vast majority of enclave pages are
currently reclaimable.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-22-jarkko@kernel.org
2020-11-18 18:04:11 +01:00
Sean Christopherson
8466436952 x86/vdso: Implement a vDSO for Intel SGX enclave call
Enclaves encounter exceptions for lots of reasons: everything from enclave
page faults to NULL pointer dereferences, to system calls that must be
“proxied” to the kernel from outside the enclave.

In addition to the code contained inside an enclave, there is also
supporting code outside the enclave called an “SGX runtime”, which is
virtually always implemented inside a shared library.  The runtime helps
build the enclave and handles things like *re*building the enclave if it
got destroyed by something like a suspend/resume cycle.

The rebuilding has traditionally been handled in SIGSEGV handlers,
registered by the library.  But, being process-wide, shared state, signal
handling and shared libraries do not mix well.

Introduce a vDSO function call that wraps the enclave entry functions
(EENTER/ERESUME functions of the ENCLU instruciton) and returns information
about any exceptions to the caller in the SGX runtime.

Instead of generating a signal, the kernel places exception information in
RDI, RSI and RDX. The kernel-provided userspace portion of the vDSO handler
will place this information in a user-provided buffer or trigger a
user-provided callback at the time of the exception.

The vDSO function calling convention uses the standard RDI RSI, RDX, RCX,
R8 and R9 registers.  This makes it possible to declare the vDSO as a C
prototype, but other than that there is no specific support for SystemV
ABI. Things like storing XSAVE are the responsibility of the enclave and
the runtime.

 [ bp: Change vsgx.o build dependency to CONFIG_X86_SGX. ]

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Cedric Xing <cedric.xing@intel.com>
Signed-off-by: Cedric Xing <cedric.xing@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-20-jarkko@kernel.org
2020-11-18 18:02:50 +01:00
Sean Christopherson
334872a091 x86/traps: Attempt to fixup exceptions in vDSO before signaling
vDSO functions can now leverage an exception fixup mechanism similar to
kernel exception fixup.  For vDSO exception fixup, the initial user is
Intel's Software Guard Extensions (SGX), which will wrap the low-level
transitions to/from the enclave, i.e. EENTER and ERESUME instructions,
in a vDSO function and leverage fixup to intercept exceptions that would
otherwise generate a signal.  This allows the vDSO wrapper to return the
fault information directly to its caller, obviating the need for SGX
applications and libraries to juggle signal handlers.

Attempt to fixup vDSO exceptions immediately prior to populating and
sending signal information.  Except for the delivery mechanism, an
exception in a vDSO function should be treated like any other exception
in userspace, e.g. any fault that is successfully handled by the kernel
should not be directly visible to userspace.

Although it's debatable whether or not all exceptions are of interest to
enclaves, defer to the vDSO fixup to decide whether to do fixup or
generate a signal.  Future users of vDSO fixup, if there ever are any,
will undoubtedly have different requirements than SGX enclaves, e.g. the
fixup vs. signal logic can be made function specific if/when necessary.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-19-jarkko@kernel.org
2020-11-18 18:02:50 +01:00
Sean Christopherson
cd072dab45 x86/fault: Add a helper function to sanitize error code
vDSO exception fixup is a replacement for signals in limited situations.
Signals and vDSO exception fixup need to provide similar information to
userspace, including the hardware error code.

That hardware error code needs to be sanitized.  For instance, if userspace
accesses a kernel address, the error code could indicate to userspace
whether the address had a Present=1 PTE.  That can leak information about
the kernel layout to userspace, which is bad.

The existing signal code does this sanitization, but fairly late in the
signal process.  The vDSO exception code runs before the sanitization
happens.

Move error code sanitization out of the signal code and into a helper.
Call the helper in the signal code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-18-jarkko@kernel.org
2020-11-18 18:02:50 +01:00
Sean Christopherson
8382c668ce x86/vdso: Add support for exception fixup in vDSO functions
Signals are a horrid little mechanism.  They are especially nasty in
multi-threaded environments because signal state like handlers is global
across the entire process.  But, signals are basically the only way that
userspace can “gracefully” handle and recover from exceptions.

The kernel generally does not like exceptions to occur during execution.
But, exceptions are a fact of life and must be handled in some
circumstances.  The kernel handles them by keeping a list of individual
instructions which may cause exceptions.  Instead of truly handling the
exception and returning to the instruction that caused it, the kernel
instead restarts execution at a *different* instruction.  This makes it
obvious to that thread of execution that the exception occurred and lets
*that* code handle the exception instead of the handler.

This is not dissimilar to the try/catch exceptions mechanisms that some
programming languages have, but applied *very* surgically to single
instructions.  It effectively changes the visible architecture of the
instruction.

Problem
=======

SGX generates a lot of signals, and the code to enter and exit enclaves and
muck with signal handling is truly horrid.  At the same time, an approach
like kernel exception fixup can not be easily applied to userspace
instructions because it changes the visible instruction architecture.

Solution
========

The vDSO is a special page of kernel-provided instructions that run in
userspace.  Any userspace calling into the vDSO knows that it is special.
This allows the kernel a place to legitimately rewrite the user/kernel
contract and change instruction behavior.

Add support for fixing up exceptions that occur while executing in the
vDSO.  This replaces what could traditionally only be done with signal
handling.

This new mechanism will be used to replace previously direct use of SGX
instructions by userspace.

Just introduce the vDSO infrastructure.  Later patches will actually
replace signal generation with vDSO exception fixup.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-17-jarkko@kernel.org
2020-11-18 18:02:50 +01:00
Jarkko Sakkinen
c82c618650 x86/sgx: Add SGX_IOC_ENCLAVE_PROVISION
The whole point of SGX is to create a hardware protected place to do
“stuff”. But, before someone is willing to hand over the keys to
the castle , an enclave must often prove that it is running on an
SGX-protected processor. Provisioning enclaves play a key role in
providing proof.

There are actually three different enclaves in play in order to make this
happen:

1. The application enclave.  The familiar one we know and love that runs
   the actual code that’s doing real work.  There can be many of these on
   a single system, or even in a single application.
2. The quoting enclave  (QE).  The QE is mentioned in lots of silly
   whitepapers, but, for the purposes of kernel enabling, just pretend they
   do not exist.
3. The provisioning enclave.  There is typically only one of these
   enclaves per system.  Provisioning enclaves have access to a special
   hardware key.

   They can use this key to help to generate certificates which serve as
   proof that enclaves are running on trusted SGX hardware.  These
   certificates can be passed around without revealing the special key.

Any user who can create a provisioning enclave can access the
processor-unique Provisioning Certificate Key which has privacy and
fingerprinting implications. Even if a user is permitted to create
normal application enclaves (via /dev/sgx_enclave), they should not be
able to create provisioning enclaves. That means a separate permissions
scheme is needed to control provisioning enclave privileges.

Implement a separate device file (/dev/sgx_provision) which allows
creating provisioning enclaves. This device will typically have more
strict permissions than the plain enclave device.

The actual device “driver” is an empty stub.  Open file descriptors for
this device will represent a token which allows provisioning enclave duty.
This file descriptor can be passed around and ultimately given as an
argument to the /dev/sgx_enclave driver ioctl().

 [ bp: Touchups. ]

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: linux-security-module@vger.kernel.org
Link: https://lkml.kernel.org/r/20201112220135.165028-16-jarkko@kernel.org
2020-11-18 18:02:50 +01:00
Jarkko Sakkinen
9d0c151b41 x86/sgx: Add SGX_IOC_ENCLAVE_INIT
Enclaves have two basic states. They are either being built and are
malleable and can be modified by doing things like adding pages. Or,
they are locked down and not accepting changes. They can only be run
after they have been locked down. The ENCLS[EINIT] function induces the
transition from being malleable to locked-down.

Add an ioctl() that performs ENCLS[EINIT]. After this, new pages can
no longer be added with ENCLS[EADD]. This is also the time where the
enclave can be measured to verify its integrity.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-15-jarkko@kernel.org
2020-11-18 18:02:49 +01:00
Jarkko Sakkinen
c6d26d3707 x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES
SGX enclave pages are inaccessible to normal software. They must be
populated with data by copying from normal memory with the help of the
EADD and EEXTEND functions of the ENCLS instruction.

Add an ioctl() which performs EADD that adds new data to an enclave, and
optionally EEXTEND functions that hash the page contents and use the
hash as part of enclave “measurement” to ensure enclave integrity.

The enclave author gets to decide which pages will be included in the
enclave measurement with EEXTEND. Measurement is very slow and has
sometimes has very little value. For instance, an enclave _could_
measure every page of data and code, but would be slow to initialize.
Or, it might just measure its code and then trust that code to
initialize the bulk of its data after it starts running.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-14-jarkko@kernel.org
2020-11-18 18:02:49 +01:00
Jarkko Sakkinen
888d249117 x86/sgx: Add SGX_IOC_ENCLAVE_CREATE
Add an ioctl() that performs the ECREATE function of the ENCLS
instruction, which creates an SGX Enclave Control Structure (SECS).

Although the SECS is an in-memory data structure, it is present in
enclave memory and is not directly accessible by software.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-13-jarkko@kernel.org
2020-11-18 18:02:49 +01:00
Jarkko Sakkinen
3fe0778eda x86/sgx: Add an SGX misc driver interface
Intel(R) SGX is a new hardware functionality that can be used by
applications to set aside private regions of code and data called
enclaves. New hardware protects enclave code and data from outside
access and modification.

Add a driver that presents a device file and ioctl API to build and
manage enclaves.

 [ bp: Small touchups, remove unused encl variable in sgx_encl_find() as
   Reported-by: kernel test robot <lkp@intel.com> ]

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-12-jarkko@kernel.org
2020-11-18 18:01:16 +01:00
Arvind Sankar
0ac317e897 x86/boot: Remove unused finalize_identity_maps()
Commit

  8570978ea0 ("x86/boot/compressed/64: Don't pre-map memory in KASLR code")

removed all the references to finalize_identity_maps(), but neglected to
delete the actual function. Remove it.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201005151208.2212886-2-nivedita@alum.mit.edu
2020-11-18 16:04:23 +01:00
Zhenzhong Duan
4d213e76a3 iommu/vt-d: Avoid panic if iommu init fails in tboot system
"intel_iommu=off" command line is used to disable iommu but iommu is force
enabled in a tboot system for security reason.

However for better performance on high speed network device, a new option
"intel_iommu=tboot_noforce" is introduced to disable the force on.

By default kernel should panic if iommu init fail in tboot for security
reason, but it's unnecessory if we use "intel_iommu=tboot_noforce,off".

Fix the code setting force_on and move intel_iommu_tboot_noforce
from tboot code to intel iommu code.

Fixes: 7304e8f28b ("iommu/vt-d: Correctly disable Intel IOMMU force on")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@gmail.com>
Tested-by: Lukasz Hawrylko <lukasz.hawrylko@linux.intel.com>
Acked-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20201110071908.3133-1-zhenzhong.duan@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-11-18 13:09:07 +00:00
Thomas Gleixner
907f8eb8e0 x86/uaccess: Document copy_from_user_nmi()
Document the functionality of copy_from_user_nmi() to avoid further
confusion. Fix the typo in the existing comment while at it.

Requested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201117202753.806376613@linutronix.de
2020-11-18 13:19:01 +01:00
Thomas Gleixner
860aaabac8 x86/dumpstack: Do not try to access user space code of other tasks
sysrq-t ends up invoking show_opcodes() for each task which tries to access
the user space code of other processes, which is obviously bogus.

It either manages to dump where the foreign task's regs->ip points to in a
valid mapping of the current task or triggers a pagefault and prints "Code:
Bad RIP value.". Both is just wrong.

Add a safeguard in copy_code() and check whether the @regs pointer matches
currents pt_regs. If not, do not even try to access it.

While at it, add commentary why using copy_from_user_nmi() is safe in
copy_code() even if the function name suggests otherwise.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201117202753.667274723@linutronix.de
2020-11-18 12:56:29 +01:00
Christoph Hellwig
16fee29b07
dma-mapping: remove the dma_direct_set_offset export
Drop the dma_direct_set_offset export and move the declaration to
dma-map-ops.h now that the Allwinner drivers have stopped calling it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
2020-11-18 09:11:38 +01:00
Hui Su
09a217c105 x86/dumpstack: Make show_trace_log_lvl() static
show_trace_log_lvl() is not used by other compilation units so make it
static and remove the declaration from the header file.

Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201113133943.GA136221@rlk
2020-11-17 19:05:32 +01:00
Ard Biesheuvel
b283477d39 efi: x86/xen: switch to efi_get_secureboot_mode helper
Now that we have a static inline helper to discover the platform's secure
boot mode that can be shared between the EFI stub and the kernel proper,
switch to it, and drop some comments about keeping them in sync manually.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-11-17 15:09:32 +01:00
Jarkko Sakkinen
d2285493be x86/sgx: Add SGX page allocator functions
Add functions for runtime allocation and free.

This allocator and its algorithms are as simple as it gets.  They do a
linear search across all EPC sections and find the first free page.  They
are not NUMA-aware and only hand out individual pages.  The SGX hardware
does not support large pages, so something more complicated like a buddy
allocator is unwarranted.

The free function (sgx_free_epc_page()) implicitly calls ENCLS[EREMOVE],
which returns the page to the uninitialized state.  This ensures that the
page is ready for use at the next allocation.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-10-jarkko@kernel.org
2020-11-17 14:36:13 +01:00
Jarkko Sakkinen
38853a3039 x86/cpu/intel: Add a nosgx kernel parameter
Add a kernel parameter to disable SGX kernel support and document it.

 [ bp: Massage. ]

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-9-jarkko@kernel.org
2020-11-17 14:36:13 +01:00
Sean Christopherson
224ab3527f x86/cpu/intel: Detect SGX support
Kernel support for SGX is ultimately decided by the state of the launch
control bits in the feature control MSR (MSR_IA32_FEAT_CTL).  If the
hardware supports SGX, but neglects to support flexible launch control, the
kernel will not enable SGX.

Enable SGX at feature control MSR initialization and update the associated
X86_FEATURE flags accordingly.  Disable X86_FEATURE_SGX (and all
derivatives) if the kernel is not able to establish itself as the authority
over SGX Launch Control.

All checks are performed for each logical CPU (not just boot CPU) in order
to verify that MSR_IA32_FEATURE_CONTROL is correctly configured on all
CPUs. All SGX code in this series expects the same configuration from all
CPUs.

This differs from VMX where X86_FEATURE_VMX is intentionally cleared only
for the current CPU so that KVM can provide additional information if KVM
fails to load like which CPU doesn't support VMX.  There’s not much the
kernel or an administrator can do to fix the situation, so SGX neglects to
convey additional details about these kinds of failures if they occur.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-8-jarkko@kernel.org
2020-11-17 14:36:13 +01:00
Sean Christopherson
74faeee06d x86/mm: Signal SIGSEGV with PF_SGX
The x86 architecture has a set of page fault error codes.  These indicate
things like whether the fault occurred from a write, or whether it
originated in userspace.

The SGX hardware architecture has its own per-page memory management
metadata (EPCM) [*] and hardware which is separate from the normal x86 MMU.
The architecture has a new page fault error code: PF_SGX.  This new error
code bit is set whenever a page fault occurs as the result of the SGX MMU.

These faults occur for a variety of reasons.  For instance, an access
attempt to enclave memory from outside the enclave causes a PF_SGX fault.
PF_SGX would also be set for permission conflicts, such as if a write to an
enclave page occurs and the page is marked read-write in the x86 page
tables but is read-only in the EPCM.

These faults do not always indicate errors, though.  SGX pages are
encrypted with a key that is destroyed at hardware reset, including
suspend. Throwing a SIGSEGV allows user space software to react and recover
when these events occur.

Include PF_SGX in the PF error codes list and throw SIGSEGV when it is
encountered.

[*] Intel SDM: 36.5.1 Enclave Page Cache Map (EPCM)

 [ bp: Add bit 15 to the comment above enum x86_pf_error_code too. ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-7-jarkko@kernel.org
2020-11-17 14:36:13 +01:00
Sean Christopherson
e7e0545299 x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections
Although carved out of normal DRAM, enclave memory is marked in the
system memory map as reserved and is not managed by the core mm.  There
may be several regions spread across the system.  Each contiguous region
is called an Enclave Page Cache (EPC) section.  EPC sections are
enumerated via CPUID

Enclave pages can only be accessed when they are mapped as part of an
enclave, by a hardware thread running inside the enclave.

Parse CPUID data, create metadata for EPC pages and populate a simple
EPC page allocator.  Although much smaller, ‘struct sgx_epc_page’
metadata is the SGX analog of the core mm ‘struct page’.

Similar to how the core mm’s page->flags encode zone and NUMA
information, embed the EPC section index to the first eight bits of
sgx_epc_page->desc.  This allows a quick reverse lookup from EPC page to
EPC section.  Existing client hardware supports only a single section,
while upcoming server hardware will support at most eight sections.
Thus, eight bits should be enough for long term needs.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Serge Ayoun <serge.ayoun@intel.com>
Signed-off-by: Serge Ayoun <serge.ayoun@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-6-jarkko@kernel.org
2020-11-17 14:36:13 +01:00
Sean Christopherson
d205e0f142 x86/{cpufeatures,msr}: Add Intel SGX Launch Control hardware bits
The SGX Launch Control hardware helps restrict which enclaves the
hardware will run.  Launch control is intended to restrict what software
can run with enclave protections, which helps protect the overall system
from bad enclaves.

For the kernel's purposes, there are effectively two modes in which the
launch control hardware can operate: rigid and flexible. In its rigid
mode, an entity other than the kernel has ultimate authority over which
enclaves can be run (firmware, Intel, etc...). In its flexible mode, the
kernel has ultimate authority over which enclaves can run.

Enable X86_FEATURE_SGX_LC to enumerate when the CPU supports SGX Launch
Control in general.

Add MSR_IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}, which when combined contain a
SHA256 hash of a 3072-bit RSA public key. The hardware allows SGX enclaves
signed with this public key to initialize and run [*]. Enclaves not signed
with this key can not initialize and run.

Add FEAT_CTL_SGX_LC_ENABLED, which informs whether the SGXLEPUBKEYHASH MSRs
can be written by the kernel.

If the MSRs do not exist or are read-only, the launch control hardware is
operating in rigid mode. Linux does not and will not support creating
enclaves when hardware is configured in rigid mode because it takes away
the authority for launch decisions from the kernel. Note, this does not
preclude KVM from virtualizing/exposing SGX to a KVM guest when launch
control hardware is operating in rigid mode.

[*] Intel SDM: 38.1.4 Intel SGX Launch Control Configuration

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-5-jarkko@kernel.org
2020-11-17 14:36:13 +01:00
Sean Christopherson
e7b6385b01 x86/cpufeatures: Add Intel SGX hardware bits
Populate X86_FEATURE_SGX feature from CPUID and tie it to the Kconfig
option with disabled-features.h.

IA32_FEATURE_CONTROL.SGX_ENABLE must be examined in addition to the CPUID
bits to enable full SGX support.  The BIOS must both set this bit and lock
IA32_FEATURE_CONTROL for SGX to be supported (Intel SDM section 36.7.1).
The setting or clearing of this bit has no impact on the CPUID bits above,
which is why it needs to be detected separately.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-4-jarkko@kernel.org
2020-11-17 14:36:13 +01:00
Jarkko Sakkinen
2c273671d0 x86/sgx: Add wrappers for ENCLS functions
ENCLS is the userspace instruction which wraps virtually all
unprivileged SGX functionality for managing enclaves.  It is essentially
the ioctl() of instructions with each function implementing different
SGX-related functionality.

Add macros to wrap the ENCLS functionality. There are two main groups,
one for functions which do not return error codes and a “ret_” set for
those that do.

ENCLS functions are documented in Intel SDM section 36.6.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-3-jarkko@kernel.org
2020-11-17 14:36:12 +01:00
Jarkko Sakkinen
70d3b8ddcd x86/sgx: Add SGX architectural data structures
Define the SGX architectural data structures used by various SGX
functions. This is not an exhaustive representation of all SGX data
structures but only those needed by the kernel.

The goal is to sequester hardware structures in "sgx/arch.h" and keep
them separate from kernel-internal or uapi structures.

The data structures are described in Intel SDM section 37.6.

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-2-jarkko@kernel.org
2020-11-17 14:36:12 +01:00
Sami Tolvanen
ebd19fc372 perf/x86: fix sysfs type mismatches
This change switches rapl to use PMU_FORMAT_ATTR, and fixes two other
macros to use device_attribute instead of kobj_attribute to avoid
callback type mismatches that trip indirect call checking with Clang's
Control-Flow Integrity (CFI).

Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20201113183126.1239404-1-samitolvanen@google.com
2020-11-17 13:15:38 +01:00
Chen Yu
1a371e67dc x86/microcode/intel: Check patch signature before saving microcode for early loading
Currently, scan_microcode() leverages microcode_matches() to check
if the microcode matches the CPU by comparing the family and model.
However, the processor stepping and flags of the microcode signature
should also be considered when saving a microcode patch for early
update.

Use find_matching_signature() in scan_microcode() and get rid of the
now-unused microcode_matches() which is a good cleanup in itself.

Complete the verification of the patch being saved for early loading in
save_microcode_patch() directly. This needs to be done there too because
save_mc_for_early() will call save_microcode_patch() too.

The second reason why this needs to be done is because the loader still
tries to support, at least hypothetically, mixed-steppings systems and
thus adds all patches to the cache that belong to the same CPU model
albeit with different steppings.

For example:

  microcode: CPU: sig=0x906ec, pf=0x2, rev=0xd6
  microcode: mc_saved[0]: sig=0x906e9, pf=0x2a, rev=0xd6, total size=0x19400, date = 2020-04-23
  microcode: mc_saved[1]: sig=0x906ea, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27
  microcode: mc_saved[2]: sig=0x906eb, pf=0x2, rev=0xd6, total size=0x19400, date = 2020-04-23
  microcode: mc_saved[3]: sig=0x906ec, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27
  microcode: mc_saved[4]: sig=0x906ed, pf=0x22, rev=0xd6, total size=0x19400, date = 2020-04-23

The patch which is being saved for early loading, however, can only be
the one which fits the CPU this runs on so do the signature verification
before saving.

 [ bp: Do signature verification in save_microcode_patch()
       and rewrite commit message. ]

Fixes: ec400ddeff ("x86/microcode_intel_early.c: Early update ucode on Intel's CPU")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=208535
Link: https://lkml.kernel.org/r/20201113015923.13960-1-yu.c.chen@intel.com
2020-11-17 10:33:18 +01:00
Chen Zhou
054409ab25 KVM: SVM: fix error return code in svm_create_vcpu()
Fix to return a negative error code from the error handling case
instead of 0 in function svm_create_vcpu(), as done elsewhere in this
function.

Fixes: f4c847a956 ("KVM: SVM: refactor msr permission bitmap allocation")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Message-Id: <20201117025426.167824-1-chenzhou10@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-17 02:40:08 -05:00
Gabriel Krisman Bertazi
51af3f2306 x86: Reclaim unused x86 TI flags
Reclaim TI flags that were migrated to syscall_work flags.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-11-krisman@collabora.com
2020-11-16 21:53:16 +01:00
Gabriel Krisman Bertazi
b4581a52ca x86: Expose syscall_work field in thread_info
This field will be used by SYSCALL_WORK flags, migrated from TI flags.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-2-krisman@collabora.com
2020-11-16 21:53:15 +01:00
Thomas Gleixner
4cffe21d4a Merge branch 'x86/entry' into core/entry
Prepare for the merging of the syscall_work series which conflicts with the
TIF bits overhaul in X86.
2020-11-16 20:51:59 +01:00
Ashish Kalra
854c57f02b KVM: SVM: Fix offset computation bug in __sev_dbg_decrypt().
Fix offset computation in __sev_dbg_decrypt() to include the
source paddr before it is rounded down to be aligned to 16 bytes
as required by SEV API. This fixes incorrect guest memory dumps
observed when using qemu monitor.

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-Id: <20201110224205.29444-1-Ashish.Kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-16 13:59:09 -05:00
Paolo Bonzini
dc924b0624 KVM: SVM: check CR4 changes against vcpu->arch
Similarly to what vmx/vmx.c does, use vcpu->arch.cr4 to check if CR4
bits PGE, PKE and OSXSAVE have changed.  When switching between VMCB01
and VMCB02, CPUID has to be adjusted every time if CR4.PKE or CR4.OSXSAVE
change; without this patch, instead, CR4 would be checked against the
previous value for L2 on vmentry, and against the previous value for
L1 on vmexit, and CPUID would not be updated.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-16 13:14:22 -05:00
Cathy Avery
7e8e6eed75 KVM: SVM: Move asid to vcpu_svm
KVM does not have separate ASIDs for L1 and L2; either the nested
hypervisor and nested guests share a single ASID, or on older processor
the ASID is used only to implement TLB flushing.

Either way, ASIDs are handled at the VM level.  In preparation
for having different VMCBs passed to VMLOAD/VMRUN/VMSAVE for L1 and
L2, store the current ASID to struct vcpu_svm and only move it to
the VMCB in svm_vcpu_run.  This way, TLB flushes can be applied
no matter which VMCB will be active during the next svm_vcpu_run.

Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20201011184818.3609-2-cavery@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-16 13:14:21 -05:00
Alex Shi
789f52c071 x86/kvm: remove unused macro HV_CLOCK_SIZE
This macro is useless, and could cause gcc warning:
arch/x86/kernel/kvmclock.c:47:0: warning: macro "HV_CLOCK_SIZE" is not
used [-Wunused-macros]
Let's remove it.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <1604651963-10067-1-git-send-email-alex.shi@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-16 13:14:21 -05:00
Borislav Petkov
18741a5251 x86/msr: Do not allow writes to MSR_IA32_ENERGY_PERF_BIAS
Now that all in-kernel-tree users are converted to using the sysfs file,
remove the MSR from the "allowlist".

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Link: https://lkml.kernel.org/r/20201029190259.3476-5-bp@alien8.de
2020-11-16 17:44:04 +01:00
Tony Luck
098416e698 x86/mce: Use "safe" MSR functions when enabling additional error logging
Booting as a guest under KVM results in error messages about
unchecked MSR access:

  unchecked MSR access error: RDMSR from 0x17f at rIP: 0xffffffff84483f16 (mce_intel_feature_init+0x156/0x270)

because KVM doesn't provide emulation for random model specific
registers.

Switch to using rdmsrl_safe()/wrmsrl_safe() to avoid the message.

Fixes: 68299a42f8 ("x86/mce: Enable additional error logging on certain Intel CPUs")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201111003954.GA11878@agluck-desk2.amr.corp.intel.com
2020-11-16 17:34:08 +01:00
Linus Torvalds
0062442ecf Fixes for ARM and x86, the latter especially for old processors
without two-dimensional paging (EPT/NPT).
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl+xQ54UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMzeQf+JP9NpXgeB7dhiODhmO5SyLdw0u9j
 kVOM6+kHcEvG6o0yU1uUZr2ZPh9vIAwIjXi8Luiodcazdp6jvxvJ32CeMYJz2lel
 y+3Gjp3WS2+FExOjBephBztaMHLihlWQt3E0EKuCc7StyfMhaZooiTRMpvrmiLWe
 HQ/epM9oLMyrCqG9MKkvTwH0lDyB5CprV1BNt6YyKjt7d5swEqC75A6lOXnmdAah
 utgx1agSIVQPv6vDF9HLaQaoelHT7ucudx+zIkvOAmoQ56AJMPfCr0+Af3ZVW+f/
 I5tXVfBhoOV3BVSIsJS7Px0HcZt7siVtl6ISZZos8ox85S4ysjWm2vXFcQ==
 =MiOr
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "Fixes for ARM and x86, the latter especially for old processors
  without two-dimensional paging (EPT/NPT)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: mmu: fix is_tdp_mmu_check when the TDP MMU is not in use
  KVM: SVM: Update cr3_lm_rsvd_bits for AMD SEV guests
  KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch
  KVM: x86: clflushopt should be treated as a no-op by emulation
  KVM: arm64: Handle SCXTNUM_ELx traps
  KVM: arm64: Unify trap handlers injecting an UNDEF
  KVM: arm64: Allow setting of ID_AA64PFR0_EL1.CSV2 from userspace
2020-11-15 09:57:58 -08:00
Linus Torvalds
326fd6db61 A small set of fixes for x86:
- Cure the fallout from the MSI irqdomain overhaul which missed that the
    Intel IOMMU does not register virtual function devices and therefore
    never reaches the point where the MSI interrupt domain is assigned. This
    makes the VF devices use the non-remapped MSI domain which is trapped by
    the IOMMU/remap unit.
 
  - Remove an extra space in the SGI_UV architecture type procfs output for
    UV5.
 
  - Remove a unused function which was missed when removing the UV BAU TLB
    shootdown handler.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xJi0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVWxD/9Tq4W6Kniln7mtoEWHRvHRceiiGcS3
 MocvqurhoJwirH4F2gkvCegTBy0r3FdUORy3OMmChVs6nb8XpPpso84SANCRePWp
 JZezpVwLSNC4O1/ZCg1Kjj4eUpzLB/UjUUQV9RsjL5wyQEhfCZgb1D40yLM/2dj5
 SkVm/EAqWuQNtYe/jqAOwTX/7mV+k2QEmKCNOigM13R9EWgu6a4J8ta1gtNSbwvN
 jWMW+M1KjZ76pfRK+y4OpbuFixteSzhSWYPITSGwQz4IpQ+Ty2Rv0zzjidmDnAR+
 Q73cup0dretdVnVDRpMwDc06dBCmt/rbN50w4yGU0YFRFDgjGc8sIbQzuIP81nEQ
 XY4l4rcBgyVufFsLrRpQxu1iYPFrcgU38W1kRkkJ3Kl/rY1a2ZU7sLE4kt4Oh55W
 A9KCmsfqP1PCYppjAQ0QT4NOp4YtecPvAU4UcBOb722DDBd8TfhLWWGw2yG57Q/d
 Wnu8xCJGy7BaLHLGGGseAft+D4aNnCjKC3jgMyvNtRDXaV2cK2Kdd6ehMlWVUapD
 xfLlKXE+igXMyoWJIWjTXQJs4dpKu6QpJCPiorwEZ8rmNaRfxsWEJVbeYwEkmUke
 bMoBBSCbZT86WVOYhI8WtrIemraY0mMYrrcE03M96HU3eYB8BV92KrIzZWThupcQ
 ZqkZbqCZm3vfHA==
 =X/P+
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A small set of fixes for x86:

   - Cure the fallout from the MSI irqdomain overhaul which missed that
     the Intel IOMMU does not register virtual function devices and
     therefore never reaches the point where the MSI interrupt domain is
     assigned. This made the VF devices use the non-remapped MSI domain
     which is trapped by the IOMMU/remap unit

   - Remove an extra space in the SGI_UV architecture type procfs output
     for UV5

   - Remove a unused function which was missed when removing the UV BAU
     TLB shootdown handler"

* tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  iommu/vt-d: Cure VF irqdomain hickup
  x86/platform/uv: Fix copied UV5 output archtype
  x86/platform/uv: Drop last traces of uv_flush_tlb_others
2020-11-15 09:49:56 -08:00
Linus Torvalds
64b609d6a6 A set of fixes for perf:
- A set of commits which reduce the stack usage of various perf event
    handling functions which allocated large data structs on stack causing
    stack overflows in the worst case.
 
  - Use the proper mechanism for detecting soft interrupts in the recursion
    protection.
 
  - Make the resursion protection simpler and more robust.
 
  - Simplify the scheduling of event groups to make the code more robust and
    prepare for fixing the issues vs. scheduling of exclusive event groups.
 
  - Prevent event multiplexing and rotation for exclusive event groups
 
  - Correct the perf event attribute exclusive semantics to take pinned
    events, e.g. the PMU watchdog, into account
 
  - Make the anythread filtering conditional for Intel's generic PMU
    counters as it is not longer guaranteed to be supported on newer
    CPUs. Check the corresponding CPUID leaf to make sure.
 
  - Fixup a duplicate initialization in an array which was probably cause by
    the usual copy & paste - forgot to edit mishap.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xIi0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofixD/4+4gc8DhOmAkMrN0Z9tiW8ebgMKmb9
 wZRkMr5Osi0GzLJOPZ6SdY6jd0A3rMN/sW6P1DT6pDtcty4bKFoW5VZBuUDIAhel
 BC4C93L3y1En/GEZu1GTy3LvsBwLBQTOoY4goDjbdAbk60S/0RTHOGyQsRsOQFe6
 fVs3iXozAFuaR6I6N3dlxuJAE51zvr8MyBWaUoByNDB//1+lLNW+JfClaAOG1oXx
 qZIg/niatBVGzSGgKNRUyh3g8G1HJtabsA/NZ4PH8ZHuYABfmj4lmmUPR77ICLfV
 wMITEBG7eaktB8EqM9hvaoOZLA5kpXHO2JbCFSs4c4x11mlC8g7QMV3poCw33YoN
 a5TmT1A3muri1riy1/Ee9lXACOq7/tf2+Xfn9o6dvDdBwd6s5pzlhLGR8gILp2lF
 2bcg3IwYvHT/Kiurb/WGNpbCqQIPJpcUcfs3tNBCCtKegahUQNnGjxN3NVo9RCit
 zfL6xIJ8eZiYnsxXx4NKm744AukWiql3aRNgRkOdBP5WC68xt6VLcxG1YZKUoDhy
 jRSOCD/DuPSMSvAAgN7S8OWlPsKWBxVxxWYV+K8FpwhgzbQ3WbS3UDiYkhgjeOxu
 OlM692oWpllKvQWlvYthr2Be6oPCRRi1vvADNNbTKzgHk5i61bwympsGl1EZx3Pz
 2ROp7NJFRESnqw==
 =FzCf
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "A set of fixes for perf:

    - A set of commits which reduce the stack usage of various perf
      event handling functions which allocated large data structs on
      stack causing stack overflows in the worst case

    - Use the proper mechanism for detecting soft interrupts in the
      recursion protection

    - Make the resursion protection simpler and more robust

    - Simplify the scheduling of event groups to make the code more
      robust and prepare for fixing the issues vs. scheduling of
      exclusive event groups

    - Prevent event multiplexing and rotation for exclusive event groups

    - Correct the perf event attribute exclusive semantics to take
      pinned events, e.g. the PMU watchdog, into account

    - Make the anythread filtering conditional for Intel's generic PMU
      counters as it is not longer guaranteed to be supported on newer
      CPUs. Check the corresponding CPUID leaf to make sure

    - Fixup a duplicate initialization in an array which was probably
      caused by the usual 'copy & paste - forgot to edit' mishap"

* tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Fix Add BW copypasta
  perf/x86/intel: Make anythread filter support conditional
  perf: Tweak perf_event_attr::exclusive semantics
  perf: Fix event multiplexing for exclusive groups
  perf: Simplify group_sched_in()
  perf: Simplify group_sched_out()
  perf/x86: Make dummy_iregs static
  perf/arch: Remove perf_sample_data::regs_user_copy
  perf: Optimize get_recursion_context()
  perf: Fix get_recursion_context()
  perf/x86: Reduce stack usage for x86_pmu::drain_pebs()
  perf: Reduce stack usage of perf_output_begin()
2020-11-15 09:46:36 -08:00
Jim Mattson
2259c17f01 kvm: x86: Sink cpuid update into vendor-specific set_cr4 functions
On emulated VM-entry and VM-exit, update the CPUID bits that reflect
CR4.OSXSAVE and CR4.PKE.

This fixes a bug where the CPUID bits could continue to reflect L2 CR4
values after emulated VM-exit to L1. It also fixes a related bug where
the CPUID bits could continue to reflect L1 CR4 values after emulated
VM-entry to L2. The latter bug is mainly relevant to SVM, wherein
CPUID is not a required intercept. However, it could also be relevant
to VMX, because the code to conditionally update these CPUID bits
assumes that the guest CPUID and the guest CR4 are always in sync.

Fixes: 8eb3f87d90 ("KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit")
Fixes: 2acf923e38 ("KVM: VMX: Enable XSAVE/XRSTOR for guest")
Fixes: b9baba8614 ("KVM, pkeys: expose CPUID/CR4 to guest")
Reported-by: Abhiroop Dabral <adabral@paloaltonetworks.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Cc: Dexuan Cui <dexuan.cui@intel.com>
Cc: Huaitong Han <huaitong.han@intel.com>
Message-Id: <20201029170648.483210-1-jmattson@google.com>
2020-11-15 09:49:18 -05:00
Peter Xu
044c59c409 KVM: Don't allocate dirty bitmap if dirty ring is enabled
Because kvm dirty rings and kvm dirty log is used in an exclusive way,
Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled.
At the meantime, since the dirty_bitmap will be conditionally created
now, we can't use it as a sign of "whether this memory slot enabled
dirty tracking".  Change users like that to check against the kvm
memory slot flags.

Note that there still can be chances where the kvm memory slot got its
dirty_bitmap allocated, _if_ the memory slots are created before
enabling of the dirty rings and at the same time with the dirty
tracking capability enabled, they'll still with the dirty_bitmap.
However it should not hurt much (e.g., the bitmaps will always be
freed if they are there), and the real users normally won't trigger
this because dirty bit tracking flag should in most cases only be
applied to kvm slots only before migration starts, that should be far
latter than kvm initializes (VM starts).

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012226.5868-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:16 -05:00
Peter Xu
fb04a1eddb KVM: X86: Implement ring-based dirty memory tracking
This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory.  These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information.  The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another.  However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial.  In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN).  This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

This patch enables dirty ring for X86 only.  However it should be
easily extended to other archs as well.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012222.5767-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:15 -05:00
Peter Xu
ff5a983cbb KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
Originally, we have three code paths that can dirty a page without
vcpu context for X86:

  - init_rmode_identity_map
  - init_rmode_tss
  - kvmgt_rw_gpa

init_rmode_identity_map and init_rmode_tss will be setup on
destination VM no matter what (and the guest cannot even see them), so
it does not make sense to track them at all.

To do this, allow __x86_set_memory_region() to return the userspace
address that just allocated to the caller.  Then in both of the
functions we directly write to the userspace address instead of
calling kvm_write_*() APIs.

Another trivial change is that we don't need to explicitly clear the
identity page table root in init_rmode_identity_map() because no
matter what we'll write to the whole page with 4M huge page entries.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012044.5151-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:12 -05:00
Vitaly Kuznetsov
c21d54f030 KVM: x86: hyper-v: allow KVM_GET_SUPPORTED_HV_CPUID as a system ioctl
KVM_GET_SUPPORTED_HV_CPUID is a vCPU ioctl but its output is now
independent from vCPU and in some cases VMMs may want to use it as a system
ioctl instead. In particular, QEMU doesn CPU feature expansion before any
vCPU gets created so KVM_GET_SUPPORTED_HV_CPUID can't be used.

Convert KVM_GET_SUPPORTED_HV_CPUID to 'dual' system/vCPU ioctl with the
same meaning.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200929150944.1235688-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:11 -05:00
Yadong Qi
bf0cd88ce3 KVM: x86: emulate wait-for-SIPI and SIPI-VMExit
Background: We have a lightweight HV, it needs INIT-VMExit and
SIPI-VMExit to wake-up APs for guests since it do not monitor
the Local APIC. But currently virtual wait-for-SIPI(WFS) state
is not supported in nVMX, so when running on top of KVM, the L1
HV cannot receive the INIT-VMExit and SIPI-VMExit which cause
the L2 guest cannot wake up the APs.

According to Intel SDM Chapter 25.2 Other Causes of VM Exits,
SIPIs cause VM exits when a logical processor is in
wait-for-SIPI state.

In this patch:
    1. introduce SIPI exit reason,
    2. introduce wait-for-SIPI state for nVMX,
    3. advertise wait-for-SIPI support to guest.

When L1 hypervisor is not monitoring Local APIC, L0 need to emulate
INIT-VMExit and SIPI-VMExit to L1 to emulate INIT-SIPI-SIPI for
L2. L2 LAPIC write would be traped by L0 Hypervisor(KVM), L0 should
emulate the INIT/SIPI vmexit to L1 hypervisor to set proper state
for L2's vcpu state.

Handle procdure:
Source vCPU:
    L2 write LAPIC.ICR(INIT).
    L0 trap LAPIC.ICR write(INIT): inject a latched INIT event to target
       vCPU.
Target vCPU:
    L0 emulate an INIT VMExit to L1 if is guest mode.
    L1 set guest VMCS, guest_activity_state=WAIT_SIPI, vmresume.
    L0 set vcpu.mp_state to INIT_RECEIVED if (vmcs12.guest_activity_state
       == WAIT_SIPI).

Source vCPU:
    L2 write LAPIC.ICR(SIPI).
    L0 trap LAPIC.ICR write(INIT): inject a latched SIPI event to traget
       vCPU.
Target vCPU:
    L0 emulate an SIPI VMExit to L1 if (vcpu.mp_state == INIT_RECEIVED).
    L1 set CS:IP, guest_activity_state=ACTIVE, vmresume.
    L0 resume to L2.
    L2 start-up.

Signed-off-by: Yadong Qi <yadong.qi@intel.com>
Message-Id: <20200922052343.84388-1-yadong.qi@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20201106065122.403183-1-yadong.qi@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:09 -05:00
Paolo Bonzini
1c96dcceae KVM: x86: fix apic_accept_events vs check_nested_events
vmx_apic_init_signal_blocked is buggy in that it returns true
even in VMX non-root mode.  In non-root mode, however, INITs
are not latched, they just cause a vmexit.  Previously,
KVM was waiting for them to be processed when kvm_apic_accept_events
and in the meanwhile it ate the SIPIs that the processor received.

However, in order to implement the wait-for-SIPI activity state,
KVM will have to process KVM_APIC_SIPI in vmx_check_nested_events,
and it will not be possible anymore to disregard SIPIs in non-root
mode as the code is currently doing.

By calling kvm_x86_ops.nested_ops->check_events, we can force a vmexit
(with the side-effect of latching INITs) before incorrectly injecting
an INIT or SIPI in a guest, and therefore vmx_apic_init_signal_blocked
can do the right thing.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:08 -05:00
Sean Christopherson
ee69c92bac KVM: x86: Return bool instead of int for CR4 and SREGS validity checks
Rework the common CR4 and SREGS checks to return a bool instead of an
int, i.e. true/false instead of 0/-EINVAL, and add "is" to the name to
clarify the polarity of the return value (which is effectively inverted
by this change).

No functional changed intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:08 -05:00
Sean Christopherson
c2fe3cd460 KVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hook
Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(),
and invoke the new hook from kvm_valid_cr4().  This fixes an issue where
KVM_SET_SREGS would return success while failing to actually set CR4.

Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return
in __set_sregs() is not a viable option as KVM has already stuffed a
variety of vCPU state.

Note, kvm_valid_cr4() and is_valid_cr4() have different return types and
inverted semantics.  This will be remedied in a future patch.

Fixes: 5e1746d620 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:07 -05:00
Sean Christopherson
311a06593b KVM: SVM: Drop VMXE check from svm_set_cr4()
Drop svm_set_cr4()'s explicit check CR4.VMXE now that common x86 handles
the check by incorporating VMXE into the CR4 reserved bits, via
kvm_cpu_caps.  SVM obviously does not set X86_FEATURE_VMX.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:07 -05:00
Sean Christopherson
a447e38a7f KVM: VMX: Drop explicit 'nested' check from vmx_set_cr4()
Drop vmx_set_cr4()'s explicit check on the 'nested' module param now
that common x86 handles the check by incorporating VMXE into the CR4
reserved bits, via kvm_cpu_caps.  X86_FEATURE_VMX is set in kvm_cpu_caps
(by vmx_set_cpu_caps()), if and only if 'nested' is true.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:06 -05:00
Sean Christopherson
d3a9e4146a KVM: VMX: Drop guest CPUID check for VMXE in vmx_set_cr4()
Drop vmx_set_cr4()'s somewhat hidden guest_cpuid_has() check on VMXE now
that common x86 handles the check by incorporating VMXE into the CR4
reserved bits, i.e. in cr4_guest_rsvd_bits.  This fixes a bug where KVM
incorrectly rejects KVM_SET_SREGS with CR4.VMXE=1 if it's executed
before KVM_SET_CPUID{,2}.

Fixes: 5e1746d620 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
Reported-by: Stas Sergeev <stsp@users.sourceforge.net>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:06 -05:00
Paolo Bonzini
c887c9b9ca kvm: mmu: fix is_tdp_mmu_check when the TDP MMU is not in use
In some cases where shadow paging is in use, the root page will
be either mmu->pae_root or vcpu->arch.mmu->lm_root.  Then it will
not have an associated struct kvm_mmu_page, because it is allocated
with alloc_page instead of kvm_mmu_alloc_page.

Just return false quickly from is_tdp_mmu_root if the TDP MMU is
not in use, which also includes the case where shadow paging is
enabled.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 08:55:43 -05:00
Steven Rostedt (VMware)
2860cd8a23 livepatch: Use the default ftrace_ops instead of REGS when ARGS is available
When CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS is available, the ftrace call
will be able to set the ip of the calling function. This will improve the
performance of live kernel patching where it does not need all the regs to
be stored just to change the instruction pointer.

If all archs that support live kernel patching also support
HAVE_DYNAMIC_FTRACE_WITH_ARGS, then the architecture specific function
klp_arch_set_pc() could be made generic.

It is possible that an arch can support HAVE_DYNAMIC_FTRACE_WITH_ARGS but
not HAVE_DYNAMIC_FTRACE_WITH_REGS and then have access to live patching.

Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: live-patching@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-13 12:15:28 -05:00
Steven Rostedt (VMware)
02a474ca26 ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default
Currently, the only way to get access to the registers of a function via a
ftrace callback is to set the "FL_SAVE_REGS" bit in the ftrace_ops. But as this
saves all regs as if a breakpoint were to trigger (for use with kprobes), it
is expensive.

The regs are already saved on the stack for the default ftrace callbacks, as
that is required otherwise a function being traced will get the wrong
arguments and possibly crash. And on x86, the arguments are already stored
where they would be on a pt_regs structure to use that code for both the
regs version of a callback, it makes sense to pass that information always
to all functions.

If an architecture does this (as x86_64 now does), it is to set
HAVE_DYNAMIC_FTRACE_WITH_ARGS, and this will let the generic code that it
could have access to arguments without having to set the flags.

This also includes having the stack pointer being saved, which could be used
for accessing arguments on the stack, as well as having the function graph
tracer not require its own trampoline!

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-13 12:14:59 -05:00
Steven Rostedt (VMware)
d19ad0775d ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs
In preparation to have arguments of a function passed to callbacks attached
to functions as default, change the default callback prototype to receive a
struct ftrace_regs as the forth parameter instead of a pt_regs.

For callbacks that set the FL_SAVE_REGS flag in their ftrace_ops flags, they
will now need to get the pt_regs via a ftrace_get_regs() helper call. If
this is called by a callback that their ftrace_ops did not have a
FL_SAVE_REGS flag set, it that helper function will return NULL.

This will allow the ftrace_regs to hold enough just to get the parameters
and stack pointer, but without the worry that callbacks may have a pt_regs
that is not completely filled.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-13 12:14:55 -05:00
Babu Moger
96308b0661 KVM: SVM: Update cr3_lm_rsvd_bits for AMD SEV guests
For AMD SEV guests, update the cr3_lm_rsvd_bits to mask
the memory encryption bit in reserved bits.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <160521948301.32054.5783800787423231162.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-13 06:31:14 -05:00
Babu Moger
0107973a80 KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch
SEV guests fail to boot on a system that supports the PCID feature.

While emulating the RSM instruction, KVM reads the guest CR3
and calls kvm_set_cr3(). If the vCPU is in the long mode,
kvm_set_cr3() does a sanity check for the CR3 value. In this case,
it validates whether the value has any reserved bits set. The
reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory
encryption is enabled, the memory encryption bit is set in the CR3
value. The memory encryption bit may fall within the KVM reserved
bit range, causing the KVM emulation failure.

Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will
cache the reserved bits in the CR3 value. This will be initialized
to rsvd_bits(cpuid_maxphyaddr(vcpu), 63).

If the architecture has any special bits(like AMD SEV encryption bit)
that needs to be masked from the reserved bits, should be cleared
in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler.

Fixes: a780a3ea62 ("KVM: X86: Fix reserved bits check for MOV to CR3")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <160521947657.32054.3264016688005356563.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-13 06:28:37 -05:00
David Edmondson
51b958e5ae KVM: x86: clflushopt should be treated as a no-op by emulation
The instruction emulator ignores clflush instructions, yet fails to
support clflushopt. Treat both similarly.

Fixes: 13e457e0ee ("KVM: x86: Emulator does not decode clflush well")
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20201103120400.240882-1-david.edmondson@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-13 06:28:33 -05:00
Mike Travis
77c7e1bc06 x86/platform/uv: Fix copied UV5 output archtype
A test shows that the output contains a space:
    # cat /proc/sgi_uv/archtype
    NSGI4 U/UVX

Remove that embedded space by copying the "trimmed" buffer instead of the
untrimmed input character list.  Use sizeof to remove size dependency on
copy out length.  Increase output buffer size by one character just in case
BIOS sends an 8 character string for archtype.

Fixes: 1e61f5a95f ("Add and decode Arch Type in UVsystab")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20201111010418.82133-1-mike.travis@hpe.com
2020-11-13 00:00:31 +01:00
Thomas Gleixner
cba08c5dc6 x86/fpu: Make kernel FPU protection RT friendly
Non RT kernels need to protect FPU against preemption and bottom half
processing. This is achieved by disabling bottom halfs via
local_bh_disable() which implictly disables preemption.

On RT kernels this protection mechanism is not sufficient because
local_bh_disable() does not disable preemption. It serializes bottom half
related processing via a CPU local lock.

As bottom halfs are running always in thread context on RT kernels
disabling preemption is the proper choice as it implicitly prevents bottom
half processing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201027101349.588965083@linutronix.de
2020-11-11 14:35:16 +01:00
Thomas Gleixner
5f0c71278d x86/fpu: Simplify fpregs_[un]lock()
There is no point in disabling preemption and then disabling bottom
halfs.

Just disabling bottom halfs is sufficient as it implicitly disables
preemption on !RT kernels.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201027101349.455380473@linutronix.de
2020-11-11 14:35:16 +01:00
Jiri Slaby
b2896458b8 x86/platform/uv: Drop last traces of uv_flush_tlb_others
Commit 39297dde73 ("x86/platform/uv: Remove UV BAU TLB Shootdown
Handler") removed uv_flush_tlb_others. Its declaration was removed also
from asm/uv/uv.h. But only for the CONFIG_X86_UV=y case. The inline
definition (!X86_UV case) is still in place.

So remove this implementation with everything what was added to support
uv_flush_tlb_others:
* include of asm/tlbflush.h
* forward declarations of struct cpumask, mm_struct, and flush_tlb_info

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mike Travis <mike.travis@hpe.com>
Acked-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20201109093653.2042-1-jslaby@suse.cz
2020-11-11 13:16:51 +01:00
Victor Ding
43756a2989 powercap: Add AMD Fam17h RAPL support
Enable AMD Fam17h RAPL support for the power capping framework.

The support is as per AMD Fam17h Model31h (Zen2) and model 00-ffh
(Zen1) PPR.

Tested by comparing the results of following two sysfs entries and the
values directly read from corresponding MSRs via /dev/cpu/[x]/msr:
  /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj
  /sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/energy_uj

Signed-off-by: Victor Ding <victording@google.com>
Acked-by: Kim Phillips <kim.phillips@amd.com>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-11-10 19:59:07 +01:00
Victor Ding
298ed2b31f x86/msr-index: sort AMD RAPL MSRs by address
MSRs in the rest of this file are sorted by their addresses; fixing the
two outliers.

No functional changes.

Signed-off-by: Victor Ding <victording@google.com>
Acked-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-11-10 19:59:06 +01:00
Arvind Sankar
c2fe61d8be efi/x86: Free efi_pgd with free_pages()
Commit

  d9e9a64180 ("x86/mm/pti: Allocate a separate user PGD")

changed the PGD allocation to allocate PGD_ALLOCATION_ORDER pages, so in
the error path it should be freed using free_pages() rather than
free_page().

Commit

    06ace26f4e ("x86/efi: Free efi_pgd with free_pages()")

fixed one instance of this, but missed another.

Move the freeing out-of-line to avoid code duplication and fix this bug.

Fixes: d9e9a64180 ("x86/mm/pti: Allocate a separate user PGD")
Link: https://lore.kernel.org/r/20201110163919.1134431-1-nivedita@alum.mit.edu
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-11-10 19:18:11 +01:00
Thomas Gleixner
aec8da04e4 x86/ioapic: Correct the PCI/ISA trigger type selection
PCI's default trigger type is level and ISA's is edge. The recent
refactoring made it the other way round, which went unnoticed as it seems
only to cause havoc on some AMD systems.

Make the comment and code do the right thing again.

Fixes: a27dca645d ("x86/io_apic: Cleanup trigger/polarity helpers")
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/87d00lgu13.fsf@nanos.tec.linutronix.de
2020-11-10 18:43:22 +01:00
Arnd Bergmann
1a8cfa24e2 perf/x86/intel/uncore: Fix Add BW copypasta
gcc -Wextra points out a duplicate initialization of one array
member:

arch/x86/events/intel/uncore_snb.c:478:37: warning: initialized field overwritten [-Woverride-init]
  478 |  [SNB_PCI_UNCORE_IMC_DATA_READS]  = { SNB_UNCORE_PCI_IMC_DATA_WRITES_BASE,

The only sensible explanation is that a duplicate 'READS' was used
instead of the correct 'WRITES', so change it back.

Fixes: 24633d901e ("perf/x86/intel/uncore: Add BW counters for GT, IA and IO breakdown")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201026215203.3893972-1-arnd@kernel.org
2020-11-10 18:38:40 +01:00
Linus Torvalds
407ab57963 ARM:
- Fix compilation error when PMD and PUD are folded
 - Fix regression in reads-as-zero behaviour of ID_AA64ZFR0_EL1
 - Add aarch64 get-reg-list test
 
 x86:
 - fix semantic conflict between two series merged for 5.10
 - fix (and test) enforcement of paravirtual cpuid features
 
 Generic:
 - various cleanups to memory management selftests
 - new selftests testcase for performance of dirty logging
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl+pVjkUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroO3fAf/ZniW/7FC4pD/M0txXUst3mKNcC16
 AbMfN36dvzdWBnAuTVsP2d+XM/sbPNacomcJGfJ5II9TKrb00FUNxU37In7vdbbm
 WjpyDEpRDXnCY+OXs7dwY66dEXzv9GTzlQaGuah67AeGpzSuu3zrXlu07di446Gv
 ZtHvbzFEvos7cByp3LoPfvbnvv9kkD5mQkOW7wG42hUPrxMNxtHC+qyP92DIpV8d
 etDNC95rhdhhZM3LAlvO6Bp4I1uFXpYHEHtIOOT05IB9clNhfdgsuD8wiqWfEo0l
 sVhg3yXWbbfGaP3vEZp5QY9qko8I0XjwIWc5hWsIHST7uPqgi8a/wIbbEA==
 =jBcA
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:
   - fix compilation error when PMD and PUD are folded
   - fix regression in reads-as-zero behaviour of ID_AA64ZFR0_EL1
   - add aarch64 get-reg-list test

  x86:
   - fix semantic conflict between two series merged for 5.10
   - fix (and test) enforcement of paravirtual cpuid features

  selftests:
   - various cleanups to memory management selftests
   - new selftests testcase for performance of dirty logging"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits)
  KVM: selftests: allow two iterations of dirty_log_perf_test
  KVM: selftests: Introduce the dirty log perf test
  KVM: selftests: Make the number of vcpus global
  KVM: selftests: Make the per vcpu memory size global
  KVM: selftests: Drop pointless vm_create wrapper
  KVM: selftests: Add wrfract to common guest code
  KVM: selftests: Simplify demand_paging_test with timespec_diff_now
  KVM: selftests: Remove address rounding in guest code
  KVM: selftests: Factor code out of demand_paging_test
  KVM: selftests: Use a single binary for dirty/clear log test
  KVM: selftests: Always clear dirty bitmap after iteration
  KVM: selftests: Add blessed SVE registers to get-reg-list
  KVM: selftests: Add aarch64 get-reg-list test
  selftests: kvm: test enforcement of paravirtual cpuid features
  selftests: kvm: Add exception handling to selftests
  selftests: kvm: Clear uc so UCALL_NONE is being properly reported
  selftests: kvm: Fix the segment descriptor layout to match the actual layout
  KVM: x86: handle MSR_IA32_DEBUGCTLMSR with report_ignored_msrs
  kvm: x86: request masterclock update any time guest uses different msr
  kvm: x86: ensure pv_cpuid.features is initialized when enabling cap
  ...
2020-11-09 13:58:10 -08:00
Stephane Eranian
cadbaa039b perf/x86/intel: Make anythread filter support conditional
Starting with Arch Perfmon v5, the anythread filter on generic counters may be
deprecated. The current kernel was exporting the any filter without checking.
On Icelake, it means you could do cpu/event=0x3c,any/ even though the filter
does not exist. This patch corrects the problem by relying on the CPUID 0xa leaf
function to determine if anythread is supported or not as described in the
Intel SDM Vol3b 18.2.5.1 AnyThread Deprecation section.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201028194247.3160610-1-eranian@google.com
2020-11-09 18:12:36 +01:00
Peter Zijlstra
e506d1dac0 perf/x86: Make dummy_iregs static
Having pt_regs on-stack is unfortunate, it's 168 bytes. Since it isn't
actually used, make it a static variable. This both gets if off the
stack and ensures it gets 0 initialized, just in case someone does
look at it.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201030151955.324273677@infradead.org
2020-11-09 18:12:35 +01:00
Peter Zijlstra
76a4efa809 perf/arch: Remove perf_sample_data::regs_user_copy
struct perf_sample_data lives on-stack, we should be careful about it's
size. Furthermore, the pt_regs copy in there is only because x86_64 is a
trainwreck, solve it differently.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201030151955.258178461@infradead.org
2020-11-09 18:12:34 +01:00
Peter Zijlstra
9dfa9a5c9b perf/x86: Reduce stack usage for x86_pmu::drain_pebs()
intel_pmu_drain_pebs_*() is typically called from handle_pmi_common(),
both have an on-stack struct perf_sample_data, which is *big*. Rewire
things so that drain_pebs() can use the one handle_pmi_common() has.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201030151955.054099690@infradead.org
2020-11-09 18:12:33 +01:00
Peter Zijlstra
267fb27352 perf: Reduce stack usage of perf_output_begin()
__perf_output_begin() has an on-stack struct perf_sample_data in the
unlikely case it needs to generate a LOST record. However, every call
to perf_output_begin() must already have a perf_sample_data on-stack.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201030151954.985416146@infradead.org
2020-11-09 18:12:33 +01:00
Brian Masney
65cae18882 x86/xen: don't unbind uninitialized lock_kicker_irq
When booting a hyperthreaded system with the kernel parameter
'mitigations=auto,nosmt', the following warning occurs:

    WARNING: CPU: 0 PID: 1 at drivers/xen/events/events_base.c:1112 unbind_from_irqhandler+0x4e/0x60
    ...
    Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
    ...
    Call Trace:
     xen_uninit_lock_cpu+0x28/0x62
     xen_hvm_cpu_die+0x21/0x30
     takedown_cpu+0x9c/0xe0
     ? trace_suspend_resume+0x60/0x60
     cpuhp_invoke_callback+0x9a/0x530
     _cpu_up+0x11a/0x130
     cpu_up+0x7e/0xc0
     bringup_nonboot_cpus+0x48/0x50
     smp_init+0x26/0x79
     kernel_init_freeable+0xea/0x229
     ? rest_init+0xaa/0xaa
     kernel_init+0xa/0x106
     ret_from_fork+0x35/0x40

The secondary CPUs are not activated with the nosmt mitigations and only
the primary thread on each CPU core is used. In this situation,
xen_hvm_smp_prepare_cpus(), and more importantly xen_init_lock_cpu(), is
not called, so the lock_kicker_irq is not initialized for the secondary
CPUs. Let's fix this by exiting early in xen_uninit_lock_cpu() if the
irq is not set to avoid the warning from above for each secondary CPU.

Signed-off-by: Brian Masney <bmasney@redhat.com>
Link: https://lore.kernel.org/r/20201107011119.631442-1-bmasney@redhat.com
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2020-11-09 07:43:45 -06:00
Linus Torvalds
40be821d62 A set of x86 fixes:
- Use SYM_FUNC_START_WEAK in the mem* ASM functions instead of a
    combination of .weak and SYM_FUNC_START_LOCAL which makes LLVMs
    integrated assembler upset.
 
  - Correct the mitigation selection logic which prevented the related prctl
    to work correctly.
 
  - Make the UV5 hubless system work correctly by fixing up the malformed
    table entries and adding the missing ones.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+oDNYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaN0EACPWY15k1YuAEIjiQxRBhq22J8Y6wNX
 Ui/rF2AZcAnNEJDTIyvjP6COnT9mjX/tuuluMaI6i/XY/9Xp5LpKvivkL2PXNN3X
 onW01ouIc1iYxXwQEVZvhYHsOyhkR9Z8yNG/q9I7xYAXNSZcAHwXVar4VlPBT7Ay
 iP75i8pGmb/NCc4oHNXuBp/dV/0/dCoLTndb5p5pX8oS60AAt9ZuK3IRc3ucayhI
 M4rTTEya1oY+ZNbtP4A4Jp7Qc/NGYDo6q04za+jcxZ5Gqacs+fk/PNuWgL1fZZtW
 sn1D+SMWEb55Xcsdy976b29FFU/DcOcf7TRASzyKgyPW5jg1dP6BZ6U0wpVV3KZw
 S2h5/pt48JZI7olrDsLQ0tzjALlk2CcFNrnRtOMDduHdw9wyz+Sg58lZYuvH3sXK
 5ZblWRJ3JiBNsNO0sA3kd4sp7xWQB3ey6mkYD8Vqb7zRIt8aXT9jqBxhDrP+Vqs/
 /UKv+BJfD6WxC0nQ4x6MS3g4sDvI+1SLfHSZ/UjWJ6NfYJW5/w429pFCaF73xCTd
 cqxja1dZYixn7ioFZjolMUdvuDiC5B2+5+RzEV87kaDzO9QZQyvsl7G74MSfwx6G
 DAydvuyJoxP2qVASobOBcVOzLQO7DsLzFZzJTttZcnkK2iprcz4qrsFLMxF9SxTD
 Amb8qck60dLfqA==
 =JdPk
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A set of x86 fixes:

   - Use SYM_FUNC_START_WEAK in the mem* ASM functions instead of a
     combination of .weak and SYM_FUNC_START_LOCAL which makes LLVMs
     integrated assembler upset

   - Correct the mitigation selection logic which prevented the related
     prctl to work correctly

   - Make the UV5 hubless system work correctly by fixing up the
     malformed table entries and adding the missing ones"

* tag 'x86-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/platform/uv: Recognize UV5 hubless system identifier
  x86/platform/uv: Remove spaces from OEM IDs
  x86/platform/uv: Fix missing OEM_TABLE_ID
  x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP
  x86/lib: Change .weak to SYM_FUNC_START_WEAK for arch/x86/lib/mem*_64.S
2020-11-08 10:09:36 -08:00
Pankaj Gupta
2cdef91cf8 KVM: x86: handle MSR_IA32_DEBUGCTLMSR with report_ignored_msrs
Windows2016 guest tries to enable LBR by setting the corresponding bits
in MSR_IA32_DEBUGCTLMSR. KVM does not emulate MSR_IA32_DEBUGCTLMSR and
spams the host kernel logs with error messages like:

	kvm [...]: vcpu1, guest rIP: 0xfffff800a8b687d3 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop"

This patch fixes this by enabling error logging only with
'report_ignored_msrs=1'.

Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Message-Id: <20201105153932.24316-1-pankaj.gupta.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-08 04:41:31 -05:00
Oliver Upton
1e293d1ae8 kvm: x86: request masterclock update any time guest uses different msr
Commit 5b9bb0ebbc ("kvm: x86: encapsulate wrmsr(MSR_KVM_SYSTEM_TIME)
emulation in helper fn", 2020-10-21) subtly changed the behavior of guest
writes to MSR_KVM_SYSTEM_TIME(_NEW). Restore the previous behavior; update
the masterclock any time the guest uses a different msr than before.

Fixes: 5b9bb0ebbc ("kvm: x86: encapsulate wrmsr(MSR_KVM_SYSTEM_TIME) emulation in helper fn", 2020-10-21)
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20201027231044.655110-6-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-08 04:41:30 -05:00
Oliver Upton
01b4f510b9 kvm: x86: ensure pv_cpuid.features is initialized when enabling cap
Make the paravirtual cpuid enforcement mechanism idempotent to ioctl()
ordering by updating pv_cpuid.features whenever userspace requests the
capability. Extract this update out of kvm_update_cpuid_runtime() into a
new helper function and move its other call site into
kvm_vcpu_after_set_cpuid() where it more likely belongs.

Fixes: 66570e966d ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20201027231044.655110-5-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-08 04:41:29 -05:00
Oliver Upton
1930e5ddce kvm: x86: reads of restricted pv msrs should also result in #GP
commit 66570e966d ("kvm: x86: only provide PV features if enabled in
guest's CPUID") only protects against disallowed guest writes to KVM
paravirtual msrs, leaving msr reads unchecked. Fix this by enforcing
KVM_CPUID_FEATURES for msr reads as well.

Fixes: 66570e966d ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20201027231044.655110-4-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-08 04:41:29 -05:00
Maxim Levitsky
cc4cb01767 KVM: x86: use positive error values for msr emulation that causes #GP
Recent introduction of the userspace msr filtering added code that uses
negative error codes for cases that result in either #GP delivery to
the guest, or handled by the userspace msr filtering.

This breaks an assumption that a negative error code returned from the
msr emulation code is a semi-fatal error which should be returned
to userspace via KVM_RUN ioctl and usually kill the guest.

Fix this by reusing the already existing KVM_MSR_RET_INVALID error code,
and by adding a new KVM_MSR_RET_FILTERED error code for the
userspace filtered msrs.

Fixes: 291f35fb2c1d1 ("KVM: x86: report negative values from wrmsr emulation to userspace")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201101115523.115780-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-08 04:41:28 -05:00
Li RongQing
c6c4f961cb KVM: x86/mmu: fix counting of rmap entries in pte_list_add
Fix an off-by-one style bug in pte_list_add() where it failed to
account the last full set of SPTEs, i.e. when desc->sptes is full
and desc->more is NULL.

Merge the two "PTE_LIST_EXT-1" checks as part of the fix to avoid
an extra comparison.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <1601196297-24104-1-git-send-email-lirongqing@baidu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-08 04:41:26 -05:00
Ingo Molnar
666fab4a3e Merge branch 'linus' into perf/kprobes
Conflicts:
	include/asm-generic/atomic-instrumented.h
	kernel/kprobes.c

Use the upstream atomic-instrumented.h checksum, and pick
the kprobes version of kernel/kprobes.c, which effectively
reverts this upstream workaround:

  645f224e7b: ("kprobes: Tell lockdep about kprobe nesting")

Since the new code *should* be fine without nesting.

Knock on wood ...

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-11-07 13:20:17 +01:00
Ingo Molnar
0a986ea81e Merge branch 'linus' into perf/kprobes
Merge recent kprobes updates into perf/kprobes that came from -mm.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-11-07 13:18:49 +01:00
Mike Travis
801284f973 x86/platform/uv: Recognize UV5 hubless system identifier
Testing shows a problem in that UV5 hubless systems were not being
recognized.  Add them to the list of OEM IDs checked.

Fixes: 6c7794423a ("Add UV5 direct references")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201105222741.157029-4-mike.travis@hpe.com
2020-11-07 11:17:39 +01:00
Mike Travis
1aee505e01 x86/platform/uv: Remove spaces from OEM IDs
Testing shows that trailing spaces caused problems with the OEM_ID and
the OEM_TABLE_ID.  One being that the OEM_ID would not string compare
correctly.  Another the OEM_ID and OEM_TABLE_ID would be concatenated
in the printout.  Remove any trailing spaces.

Fixes: 1e61f5a95f ("Add and decode Arch Type in UVsystab")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201105222741.157029-3-mike.travis@hpe.com
2020-11-07 11:17:39 +01:00
Mike Travis
1aec69ae56 x86/platform/uv: Fix missing OEM_TABLE_ID
Testing shows a problem in that the OEM_TABLE_ID was missing for
hubless systems.  This is used to determine the APIC type (legacy or
extended).  Add the OEM_TABLE_ID to the early hubless processing.

Fixes: 1e61f5a95f ("Add and decode Arch Type in UVsystab")
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201105222741.157029-2-mike.travis@hpe.com
2020-11-07 11:17:39 +01:00
Paul E. McKenney
3fcd6a230f x86/cpu: Avoid cpuinfo-induced IPIing of idle CPUs
Currently, accessing /proc/cpuinfo sends IPIs to idle CPUs in order to
learn their clock frequency.  Which is a bit strange, given that waking
them from idle likely significantly changes their clock frequency.
This commit therefore avoids sending /proc/cpuinfo-induced IPIs to
idle CPUs.

[ paulmck: Also check for idle in arch_freq_prepare_all(). ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
2020-11-06 16:59:11 -08:00
Paul E. McKenney
f4deaf9021 x86/cpu: Avoid cpuinfo-induced IPI pileups
The aperfmperf_snapshot_cpu() function is invoked upon access to
/proc/cpuinfo, and it does do an early exit if the specified CPU has
recently done a snapshot.  Unfortunately, the indication that a snapshot
has been completed is set in an IPI handler, and the execution of this
handler can be delayed by any number of unfortunate events.  This means
that a system that starts a number of applications, each of which
parses /proc/cpuinfo, can suffer from an smp_call_function_single()
storm, especially given that each access to /proc/cpuinfo invokes
smp_call_function_single() for all CPUs.  Please note that this is not
theoretical speculation.  Note also that one CPU's pending IPI serves
all requests, so there is no point in ever having more than one IPI
pending to a given CPU.

This commit therefore suppresses duplicate IPIs to a given CPU via a
new ->scfpending field in the aperfmperf_sample structure.  This field
is set to the value one if an IPI is pending to the corresponding CPU
and to zero otherwise.

The aperfmperf_snapshot_cpu() function uses atomic_xchg() to set this
field to the value one and sample the old value.  If this function's
"wait" parameter is zero, smp_call_function_single() is called only if
the old value of the ->scfpending field was zero.  The IPI handler uses
atomic_set_release() to set this new field to zero just before returning,
so that the prior stores into the aperfmperf_sample structure are seen
by future requests that get to the atomic_xchg().  Future requests that
pass the elapsed-time check are ordered by the fact that on x86 loads act
as acquire loads, just as was the case prior to this change.  The return
value is based off of the age of the prior snapshot, just as before.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
[ paulmck: Allow /proc/cpuinfo to take advantage of arch_freq_get_on_cpu(). ]
[ paulmck: Add comment on memory barrier. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
2020-11-06 16:58:40 -08:00
Thomas Gleixner
351191ad55 io-mapping: Cleanup atomic iomap
Switch the atomic iomap implementation over to kmap_local and stick the
preempt/pagefault mechanics into the generic code similar to the
kmap_atomic variants.

Rename the x86 map function in preparation for a non-atomic variant.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20201103095858.625310005@linutronix.de
2020-11-06 23:14:58 +01:00
Thomas Gleixner
157e118b55 x86/mm/highmem: Use generic kmap atomic implementation
Convert X86 to the generic kmap atomic implementation and make the
iomap_atomic() naming convention consistent while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201103095857.375127260@linutronix.de
2020-11-06 23:14:55 +01:00
Zhen Lei
15af36596a x86/mce: Correct the detection of invalid notifier priorities
Commit

  c9c6d216ed ("x86/mce: Rename "first" function as "early"")

changed the enumeration of MCE notifier priorities. Correct the check
for notifier priorities to cover the new range.

 [ bp: Rewrite commit message, remove superfluous brackets in
   conditional. ]

Fixes: c9c6d216ed ("x86/mce: Rename "first" function as "early"")
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201106141216.2062-2-thunder.leizhen@huawei.com
2020-11-06 19:02:48 +01:00
Steven Rostedt (VMware)
773c167050 ftrace: Add recording of functions that caused recursion
This adds CONFIG_FTRACE_RECORD_RECURSION that will record to a file
"recursed_functions" all the functions that caused recursion while a
callback to the function tracer was running.

Link: https://lkml.kernel.org/r/20201106023548.102375687@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-csky@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: live-patching@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-06 08:42:26 -05:00
Steven Rostedt (VMware)
c536aa1c5b kprobes/ftrace: Add recursion protection to the ftrace callback
If a ftrace callback does not supply its own recursion protection and
does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
make a helper trampoline to do so before calling the callback instead of
just calling the callback directly.

The default for ftrace_ops is going to change. It will expect that handlers
provide their own recursion protection, unless its ftrace_ops states
otherwise.

Link: https://lkml.kernel.org/r/20201028115613.140212174@goodmis.org
Link: https://lkml.kernel.org/r/20201106023546.944907560@goodmis.org

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh  Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-csky@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-06 08:35:44 -05:00
Kaixu Xia
77080929d5 x86/mce: Assign boolean values to a bool variable
Fix the following coccinelle warnings:

  ./arch/x86/kernel/cpu/mce/core.c:1765:3-20: WARNING: Assignment of 0/1 to bool variable
  ./arch/x86/kernel/cpu/mce/core.c:1584:2-9: WARNING: Assignment of 0/1 to bool variable

 [ bp: Massage commit message. ]

Reported-by: Tosk Robot <tencent_os_robot@tencent.com>
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1604654363-1463-1-git-send-email-kaixuxia@tencent.com
2020-11-06 11:51:04 +01:00
Chester Lin
25519d6834 ima: generalize x86/EFI arch glue for other EFI architectures
Move the x86 IMA arch code into security/integrity/ima/ima_efi.c,
so that we will be able to wire it up for arm64 in a future patch.

Co-developed-by: Chester Lin <clin@suse.com>
Signed-off-by: Chester Lin <clin@suse.com>
Acked-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-11-06 07:40:42 +01:00
Anand K Mistry
1978b3a53a x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP
On AMD CPUs which have the feature X86_FEATURE_AMD_STIBP_ALWAYS_ON,
STIBP is set to on and

  spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED

At the same time, IBPB can be set to conditional.

However, this leads to the case where it's impossible to turn on IBPB
for a process because in the PR_SPEC_DISABLE case in ib_prctl_set() the

  spectre_v2_user_stibp == SPECTRE_V2_USER_STRICT_PREFERRED

condition leads to a return before the task flag is set. Similarly,
ib_prctl_get() will return PR_SPEC_DISABLE even though IBPB is set to
conditional.

More generally, the following cases are possible:

1. STIBP = conditional && IBPB = on for spectre_v2_user=seccomp,ibpb
2. STIBP = on && IBPB = conditional for AMD CPUs with
   X86_FEATURE_AMD_STIBP_ALWAYS_ON

The first case functions correctly today, but only because
spectre_v2_user_ibpb isn't updated to reflect the IBPB mode.

At a high level, this change does one thing. If either STIBP or IBPB
is set to conditional, allow the prctl to change the task flag.
Also, reflect that capability when querying the state. This isn't
perfect since it doesn't take into account if only STIBP or IBPB is
unconditionally on. But it allows the conditional feature to work as
expected, without affecting the unconditional one.

 [ bp: Massage commit message and comment; space out statements for
   better readability. ]

Fixes: 21998a3515 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.")
Signed-off-by: Anand K Mistry <amistry@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201105163246.v2.1.Ifd7243cd3e2c2206a893ad0a5b9a4f19549e22c6@changeid
2020-11-05 21:43:34 +01:00
Linus Torvalds
6732b35485 hyperv-fixes for 5.10-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAl+kK6cTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXsv3B/9qN84MVeriRKRUn1e+F15NHqGfJezZ
 oS/xjo2XoFaMrTpu8DgzN2C3yMZ0eutJYloUXWCJap1yI1ZaivupAPsOxCc42HwC
 /lRu6vI9jPL2kUzWzusR/yuijZsfj5GYkoNRW9HM3XruXG1Ta59q1JkLhIbUJKFk
 KKtKJoLn2+DQe8GWp3K8gJd5kryUSFWq1j6LO8w3kfSHxzj6AmDLWgHje8d1y0qA
 IKeNNTsnF3kht0/oBNdf7QRKA5X1yb6kpJ9m9+0p/RxMA9eSGmH6iOc5j1VyM4a9
 qf1S++4yENoGtsFzid/6XXSrBPGvI57qCB76uRvwyrDwzKkRmke/SLDj
 =mkuq
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - clarify a comment (Michael Kelley)

 - change a pr_warn() to pr_info() (Olaf Hering)

* tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: Clarify comment on x2apic mode
  hv_balloon: disable warning when floor reached
2020-11-05 11:32:03 -08:00
Chester Lin
e1ac4b2406 efi: generalize efi_get_secureboot
Generalize the efi_get_secureboot() function so not only efistub but also
other subsystems can use it.

Note that the MokSbState handling is not factored out: the variable is
boot time only, and so it cannot be parameterized as easily. Also, the
IMA code will switch to this version in a future patch, and it does not
incorporate the MokSbState exception in the first place.

Note that the new efi_get_secureboot_mode() helper treats any failures
to read SetupMode as setup mode being disabled.

Co-developed-by: Chester Lin <clin@suse.com>
Signed-off-by: Chester Lin <clin@suse.com>
Acked-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-11-04 23:05:40 +01:00
Thomas Gleixner
b6be002bcd x86/entry: Move nmi entry/exit into common code
Lockdep state handling on NMI enter and exit is nothing specific to X86. It's
not any different on other architectures. Also the extra state type is not
necessary, irqentry_state_t can carry the necessary information as well.

Move it to common code and extend irqentry_state_t to carry lockdep state.

[ Ira: Make exit_rcu and lockdep a union as they are mutually exclusive
  between the IRQ and NMI exceptions, and add kernel documentation for
  struct irqentry_state_t ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201102205320.1458656-7-ira.weiny@intel.com
2020-11-04 22:55:36 +01:00
Thomas Gleixner
01be83eea0 Merge branch 'core/urgent' into core/entry
Pick up the entry fix before further modifications.
2020-11-04 18:14:52 +01:00
Fangrui Song
4d6ffa27b8 x86/lib: Change .weak to SYM_FUNC_START_WEAK for arch/x86/lib/mem*_64.S
Commit

  393f203f5f ("x86_64: kasan: add interceptors for memset/memmove/memcpy functions")

added .weak directives to arch/x86/lib/mem*_64.S instead of changing the
existing ENTRY macros to WEAK. This can lead to the assembly snippet

  .weak memcpy
  ...
  .globl memcpy

which will produce a STB_WEAK memcpy with GNU as but STB_GLOBAL memcpy
with LLVM's integrated assembler before LLVM 12. LLVM 12 (since
https://reviews.llvm.org/D90108) will error on such an overridden symbol
binding.

Commit

  ef1e03152c ("x86/asm: Make some functions local")

changed ENTRY in arch/x86/lib/memcpy_64.S to SYM_FUNC_START_LOCAL, which
was ineffective due to the preceding .weak directive.

Use the appropriate SYM_FUNC_START_WEAK instead.

Fixes: 393f203f5f ("x86_64: kasan: add interceptors for memset/memmove/memcpy functions")
Fixes: ef1e03152c ("x86/asm: Make some functions local")
Reported-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201103012358.168682-1-maskray@google.com
2020-11-04 12:30:20 +01:00
David Woodhouse
f36a74b934 x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index
In commit b643128b91 ("x86/ioapic: Use irq_find_matching_fwspec() to
find remapping irqdomain") the I/O-APIC code was changed to find its
parent irqdomain using irq_find_matching_fwspec(), but the key used
for the lookup was wrong. It shouldn't use 'ioapic' which is the index
into its own ioapics[] array. It should use the actual arbitration
ID of the I/O-APIC in question, which is mpc_ioapic_id(ioapic).

Fixes: b643128b91 ("x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain")
Reported-by: lkp <oliver.sang@intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/57adf2c305cd0c5e9d860b2f3007a7e676fd0f9f.camel@infradead.org
2020-11-04 11:11:35 +01:00
Dexuan Cui
d981059e13 x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
When a Linux VM runs on Hyper-V, if the VM has CPUs with >255 APIC IDs,
the CPUs can't be the destination of IOAPIC interrupts, because the
IOAPIC RTE's Dest Field has only 8 bits. Currently the hackery driver
drivers/iommu/hyperv-iommu.c is used to ensure IOAPIC interrupts are
only routed to CPUs that don't have >255 APIC IDs. However, there is
an issue with kdump, because the kdump kernel can run on any CPU, and
hence IOAPIC interrupts can't work if the kdump kernel run on a CPU
with a >255 APIC ID.

The kdump issue can be fixed by the Extended Dest ID, which is introduced
recently by David Woodhouse (for IOAPIC, see the field virt_destid_8_14 in
struct IO_APIC_route_entry). Of course, the Extended Dest ID needs the
support of the underlying hypervisor. The latest Hyper-V has added the
support recently: with this commit, on such a Hyper-V host, Linux VM
does not use hyperv-iommu.c because hyperv_prepare_irq_remapping()
returns -ENODEV; instead, Linux kernel's generic support of Extended Dest
ID from David is used, meaning that Linux VM is able to support up to
32K CPUs, and IOAPIC interrupts can be routed to all the CPUs.

On an old Hyper-V host that doesn't support the Extended Dest ID, nothing
changes with this commit: Linux VM is still able to bring up the CPUs with
> 255 APIC IDs with the help of hyperv-iommu.c, but IOAPIC interrupts still
can not go to such CPUs, and the kdump kernel still can not work properly
on such CPUs.

[ tglx: Updated comment as suggested by David ]

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20201103011136.59108-1-decui@microsoft.com
2020-11-04 11:10:52 +01:00
Linus Torvalds
43c834186c A couple of changes to the SEV-ES code to perform more stringent
hypervisor checks before enabling encryption. (Joerg Roedel)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+hKN8ACgkQEsHwGGHe
 VUrkZQ/+LWjbDrbkLCQpWuzLagAocZMKKvr4+2ujU+krj0QU5FFJbfuzhkktQD+H
 cbfOW7+E8lqTDoj/dwoJPj2Xs8HvW4Ua6sbxF5lCPhlEr3NIetRfQ7SPj3qFvQG+
 FKP/55RSnjKIx7aZXKN9YAw2FF3EC1BisjszCBKid5S8HbGqjLMb2Ue0i/nssksY
 CvLwaxtDOGuSzJ8FwL+vmI70NkeLZ0ulTxbuxXAqfMTvJX3e1QA9dgeZMgfU1hng
 eA1Pjlm0X7FOsnwihYP2EZ6NzRrTkYeGl1Iagz1apqlDlQ+bcaxvs2btIyb7MKt5
 6PPDGg0P0WVMNfOEUYTZob31QcLnakA/p8kG8sYE6h2PlqO9Tf5cpmOJ6pv+DYFz
 hfcjAZfamStUbWdWpr33RVCXN5pwZRu+UytD3JYykzgwmKXQxLHqrbjHXLO3zJ7k
 +L0JE+N2vmi/7M5Ghsv3yKwy5fR5rMT5V6qEHSd1qrr9VpKBceNMJgPA8wh4882F
 SD5sD2b6L/Cf9L4FAFqICHb/p4rxPRf5VnUoybo70U7EiwfbZQik5g3X5cO4KO2N
 0z8nMk7dIZncQF0LYJNElIvKonrU8sIa+TbHjYyBWdQlOPgK4IlCvZeyjVUvUG24
 kYx2WbANhCxGFd86rsl5P7xNXvBiSALf1afbQPvU0VTbZ43vSnQ=
 =Pvgr
 -----END PGP SIGNATURE-----

Merge tag 'x86_seves_for_v5.10_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV-ES fixes from Borislav Petkov:
 "A couple of changes to the SEV-ES code to perform more stringent
  hypervisor checks before enabling encryption (Joerg Roedel)"

* tag 'x86_seves_for_v5.10_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev-es: Do not support MMIO to/from encrypted memory
  x86/head/64: Check SEV encryption before switching to kernel page-table
  x86/boot/compressed/64: Check SEV encryption in 64-bit boot-path
  x86/boot/compressed/64: Sanity-check CPUID results in the early #VC handler
  x86/boot/compressed/64: Introduce sev_status
2020-11-03 09:55:09 -08:00
Mauro Carvalho Chehab
4a2d2ed9ba x86/mtrr: Fix a kernel-doc markup
Kernel-doc markup should use this format:
	identifier - description

Fix it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/2217cd4ae9e561da2825485eb97de77c65741489.1603469755.git.mchehab+huawei@kernel.org
2020-11-02 19:58:53 +01:00
Tony Luck
68299a42f8 x86/mce: Enable additional error logging on certain Intel CPUs
The Xeon versions of Sandy Bridge, Ivy Bridge and Haswell support an
optional additional error logging mode which is enabled by an MSR.

Previously, this mode was enabled from the mcelog(8) tool via /dev/cpu,
but userspace should not be poking at MSRs. So move the enabling into
the kernel.

 [ bp: Correct the explanation why this is done. ]

Suggested-by: Boris Petkov <bp@alien8.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201030190807.GA13884@agluck-desk2.amr.corp.intel.com
2020-11-02 11:15:59 +01:00
Linus Torvalds
7b56fbd83e Three fixes all related to #DB:
- Handle the BTF bit correctly so it doesn't get lost due to a kernel #DB
 
  - Only clear and set the virtual DR6 value used by ptrace on user space
    triggered #DB. A kernel #DB must leave it alone to ensure data
    consistency for ptrace.
 
  - Make the bitmasking of the virtual DR6 storage correct so it does not
    lose DR_STEP.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+evlcTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofaAD/4gzRDGRZiorwX60o+RuJjgBN/iS1BN
 SEpprlC2TGtdiOvGNKxMxwMZFbpSExgGtwku6xya1VDDLJS+NwartL0AEnCsei2e
 O4KqUiU6HuBsKi+M0LARVm8bHRzb1s3YjrFFMwrarTm40joOvxNiw2w+nvTtamfJ
 tlkQMp5/iX0KNO5FBLMPDDwVyE3xY+yAyiI0Z7bPmrhNSlNWW6x8QKoBpp8T5v2a
 ATIhAnOAClZVH/ig9A/fbUgsBqxyyIZRSW2wudAtHZg8NhDox+TitUIWG+IEoRDL
 uW2hKSqxrBsapTWxv2+tN5Kk0ORUSqEMqWaog80pjP3o0ezUKZxQsYuAZ430bzDr
 qqVWGpX42UjYet6V4S9P+I/gN3lCZmYoc24zfWLT8T8KRaDOOebYmC7EiJjudaXd
 sEYCKRv10ysBlFqXyzz2LDOyjFOiXodyFbhdBQVCPUuis99mHdYIJHoUxpYhgZSH
 IIbYn9RUsij6mMaCfPhNVoJwRVtY+AGGisfEnk+v5dfFYQruliFWgyjDYWT5+IxQ
 ZsJOJVCqR8JT9bL0xiTNbcZ3SvRfMQ2pLfd6MWX9fQZ9o6LvVSg32gPjgneKpQ/y
 QLRtiyVW3oBw2rJZeoYVIHrD8r3SUOI8iQLcCjXwxQoBiei6G+4GNdFtkUmqAFB3
 VvtPZYyjj05gCw==
 =Hfek
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "Three fixes all related to #DB:

   - Handle the BTF bit correctly so it doesn't get lost due to a kernel
     #DB

   - Only clear and set the virtual DR6 value used by ptrace on user
     space triggered #DB. A kernel #DB must leave it alone to ensure
     data consistency for ptrace.

   - Make the bitmasking of the virtual DR6 storage correct so it does
     not lose DR_STEP"

* tag 'x86-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/debug: Fix DR_STEP vs ptrace_get_debugreg(6)
  x86/debug: Only clear/set ->virtual_dr6 for userspace #DB
  x86/debug: Fix BTF handling
2020-11-01 11:21:26 -08:00
Linus Torvalds
2d38c80d5b ARM:
* selftest fix
 * Force PTE mapping on device pages provided via VFIO
 * Fix detection of cacheable mapping at S2
 * Fallback to PMD/PTE mappings for composite huge pages
 * Fix accounting of Stage-2 PGD allocation
 * Fix AArch32 handling of some of the debug registers
 * Simplify host HYP entry
 * Fix stray pointer conversion on nVHE TLB invalidation
 * Fix initialization of the nVHE code
 * Simplify handling of capabilities exposed to HYP
 * Nuke VCPUs caught using a forbidden AArch32 EL0
 
 x86:
 * new nested virtualization selftest
 * Miscellaneous fixes
 * make W=1 fixes
 * Reserve new CPUID bit in the KVM leaves
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl+dhRAUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPWCgf/U997UW/11IdNtkehQO/DFdx7lHev
 +IahN1Pnbt92ZoR5nGhK9pgvDahIVhqTmUvgV+3fD24OnqXTpYTu1fliBvL6ynbN
 J9Ycf0zFAgwfgTTD5UexTlEovnhX4xz7NDmd6rpxGDZdMaBHQFPkCXBFK45pf4nd
 O349aHV0X1AA7Tt/sLhpXpi74Vake1xErLHKhIVLHKyo/zDm+Q0UZry068NNBzTr
 St3+QSGlFXhuekVrZLh+DShh6rZGLyY9tcySt6o0Jk7fSs1lmEnPbBgeeqYmyHMd
 Yn+ybhthmNkkpI8so70TA9roiVar4UmjnMBOiav62bo7ue26pKE5cWQyXw==
 =mvBr
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:
   - selftest fix
   - force PTE mapping on device pages provided via VFIO
   - fix detection of cacheable mapping at S2
   - fallback to PMD/PTE mappings for composite huge pages
   - fix accounting of Stage-2 PGD allocation
   - fix AArch32 handling of some of the debug registers
   - simplify host HYP entry
   - fix stray pointer conversion on nVHE TLB invalidation
   - fix initialization of the nVHE code
   - simplify handling of capabilities exposed to HYP
   - nuke VCPUs caught using a forbidden AArch32 EL0

  x86:
   - new nested virtualization selftest
   - miscellaneous fixes
   - make W=1 fixes
   - reserve new CPUID bit in the KVM leaves"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: vmx: remove unused variable
  KVM: selftests: Don't require THP to run tests
  KVM: VMX: eVMCS: make evmcs_sanitize_exec_ctrls() work again
  KVM: selftests: test behavior of unmapped L2 APIC-access address
  KVM: x86: Fix NULL dereference at kvm_msr_ignored_check()
  KVM: x86: replace static const variables with macros
  KVM: arm64: Handle Asymmetric AArch32 systems
  arm64: cpufeature: upgrade hyp caps to final
  arm64: cpufeature: reorder cpus_have_{const, final}_cap()
  KVM: arm64: Factor out is_{vhe,nvhe}_hyp_code()
  KVM: arm64: Force PTE mapping on fault resulting in a device mapping
  KVM: arm64: Use fallback mapping sizes for contiguous huge page sizes
  KVM: arm64: Fix masks in stage2_pte_cacheable()
  KVM: arm64: Fix AArch32 handling of DBGD{CCINT,SCRext} and DBGVCR
  KVM: arm64: Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT
  KVM: arm64: Drop useless PAN setting on host EL1 to EL2 transition
  KVM: arm64: Remove leftover kern_hyp_va() in nVHE TLB invalidation
  KVM: arm64: Don't corrupt tpidr_el2 on failed HVC call
  x86/kvm: Reserve KVM_FEATURE_MSI_EXT_DEST_ID
2020-11-01 09:43:32 -08:00
Paolo Bonzini
9478dec3b5 KVM: vmx: remove unused variable
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-31 11:38:43 -04:00
Vitaly Kuznetsov
064eedf2c5 KVM: VMX: eVMCS: make evmcs_sanitize_exec_ctrls() work again
It was noticed that evmcs_sanitize_exec_ctrls() is not being executed
nowadays despite the code checking 'enable_evmcs' static key looking
correct. Turns out, static key magic doesn't work in '__init' section
(and it is unclear when things changed) but setup_vmcs_config() is called
only once per CPU so we don't really need it to. Switch to checking
'enlightened_vmcs' instead, it is supposed to be in sync with
'enable_evmcs'.

Opportunistically make evmcs_sanitize_exec_ctrls '__init' and drop unneeded
extra newline from it.

Reported-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20201014143346.2430936-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-31 10:27:58 -04:00
Arnd Bergmann
0774a6ed29 timekeeping: default GENERIC_CLOCKEVENTS to enabled
Almost all machines use GENERIC_CLOCKEVENTS, so it feels wrong to
require each one to select that symbol manually.

Instead, enable it whenever CONFIG_LEGACY_TIMER_TICK is disabled as
a simplification. It should be possible to select both
GENERIC_CLOCKEVENTS and LEGACY_TIMER_TICK from an architecture now
and decide at runtime between the two.

For the clockevents arch-support.txt file, this means that additional
architectures are marked as TODO when they have at least one machine
that still uses LEGACY_TIMER_TICK, rather than being marked 'ok' when
at least one machine has been converted. This means that both m68k and
arm (for riscpc) revert to TODO.

At this point, we could just always enable CONFIG_GENERIC_CLOCKEVENTS
rather than leaving it off when not needed. I built an m68k
defconfig kernel (using gcc-10.1.0) and found that this would add
around 5.5KB in kernel image size:

   text	   data	    bss	    dec	    hex	filename
3861936	1092236	 196656	5150828	 4e986c	obj-m68k/vmlinux-no-clockevent
3866201	1093832	 196184	5156217	 4ead79	obj-m68k/vmlinux-clockevent

On Arm (MACH_RPC), that difference appears to be twice as large,
around 11KB on top of an 6MB vmlinux.

Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2020-10-30 21:57:07 +01:00
Takashi Iwai
d383b3146d KVM: x86: Fix NULL dereference at kvm_msr_ignored_check()
The newly introduced kvm_msr_ignored_check() tries to print error or
debug messages via vcpu_*() macros, but those may cause Oops when NULL
vcpu is passed for KVM_GET_MSRS ioctl.

Fix it by replacing the print calls with kvm_*() macros.

(Note that this will leave vcpu argument completely unused in the
 function, but I didn't touch it to make the fix as small as
 possible.  A clean up may be applied later.)

Fixes: 12bc2132b1 ("KVM: X86: Do the same ignore_msrs check for feature msrs")
BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1178280
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Message-Id: <20201030151414.20165-1-tiwai@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-30 13:40:31 -04:00
Paolo Bonzini
8a967d655e KVM: x86: replace static const variables with macros
Even though the compiler is able to replace static const variables with
their value, it will warn about them being unused when Linux is built with W=1.
Use good old macros instead, this is not C++.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-30 13:39:55 -04:00
Arvind Sankar
458c0480dc crypto: hash - Use memzero_explicit() for clearing state
Without the barrier_data() inside memzero_explicit(), the compiler may
optimize away the state-clearing if it can tell that the state is not
used afterwards.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-10-30 17:35:03 +11:00
Eric Biggers
d4b3984c9e crypto: x86/aes - remove unused file aes_glue.c
Commit 1d2c327931 ("crypto: x86/aes - drop scalar assembler
implementations") was meant to remove aes_glue.c, but it actually left
it as an unused one-line file.  Remove this unused file.

Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-10-30 17:35:01 +11:00
Arvind Sankar
ea3186b957 x86/build: Fix vmlinux size check on 64-bit
Commit

  b4e0409a36 ("x86: check vmlinux limits, 64-bit")

added a check that the size of the 64-bit kernel is less than
KERNEL_IMAGE_SIZE.

The check uses (_end - _text), but this is not enough. The initial
PMD used in startup_64() (level2_kernel_pgt) can only map upto
KERNEL_IMAGE_SIZE from __START_KERNEL_map, not from _text, and the
modules area (MODULES_VADDR) starts at KERNEL_IMAGE_SIZE.

The correct check is what is currently done for 32-bit, since
LOAD_OFFSET is defined appropriately for the two architectures. Just
check (_end - LOAD_OFFSET) against KERNEL_IMAGE_SIZE unconditionally.

Note that on 32-bit, the limit is not strict: KERNEL_IMAGE_SIZE is not
really used by the main kernel. The higher the kernel is located, the
less the space available for the vmalloc area. However, it is used by
KASLR in the compressed stub to limit the maximum address of the kernel
to a safe value.

Clean up various comments to clarify that despite the name,
KERNEL_IMAGE_SIZE is not a limit on the size of the kernel image, but a
limit on the maximum virtual address that the image can occupy.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201029161903.2553528-1-nivedita@alum.mit.edu
2020-10-29 21:54:35 +01:00
Joerg Roedel
2411cd8211 x86/sev-es: Do not support MMIO to/from encrypted memory
MMIO memory is usually not mapped encrypted, so there is no reason to
support emulated MMIO when it is mapped encrypted.

Prevent a possible hypervisor attack where a RAM page is mapped as
an MMIO page in the nested page-table, so that any guest access to it
will trigger a #VC exception and leak the data on that page to the
hypervisor via the GHCB (like with valid MMIO). On the read side this
attack would allow the HV to inject data into the guest.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-6-joro@8bytes.org
2020-10-29 19:27:42 +01:00
Joerg Roedel
c9f09539e1 x86/head/64: Check SEV encryption before switching to kernel page-table
When SEV is enabled, the kernel requests the C-bit position again from
the hypervisor to build its own page-table. Since the hypervisor is an
untrusted source, the C-bit position needs to be verified before the
kernel page-table is used.

Call sev_verify_cbit() before writing the CR3.

 [ bp: Massage. ]

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-5-joro@8bytes.org
2020-10-29 18:09:59 +01:00
Joerg Roedel
86ce43f7dd x86/boot/compressed/64: Check SEV encryption in 64-bit boot-path
Check whether the hypervisor reported the correct C-bit when running as
an SEV guest. Using a wrong C-bit position could be used to leak
sensitive data from the guest to the hypervisor.

The check function is in a separate file:

  arch/x86/kernel/sev_verify_cbit.S

so that it can be re-used in the running kernel image.

 [ bp: Massage. ]

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-4-joro@8bytes.org
2020-10-29 18:06:52 +01:00
Joerg Roedel
ed7b895f3e x86/boot/compressed/64: Sanity-check CPUID results in the early #VC handler
The early #VC handler which doesn't have a GHCB can only handle CPUID
exit codes. It is needed by the early boot code to handle #VC exceptions
raised in verify_cpu() and to get the position of the C-bit.

But the CPUID information comes from the hypervisor which is untrusted
and might return results which trick the guest into the no-SEV boot path
with no C-bit set in the page-tables. All data written to memory would
then be unencrypted and could leak sensitive data to the hypervisor.

Add sanity checks to the early #VC handler to make sure the hypervisor
can not pretend that SEV is disabled.

 [ bp: Massage a bit. ]

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-3-joro@8bytes.org
2020-10-29 13:48:49 +01:00
Jens Axboe
c8d5ed6793 x86: Wire up TIF_NOTIFY_SIGNAL
The generic entry code has support for TIF_NOTIFY_SIGNAL already. Just
provide the TIF bit.

[ tglx: Adopted to other TIF changes in x86 ]

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201026203230.386348-4-axboe@kernel.dk
2020-10-29 11:29:51 +01:00
Kan Liang
306e3e91ed perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY
The event CYCLE_ACTIVITY.STALLS_MEM_ANY (0x14a3) should be available on
all 8 GP counters on ICL, but it's only scheduled on the first four
counters due to the current ICL constraint table.

Add a line for the CYCLE_ACTIVITY.STALLS_MEM_ANY event in the ICL
constraint table.
Correct the comments for the CYCLE_ACTIVITY.CYCLES_MEM_ANY event.

Fixes: 6017608936 ("perf/x86/intel: Add Icelake support")
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201019164529.32154-1-kan.liang@linux.intel.com
2020-10-29 11:00:41 +01:00
Kan Liang
43bc103a80 perf/x86/intel/uncore: Add Rocket Lake support
For Rocket Lake, the MSR uncore, e.g., CBOX, ARB and CLOCKBOX, are the
same as Tiger Lake. Share the perf code with it.

For Rocket Lake and Tiger Lake, the 8th CBOX is not mapped into a
different MSR space anymore. Add rkl_uncore_msr_init_box() to replace
skl_uncore_msr_init_box().

The IMC uncore is the similar to Ice Lake. Add new PCIIDs of IMC for
Rocket Lake.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201019153528.13850-4-kan.liang@linux.intel.com
2020-10-29 11:00:40 +01:00
Kan Liang
907a196fbc perf/x86/msr: Add Rocket Lake CPU support
Like Ice Lake and Tiger Lake, PPERF and SMI_COUNT MSRs are also
supported by Rocket Lake.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201019153528.13850-3-kan.liang@linux.intel.com
2020-10-29 11:00:40 +01:00
Kan Liang
cbea56395c perf/x86/cstate: Add Rocket Lake CPU support
From the perspective of Intel cstate residency counters, Rocket Lake is
the same as Ice Lake and Tiger Lake. Share the code with them. Update
the comments for Rocket Lake.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201019153528.13850-2-kan.liang@linux.intel.com
2020-10-29 11:00:40 +01:00
Kan Liang
b14d0db5b8 perf/x86/intel: Add Rocket Lake CPU support
From the perspective of Intel PMU, Rocket Lake is the same as Ice Lake
and Tiger Lake. Share the perf code with them.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201019153528.13850-1-kan.liang@linux.intel.com
2020-10-29 11:00:39 +01:00
Stephane Eranian
995f088efe perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE
When studying code layout, it is useful to capture the page size of the
sampled code address.

Add a new sample type for code page size.
The new sample type requires collecting the ip. The code page size can
be calculated from the NMI-safe perf_get_page_size().

For large PEBS, it's very unlikely that the mapping is gone for the
earlier PEBS records. Enable the feature for the large PEBS. The worst
case is that page-size '0' is returned.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201001135749.2804-5-kan.liang@linux.intel.com
2020-10-29 11:00:39 +01:00
Kan Liang
76a5433f95 perf/x86/intel: Support PERF_SAMPLE_DATA_PAGE_SIZE
The new sample type, PERF_SAMPLE_DATA_PAGE_SIZE, requires the virtual
address. Update the data->addr if the sample type is set.

The large PEBS is disabled with the sample type, because perf doesn't
support munmap tracking yet. The PEBS buffer for large PEBS cannot be
flushed for each munmap. Wrong page size may be calculated. The large
PEBS can be enabled later separately when munmap tracking is supported.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201001135749.2804-3-kan.liang@linux.intel.com
2020-10-29 11:00:38 +01:00
Joerg Roedel
3ad84246a4 x86/boot/compressed/64: Introduce sev_status
Introduce sev_status and initialize it together with sme_me_mask to have
an indicator which SEV features are enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20201028164659.27002-2-joro@8bytes.org
2020-10-29 10:54:36 +01:00
Jens Axboe
12db8b6900 entry: Add support for TIF_NOTIFY_SIGNAL
Add TIF_NOTIFY_SIGNAL handling in the generic entry code, which if set,
will return true if signal_pending() is used in a wait loop. That causes an
exit of the loop so that notify_signal tracehooks can be run. If the wait
loop is currently inside a system call, the system call is restarted once
task_work has been processed.

In preparation for only having arch_do_signal() handle syscall restarts if
_TIF_SIGPENDING isn't set, rename it to arch_do_signal_or_restart().  Pass
in a boolean that tells the architecture specific signal handler if it
should attempt to get a signal, or just process a potential syscall
restart.

For !CONFIG_GENERIC_ENTRY archs, add the TIF_NOTIFY_SIGNAL handling to
get_signal(). This is done to minimize the needed architecture changes to
support this feature.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20201026203230.386348-3-axboe@kernel.dk
2020-10-29 09:37:36 +01:00
David Woodhouse
2e008ffe42 x86/kvm: Enable 15-bit extension when KVM_FEATURE_MSI_EXT_DEST_ID detected
This allows the host to indicate that MSI emulation supports 15-bit
destination IDs, allowing up to 32768 CPUs without interrupt remapping.

cf. https://patchwork.kernel.org/patch/11816693/ for qemu

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20201024213535.443185-36-dwmw2@infradead.org
2020-10-28 20:26:33 +01:00
David Woodhouse
ab0f59c6f1 x86/apic: Support 15 bits of APIC ID in MSI where available
Some hypervisors can allow the guest to use the Extended Destination ID
field in the MSI address to address up to 32768 CPUs.

This applies to all downstream devices which generate MSI cycles,
including HPET, I/O-APIC and PCI MSI.

HPET and PCI MSI use the same __irq_msi_compose_msg() function, while
I/O-APIC generates its own and had support for the extended bits added in
a previous commit.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-33-dwmw2@infradead.org
2020-10-28 20:26:29 +01:00
David Woodhouse
51130d2188 x86/ioapic: Handle Extended Destination ID field in RTE
Bits 63-48 of the I/OAPIC Redirection Table Entry map directly to bits 19-4
of the address used in the resulting MSI cycle.

Historically, the x86 MSI format only used the top 8 of those 16 bits as
the destination APIC ID, and the "Extended Destination ID" in the lower 8
bits was unused.

With interrupt remapping, the lowest bit of the Extended Destination ID
(bit 48 of RTE, bit 4 of MSI address) is now used to indicate a remappable
format MSI.

A hypervisor can use the other 7 bits of the Extended Destination ID to
permit guests to address up to 15 bits of APIC IDs, thus allowing 32768
vCPUs before having to expose a vIOMMU and interrupt remapping to the
guest.

No behavioural change in this patch, since nothing yet permits APIC IDs
above 255 to be used with the non-IR I/OAPIC domain.

[ tglx: Converted it to the cleaned up entry/msi_msg format and added
  	commentry ]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-32-dwmw2@infradead.org
2020-10-28 20:26:28 +01:00
David Woodhouse
ed381fca47 x86: Kill all traces of irq_remapping_get_irq_domain()
All users are converted to use the fwspec based parent domain lookup.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-30-dwmw2@infradead.org
2020-10-28 20:26:28 +01:00
David Woodhouse
b643128b91 x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain
All possible parent domains have a select method now. Make use of it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-29-dwmw2@infradead.org
2020-10-28 20:26:28 +01:00
David Woodhouse
c2a5881c28 x86/hpet: Use irq_find_matching_fwspec() to find remapping irqdomain
All possible parent domains have a select method now. Make use of it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-28-dwmw2@infradead.org
2020-10-28 20:26:28 +01:00
David Woodhouse
6452ea2a32 x86/apic: Add select() method on vector irqdomain
This will be used to select the irqdomain for I/O-APIC and HPET.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-24-dwmw2@infradead.org
2020-10-28 20:26:27 +01:00
David Woodhouse
5d5a971338 x86/ioapic: Generate RTE directly from parent irqchip's MSI message
The I/O-APIC generates an MSI cycle with address/data bits taken from its
Redirection Table Entry in some combination which used to make sense, but
now is just a bunch of bits which get passed through in some seemingly
arbitrary order.

Instead of making IRQ remapping drivers directly frob the I/OA-PIC RTE, let
them just do their job and generate an MSI message. The bit swizzling to
turn that MSI message into the I/O-APIC's RTE is the same in all cases,
since it's a function of the I/O-APIC hardware. The IRQ remappers have no
real need to get involved with that.

The only slight caveat is that the I/OAPIC is interpreting some of those
fields too, and it does want the 'vector' field to be unique to make EOI
work. The AMD IOMMU happens to put its IRTE index in the bits that the
I/O-APIC thinks are the vector field, and accommodates this requirement by
reserving the first 32 indices for the I/O-APIC.  The Intel IOMMU doesn't
actually use the bits that the I/O-APIC thinks are the vector field, so it
fills in the 'pin' value there instead.

[ tglx: Replaced the unreadably macro maze with the cleaned up RTE/msi_msg
  	bitfields and added commentry to explain the mapping magic ]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-22-dwmw2@infradead.org
2020-10-28 20:26:27 +01:00
Thomas Gleixner
341b4a7211 x86/ioapic: Cleanup IO/APIC route entry structs
Having two seperate structs for the I/O-APIC RTE entries (non-remapped and
DMAR remapped) requires type casts and makes it hard to map.

Combine them in IO_APIC_routing_entry by defining a union of two 64bit
bitfields. Use naming which reflects which bits are shared and which bits
are actually different for the operating modes.

[dwmw2: Fix it up and finish the job, pulling the 32-bit w1,w2 words for
        register access into the same union and eliminating a few more
        places where bits were accessed through masks and shifts.]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-21-dwmw2@infradead.org
2020-10-28 20:26:27 +01:00
Thomas Gleixner
a27dca645d x86/io_apic: Cleanup trigger/polarity helpers
'trigger' and 'polarity' are used throughout the I/O-APIC code for handling
the trigger type (edge/level) and the active low/high configuration. While
there are defines for initializing these variables and struct members, they
are not used consequently and the meaning of 'trigger' and 'polarity' is
opaque and confusing at best.

Rename them to 'is_level' and 'active_low' and make them boolean in various
structs so it's entirely clear what the meaning is.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-20-dwmw2@infradead.org
2020-10-28 20:26:26 +01:00
Thomas Gleixner
0c1883c1eb x86/msi: Remove msidef.h
Nothing uses the macro maze anymore.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-19-dwmw2@infradead.org
2020-10-28 20:26:26 +01:00
Thomas Gleixner
41bb2115be x86/pci/xen: Use msi_msg shadow structs
Use the msi_msg shadow structs and compose the message with named bitfields
instead of the unreadable macro maze.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-18-dwmw2@infradead.org
2020-10-28 20:26:26 +01:00