2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2024-12-17 01:34:00 +08:00
Commit Graph

982089 Commits

Author SHA1 Message Date
Mel Gorman
e6e0dc2d54 sched/fair: Remove SIS_AVG_CPU
SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It was
a blunt instrument and disabled by commit 4c77b18cf8 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by commit 1ad3aaf3fc ("sched/core: Implement new approach
to scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210125085909.4600-2-mgorman@techsingularity.net
2021-01-27 17:26:43 +01:00
Peter Oskolkov
1875dc5b8f sched: Correctly sort struct predeclarations
The comment says:
  /* task_struct member predeclarations (sorted alphabetically): */

So move io_uring_task where it belongs (alphabetically).

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210126193449.487547-1-posk@google.com
2021-01-27 17:26:43 +01:00
Yue Hu
432900f816 init/Kconfig: Correct thermal pressure help text
We're using arch_scale_thermal_pressure() to retrieve per CPU thermal
pressure.

Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210127054451.1240-1-zbestahu@gmail.com
2021-01-27 17:26:43 +01:00
Valentin Schneider
620a6dc407 sched/topology: Make sched_init_numa() use a set for the deduplicating sort
The deduplicating sort in sched_init_numa() assumes that the first line in
the distance table contains all unique values in the entire table. I've
been trying to pen what this exactly means for the topology, but it's not
straightforward. For instance, topology.c uses this example:

  node   0   1   2   3
    0:  10  20  20  30
    1:  20  10  20  20
    2:  20  20  10  20
    3:  30  20  20  10

  0 ----- 1
  |     / |
  |   /   |
  | /     |
  2 ----- 3

Which works out just fine. However, if we swap nodes 0 and 1:

  1 ----- 0
  |     / |
  |   /   |
  | /     |
  2 ----- 3

we get this distance table:

  node   0  1  2  3
    0:  10 20 20 20
    1:  20 10 20 30
    2:  20 20 10 20
    3:  20 30 20 10

Which breaks the deduplicating sort (non-representative first line). In
this case this would just be a renumbering exercise, but it so happens that
we can have a deduplicating sort that goes through the whole table in O(n²)
at the extra cost of a temporary memory allocation (i.e. any form of set).

The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following
this, implement the set as a 256-bits bitmap. Should this not be
satisfactory (i.e. we want to support 32-bit values), then we'll have to go
for some other sparse set implementation.

This has the added benefit of letting us allocate just the right amount of
memory for sched_domains_numa_distance[], rather than an arbitrary
(nr_node_ids + 1).

Note: DT binding equivalent (distance-map) decodes distances as 32-bit
values.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210122123943.1217-2-valentin.schneider@arm.com
2021-01-27 17:26:42 +01:00
Qais Yousef
0ae78eec8a sched/eas: Don't update misfit status if the task is pinned
If the task is pinned to a cpu, setting the misfit status means that
we'll unnecessarily continuously attempt to migrate the task but fail.

This continuous failure will cause the balance_interval to increase to
a high value, and eventually cause unnecessary significant delays in
balancing the system when real imbalance happens.

Caught while testing uclamp where rt-app calibration loop was pinned to
cpu 0, shortly after which we spawn another task with high util_clamp
value. The task was failing to migrate after over 40ms of runtime due to
balance_interval unnecessary expanded to a very high value from the
calibration loop.

Not done here, but it could be useful to extend the check for pinning to
verify that the affinity of the task has a cpu that fits. We could end
up in a similar situation otherwise.

Fixes: 3b1baa6496 ("sched/fair: Add 'group_misfit_task' load-balance type")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Acked-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210119120755.2425264-1-qais.yousef@arm.com
2021-01-27 17:26:42 +01:00
Hui Su
65bcf072e2 sched: Use task_current() instead of 'rq->curr == p'
Use the task_current() function where appropriate.

No functional change.

Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201030173223.GA52339@rlk
2021-01-14 11:20:11 +01:00
Vincent Guittot
e9b9734b74 sched/fair: Reduce cases for active balance
Active balance is triggered for a number of voluntary cases like misfit
or pinned tasks cases but also after that a number of load balance
attempts failed to migrate a task. There is no need to use active load
balance when the group is overloaded because an overloaded state means
that there is at least one waiting task. Nevertheless, the waiting task
is not selected and detached until the threshold becomes higher than its
load. This threshold increases with the number of failed lb (see the
condition if ((load >> env->sd->nr_balance_failed) > env->imbalance) in
detach_tasks()) and the waiting task will end up to be selected after a
number of attempts.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20210107103325.30851-4-vincent.guittot@linaro.org
2021-01-14 11:20:11 +01:00
Vincent Guittot
8a41dfcda7 sched/fair: Don't set LBF_ALL_PINNED unnecessarily
Setting LBF_ALL_PINNED during active load balance is only valid when there
is only 1 running task on the rq otherwise this ends up increasing the
balance interval whereas other tasks could migrate after the next interval
once they become cache-cold as an example.

LBF_ALL_PINNED flag is now always set it by default. It is then cleared
when we find one task that can be pulled when calling detach_tasks() or
during active migration.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20210107103325.30851-3-vincent.guittot@linaro.org
2021-01-14 11:20:11 +01:00
Vincent Guittot
fc488ffd42 sched/fair: Skip idle cfs_rq
Don't waste time checking whether an idle cfs_rq could be the busiest
queue. Furthermore, this can end up selecting a cfs_rq with a high load
but being idle in case of migrate_load.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20210107103325.30851-2-vincent.guittot@linaro.org
2021-01-14 11:20:10 +01:00
Xuewen Yan
8c1f560c1e sched/fair: Avoid stale CPU util_est value for schedutil in task dequeue
CPU (root cfs_rq) estimated utilization (util_est) is currently used in
dequeue_task_fair() to drive frequency selection before it is updated.

with:

CPU_util        : rq->cfs.avg.util_avg
CPU_util_est    : rq->cfs.avg.util_est
CPU_utilization : max(CPU_util, CPU_util_est)
task_util       : p->se.avg.util_avg
task_util_est   : p->se.avg.util_est

dequeue_task_fair():

    /* (1) CPU_util and task_util update + inform schedutil about
           CPU_utilization changes */
    for_each_sched_entity() /* 2 loops */
        (dequeue_entity() ->) update_load_avg() -> cfs_rq_util_change()
         -> cpufreq_update_util() ->...-> sugov_update_[shared\|single]
         -> sugov_get_util() -> cpu_util_cfs()

    /* (2) CPU_util_est and task_util_est update */
    util_est_dequeue()

cpu_util_cfs() uses CPU_utilization which could lead to a false (too
high) utilization value for schedutil in task ramp-down or ramp-up
scenarios during task dequeue.

To mitigate the issue split the util_est update (2) into:

 (A) CPU_util_est update in util_est_dequeue()
 (B) task_util_est update in util_est_update()

Place (A) before (1) and keep (B) where (2) is. The latter is necessary
since (B) relies on task_util update in (1).

Fixes: 7f65ea42eb ("sched/fair: Add util_est on top of PELT")
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1608283672-18240-1-git-send-email-xuewen.yan94@gmail.com
2021-01-14 11:20:10 +01:00
Peter Zijlstra
0301925dd0 sched: Add schedutil overview
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
Link: https://lkml.kernel.org/r/20201218103258.GA3040@hirez.programming.kicks-ass.net
2021-01-14 11:20:10 +01:00
Anna-Maria Behnsen
e0b257c3b7 sched: Prevent raising SCHED_SOFTIRQ when CPU is !active
SCHED_SOFTIRQ is raised to trigger periodic load balancing. When CPU is not
active, CPU should not participate in load balancing.

The scheduler uses nohz.idle_cpus_mask to keep track of the CPUs which can
do idle load balancing. When bringing a CPU up the CPU is added to the mask
when it reaches the active state, but on teardown the CPU stays in the mask
until it goes offline and invokes sched_cpu_dying().

When SCHED_SOFTIRQ is raised on a !active CPU, there might be a pending
softirq when stopping the tick which triggers a warning in NOHZ code. The
SCHED_SOFTIRQ can also be raised by the scheduler tick which has the same
issue.

Therefore remove the CPU from nohz.idle_cpus_mask when it is marked
inactive and also prevent the scheduler_tick() from raising SCHED_SOFTIRQ
after this point.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201215104400.9435-1-anna-maria@linutronix.de
2021-01-14 11:20:09 +01:00
Viresh Kumar
d1515851ca thermal: cpufreq_cooling: Reuse sched_cpu_util() for SMP platforms
Several parts of the kernel are already using the effective CPU
utilization (as seen by the scheduler) to get the current load on the
CPU, do the same here instead of depending on the idle time of the CPU,
which isn't that accurate comparatively.

This is also the right thing to do as it makes the cpufreq governor
(schedutil) align better with the cpufreq_cooling driver, as the power
requested by cpufreq_cooling governor will exactly match the next
frequency requested by the schedutil governor since they are both using
the same metric to calculate load.

This was tested on ARM Hikey6220 platform with hackbench, sysbench and
schbench. None of them showed any regression or significant
improvements. Schbench is the most important ones out of these as it
creates the scenario where the utilization numbers provide a better
estimate of the future.

Scenario 1: The CPUs were mostly idle in the previous polling window of
the IPA governor as the tasks were sleeping and here are the details
from traces (load is in %):

 Old: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=203 load={{0x35,0x1,0x0,0x31,0x0,0x0,0x64,0x0}} dynamic_power=1339
 New: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=600 load={{0x60,0x46,0x45,0x45,0x48,0x3b,0x61,0x44}} dynamic_power=3960

Here, the "Old" line gives the load and requested_power (dynamic_power
here) numbers calculated using the idle time based implementation, while
"New" is based on the CPU utilization from scheduler.

As can be clearly seen, the load and requested_power numbers are simply
incorrect in the idle time based approach and the numbers collected from
CPU's utilization are much closer to the reality.

Scenario 2: The CPUs were busy in the previous polling window of the IPA
governor:

 Old: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=800 load={{0x64,0x64,0x64,0x64,0x64,0x64,0x64,0x64}} dynamic_power=5280
 New: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=708 load={{0x4d,0x5c,0x5c,0x5b,0x5c,0x5c,0x51,0x5b}} dynamic_power=4672

As can be seen, the idle time based load is 100% for all the CPUs as it
took only the last window into account, but in reality the CPUs aren't
that loaded as shown by the utilization numbers.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/9c255c83d78d58451abc06848001faef94c87a12.1607400596.git.viresh.kumar@linaro.org
2021-01-14 11:20:09 +01:00
Viresh Kumar
a5418be9df sched/core: Rename schedutil_cpu_util() and allow rest of the kernel to use it
There is nothing schedutil specific in schedutil_cpu_util(), rename it
to effective_cpu_util(). Also create and expose another wrapper
sched_cpu_util() which can be used by other parts of the kernel, like
thermal core (that will be done in a later commit).

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/db011961fb3bb8bef1c0eda5cd64564637d3ef31.1607400596.git.viresh.kumar@linaro.org
2021-01-14 11:20:09 +01:00
Viresh Kumar
7d6a905f3d sched/core: Move schedutil_cpu_util() to core.c
There is nothing schedutil specific in schedutil_cpu_util(), move it to
core.c and define it only for CONFIG_SMP.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/c921a362c78e1324f8ebc5aaa12f53e309c5a8a2.1607400596.git.viresh.kumar@linaro.org
2021-01-14 11:20:08 +01:00
Linus Torvalds
e71ba9452f Linux 5.11-rc2 2021-01-03 15:55:30 -08:00
Linus Torvalds
3516bd7293 - Update defconfigs and sort config select list.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAl/wiW8ACgkQjYWKoQLX
 FBijVQf/TV1fXL3ygvEkuVDnSmfpHfNVCNUEityhumMMoGwrmVv7mj0njMflWN8z
 VcRV1B1drh4R7jmITnlb8Z+Ii1kx8ADMwGLPVeuJRlUUbJoYRtw5HwvpFUXuzEnU
 rVRvLjWP2nYwzSqv6nYn/BZ1FZFSy4t7dnjRGLoGZmLLwSbf71HFmphIIyyIwEht
 LAq3s8x0vNygigR+eNRf8GLjgPbJAvH5A2QDnVmpZMeNiXUBGmd/BTFVwYMPIiJU
 Vm/FlRKiSJGJkqco9le9RQbbByBGV3VB31mgr2OkGAJQwZKmzYOZl/bwDp+q2lKH
 K9HmyrA+T5vP35jrbBAytOYbMg3AnQ==
 =Ev15
 -----END PGP SIGNATURE-----

Merge tag 's390-5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 cleanups from Vasily Gorbik:
 "Update defconfigs and sort config select list"

* tag 's390-5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/Kconfig: sort config S390 select list once again
  s390: update defconfigs
2021-01-02 12:22:46 -08:00
Linus Torvalds
d9296a7bd9 Power management updates for 5.11-rc2
- Fix recently introduced crash in the intel_pstate driver that
    occurs if scale-invariance is disabled during resume from
    suspend-to-RAM due to inconsistent changes of APERF or MPERF
    MSR values made by the platform firmware (Rafael Wysocki).
 
  - Fix a memory leak and add a missing clk_put() in error paths in
    the OPP framework (Quanyang Wang, Viresh Kumar).
 
  - Add new C-states table for SnowRidge processors to the intel_idle
    driver (Artem Bityutskiy).
 
  - Update the MAINTAINERS entry for cpuidle to make it clear that
    the governors are covered by it too (Lukas Bulwahn).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl/wOgQSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxe7wQAIP06E92PVDN1tDf88FN2BpBIziQvHPF
 T99v8RWQZCaMOQz5b1nXTqbQrKWFgqHiQK+6VjQcMW6xTWt7EM0eNdzGQtEsv9vd
 Ux4UIQp6CZnhorXLC9ZmTo+7tFze2L/KV4qVUQkkrcOXQ1qIqJiZdktYULRXgl37
 pJwUroPacKT9WiYwzyXZ2dkt337DQBb1AyumHZ/5gOik6Xo+uVK0kuDzECjRHQ+5
 R2aZftqAm/u6/JaeEgKQG+0LOAt1wJ6y9xcsFq303K/xDMKyRw4P3ngcVRfbyUSI
 2t3GfL9h2+41wtUpLKeJiADIJEov5P9ngwMedHNOpIX+aLL4cBbXG/LFc86QakD4
 qVcN+NOl13kNfn6eSSVz8mmf36juoEcYu1RhenPZfClwesqOZk0YcIoovT7k5A8B
 BnfEkQrjN6jFTfgc+A7rDXbLvRjhkIw2qr9Do+77AdYu0LkvgWlV3It3No10SzKw
 NshvNIedX3KF6xBISoFJhsGUrC8Xfpc5DIsfqYyQBIIKMEtSA00gK2XOHIpmhnUY
 0jXwL3DNIf2xC4BXHUgxJa6QpUyjKeLdqqwpnUryChHo+b8z78oE/gMza7iAV56t
 G7pQcdsX2ogLeNKtOvDbhYDa1janQpkiJoD/fy+QD7V5RO+YaxOVM5gFUrQegJKY
 1Qm6LXaswei8
 =51jM
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix a crash in intel_pstate during resume from suspend-to-RAM
  that may occur after recent changes and two resource leaks in error
  paths in the operating performance points (OPP) framework, add a new
  C-states table to intel_idle and update the cpuidle MAINTAINERS entry
  to cover the governors too.

  Specifics:

   - Fix recently introduced crash in the intel_pstate driver that
     occurs if scale-invariance is disabled during resume from
     suspend-to-RAM due to inconsistent changes of APERF or MPERF MSR
     values made by the platform firmware (Rafael Wysocki).

   - Fix a memory leak and add a missing clk_put() in error paths in the
     OPP framework (Quanyang Wang, Viresh Kumar).

   - Add new C-states table for SnowRidge processors to the intel_idle
     driver (Artem Bityutskiy).

   - Update the MAINTAINERS entry for cpuidle to make it clear that the
     governors are covered by it too (Lukas Bulwahn)"

* tag 'pm-5.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  intel_idle: add SnowRidge C-state table
  cpufreq: intel_pstate: Fix fast-switch fallback path
  opp: Call the missing clk_put() on error
  opp: fix memory leak in _allocate_opp_table
  MAINTAINERS: include governors into CPU IDLE TIME MANAGEMENT FRAMEWORK
2021-01-02 11:53:05 -08:00
Rafael J. Wysocki
89ecf09e0b Merge branches 'pm-cpufreq' and 'pm-cpuidle'
* pm-cpufreq:
  cpufreq: intel_pstate: Fix fast-switch fallback path

* pm-cpuidle:
  intel_idle: add SnowRidge C-state table
  MAINTAINERS: include governors into CPU IDLE TIME MANAGEMENT FRAMEWORK
2021-01-02 10:16:32 +01:00
Linus Torvalds
eda809aef5 SCSI fixes on 20210101
This is a load of driver fixes (12 ufs, 1 mpt3sas, 1 cxgbi).  The big
 core two fixes are for power management ("block: Do not accept any
 requests while suspended" and "block: Fix a race in the runtime power
 management code") which finally sorts out the resume problems we've
 occasionally been having.  To make the resume fix, there are seven
 necessary precursors which effectively renames REQ_PREEMPT to REQ_PM,
 so every "special" request in block is automatically a power
 management exempt one.  All of the non-PM preempt cases are removed
 except for the one in the SCSI Parallel Interface (spi) domain
 validation which is a genuine case where we have to run requests at
 high priority to validate the bus so this becomes an autopm get/put
 protected request.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCX+98LyYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishYvLAP9K+HBT
 Lrkt3VWc9gq6F36+QH/SeW8IyXGaj77ysFHXxwD/UambRjRK8IA24mvf9sWeLLj6
 p8CqCHUkCXqP48IiymE=
 =NHrx
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "This is a load of driver fixes (12 ufs, 1 mpt3sas, 1 cxgbi).

  The big core two fixes are for power management ("block: Do not accept
  any requests while suspended" and "block: Fix a race in the runtime
  power management code") which finally sorts out the resume problems
  we've occasionally been having.

  To make the resume fix, there are seven necessary precursors which
  effectively renames REQ_PREEMPT to REQ_PM, so every "special" request
  in block is automatically a power management exempt one.

  All of the non-PM preempt cases are removed except for the one in the
  SCSI Parallel Interface (spi) domain validation which is a genuine
  case where we have to run requests at high priority to validate the
  bus so this becomes an autopm get/put protected request"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (22 commits)
  scsi: cxgb4i: Fix TLS dependency
  scsi: ufs: Un-inline ufshcd_vops_device_reset function
  scsi: ufs: Re-enable WriteBooster after device reset
  scsi: ufs-mediatek: Use correct path to fix compile error
  scsi: mpt3sas: Signedness bug in _base_get_diag_triggers()
  scsi: block: Do not accept any requests while suspended
  scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT
  scsi: core: Only process PM requests if rpm_status != RPM_ACTIVE
  scsi: scsi_transport_spi: Set RQF_PM for domain validation commands
  scsi: ide: Mark power management requests with RQF_PM instead of RQF_PREEMPT
  scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
  scsi: block: Introduce BLK_MQ_REQ_PM
  scsi: block: Fix a race in the runtime power management code
  scsi: ufs-pci: Enable UFSHCD_CAP_RPM_AUTOSUSPEND for Intel controllers
  scsi: ufs-pci: Fix recovery from hibernate exit errors for Intel controllers
  scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
  scsi: ufs-pci: Fix restore from S4 for Intel controllers
  scsi: ufs-mediatek: Keep VCC always-on for specific devices
  scsi: ufs: Allow regulators being always-on
  scsi: ufs: Clear UAC for RPMB after ufshcd resets
  ...
2021-01-01 12:58:07 -08:00
Linus Torvalds
8b4805c68a block-5.11-2021-01-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/vOAwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpryUD/9C4cYltnJmzd4OQLDav3vchGI2dhy8Fh7T
 Lp04YPpsVFswZq/tz1fyrP1gA4r7lD2QGn+rGtel/hgaXkaxLqwoQ9No/lOJ7Y22
 dtDfPGlNrvhjBQL5l+N7xP1DF8BlBOaXHPfMSW2t14InnV/TYUvxuI4YwhZIiuqP
 kWYmAGcTdyFRS/x+tQjiyvqMd8VVYiTlEWyL4TpZoxeHigZF2Q3An3uZ+NdsnO0z
 S19yZ7eMwUks4Kx2X2WQ2uaMea90bX+sU6v4XABqBcgWqVH/1mbL4MZ8kaiCUaBr
 66Im7dN1xS/VMyueB3crDhz7RvjDlZZmCz3i9CNnWUcbHrvUXRms8b9LNsrpXkJ/
 ZZq8YAmqM20EBeZVSXL2WCFK1sDBxxsXv5zX4MYwUk7pZ3B+Uea8Z/DCUHBtTpnN
 FEbeGDFZs4IlhHoQ/UnnykdAYHvxUVEbWSICcQrzgeh0e4aPgS7nZOE6FiLU5q4n
 rl+dOjz5SrdURvFBVPybCFnoV9YCdU7mRDZkx/AWyYpG/zGzQhbS1JQzd9YATIFA
 TF6aAl6TuA5yoq9QIiVfd+7SdGqxhM03rCxelw7I9conVzpfBUFXSphVsEh5XnkW
 X2M4R1aTtQ49cscFALX6okuadqJoRFEH1f4hT4m4C8BszRH2UjD/Up52pP15Wq0Q
 mmtr1MenIw==
 =OvP3
 -----END PGP SIGNATURE-----

Merge tag 'block-5.11-2021-01-01' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Two minor block fixes from this last week that should go into 5.11:

   - Add missing NOWAIT debugfs definition (Andres)

   - Fix kerneldoc warning introduced this merge window (Randy)"

* tag 'block-5.11-2021-01-01' of git://git.kernel.dk/linux-block:
  block: add debugfs stanza for QUEUE_FLAG_NOWAIT
  fs: block_dev.c: fix kernel-doc warnings from struct block_device changes
2021-01-01 12:49:09 -08:00
Linus Torvalds
dc3e24b214 io_uring-5.11-2021-01-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/vOCwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqA7D/9AyFvg16KEgfYCN2OYXU5jyphu7sCCb8Cx
 PJ4H+Lf7fWki+/yFdXLxQnuBMGEOYWqtEIPN9CnO/I1ixzOoNugxiFAyerhd/Noh
 COg2EUsUrWq/zobYP60wN9pBPnW6EHTnFVA02kMVKunm4d5O5DZWPXy5BwA9yU3u
 dE9LoYDjFiaahogi3x+EmYStexxT0FB0d5WTONA7qSFrskeNbyVaYy8mY09jPynG
 IbG41fv2n0Zwlcx4XDCebsZ1+08rAGZFhwiq8VBhPNiz7sOud9jW7rRFHXR2FVoo
 DsW2npiYHVvOYqkl1HjXw5Mo6p8UKrDEDAIS7OOAHXM9Lz2/YGS9h9ogROccBta2
 5er12VaahIEiH05KtxpGv/q+vyJK7Gdqg0jSuSzKHSdSpTS10Ejh82Xo2V6lRedb
 gP03ZiDZjLtvh8F5hrWTJqPTtnFDRkY/I7R3WP1Ga7mqajFhpFDMvjvyEMMBCz+K
 KGjMfahNo2nzc9nu5M1VjX42tz5VxKjA3N2netxBfDMVB/GpGcQ7xygS85wx7VPn
 UUChgqw0aJrrq5slOZEAVqSsBN/wN97+m6uLLdk025CzQngwiw5fkTooakPxnGee
 bW9WKMpWBj/ipPXvU5C1tvHk4gxMg+cmxcr6EZ3uaWfE+MC7Xk9c00lNF62CT0Xm
 e+0RWRV1ig==
 =XYT5
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.11-2021-01-01' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A few fixes that should go into 5.11, all marked for stable as well:

   - Fix issue around identity COW'ing and users that share a ring
     across processes

   - Fix a hang associated with unregistering fixed files (Pavel)

   - Move the 'process is exiting' cancelation a bit earlier, so
     task_works aren't affected by it (Pavel)"

* tag 'io_uring-5.11-2021-01-01' of git://git.kernel.dk/linux-block:
  kernel/io_uring: cancel io_uring before task works
  io_uring: fix io_sqe_files_unregister() hangs
  io_uring: add a helper for setting a ref node
  io_uring: don't assume mm is constant across submits
2021-01-01 12:29:49 -08:00
Linus Torvalds
cedd1862be depmod: handle the case of /sbin/depmod without /sbin in PATH
Commit 436e980e2e ("kbuild: don't hardcode depmod path") stopped
hard-coding the path of depmod, but in the process caused trouble for
distributions that had that /sbin location, but didn't have it in the
PATH (generally because /sbin is limited to the super-user path).

Work around it for now by just adding /sbin to the end of PATH in the
depmod.sh script.

Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-01 12:26:39 -08:00
Pavel Begunkov
b1b6b5a30d kernel/io_uring: cancel io_uring before task works
For cancelling io_uring requests it needs either to be able to run
currently enqueued task_works or having it shut down by that moment.
Otherwise io_uring_cancel_files() may be waiting for requests that won't
ever complete.

Go with the first way and do cancellations before setting PF_EXITING and
so before putting the task_work infrastructure into a transition state
where task_work_run() would better not be called.

Cc: stable@vger.kernel.org # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-30 19:36:54 -07:00
Pavel Begunkov
1ffc54220c io_uring: fix io_sqe_files_unregister() hangs
io_sqe_files_unregister() uninterruptibly waits for enqueued ref nodes,
however requests keeping them may never complete, e.g. because of some
userspace dependency. Make sure it's interruptible otherwise it would
hang forever.

Cc: stable@vger.kernel.org # 5.6+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-30 19:35:53 -07:00
Pavel Begunkov
1642b4450d io_uring: add a helper for setting a ref node
Setting a new reference node to a file data is not trivial, don't repeat
it, add and use a helper.

Cc: stable@vger.kernel.org # 5.6+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-30 19:35:53 -07:00
Linus Torvalds
f6e1ea1964 A fix for an edge case in MClientRequest encoding and a couple of
trivial fixups for the new msgr2 support.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl/stgUTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi7SECACa5sWn5GKArGNCYesG/Xnbl4FbtorA
 QNmDS7z5SBTsxQfOmkIq7Xon9fRxXEpWzr/eCEZKcXc+REHXGsy8zFz/HA0t97SS
 tSho288zfEifGK0k2aptBTXAiEnzhIweu9f5ZbxfLN+JPyfynxfvXytdXYFeU/Op
 zLUBSjJusmm671APBvyWl/5h/2N7XkpBM3AWNJjaW26+Mft3SpsS8Ui+USWb9SJF
 YEmsVR03W2pbSS66a774bThoLP8exfg7xMfUcWRHqYs3Awpp0mwMnKkHMfooYLGA
 Cc5FsQ7Exin23B4vCm6+0prg/cgwBg516HfsFH/jTcE2D/PjSCEX7tXE
 =ZGwT
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.11-rc2' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A fix for an edge case in MClientRequest encoding and a couple of
  trivial fixups for the new msgr2 support"

* tag 'ceph-for-5.11-rc2' of git://github.com/ceph/ceph-client:
  libceph: add __maybe_unused to DEFINE_MSGR2_FEATURE
  libceph: align session_key and con_secret to 16 bytes
  libceph: fix auth_signature buffer allocation in secure mode
  ceph: reencode gid_list when reconnecting
2020-12-30 12:02:12 -08:00
Artem Bityutskiy
9cf93f056f intel_idle: add SnowRidge C-state table
Add C-state table for the SnowRidge SoC which is found on Intel Jacobsville
platforms.

The following has been changed.

 1. C1E latency changed from 10us to 15us. It was measured using the
    open source "wult" tool (the "nic" method, 15us is the 99.99th
    percentile).

 2. C1E power break even changed from 20us to 25us, which may result
    in less C1E residency in some workloads.

 3. C6 latency changed from 50us to 130us. Measured the same way as C1E.

The C6 C-state is supported only by some SnowRidge revisions, so add a C-state
table commentary about this.

On SnowRidge, C6 support is enumerated via the usual mechanism: "mwait" leaf of
the "cpuid" instruction. The 'intel_idle' driver does check this leaf, so even
though C6 is present in the table, the driver will only use it if the CPU does
support it.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-30 18:25:34 +01:00
Rafael J. Wysocki
be1283454b cpufreq: intel_pstate: Fix fast-switch fallback path
When sugov_update_single_perf() falls back to the "frequency"
path due to the missing scale-invariance, it will call
cpufreq_driver_fast_switch() via sugov_fast_switch()
and the driver's ->fast_switch() callback will be invoked,
so it must not be NULL.

However, after commit a365ab6b9d ("cpufreq: intel_pstate: Implement
the ->adjust_perf() callback") intel_pstate sets ->fast_switch() to
NULL when it is going to use intel_cpufreq_adjust_perf(), which is a
mistake, because on x86 the scale-invariance may be turned off
dynamically, so modify it to retain the original ->adjust_perf()
callback pointer.

Fixes: a365ab6b9d ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback")
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Tested-by: Kenneth R. Crudup <kenny@panix.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-12-30 18:22:17 +01:00
Rafael J. Wysocki
977a3b0f6e Merge branch 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull operating performance points (OPP) framework fixes for 5.11-rc2
from Viresh Kumar:

"This contains two patches to fix freeing of resources in error paths."

* 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  opp: Call the missing clk_put() on error
  opp: fix memory leak in _allocate_opp_table
2020-12-30 18:19:34 +01:00
Heiko Carstens
129975e75b s390/Kconfig: sort config S390 select list once again
...and add comments at the top and bottom.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-12-30 17:08:53 +01:00
Heiko Carstens
742eb4750f s390: update defconfigs
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-12-30 17:08:53 +01:00
Andres Freund
dc30432605 block: add debugfs stanza for QUEUE_FLAG_NOWAIT
This was missed in 021a24460d. Leads to the numeric value of
QUEUE_FLAG_NOWAIT (i.e. 29) showing up in
/sys/kernel/debug/block/*/state.

Fixes: 021a24460d
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-29 16:47:46 -07:00
Randy Dunlap
875b2376fd fs: block_dev.c: fix kernel-doc warnings from struct block_device changes
Fix new kernel-doc warnings in fs/block_dev.c:

../fs/block_dev.c:1066: warning: Excess function parameter 'whole' description in 'bd_abort_claiming'
../fs/block_dev.c:1837: warning: Function parameter or member 'dev' not described in 'lookup_bdev'

Fixes: 4e7b5671c6 ("block: remove i_bdev")
Fixes: 37c3fc9abb ("block: simplify the block device claiming interface")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-29 16:47:18 -07:00
Linus Torvalds
139711f033 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "16 patches

  Subsystems affected by this patch series: mm (selftests, hugetlb,
  pagecache, mremap, kasan, and slub), kbuild, checkpatch, misc, and
  lib"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: slub: call account_slab_page() after slab page initialization
  zlib: move EXPORT_SYMBOL() and MODULE_LICENSE() out of dfltcc_syms.c
  lib/zlib: fix inflating zlib streams on s390
  lib/genalloc: fix the overflow when size is too big
  kdev_t: always inline major/minor helper functions
  sizes.h: add SZ_8G/SZ_16G/SZ_32G macros
  local64.h: make <asm/local64.h> mandatory
  kasan: fix null pointer dereference in kasan_record_aux_stack
  mm: generalise COW SMC TLB flushing race comment
  mm/mremap.c: fix extent calculation
  mm: memmap defer init doesn't work as expected
  mm: add prototype for __add_to_page_cache_locked()
  checkpatch: prefer strscpy to strlcpy
  Revert "kbuild: avoid static_assert for genksyms"
  mm/hugetlb: fix deadlock in hugetlb_cow error path
  selftests/vm: fix building protection keys test
2020-12-29 15:45:49 -08:00
Roman Gushchin
1f3147b49d mm: slub: call account_slab_page() after slab page initialization
It's convenient to have page->objects initialized before calling into
account_slab_page().  In particular, this information can be used to
pre-alloc the obj_cgroup vector.

Let's call account_slab_page() a bit later, after the initialization of
page->objects.

This commit doesn't bring any functional change, but is required for
further optimizations.

[akpm@linux-foundation.org: undo changes needed by forthcoming mm-memcg-slab-pre-allocate-obj_cgroups-for-slab-caches-with-slab_account.patch]

Link: https://lkml.kernel.org/r/20201110195753.530157-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Randy Dunlap
605cc30dea zlib: move EXPORT_SYMBOL() and MODULE_LICENSE() out of dfltcc_syms.c
In commit 11fb479ff5 ("zlib: export S390 symbols for zlib modules"), I
added EXPORT_SYMBOL()s to dfltcc_inflate.c but then Mikhail said that
these should probably be in dfltcc_syms.c with the other
EXPORT_SYMBOL()s.

However, that is contrary to the current kernel style, which places
EXPORT_SYMBOL() immediately after the function that it applies to, so
move all EXPORT_SYMBOL()s to their respective function locations and
drop the dfltcc_syms.c file.  Also move MODULE_LICENSE() from the
deleted file to dfltcc.c.

[rdunlap@infradead.org: remove dfltcc_syms.o from Makefile]
  Link: https://lkml.kernel.org/r/20201227171837.15492-1-rdunlap@infradead.org

Link: https://lkml.kernel.org/r/20201219052530.28461-1-rdunlap@infradead.org
Fixes: 11fb479ff5 ("zlib: export S390 symbols for zlib modules")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Zaslonko Mikhail <zaslonko@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Ilya Leoshkevich
f0bb29e8c4 lib/zlib: fix inflating zlib streams on s390
Decompressing zlib streams on s390 fails with "incorrect data check"
error.

Userspace zlib checks inflate_state.flags in order to byteswap checksums
only for zlib streams, and s390 hardware inflate code, which was ported
from there, tries to match this behavior.  At the same time, kernel zlib
does not use inflate_state.flags, so it contains essentially random
values.  For many use cases either zlib stream is zeroed out or checksum
is not used, so this problem is masked, but at least SquashFS is still
affected.

Fix by always passing a checksum to and from the hardware as is, which
matches zlib_inflate()'s expectations.

Link: https://lkml.kernel.org/r/20201215155551.894884-1-iii@linux.ibm.com
Fixes: 1261961000 ("lib/zlib: add s390 hardware support for kernel zlib_inflate")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Cc: <stable@vger.kernel.org>	[5.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Huang Shijie
3684566384 lib/genalloc: fix the overflow when size is too big
Some graphic card has very big memory on chip, such as 32G bytes.

In the following case, it will cause overflow:

    pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
    ret = gen_pool_add(pool, 0x1000000, SZ_32G, NUMA_NO_NODE);

    va = gen_pool_alloc(pool, SZ_4G);

The overflow occurs in gen_pool_alloc_algo_owner():

		....
		size = nbits << order;
		....

The @nbits is "int" type, so it will overflow.
Then the gen_pool_avail() will return the wrong value.

This patch converts some "int" to "unsigned long", and
changes the compare code in while.

Link: https://lkml.kernel.org/r/20201229060657.3389-1-sjhuang@iluvatar.ai
Signed-off-by: Huang Shijie <sjhuang@iluvatar.ai>
Reported-by: Shi Jiasheng <jiasheng.shi@iluvatar.ai>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Josh Poimboeuf
aa8c7db494 kdev_t: always inline major/minor helper functions
Silly GCC doesn't always inline these trivial functions.

Fixes the following warning:

  arch/x86/kernel/sys_ia32.o: warning: objtool: cp_stat64()+0xd8: call to new_encode_dev() with UACCESS enabled

Link: https://lkml.kernel.org/r/984353b44a4484d86ba9f73884b7306232e25e30.1608737428.git.jpoimboe@redhat.com
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>	[build-tested]
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Huang Shijie
8b0fac44bd sizes.h: add SZ_8G/SZ_16G/SZ_32G macros
Add these macros, since we can use them in drivers.

Link: https://lkml.kernel.org/r/20201229072819.11183-1-sjhuang@iluvatar.ai
Signed-off-by: Huang Shijie <sjhuang@iluvatar.ai>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Randy Dunlap
87dbc209ea local64.h: make <asm/local64.h> mandatory
Make <asm-generic/local64.h> mandatory in include/asm-generic/Kbuild and
remove all arch/*/include/asm/local64.h arch-specific files since they
only #include <asm-generic/local64.h>.

This fixes build errors on arch/c6x/ and arch/nios2/ for
block/blk-iocost.c.

Build-tested on 21 of 25 arch-es.  (tools problems on the others)

Yes, we could even rename <asm-generic/local64.h> to
<linux/local64.h> and change all #includes to use
<linux/local64.h> instead.

Link: https://lkml.kernel.org/r/20201227024446.17018-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Walter Wu
13384f6125 kasan: fix null pointer dereference in kasan_record_aux_stack
Syzbot reported the following [1]:

  BUG: kernel NULL pointer dereference, address: 0000000000000008
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 2d993067 P4D 2d993067 PUD 19a3c067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP KASAN
  CPU: 1 PID: 3852 Comm: kworker/1:2 Not tainted 5.10.0-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Workqueue: events free_ipc
  RIP: 0010:kasan_record_aux_stack+0x77/0xb0

Add null checking slab object from kasan_get_alloc_meta() in order to
avoid null pointer dereference.

[1] https://syzkaller.appspot.com/x/log.txt?x=10a82a50d00000

Link: https://lkml.kernel.org/r/20201228080018.23041-1-walter-zh.wu@mediatek.com
Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com>
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Nicholas Piggin
111fe7186b mm: generalise COW SMC TLB flushing race comment
I'm not sure if I'm completely missing something here, but AFAIKS the
reference to the mysterious "COW SMC race" confuses the issue.  The
original changelog and mailing list thread didn't help me either.

This SMC race is where the problem was detected, but isn't the general
problem bigger and more obvious: that the new PTE could be picked up at
any time by any TLB while entries for the old PTE exist in other TLBs
before the TLB flush takes effect?

The case where the iTLB and dTLB of a CPU are pointing at different pages
is an interesting one but follows from the general problem.

The other (minor) thing with the comment I think it makes it a bit clearer
to say what the old code was doing (i.e., it avoids the race as opposed to
what?).

References: 4ce072f1fa ("mm: fix a race condition under SMC + COW")
Link: https://lkml.kernel.org/r/20201215121119.351650-1-npiggin@gmail.com
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Kalesh Singh
e05986ee7a mm/mremap.c: fix extent calculation
When `next < old_addr`, `next - old_addr` arithmetic underflows causing
`extent` to be incorrect.

Make `extent` the smaller of `next - old_addr` or `old_end - old_addr`.

Link: https://lkml.kernel.org/r/20201219170433.2418867-1-kaleshsingh@google.com
Fixes: c49dd34018 ("mm: speedup mremap on 1GB or larger regions")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Baoquan He
dc2da7b45f mm: memmap defer init doesn't work as expected
VMware observed a performance regression during memmap init on their
platform, and bisected to commit 73a6e474cb ("mm: memmap_init:
iterate over memblock regions rather that check each PFN") causing it.

Before the commit:

  [0.033176] Normal zone: 1445888 pages used for memmap
  [0.033176] Normal zone: 89391104 pages, LIFO batch:63
  [0.035851] ACPI: PM-Timer IO Port: 0x448

With commit

  [0.026874] Normal zone: 1445888 pages used for memmap
  [0.026875] Normal zone: 89391104 pages, LIFO batch:63
  [2.028450] ACPI: PM-Timer IO Port: 0x448

The root cause is the current memmap defer init doesn't work as expected.

Before, memmap_init_zone() was used to do memmap init of one whole zone,
to initialize all low zones of one numa node, but defer memmap init of
the last zone in that numa node.  However, since commit 73a6e474cb,
function memmap_init() is adapted to iterater over memblock regions
inside one zone, then call memmap_init_zone() to do memmap init for each
region.

E.g, on VMware's system, the memory layout is as below, there are two
memory regions in node 2.  The current code will mistakenly initialize the
whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
iniatialize only one memmory section on the 2nd region [mem
0x10000000000-0x1033fffffff].  In fact, we only expect to see that there's
only one memory section's memmap initialized.  That's why more time is
costed at the time.

[    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[    0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
[    0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
[    0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
[    0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]

Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge
whether defer need be taken in zone wide.

Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
Fixes: commit 73a6e474cb ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: Rahul Gopakumar <gopakumarr@vmware.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Souptick Joarder
6d87d0ece5 mm: add prototype for __add_to_page_cache_locked()
Otherwise it causes a gcc warning:

  mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]

A previous attempt to make this function static led to compilation
errors when CONFIG_DEBUG_INFO_BTF is enabled because
__add_to_page_cache_locked() is referred to by BPF code.

Adding a prototype will silence the warning.

Link: https://lkml.kernel.org/r/1608693702-4665-1-git-send-email-jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Joe Perches
5dbdb2d87c checkpatch: prefer strscpy to strlcpy
Prefer strscpy over the deprecated strlcpy function.

Link: https://lkml.kernel.org/r/19fe91084890e2c16fe56f960de6c570a93fa99b.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Requested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Masahiro Yamada
3a176b9460 Revert "kbuild: avoid static_assert for genksyms"
This reverts commit 14dc3983b5.

Macro Elver had sent a fix proper fix earlier, and also pointed out
corner cases:
 "I guess what you propose is simpler, but might still have corner cases
  where we still get warnings. In particular, if some file (for whatever
  reason) does not include build_bug.h and uses a raw _Static_assert(),
  then we still get warnings. E.g. I see 1 user of raw _Static_assert()
  (drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h )."

I believe the raw use of _Static_assert() should be allowed, so this
should be fixed in genksyms.

Even after commit 14dc3983b5 ("kbuild: avoid static_assert for
genksyms"), I confirmed the following test code emits the warning.

  ---------------->8----------------
  #include <linux/export.h>

  _Static_assert((1 ?: 0), "");

  void foo(void) { }
  EXPORT_SYMBOL(foo);
  ---------------->8----------------

  WARNING: modpost: EXPORT symbol "foo" [vmlinux] version generation failed, symbol will not be versioned.

Now that commit 869b91992bce ("genksyms: Ignore module scoped
_Static_assert()") fixed this issue properly, the workaround should
be reverted.

Link: https://lkml.org/lkml/2020/12/10/845
Link: https://lkml.kernel.org/r/20201219183911.181442-1-masahiroy@kernel.org
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00
Mike Kravetz
e7dd91c456 mm/hugetlb: fix deadlock in hugetlb_cow error path
syzbot reported the deadlock here [1].  The issue is in hugetlb cow
error handling when there are not enough huge pages for the faulting
task which took the original reservation.  It is possible that other
(child) tasks could have consumed pages associated with the reservation.
In this case, we want the task which took the original reservation to
succeed.  So, we unmap any associated pages in children so that they can
be used by the faulting task that owns the reservation.

The unmapping code needs to hold i_mmap_rwsem in write mode.  However,
due to commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd
sharing synchronization") we are already holding i_mmap_rwsem in read
mode when hugetlb_cow is called.

Technically, i_mmap_rwsem does not need to be held in read mode for COW
mappings as they can not share pmd's.  Modifying the fault code to not
take i_mmap_rwsem in read mode for COW (and other non-sharable) mappings
is too involved for a stable fix.

Instead, we simply drop the hugetlb_fault_mutex and i_mmap_rwsem before
unmapping.  This is OK as it is technically not needed.  They are
reacquired after unmapping as expected by calling code.  Since this is
done in an uncommon error path, the overhead of dropping and reacquiring
mutexes is acceptable.

While making changes, remove redundant BUG_ON after unmap_ref_private.

[1] https://lkml.kernel.org/r/000000000000b73ccc05b5cf8558@google.com

Link: https://lkml.kernel.org/r/4c5781b8-3b00-761e-c0c7-c5edebb6ec1a@oracle.com
Fixes: c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: syzbot+5eee4145df3c15e96625@syzkaller.appspotmail.com
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-29 15:36:49 -08:00