Scheduler updates for v5.12:

[ NOTE: unfortunately this tree had to be freshly rebased today,
         it's a same-content tree of 82891be90f3c (-next published)
         merged with v5.11.
 
         The main reason for the rebase was an authorship misattribution
         problem with a new commit, which we noticed in the last minute,
         and which we didn't want to be merged upstream. The offending
         commit was deep in the tree, and dependent commits had to be
         rebased as well. ]
 
 - Core scheduler updates:
 
   - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
     preempt=none/voluntary/full boot options (default: full),
     to allow distros to build a PREEMPT kernel but fall back to
     close to PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling
     behavior via a boot time selection.
 
     There's also the /debug/sched_debug switch to do this runtime.
 
     This feature is implemented via runtime patching (a new variant of static calls).
 
     The scope of the runtime patching can be best reviewed by looking
     at the sched_dynamic_update() function in kernel/sched/core.c.
 
     ( Note that the dynamic none/voluntary mode isn't 100% identical,
       for example preempt-RCU is available in all cases, plus the
       preempt count is maintained in all models, which has runtime
       overhead even with the code patching. )
 
     The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast majority
     of distributions, are supposed to be unaffected.
 
   - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
     was found via rcutorture triggering a hang. The bug is that
     rcu_idle_enter() may wake up a NOCB kthread, but this happens after
     the last generic need_resched() check. Some cpuidle drivers fix it
     by chance but many others don't.
 
     In true 2020 fashion the original bug fix has grown into a 5-patch
     scheduler/RCU fix series plus another 16 RCU patches to address
     the underlying issue of missed preemption events. These are the
     initial fixes that should fix current incarnations of the bug.
 
   - Clean up rbtree usage in the scheduler, by providing & using the following
     consistent set of rbtree APIs:
 
      partial-order; less() based:
        - rb_add(): add a new entry to the rbtree
        - rb_add_cached(): like rb_add(), but for a rb_root_cached
 
      total-order; cmp() based:
        - rb_find(): find an entry in an rbtree
        - rb_find_add(): find an entry, and add if not found
 
        - rb_find_first(): find the first (leftmost) matching entry
        - rb_next_match(): continue from rb_find_first()
        - rb_for_each(): iterate a sub-tree using the previous two
 
   - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a single pass.
     This is a 4-commit series where each commit improves one aspect of the idle
     sibling scan logic.
 
   - Improve the cpufreq cooling driver by getting the effective CPU utilization
     metrics from the scheduler
 
   - Improve the fair scheduler's active load-balancing logic by reducing the number
     of active LB attempts & lengthen the load-balancing interval. This improves
     stress-ng mmapfork performance.
 
   - Fix CFS's estimated utilization (util_est) calculation bug that can result in
     too high utilization values
 
 - Misc updates & fixes:
 
    - Fix the HRTICK reprogramming & optimization feature
    - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code
    - Reduce dl_add_task_root_domain() overhead
    - Fix uprobes refcount bug
    - Process pending softirqs in flush_smp_call_function_from_idle()
    - Clean up task priority related defines, remove *USER_*PRIO and
      USER_PRIO()
    - Simplify the sched_init_numa() deduplication sort
    - Documentation updates
    - Fix EAS bug in update_misfit_status(), which degraded the quality
      of energy-balancing
    - Smaller cleanups
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmAtHBsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1itgg/+NGed12pgPjYBzesdou60Lvx7LZLGjfOt
 M1F1EnmQGn/hEH2fCY6ZoqIZQTVltm7GIcBNabzYTzlaHZsdtyuDUJBZyj19vTlk
 zekcj7WVt+qvfjChaNwEJhQ9nnOM/eohMgEOHMAAJd9zlnQvve7NOLQ56UDM+kn/
 9taFJ5ZPvb4avP6C5p3KivvKex6Bjof/Tl0m3utpNyPpI/qK3FyGxwdgCxU0yepT
 ABWQX5ZQCufFvo1bgnBPfqyzab4MqhoM3bNKBsLQfuAlssG1xRv4KQOev4dRwrt9
 pXJikV5C9yez5d2lGe5p0ltH5IZS/l9x2yI/ZQj3OUDTFyV1ic6WfFAqJgDzVF8E
 i/vvA4NPQiI241Bkps+ErcCw4aVOgiY6TWli74cHjLUIX0+As6aHrFWXGSxUmiHB
 WR+B8KmdfzRTTlhOxMA+cvlpZcKCfxWkJJmXzr/lDZzIuKPqM3QCE2wD9sixkfVo
 JNICT0IvZghWOdbMEfZba8Psh/e2LVI9RzdpEiuYJz1ZrVlt1hO0M6jBxY0hMz9n
 k54z81xODw0a8P2FHMtpmB1vhAeqCmvwA6DO8z0Oxs0DFi+KM2bLf2efHsCKafI+
 Bm5v9YFaOk/55R76hJVh+aYLlyFgFkKd+P/niJTPDnxOk3SqJuXvTrql1HeGHkNr
 kYgQa23dsZk=
 =pyaG
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core scheduler updates:

   - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
     preempt=none/voluntary/full boot options (default: full), to allow
     distros to build a PREEMPT kernel but fall back to close to
     PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via
     a boot time selection.

     There's also the /debug/sched_debug switch to do this runtime.

     This feature is implemented via runtime patching (a new variant of
     static calls).

     The scope of the runtime patching can be best reviewed by looking
     at the sched_dynamic_update() function in kernel/sched/core.c.

     ( Note that the dynamic none/voluntary mode isn't 100% identical,
       for example preempt-RCU is available in all cases, plus the
       preempt count is maintained in all models, which has runtime
       overhead even with the code patching. )

     The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast
     majority of distributions, are supposed to be unaffected.

   - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
     was found via rcutorture triggering a hang. The bug is that
     rcu_idle_enter() may wake up a NOCB kthread, but this happens after
     the last generic need_resched() check. Some cpuidle drivers fix it
     by chance but many others don't.

     In true 2020 fashion the original bug fix has grown into a 5-patch
     scheduler/RCU fix series plus another 16 RCU patches to address the
     underlying issue of missed preemption events. These are the initial
     fixes that should fix current incarnations of the bug.

   - Clean up rbtree usage in the scheduler, by providing & using the
     following consistent set of rbtree APIs:

       partial-order; less() based:
         - rb_add(): add a new entry to the rbtree
         - rb_add_cached(): like rb_add(), but for a rb_root_cached

       total-order; cmp() based:
         - rb_find(): find an entry in an rbtree
         - rb_find_add(): find an entry, and add if not found

         - rb_find_first(): find the first (leftmost) matching entry
         - rb_next_match(): continue from rb_find_first()
         - rb_for_each(): iterate a sub-tree using the previous two

   - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a
     single pass. This is a 4-commit series where each commit improves
     one aspect of the idle sibling scan logic.

   - Improve the cpufreq cooling driver by getting the effective CPU
     utilization metrics from the scheduler

   - Improve the fair scheduler's active load-balancing logic by
     reducing the number of active LB attempts & lengthen the
     load-balancing interval. This improves stress-ng mmapfork
     performance.

   - Fix CFS's estimated utilization (util_est) calculation bug that can
     result in too high utilization values

  Misc updates & fixes:

   - Fix the HRTICK reprogramming & optimization feature

   - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code

   - Reduce dl_add_task_root_domain() overhead

   - Fix uprobes refcount bug

   - Process pending softirqs in flush_smp_call_function_from_idle()

   - Clean up task priority related defines, remove *USER_*PRIO and
     USER_PRIO()

   - Simplify the sched_init_numa() deduplication sort

   - Documentation updates

   - Fix EAS bug in update_misfit_status(), which degraded the quality
     of energy-balancing

   - Smaller cleanups"

* tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  sched,x86: Allow !PREEMPT_DYNAMIC
  entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point
  entry: Explicitly flush pending rcuog wakeup before last rescheduling point
  rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
  rcu/nocb: Perform deferred wake up before last idle's need_resched() check
  rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
  sched/features: Distinguish between NORMAL and DEADLINE hrtick
  sched/features: Fix hrtick reprogramming
  sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()
  uprobes: (Re)add missing get_uprobe() in __find_uprobe()
  smp: Process pending softirqs in flush_smp_call_function_from_idle()
  sched: Harden PREEMPT_DYNAMIC
  static_call: Allow module use without exposing static_call_key
  sched: Add /debug/sched_preempt
  preempt/dynamic: Support dynamic preempt with preempt= boot option
  preempt/dynamic: Provide irqentry_exit_cond_resched() static call
  preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
  preempt/dynamic: Provide cond_resched() and might_resched() static calls
  preempt: Introduce CONFIG_PREEMPT_DYNAMIC
  static_call: Provide DEFINE_STATIC_CALL_RET0()
  ...
This commit is contained in:
Linus Torvalds 2021-02-21 12:35:04 -08:00
commit 657bd90c93
48 changed files with 1903 additions and 790 deletions

View File

@ -3903,6 +3903,13 @@
Format: {"off"}
Disable Hardware Transactional Memory
preempt= [KNL]
Select preemption mode if you have CONFIG_PREEMPT_DYNAMIC
none - Limited to cond_resched() calls
voluntary - Limited to cond_resched() and might_sleep() calls
full - Any section that isn't explicitly preempt disabled
can be preempted anytime.
print-fatal-signals=
[KNL] debug: print fatal signals

View File

@ -0,0 +1,169 @@
NOTE; all this assumes a linear relation between frequency and work capacity,
we know this is flawed, but it is the best workable approximation.
PELT (Per Entity Load Tracking)
-------------------------------
With PELT we track some metrics across the various scheduler entities, from
individual tasks to task-group slices to CPU runqueues. As the basis for this
we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
half, while the rest of history contribute the other half.
Specifically:
ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
ewma(u) = ewma_sum(u) / ewma_sum(1)
Since this is essentially a progression of an infinite geometric series, the
results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
is key, since it gives the ability to recompose the averages when tasks move
around.
Note that blocked tasks still contribute to the aggregates (task-group slices
and CPU runqueues), which reflects their expected contribution when they
resume running.
Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
reflects the time an entity spends on the CPU, while 'runnable' reflects the
time an entity spends on the runqueue. When there is only a single task these
two metrics are the same, but once there is contention for the CPU 'running'
will decrease to reflect the fraction of time each task spends on the CPU
while 'runnable' will increase to reflect the amount of contention.
For more detail see: kernel/sched/pelt.c
Frequency- / CPU Invariance
---------------------------
Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
a big CPU, we allow architectures to scale the time delta with two ratios, one
Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
For simple DVFS architectures (where software is in full control) we trivially
compute the ratio as:
f_cur
r_dvfs := -----
f_max
For more dynamic systems where the hardware is in control of DVFS we use
hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
For Intel specifically, we use:
APERF
f_cur := ----- * P0
MPERF
4C-turbo; if available and turbo enabled
f_max := { 1C-turbo; if turbo enabled
P0; otherwise
f_cur
r_dvfs := min( 1, ----- )
f_max
We pick 4C turbo over 1C turbo to make it slightly more sustainable.
r_cpu is determined as the ratio of highest performance level of the current
CPU vs the highest performance level of any other CPU in the system.
r_tot = r_dvfs * r_cpu
The result is that the above 'running' and 'runnable' metrics become invariant
of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.
For more detail see:
- kernel/sched/pelt.h:update_rq_clock_pelt()
- arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
UTIL_EST / UTIL_EST_FASTUP
--------------------------
Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
(DVFS) ramp-up after they are running again.
To alleviate this (a default enabled option) UTIL_EST drives an Infinite
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
filter to instantly increase and only decay on decrease.
A further runqueue wide sum (of runnable tasks) is maintained of:
util_est := \Sum_t max( t_running, t_util_est_ewma )
For more detail see: kernel/sched/fair.c:util_est_dequeue()
UCLAMP
------
It is possible to set effective u_min and u_max clamps on each CFS or RT task;
the runqueue keeps an max aggregate of these clamps for all running tasks.
For more detail see: include/uapi/linux/sched/types.h
Schedutil / DVFS
----------------
Every time the scheduler load tracking is updated (task wakeup, task
migration, time progression) we call out to schedutil to update the hardware
DVFS state.
The basis is the CPU runqueue's 'running' metric, which per the above it is
the frequency invariant utilization estimate of the CPU. From this we compute
a desired frequency like:
max( running, util_est ); if UTIL_EST
u_cfs := { running; otherwise
clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK
u_clamp := { u_cfs + u_rt; otherwise
u := u_clamp + u_irq + u_dl; [approx. see source for more detail]
f_des := min( f_max, 1.25 u * f_max )
XXX IO-wait; when the update is due to a task wakeup from IO-completion we
boost 'u' above.
This frequency is then used to select a P-state/OPP or directly munged into a
CPPC style request to the hardware.
XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
required to satisfy the workload.
Because these callbacks are directly from the scheduler, the DVFS hardware
interaction should be 'fast' and non-blocking. Schedutil supports
rate-limiting DVFS requests for when hardware interaction is slow and
expensive, this reduces effectiveness.
For more information see: kernel/sched/cpufreq_schedutil.c
NOTES
-----
- On low-load scenarios, where DVFS is most relevant, the 'running' numbers
will closely reflect utilization.
- In saturated scenarios task movement will cause some transient dips,
suppose we have a CPU saturated with 4 tasks, then when we migrate a task
to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
new CPU will gain 0.25. This is inevitable and time progression will
correct this. XXX do we still guarantee f_max due to no idle-time?
- Much of the above is about avoiding DVFS dips, and independent DVFS domains
having to re-learn / ramp-up when load shifts.

View File

@ -1058,6 +1058,15 @@ config HAVE_STATIC_CALL_INLINE
bool
depends on HAVE_STATIC_CALL
config HAVE_PREEMPT_DYNAMIC
bool
depends on HAVE_STATIC_CALL
depends on GENERIC_ENTRY
help
Select this if the architecture support boot time preempt setting
on top of static calls. It is strongly advised to support inline
static call to avoid any overhead.
config ARCH_WANT_LD_ORPHAN_WARN
bool
help

View File

@ -72,7 +72,7 @@ static struct timer_list spuloadavg_timer;
#define DEF_SPU_TIMESLICE (100 * HZ / (1000 * SPUSCHED_TICK))
#define SCALE_PRIO(x, prio) \
max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
max(x * (MAX_PRIO - prio) / (NICE_WIDTH / 2), MIN_SPU_TIMESLICE)
/*
* scale user-nice values [ -20 ... 0 ... 19 ] to time slice values:

View File

@ -224,6 +224,7 @@ config X86
select HAVE_STACK_VALIDATION if X86_64
select HAVE_STATIC_CALL
select HAVE_STATIC_CALL_INLINE if HAVE_STACK_VALIDATION
select HAVE_PREEMPT_DYNAMIC
select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK

View File

@ -5,6 +5,7 @@
#include <asm/rmwcc.h>
#include <asm/percpu.h>
#include <linux/thread_info.h>
#include <linux/static_call_types.h>
DECLARE_PER_CPU(int, __preempt_count);
@ -103,16 +104,45 @@ static __always_inline bool should_resched(int preempt_offset)
}
#ifdef CONFIG_PREEMPTION
extern asmlinkage void preempt_schedule_thunk(void);
# define __preempt_schedule() \
asm volatile ("call preempt_schedule_thunk" : ASM_CALL_CONSTRAINT)
extern asmlinkage void preempt_schedule(void);
extern asmlinkage void preempt_schedule_notrace_thunk(void);
# define __preempt_schedule_notrace() \
asm volatile ("call preempt_schedule_notrace_thunk" : ASM_CALL_CONSTRAINT)
extern asmlinkage void preempt_schedule(void);
extern asmlinkage void preempt_schedule_thunk(void);
extern asmlinkage void preempt_schedule_notrace(void);
#endif
#define __preempt_schedule_func preempt_schedule_thunk
extern asmlinkage void preempt_schedule_notrace(void);
extern asmlinkage void preempt_schedule_notrace_thunk(void);
#define __preempt_schedule_notrace_func preempt_schedule_notrace_thunk
#ifdef CONFIG_PREEMPT_DYNAMIC
DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
#define __preempt_schedule() \
do { \
__STATIC_CALL_MOD_ADDRESSABLE(preempt_schedule); \
asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule) : ASM_CALL_CONSTRAINT); \
} while (0)
DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
#define __preempt_schedule_notrace() \
do { \
__STATIC_CALL_MOD_ADDRESSABLE(preempt_schedule_notrace); \
asm volatile ("call " STATIC_CALL_TRAMP_STR(preempt_schedule_notrace) : ASM_CALL_CONSTRAINT); \
} while (0)
#else /* PREEMPT_DYNAMIC */
#define __preempt_schedule() \
asm volatile ("call preempt_schedule_thunk" : ASM_CALL_CONSTRAINT);
#define __preempt_schedule_notrace() \
asm volatile ("call preempt_schedule_notrace_thunk" : ASM_CALL_CONSTRAINT);
#endif /* PREEMPT_DYNAMIC */
#endif /* PREEMPTION */
#endif /* __ASM_PREEMPT_H */

View File

@ -37,4 +37,11 @@
#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \
__ARCH_DEFINE_STATIC_CALL_TRAMP(name, "ret; nop; nop; nop; nop")
#define ARCH_ADD_TRAMP_KEY(name) \
asm(".pushsection .static_call_tramp_key, \"a\" \n" \
".long " STATIC_CALL_TRAMP_STR(name) " - . \n" \
".long " STATIC_CALL_KEY_STR(name) " - . \n" \
".popsection \n")
#endif /* _ASM_STATIC_CALL_H */

View File

@ -11,14 +11,26 @@ enum insn_type {
RET = 3, /* tramp / site cond-tail-call */
};
/*
* data16 data16 xorq %rax, %rax - a single 5 byte instruction that clears %rax
* The REX.W cancels the effect of any data16.
*/
static const u8 xor5rax[] = { 0x66, 0x66, 0x48, 0x31, 0xc0 };
static void __ref __static_call_transform(void *insn, enum insn_type type, void *func)
{
const void *emulate = NULL;
int size = CALL_INSN_SIZE;
const void *code;
switch (type) {
case CALL:
code = text_gen_insn(CALL_INSN_OPCODE, insn, func);
if (func == &__static_call_return0) {
emulate = code;
code = &xor5rax;
}
break;
case NOP:
@ -41,7 +53,7 @@ static void __ref __static_call_transform(void *insn, enum insn_type type, void
if (unlikely(system_state == SYSTEM_BOOTING))
return text_poke_early(insn, code, size);
text_poke_bp(insn, code, size, NULL);
text_poke_bp(insn, code, size, emulate);
}
static void __static_call_validate(void *insn, bool tail)
@ -54,7 +66,8 @@ static void __static_call_validate(void *insn, bool tail)
return;
} else {
if (opcode == CALL_INSN_OPCODE ||
!memcmp(insn, ideal_nops[NOP_ATOMIC5], 5))
!memcmp(insn, ideal_nops[NOP_ATOMIC5], 5) ||
!memcmp(insn, xor5rax, 5))
return;
}

View File

@ -1782,6 +1782,7 @@ EXPORT_SYMBOL_GPL(kvm_emulate_wrmsr);
bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu)
{
xfer_to_guest_mode_prepare();
return vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) ||
xfer_to_guest_mode_work_pending();
}

View File

@ -76,7 +76,9 @@ struct cpufreq_cooling_device {
struct em_perf_domain *em;
struct cpufreq_policy *policy;
struct list_head node;
#ifndef CONFIG_SMP
struct time_in_idle *idle_time;
#endif
struct freq_qos_request qos_req;
};
@ -132,14 +134,25 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
}
/**
* get_load() - get load for a cpu since last updated
* @cpufreq_cdev: &struct cpufreq_cooling_device for this cpu
* get_load() - get load for a cpu
* @cpufreq_cdev: struct cpufreq_cooling_device for the cpu
* @cpu: cpu number
* @cpu_idx: index of the cpu in time_in_idle*
* @cpu_idx: index of the cpu in time_in_idle array
*
* Return: The average load of cpu @cpu in percentage since this
* function was last called.
*/
#ifdef CONFIG_SMP
static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
int cpu_idx)
{
unsigned long max = arch_scale_cpu_capacity(cpu);
unsigned long util;
util = sched_cpu_util(cpu, max);
return (util * 100) / max;
}
#else /* !CONFIG_SMP */
static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
int cpu_idx)
{
@ -161,6 +174,7 @@ static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
return load;
}
#endif /* CONFIG_SMP */
/**
* get_dynamic_power() - calculate the dynamic power
@ -346,6 +360,36 @@ static inline bool em_is_sane(struct cpufreq_cooling_device *cpufreq_cdev,
}
#endif /* CONFIG_THERMAL_GOV_POWER_ALLOCATOR */
#ifdef CONFIG_SMP
static inline int allocate_idle_time(struct cpufreq_cooling_device *cpufreq_cdev)
{
return 0;
}
static inline void free_idle_time(struct cpufreq_cooling_device *cpufreq_cdev)
{
}
#else
static int allocate_idle_time(struct cpufreq_cooling_device *cpufreq_cdev)
{
unsigned int num_cpus = cpumask_weight(cpufreq_cdev->policy->related_cpus);
cpufreq_cdev->idle_time = kcalloc(num_cpus,
sizeof(*cpufreq_cdev->idle_time),
GFP_KERNEL);
if (!cpufreq_cdev->idle_time)
return -ENOMEM;
return 0;
}
static void free_idle_time(struct cpufreq_cooling_device *cpufreq_cdev)
{
kfree(cpufreq_cdev->idle_time);
cpufreq_cdev->idle_time = NULL;
}
#endif /* CONFIG_SMP */
static unsigned int get_state_freq(struct cpufreq_cooling_device *cpufreq_cdev,
unsigned long state)
{
@ -485,7 +529,7 @@ __cpufreq_cooling_register(struct device_node *np,
struct thermal_cooling_device *cdev;
struct cpufreq_cooling_device *cpufreq_cdev;
char dev_name[THERMAL_NAME_LENGTH];
unsigned int i, num_cpus;
unsigned int i;
struct device *dev;
int ret;
struct thermal_cooling_device_ops *cooling_ops;
@ -496,7 +540,6 @@ __cpufreq_cooling_register(struct device_node *np,
return ERR_PTR(-ENODEV);
}
if (IS_ERR_OR_NULL(policy)) {
pr_err("%s: cpufreq policy isn't valid: %p\n", __func__, policy);
return ERR_PTR(-EINVAL);
@ -514,12 +557,10 @@ __cpufreq_cooling_register(struct device_node *np,
return ERR_PTR(-ENOMEM);
cpufreq_cdev->policy = policy;
num_cpus = cpumask_weight(policy->related_cpus);
cpufreq_cdev->idle_time = kcalloc(num_cpus,
sizeof(*cpufreq_cdev->idle_time),
GFP_KERNEL);
if (!cpufreq_cdev->idle_time) {
cdev = ERR_PTR(-ENOMEM);
ret = allocate_idle_time(cpufreq_cdev);
if (ret) {
cdev = ERR_PTR(ret);
goto free_cdev;
}
@ -579,7 +620,7 @@ remove_qos_req:
remove_ida:
ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
free_idle_time:
kfree(cpufreq_cdev->idle_time);
free_idle_time(cpufreq_cdev);
free_cdev:
kfree(cpufreq_cdev);
return cdev;
@ -672,7 +713,7 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
thermal_cooling_device_unregister(cdev);
freq_qos_remove_request(&cpufreq_cdev->qos_req);
ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
kfree(cpufreq_cdev->idle_time);
free_idle_time(cpufreq_cdev);
kfree(cpufreq_cdev);
}
EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);

View File

@ -403,7 +403,10 @@
. = ALIGN(8); \
__start_static_call_sites = .; \
KEEP(*(.static_call_sites)) \
__stop_static_call_sites = .;
__stop_static_call_sites = .; \
__start_static_call_tramp_key = .; \
KEEP(*(.static_call_tramp_key)) \
__stop_static_call_tramp_key = .;
/*
* Allow architectures to handle ro_after_init data on their

View File

@ -307,7 +307,7 @@ void css_task_iter_end(struct css_task_iter *it);
* Inline functions.
*/
static inline u64 cgroup_id(struct cgroup *cgrp)
static inline u64 cgroup_id(const struct cgroup *cgrp)
{
return cgrp->kn->id;
}
@ -701,7 +701,7 @@ void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen);
struct cgroup_subsys_state;
struct cgroup;
static inline u64 cgroup_id(struct cgroup *cgrp) { return 1; }
static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
static inline void css_get(struct cgroup_subsys_state *css) {}
static inline void css_put(struct cgroup_subsys_state *css) {}
static inline int cgroup_attach_task_all(struct task_struct *from,

View File

@ -2,6 +2,7 @@
#ifndef __LINUX_ENTRYCOMMON_H
#define __LINUX_ENTRYCOMMON_H
#include <linux/static_call_types.h>
#include <linux/tracehook.h>
#include <linux/syscalls.h>
#include <linux/seccomp.h>
@ -454,6 +455,9 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
* Conditional reschedule with additional sanity checks.
*/
void irqentry_exit_cond_resched(void);
#ifdef CONFIG_PREEMPT_DYNAMIC
DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
#endif
/**
* irqentry_exit - Handle return from exception that used irqentry_enter()

View File

@ -46,6 +46,20 @@ static inline int arch_xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu,
*/
int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu);
/**
* xfer_to_guest_mode_prepare - Perform last minute preparation work that
* need to be handled while IRQs are disabled
* upon entering to guest.
*
* Has to be invoked with interrupts disabled before the last call
* to xfer_to_guest_mode_work_pending().
*/
static inline void xfer_to_guest_mode_prepare(void)
{
lockdep_assert_irqs_disabled();
rcu_nocb_flush_deferred_wakeup();
}
/**
* __xfer_to_guest_mode_work_pending - Check if work is pending
*

View File

@ -15,7 +15,7 @@
#include <linux/typecheck.h>
#include <linux/printk.h>
#include <linux/build_bug.h>
#include <linux/static_call_types.h>
#include <asm/byteorder.h>
#include <uapi/linux/kernel.h>
@ -81,11 +81,26 @@ struct pt_regs;
struct user;
#ifdef CONFIG_PREEMPT_VOLUNTARY
extern int _cond_resched(void);
# define might_resched() _cond_resched()
extern int __cond_resched(void);
# define might_resched() __cond_resched()
#elif defined(CONFIG_PREEMPT_DYNAMIC)
extern int __cond_resched(void);
DECLARE_STATIC_CALL(might_resched, __cond_resched);
static __always_inline void might_resched(void)
{
static_call_mod(might_resched)();
}
#else
# define might_resched() do { } while (0)
#endif
#endif /* CONFIG_PREEMPT_* */
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
extern void ___might_sleep(const char *file, int line, int preempt_offset);

View File

@ -141,12 +141,18 @@ static inline void rb_insert_color_cached(struct rb_node *node,
rb_insert_color(node, &root->rb_root);
}
static inline void rb_erase_cached(struct rb_node *node,
struct rb_root_cached *root)
static inline struct rb_node *
rb_erase_cached(struct rb_node *node, struct rb_root_cached *root)
{
struct rb_node *leftmost = NULL;
if (root->rb_leftmost == node)
root->rb_leftmost = rb_next(node);
leftmost = root->rb_leftmost = rb_next(node);
rb_erase(node, &root->rb_root);
return leftmost;
}
static inline void rb_replace_node_cached(struct rb_node *victim,
@ -158,4 +164,198 @@ static inline void rb_replace_node_cached(struct rb_node *victim,
rb_replace_node(victim, new, &root->rb_root);
}
/*
* The below helper functions use 2 operators with 3 different
* calling conventions. The operators are related like:
*
* comp(a->key,b) < 0 := less(a,b)
* comp(a->key,b) > 0 := less(b,a)
* comp(a->key,b) == 0 := !less(a,b) && !less(b,a)
*
* If these operators define a partial order on the elements we make no
* guarantee on which of the elements matching the key is found. See
* rb_find().
*
* The reason for this is to allow the find() interface without requiring an
* on-stack dummy object, which might not be feasible due to object size.
*/
/**
* rb_add_cached() - insert @node into the leftmost cached tree @tree
* @node: node to insert
* @tree: leftmost cached tree to insert @node into
* @less: operator defining the (partial) node order
*
* Returns @node when it is the new leftmost, or NULL.
*/
static __always_inline struct rb_node *
rb_add_cached(struct rb_node *node, struct rb_root_cached *tree,
bool (*less)(struct rb_node *, const struct rb_node *))
{
struct rb_node **link = &tree->rb_root.rb_node;
struct rb_node *parent = NULL;
bool leftmost = true;
while (*link) {
parent = *link;
if (less(node, parent)) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = false;
}
}
rb_link_node(node, parent, link);
rb_insert_color_cached(node, tree, leftmost);
return leftmost ? node : NULL;
}
/**
* rb_add() - insert @node into @tree
* @node: node to insert
* @tree: tree to insert @node into
* @less: operator defining the (partial) node order
*/
static __always_inline void
rb_add(struct rb_node *node, struct rb_root *tree,
bool (*less)(struct rb_node *, const struct rb_node *))
{
struct rb_node **link = &tree->rb_node;
struct rb_node *parent = NULL;
while (*link) {
parent = *link;
if (less(node, parent))
link = &parent->rb_left;
else
link = &parent->rb_right;
}
rb_link_node(node, parent, link);
rb_insert_color(node, tree);
}
/**
* rb_find_add() - find equivalent @node in @tree, or add @node
* @node: node to look-for / insert
* @tree: tree to search / modify
* @cmp: operator defining the node order
*
* Returns the rb_node matching @node, or NULL when no match is found and @node
* is inserted.
*/
static __always_inline struct rb_node *
rb_find_add(struct rb_node *node, struct rb_root *tree,
int (*cmp)(struct rb_node *, const struct rb_node *))
{
struct rb_node **link = &tree->rb_node;
struct rb_node *parent = NULL;
int c;
while (*link) {
parent = *link;
c = cmp(node, parent);
if (c < 0)
link = &parent->rb_left;
else if (c > 0)
link = &parent->rb_right;
else
return parent;
}
rb_link_node(node, parent, link);
rb_insert_color(node, tree);
return NULL;
}
/**
* rb_find() - find @key in tree @tree
* @key: key to match
* @tree: tree to search
* @cmp: operator defining the node order
*
* Returns the rb_node matching @key or NULL.
*/
static __always_inline struct rb_node *
rb_find(const void *key, const struct rb_root *tree,
int (*cmp)(const void *key, const struct rb_node *))
{
struct rb_node *node = tree->rb_node;
while (node) {
int c = cmp(key, node);
if (c < 0)
node = node->rb_left;
else if (c > 0)
node = node->rb_right;
else
return node;
}
return NULL;
}
/**
* rb_find_first() - find the first @key in @tree
* @key: key to match
* @tree: tree to search
* @cmp: operator defining node order
*
* Returns the leftmost node matching @key, or NULL.
*/
static __always_inline struct rb_node *
rb_find_first(const void *key, const struct rb_root *tree,
int (*cmp)(const void *key, const struct rb_node *))
{
struct rb_node *node = tree->rb_node;
struct rb_node *match = NULL;
while (node) {
int c = cmp(key, node);
if (c <= 0) {
if (!c)
match = node;
node = node->rb_left;
} else if (c > 0) {
node = node->rb_right;
}
}
return match;
}
/**
* rb_next_match() - find the next @key in @tree
* @key: key to match
* @tree: tree to search
* @cmp: operator defining node order
*
* Returns the next node matching @key, or NULL.
*/
static __always_inline struct rb_node *
rb_next_match(const void *key, struct rb_node *node,
int (*cmp)(const void *key, const struct rb_node *))
{
node = rb_next(node);
if (node && cmp(key, node))
node = NULL;
return node;
}
/**
* rb_for_each() - iterates a subtree matching @key
* @node: iterator
* @key: key to match
* @tree: tree to search
* @cmp: operator defining node order
*/
#define rb_for_each(node, key, tree, cmp) \
for ((node) = rb_find_first((key), (tree), (cmp)); \
(node); (node) = rb_next_match((key), (node), (cmp)))
#endif /* _LINUX_RBTREE_H */

View File

@ -114,10 +114,12 @@ static inline void rcu_user_exit(void) { }
void rcu_init_nohz(void);
int rcu_nocb_cpu_offload(int cpu);
int rcu_nocb_cpu_deoffload(int cpu);
void rcu_nocb_flush_deferred_wakeup(void);
#else /* #ifdef CONFIG_RCU_NOCB_CPU */
static inline void rcu_init_nohz(void) { }
static inline int rcu_nocb_cpu_offload(int cpu) { return -EINVAL; }
static inline int rcu_nocb_cpu_deoffload(int cpu) { return 0; }
static inline void rcu_nocb_flush_deferred_wakeup(void) { }
#endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
/**

View File

@ -47,6 +47,7 @@ struct cfs_rq;
struct fs_struct;
struct futex_pi_state;
struct io_context;
struct io_uring_task;
struct mempolicy;
struct nameidata;
struct nsproxy;
@ -65,7 +66,6 @@ struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
struct io_uring_task;
/*
* Task state bitmask. NOTE! These bits are also
@ -1871,11 +1871,32 @@ static inline int test_tsk_need_resched(struct task_struct *tsk)
* value indicates whether a reschedule was done in fact.
* cond_resched_lock() will drop the spinlock before scheduling,
*/
#ifndef CONFIG_PREEMPTION
extern int _cond_resched(void);
#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
extern int __cond_resched(void);
#ifdef CONFIG_PREEMPT_DYNAMIC
DECLARE_STATIC_CALL(cond_resched, __cond_resched);
static __always_inline int _cond_resched(void)
{
return static_call_mod(cond_resched)();
}
#else
static inline int _cond_resched(void)
{
return __cond_resched();
}
#endif /* CONFIG_PREEMPT_DYNAMIC */
#else
static inline int _cond_resched(void) { return 0; }
#endif
#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
#define cond_resched() ({ \
___might_sleep(__FILE__, __LINE__, 0); \
@ -1968,6 +1989,11 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
#define TASK_SIZE_OF(tsk) TASK_SIZE
#endif
#ifdef CONFIG_SMP
/* Returns effective CPU energy utilization, as seen by the scheduler */
unsigned long sched_cpu_util(int cpu, unsigned long max);
#endif /* CONFIG_SMP */
#ifdef CONFIG_RSEQ
/*

View File

@ -11,16 +11,9 @@
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
* tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
* values are inverted: lower p->prio value means higher priority.
*
* The MAX_USER_RT_PRIO value allows the actual maximum
* RT priority to be separate from the value exported to
* user-space. This allows kernel threads to set their
* priority to a value higher than any user task. Note:
* MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
*/
#define MAX_USER_RT_PRIO 100
#define MAX_RT_PRIO MAX_USER_RT_PRIO
#define MAX_RT_PRIO 100
#define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2)
@ -33,15 +26,6 @@
#define NICE_TO_PRIO(nice) ((nice) + DEFAULT_PRIO)
#define PRIO_TO_NICE(prio) ((prio) - DEFAULT_PRIO)
/*
* 'User priority' is the nice value converted to something we
* can work with better when scaling various scheduler parameters,
* it's a [ 0 ... 39 ] range.
*/
#define USER_PRIO(p) ((p)-MAX_RT_PRIO)
#define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio)
#define MAX_USER_PRIO (USER_PRIO(MAX_PRIO))
/*
* Convert nice value [19,-20] to rlimit style value [1,40].
*/

View File

@ -107,26 +107,10 @@ extern void arch_static_call_transform(void *site, void *tramp, void *func, bool
#define STATIC_CALL_TRAMP_ADDR(name) &STATIC_CALL_TRAMP(name)
/*
* __ADDRESSABLE() is used to ensure the key symbol doesn't get stripped from
* the symbol table so that objtool can reference it when it generates the
* .static_call_sites section.
*/
#define __static_call(name) \
({ \
__ADDRESSABLE(STATIC_CALL_KEY(name)); \
&STATIC_CALL_TRAMP(name); \
})
#else
#define STATIC_CALL_TRAMP_ADDR(name) NULL
#endif
#define DECLARE_STATIC_CALL(name, func) \
extern struct static_call_key STATIC_CALL_KEY(name); \
extern typeof(func) STATIC_CALL_TRAMP(name);
#define static_call_update(name, func) \
({ \
BUILD_BUG_ON(!__same_type(*(func), STATIC_CALL_TRAMP(name))); \
@ -154,17 +138,25 @@ struct static_call_key {
};
};
/* For finding the key associated with a trampoline */
struct static_call_tramp_key {
s32 tramp;
s32 key;
};
extern void __static_call_update(struct static_call_key *key, void *tramp, void *func);
extern int static_call_mod_init(struct module *mod);
extern int static_call_text_reserved(void *start, void *end);
#define DEFINE_STATIC_CALL(name, _func) \
extern long __static_call_return0(void);
#define __DEFINE_STATIC_CALL(name, _func, _func_init) \
DECLARE_STATIC_CALL(name, _func); \
struct static_call_key STATIC_CALL_KEY(name) = { \
.func = _func, \
.func = _func_init, \
.type = 1, \
}; \
ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func_init)
#define DEFINE_STATIC_CALL_NULL(name, _func) \
DECLARE_STATIC_CALL(name, _func); \
@ -174,17 +166,23 @@ extern int static_call_text_reserved(void *start, void *end);
}; \
ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
#define static_call(name) __static_call(name)
#define static_call_cond(name) (void)__static_call(name)
#define EXPORT_STATIC_CALL(name) \
EXPORT_SYMBOL(STATIC_CALL_KEY(name)); \
EXPORT_SYMBOL(STATIC_CALL_TRAMP(name))
#define EXPORT_STATIC_CALL_GPL(name) \
EXPORT_SYMBOL_GPL(STATIC_CALL_KEY(name)); \
EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(name))
/* Leave the key unexported, so modules can't change static call targets: */
#define EXPORT_STATIC_CALL_TRAMP(name) \
EXPORT_SYMBOL(STATIC_CALL_TRAMP(name)); \
ARCH_ADD_TRAMP_KEY(name)
#define EXPORT_STATIC_CALL_TRAMP_GPL(name) \
EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(name)); \
ARCH_ADD_TRAMP_KEY(name)
#elif defined(CONFIG_HAVE_STATIC_CALL)
static inline int static_call_init(void) { return 0; }
@ -193,12 +191,12 @@ struct static_call_key {
void *func;
};
#define DEFINE_STATIC_CALL(name, _func) \
#define __DEFINE_STATIC_CALL(name, _func, _func_init) \
DECLARE_STATIC_CALL(name, _func); \
struct static_call_key STATIC_CALL_KEY(name) = { \
.func = _func, \
.func = _func_init, \
}; \
ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func_init)
#define DEFINE_STATIC_CALL_NULL(name, _func) \
DECLARE_STATIC_CALL(name, _func); \
@ -207,7 +205,6 @@ struct static_call_key {
}; \
ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
#define static_call(name) __static_call(name)
#define static_call_cond(name) (void)__static_call(name)
static inline
@ -224,14 +221,24 @@ static inline int static_call_text_reserved(void *start, void *end)
return 0;
}
static inline long __static_call_return0(void)
{
return 0;
}
#define EXPORT_STATIC_CALL(name) \
EXPORT_SYMBOL(STATIC_CALL_KEY(name)); \
EXPORT_SYMBOL(STATIC_CALL_TRAMP(name))
#define EXPORT_STATIC_CALL_GPL(name) \
EXPORT_SYMBOL_GPL(STATIC_CALL_KEY(name)); \
EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(name))
/* Leave the key unexported, so modules can't change static call targets: */
#define EXPORT_STATIC_CALL_TRAMP(name) \
EXPORT_SYMBOL(STATIC_CALL_TRAMP(name))
#define EXPORT_STATIC_CALL_TRAMP_GPL(name) \
EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(name))
#else /* Generic implementation */
static inline int static_call_init(void) { return 0; }
@ -240,10 +247,15 @@ struct static_call_key {
void *func;
};
#define DEFINE_STATIC_CALL(name, _func) \
static inline long __static_call_return0(void)
{
return 0;
}
#define __DEFINE_STATIC_CALL(name, _func, _func_init) \
DECLARE_STATIC_CALL(name, _func); \
struct static_call_key STATIC_CALL_KEY(name) = { \
.func = _func, \
.func = _func_init, \
}
#define DEFINE_STATIC_CALL_NULL(name, _func) \
@ -252,9 +264,6 @@ struct static_call_key {
.func = NULL, \
}
#define static_call(name) \
((typeof(STATIC_CALL_TRAMP(name))*)(STATIC_CALL_KEY(name).func))
static inline void __static_call_nop(void) { }
/*
@ -295,4 +304,10 @@ static inline int static_call_text_reserved(void *start, void *end)
#endif /* CONFIG_HAVE_STATIC_CALL */
#define DEFINE_STATIC_CALL(name, _func) \
__DEFINE_STATIC_CALL(name, _func, _func)
#define DEFINE_STATIC_CALL_RET0(name, _func) \
__DEFINE_STATIC_CALL(name, _func, __static_call_return0)
#endif /* _LINUX_STATIC_CALL_H */

View File

@ -4,11 +4,13 @@
#include <linux/types.h>
#include <linux/stringify.h>
#include <linux/compiler.h>
#define STATIC_CALL_KEY_PREFIX __SCK__
#define STATIC_CALL_KEY_PREFIX_STR __stringify(STATIC_CALL_KEY_PREFIX)
#define STATIC_CALL_KEY_PREFIX_LEN (sizeof(STATIC_CALL_KEY_PREFIX_STR) - 1)
#define STATIC_CALL_KEY(name) __PASTE(STATIC_CALL_KEY_PREFIX, name)
#define STATIC_CALL_KEY_STR(name) __stringify(STATIC_CALL_KEY(name))
#define STATIC_CALL_TRAMP_PREFIX __SCT__
#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
@ -32,4 +34,52 @@ struct static_call_site {
s32 key;
};
#define DECLARE_STATIC_CALL(name, func) \
extern struct static_call_key STATIC_CALL_KEY(name); \
extern typeof(func) STATIC_CALL_TRAMP(name);
#ifdef CONFIG_HAVE_STATIC_CALL
#define __raw_static_call(name) (&STATIC_CALL_TRAMP(name))
#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
/*
* __ADDRESSABLE() is used to ensure the key symbol doesn't get stripped from
* the symbol table so that objtool can reference it when it generates the
* .static_call_sites section.
*/
#define __STATIC_CALL_ADDRESSABLE(name) \
__ADDRESSABLE(STATIC_CALL_KEY(name))
#define __static_call(name) \
({ \
__STATIC_CALL_ADDRESSABLE(name); \
__raw_static_call(name); \
})
#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
#define __STATIC_CALL_ADDRESSABLE(name)
#define __static_call(name) __raw_static_call(name)
#endif /* CONFIG_HAVE_STATIC_CALL_INLINE */
#ifdef MODULE
#define __STATIC_CALL_MOD_ADDRESSABLE(name)
#define static_call_mod(name) __raw_static_call(name)
#else
#define __STATIC_CALL_MOD_ADDRESSABLE(name) __STATIC_CALL_ADDRESSABLE(name)
#define static_call_mod(name) __static_call(name)
#endif
#define static_call(name) __static_call(name)
#else
#define static_call(name) \
((typeof(STATIC_CALL_TRAMP(name))*)(STATIC_CALL_KEY(name).func))
#endif /* CONFIG_HAVE_STATIC_CALL */
#endif /* _STATIC_CALL_TYPES_H */

View File

@ -48,6 +48,7 @@ int arch_update_cpu_topology(void);
/* Conform to ACPI 2.0 SLIT distance definitions */
#define LOCAL_DISTANCE 10
#define REMOTE_DISTANCE 20
#define DISTANCE_BITS 8
#ifndef node_distance
#define node_distance(from,to) ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)
#endif

View File

@ -524,7 +524,7 @@ config SCHED_THERMAL_PRESSURE
i.e. put less load on throttled CPUs than on non/less throttled ones.
This requires the architecture to implement
arch_set_thermal_pressure() and arch_get_thermal_pressure().
arch_set_thermal_pressure() and arch_scale_thermal_pressure().
config BSD_PROCESS_ACCT
bool "BSD Process Accounting"

View File

@ -40,6 +40,7 @@ config PREEMPT
depends on !ARCH_NO_PREEMPT
select PREEMPTION
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
select PREEMPT_DYNAMIC if HAVE_PREEMPT_DYNAMIC
help
This option reduces the latency of the kernel by making
all kernel code (that is not executing in a critical section)
@ -80,3 +81,21 @@ config PREEMPT_COUNT
config PREEMPTION
bool
select PREEMPT_COUNT
config PREEMPT_DYNAMIC
bool
help
This option allows to define the preemption model on the kernel
command line parameter and thus override the default preemption
model defined during compile time.
The feature is primarily interesting for Linux distributions which
provide a pre-built kernel binary to reduce the number of kernel
flavors they offer while still offering different usecases.
The runtime overhead is negligible with HAVE_STATIC_CALL_INLINE enabled
but if runtime patching is not available for the specific architecture
then the potential overhead should be considered.
Interesting if you want the same pre-built kernel should be used for
both Server and Desktop workloads.

View File

@ -184,6 +184,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
* enabled above.
*/
local_irq_disable_exit_to_user();
/* Check if any of the above work has queued a deferred wakeup */
rcu_nocb_flush_deferred_wakeup();
ti_work = READ_ONCE(current_thread_info()->flags);
}
@ -197,6 +201,9 @@ static void exit_to_user_mode_prepare(struct pt_regs *regs)
lockdep_assert_irqs_disabled();
/* Flush pending rcuog wakeup before the last need_resched() check */
rcu_nocb_flush_deferred_wakeup();
if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
ti_work = exit_to_user_mode_loop(regs, ti_work);
@ -385,6 +392,9 @@ void irqentry_exit_cond_resched(void)
preempt_schedule_irq();
}
}
#ifdef CONFIG_PREEMPT_DYNAMIC
DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
#endif
noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
{
@ -411,8 +421,13 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
}
instrumentation_begin();
if (IS_ENABLED(CONFIG_PREEMPTION))
if (IS_ENABLED(CONFIG_PREEMPTION)) {
#ifdef CONFIG_PREEMT_DYNAMIC
static_call(irqentry_exit_cond_resched)();
#else
irqentry_exit_cond_resched();
#endif
}
/* Covers both tracing and lockdep */
trace_hardirqs_on();
instrumentation_end();

View File

@ -1597,50 +1597,91 @@ static void perf_event_groups_init(struct perf_event_groups *groups)
groups->index = 0;
}
static inline struct cgroup *event_cgroup(const struct perf_event *event)
{
struct cgroup *cgroup = NULL;
#ifdef CONFIG_CGROUP_PERF
if (event->cgrp)
cgroup = event->cgrp->css.cgroup;
#endif
return cgroup;
}
/*
* Compare function for event groups;
*
* Implements complex key that first sorts by CPU and then by virtual index
* which provides ordering when rotating groups for the same CPU.
*/
static bool
perf_event_groups_less(struct perf_event *left, struct perf_event *right)
static __always_inline int
perf_event_groups_cmp(const int left_cpu, const struct cgroup *left_cgroup,
const u64 left_group_index, const struct perf_event *right)
{
if (left->cpu < right->cpu)
return true;
if (left->cpu > right->cpu)
return false;
if (left_cpu < right->cpu)
return -1;
if (left_cpu > right->cpu)
return 1;
#ifdef CONFIG_CGROUP_PERF
if (left->cgrp != right->cgrp) {
if (!left->cgrp || !left->cgrp->css.cgroup) {
{
const struct cgroup *right_cgroup = event_cgroup(right);
if (left_cgroup != right_cgroup) {
if (!left_cgroup) {
/*
* Left has no cgroup but right does, no cgroups come
* first.
* Left has no cgroup but right does, no
* cgroups come first.
*/
return true;
return -1;
}
if (!right->cgrp || !right->cgrp->css.cgroup) {
if (!right_cgroup) {
/*
* Right has no cgroup but left does, no cgroups come
* first.
* Right has no cgroup but left does, no
* cgroups come first.
*/
return false;
return 1;
}
/* Two dissimilar cgroups, order by id. */
if (left->cgrp->css.cgroup->kn->id < right->cgrp->css.cgroup->kn->id)
return true;
if (cgroup_id(left_cgroup) < cgroup_id(right_cgroup))
return -1;
return false;
return 1;
}
}
#endif
if (left->group_index < right->group_index)
return true;
if (left->group_index > right->group_index)
return false;
if (left_group_index < right->group_index)
return -1;
if (left_group_index > right->group_index)
return 1;
return false;
return 0;
}
#define __node_2_pe(node) \
rb_entry((node), struct perf_event, group_node)
static inline bool __group_less(struct rb_node *a, const struct rb_node *b)
{
struct perf_event *e = __node_2_pe(a);
return perf_event_groups_cmp(e->cpu, event_cgroup(e), e->group_index,
__node_2_pe(b)) < 0;
}
struct __group_key {
int cpu;
struct cgroup *cgroup;
};
static inline int __group_cmp(const void *key, const struct rb_node *node)
{
const struct __group_key *a = key;
const struct perf_event *b = __node_2_pe(node);
/* partial/subtree match: @cpu, @cgroup; ignore: @group_index */
return perf_event_groups_cmp(a->cpu, a->cgroup, b->group_index, b);
}
/*
@ -1652,27 +1693,9 @@ static void
perf_event_groups_insert(struct perf_event_groups *groups,
struct perf_event *event)
{
struct perf_event *node_event;
struct rb_node *parent;
struct rb_node **node;
event->group_index = ++groups->index;
node = &groups->tree.rb_node;
parent = *node;
while (*node) {
parent = *node;
node_event = container_of(*node, struct perf_event, group_node);
if (perf_event_groups_less(event, node_event))
node = &parent->rb_left;
else
node = &parent->rb_right;
}
rb_link_node(&event->group_node, parent, node);
rb_insert_color(&event->group_node, &groups->tree);
rb_add(&event->group_node, &groups->tree, __group_less);
}
/*
@ -1720,45 +1743,17 @@ static struct perf_event *
perf_event_groups_first(struct perf_event_groups *groups, int cpu,
struct cgroup *cgrp)
{
struct perf_event *node_event = NULL, *match = NULL;
struct rb_node *node = groups->tree.rb_node;
#ifdef CONFIG_CGROUP_PERF
u64 node_cgrp_id, cgrp_id = 0;
struct __group_key key = {
.cpu = cpu,
.cgroup = cgrp,
};
struct rb_node *node;
if (cgrp)
cgrp_id = cgrp->kn->id;
#endif
node = rb_find_first(&key, &groups->tree, __group_cmp);
if (node)
return __node_2_pe(node);
while (node) {
node_event = container_of(node, struct perf_event, group_node);
if (cpu < node_event->cpu) {
node = node->rb_left;
continue;
}
if (cpu > node_event->cpu) {
node = node->rb_right;
continue;
}
#ifdef CONFIG_CGROUP_PERF
node_cgrp_id = 0;
if (node_event->cgrp && node_event->cgrp->css.cgroup)
node_cgrp_id = node_event->cgrp->css.cgroup->kn->id;
if (cgrp_id < node_cgrp_id) {
node = node->rb_left;
continue;
}
if (cgrp_id > node_cgrp_id) {
node = node->rb_right;
continue;
}
#endif
match = node_event;
node = node->rb_left;
}
return match;
return NULL;
}
/*
@ -1767,27 +1762,17 @@ perf_event_groups_first(struct perf_event_groups *groups, int cpu,
static struct perf_event *
perf_event_groups_next(struct perf_event *event)
{
struct perf_event *next;
#ifdef CONFIG_CGROUP_PERF
u64 curr_cgrp_id = 0;
u64 next_cgrp_id = 0;
#endif
struct __group_key key = {
.cpu = event->cpu,
.cgroup = event_cgroup(event),
};
struct rb_node *next;
next = rb_next_match(&key, &event->group_node, __group_cmp);
if (next)
return __node_2_pe(next);
next = rb_entry_safe(rb_next(&event->group_node), typeof(*event), group_node);
if (next == NULL || next->cpu != event->cpu)
return NULL;
#ifdef CONFIG_CGROUP_PERF
if (event->cgrp && event->cgrp->css.cgroup)
curr_cgrp_id = event->cgrp->css.cgroup->kn->id;
if (next->cgrp && next->cgrp->css.cgroup)
next_cgrp_id = next->cgrp->css.cgroup->kn->id;
if (curr_cgrp_id != next_cgrp_id)
return NULL;
#endif
return next;
}
/*

View File

@ -613,41 +613,56 @@ static void put_uprobe(struct uprobe *uprobe)
}
}
static int match_uprobe(struct uprobe *l, struct uprobe *r)
static __always_inline
int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset,
const struct uprobe *r)
{
if (l->inode < r->inode)
if (l_inode < r->inode)
return -1;
if (l->inode > r->inode)
if (l_inode > r->inode)
return 1;
if (l->offset < r->offset)
if (l_offset < r->offset)
return -1;
if (l->offset > r->offset)
if (l_offset > r->offset)
return 1;
return 0;
}
#define __node_2_uprobe(node) \
rb_entry((node), struct uprobe, rb_node)
struct __uprobe_key {
struct inode *inode;
loff_t offset;
};
static inline int __uprobe_cmp_key(const void *key, const struct rb_node *b)
{
const struct __uprobe_key *a = key;
return uprobe_cmp(a->inode, a->offset, __node_2_uprobe(b));
}
static inline int __uprobe_cmp(struct rb_node *a, const struct rb_node *b)
{
struct uprobe *u = __node_2_uprobe(a);
return uprobe_cmp(u->inode, u->offset, __node_2_uprobe(b));
}
static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
{
struct uprobe u = { .inode = inode, .offset = offset };
struct rb_node *n = uprobes_tree.rb_node;
struct uprobe *uprobe;
int match;
struct __uprobe_key key = {
.inode = inode,
.offset = offset,
};
struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
while (n) {
uprobe = rb_entry(n, struct uprobe, rb_node);
match = match_uprobe(&u, uprobe);
if (!match)
return get_uprobe(uprobe);
if (node)
return get_uprobe(__node_2_uprobe(node));
if (match < 0)
n = n->rb_left;
else
n = n->rb_right;
}
return NULL;
}
@ -668,32 +683,15 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
{
struct rb_node **p = &uprobes_tree.rb_node;
struct rb_node *parent = NULL;
struct uprobe *u;
int match;
struct rb_node *node;
while (*p) {
parent = *p;
u = rb_entry(parent, struct uprobe, rb_node);
match = match_uprobe(uprobe, u);
if (!match)
return get_uprobe(u);
node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
if (node)
return get_uprobe(__node_2_uprobe(node));
if (match < 0)
p = &parent->rb_left;
else
p = &parent->rb_right;
}
u = NULL;
rb_link_node(&uprobe->rb_node, parent, p);
rb_insert_color(&uprobe->rb_node, &uprobes_tree);
/* get access + creation ref */
refcount_set(&uprobe->ref, 2);
return u;
return NULL;
}
/*

View File

@ -267,27 +267,18 @@ rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
return 1;
}
#define __node_2_waiter(node) \
rb_entry((node), struct rt_mutex_waiter, tree_entry)
static inline bool __waiter_less(struct rb_node *a, const struct rb_node *b)
{
return rt_mutex_waiter_less(__node_2_waiter(a), __node_2_waiter(b));
}
static void
rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
{
struct rb_node **link = &lock->waiters.rb_root.rb_node;
struct rb_node *parent = NULL;
struct rt_mutex_waiter *entry;
bool leftmost = true;
while (*link) {
parent = *link;
entry = rb_entry(parent, struct rt_mutex_waiter, tree_entry);
if (rt_mutex_waiter_less(waiter, entry)) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = false;
}
}
rb_link_node(&waiter->tree_entry, parent, link);
rb_insert_color_cached(&waiter->tree_entry, &lock->waiters, leftmost);
rb_add_cached(&waiter->tree_entry, &lock->waiters, __waiter_less);
}
static void
@ -300,27 +291,18 @@ rt_mutex_dequeue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
RB_CLEAR_NODE(&waiter->tree_entry);
}
#define __node_2_pi_waiter(node) \
rb_entry((node), struct rt_mutex_waiter, pi_tree_entry)
static inline bool __pi_waiter_less(struct rb_node *a, const struct rb_node *b)
{
return rt_mutex_waiter_less(__node_2_pi_waiter(a), __node_2_pi_waiter(b));
}
static void
rt_mutex_enqueue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
{
struct rb_node **link = &task->pi_waiters.rb_root.rb_node;
struct rb_node *parent = NULL;
struct rt_mutex_waiter *entry;
bool leftmost = true;
while (*link) {
parent = *link;
entry = rb_entry(parent, struct rt_mutex_waiter, pi_tree_entry);
if (rt_mutex_waiter_less(waiter, entry)) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = false;
}
}
rb_link_node(&waiter->pi_tree_entry, parent, link);
rb_insert_color_cached(&waiter->pi_tree_entry, &task->pi_waiters, leftmost);
rb_add_cached(&waiter->pi_tree_entry, &task->pi_waiters, __pi_waiter_less);
}
static void

View File

@ -649,7 +649,6 @@ static noinstr void rcu_eqs_enter(bool user)
trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks));
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
rdp = this_cpu_ptr(&rcu_data);
do_nocb_deferred_wakeup(rdp);
rcu_prepare_for_idle();
rcu_preempt_deferred_qs(current);
@ -683,6 +682,50 @@ void rcu_idle_enter(void)
EXPORT_SYMBOL_GPL(rcu_idle_enter);
#ifdef CONFIG_NO_HZ_FULL
#if !defined(CONFIG_GENERIC_ENTRY) || !defined(CONFIG_KVM_XFER_TO_GUEST_WORK)
/*
* An empty function that will trigger a reschedule on
* IRQ tail once IRQs get re-enabled on userspace/guest resume.
*/
static void late_wakeup_func(struct irq_work *work)
{
}
static DEFINE_PER_CPU(struct irq_work, late_wakeup_work) =
IRQ_WORK_INIT(late_wakeup_func);
/*
* If either:
*
* 1) the task is about to enter in guest mode and $ARCH doesn't support KVM generic work
* 2) the task is about to enter in user mode and $ARCH doesn't support generic entry.
*
* In these cases the late RCU wake ups aren't supported in the resched loops and our
* last resort is to fire a local irq_work that will trigger a reschedule once IRQs
* get re-enabled again.
*/
noinstr static void rcu_irq_work_resched(void)
{
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
if (IS_ENABLED(CONFIG_GENERIC_ENTRY) && !(current->flags & PF_VCPU))
return;
if (IS_ENABLED(CONFIG_KVM_XFER_TO_GUEST_WORK) && (current->flags & PF_VCPU))
return;
instrumentation_begin();
if (do_nocb_deferred_wakeup(rdp) && need_resched()) {
irq_work_queue(this_cpu_ptr(&late_wakeup_work));
}
instrumentation_end();
}
#else
static inline void rcu_irq_work_resched(void) { }
#endif
/**
* rcu_user_enter - inform RCU that we are resuming userspace.
*
@ -697,8 +740,16 @@ EXPORT_SYMBOL_GPL(rcu_idle_enter);
noinstr void rcu_user_enter(void)
{
lockdep_assert_irqs_disabled();
/*
* Other than generic entry implementation, we may be past the last
* rescheduling opportunity in the entry code. Trigger a self IPI
* that will fire and reschedule once we resume in user/guest mode.
*/
rcu_irq_work_resched();
rcu_eqs_enter(true);
}
#endif /* CONFIG_NO_HZ_FULL */
/**

View File

@ -435,7 +435,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
unsigned long flags);
static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
static bool do_nocb_deferred_wakeup(struct rcu_data *rdp);
static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
static void rcu_spawn_cpu_nocb_kthread(int cpu);
static void __init rcu_spawn_nocb_kthreads(void);

View File

@ -1632,7 +1632,7 @@ bool rcu_is_nocb_cpu(int cpu)
* Kick the GP kthread for this NOCB group. Caller holds ->nocb_lock
* and this function releases it.
*/
static void wake_nocb_gp(struct rcu_data *rdp, bool force,
static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
unsigned long flags)
__releases(rdp->nocb_lock)
{
@ -1644,7 +1644,7 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force,
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("AlreadyAwake"));
rcu_nocb_unlock_irqrestore(rdp, flags);
return;
return false;
}
del_timer(&rdp->nocb_timer);
rcu_nocb_unlock_irqrestore(rdp, flags);
@ -1657,6 +1657,8 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force,
raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
if (needwake)
wake_up_process(rdp_gp->nocb_gp_kthread);
return needwake;
}
/*
@ -2251,20 +2253,23 @@ static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
}
/* Do a deferred wakeup of rcu_nocb_kthread(). */
static void do_nocb_deferred_wakeup_common(struct rcu_data *rdp)
static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp)
{
unsigned long flags;
int ndw;
int ret;
rcu_nocb_lock_irqsave(rdp, flags);
if (!rcu_nocb_need_deferred_wakeup(rdp)) {
rcu_nocb_unlock_irqrestore(rdp, flags);
return;
return false;
}
ndw = READ_ONCE(rdp->nocb_defer_wakeup);
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags);
ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
return ret;
}
/* Do a deferred wakeup of rcu_nocb_kthread() from a timer handler. */
@ -2280,12 +2285,19 @@ static void do_nocb_deferred_wakeup_timer(struct timer_list *t)
* This means we do an inexact common-case check. Note that if
* we miss, ->nocb_timer will eventually clean things up.
*/
static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
static bool do_nocb_deferred_wakeup(struct rcu_data *rdp)
{
if (rcu_nocb_need_deferred_wakeup(rdp))
do_nocb_deferred_wakeup_common(rdp);
return do_nocb_deferred_wakeup_common(rdp);
return false;
}
void rcu_nocb_flush_deferred_wakeup(void)
{
do_nocb_deferred_wakeup(this_cpu_ptr(&rcu_data));
}
EXPORT_SYMBOL_GPL(rcu_nocb_flush_deferred_wakeup);
static int rdp_offload_toggle(struct rcu_data *rdp,
bool offload, unsigned long flags)
__releases(rdp->nocb_lock)
@ -2835,8 +2847,9 @@ static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
return false;
}
static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
static bool do_nocb_deferred_wakeup(struct rcu_data *rdp)
{
return false;
}
static void rcu_spawn_cpu_nocb_kthread(int cpu)

View File

@ -355,8 +355,9 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
static void __hrtick_restart(struct rq *rq)
{
struct hrtimer *timer = &rq->hrtick_timer;
ktime_t time = rq->hrtick_time;
hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED_HARD);
hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
}
/*
@ -380,7 +381,6 @@ static void __hrtick_start(void *arg)
void hrtick_start(struct rq *rq, u64 delay)
{
struct hrtimer *timer = &rq->hrtick_timer;
ktime_t time;
s64 delta;
/*
@ -388,9 +388,7 @@ void hrtick_start(struct rq *rq, u64 delay)
* doesn't make sense and can cause timer DoS.
*/
delta = max_t(s64, delay, 10000LL);
time = ktime_add_ns(timer->base->get_time(), delta);
hrtimer_set_expires(timer, time);
rq->hrtick_time = ktime_add_ns(timer->base->get_time(), delta);
if (rq == this_rq())
__hrtick_restart(rq);
@ -4970,7 +4968,7 @@ static void __sched notrace __schedule(bool preempt)
schedule_debug(prev, preempt);
if (sched_feat(HRTICK))
if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
hrtick_clear(rq);
local_irq_disable();
@ -5264,6 +5262,12 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
NOKPROBE_SYMBOL(preempt_schedule);
EXPORT_SYMBOL(preempt_schedule);
#ifdef CONFIG_PREEMPT_DYNAMIC
DEFINE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
EXPORT_STATIC_CALL_TRAMP(preempt_schedule);
#endif
/**
* preempt_schedule_notrace - preempt_schedule called by tracing
*
@ -5316,8 +5320,197 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
}
EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
#ifdef CONFIG_PREEMPT_DYNAMIC
DEFINE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
#endif
#endif /* CONFIG_PREEMPTION */
#ifdef CONFIG_PREEMPT_DYNAMIC
#include <linux/entry-common.h>
/*
* SC:cond_resched
* SC:might_resched
* SC:preempt_schedule
* SC:preempt_schedule_notrace
* SC:irqentry_exit_cond_resched
*
*
* NONE:
* cond_resched <- __cond_resched
* might_resched <- RET0
* preempt_schedule <- NOP
* preempt_schedule_notrace <- NOP
* irqentry_exit_cond_resched <- NOP
*
* VOLUNTARY:
* cond_resched <- __cond_resched
* might_resched <- __cond_resched
* preempt_schedule <- NOP
* preempt_schedule_notrace <- NOP
* irqentry_exit_cond_resched <- NOP
*
* FULL:
* cond_resched <- RET0
* might_resched <- RET0
* preempt_schedule <- preempt_schedule
* preempt_schedule_notrace <- preempt_schedule_notrace
* irqentry_exit_cond_resched <- irqentry_exit_cond_resched
*/
enum {
preempt_dynamic_none = 0,
preempt_dynamic_voluntary,
preempt_dynamic_full,
};
static int preempt_dynamic_mode = preempt_dynamic_full;
static int sched_dynamic_mode(const char *str)
{
if (!strcmp(str, "none"))
return 0;
if (!strcmp(str, "voluntary"))
return 1;
if (!strcmp(str, "full"))
return 2;
return -1;
}
static void sched_dynamic_update(int mode)
{
/*
* Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
* the ZERO state, which is invalid.
*/
static_call_update(cond_resched, __cond_resched);
static_call_update(might_resched, __cond_resched);
static_call_update(preempt_schedule, __preempt_schedule_func);
static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
switch (mode) {
case preempt_dynamic_none:
static_call_update(cond_resched, __cond_resched);
static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0);
static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL);
static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL);
static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL);
pr_info("Dynamic Preempt: none\n");
break;
case preempt_dynamic_voluntary:
static_call_update(cond_resched, __cond_resched);
static_call_update(might_resched, __cond_resched);
static_call_update(preempt_schedule, (typeof(&preempt_schedule)) NULL);
static_call_update(preempt_schedule_notrace, (typeof(&preempt_schedule_notrace)) NULL);
static_call_update(irqentry_exit_cond_resched, (typeof(&irqentry_exit_cond_resched)) NULL);
pr_info("Dynamic Preempt: voluntary\n");
break;
case preempt_dynamic_full:
static_call_update(cond_resched, (typeof(&__cond_resched)) __static_call_return0);
static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0);
static_call_update(preempt_schedule, __preempt_schedule_func);
static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
pr_info("Dynamic Preempt: full\n");
break;
}
preempt_dynamic_mode = mode;
}
static int __init setup_preempt_mode(char *str)
{
int mode = sched_dynamic_mode(str);
if (mode < 0) {
pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
return 1;
}
sched_dynamic_update(mode);
return 0;
}
__setup("preempt=", setup_preempt_mode);
#ifdef CONFIG_SCHED_DEBUG
static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
size_t cnt, loff_t *ppos)
{
char buf[16];
int mode;
if (cnt > 15)
cnt = 15;
if (copy_from_user(&buf, ubuf, cnt))
return -EFAULT;
buf[cnt] = 0;
mode = sched_dynamic_mode(strstrip(buf));
if (mode < 0)
return mode;
sched_dynamic_update(mode);
*ppos += cnt;
return cnt;
}
static int sched_dynamic_show(struct seq_file *m, void *v)
{
static const char * preempt_modes[] = {
"none", "voluntary", "full"
};
int i;
for (i = 0; i < ARRAY_SIZE(preempt_modes); i++) {
if (preempt_dynamic_mode == i)
seq_puts(m, "(");
seq_puts(m, preempt_modes[i]);
if (preempt_dynamic_mode == i)
seq_puts(m, ")");
seq_puts(m, " ");
}
seq_puts(m, "\n");
return 0;
}
static int sched_dynamic_open(struct inode *inode, struct file *filp)
{
return single_open(filp, sched_dynamic_show, NULL);
}
static const struct file_operations sched_dynamic_fops = {
.open = sched_dynamic_open,
.write = sched_dynamic_write,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
};
static __init int sched_init_debug_dynamic(void)
{
debugfs_create_file("sched_preempt", 0644, NULL, NULL, &sched_dynamic_fops);
return 0;
}
late_initcall(sched_init_debug_dynamic);
#endif /* CONFIG_SCHED_DEBUG */
#endif /* CONFIG_PREEMPT_DYNAMIC */
/*
* This is the entry point to schedule() from kernel preemption
* off of irq context.
@ -5615,8 +5808,12 @@ SYSCALL_DEFINE1(nice, int, increment)
* @p: the task in question.
*
* Return: The priority value as seen by users in /proc.
* RT tasks are offset by -200. Normal tasks are centered
* around 0, value goes from -16 to +15.
*
* sched policy return value kernel prio user prio/nice
*
* normal, batch, idle [0 ... 39] [100 ... 139] 0/[-20 ... 19]
* fifo, rr [-2 ... -100] [98 ... 0] [1 ... 99]
* deadline -101 -1 0
*/
int task_prio(const struct task_struct *p)
{
@ -5675,6 +5872,120 @@ struct task_struct *idle_task(int cpu)
return cpu_rq(cpu)->idle;
}
#ifdef CONFIG_SMP
/*
* This function computes an effective utilization for the given CPU, to be
* used for frequency selection given the linear relation: f = u * f_max.
*
* The scheduler tracks the following metrics:
*
* cpu_util_{cfs,rt,dl,irq}()
* cpu_bw_dl()
*
* Where the cfs,rt and dl util numbers are tracked with the same metric and
* synchronized windows and are thus directly comparable.
*
* The cfs,rt,dl utilization are the running times measured with rq->clock_task
* which excludes things like IRQ and steal-time. These latter are then accrued
* in the irq utilization.
*
* The DL bandwidth number otoh is not a measured metric but a value computed
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
*/
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
unsigned long max, enum cpu_util_type type,
struct task_struct *p)
{
unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);
if (!uclamp_is_used() &&
type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
return max;
}
/*
* Early check to see if IRQ/steal time saturates the CPU, can be
* because of inaccuracies in how we track these -- see
* update_irq_load_avg().
*/
irq = cpu_util_irq(rq);
if (unlikely(irq >= max))
return max;
/*
* Because the time spend on RT/DL tasks is visible as 'lost' time to
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
*
* CFS and RT utilization can be boosted or capped, depending on
* utilization clamp constraints requested by currently RUNNABLE
* tasks.
* When there are no CFS RUNNABLE tasks, clamps are released and
* frequency will be gracefully reduced with the utilization decay.
*/
util = util_cfs + cpu_util_rt(rq);
if (type == FREQUENCY_UTIL)
util = uclamp_rq_util_with(rq, util, p);
dl_util = cpu_util_dl(rq);
/*
* For frequency selection we do not make cpu_util_dl() a permanent part
* of this sum because we want to use cpu_bw_dl() later on, but we need
* to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
* that we select f_max when there is no idle time.
*
* NOTE: numerical errors or stop class might cause us to not quite hit
* saturation when we should -- something for later.
*/
if (util + dl_util >= max)
return max;
/*
* OTOH, for energy computation we need the estimated running time, so
* include util_dl and ignore dl_bw.
*/
if (type == ENERGY_UTIL)
util += dl_util;
/*
* There is still idle time; further improve the number by using the
* irq metric. Because IRQ/steal time is hidden from the task clock we
* need to scale the task numbers:
*
* max - irq
* U' = irq + --------- * U
* max
*/
util = scale_irq_capacity(util, irq, max);
util += irq;
/*
* Bandwidth required by DEADLINE must always be granted while, for
* FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
* to gracefully reduce the frequency when no tasks show up for longer
* periods of time.
*
* Ideally we would like to set bw_dl as min/guaranteed freq and util +
* bw_dl as requested freq. However, cpufreq is not yet ready for such
* an interface. So, we only do the latter for now.
*/
if (type == FREQUENCY_UTIL)
util += cpu_bw_dl(rq);
return min(max, util);
}
unsigned long sched_cpu_util(int cpu, unsigned long max)
{
return effective_cpu_util(cpu, cpu_util_cfs(cpu_rq(cpu)), max,
ENERGY_UTIL, NULL);
}
#endif /* CONFIG_SMP */
/**
* find_process_by_pid - find a process with a matching PID value.
* @pid: the pid in question.
@ -5796,11 +6107,10 @@ recheck:
/*
* Valid priorities for SCHED_FIFO and SCHED_RR are
* 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
* 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL,
* SCHED_BATCH and SCHED_IDLE is 0.
*/
if ((p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
(!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
if (attr->sched_priority > MAX_RT_PRIO-1)
return -EINVAL;
if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
(rt_policy(policy) != (attr->sched_priority != 0)))
@ -6667,17 +6977,27 @@ SYSCALL_DEFINE0(sched_yield)
return 0;
}
#ifndef CONFIG_PREEMPTION
int __sched _cond_resched(void)
#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
int __sched __cond_resched(void)
{
if (should_resched(0)) {
preempt_schedule_common();
return 1;
}
#ifndef CONFIG_PREEMPT_RCU
rcu_all_qs();
#endif
return 0;
}
EXPORT_SYMBOL(_cond_resched);
EXPORT_SYMBOL(__cond_resched);
#endif
#ifdef CONFIG_PREEMPT_DYNAMIC
DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched);
EXPORT_STATIC_CALL_TRAMP(cond_resched);
DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched);
EXPORT_STATIC_CALL_TRAMP(might_resched);
#endif
/*
@ -6868,7 +7188,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
switch (policy) {
case SCHED_FIFO:
case SCHED_RR:
ret = MAX_USER_RT_PRIO-1;
ret = MAX_RT_PRIO-1;
break;
case SCHED_DEADLINE:
case SCHED_NORMAL:
@ -7508,6 +7828,12 @@ int sched_cpu_deactivate(unsigned int cpu)
struct rq_flags rf;
int ret;
/*
* Remove CPU from nohz.idle_cpus_mask to prevent participating in
* load balancing when not active
*/
nohz_balance_exit_idle(rq);
set_cpu_active(cpu, false);
/*
@ -7652,7 +7978,6 @@ int sched_cpu_dying(unsigned int cpu)
calc_load_migrate(rq);
update_max_interval();
nohz_balance_exit_idle(rq);
hrtick_clear(rq);
return 0;
}

View File

@ -171,112 +171,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}
/*
* This function computes an effective utilization for the given CPU, to be
* used for frequency selection given the linear relation: f = u * f_max.
*
* The scheduler tracks the following metrics:
*
* cpu_util_{cfs,rt,dl,irq}()
* cpu_bw_dl()
*
* Where the cfs,rt and dl util numbers are tracked with the same metric and
* synchronized windows and are thus directly comparable.
*
* The cfs,rt,dl utilization are the running times measured with rq->clock_task
* which excludes things like IRQ and steal-time. These latter are then accrued
* in the irq utilization.
*
* The DL bandwidth number otoh is not a measured metric but a value computed
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
*/
unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
unsigned long max, enum schedutil_type type,
struct task_struct *p)
{
unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);
if (!uclamp_is_used() &&
type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
return max;
}
/*
* Early check to see if IRQ/steal time saturates the CPU, can be
* because of inaccuracies in how we track these -- see
* update_irq_load_avg().
*/
irq = cpu_util_irq(rq);
if (unlikely(irq >= max))
return max;
/*
* Because the time spend on RT/DL tasks is visible as 'lost' time to
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
*
* CFS and RT utilization can be boosted or capped, depending on
* utilization clamp constraints requested by currently RUNNABLE
* tasks.
* When there are no CFS RUNNABLE tasks, clamps are released and
* frequency will be gracefully reduced with the utilization decay.
*/
util = util_cfs + cpu_util_rt(rq);
if (type == FREQUENCY_UTIL)
util = uclamp_rq_util_with(rq, util, p);
dl_util = cpu_util_dl(rq);
/*
* For frequency selection we do not make cpu_util_dl() a permanent part
* of this sum because we want to use cpu_bw_dl() later on, but we need
* to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
* that we select f_max when there is no idle time.
*
* NOTE: numerical errors or stop class might cause us to not quite hit
* saturation when we should -- something for later.
*/
if (util + dl_util >= max)
return max;
/*
* OTOH, for energy computation we need the estimated running time, so
* include util_dl and ignore dl_bw.
*/
if (type == ENERGY_UTIL)
util += dl_util;
/*
* There is still idle time; further improve the number by using the
* irq metric. Because IRQ/steal time is hidden from the task clock we
* need to scale the task numbers:
*
* max - irq
* U' = irq + --------- * U
* max
*/
util = scale_irq_capacity(util, irq, max);
util += irq;
/*
* Bandwidth required by DEADLINE must always be granted while, for
* FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
* to gracefully reduce the frequency when no tasks show up for longer
* periods of time.
*
* Ideally we would like to set bw_dl as min/guaranteed freq and util +
* bw_dl as requested freq. However, cpufreq is not yet ready for such
* an interface. So, we only do the latter for now.
*/
if (type == FREQUENCY_UTIL)
util += cpu_bw_dl(rq);
return min(max, util);
}
static void sugov_get_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
@ -284,7 +178,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->max = max;
sg_cpu->bw_dl = cpu_bw_dl(rq);
sg_cpu->util = schedutil_cpu_util(sg_cpu->cpu, cpu_util_cfs(rq), max,
sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(rq), max,
FREQUENCY_UTIL, NULL);
}

View File

@ -517,58 +517,44 @@ static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
update_dl_migration(dl_rq);
}
#define __node_2_pdl(node) \
rb_entry((node), struct task_struct, pushable_dl_tasks)
static inline bool __pushable_less(struct rb_node *a, const struct rb_node *b)
{
return dl_entity_preempt(&__node_2_pdl(a)->dl, &__node_2_pdl(b)->dl);
}
/*
* The list of pushable -deadline task is not a plist, like in
* sched_rt.c, it is an rb-tree with tasks ordered by deadline.
*/
static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
{
struct dl_rq *dl_rq = &rq->dl;
struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_root.rb_node;
struct rb_node *parent = NULL;
struct task_struct *entry;
bool leftmost = true;
struct rb_node *leftmost;
BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
while (*link) {
parent = *link;
entry = rb_entry(parent, struct task_struct,
pushable_dl_tasks);
if (dl_entity_preempt(&p->dl, &entry->dl))
link = &parent->rb_left;
else {
link = &parent->rb_right;
leftmost = false;
}
}
leftmost = rb_add_cached(&p->pushable_dl_tasks,
&rq->dl.pushable_dl_tasks_root,
__pushable_less);
if (leftmost)
dl_rq->earliest_dl.next = p->dl.deadline;
rb_link_node(&p->pushable_dl_tasks, parent, link);
rb_insert_color_cached(&p->pushable_dl_tasks,
&dl_rq->pushable_dl_tasks_root, leftmost);
rq->dl.earliest_dl.next = p->dl.deadline;
}
static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
{
struct dl_rq *dl_rq = &rq->dl;
struct rb_root_cached *root = &dl_rq->pushable_dl_tasks_root;
struct rb_node *leftmost;
if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
return;
if (dl_rq->pushable_dl_tasks_root.rb_leftmost == &p->pushable_dl_tasks) {
struct rb_node *next_node;
leftmost = rb_erase_cached(&p->pushable_dl_tasks, root);
if (leftmost)
dl_rq->earliest_dl.next = __node_2_pdl(leftmost)->dl.deadline;
next_node = rb_next(&p->pushable_dl_tasks);
if (next_node) {
dl_rq->earliest_dl.next = rb_entry(next_node,
struct task_struct, pushable_dl_tasks)->dl.deadline;
}
}
rb_erase_cached(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
RB_CLEAR_NODE(&p->pushable_dl_tasks);
}
@ -1478,29 +1464,21 @@ void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
dec_dl_migration(dl_se, dl_rq);
}
#define __node_2_dle(node) \
rb_entry((node), struct sched_dl_entity, rb_node)
static inline bool __dl_less(struct rb_node *a, const struct rb_node *b)
{
return dl_time_before(__node_2_dle(a)->deadline, __node_2_dle(b)->deadline);
}
static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rb_node **link = &dl_rq->root.rb_root.rb_node;
struct rb_node *parent = NULL;
struct sched_dl_entity *entry;
int leftmost = 1;
BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
while (*link) {
parent = *link;
entry = rb_entry(parent, struct sched_dl_entity, rb_node);
if (dl_time_before(dl_se->deadline, entry->deadline))
link = &parent->rb_left;
else {
link = &parent->rb_right;
leftmost = 0;
}
}
rb_link_node(&dl_se->rb_node, parent, link);
rb_insert_color_cached(&dl_se->rb_node, &dl_rq->root, leftmost);
rb_add_cached(&dl_se->rb_node, &dl_rq->root, __dl_less);
inc_dl_tasks(dl_se, dl_rq);
}
@ -1513,6 +1491,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
return;
rb_erase_cached(&dl_se->rb_node, &dl_rq->root);
RB_CLEAR_NODE(&dl_se->rb_node);
dec_dl_tasks(dl_se, dl_rq);
@ -1853,7 +1832,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
if (!first)
return;
if (hrtick_enabled(rq))
if (hrtick_enabled_dl(rq))
start_hrtick_dl(rq, p);
if (rq->curr->sched_class != &dl_sched_class)
@ -1916,7 +1895,7 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
* not being the leftmost task anymore. In that case NEED_RESCHED will
* be set and schedule() will start a new hrtick for the next task.
*/
if (hrtick_enabled(rq) && queued && p->dl.runtime > 0 &&
if (hrtick_enabled_dl(rq) && queued && p->dl.runtime > 0 &&
is_leftmost(p, &rq->dl))
start_hrtick_dl(rq, p);
}
@ -2409,9 +2388,13 @@ void dl_add_task_root_domain(struct task_struct *p)
struct rq *rq;
struct dl_bw *dl_b;
rq = task_rq_lock(p, &rf);
if (!dl_task(p))
goto unlock;
raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
if (!dl_task(p)) {
raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
return;
}
rq = __task_rq_lock(p, &rf);
dl_b = &rq->rd->dl_bw;
raw_spin_lock(&dl_b->lock);
@ -2420,7 +2403,6 @@ void dl_add_task_root_domain(struct task_struct *p)
raw_spin_unlock(&dl_b->lock);
unlock:
task_rq_unlock(rq, p, &rf);
}
@ -2514,7 +2496,7 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
static void prio_changed_dl(struct rq *rq, struct task_struct *p,
int oldprio)
{
if (task_on_rq_queued(p) || rq->curr == p) {
if (task_on_rq_queued(p) || task_current(rq, p)) {
#ifdef CONFIG_SMP
/*
* This might be too much, but unfortunately

View File

@ -486,7 +486,7 @@ static char *task_group_path(struct task_group *tg)
static void
print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
{
if (rq->curr == p)
if (task_current(rq, p))
SEQ_printf(m, ">R");
else
SEQ_printf(m, " %c", task_state_to_char(p));

View File

@ -531,12 +531,15 @@ static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
return min_vruntime;
}
static inline int entity_before(struct sched_entity *a,
static inline bool entity_before(struct sched_entity *a,
struct sched_entity *b)
{
return (s64)(a->vruntime - b->vruntime) < 0;
}
#define __node_2_se(node) \
rb_entry((node), struct sched_entity, run_node)
static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
@ -552,8 +555,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
}
if (leftmost) { /* non-empty tree */
struct sched_entity *se;
se = rb_entry(leftmost, struct sched_entity, run_node);
struct sched_entity *se = __node_2_se(leftmost);
if (!curr)
vruntime = se->vruntime;
@ -569,37 +571,17 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
#endif
}
static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
{
return entity_before(__node_2_se(a), __node_2_se(b));
}
/*
* Enqueue an entity into the rb-tree:
*/
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;
struct rb_node *parent = NULL;
struct sched_entity *entry;
bool leftmost = true;
/*
* Find the right place in the rbtree:
*/
while (*link) {
parent = *link;
entry = rb_entry(parent, struct sched_entity, run_node);
/*
* We dont care about collisions. Nodes with
* the same key stay together.
*/
if (entity_before(se, entry)) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = false;
}
}
rb_link_node(&se->run_node, parent, link);
rb_insert_color_cached(&se->run_node,
&cfs_rq->tasks_timeline, leftmost);
rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
}
static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
@ -614,7 +596,7 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
if (!left)
return NULL;
return rb_entry(left, struct sched_entity, run_node);
return __node_2_se(left);
}
static struct sched_entity *__pick_next_entity(struct sched_entity *se)
@ -624,7 +606,7 @@ static struct sched_entity *__pick_next_entity(struct sched_entity *se)
if (!next)
return NULL;
return rb_entry(next, struct sched_entity, run_node);
return __node_2_se(next);
}
#ifdef CONFIG_SCHED_DEBUG
@ -635,7 +617,7 @@ struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
if (!last)
return NULL;
return rb_entry(last, struct sched_entity, run_node);
return __node_2_se(last);
}
/**************************************************************
@ -3943,6 +3925,22 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
trace_sched_util_est_cfs_tp(cfs_rq);
}
static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
struct task_struct *p)
{
unsigned int enqueued;
if (!sched_feat(UTIL_EST))
return;
/* Update root cfs_rq's estimated utilization */
enqueued = cfs_rq->avg.util_est.enqueued;
enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
trace_sched_util_est_cfs_tp(cfs_rq);
}
/*
* Check if a (signed) value is within a specified (unsigned) margin,
* based on the observation that:
@ -3956,23 +3954,16 @@ static inline bool within_margin(int value, int margin)
return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
}
static void
util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
static inline void util_est_update(struct cfs_rq *cfs_rq,
struct task_struct *p,
bool task_sleep)
{
long last_ewma_diff;
struct util_est ue;
int cpu;
if (!sched_feat(UTIL_EST))
return;
/* Update root cfs_rq's estimated utilization */
ue.enqueued = cfs_rq->avg.util_est.enqueued;
ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p));
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
trace_sched_util_est_cfs_tp(cfs_rq);
/*
* Skip update of task's estimated utilization when the task has not
* yet completed an activation, e.g. being migrated.
@ -4012,8 +4003,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
* To avoid overestimation of actual task utilization, skip updates if
* we cannot grant there is idle time in this CPU.
*/
cpu = cpu_of(rq_of(cfs_rq));
if (task_util(p) > capacity_orig_of(cpu))
if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq))))
return;
/*
@ -4052,7 +4042,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
if (!static_branch_unlikely(&sched_asym_cpucapacity))
return;
if (!p) {
if (!p || p->nr_cpus_allowed == 1) {
rq->misfit_task_load = 0;
return;
}
@ -4096,7 +4086,10 @@ static inline void
util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
static inline void
util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
static inline void
util_est_update(struct cfs_rq *cfs_rq, struct task_struct *p,
bool task_sleep) {}
static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
@ -5419,7 +5412,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
s64 delta = slice - ran;
if (delta < 0) {
if (rq->curr == p)
if (task_current(rq, p))
resched_curr(rq);
return;
}
@ -5436,7 +5429,7 @@ static void hrtick_update(struct rq *rq)
{
struct task_struct *curr = rq->curr;
if (!hrtick_enabled(rq) || curr->sched_class != &fair_sched_class)
if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
return;
if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
@ -5609,6 +5602,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
util_est_dequeue(&rq->cfs, p);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
@ -5659,7 +5654,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
rq->next_balance = jiffies;
dequeue_throttle:
util_est_dequeue(&rq->cfs, p, task_sleep);
util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);
}
@ -6006,6 +6001,14 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
return new_cpu;
}
static inline int __select_idle_cpu(int cpu)
{
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
return cpu;
return -1;
}
#ifdef CONFIG_SCHED_SMT
DEFINE_STATIC_KEY_FALSE(sched_smt_present);
EXPORT_SYMBOL_GPL(sched_smt_present);
@ -6064,74 +6067,51 @@ unlock:
* there are no idle cores left in the system; tracked through
* sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
*/
static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int core, cpu;
bool idle = true;
int cpu;
if (!static_branch_likely(&sched_smt_present))
return -1;
if (!test_idle_cores(target, false))
return -1;
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
for_each_cpu_wrap(core, cpus, target) {
bool idle = true;
return __select_idle_cpu(core);
for_each_cpu(cpu, cpu_smt_mask(core)) {
if (!available_idle_cpu(cpu)) {
idle = false;
if (*idle_cpu == -1) {
if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
*idle_cpu = cpu;
break;
}
continue;
}
break;
}
if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
*idle_cpu = cpu;
}
if (idle)
return core;
cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
}
/*
* Failed to find an idle core; stop looking for one.
*/
set_idle_cores(target, 0);
return -1;
}
/*
* Scan the local SMT mask for idle CPUs.
*/
static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
{
int cpu;
if (!static_branch_likely(&sched_smt_present))
return -1;
for_each_cpu(cpu, cpu_smt_mask(target)) {
if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
!cpumask_test_cpu(cpu, sched_domain_span(sd)))
continue;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
return cpu;
}
return -1;
}
#else /* CONFIG_SCHED_SMT */
static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
static inline void set_idle_cores(int cpu, int val)
{
return -1;
}
static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
static inline bool test_idle_cores(int cpu, bool def)
{
return -1;
return def;
}
static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
{
return __select_idle_cpu(core);
}
#endif /* CONFIG_SCHED_SMT */
@ -6144,49 +6124,61 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
struct sched_domain *this_sd;
u64 avg_cost, avg_idle;
u64 time;
int i, cpu, idle_cpu = -1, nr = INT_MAX;
bool smt = test_idle_cores(target, false);
int this = smp_processor_id();
int cpu, nr = INT_MAX;
struct sched_domain *this_sd;
u64 time;
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
return -1;
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
if (sched_feat(SIS_PROP) && !smt) {
u64 avg_cost, avg_idle, span_avg;
/*
* Due to large variance we need a large fuzz factor; hackbench in
* particularly is sensitive here.
* Due to large variance we need a large fuzz factor;
* hackbench in particularly is sensitive here.
*/
avg_idle = this_rq()->avg_idle / 512;
avg_cost = this_sd->avg_scan_cost + 1;
if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
return -1;
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
nr = 4;
}
time = cpu_clock(this);
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
break;
}
for_each_cpu_wrap(cpu, cpus, target) {
if (smt) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
if ((unsigned int)i < nr_cpumask_bits)
return i;
} else {
if (!--nr)
return -1;
idle_cpu = __select_idle_cpu(cpu);
if ((unsigned int)idle_cpu < nr_cpumask_bits)
break;
}
}
if (smt)
set_idle_cores(this, false);
if (sched_feat(SIS_PROP) && !smt) {
time = cpu_clock(this) - time;
update_avg(&this_sd->avg_scan_cost, time);
}
return cpu;
return idle_cpu;
}
/*
@ -6315,18 +6307,10 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (!sd)
return target;
i = select_idle_core(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
i = select_idle_smt(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
return target;
}
@ -6543,7 +6527,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* is already enough to scale the EM reported power
* consumption at the (eventually clamped) cpu_capacity.
*/
sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
sum_util += effective_cpu_util(cpu, util_cfs, cpu_cap,
ENERGY_UTIL, NULL);
/*
@ -6553,7 +6537,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
cpu_util = effective_cpu_util(cpu, util_cfs, cpu_cap,
FREQUENCY_UTIL, tsk);
max_util = max(max_util, cpu_util);
}
@ -6651,7 +6635,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
* IOW, placing the task there would make the CPU
* overutilized. Take uclamp into account to see how
* much capacity we can get out of the CPU; this is
* aligned with schedutil_cpu_util().
* aligned with sched_cpu_util().
*/
util = uclamp_rq_util_with(cpu_rq(cpu), util, p);
if (!fits_capacity(util, cpu_cap))
@ -7132,7 +7116,7 @@ done: __maybe_unused;
list_move(&p->se.group_node, &rq->cfs_tasks);
#endif
if (hrtick_enabled(rq))
if (hrtick_enabled_fair(rq))
hrtick_start_fair(rq, p);
update_misfit_status(p, rq);
@ -9389,8 +9373,11 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (rt > env->fbq_type)
continue;
capacity = capacity_of(i);
nr_running = rq->cfs.h_nr_running;
if (!nr_running)
continue;
capacity = capacity_of(i);
/*
* For ASYM_CPUCAPACITY domains, don't pick a CPU that could
@ -9496,13 +9483,32 @@ asym_active_balance(struct lb_env *env)
}
static inline bool
voluntary_active_balance(struct lb_env *env)
imbalanced_active_balance(struct lb_env *env)
{
struct sched_domain *sd = env->sd;
/*
* The imbalanced case includes the case of pinned tasks preventing a fair
* distribution of the load on the system but also the even distribution of the
* threads on a system with spare capacity
*/
if ((env->migration_type == migrate_task) &&
(sd->nr_balance_failed > sd->cache_nice_tries+2))
return 1;
return 0;
}
static int need_active_balance(struct lb_env *env)
{
struct sched_domain *sd = env->sd;
if (asym_active_balance(env))
return 1;
if (imbalanced_active_balance(env))
return 1;
/*
* The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
* It's worth migrating the task if the src_cpu's capacity is reduced
@ -9522,16 +9528,6 @@ voluntary_active_balance(struct lb_env *env)
return 0;
}
static int need_active_balance(struct lb_env *env)
{
struct sched_domain *sd = env->sd;
if (voluntary_active_balance(env))
return 1;
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
static int active_load_balance_cpu_stop(void *data);
static int should_we_balance(struct lb_env *env)
@ -9623,6 +9619,8 @@ redo:
env.src_rq = busiest;
ld_moved = 0;
/* Clear this flag as soon as we find a pullable task */
env.flags |= LBF_ALL_PINNED;
if (busiest->nr_running > 1) {
/*
* Attempt to move tasks. If find_busiest_group has found
@ -9630,7 +9628,6 @@ redo:
* still unbalanced. ld_moved simply stays zero, so it is
* correctly treated as an imbalance.
*/
env.flags |= LBF_ALL_PINNED;
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
more_balance:
@ -9756,10 +9753,12 @@ more_balance:
if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
raw_spin_unlock_irqrestore(&busiest->lock,
flags);
env.flags |= LBF_ALL_PINNED;
goto out_one_pinned;
}
/* Record that we found at least one task that could run on this_cpu */
env.flags &= ~LBF_ALL_PINNED;
/*
* ->active_balance synchronizes accesses to
* ->active_balance_work. Once set, it's cleared
@ -9781,21 +9780,13 @@ more_balance:
/* We've kicked active balancing, force task migration. */
sd->nr_balance_failed = sd->cache_nice_tries+1;
}
} else
} else {
sd->nr_balance_failed = 0;
}
if (likely(!active_balance) || voluntary_active_balance(&env)) {
if (likely(!active_balance) || need_active_balance(&env)) {
/* We were unbalanced, so reset the balancing interval */
sd->balance_interval = sd->min_interval;
} else {
/*
* If we've begun active balancing, start to back off. This
* case may not be covered by the all_pinned logic if there
* is only 1 task on the busy runqueue (because we don't call
* detach_tasks).
*/
if (sd->balance_interval < sd->max_interval)
sd->balance_interval *= 2;
}
goto out;
@ -10700,8 +10691,11 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
*/
void trigger_load_balance(struct rq *rq)
{
/* Don't need to rebalance while attached to NULL domain */
if (unlikely(on_null_domain(rq)))
/*
* Don't need to rebalance while attached to NULL domain or
* runqueue CPU is not active
*/
if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq))))
return;
if (time_after_eq(jiffies, rq->next_balance))
@ -10806,7 +10800,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
* our priority decreased, or if we are not currently running on
* this runqueue and our priority is higher than the current's
*/
if (rq->curr == p) {
if (task_current(rq, p)) {
if (p->prio > oldprio)
resched_curr(rq);
} else
@ -10939,7 +10933,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
* kick off the schedule if running, otherwise just see
* if we can still preempt the current task.
*/
if (rq->curr == p)
if (task_current(rq, p))
resched_curr(rq);
else
check_preempt_curr(rq, p, 0);

View File

@ -38,6 +38,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
SCHED_FEAT(WAKEUP_PREEMPTION, true)
SCHED_FEAT(HRTICK, false)
SCHED_FEAT(HRTICK_DL, false)
SCHED_FEAT(DOUBLE_TICK, false)
/*
@ -54,7 +55,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
/*
* When doing wakeups, attempt to limit superfluous scans of the LLC domain.
*/
SCHED_FEAT(SIS_AVG_CPU, false)
SCHED_FEAT(SIS_PROP, true)
/*

View File

@ -285,6 +285,7 @@ static void do_idle(void)
}
arch_cpu_idle_enter();
rcu_nocb_flush_deferred_wakeup();
/*
* In poll mode we reenable interrupts and spin. Also if we

View File

@ -2357,7 +2357,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
if (!task_on_rq_queued(p))
return;
if (rq->curr == p) {
if (task_current(rq, p)) {
#ifdef CONFIG_SMP
/*
* If our priority decreases while running, we

View File

@ -140,7 +140,7 @@ extern void call_trace_sched_update_nr_running(struct rq *rq, int count);
* scale_load() and scale_load_down(w) to convert between them. The
* following must be true:
*
* scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
* scale_load(sched_prio_to_weight[NICE_TO_PRIO(0)-MAX_RT_PRIO]) == NICE_0_LOAD
*
*/
#define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT)
@ -1031,6 +1031,7 @@ struct rq {
call_single_data_t hrtick_csd;
#endif
struct hrtimer hrtick_timer;
ktime_t hrtick_time;
#endif
#ifdef CONFIG_SCHEDSTATS
@ -2104,17 +2105,39 @@ extern const_debug unsigned int sysctl_sched_migration_cost;
*/
static inline int hrtick_enabled(struct rq *rq)
{
if (!sched_feat(HRTICK))
return 0;
if (!cpu_active(cpu_of(rq)))
return 0;
return hrtimer_is_hres_active(&rq->hrtick_timer);
}
static inline int hrtick_enabled_fair(struct rq *rq)
{
if (!sched_feat(HRTICK))
return 0;
return hrtick_enabled(rq);
}
static inline int hrtick_enabled_dl(struct rq *rq)
{
if (!sched_feat(HRTICK_DL))
return 0;
return hrtick_enabled(rq);
}
void hrtick_start(struct rq *rq, u64 delay);
#else
static inline int hrtick_enabled_fair(struct rq *rq)
{
return 0;
}
static inline int hrtick_enabled_dl(struct rq *rq)
{
return 0;
}
static inline int hrtick_enabled(struct rq *rq)
{
return 0;
@ -2558,27 +2581,24 @@ static inline unsigned long capacity_orig_of(int cpu)
{
return cpu_rq(cpu)->cpu_capacity_orig;
}
#endif
/**
* enum schedutil_type - CPU utilization type
* enum cpu_util_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
* @ENERGY_UTIL: Utilization used during energy calculation
*
* The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
* need to be aggregated differently depending on the usage made of them. This
* enum is used within schedutil_freq_util() to differentiate the types of
* enum is used within effective_cpu_util() to differentiate the types of
* utilization expected by the callers, and adjust the aggregation accordingly.
*/
enum schedutil_type {
enum cpu_util_type {
FREQUENCY_UTIL,
ENERGY_UTIL,
};
#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
unsigned long max, enum schedutil_type type,
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
unsigned long max, enum cpu_util_type type,
struct task_struct *p);
static inline unsigned long cpu_bw_dl(struct rq *rq)
@ -2607,14 +2627,7 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
{
return READ_ONCE(rq->avg_rt.util_avg);
}
#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
static inline unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
unsigned long max, enum schedutil_type type,
struct task_struct *p)
{
return 0;
}
#endif /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
#endif
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
static inline unsigned long cpu_util_irq(struct rq *rq)

View File

@ -1596,66 +1596,58 @@ static void init_numa_topology_type(void)
}
}
#define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
void sched_init_numa(void)
{
int next_distance, curr_distance = node_distance(0, 0);
struct sched_domain_topology_level *tl;
int level = 0;
int i, j, k;
sched_domains_numa_distance = kzalloc(sizeof(int) * (nr_node_ids + 1), GFP_KERNEL);
if (!sched_domains_numa_distance)
return;
/* Includes NUMA identity node at level 0. */
sched_domains_numa_distance[level++] = curr_distance;
sched_domains_numa_levels = level;
unsigned long *distance_map;
int nr_levels = 0;
int i, j;
/*
* O(nr_nodes^2) deduplicating selection sort -- in order to find the
* unique distances in the node_distance() table.
*
* Assumes node_distance(0,j) includes all distances in
* node_distance(i,j) in order to avoid cubic time.
*/
next_distance = curr_distance;
distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
if (!distance_map)
return;
bitmap_zero(distance_map, NR_DISTANCE_VALUES);
for (i = 0; i < nr_node_ids; i++) {
for (j = 0; j < nr_node_ids; j++) {
for (k = 0; k < nr_node_ids; k++) {
int distance = node_distance(i, k);
int distance = node_distance(i, j);
if (distance > curr_distance &&
(distance < next_distance ||
next_distance == curr_distance))
next_distance = distance;
if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
sched_numa_warn("Invalid distance value range");
return;
}
bitmap_set(distance_map, distance, 1);
}
}
/*
* While not a strong assumption it would be nice to know
* about cases where if node A is connected to B, B is not
* equally connected to A.
* We can now figure out how many unique distance values there are and
* allocate memory accordingly.
*/
if (sched_debug() && node_distance(k, i) != distance)
sched_numa_warn("Node-distance not symmetric");
nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES);
if (sched_debug() && i && !find_numa_distance(distance))
sched_numa_warn("Node-0 not representative");
sched_domains_numa_distance = kcalloc(nr_levels, sizeof(int), GFP_KERNEL);
if (!sched_domains_numa_distance) {
bitmap_free(distance_map);
return;
}
if (next_distance != curr_distance) {
sched_domains_numa_distance[level++] = next_distance;
sched_domains_numa_levels = level;
curr_distance = next_distance;
} else break;
for (i = 0, j = 0; i < nr_levels; i++, j++) {
j = find_next_bit(distance_map, NR_DISTANCE_VALUES, j);
sched_domains_numa_distance[i] = j;
}
bitmap_free(distance_map);
/*
* In case of sched_debug() we verify the above assumption.
*/
if (!sched_debug())
break;
}
/*
* 'level' contains the number of unique distances
* 'nr_levels' contains the number of unique distances
*
* The sched_domains_numa_distance[] array includes the actual distance
* numbers.
@ -1664,15 +1656,15 @@ void sched_init_numa(void)
/*
* Here, we should temporarily reset sched_domains_numa_levels to 0.
* If it fails to allocate memory for array sched_domains_numa_masks[][],
* the array will contain less then 'level' members. This could be
* the array will contain less then 'nr_levels' members. This could be
* dangerous when we use it to iterate array sched_domains_numa_masks[][]
* in other functions.
*
* We reset it to 'level' at the end of this function.
* We reset it to 'nr_levels' at the end of this function.
*/
sched_domains_numa_levels = 0;
sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
sched_domains_numa_masks = kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL);
if (!sched_domains_numa_masks)
return;
@ -1680,7 +1672,7 @@ void sched_init_numa(void)
* Now for each level, construct a mask per node which contains all
* CPUs of nodes that are that many hops away from us.
*/
for (i = 0; i < level; i++) {
for (i = 0; i < nr_levels; i++) {
sched_domains_numa_masks[i] =
kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
if (!sched_domains_numa_masks[i])
@ -1688,12 +1680,17 @@ void sched_init_numa(void)
for (j = 0; j < nr_node_ids; j++) {
struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
int k;
if (!mask)
return;
sched_domains_numa_masks[i][j] = mask;
for_each_node(k) {
if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
sched_numa_warn("Node-distance not symmetric");
if (node_distance(j, k) > sched_domains_numa_distance[i])
continue;
@ -1705,7 +1702,7 @@ void sched_init_numa(void)
/* Compute default topology size */
for (i = 0; sched_domain_topology[i].mask; i++);
tl = kzalloc((i + level + 1) *
tl = kzalloc((i + nr_levels + 1) *
sizeof(struct sched_domain_topology_level), GFP_KERNEL);
if (!tl)
return;
@ -1728,7 +1725,7 @@ void sched_init_numa(void)
/*
* .. and append 'j' levels of NUMA goodness.
*/
for (j = 1; j < level; i++, j++) {
for (j = 1; j < nr_levels; i++, j++) {
tl[i] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
.sd_flags = cpu_numa_flags,
@ -1740,8 +1737,8 @@ void sched_init_numa(void)
sched_domain_topology = tl;
sched_domains_numa_levels = level;
sched_max_numa_distance = sched_domains_numa_distance[level - 1];
sched_domains_numa_levels = nr_levels;
sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
init_numa_topology_type();
}

View File

@ -14,6 +14,7 @@
#include <linux/export.h>
#include <linux/percpu.h>
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/gfp.h>
#include <linux/smp.h>
#include <linux/cpu.h>
@ -449,6 +450,9 @@ void flush_smp_call_function_from_idle(void)
local_irq_save(flags);
flush_smp_call_function_queue(true);
if (local_softirq_pending())
do_softirq();
local_irq_restore(flags);
}

View File

@ -12,6 +12,8 @@
extern struct static_call_site __start_static_call_sites[],
__stop_static_call_sites[];
extern struct static_call_tramp_key __start_static_call_tramp_key[],
__stop_static_call_tramp_key[];
static bool static_call_initialized;
@ -323,10 +325,59 @@ static int __static_call_mod_text_reserved(void *start, void *end)
return ret;
}
static unsigned long tramp_key_lookup(unsigned long addr)
{
struct static_call_tramp_key *start = __start_static_call_tramp_key;
struct static_call_tramp_key *stop = __stop_static_call_tramp_key;
struct static_call_tramp_key *tramp_key;
for (tramp_key = start; tramp_key != stop; tramp_key++) {
unsigned long tramp;
tramp = (long)tramp_key->tramp + (long)&tramp_key->tramp;
if (tramp == addr)
return (long)tramp_key->key + (long)&tramp_key->key;
}
return 0;
}
static int static_call_add_module(struct module *mod)
{
return __static_call_init(mod, mod->static_call_sites,
mod->static_call_sites + mod->num_static_call_sites);
struct static_call_site *start = mod->static_call_sites;
struct static_call_site *stop = start + mod->num_static_call_sites;
struct static_call_site *site;
for (site = start; site != stop; site++) {
unsigned long addr = (unsigned long)static_call_key(site);
unsigned long key;
/*
* Is the key is exported, 'addr' points to the key, which
* means modules are allowed to call static_call_update() on
* it.
*
* Otherwise, the key isn't exported, and 'addr' points to the
* trampoline so we need to lookup the key.
*
* We go through this dance to prevent crazy modules from
* abusing sensitive static calls.
*/
if (!kernel_text_address(addr))
continue;
key = tramp_key_lookup(addr);
if (!key) {
pr_warn("Failed to fixup __raw_static_call() usage at: %ps\n",
static_call_addr(site));
return -EINVAL;
}
site->key = (key - (long)&site->key) |
(site->key & STATIC_CALL_SITE_FLAGS);
}
return __static_call_init(mod, start, stop);
}
static void static_call_del_module(struct module *mod)
@ -438,6 +489,11 @@ int __init static_call_init(void)
}
early_initcall(static_call_init);
long __static_call_return0(void)
{
return 0;
}
#ifdef CONFIG_STATIC_CALL_SELFTEST
static int func_a(int x)

View File

@ -14,6 +14,14 @@
#include <linux/rbtree.h>
#include <linux/export.h>
#define __node_2_tq(_n) \
rb_entry((_n), struct timerqueue_node, node)
static inline bool __timerqueue_less(struct rb_node *a, const struct rb_node *b)
{
return __node_2_tq(a)->expires < __node_2_tq(b)->expires;
}
/**
* timerqueue_add - Adds timer to timerqueue.
*
@ -26,28 +34,10 @@
*/
bool timerqueue_add(struct timerqueue_head *head, struct timerqueue_node *node)
{
struct rb_node **p = &head->rb_root.rb_root.rb_node;
struct rb_node *parent = NULL;
struct timerqueue_node *ptr;
bool leftmost = true;
/* Make sure we don't add nodes that are already added */
WARN_ON_ONCE(!RB_EMPTY_NODE(&node->node));
while (*p) {
parent = *p;
ptr = rb_entry(parent, struct timerqueue_node, node);
if (node->expires < ptr->expires) {
p = &(*p)->rb_left;
} else {
p = &(*p)->rb_right;
leftmost = false;
}
}
rb_link_node(&node->node, parent, p);
rb_insert_color_cached(&node->node, &head->rb_root, leftmost);
return leftmost;
return rb_add_cached(&node->node, &head->rb_root, __timerqueue_less);
}
EXPORT_SYMBOL_GPL(timerqueue_add);

View File

@ -152,4 +152,194 @@ static inline void rb_replace_node_cached(struct rb_node *victim,
rb_replace_node(victim, new, &root->rb_root);
}
/*
* The below helper functions use 2 operators with 3 different
* calling conventions. The operators are related like:
*
* comp(a->key,b) < 0 := less(a,b)
* comp(a->key,b) > 0 := less(b,a)
* comp(a->key,b) == 0 := !less(a,b) && !less(b,a)
*
* If these operators define a partial order on the elements we make no
* guarantee on which of the elements matching the key is found. See
* rb_find().
*
* The reason for this is to allow the find() interface without requiring an
* on-stack dummy object, which might not be feasible due to object size.
*/
/**
* rb_add_cached() - insert @node into the leftmost cached tree @tree
* @node: node to insert
* @tree: leftmost cached tree to insert @node into
* @less: operator defining the (partial) node order
*/
static __always_inline void
rb_add_cached(struct rb_node *node, struct rb_root_cached *tree,
bool (*less)(struct rb_node *, const struct rb_node *))
{
struct rb_node **link = &tree->rb_root.rb_node;
struct rb_node *parent = NULL;
bool leftmost = true;
while (*link) {
parent = *link;
if (less(node, parent)) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = false;
}
}
rb_link_node(node, parent, link);
rb_insert_color_cached(node, tree, leftmost);
}
/**
* rb_add() - insert @node into @tree
* @node: node to insert
* @tree: tree to insert @node into
* @less: operator defining the (partial) node order
*/
static __always_inline void
rb_add(struct rb_node *node, struct rb_root *tree,
bool (*less)(struct rb_node *, const struct rb_node *))
{
struct rb_node **link = &tree->rb_node;
struct rb_node *parent = NULL;
while (*link) {
parent = *link;
if (less(node, parent))
link = &parent->rb_left;
else
link = &parent->rb_right;
}
rb_link_node(node, parent, link);
rb_insert_color(node, tree);
}
/**
* rb_find_add() - find equivalent @node in @tree, or add @node
* @node: node to look-for / insert
* @tree: tree to search / modify
* @cmp: operator defining the node order
*
* Returns the rb_node matching @node, or NULL when no match is found and @node
* is inserted.
*/
static __always_inline struct rb_node *
rb_find_add(struct rb_node *node, struct rb_root *tree,
int (*cmp)(struct rb_node *, const struct rb_node *))
{
struct rb_node **link = &tree->rb_node;
struct rb_node *parent = NULL;
int c;
while (*link) {
parent = *link;
c = cmp(node, parent);
if (c < 0)
link = &parent->rb_left;
else if (c > 0)
link = &parent->rb_right;
else
return parent;
}
rb_link_node(node, parent, link);
rb_insert_color(node, tree);
return NULL;
}
/**
* rb_find() - find @key in tree @tree
* @key: key to match
* @tree: tree to search
* @cmp: operator defining the node order
*
* Returns the rb_node matching @key or NULL.
*/
static __always_inline struct rb_node *
rb_find(const void *key, const struct rb_root *tree,
int (*cmp)(const void *key, const struct rb_node *))
{
struct rb_node *node = tree->rb_node;
while (node) {
int c = cmp(key, node);
if (c < 0)
node = node->rb_left;
else if (c > 0)
node = node->rb_right;
else
return node;
}
return NULL;
}
/**
* rb_find_first() - find the first @key in @tree
* @key: key to match
* @tree: tree to search
* @cmp: operator defining node order
*
* Returns the leftmost node matching @key, or NULL.
*/
static __always_inline struct rb_node *
rb_find_first(const void *key, const struct rb_root *tree,
int (*cmp)(const void *key, const struct rb_node *))
{
struct rb_node *node = tree->rb_node;
struct rb_node *match = NULL;
while (node) {
int c = cmp(key, node);
if (c <= 0) {
if (!c)
match = node;
node = node->rb_left;
} else if (c > 0) {
node = node->rb_right;
}
}
return match;
}
/**
* rb_next_match() - find the next @key in @tree
* @key: key to match
* @tree: tree to search
* @cmp: operator defining node order
*
* Returns the next node matching @key, or NULL.
*/
static __always_inline struct rb_node *
rb_next_match(const void *key, struct rb_node *node,
int (*cmp)(const void *key, const struct rb_node *))
{
node = rb_next(node);
if (node && cmp(key, node))
node = NULL;
return node;
}
/**
* rb_for_each() - iterates a subtree matching @key
* @node: iterator
* @key: key to match
* @tree: tree to search
* @cmp: operator defining node order
*/
#define rb_for_each(node, key, tree, cmp) \
for ((node) = rb_find_first((key), (tree), (cmp)); \
(node); (node) = rb_next_match((key), (node), (cmp)))
#endif /* __TOOLS_LINUX_PERF_RBTREE_H */

View File

@ -4,11 +4,13 @@
#include <linux/types.h>
#include <linux/stringify.h>
#include <linux/compiler.h>
#define STATIC_CALL_KEY_PREFIX __SCK__
#define STATIC_CALL_KEY_PREFIX_STR __stringify(STATIC_CALL_KEY_PREFIX)
#define STATIC_CALL_KEY_PREFIX_LEN (sizeof(STATIC_CALL_KEY_PREFIX_STR) - 1)
#define STATIC_CALL_KEY(name) __PASTE(STATIC_CALL_KEY_PREFIX, name)
#define STATIC_CALL_KEY_STR(name) __stringify(STATIC_CALL_KEY(name))
#define STATIC_CALL_TRAMP_PREFIX __SCT__
#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
@ -32,4 +34,52 @@ struct static_call_site {
s32 key;
};
#define DECLARE_STATIC_CALL(name, func) \
extern struct static_call_key STATIC_CALL_KEY(name); \
extern typeof(func) STATIC_CALL_TRAMP(name);
#ifdef CONFIG_HAVE_STATIC_CALL
#define __raw_static_call(name) (&STATIC_CALL_TRAMP(name))
#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
/*
* __ADDRESSABLE() is used to ensure the key symbol doesn't get stripped from
* the symbol table so that objtool can reference it when it generates the
* .static_call_sites section.
*/
#define __STATIC_CALL_ADDRESSABLE(name) \
__ADDRESSABLE(STATIC_CALL_KEY(name))
#define __static_call(name) \
({ \
__STATIC_CALL_ADDRESSABLE(name); \
__raw_static_call(name); \
})
#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
#define __STATIC_CALL_ADDRESSABLE(name)
#define __static_call(name) __raw_static_call(name)
#endif /* CONFIG_HAVE_STATIC_CALL_INLINE */
#ifdef MODULE
#define __STATIC_CALL_MOD_ADDRESSABLE(name)
#define static_call_mod(name) __raw_static_call(name)
#else
#define __STATIC_CALL_MOD_ADDRESSABLE(name) __STATIC_CALL_ADDRESSABLE(name)
#define static_call_mod(name) __static_call(name)
#endif
#define static_call(name) __static_call(name)
#else
#define static_call(name) \
((typeof(STATIC_CALL_TRAMP(name))*)(STATIC_CALL_KEY(name).func))
#endif /* CONFIG_HAVE_STATIC_CALL */
#endif /* _STATIC_CALL_TYPES_H */

View File

@ -502,9 +502,22 @@ static int create_static_call_sections(struct objtool_file *file)
key_sym = find_symbol_by_name(file->elf, tmp);
if (!key_sym) {
if (!module) {
WARN("static_call: can't find static_call_key symbol: %s", tmp);
return -1;
}
/*
* For modules(), the key might not be exported, which
* means the module can make static calls but isn't
* allowed to change them.
*
* In that case we temporarily set the key to be the
* trampoline address. This is fixed up in
* static_call_add_module().
*/
key_sym = insn->call_dest;
}
free(key_name);
/* populate reloc for 'key' */

View File

@ -43,75 +43,24 @@ static void elf_hash_init(struct hlist_head *table)
#define elf_hash_for_each_possible(name, obj, member, key) \
hlist_for_each_entry(obj, &name[hash_min(key, elf_hash_bits())], member)
static void rb_add(struct rb_root *tree, struct rb_node *node,
int (*cmp)(struct rb_node *, const struct rb_node *))
{
struct rb_node **link = &tree->rb_node;
struct rb_node *parent = NULL;
while (*link) {
parent = *link;
if (cmp(node, parent) < 0)
link = &parent->rb_left;
else
link = &parent->rb_right;
}
rb_link_node(node, parent, link);
rb_insert_color(node, tree);
}
static struct rb_node *rb_find_first(const struct rb_root *tree, const void *key,
int (*cmp)(const void *key, const struct rb_node *))
{
struct rb_node *node = tree->rb_node;
struct rb_node *match = NULL;
while (node) {
int c = cmp(key, node);
if (c <= 0) {
if (!c)
match = node;
node = node->rb_left;
} else if (c > 0) {
node = node->rb_right;
}
}
return match;
}
static struct rb_node *rb_next_match(struct rb_node *node, const void *key,
int (*cmp)(const void *key, const struct rb_node *))
{
node = rb_next(node);
if (node && cmp(key, node))
node = NULL;
return node;
}
#define rb_for_each(tree, node, key, cmp) \
for ((node) = rb_find_first((tree), (key), (cmp)); \
(node); (node) = rb_next_match((node), (key), (cmp)))
static int symbol_to_offset(struct rb_node *a, const struct rb_node *b)
static bool symbol_to_offset(struct rb_node *a, const struct rb_node *b)
{
struct symbol *sa = rb_entry(a, struct symbol, node);
struct symbol *sb = rb_entry(b, struct symbol, node);
if (sa->offset < sb->offset)
return -1;
return true;
if (sa->offset > sb->offset)
return 1;
return false;
if (sa->len < sb->len)
return -1;
return true;
if (sa->len > sb->len)
return 1;
return false;
sa->alias = sb;
return 0;
return false;
}
static int symbol_by_offset(const void *key, const struct rb_node *node)
@ -165,7 +114,7 @@ struct symbol *find_symbol_by_offset(struct section *sec, unsigned long offset)
{
struct rb_node *node;
rb_for_each(&sec->symbol_tree, node, &offset, symbol_by_offset) {
rb_for_each(node, &offset, &sec->symbol_tree, symbol_by_offset) {
struct symbol *s = rb_entry(node, struct symbol, node);
if (s->offset == offset && s->type != STT_SECTION)
@ -179,7 +128,7 @@ struct symbol *find_func_by_offset(struct section *sec, unsigned long offset)
{
struct rb_node *node;
rb_for_each(&sec->symbol_tree, node, &offset, symbol_by_offset) {
rb_for_each(node, &offset, &sec->symbol_tree, symbol_by_offset) {
struct symbol *s = rb_entry(node, struct symbol, node);
if (s->offset == offset && s->type == STT_FUNC)
@ -193,7 +142,7 @@ struct symbol *find_symbol_containing(const struct section *sec, unsigned long o
{
struct rb_node *node;
rb_for_each(&sec->symbol_tree, node, &offset, symbol_by_offset) {
rb_for_each(node, &offset, &sec->symbol_tree, symbol_by_offset) {
struct symbol *s = rb_entry(node, struct symbol, node);
if (s->type != STT_SECTION)
@ -207,7 +156,7 @@ struct symbol *find_func_containing(struct section *sec, unsigned long offset)
{
struct rb_node *node;
rb_for_each(&sec->symbol_tree, node, &offset, symbol_by_offset) {
rb_for_each(node, &offset, &sec->symbol_tree, symbol_by_offset) {
struct symbol *s = rb_entry(node, struct symbol, node);
if (s->type == STT_FUNC)
@ -442,7 +391,7 @@ static int read_symbols(struct elf *elf)
sym->offset = sym->sym.st_value;
sym->len = sym->sym.st_size;
rb_add(&sym->sec->symbol_tree, &sym->node, symbol_to_offset);
rb_add(&sym->node, &sym->sec->symbol_tree, symbol_to_offset);
pnode = rb_prev(&sym->node);
if (pnode)
entry = &rb_entry(pnode, struct symbol, node)->list;