thread_group_cputimer() is a complete misnomer. The function does two things:
- For arming process wide timers it makes sure that the atomic time
storage is up to date. If no cpu timer is armed yet, then the atomic
time storage is not updated by the scheduler for performance reasons.
In that case a full summing up of all threads needs to be done and the
update needs to be enabled.
- Samples the current time into the caller supplied storage.
Rename it to thread_group_start_cputime(), make it static and fixup the
callsite.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.869350319@linutronix.de
The thread group accounting is active, otherwise the expiry function would
not be running. Sample the thread group time directly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.780348088@linutronix.de
get_itimer() locks sighand lock and checks whether the timer is already
expired. If it is not expired then the thread group cputime accounting is
already enabled. Use the sampling function not the one which is meant for
starting a timer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.689713638@linutronix.de
get_itimer() needs a sample of the current thread group cputime. It invokes
thread_group_cputimer() - which is a misnomer. That function also starts
eventually the group cputime accouting which is bogus because the
accounting is already active when a timer is armed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.599658199@linutronix.de
Replace the next slightly different copy of permission checks. That also
removes the necessarity to check the return value of the sample functions
because the clock id is already validated.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.414813172@linutronix.de
The code contains three slightly different copies of validating whether a
given clock resolves to a valid task and whether the current caller has
permissions to access it.
Create central functions. Replace check_clock() as a first step and rename
it to something sensible.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821192919.326097175@linutronix.de
- Add DT binding and new compatible for Allwinner sun4i (Maxime Ripard)
- Register the Atmel tcb clocksource for delays (Alexandre Belloni)
- Add a clock divider for the Freescale imx platforms and new timer node
in the DT (Anson Huang)
- Use DIV_ROUND_CLOSEST macro for the Renesas OSTM (Geert Uytterhoeven)
- Fix GENMASK and timer operation for the npcm timer (Avi Fishman)
- Fix timer-of showing an error message when EPROBE_DEFER is
returned (Jon Hunter)
- Add new SoC DT binding and match for Renesas timers (Magnus Damm)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGn3N4YVz0WNVyHskqDIjiipP6E8FAl1kXkMACgkQqDIjiipP
6E+ueQf7BpqpjmRa/YVUXXoX9G85nuSdwWvW8v1J14Drly7GvD0aYOX/TSzdqvV3
1cEVDiSSWD/tdp5e8L+ah1c60EvEv1O7eAhvRj3miaI/M35RyluL9yCntjtxSFT9
D9PHeR9u0fDzOfabJ04KrWOwwEyS5/BJfhsBX6rfF0AQkHfxNoU4VdwEBXDZkyfZ
oVRcu75AIvvlqec32kCCF5veU3I4tdYYcnEKD3IEOKMbu84JUrVZtX3SdpJ9Pd1M
pqm2ldP3S1Q1zryXOGJ9yzhrnhqO+jIoMQP0KscGQ7iQJhVvtOwDhMVp89ThNpZm
PLw7chPsBLZrP2D6J7OvJFpRBDqCrw==
=wXfz
-----END PGP SIGNATURE-----
Merge tag 'timers-v5.4' of https://git.linaro.org/people/daniel.lezcano/linux into timers/core
Pull clocksource/events updates from Daniel Lezcano:
- Remove dev_err() when used with platform_get_irq (Stephen Boyd)
- Add DT binding and new compatible for Allwinner sun4i (Maxime Ripard)
- Register the Atmel tcb clocksource for delays (Alexandre Belloni)
- Add a clock divider for the Freescale imx platforms and new timer node
in the DT (Anson Huang)
- Use DIV_ROUND_CLOSEST macro for the Renesas OSTM (Geert Uytterhoeven)
- Fix GENMASK and timer operation for the npcm timer (Avi Fishman)
- Fix timer-of showing an error message when EPROBE_DEFER is
returned (Jon Hunter)
- Add new SoC DT binding and match for Renesas timers (Magnus Damm)
Update the CMT driver to mark "renesas,cmt-48" as deprecated.
Instead of documenting a theoretical hardware device based on current software
support level, define DT bindings top-down based on available data sheet
information and make use of part numbers in the DT compat string.
In case of the only in-tree users r8a7740 and sh73a0 the compat strings
"renesas,r8a7740-cmt1" and "renesas,sh73a0-cmt1" may be used instead.
Signed-off-by: Magnus Damm <damm+renesas@opensource.se>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Add SoC-specific matching for CMT1 on r8a7740 and sh73a0.
This allows us to move away from the old DT bindings such as
- "renesas,cmt-48-sh73a0"
- "renesas,cmt-48-r8a7740"
- "renesas,cmt-48"
in favour for the now commonly used format "renesas,<soc>-<device>"
Signed-off-by: Magnus Damm <damm+renesas@opensource.se>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
The R-Car Gen3 SoCs so far come with a total for 4 on-chip CMT devices:
- CMT0
- CMT1
- CMT2
- CMT3
CMT0 includes two rather basic 32-bit timer channels. The rest of the on-chip
CMT devices support 48-bit counters and have 8 channels each.
Based on the data sheet information "CMT2/3 are exactly same as CMT1"
it seems that CMT2 and CMT3 now use the CMT1 compat string in the DTSI.
Clarify this in the DT binding documentation by describing R-Car Gen3 and
RZ/G2 CMT1 as "48-bit CMT devices".
Signed-off-by: Magnus Damm <damm+renesas@opensource.se>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
This patch adds DT binding documentation for the CMT devices on
the R-Car Gen3 D3 (r8a77995) SoC.
Signed-off-by: Magnus Damm <damm+renesas@opensource.se>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
This patch adds DT binding documentation for the CMT devices on
the R-Car Gen2 V2H (r8a7792) SoC.
Signed-off-by: Magnus Damm <damm+renesas@opensource.se>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
This patch reworks the DT binding documentation for the 6-channel
48-bit CMTs known as CMT1 on r8a7740 and sh73a0.
After the update the same style of DT binding as the rest of the upstream
SoCs will now also be used by r8a7740 and sh73a0. The DT binding "cmt-48"
is removed from the DT binding documentation, however software support for
this deprecated binding will still remain in the CMT driver for some time.
Signed-off-by: Magnus Damm <damm+renesas@opensource.se>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Document the on-chip CMT devices included in r8a7740 and sh73a0.
Included in this patch is DT binding documentation for 32-bit CMTs
CMT0, CMT2, CMT3 and CMT4. They all contain a single channel and are
quite similar however some minor differences still exist:
- "Counter input clock" (clock input and on-device divider)
One example is that RCLK 1/1 is supported by CMT2, CMT3 and CMT4.
- "Wakeup request" (supported by CMT0 and CMT2)
Because of this one unique compat string per CMT device is selected.
Signed-off-by: Magnus Damm <damm+renesas@opensource.se>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Deferred probe is an expected return value on many platforms and so
there's no need to output a warning that may potentially confuse users.
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Deferred probe is an expected return value for clk_get() on many
platforms. The driver deals with it properly, so there's no need
to output a warning that may potentially confuse users.
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
NPCM7XX_Tx_OPER GENMASK bits are wrong, fix them.
Hopefully the NPCM7XX_REG_TICR0 register reset value of those bits was 0,
so it did not cause an issue.
The function npcm7xx_timer_oneshot() reads the register
NPCM7XX_REG_TCSR0, modifies it and then reads it again overwriting the
previous changes. Remove the extra read which is pointless.
The function npcm7xx_timer_periodic() is correct but the code writes
to the NPCM7XX_REG_TICR0 register while it is dealing with the
NPCM7XX_REG_TCSR0 register, that is confusing. Separate the write to
the registers in the code for the sake of clarity.
Fixes: 1c00289ecd ("clocksource/drivers/npcm: Add NPCM7xx timer driver")
Signed-off-by: Avi Fishman <avifishman70@gmail.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Use the DIV_ROUND_CLOSEST() helper instead of open-coding the same
operation.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
The system counter block guide states that the base clock is
internally divided by 3 before use, that means the clock input of
system counter defined in DT should be base clock which is normally
from OSC, and then internally divided by 3 before use.
Signed-off-by: Anson Huang <Anson.Huang@nxp.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Implement and register delay timer to allow get_cycles() to work properly.
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
The newer Allwinner SoCs have a High Speed Timer supported in Linux, with a
matching Device Tree binding.
Now that we have the DT validation in place, let's convert the device tree
bindings for that controller over to a YAML schemas.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Newer Allwinner SoCs have different number of interrupts, let's add
different compatibles for all of them to deal with this properly.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Newer Allwinner SoCs have different number of interrupts, let's add
different compatibles for all of them to deal with this properly.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
The older Allwinner SoCs have a Timer supported in Linux, with a matching
Device Tree binding.
While the original binding only mentions one interrupt, the timer actually
has 6 of them.
Now that we have the DT validation in place, let's convert the device tree
bindings for that controller over to a YAML schemas.
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
We don't need dev_err() messages when platform_get_irq() fails now that
platform_get_irq() prints an error message itself when something goes
wrong. Let's remove these prints with a simple semantic patch.
// <smpl>
@@
expression ret;
struct platform_device *E;
@@
ret =
(
platform_get_irq(E, ...)
|
platform_get_irq_byname(E, ...)
);
if ( \( ret < 0 \| ret <= 0 \) )
{
(
-if (ret != -EPROBE_DEFER)
-{ ...
-dev_err(...);
-... }
|
...
-dev_err(...);
)
...
}
// </smpl>
While we're here, remove braces on if statements that only have one
statement (manually).
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
There is no particular reason to not enable TSC page clocksource on
32-bit. mul_u64_u64_shr() is available and despite the increased
computational complexity (compared to 64bit) TSC page is still a huge win
compared to MSR-based clocksource.
In-kernel reads:
MSR based clocksource: 3361 cycles
TSC page clocksource: 49 cycles
Reads from userspace (utilizing vDSO in case of TSC page):
MSR based clocksource: 5664 cycles
TSC page clocksource: 131 cycles
Enabling TSC page on 32bits allows to get rid of CONFIG_HYPERV_TSCPAGE as
it is now not any different from CONFIG_HYPERV_TIMER.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lkml.kernel.org/r/20190822083630.17059-1-vkuznets@redhat.com
Hyper-V guests use the default native_sched_clock() in
pv_ops.time.sched_clock on x86. But native_sched_clock() directly uses the
raw TSC value, which can be discontinuous in a Hyper-V VM.
Add the generic hv_setup_sched_clock() to set the sched clock function
appropriately. On x86, this sets pv_ops.time.sched_clock to read the
Hyper-V reference TSC value that is scaled and adjusted to be continuous.
Also move the Hyper-V reference TSC initialization much earlier in the boot
process so no discontinuity is observed when pv_ops.time.sched_clock
calculates its offset.
[ tglx: Folded build fix ]
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lkml.kernel.org/r/20190814123216.32245-3-Tianyu.Lan@microsoft.com
Prepare to add Hyper-V sched clock callback and move Hyper-V Reference TSC
initialization much earlier in the boot process. Earlier initialization is
needed so that it happens while the timestamp value is still 0 and no
discontinuity in the timestamp will occur when pv_ops.time.sched_clock
calculates its offset.
The earlier initialization requires that the Hyper-V TSC page be allocated
statically instead of with vmalloc(), so fixup the references to the TSC
page and the method of getting its physical address.
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lkml.kernel.org/r/20190814123216.32245-2-Tianyu.Lan@microsoft.com
Warning when p == NULL and then proceeding and dereferencing p does not
make any sense as the kernel will crash with a NULL pointer dereference
right away.
Bailing out when p == NULL and returning an error code does not cure the
underlying problem which caused p to be NULL. Though it might allow to
do proper debugging.
Same applies to the clock id check in set_process_cpu_timer().
Clean them up and make them return without trying to do further damage.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.846497772@linutronix.de
migration_base is used as a placeholder when an hrtimer is migrated to a
different CPU. In the case that hrtimer_cancel_wait_running() hits a timer
which is currently migrated it would pointlessly acquire the expiry lock of
the migration base, which is even not initialized.
Surely it could be initialized, but there is absolutely no point in
acquiring this lock because the timer is guaranteed not to run it's
callback for which the caller waits to finish on that base. So it would
just do the inc/lock/dec/unlock dance for nothing.
As the base switch is short and non-preemptible, there is no issue when the
wait function returns immediately.
The timer base and base->cpu_base cannot be NULL in the code path which is
invoking that, so just replace those checks with a check whether base is
migration base.
[ tglx: Updated from RT patch. Massaged changelog. Added comment. ]
Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821092409.13225-4-julien.grall@arm.com
The update to timer->base is protected by the base->cpu_base->lock().
However, hrtimer_cancel_wait_running() does access it lockless. So the
compiler is allowed to refetch timer->base which can cause havoc when the
timer base is changed concurrently.
Use READ_ONCE() to prevent this.
[ tglx: Adapted from a RT patch ]
Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821092409.13225-2-julien.grall@arm.com
The comment above cleanup_timers() is outdated. The timers are only removed
from the task/process list heads but not modified in any other way.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.747233612@linutronix.de
rtc.h is not needed in alarmtimers when a forward declaration of struct
rtc_device is provided. That allows to include posix-timers.h without
adding more includes to alarmtimer.h or creating circular include
dependencies.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.565389536@linutronix.de
- Rename struct siginfo to kernel_siginfo as that is used and required
- Add a forward declaration for task_struct and remove sched.h include
- Remove timex.h include as it is not needed
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.472005793@linutronix.de
The handling of a priority inversion between timer cancelling and a a not
well defined possible preemption of softirq kthread is not very clear.
Especially in the posix timers side it's unclear why there is a specific RT
wait callback.
All the nice explanations can be found in the initial changelog of
f61eff83ce (hrtimer: Prepare support for PREEMPT_RT").
Extract the detailed informations from there and put it into comments.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190820132656.GC2093@lenoir
Posix timer delete retry loops are affected by the same priority inversion
and live lock issues as the other timers.
Provide a RT specific synchronization function which keeps a reference to
the timer by holding rcu read lock to prevent the timer from being freed,
dropping the timer lock and invoking the timer specific wait function via a
new callback.
This does not yet cover posix CPU timers because they need more special
treatment on PREEMPT_RT.
[ This is folded into the original attempt which did not use a callback. ]
Originally-by: Anna-Maria Gleixenr <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.656864506@linutronix.de
Timer deletion on PREEMPT_RT is prone to priority inversion and live
locks. The hrtimer code has a synchronization mechanism for this. Posix CPU
timers will grow one.
But that mechanism cannot be invoked while holding the k_itimer lock
because that can deadlock against the running timer callback. So the lock
must be dropped which allows the timer to be freed.
The timer free can be prevented by taking RCU readlock before dropping the
lock, but because the rcu_head is part of the 'it' union a concurrent free
will overwrite the hrtimer on which the task is trying to synchronize.
Move the rcu_head out of the union to prevent this.
[ tglx: Fixed up kernel-doc. Rewrote changelog ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.965541887@linutronix.de
As a preparatory step for adding the PREEMPT RT specific synchronization
mechanism to wait for a running timer callback, rework the timer cancel
retry loops so they call a common function. This allows trivial
substitution in one place.
Originally-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.874901027@linutronix.de
do_timer_settime() has a 'flags' argument and uses 'flag' for the interrupt
flags, which is confusing at best.
Rename the argument so 'flags' can be used for interrupt flags as usual.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.782664411@linutronix.de
Use the hrtimer_cancel_wait_running() synchronization mechanism to prevent
priority inversion and live locks on PREEMPT_RT.
As a benefit the retry loop gains the missing cpu_relax() on !RT.
[ tglx: Split out of combo patch ]
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.690771827@linutronix.de
Use the hrtimer_cancel_wait_running() synchronization mechanism to prevent
priority inversion and live locks on PREEMPT_RT.
[ tglx: Split out of combo patch ]
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.600085866@linutronix.de
Use the hrtimer_cancel_wait_running() synchronization mechanism to prevent
priority inversion and live locks on PREEMPT_RT.
[ tglx: Split out of combo patch ]
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.508744705@linutronix.de
SCHED_DEADLINE inactive timer needs to run in hardirq context (as
dl_task_timer already does) on PREEMPT_RT
Change the mode to HRTIMER_MODE_REL_HARD.
[ tglx: Fixed up the start site, so mode debugging works ]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190731103715.4047-1-juri.lelli@redhat.com
When PREEMPT_RT is enabled, the soft interrupt thread can be preempted. If
the soft interrupt thread is preempted in the middle of a timer callback,
then calling del_timer_sync() can lead to two issues:
- If the caller is on a remote CPU then it has to spin wait for the timer
handler to complete. This can result in unbound priority inversion.
- If the caller originates from the task which preempted the timer
handler on the same CPU, then spin waiting for the timer handler to
complete is never going to end.
To avoid these issues, add a new lock to the timer base which is held
around the execution of the timer callbacks. If del_timer_sync() detects
that the timer callback is currently running, it blocks on the expiry
lock. When the callback is finished, the expiry lock is dropped by the
softirq thread which wakes up the waiter and the system makes progress.
This addresses both the priority inversion and the life lock issues.
This mechanism is not used for timers which are marked IRQSAFE as for those
preemption is disabled accross the callback and therefore this situation
cannot happen. The callbacks for such timers need to be individually
audited for RT compliance.
The same issue can happen in virtual machines when the vCPU which runs a
timer callback is scheduled out. If a second vCPU of the same guest calls
del_timer_sync() it will spin wait for the other vCPU to be scheduled back
in. The expiry lock mechanism would avoid that. It'd be trivial to enable
this when paravirt spinlocks are enabled in a guest, but it's not clear
whether this is an actual problem in the wild, so for now it's an RT only
mechanism.
As the softirq thread can be preempted with PREEMPT_RT=y, the SMP variant
of del_timer_sync() needs to be used on UP as well.
[ tglx: Refactored it for mainline ]
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.832418500@linutronix.de
When PREEMPT_RT is enabled, the soft interrupt thread can be preempted. If
the soft interrupt thread is preempted in the middle of a timer callback,
then calling hrtimer_cancel() can lead to two issues:
- If the caller is on a remote CPU then it has to spin wait for the timer
handler to complete. This can result in unbound priority inversion.
- If the caller originates from the task which preempted the timer
handler on the same CPU, then spin waiting for the timer handler to
complete is never going to end.
To avoid these issues, add a new lock to the timer base which is held
around the execution of the timer callbacks. If hrtimer_cancel() detects
that the timer callback is currently running, it blocks on the expiry
lock. When the callback is finished, the expiry lock is dropped by the
softirq thread which wakes up the waiter and the system makes progress.
This addresses both the priority inversion and the life lock issues.
The same issue can happen in virtual machines when the vCPU which runs a
timer callback is scheduled out. If a second vCPU of the same guest calls
hrtimer_cancel() it will spin wait for the other vCPU to be scheduled back
in. The expiry lock mechanism would avoid that. It'd be trivial to enable
this when paravirt spinlocks are enabled in a guest, but it's not clear
whether this is an actual problem in the wild, so for now it's an RT only
mechanism.
[ tglx: Refactored it for mainline ]
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.737767218@linutronix.de
On PREEMPT_RT enabled kernels hrtimers which are not explicitely marked for
hard interrupt expiry mode are moved into soft interrupt context either for
latency reasons or because the hrtimer callback takes regular spinlocks or
invokes other functions which are not suitable for hard interrupt context
on PREEMPT_RT.
The hrtimer_sleeper callback is RT compatible in hard interrupt context,
but there is a latency concern: Untrusted userspace can spawn many threads
which arm timers for the same expiry time on the same CPU. On expiry that
causes a latency spike due to the wakeup of a gazillion threads.
OTOH, priviledged real-time user space applications rely on the low latency
of hard interrupt wakeups. These syscall related wakeups are all based on
hrtimer sleepers.
If the current task is in a real-time scheduling class, mark the mode for
hard interrupt expiry.
[ tglx: Split out of a larger combo patch. Added changelog ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.645792403@linutronix.de