vmstat: skip periodic vmstat update for isolated CPUs

Problem: The interruption caused by vmstat_update is undesirable
for certain applications.

With workloads that are running on isolated cpus with nohz full mode to
shield off any kernel interruption. For example, a VM running a
time sensitive application with a 50us maximum acceptable interruption
(use case: soft PLC).

oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the
        oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold.

The isolated vCPU can perform operations that modify per-CPU page counters,
for example to complete I/O operations:

      CPU 11/KVM-9540    [001] dNh1.  2314.248584: mod_zone_page_state <-__folio_end_writeback
      CPU 11/KVM-9540    [001] dNh1.  2314.248585: <stack trace>
 => 0xffffffffc042b083
 => mod_zone_page_state
 => __folio_end_writeback
 => folio_end_writeback
 => iomap_finish_ioend
 => blk_mq_end_request_batch
 => nvme_irq
 => __handle_irq_event_percpu
 => handle_irq_event
 => handle_edge_irq
 => __common_interrupt
 => common_interrupt
 => asm_common_interrupt
 => vmx_do_interrupt_nmi_irqoff
 => vmx_handle_exit_irqoff
 => vcpu_enter_guest
 => vcpu_run
 => kvm_arch_vcpu_ioctl_run
 => kvm_vcpu_ioctl
 => __x64_sys_ioctl
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

In kernel users of vmstat counters either require the precise value and
they are using zone_page_state_snapshot interface or they can live with an
imprecision as the regular flushing can happen at arbitrary time and
cumulative error can grow (see calculate_normal_threshold).

From that POV the regular flushing can be postponed for CPUs that have
been isolated from the kernel interference without critical infrastructure
ever noticing.  Skip regular flushing from vmstat_shepherd for all
isolated CPUs to avoid interference with the isolated workload.

Suggested by Michal Hocko.

Link: https://lkml.kernel.org/r/ZIDoV/zxFKVmQl7W@tpad
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit is contained in:
Marcelo Tosatti 2023-06-07 17:28:07 -03:00 committed by Andrew Morton
parent 56e0d1cb16
commit be5e015d10

View File

@ -28,6 +28,7 @@
#include <linux/mm_inline.h> #include <linux/mm_inline.h>
#include <linux/page_ext.h> #include <linux/page_ext.h>
#include <linux/page_owner.h> #include <linux/page_owner.h>
#include <linux/sched/isolation.h>
#include "internal.h" #include "internal.h"
@ -2022,6 +2023,20 @@ static void vmstat_shepherd(struct work_struct *w)
for_each_online_cpu(cpu) { for_each_online_cpu(cpu) {
struct delayed_work *dw = &per_cpu(vmstat_work, cpu); struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
/*
* In kernel users of vmstat counters either require the precise value and
* they are using zone_page_state_snapshot interface or they can live with
* an imprecision as the regular flushing can happen at arbitrary time and
* cumulative error can grow (see calculate_normal_threshold).
*
* From that POV the regular flushing can be postponed for CPUs that have
* been isolated from the kernel interference without critical
* infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
* for all isolated CPUs to avoid interference with the isolated workload.
*/
if (cpu_is_isolated(cpu))
continue;
if (!delayed_work_pending(dw) && need_update(cpu)) if (!delayed_work_pending(dw) && need_update(cpu))
queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);