Migration of a system under stress (for example, with
"stress-ng --numa 2") triggers on the destination
some kernel watchdog messages like:
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 3489660870s!
NMI watchdog: BUG: soft lockup - CPU#1 stuck for 3489660884s!
This problem appears with the changes introduced by
42043e4 spapr: clock should count only if vm is running
I think this commit only triggers the problem.
Kernel computes the soft lockup duration using the
Virtual Timebase register (VTB), not using the Timebase
Register (TBR, the one 42043e4 stops).
It appears VTB is not migrated, so this patch adds it in
the list of the SPRs to migrate, and fixes the problem.
For the migration, I've tested a migration from qemu-2.8.0 and
pseries-2.8.0 to a patched master (qemu-2.11.0-rc1). The received
VTB is 0 (as is it not initialized by qemu-2.8.0), but the value
seems to be ignored by KVM and a non zero VTB is used by the kernel.
I have no explanation for that, but as the original problem appears
only with SMP system under stress I suspect some problems in KVM
(I think because VTB is shared by all threads of a core).
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>