mirror of
https://github.com/qemu/qemu.git
synced 2024-12-04 01:03:38 +08:00
878ec29b9c
Record/replay feature of icount allows deterministic running of execution scenarios. Some CPUs and peripheral devices read random numbers from external sources making deterministic execution impossible. This patch adds recording and replaying of random read operations into guest-random module, which is used by the virtual hardware. Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru> Message-Id: <157675984852.14505.15709141760677102489.stgit@pasha-Precision-3630-Tower> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
365 lines
18 KiB
Plaintext
365 lines
18 KiB
Plaintext
Copyright (c) 2010-2015 Institute for System Programming
|
|
of the Russian Academy of Sciences.
|
|
|
|
This work is licensed under the terms of the GNU GPL, version 2 or later.
|
|
See the COPYING file in the top-level directory.
|
|
|
|
Record/replay
|
|
-------------
|
|
|
|
Record/replay functions are used for the deterministic replay of qemu execution.
|
|
Execution recording writes a non-deterministic events log, which can be later
|
|
used for replaying the execution anywhere and for unlimited number of times.
|
|
It also supports checkpointing for faster rewind to the specific replay moment.
|
|
Execution replaying reads the log and replays all non-deterministic events
|
|
including external input, hardware clocks, and interrupts.
|
|
|
|
Deterministic replay has the following features:
|
|
* Deterministically replays whole system execution and all contents of
|
|
the memory, state of the hardware devices, clocks, and screen of the VM.
|
|
* Writes execution log into the file for later replaying for multiple times
|
|
on different machines.
|
|
* Supports i386, x86_64, and ARM hardware platforms.
|
|
* Performs deterministic replay of all operations with keyboard and mouse
|
|
input devices.
|
|
|
|
Usage of the record/replay:
|
|
* First, record the execution with the following command line:
|
|
qemu-system-i386 \
|
|
-icount shift=7,rr=record,rrfile=replay.bin \
|
|
-drive file=disk.qcow2,if=none,snapshot,id=img-direct \
|
|
-drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
|
|
-device ide-hd,drive=img-blkreplay \
|
|
-netdev user,id=net1 -device rtl8139,netdev=net1 \
|
|
-object filter-replay,id=replay,netdev=net1
|
|
* After recording, you can replay it by using another command line:
|
|
qemu-system-i386 \
|
|
-icount shift=7,rr=replay,rrfile=replay.bin \
|
|
-drive file=disk.qcow2,if=none,snapshot,id=img-direct \
|
|
-drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
|
|
-device ide-hd,drive=img-blkreplay \
|
|
-netdev user,id=net1 -device rtl8139,netdev=net1 \
|
|
-object filter-replay,id=replay,netdev=net1
|
|
The only difference with recording is changing the rr option
|
|
from record to replay.
|
|
* Block device images are not actually changed in the recording mode,
|
|
because all of the changes are written to the temporary overlay file.
|
|
This behavior is enabled by using blkreplay driver. It should be used
|
|
for every enabled block device, as described in 'Block devices' section.
|
|
* '-net none' option should be specified when network is not used,
|
|
because QEMU adds network card by default. When network is needed,
|
|
it should be configured explicitly with replay filter, as described
|
|
in 'Network devices' section.
|
|
* Interaction with audio devices and serial ports are recorded and replayed
|
|
automatically when such devices are enabled.
|
|
|
|
Academic papers with description of deterministic replay implementation:
|
|
http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html
|
|
http://dl.acm.org/citation.cfm?id=2786805.2803179
|
|
|
|
Modifications of qemu include:
|
|
* wrappers for clock and time functions to save their return values in the log
|
|
* saving different asynchronous events (e.g. system shutdown) into the log
|
|
* synchronization of the bottom halves execution
|
|
* synchronization of the threads from thread pool
|
|
* recording/replaying user input (mouse, keyboard, and microphone)
|
|
* adding internal checkpoints for cpu and io synchronization
|
|
* network filter for recording and replaying the packets
|
|
* block driver for making block layer deterministic
|
|
* serial port input record and replay
|
|
* recording of random numbers obtained from the external sources
|
|
|
|
Locking and thread synchronisation
|
|
----------------------------------
|
|
|
|
Previously the synchronisation of the main thread and the vCPU thread
|
|
was ensured by the holding of the BQL. However the trend has been to
|
|
reduce the time the BQL was held across the system including under TCG
|
|
system emulation. As it is important that batches of events are kept
|
|
in sequence (e.g. expiring timers and checkpoints in the main thread
|
|
while instruction checkpoints are written by the vCPU thread) we need
|
|
another lock to keep things in lock-step. This role is now handled by
|
|
the replay_mutex_lock. It used to be held only for each event being
|
|
written but now it is held for a whole execution period. This results
|
|
in a deterministic ping-pong between the two main threads.
|
|
|
|
As the BQL is now a finer grained lock than the replay_lock it is almost
|
|
certainly a bug, and a source of deadlocks, to take the
|
|
replay_mutex_lock while the BQL is held. This is enforced by an assert.
|
|
While the unlocks are usually in the reverse order, this is not
|
|
necessary; you can drop the replay_lock while holding the BQL, without
|
|
doing a more complicated unlock_iothread/replay_unlock/lock_iothread
|
|
sequence.
|
|
|
|
Non-deterministic events
|
|
------------------------
|
|
|
|
Our record/replay system is based on saving and replaying non-deterministic
|
|
events (e.g. keyboard input) and simulating deterministic ones (e.g. reading
|
|
from HDD or memory of the VM). Saving only non-deterministic events makes
|
|
log file smaller and simulation faster.
|
|
|
|
The following non-deterministic data from peripheral devices is saved into
|
|
the log: mouse and keyboard input, network packets, audio controller input,
|
|
serial port input, and hardware clocks (they are non-deterministic
|
|
too, because their values are taken from the host machine). Inputs from
|
|
simulated hardware, memory of VM, software interrupts, and execution of
|
|
instructions are not saved into the log, because they are deterministic and
|
|
can be replayed by simulating the behavior of virtual machine starting from
|
|
initial state.
|
|
|
|
We had to solve three tasks to implement deterministic replay: recording
|
|
non-deterministic events, replaying non-deterministic events, and checking
|
|
that there is no divergence between record and replay modes.
|
|
|
|
We changed several parts of QEMU to make event log recording and replaying.
|
|
Devices' models that have non-deterministic input from external devices were
|
|
changed to write every external event into the execution log immediately.
|
|
E.g. network packets are written into the log when they arrive into the virtual
|
|
network adapter.
|
|
|
|
All non-deterministic events are coming from these devices. But to
|
|
replay them we need to know at which moments they occur. We specify
|
|
these moments by counting the number of instructions executed between
|
|
every pair of consecutive events.
|
|
|
|
Instruction counting
|
|
--------------------
|
|
|
|
QEMU should work in icount mode to use record/replay feature. icount was
|
|
designed to allow deterministic execution in absence of external inputs
|
|
of the virtual machine. We also use icount to control the occurrence of the
|
|
non-deterministic events. The number of instructions elapsed from the last event
|
|
is written to the log while recording the execution. In replay mode we
|
|
can predict when to inject that event using the instruction counter.
|
|
|
|
Timers
|
|
------
|
|
|
|
Timers are used to execute callbacks from different subsystems of QEMU
|
|
at the specified moments of time. There are several kinds of timers:
|
|
* Real time clock. Based on host time and used only for callbacks that
|
|
do not change the virtual machine state. For this reason real time
|
|
clock and timers does not affect deterministic replay at all.
|
|
* Virtual clock. These timers run only during the emulation. In icount
|
|
mode virtual clock value is calculated using executed instructions counter.
|
|
That is why it is completely deterministic and does not have to be recorded.
|
|
* Host clock. This clock is used by device models that simulate real time
|
|
sources (e.g. real time clock chip). Host clock is the one of the sources
|
|
of non-determinism. Host clock read operations should be logged to
|
|
make the execution deterministic.
|
|
* Virtual real time clock. This clock is similar to real time clock but
|
|
it is used only for increasing virtual clock while virtual machine is
|
|
sleeping. Due to its nature it is also non-deterministic as the host clock
|
|
and has to be logged too.
|
|
|
|
Checkpoints
|
|
-----------
|
|
|
|
Replaying of the execution of virtual machine is bound by sources of
|
|
non-determinism. These are inputs from clock and peripheral devices,
|
|
and QEMU thread scheduling. Thread scheduling affect on processing events
|
|
from timers, asynchronous input-output, and bottom halves.
|
|
|
|
Invocations of timers are coupled with clock reads and changing the state
|
|
of the virtual machine. Reads produce non-deterministic data taken from
|
|
host clock. And VM state changes should preserve their order. Their relative
|
|
order in replay mode must replicate the order of callbacks in record mode.
|
|
To preserve this order we use checkpoints. When a specific clock is processed
|
|
in record mode we save to the log special "checkpoint" event.
|
|
Checkpoints here do not refer to virtual machine snapshots. They are just
|
|
record/replay events used for synchronization.
|
|
|
|
QEMU in replay mode will try to invoke timers processing in random moment
|
|
of time. That's why we do not process a group of timers until the checkpoint
|
|
event will be read from the log. Such an event allows synchronizing CPU
|
|
execution and timer events.
|
|
|
|
Two other checkpoints govern the "warping" of the virtual clock.
|
|
While the virtual machine is idle, the virtual clock increments at
|
|
1 ns per *real time* nanosecond. This is done by setting up a timer
|
|
(called the warp timer) on the virtual real time clock, so that the
|
|
timer fires at the next deadline of the virtual clock; the virtual clock
|
|
is then incremented (which is called "warping" the virtual clock) as
|
|
soon as the timer fires or the CPUs need to go out of the idle state.
|
|
Two functions are used for this purpose; because these actions change
|
|
virtual machine state and must be deterministic, each of them creates a
|
|
checkpoint. qemu_start_warp_timer checks if the CPUs are idle and if so
|
|
starts accounting real time to virtual clock. qemu_account_warp_timer
|
|
is called when the CPUs get an interrupt or when the warp timer fires,
|
|
and it warps the virtual clock by the amount of real time that has passed
|
|
since qemu_start_warp_timer.
|
|
|
|
Bottom halves
|
|
-------------
|
|
|
|
Disk I/O events are completely deterministic in our model, because
|
|
in both record and replay modes we start virtual machine from the same
|
|
disk state. But callbacks that virtual disk controller uses for reading and
|
|
writing the disk may occur at different moments of time in record and replay
|
|
modes.
|
|
|
|
Reading and writing requests are created by CPU thread of QEMU. Later these
|
|
requests proceed to block layer which creates "bottom halves". Bottom
|
|
halves consist of callback and its parameters. They are processed when
|
|
main loop locks the global mutex. These locks are not synchronized with
|
|
replaying process because main loop also processes the events that do not
|
|
affect the virtual machine state (like user interaction with monitor).
|
|
|
|
That is why we had to implement saving and replaying bottom halves callbacks
|
|
synchronously to the CPU execution. When the callback is about to execute
|
|
it is added to the queue in the replay module. This queue is written to the
|
|
log when its callbacks are executed. In replay mode callbacks are not processed
|
|
until the corresponding event is read from the events log file.
|
|
|
|
Sometimes the block layer uses asynchronous callbacks for its internal purposes
|
|
(like reading or writing VM snapshots or disk image cluster tables). In this
|
|
case bottom halves are not marked as "replayable" and do not saved
|
|
into the log.
|
|
|
|
Block devices
|
|
-------------
|
|
|
|
Block devices record/replay module intercepts calls of
|
|
bdrv coroutine functions at the top of block drivers stack.
|
|
To record and replay block operations the drive must be configured
|
|
as following:
|
|
-drive file=disk.qcow2,if=none,snapshot,id=img-direct
|
|
-drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
|
|
-device ide-hd,drive=img-blkreplay
|
|
|
|
blkreplay driver should be inserted between disk image and virtual driver
|
|
controller. Therefore all disk requests may be recorded and replayed.
|
|
|
|
All block completion operations are added to the queue in the coroutines.
|
|
Queue is flushed at checkpoints and information about processed requests
|
|
is recorded to the log. In replay phase the queue is matched with
|
|
events read from the log. Therefore block devices requests are processed
|
|
deterministically.
|
|
|
|
Snapshotting
|
|
------------
|
|
|
|
New VM snapshots may be created in replay mode. They can be used later
|
|
to recover the desired VM state. All VM states created in replay mode
|
|
are associated with the moment of time in the replay scenario.
|
|
After recovering the VM state replay will start from that position.
|
|
|
|
Default starting snapshot name may be specified with icount field
|
|
rrsnapshot as follows:
|
|
-icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name
|
|
|
|
This snapshot is created at start of recording and restored at start
|
|
of replaying. It also can be loaded while replaying to roll back
|
|
the execution.
|
|
|
|
'snapshot' flag of the disk image must be removed to save the snapshots
|
|
in the overlay (or original image) instead of using the temporary overlay.
|
|
-drive file=disk.ovl,if=none,id=img-direct
|
|
-drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
|
|
-device ide-hd,drive=img-blkreplay
|
|
|
|
Use QEMU monitor to create additional snapshots. 'savevm <name>' command
|
|
created the snapshot and 'loadvm <name>' restores it. To prevent corruption
|
|
of the original disk image, use overlay files linked to the original images.
|
|
Therefore all new snapshots (including the starting one) will be saved in
|
|
overlays and the original image remains unchanged.
|
|
|
|
Network devices
|
|
---------------
|
|
|
|
Record and replay for network interactions is performed with the network filter.
|
|
Each backend must have its own instance of the replay filter as follows:
|
|
-netdev user,id=net1 -device rtl8139,netdev=net1
|
|
-object filter-replay,id=replay,netdev=net1
|
|
|
|
Replay network filter is used to record and replay network packets. While
|
|
recording the virtual machine this filter puts all packets coming from
|
|
the outer world into the log. In replay mode packets from the log are
|
|
injected into the network device. All interactions with network backend
|
|
in replay mode are disabled.
|
|
|
|
Audio devices
|
|
-------------
|
|
|
|
Audio data is recorded and replay automatically. The command line for recording
|
|
and replaying must contain identical specifications of audio hardware, e.g.:
|
|
-soundhw ac97
|
|
|
|
Serial ports
|
|
------------
|
|
|
|
Serial ports input is recorded and replay automatically. The command lines
|
|
for recording and replaying must contain identical number of ports in record
|
|
and replay modes, but their backends may differ.
|
|
E.g., '-serial stdio' in record mode, and '-serial null' in replay mode.
|
|
|
|
Replay log format
|
|
-----------------
|
|
|
|
Record/replay log consists of the header and the sequence of execution
|
|
events. The header includes 4-byte replay version id and 8-byte reserved
|
|
field. Version is updated every time replay log format changes to prevent
|
|
using replay log created by another build of qemu.
|
|
|
|
The sequence of the events describes virtual machine state changes.
|
|
It includes all non-deterministic inputs of VM, synchronization marks and
|
|
instruction counts used to correctly inject inputs at replay.
|
|
|
|
Synchronization marks (checkpoints) are used for synchronizing qemu threads
|
|
that perform operations with virtual hardware. These operations may change
|
|
system's state (e.g., change some register or generate interrupt) and
|
|
therefore should execute synchronously with CPU thread.
|
|
|
|
Every event in the log includes 1-byte event id and optional arguments.
|
|
When argument is an array, it is stored as 4-byte array length
|
|
and corresponding number of bytes with data.
|
|
Here is the list of events that are written into the log:
|
|
|
|
- EVENT_INSTRUCTION. Instructions executed since last event.
|
|
Argument: 4-byte number of executed instructions.
|
|
- EVENT_INTERRUPT. Used to synchronize interrupt processing.
|
|
- EVENT_EXCEPTION. Used to synchronize exception handling.
|
|
- EVENT_ASYNC. This is a group of events. They are always processed
|
|
together with checkpoints. When such an event is generated, it is
|
|
stored in the queue and processed only when checkpoint occurs.
|
|
Every such event is followed by 1-byte checkpoint id and 1-byte
|
|
async event id from the following list:
|
|
- REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes
|
|
callbacks that affect virtual machine state, but normally called
|
|
asynchronously.
|
|
Argument: 8-byte operation id.
|
|
- REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains
|
|
parameters of keyboard and mouse input operations
|
|
(key press/release, mouse pointer movement).
|
|
Arguments: 9-16 bytes depending of input event.
|
|
- REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event.
|
|
- REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input
|
|
initiated by the sender.
|
|
Arguments: 1-byte character device id.
|
|
Array with bytes were read.
|
|
- REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize
|
|
operations with disk and flash drives with CPU.
|
|
Argument: 8-byte operation id.
|
|
- REPLAY_ASYNC_EVENT_NET. Incoming network packet.
|
|
Arguments: 1-byte network adapter id.
|
|
4-byte packet flags.
|
|
Array with packet bytes.
|
|
- EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu,
|
|
e.g., by closing the window.
|
|
- EVENT_CHAR_WRITE. Used to synchronize character output operations.
|
|
Arguments: 4-byte output function return value.
|
|
4-byte offset in the output array.
|
|
- EVENT_CHAR_READ_ALL. Used to synchronize character input operations,
|
|
initiated by qemu.
|
|
Argument: Array with bytes that were read.
|
|
- EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation,
|
|
initiated by qemu.
|
|
Argument: 4-byte error code.
|
|
- EVENT_CLOCK + clock_id. Group of events for host clock read operations.
|
|
Argument: 8-byte clock value.
|
|
- EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of
|
|
CPU, internal threads, and asynchronous input events. May be followed
|
|
by one or more EVENT_ASYNC events.
|
|
- EVENT_END. Last event in the log.
|