atomics: update documentation

Some of the constraints on operand sizes have been relaxed, so adjust the
documentation.

Deprecate atomic_mb_read and atomic_mb_set; it is not really possible to
use them correctly because they do not interoperate with sequentially-consistent
RMW operations.

Finally, extend the memory barrier pairing section to cover acquire and
release semantics in general, roughly based on the KVM Forum 2016 talk,
"<atomic.h> weapons".

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This commit is contained in:
Paolo Bonzini 2020-04-06 11:34:12 +02:00
parent 15e8699f00
commit de99dab06f

View File

@ -11,10 +11,15 @@ that is consistent with the expectations of the programmer.
The most basic tool is locking. Mutexes, condition variables and
semaphores are used in QEMU, and should be the default approach to
synchronization. Anything else is considerably harder, but it's
also justified more often than one would like. The two tools that
are provided by ``qemu/atomic.h`` are memory barriers and atomic operations.
also justified more often than one would like;
the most performance-critical parts of QEMU in particular require
a very low level approach to concurrency, involving memory barriers
and atomic operations. The semantics of concurrent memory accesses are governed
by the C11 memory model.
Macros defined by ``qemu/atomic.h`` fall in three camps:
QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to
provide better portability and a less verbose syntax. ``qemu/atomic.h``
provides macros that fall in three camps:
- compiler barriers: ``barrier()``;
@ -24,13 +29,21 @@ Macros defined by ``qemu/atomic.h`` fall in three camps:
- sequentially consistent atomic access: everything else.
In general, use of ``qemu/atomic.h`` should be wrapped with more easily
used data structures (e.g. the lock-free singly-linked list operations
``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization
primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``). Bare use of
atomic operations and memory barriers should be limited to inter-thread
checking of flags and documented thoroughly.
Compiler memory barrier
=======================
``barrier()`` prevents the compiler from moving the memory accesses either
side of it to the other side. The compiler barrier has no direct effect
on the CPU, which may then reorder things however it wishes.
``barrier()`` prevents the compiler from moving the memory accesses on
either side of it to the other side. The compiler barrier has no direct
effect on the CPU, which may then reorder things however it wishes.
``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some
architectures, CPU guarantees are strong enough that blocking compiler
@ -73,7 +86,8 @@ operations::
typeof(*ptr) atomic_cmpxchg(ptr, old, new)
all of which return the old value of ``*ptr``. These operations are
polymorphic; they operate on any type that is as wide as a pointer.
polymorphic; they operate on any type that is as wide as a pointer or
smaller.
Similar operations return the new value of ``*ptr``::
@ -85,36 +99,28 @@ Similar operations return the new value of ``*ptr``::
typeof(*ptr) atomic_or_fetch(ptr, val)
typeof(*ptr) atomic_xor_fetch(ptr, val)
Sequentially consistent loads and stores can be done using::
atomic_fetch_add(ptr, 0) for loads
atomic_xchg(ptr, val) for stores
However, they are quite expensive on some platforms, notably POWER and
Arm. Therefore, qemu/atomic.h provides two primitives with slightly
weaker constraints::
``qemu/atomic.h`` also provides loads and stores that cannot be reordered
with each other::
typeof(*ptr) atomic_mb_read(ptr)
void atomic_mb_set(ptr, val)
The semantics of these primitives map to Java volatile variables,
and are strongly related to memory barriers as used in the Linux
kernel (see below).
However these do not provide sequential consistency and, in particular,
they do not participate in the total ordering enforced by
sequentially-consistent operations. For this reason they are deprecated.
They should instead be replaced with any of the following (ordered from
easiest to hardest):
As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
be reordered with each other, and it is also not possible to reorder
"normal" accesses around them.
- accesses inside a mutex or spinlock
However, and this is the important difference between
atomic_mb_read/atomic_mb_set and sequential consistency, it is important
for both threads to access the same volatile variable. It is not the
case that everything visible to thread A when it writes volatile field f
becomes visible to thread B after it reads volatile field g. The store
and load have to "match" (i.e., be performed on the same volatile
field) to achieve the right semantics.
- lightweight synchronization primitives such as ``QemuEvent``
- RCU operations (``atomic_rcu_read``, ``atomic_rcu_set``) when publishing
or accessing a new version of a data structure
These operations operate on any type that is as wide as an int or smaller.
- other atomic accesses: ``atomic_read`` and ``atomic_load_acquire`` for
loads, ``atomic_set`` and ``atomic_store_release`` for stores, ``smp_mb``
to forbid reordering subsequent loads before a store.
Weak atomic access and manual memory barriers
@ -122,9 +128,24 @@ Weak atomic access and manual memory barriers
Compared to sequentially consistent atomic access, programming with
weaker consistency models can be considerably more complicated.
In general, if the algorithm you are writing includes both writes
and reads on the same side, it is generally simpler to use sequentially
consistent primitives.
The only guarantees that you can rely upon in this case are:
- atomic accesses will not cause data races (and hence undefined behavior);
ordinary accesses instead cause data races if they are concurrent with
other accesses of which at least one is a write. In order to ensure this,
the compiler will not optimize accesses out of existence, create unsolicited
accesses, or perform other similar optimzations.
- acquire operations will appear to happen, with respect to the other
components of the system, before all the LOAD or STORE operations
specified afterwards.
- release operations will appear to happen, with respect to the other
components of the system, after all the LOAD or STORE operations
specified before.
- release operations will *synchronize with* acquire operations;
see :ref:`acqrel` for a detailed explanation.
When using this model, variables are accessed with:
@ -142,9 +163,9 @@ When using this model, variables are accessed with:
- ``atomic_store_release()``, which guarantees the STORE to appear to
happen, with respect to the other components of the system,
after all the LOAD or STORE operations specified afterwards.
after all the LOAD or STORE operations specified before.
Operations coming after ``atomic_store_release()`` can still be
reordered after it.
reordered before it.
Restrictions to the ordering of accesses can also be specified
using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
@ -208,168 +229,188 @@ They come in six kinds:
dependency and a full read barrier or better is required.
This is the set of barriers that is required *between* two ``atomic_read()``
and ``atomic_set()`` operations to achieve sequential consistency:
Memory barriers and ``atomic_load_acquire``/``atomic_store_release`` are
mostly used when a data structure has one thread that is always a writer
and one thread that is always a reader:
+----------------+-------------------------------------------------------+
| | 2nd operation |
| +------------------+-----------------+------------------+
| 1st operation | (after last) | atomic_read | atomic_set |
+----------------+------------------+-----------------+------------------+
| (before first) | .. | none | smp_mb_release() |
+----------------+------------------+-----------------+------------------+
| atomic_read | smp_mb_acquire() | smp_rmb() [1]_ | [2]_ |
+----------------+------------------+-----------------+------------------+
| atomic_set | none | smp_mb() [3]_ | smp_wmb() |
+----------------+------------------+-----------------+------------------+
+----------------------------------+----------------------------------+
| thread 1 | thread 2 |
+==================================+==================================+
| :: | :: |
| | |
| atomic_store_release(&a, x); | y = atomic_load_acquire(&b); |
| atomic_store_release(&b, y); | x = atomic_load_acquire(&a); |
+----------------------------------+----------------------------------+
.. [1] Or smp_read_barrier_depends().
In this case, correctness is easy to check for using the "pairing"
trick that is explained below.
.. [2] This requires a load-store barrier. This is achieved by
either smp_mb_acquire() or smp_mb_release().
Sometimes, a thread is accessing many variables that are otherwise
unrelated to each other (for example because, apart from the current
thread, exactly one other thread will read or write each of these
variables). In this case, it is possible to "hoist" the barriers
outside a loop. For example:
.. [3] This requires a store-load barrier. On most machines, the only
way to achieve this is a full barrier.
+------------------------------------------+----------------------------------+
| before | after |
+==========================================+==================================+
| :: | :: |
| | |
| n = 0; | n = 0; |
| for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
| n += atomic_load_acquire(&a[i]); | n += atomic_read(&a[i]); |
| | smp_mb_acquire(); |
+------------------------------------------+----------------------------------+
| :: | :: |
| | |
| | smp_mb_release(); |
| for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
| atomic_store_release(&a[i], false); | atomic_set(&a[i], false); |
+------------------------------------------+----------------------------------+
Splitting a loop can also be useful to reduce the number of barriers:
+------------------------------------------+----------------------------------+
| before | after |
+==========================================+==================================+
| :: | :: |
| | |
| n = 0; | smp_mb_release(); |
| for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) |
| atomic_store_release(&a[i], false); | atomic_set(&a[i], false); |
| smp_mb(); | smb_mb(); |
| n += atomic_read(&b[i]); | n = 0; |
| } | for (i = 0; i < 10; i++) |
| | n += atomic_read(&b[i]); |
+------------------------------------------+----------------------------------+
In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer
as well) ``smp_wmb()``:
+------------------------------------------+----------------------------------+
| before | after |
+==========================================+==================================+
| :: | :: |
| | |
| | smp_mb_release(); |
| for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) |
| atomic_store_release(&a[i], false); | atomic_set(&a[i], false); |
| atomic_store_release(&b[i], false); | smb_wmb(); |
| } | for (i = 0; i < 10; i++) |
| | atomic_set(&b[i], false); |
+------------------------------------------+----------------------------------+
You can see that the two possible definitions of ``atomic_mb_read()``
and ``atomic_mb_set()`` are the following:
.. _acqrel:
1) | atomic_mb_read(p) = atomic_read(p); smp_mb_acquire()
| atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb()
Acquire/release pairing and the *synchronizes-with* relation
------------------------------------------------------------
2) | atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire()
| atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v);
Atomic operations other than ``atomic_set()`` and ``atomic_read()`` have
either *acquire* or *release* semantics [#rmw]_. This has two effects:
Usually the former is used, because ``smp_mb()`` is expensive and a program
normally has more reads than writes. Therefore it makes more sense to
make ``atomic_mb_set()`` the more expensive operation.
.. [#rmw] Read-modify-write operations can have both---acquire applies to the
read part, and release to the write.
There are two common cases in which atomic_mb_read and atomic_mb_set
generate too many memory barriers, and thus it can be useful to manually
place barriers, or use atomic_load_acquire/atomic_store_release instead:
- within a thread, they are ordered either before subsequent operations
(for acquire) or after previous operations (for release).
- when a data structure has one thread that is always a writer
and one thread that is always a reader, manual placement of
memory barriers makes the write side faster. Furthermore,
correctness is easy to check for in this case using the "pairing"
trick that is explained below:
- if a release operation in one thread *synchronizes with* an acquire operation
in another thread, the ordering constraints propagates from the first to the
second thread. That is, everything before the release operation in the
first thread is guaranteed to *happen before* everything after the
acquire operation in the second thread.
+----------------------------------------------------------------------+
| thread 1 |
+-----------------------------------+----------------------------------+
| before | after |
+===================================+==================================+
| :: | :: |
| | |
| (other writes) | |
| atomic_mb_set(&a, x) | atomic_store_release(&a, x) |
| atomic_mb_set(&b, y) | atomic_store_release(&b, y) |
+-----------------------------------+----------------------------------+
The concept of acquire and release semantics is not exclusive to atomic
operations; almost all higher-level synchronization primitives also have
acquire or release semantics. For example:
+----------------------------------------------------------------------+
| thread 2 |
+-----------------------------------+----------------------------------+
| before | after |
+===================================+==================================+
| :: | :: |
| | |
| y = atomic_mb_read(&b) | y = atomic_load_acquire(&b) |
| x = atomic_mb_read(&a) | x = atomic_load_acquire(&a) |
| (other reads) | |
+-----------------------------------+----------------------------------+
- ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has
release semantics and synchronizes with a ``pthread_mutex_lock`` for the
same mutex.
Note that the barrier between the stores in thread 1, and between
the loads in thread 2, has been optimized here to a write or a
read memory barrier respectively. On some architectures, notably
ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as
smp_mb, but smp_rmb and/or smp_wmb are more efficient.
- ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics;
``pthread_cond_wait`` has both release semantics (synchronizing with
``pthread_mutex_lock``) and acquire semantics (synchronizing with
``pthread_mutex_unlock`` and signaling of the condition variable).
- sometimes, a thread is accessing many variables that are otherwise
unrelated to each other (for example because, apart from the current
thread, exactly one other thread will read or write each of these
variables). In this case, it is possible to "hoist" the implicit
barriers provided by ``atomic_mb_read()`` and ``atomic_mb_set()`` outside
a loop. For example, the above definition ``atomic_mb_read()`` gives
the following transformation:
- ``pthread_create`` has release semantics and synchronizes with the start
of the new thread; ``pthread_join`` has acquire semantics and synchronizes
with the exiting of the thread.
+-----------------------------------+----------------------------------+
| before | after |
+===================================+==================================+
| :: | :: |
| | |
| n = 0; | n = 0; |
| for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
| n += atomic_mb_read(&a[i]); | n += atomic_read(&a[i]); |
| | smp_mb_acquire(); |
+-----------------------------------+----------------------------------+
- ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has
acquire semantics.
Similarly, atomic_mb_set() can be transformed as follows:
For example, in the following example there are no atomic accesses, but still
thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit``
(release) and ``pthread_join`` (acquire):
+-----------------------------------+----------------------------------+
| before | after |
+===================================+==================================+
| :: | :: |
| | |
| | smp_mb_release(); |
| for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
| atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); |
| | smp_mb(); |
+-----------------------------------+----------------------------------+
+----------------------+-------------------------------+
| thread 1 | thread 2 |
+======================+===============================+
| :: | :: |
| | |
| *a = 1; | |
| pthread_exit(a); | pthread_join(thread1, &a); |
| | x = *a; |
+----------------------+-------------------------------+
Synchronization between threads basically descends from this pairing of
a release operation and an acquire operation. Therefore, atomic operations
other than ``atomic_set()`` and ``atomic_read()`` will almost always be
paired with another operation of the opposite kind: an acquire operation
will pair with a release operation and vice versa. This rule of thumb is
extremely useful; in the case of QEMU, however, note that the other
operation may actually be in a driver that runs in the guest!
The other thread can still use ``atomic_mb_read()``/``atomic_mb_set()``.
``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``,
``atomic_load_acquire()`` and ``atomic_rcu_read()`` all count
as acquire operations. ``smp_wmb()``, ``smp_mb_release()``,
``atomic_store_release()`` and ``atomic_rcu_set()`` all count as release
operations. ``smp_mb()`` counts as both acquire and release, therefore
it can pair with any other atomic operation. Here is an example:
The two tricks can be combined. In this case, splitting a loop in
two lets you hoist the barriers out of the loops _and_ eliminate the
expensive ``smp_mb()``:
+----------------------+------------------------------+
| thread 1 | thread 2 |
+======================+==============================+
| :: | :: |
| | |
| atomic_set(&a, 1); | |
| smp_wmb(); | |
| atomic_set(&b, 2); | x = atomic_read(&b); |
| | smp_rmb(); |
| | y = atomic_read(&a); |
+----------------------+------------------------------+
+-----------------------------------+----------------------------------+
| before | after |
+===================================+==================================+
| :: | :: |
| | |
| | smp_mb_release(); |
| for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) |
| atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); |
| atomic_mb_set(&b[i], false); | smb_wmb(); |
| } | for (i = 0; i < 10; i++) |
| | atomic_set(&a[i], false); |
| | smp_mb(); |
+-----------------------------------+----------------------------------+
Note that a load-store pair only counts if the two operations access the
same variable: that is, a store-release on a variable ``x`` *synchronizes
with* a load-acquire on a variable ``x``, while a release barrier
synchronizes with any acquire operation. The following example shows
correct synchronization:
+--------------------------------+--------------------------------+
| thread 1 | thread 2 |
+================================+================================+
| :: | :: |
| | |
| atomic_set(&a, 1); | |
| atomic_store_release(&b, 2); | x = atomic_load_acquire(&b); |
| | y = atomic_read(&a); |
+--------------------------------+--------------------------------+
Memory barrier pairing
----------------------
A useful rule of thumb is that memory barriers should always, or almost
always, be paired with another barrier. In the case of QEMU, however,
note that the other barrier may actually be in a driver that runs in
the guest!
For the purposes of pairing, ``smp_read_barrier_depends()`` and ``smp_rmb()``
both count as read barriers. A read barrier shall pair with a write
barrier or a full barrier; a write barrier shall pair with a read
barrier or a full barrier. A full barrier can pair with anything.
For example:
+--------------------+------------------------------+
| thread 1 | thread 2 |
+====================+==============================+
| :: | :: |
| | |
| a = 1; | |
| smp_wmb(); | |
| b = 2; | x = b; |
| | smp_rmb(); |
| | y = a; |
+--------------------+------------------------------+
Acquire and release semantics of higher-level primitives can also be
relied upon for the purpose of establishing the *synchronizes with*
relation.
Note that the "writing" thread is accessing the variables in the
opposite order as the "reading" thread. This is expected: stores
before the write barrier will normally match the loads after the
read barrier, and vice versa. The same is true for more than 2
access and for data dependency barriers:
before a release operation will normally match the loads after
the acquire operation, and vice versa. In fact, this happened already
in the ``pthread_exit``/``pthread_join`` example above.
Finally, this more complex example has more than two accesses and data
dependency barriers. It also does not use atomic accesses whenever there
cannot be a data race:
+----------------------+------------------------------+
| thread 1 | thread 2 |
@ -380,19 +421,15 @@ access and for data dependency barriers:
| smp_wmb(); | |
| x->i = 2; | |
| smp_wmb(); | |
| a = x; | x = a; |
| atomic_set(&a, x); | x = atomic_read(&a); |
| | smp_read_barrier_depends(); |
| | y = x->i; |
| | smp_read_barrier_depends(); |
| | z = b[y]; |
+----------------------+------------------------------+
``smp_wmb()`` also pairs with ``atomic_mb_read()`` and ``smp_mb_acquire()``.
and ``smp_rmb()`` also pairs with ``atomic_mb_set()`` and ``smp_mb_release()``.
Comparison with Linux kernel memory barriers
============================================
Comparison with Linux kernel primitives
=======================================
Here is a list of differences between Linux kernel atomic operations
and memory barriers, and the equivalents in QEMU:
@ -426,19 +463,43 @@ and memory barriers, and the equivalents in QEMU:
``atomic_cmpxchg`` returns the old value of the variable
===================== =========================================
In QEMU, the second kind does not exist. Currently Linux has
atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub.
In QEMU, the second kind is named ``atomic_OP_fetch``.
- different atomic read-modify-write operations in Linux imply
a different set of memory barriers; in QEMU, all of them enforce
sequential consistency, which means they imply full memory barriers
before and after the operation.
sequential consistency.
- Linux does not have an equivalent of ``atomic_mb_set()``. In particular,
note that ``smp_store_mb()`` is a little weaker than ``atomic_mb_set()``.
``atomic_mb_read()`` compiles to the same instructions as Linux's
``smp_load_acquire()``, but this should be treated as an implementation
detail.
- in QEMU, ``atomic_read()`` and ``atomic_set()`` do not participate in
the total ordering enforced by sequentially-consistent operations.
This is because QEMU uses the C11 memory model. The following example
is correct in Linux but not in QEMU:
+----------------------------------+--------------------------------+
| Linux (correct) | QEMU (incorrect) |
+==================================+================================+
| :: | :: |
| | |
| a = atomic_fetch_add(&x, 2); | a = atomic_fetch_add(&x, 2); |
| b = READ_ONCE(&y); | b = atomic_read(&y); |
+----------------------------------+--------------------------------+
because the read of ``y`` can be moved (by either the processor or the
compiler) before the write of ``x``.
Fixing this requires an ``smp_mb()`` memory barrier between the write
of ``x`` and the read of ``y``. In the common case where only one thread
writes ``x``, it is also possible to write it like this:
+--------------------------------+
| QEMU (correct) |
+================================+
| :: |
| |
| a = atomic_read(&x); |
| atomic_set(&x, a + 2); |
| smp_mb(); |
| b = atomic_read(&y); |
+--------------------------------+
Sources
=======