The raw block driver was blindly copying all limits from bs->file,
even though: 1. the main bdrv_refresh_limits() already does this
for many of the limits, and 2. blindly copying from the children
can weaken any stricter limits that were already inherited from
the backing chain during the main bdrv_refresh_limits(). Also,
a future patch is about to move .request_alignment into
BlockLimits, and that is a limit that should NOT be copied from
other layers in the BDS chain.
Thus, we can completely drop raw_refresh_limits(), and rely on
the block layer setting up the proper limits.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Sector-based limits are awkward to think about; in our on-going
quest to move to byte-based interfaces, convert max_discard and
discard_alignment. Rename them, using 'pdiscard' as an aid to
track which remaining discard interfaces need conversion, and so
that the compiler will help us catch the change in semantics
across any rebased code. The BlockLimits type is now completely
byte-based; and in iscsi.c, sector_limits_lun2qemu() is no
longer needed.
pdiscard_alignment is made unsigned (we use power-of-2 alignments
as bitmasks, where unsigned is easier to think about) while
leaving max_pdiscard signed (since we still have an 'int'
interface); this is comparable to what commit cf081fc did for
write zeroes limits. We may later want to make everything an
unsigned 64-bit limit - but that requires a bigger code audit.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Sector-based limits are awkward to think about; in our on-going
quest to move to byte-based interfaces, convert max_transfer_length
and opt_transfer_length. Rename them (dropping the _length suffix)
so that the compiler will help us catch the change in semantics
across any rebased code, and improve the documentation. Use unsigned
values, so that we don't have to worry about negative values and
so that bit-twiddling is easier; however, we are still constrained
by 2^31 of signed int in most APIs.
When a value comes from an external source (iscsi and raw-posix),
sanitize the results to ensure that opt_transfer is a power of 2.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to eventually stick request_alignment alongside other
BlockLimits, but first, we must ensure it is populated at the
same time as all other limits, rather than being a special case
that is set only when a block is first opened.
Now that all drivers have been updated to supply an override
of request_alignment during their .bdrv_refresh_limits(), as
needed, the block layer itself can defer setting the default
alignment until part of the overall bdrv_refresh_limits().
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to eventually stick request_alignment alongside other
BlockLimits, but first, we must ensure it is populated at the
same time as all other limits, rather than being a special case
that is set only when a block is first opened.
Add a .bdrv_refresh_limits() to all four of our legacy devices
that will always be sector-only (bochs, cloop, dmg, vvfat), in
spite of their recent conversion to expose a byte interface.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to eventually stick request_alignment alongside other
BlockLimits, but first, we must ensure it is populated at the
same time as all other limits, rather than being a special case
that is set only when a block is first opened.
In this case, raw_probe_alignment() already did what we needed,
so just fix its signature and wire it in correctly.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to eventually stick request_alignment alongside other
BlockLimits, but first, we must ensure it is populated at the
same time as all other limits, rather than being a special case
that is set only when a block is first opened.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to eventually stick request_alignment alongside other
BlockLimits, but first, we must ensure it is populated at the
same time as all other limits, rather than being a special case
that is set only when a block is first opened.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to eventually stick request_alignment alongside other
BlockLimits, but first, we must ensure it is populated at the
same time as all other limits, rather than being a special case
that is set only when a block is first opened.
Note that when the user does not provide "align", then we were
defaulting to bs->request_alignment - but at this stage in the
initialization, that was always 512. We were also rejecting an
explicit "align":0 from the user; this patch now allows that,
as an explicit request for the default alignment (which may not
always be 512 in the future).
qemu-iotests 77 is particularly sensitive to the fact that we
can specify an artificial alignment override in blkdebug, and
that override must continue to work even when limits are
refreshed on an already open device.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Making all callers special-case 0 as unlimited is awkward,
and we DO have a hard maximum of BDRV_REQUEST_MAX_SECTORS given
our current block layer API limits.
In the case of scsi, this means that we now always advertise a
limit to the guest, even in cases where the underlying layers
previously use 0 for no inherent limit beyond the block layer.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function sector_limits_lun2qemu() returns a value in units of
the block layer's 512-byte sector, and can be as large as
0x40000000, which is much larger than the block layer's inherent
limit of BDRV_REQUEST_MAX_SECTORS. The block layer already
handles '0' as a synonym to the inherent limit, and it is nicer
to return this value than it is to calculate an arbitrary
maximum, for two reasons: we want to ensure that the block layer
continues to special-case '0' as 'no limit beyond the inherent
limits'; and we want to be able to someday expand the block
layer to allow 64-bit limits, where auditing for uses of
BDRV_REQUEST_MAX_SECTORS will help us make sure we aren't
artificially constraining iscsi to old block layer limits.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We were basing the advertisement of maximum discard and transfer
length off of UINT32_MAX, but since the rest of the block layer
has signed int limits on a transaction, nothing could ever reach
that maximum, and we risk overflowing an int once things are
converted to byte-based rather than sector-based limits. What's
more, we DO have a much smaller limit: both the current kernel
and qemu-nbd have a hard limit of 32M on a read or write
transaction, and while they may also permit up to a full 32 bits
on a discard transaction, the upstream NBD protocol is proposing
wording that without any explicit advertisement otherwise,
clients should limit ALL requests to the same limits as read and
write, even though the other requests do not actually require as
many bytes across the wire. So the better limit to tell the
block layer is 32M for both values.
Behavior doesn't actually change with this patch (the block layer
is currently ignoring the max_transfer advertisements); but when
that problem is fixed in a later series, this patch will prevent
the exposure of a latent bug.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The NBD layer was breaking up request at a limit of 2040 sectors
(just under 1M) to cater to old qemu-nbd. But the server limit
was raised to 32M in commit 2d8214885 to match the kernel, more
than three years ago; and the upstream NBD Protocol is proposing
documentation that without any explicit communication to state
otherwise, a client should be able to safely assume that a 32M
transaction will work. It is time to rely on the larger sizing,
and any downstream distro that cares about maximum
interoperability to older qemu-nbd servers can just tweak the
value of #define NBD_MAX_SECTORS.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-stable@nongnu.org
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the amount of data to read ends exactly on the total size
of the bs, then we were wasting time creating a local qiov
to read the data in preparation for what would normally be
appending zeroes beyond the end, even though this corner case
has nothing further to do.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We don't pass any flags on to drivers to handle. Tighten an
assert to explain why we pass 0 to bdrv_driver_preadv(), and add
some comments on things to be aware of if we want to turn on
per-BDS BDRV_REQ_FUA support during reads in the future. Also,
document that we may want to consider using unmap during
copy-on-read operations where the read is all zeroes.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For symmetry with bdrv_aligned_preadv(), assert that the caller
really has aligned things properly. This requires adding an align
parameter, which is used now only in the new asserts, but will
come in handy in a later patch that adds auto-fragmentation to the
max transfer size, since that value need not always be a multiple
of the alignment, and therefore must be rounded down.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 94d047a added an assertion the the request alignment check.
This introduced 2 issues:
a) A off-by-one error since a request of BDRV_REQUEST_MAX_SECTORS
is actually allowed.
b) The bdrv_get_block_status call in the read path to check the allocation
status requests up to INT_MAX sectors which triggers the assertion.
Fixes: 94d047a35b
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <1466414680-18383-1-git-send-email-pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
s/target bs/to_replace/, also we check to_replace bs is not
blocked in qmp_drive_mirror() not here
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1466672241-22485-3-git-send-email-xiecl.fnst@cn.fujitsu.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Callback for block job should always exist
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1466672241-22485-2-git-send-email-xiecl.fnst@cn.fujitsu.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
During the refactor of mirror_iteration in e5b43573,
we regressed the fix introduced in cae98cb8.
This patch re-adds IOV_MAX checking to cases where we
aren't checking alignment (and size) already.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1466625064-11280-3-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
mirror_do_read intends to return the number of sectors processed after
the starting sector, without regard to how many sectors were processed
before the starting sector due to alignment.
Clean up the comments and code to hopefully illustrate this more clearly.
This also fixes an issue in initialization where if the mirror buffer size
is initialized to smaller than the number of sectors being requested for
transfer, we report back an incorrectly large number to the caller.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1466625064-11280-2-git-send-email-jsnow@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
This adds commandline support for the logging level of the
gluster protocol driver, output to stdout. The option is 'debug',
e.g.:
-drive filename=gluster://192.168.15.180/gv2/test.qcow2,debug=9
Debug levels are 0-9, with 9 being the most verbose, and 0 representing
no debugging output. The default is the same as it was before, which
is a level of 4. The current logging levels defined in the gluster
source are:
0 - None
1 - Emergency
2 - Alert
3 - Critical
4 - Error
5 - Warning
6 - Notice
7 - Info
8 - Debug
9 - Trace
(From: glusterfs/logging.h)
Reviewed-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
trace_mirror_yield_in_flight accepts 2nd arguments in sectors while here
we pass chunks instead.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1466518157-27140-1-git-send-email-den@openvz.org
CC: Jeff Cody <jcody@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
upcoming libnfs will have support for a read cache that can
significantly help to speed up requests since libnfs by design
circumvents the kernel cache.
Example:
qemu -cdrom nfs://127.0.0.1/iso/my.iso?pagecache=1024
The pagecache parameters takes the maximum amount of pages to
cache. A page in libnfs is always the NFS_BLKSIZE which is
4KB.
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1463662083-20814-3-git-send-email-pl@kamp.de
Signed-off-by: Jeff Cody <jcody@redhat.com>
if we open a NFS export with disabled cache we should refuse
the readahead feature as it will cache data inside libnfs.
If a export was opened with readahead enabled it should
futher not be allowed to disable the cache while running.
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 1463662083-20814-2-git-send-email-pl@kamp.de
Signed-off-by: Jeff Cody <jcody@redhat.com>
GlusterFS 3.8 contains support for SEEK_DATA and SEEK_HOLE. This makes
it possible to detect sparse areas in files.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Version: GnuPG v1
iQEcBAABAgAGBQJXaFInAAoJEJykq7OBq3PI6VsH/0Sfgbdo1RksYuQwb/y92sCW
EN+lxUZ+OLfgrc8PYgNZwfSM3rsfYhznL0MAXOeEe7Ahabi07w7DhGR8WvwfAOlI
G96FRuvrIPfv5u6U6fwS4CvG3TIHVLxfHKCsTpPUmH8U5CNx/x/tpjNiWN1dj6t+
sXybSjYHfZfiZy2tI9MFIFWCdxnF/pl0QAPhbRqc8Y/RQTDrPKRjLpz+nitN/u96
5TS7KlELyQuP91YMmLceYSmIkHbxW703h+iE2n4hov0uZCP8Jil+2Jsd3ziQSRlL
j6LqexQ2ViBGdDSfiZGYES2VPlsHOCwb4G+IgWBStfZg1ppaXENvcDzPrgrB+L4=
=eUnF
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/stefanha/tags/tracing-pull-request' into staging
# gpg: Signature made Mon 20 Jun 2016 21:29:27 BST
# gpg: using RSA key 0x9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>"
# gpg: aka "Stefan Hajnoczi <stefanha@gmail.com>"
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35 775A 9CA4 ABB3 81AB 73C8
* remotes/stefanha/tags/tracing-pull-request: (42 commits)
trace: split out trace events for linux-user/ directory
trace: split out trace events for qom/ directory
trace: split out trace events for target-ppc/ directory
trace: split out trace events for target-s390x/ directory
trace: split out trace events for target-sparc/ directory
trace: split out trace events for net/ directory
trace: split out trace events for audio/ directory
trace: split out trace events for ui/ directory
trace: split out trace events for hw/alpha/ directory
trace: split out trace events for hw/arm/ directory
trace: split out trace events for hw/acpi/ directory
trace: split out trace events for hw/vfio/ directory
trace: split out trace events for hw/s390x/ directory
trace: split out trace events for hw/pci/ directory
trace: split out trace events for hw/ppc/ directory
trace: split out trace events for hw/9pfs/ directory
trace: split out trace events for hw/i386/ directory
trace: split out trace events for hw/isa/ directory
trace: split out trace events for hw/sd/ directory
trace: split out trace events for hw/sparc/ directory
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Move all trace-events for files in the block/ directory to
their own file.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 1466066426-16657-7-git-send-email-berrange@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Use Coccinelle script to replace 'ret = E; return ret' with
'return E'. The script will do the substitution only when the
function return type and variable type are the same.
Manual fixups:
* audio/audio.c: coding style of "read (...)" and "write (...)"
* block/qcow2-cluster.c: wrap line to make it shorter
* block/qcow2-refcount.c: change indentation of wrapped line
* target-tricore/op_helper.c: fix coding style of
"remainder|quotient"
* target-mips/dsp_helper.c: reverted changes because I don't
want to argue about checkpatch.pl
* ui/qemu-pixman.c: fix line indentation
* block/rbd.c: restore blank line between declarations and
statements
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <1465855078-19435-4-git-send-email-ehabkost@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Unused Coccinelle rule name dropped along with a redundant comment;
whitespace touched up in block/qcow2-cluster.c; stale commit message
paragraph deleted]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
This patch simplifies code that uses a local_err variable just to
immediately use it for an error_propagate() call.
Coccinelle patch used to perform the changes added to
scripts/coccinelle/remove_local_err.cocci.
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <1465855078-19435-3-git-send-email-ehabkost@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Blank line in s390-virtio-ccw.c restored]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
error_propagate() already ignores local_err==NULL, so there's no
need to check it before calling.
Coccinelle patch used to perform the changes added to
scripts/coccinelle/error_propagate_null.cocci.
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <1465855078-19435-2-git-send-email-ehabkost@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Move s->target to the new AioContext when there is an AioContext change.
The backup_run() coroutine does not use asynchronous I/O so there is no
need to wait for in-flight requests in a BlockJobDriver->pause()
callback.
Guest writes are intercepted by the backup job. Treat them as guest
activity and do it even while the job is paused. This is necessary
since the only alternative would be to fail a job that experienced guest
writes during pause once the job is resumed. In practice the guest
writes don't interfere with AioContext switching since bdrv_drain() is
used by bdrv_set_aio_context().
Loops already contain pause points because of block_job_sleep_ns() calls
in the yield_and_check() helper function. It is necessary to convert a
raw qemu_coroutine_yield() to block_job_yield() so the
MIRROR_SYNC_MODE_NONE case can pause.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1466096189-6477-9-git-send-email-stefanha@redhat.com
Add block_job_pause_point() calls to mark quiescent points and make sure
to complete in-flight requests when switching AioContexts.
This patch solves undefined behavior in the mirror block job when the
BDS AioContext is changed by dataplane.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1466096189-6477-8-git-send-email-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is mandatory for correct backup creation. In the other case the
content under this area would be lost.
Dirty bits are set exactly like in bdrv_aligned_pwritev, i.e. they are set
even if notifier has returned a error.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1466093381-6120-4-git-send-email-den@openvz.org
CC: Fam Zheng <famz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Actually we must set dirty bitmap dirty after we have written all our
zeroes for correct processing in drive mirror code. In the other case
we can face not zeroes in this area in mirror_iteration.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1466093381-6120-3-git-send-email-den@openvz.org
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The request area is specified in bytes, not in sectors.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1466093381-6120-2-git-send-email-den@openvz.org
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
qemu/osdep.h checks whether MAP_ANONYMOUS is defined, but this check
is bogus without a previous inclusion of sys/mman.h. Include it in
sysemu/os-posix.h and remove it from everywhere else.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The null block driver ignores any filename used for creating its BDSs,
which allows creating such BDSs even without any filename at all. In
that case, we currently construct a JSON filename when queried instead
of a plain "null-co://" or "null-aio://". This patch implements
bdrv_refresh_filename() to remedy this behavior.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20160610185750.30956-4-mreitz@redhat.com
[mreitz@redhat.com: Added commit message]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Currently, we are trying to move the backing BDS from the source to the
target in bdrv_replace_in_backing_chain() which is called from
mirror_exit(). However, mirror_complete() already tries to open the
target's backing chain with a call to bdrv_open_backing_file().
First, we should only set the target's backing BDS once. Second, the
mirroring block job has a better idea of what to set it to than the
generic code in bdrv_replace_in_backing_chain() (in fact, the latter's
conditions on when to move the backing BDS from source to target are not
really correct).
Therefore, remove that code from bdrv_replace_in_backing_chain() and
leave it to mirror_complete().
Depending on what kind of mirroring is performed, we furthermore want to
use different strategies to open the target's backing chain:
- If blockdev-mirror is used, we can assume the user made sure that the
target already has the correct backing chain. In particular, we should
not try to open a backing file if the target does not have any yet.
- If drive-mirror with mode=absolute-paths is used, we can and should
reuse the already existing chain of nodes that the source BDS is in.
In case of sync=full, no backing BDS is required; with sync=top, we
just link the source's backing BDS to the target, and with sync=none,
we use the source BDS as the target's backing BDS.
We should not try to open these backing files anew because this would
lead to two BDSs existing per physical file in the backing chain, and
we would like to avoid such concurrent access.
- If drive-mirror with mode=existing is used, we have to use the
information provided in the physical image file which means opening
the target's backing chain completely anew, just as it has been done
already.
If the target's backing chain shares images with the source, this may
lead to multiple BDSs per physical image file. But since we cannot
reliably ascertain this case, there is nothing we can do about it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20160610185750.30956-3-mreitz@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Ceph RBD block driver does not use error_setg_errno() where
it is possible to use. This patch replaces error_setg()
from error_setg_errno().
Signed-off-by: Vikhyat Umrao <vumrao@redhat.com>
Message-id: 1462780319-5796-1-git-send-email-vumrao@redhat.com
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
If the base or overlay images need to be reopened in read-write mode
but the block_job_create() call fails then no one will put those
images back in read-only mode.
We can solve this problem easily by calling block_job_create() first.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-id: aa495045770a6f1a7cc5d408397a17c75097fdd8.1464346103.git.berto@igalia.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
bdrv_drain_all() pauses all block jobs by using bdrv_next() to iterate
over all top-level BlockDriverStates. Therefore the code is unable to
find block jobs in other nodes.
This patch uses block_job_next() to iterate over all block jobs.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Message-id: 55ee7d7d4a65c28aa1a1b28823897ef326f328e2.1464346103.git.berto@igalia.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
It is always true for open images now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
We don't really want to go through the block layer in order to read from
or write to the vmstate in a qcow2 image. Doing so required a few ugly
hacks like saving and restoring the old image size (because writing to
vmstate offsets would increase the image size) or disabling the "reads
after EOF = zeroes" logic. When calling the right functions directly,
these hacks aren't necessary any more.
Note that .bdrv_vmstate_load/save() return 0 instead of the number of
bytes in case of success now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
This allows drivers to share code between normal I/O and vmstate
accesses.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The return value of .bdrv_load/save_vmstate() can be any non-negative
number in case of success now. It used to be bytes/-errno.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
This brings it in line with .bdrv_save_vmstate().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
We already have a byte-based bdrv_pwritev(), but the read counterpart
was still missing. This commit adds it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
linux-aio uses a BH in order to make sure that the remaining completions
are processed even in nested event loops of completion callbacks in
order to avoid deadlocks.
There is no need, however, to have the BH overhead for the first call
into qemu_laio_completion_bh() or after all pending completions have
already been processed. Therefore, this patch calls directly into
qemu_laio_completion_bh() in qemu_laio_completion_cb() and cancels
the BH after qemu_laio_completion_bh() has processed all pending
completions.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
If block drivers say that they can do an alignment < 512 bytes, let's
just suppose they mean it. raw-posix used to be an offender with respect
to this, but it can actually deal with byte-aligned requests now.
The default is still 512 bytes for any drivers that only implement
sector-based interfaces, but it is 1 now for drivers that implement
.bdrv_co_preadv.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The raw-posix block driver actually supports byte-aligned requests now
on non-O_DIRECT images, like it already (and previously incorrectly)
claimed in bs->request_alignment.
For some block drivers this means that a RMW cycle can be avoided when
they write sub-sector metadata e.g. for cluster allocation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
In order to use the modern byte-based .bdrv_co_preadv/pwritev()
interface, this patch switches raw-posix to coroutine-based interfaces
as a first step. In terms of semantics and performance, it doesn't make
a difference with the existing code whether we go from a coroutine to a
callback-based interface already in block/io.c or only in linux-aio.c
As there have been concerns in the past that this change may be a step
in the wrong direction with respect to a possible AIO fast path, the
old callback-based interface for linux-aio is left around and can be
reactivated when a fast path (e.g. directly from virtio-blk dataplane,
bypassing the whole block layer) is implemented.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch makes bdrv_aligned_preadv() ready to accept byte-aligned
requests. Note that this doesn't mean that such requests are actually
made. The caller still ensures that all requests are aligned to at least
512 bytes.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
In a first step to convert the common I/O path to work on bytes rather
than sectors, this converts the copy-on-read logic that is used by
bdrv_aligned_preadv().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Back in the 2.3.0 release we declared qcow[2] encryption as
deprecated, warning people that it would be removed in a future
release.
commit a1f688f415
Author: Markus Armbruster <armbru@redhat.com>
Date: Fri Mar 13 21:09:40 2015 +0100
block: Deprecate QCOW/QCOW2 encryption
The code still exists today, but by a (happy?) accident we entirely
broke the ability to use qcow[2] encryption in the system emulators
in the 2.4.0 release due to
commit 8336aafae1
Author: Daniel P. Berrange <berrange@redhat.com>
Date: Tue May 12 17:09:18 2015 +0100
qcow2/qcow: protect against uninitialized encryption key
This commit was designed to prevent future coding bugs which
might cause QEMU to read/write data on an encrypted block
device in plain text mode before a decryption key is set.
It turns out this preventative measure was a little too good,
because we already had a long standing bug where QEMU read
encrypted data in plain text mode during system emulator
startup, in order to guess disk geometry:
Thread 10 (Thread 0x7fffd3fff700 (LWP 30373)):
#0 0x00007fffe90b1a28 in raise () at /lib64/libc.so.6
#1 0x00007fffe90b362a in abort () at /lib64/libc.so.6
#2 0x00007fffe90aa227 in __assert_fail_base () at /lib64/libc.so.6
#3 0x00007fffe90aa2d2 in () at /lib64/libc.so.6
#4 0x000055555587ae19 in qcow2_co_readv (bs=0x5555562accb0, sector_num=0, remaining_sectors=1, qiov=0x7fffffffd260) at block/qcow2.c:1229
#5 0x000055555589b60d in bdrv_aligned_preadv (bs=bs@entry=0x5555562accb0, req=req@entry=0x7fffd3ffea50, offset=offset@entry=0, bytes=bytes@entry=512, align=align@entry=512, qiov=qiov@entry=0x7fffffffd260, flags=0) at block/io.c:908
#6 0x000055555589b8bc in bdrv_co_do_preadv (bs=0x5555562accb0, offset=0, bytes=512, qiov=0x7fffffffd260, flags=<optimized out>) at block/io.c:999
#7 0x000055555589c375 in bdrv_rw_co_entry (opaque=0x7fffffffd210) at block/io.c:544
#8 0x000055555586933b in coroutine_thread (opaque=0x555557876310) at coroutine-gthread.c:134
#9 0x00007ffff64e1835 in g_thread_proxy (data=0x5555562b5590) at gthread.c:778
#10 0x00007ffff6bb760a in start_thread () at /lib64/libpthread.so.0
#11 0x00007fffe917f59d in clone () at /lib64/libc.so.6
Thread 1 (Thread 0x7ffff7ecab40 (LWP 30343)):
#0 0x00007fffe91797a9 in syscall () at /lib64/libc.so.6
#1 0x00007ffff64ff87f in g_cond_wait (cond=cond@entry=0x555555e085f0 <coroutine_cond>, mutex=mutex@entry=0x555555e08600 <coroutine_lock>) at gthread-posix.c:1397
#2 0x00005555558692c3 in qemu_coroutine_switch (co=<optimized out>) at coroutine-gthread.c:117
#3 0x00005555558692c3 in qemu_coroutine_switch (from_=0x5555562b5e30, to_=to_@entry=0x555557876310, action=action@entry=COROUTINE_ENTER) at coroutine-gthread.c:175
#4 0x0000555555868a90 in qemu_coroutine_enter (co=0x555557876310, opaque=0x0) at qemu-coroutine.c:116
#5 0x0000555555859b84 in thread_pool_completion_bh (opaque=0x7fffd40010e0) at thread-pool.c:187
#6 0x0000555555859514 in aio_bh_poll (ctx=ctx@entry=0x5555562953b0) at async.c:85
#7 0x0000555555864d10 in aio_dispatch (ctx=ctx@entry=0x5555562953b0) at aio-posix.c:135
#8 0x0000555555864f75 in aio_poll (ctx=ctx@entry=0x5555562953b0, blocking=blocking@entry=true) at aio-posix.c:291
#9 0x000055555589c40d in bdrv_prwv_co (bs=bs@entry=0x5555562accb0, offset=offset@entry=0, qiov=qiov@entry=0x7fffffffd260, is_write=is_write@entry=false, flags=flags@entry=(unknown: 0)) at block/io.c:591
#10 0x000055555589c503 in bdrv_rw_co (bs=bs@entry=0x5555562accb0, sector_num=sector_num@entry=0, buf=buf@entry=0x7fffffffd2e0 "\321,", nb_sectors=nb_sectors@entry=21845, is_write=is_write@entry=false, flags=flags@entry=(unknown: 0)) at block/io.c:614
#11 0x000055555589c562 in bdrv_read_unthrottled (nb_sectors=21845, buf=0x7fffffffd2e0 "\321,", sector_num=0, bs=0x5555562accb0) at block/io.c:622
#12 0x000055555589c562 in bdrv_read_unthrottled (bs=0x5555562accb0, sector_num=sector_num@entry=0, buf=buf@entry=0x7fffffffd2e0 "\321,", nb_sectors=nb_sectors@entry=21845) at block/io.c:634
nb_sectors@entry=1) at block/block-backend.c:504
#14 0x0000555555752e9f in guess_disk_lchs (blk=blk@entry=0x5555562a5290, pcylinders=pcylinders@entry=0x7fffffffd52c, pheads=pheads@entry=0x7fffffffd530, psectors=psectors@entry=0x7fffffffd534) at hw/block/hd-geometry.c:68
#15 0x0000555555752ff7 in hd_geometry_guess (blk=0x5555562a5290, pcyls=pcyls@entry=0x555557875d1c, pheads=pheads@entry=0x555557875d20, psecs=psecs@entry=0x555557875d24, ptrans=ptrans@entry=0x555557875d28) at hw/block/hd-geometry.c:133
#16 0x0000555555752b87 in blkconf_geometry (conf=conf@entry=0x555557875d00, ptrans=ptrans@entry=0x555557875d28, cyls_max=cyls_max@entry=65536, heads_max=heads_max@entry=16, secs_max=secs_max@entry=255, errp=errp@entry=0x7fffffffd5e0) at hw/block/block.c:71
#17 0x0000555555799bc4 in ide_dev_initfn (dev=0x555557875c80, kind=IDE_HD) at hw/ide/qdev.c:174
#18 0x0000555555768394 in device_realize (dev=0x555557875c80, errp=0x7fffffffd640) at hw/core/qdev.c:247
#19 0x0000555555769a81 in device_set_realized (obj=0x555557875c80, value=<optimized out>, errp=0x7fffffffd730) at hw/core/qdev.c:1058
#20 0x00005555558240ce in property_set_bool (obj=0x555557875c80, v=<optimized out>, opaque=0x555557875de0, name=<optimized out>, errp=0x7fffffffd730)
at qom/object.c:1514
#21 0x0000555555826c87 in object_property_set_qobject (obj=obj@entry=0x555557875c80, value=value@entry=0x55555784bcb0, name=name@entry=0x55555591cb3d "realized", errp=errp@entry=0x7fffffffd730) at qom/qom-qobject.c:24
#22 0x0000555555825760 in object_property_set_bool (obj=obj@entry=0x555557875c80, value=value@entry=true, name=name@entry=0x55555591cb3d "realized", errp=errp@entry=0x7fffffffd730) at qom/object.c:905
#23 0x000055555576897b in qdev_init_nofail (dev=dev@entry=0x555557875c80) at hw/core/qdev.c:380
#24 0x0000555555799ead in ide_create_drive (bus=bus@entry=0x555557629630, unit=unit@entry=0, drive=0x5555562b77e0) at hw/ide/qdev.c:122
#25 0x000055555579a746 in pci_ide_create_devs (dev=dev@entry=0x555557628db0, hd_table=hd_table@entry=0x7fffffffd830) at hw/ide/pci.c:440
#26 0x000055555579b165 in pci_piix3_ide_init (bus=<optimized out>, hd_table=0x7fffffffd830, devfn=<optimized out>) at hw/ide/piix.c:218
#27 0x000055555568ca55 in pc_init1 (machine=0x5555562960a0, pci_enabled=1, kvmclock_enabled=<optimized out>) at /home/berrange/src/virt/qemu/hw/i386/pc_piix.c:256
#28 0x0000555555603ab2 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4249
So the safety net is correctly preventing QEMU reading cipher
text as if it were plain text, during startup and aborting QEMU
to avoid bad usage of this data.
For added fun this bug only happens if the encrypted qcow2
file happens to have data written to the first cluster,
otherwise the cluster won't be allocated and so qcow2 would
not try the decryption routines at all, just return all 0's.
That no one even noticed, let alone reported, this bug that
has shipped in 2.4.0, 2.5.0 and 2.6.0 shows that the number
of actual users of encrypted qcow2 is approximately zero.
So rather than fix the crash, and backport it to stable
releases, just go ahead with what we have warned users about
and disable any use of qcow2 encryption in the system
emulators. qemu-img/qemu-io/qemu-nbd are still able to access
qcow2 encrypted images for the sake of data conversion.
In the future, qcow2 will gain support for the alternative
luks format, but when this happens it'll be using the
'-object secret' infrastructure for getting keys, which
avoids this problematic scenario entirely.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a new BDRV_REQ_MASK constant, and use it to make sure that
caller flags are always valid.
Tested with 'make check' and with qemu-iotests on both '-raw'
and '-qcow2'; the only failure turned up was fixed in the
previous commit.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit e253f4b8 converted mirroring from sector-based bdrv_aio_*
to byte-based blk_aio_*, but failed to account for the subtle
difference in signatures (the former takes a semi-redundant length,
the latter takes a flags parameter). Since all of our flags are
currently smaller in size than BDRV_SECTOR_SIZE, it has no ill
effects until we either perform sub-sector mirroring, or we start
asserting that no unexpected flags are set. I found it while
testing new asserts when qemu-iotests 132 started warning about an
unknown flag 0x200000.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This changes qcow2 to implement the byte-based .bdrv_co_pwritev
interface rather than the sector-based old one.
As preallocation uses the same allocation function as normal writes, and
the interface of that function needs to be changed, it is converted in
the same patch.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This will allow copy on write operations where the overwritten part of
the cluster is not aligned to sector boundaries.
Also rename the function because it has nothing to do with sectors any
more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reading from qcow2 images is now byte granularity.
Most of the affected code in qcow2 actually gets simpler with this
change. The only exception is encryption, which is fixed on 512 bytes
blocks; in order to keep this working, bs->request_alignment is set for
encrypted images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This patch changes the units that qcow2_get_cluster_offset() uses
internally, without touching the interface just yet. This will be done
in another patch.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Drivers that implement .bdrv_co_pwritev() get the flags passed as an
argument to said function, but we also unconditionally emulate the flags
anyway. We shouldn't do that.
Fix this by clearing all flags that the driver supports natively after
it returns from .bdrv_co_pwritev().
Fixes: 4df863f3 ('block: Make supported_write_flags a per-bds property')
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The code to exit the loop after bdrv_snapshot_delete_by_id_or_name()
returned failure was duplicated. The first copy of it was too early so
that the AioContext lock would not be freed. This patch removes it so
that only the second, correct copy remains.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
The problem with excessive flushing was found by a couple of performance
tests:
- parallel directory tree creation (from 2 processes)
- 32 cached writes + fsync at the end in a loop
For the first one results improved from 2.6 loops/sec to 3.5 loops/sec.
Each loop creates 10^3 directories with 10 files in each.
For the second one results improved from ~600 fsync/sec to ~1100
fsync/sec. Though, it was run on SSD so it probably won't show such
performance gain on rotational media.
qcow2_cache_flush() calls bdrv_flush() unconditionally after writing
cache entries of a particular cache. This can lead to as many as
2 additional fdatasyncs inside bdrv_flush.
We can simply skip all fdatasync calls inside qcow2_co_flush_to_os
as bdrv_flush for sure will do the job. These flushes are necessary to
keep the right order of writes to the different caches. Though this is
not necessary in the current code base as this ordering is ensured through
the flush in qcow2_cache_flush_dependency().
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is sometimes a useful value we should count in.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that all drivers have been converted to a byte interface,
we no longer need a sector interface.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Another step on our continuing quest to switch to byte-based
interfaces.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Another step on our continuing quest to switch to byte-based
interfaces.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Another step on our continuing quest to switch to byte-based
interfaces.
Signed-off-by: Eric Blake <eblake@redhat.com>
[ kwolf: Fixed up trace_paio_submit_co() call for qiov == NULL ]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Another step on our continuing quest to switch to byte-based
interfaces.
Kill an abuse of the comma operator while at it (fortunately,
the semantics were still right). Also, the test for requests
not aligned to clusters should be applied always, not just
when a backing file is present.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Another step on our continuing quest to switch to byte-based
interfaces.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Another step on our continuing quest to switch to byte-based
interfaces.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Another step on our continuing quest to switch to byte-based
interfaces.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Another step on our continuing quest to switch to byte-based
interfaces.
As this is the first byte-based iscsi interface, convert
is_request_lun_aligned() into two versions, one for sectors
and one for bytes. Also, change from outright -EINVAL failure
on an unaligned request, to instead failing with -ENOTSUP to
trigger a read-modify-write fallback, particularly since the
block layer should be honoring bs->request_alignment to avoid
-EINVAL on read/write requests.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Rename to bdrv_pwrite_zeroes() to let the compiler ensure we
cater to the updated semantics. Do the same for bdrv_co_write_zeroes().
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Update bdrv_co_do_write_zeroes() to be byte-based, and select
between the new byte-based bdrv_co_pwrite_zeroes() or the old
bdrv_co_write_zeroes(). The next patches will convert drivers,
then remove the old interface.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Another step towards removing sector-based interfaces: convert
the maximum write and minimum alignment values from sectors to
bytes. Rename the variables to let the compiler check that all
users are converted to the new semantics.
The maximum remains an int as long as BDRV_REQUEST_MAX_SECTORS
is constrained by INT_MAX (this means that we can't even
support a 2G write_zeroes, but just under it) - changing
operation lengths to unsigned or to 64-bits is a much bigger
audit, and debatable if we even want to do it (since at the
core, a 32-bit platform will still have ssize_t as its
underlying limit on write()).
Meanwhile, alignment is changed to 'uint32_t', since it makes no
sense to have an alignment larger than the maximum write, and
less painful to use an unsigned type with well-defined behavior
in bit operations than to have to worry about what happens if
a driver mistakenly supplies a negative alignment.
Add an assert that no one was trying to use sectors to get a
write zeroes larger than 2G, and therefore that a later conversion
to bytes won't be impacted by keeping the limit at 32 bits.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If hardware does not advertise a minimum zero/discard
alignment, we still want to guarantee that the block layer
will align requests to our blocks, rather than the arbitrary
512-byte BDRV sector size.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
is_zero_cluster() and is_zero_cluster_top_locked() are used only
by qcow2_co_write_zeroes(). The former is too broad (we don't
care if the sectors we are about to overwrite are non-zero, only
that all other sectors in the cluster are zero), so it needs to
be called up to twice but with smaller limits - rename it along
with adding the neeeded parameter. The latter can be inlined for
more compact code.
The testsuite change shows that we now have a sparser top file
when an unaligned write_zeroes overwrites the only portion of
the backing file with data.
Based on a patch proposal by Denis V. Lunev.
CC: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch follows guidelines of all other tracepoints in qcow2, like ones
in qcow2_co_writev. I think that they should dump values in the same
quantities or be changed all together.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Eric Blake <eblake@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
Message-Id: <1463476543-3087-4-git-send-email-den@openvz.org>
[eblake: typo fix in commit message]
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Unaligned requests will occupy only one cluster. This is true since the
previous commit. Simplify the code taking this consideration into
account.
In other words, the caller is now buggy if it ever passes us an unaligned
request that crosses cluster boundaries (the only requests that can cross
boundaries will be aligned).
There are no other changes so far.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
Message-Id: <1463476543-3087-3-git-send-email-den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We should split requests even if they are less than write_zeroes_alignment.
For example we can have the following request:
offset 62k
size 4k
write_zeroes_alignment 64k
The original code sent 1 request covering 2 qcow2 clusters, and resulted
in both clusters being allocated. But by splitting the request, we can
cater to the case where one of the two clusters can be zeroed as a
whole, for only 1 cluster allocated after the operation.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Eric Blake <eblake@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
Message-Id: <1463476543-3087-2-git-send-email-den@openvz.org>
[eblake: Avoid exceeding nb_sectors, hoist alignment checks out of
loop, and update testsuite to show that patch works]
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace (((n) + (d) - 1) /(d)) by DIV_ROUND_UP(n,d).
This patch is the result of coccinelle script
scripts/coccinelle/round.cocci
CC: qemu-block@nongnu.org
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Replace (((n) + (d) - 1) /(d)) by DIV_ROUND_UP(n,d).
This patch is the result of coccinelle script
scripts/coccinelle/round.cocci
CC: qemu-block@nongnu.org
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Replace (((n) + (d) - 1) /(d)) by DIV_ROUND_UP(n,d).
This patch is the result of coccinelle script
scripts/coccinelle/round.cocci
CC: qemu-block@nongnu.org
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Replace (((n) + (d) - 1) /(d)) by DIV_ROUND_UP(n,d).
This patch is the result of coccinelle script
scripts/coccinelle/round.cocci
CC: qemu-block@nongnu.org
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Remove glib.h includes, as it is provided by osdep.h.
This commit was created with scripts/clean-includes.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Similar to the "!drv || !drv->bdrv_aio_ioctl" case above, here it is
okay to set co.ret and return. As pointed out by Paolo, a BH will be
created as necessary by the caller (bdrv_co_maybe_schedule_bh).
Besides, as pointed out by Kevin, "data" was leaked before.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 20160601015223.19277-1-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
I was thrown by the fact that the public type BlockRequest had
an anonymous union, but no obvious discriminator. Turns out
that the only client of the second branch of the union was code
internal to io.c, now that commit 91c6e4b killed public
multiwrite, so move it into io.c and improve the comments.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1463699150-19445-1-git-send-email-eblake@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
in a read-modify-write cycle a small request might cause
head and tail to fall into the same aligned block. Currently
QEMU reads the same block twice in this case which is
not necessary.
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-id: 1464607873-28206-1-git-send-email-pl@kamp.de
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
at least in the path via virtio-blk the maximum size is not
restricted.
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <1464080368-29584-1-git-send-email-pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This changes the commit block job to use the job's BlockBackend for
performing its I/O. job->bs isn't used by the commit code any more
afterwards.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This changes the backup block job to use the job's BlockBackend for
performing its I/O. job->bs isn't used by the backup code any more
afterwards.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Now that we pass the job to the function, bs is implied by that.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Instead of relying on peeking at bs->job, we want to explicitly get
a reference to the job that was involved in this notifier callback.
Pack the Notifier inside of the BackupBlockJob so we can use
container_of to get a reference back to the BackupBlockJob object.
This cuts out one more case where we rely unnecessarily on bs->job.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This changes the mirror block job to use the job's BlockBackend for
performing its I/O. job->bs isn't used by the mirroring code any more
afterwards.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
We had to forbid mirroring to a target BDS that already had a BB
attached because the node swapping at job completion would add a second
BB and we didn't support multiple BBs on a single BDS at the time. Now
we do, so we can lift the restriction.
As we allow additional BlockBackends for the target, we must expect
other users to be sending requests. There may no requests be in flight
during the graph modification, so we have to drain those users now.
The core part of this patch is a revert of commit 40365552.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This changes the streaming block job to use the job's BlockBackend for
performing the COR reads. job->bs isn't used by the streaming code any
more afterwards.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Also add trace points now that the function can be directly called.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
This adds a new BlockBackend field to the BlockJob struct, which
coexists with the BlockDriverState while converting the individual jobs.
When creating a block job, a new BlockBackend is created on top of the
given BlockDriverState, and it is destroyed when the BlockJob ends. The
reference to the BDS is now held by the BlockBackend instead of calling
bdrv_ref/unref manually.
We have to be careful when we use bdrv_replace_in_backing_chain() in
block jobs because this changes the BDS that job->blk points to. At the
moment block jobs are too tightly coupled with their BDS, so that moving
a job to another BDS isn't easily possible; therefore, we need to just
manually undo this change afterwards.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
The existing users of the function are:
1. blk_new_open(), which already enabled the write cache
2. Some test cases that don't care about the setting
3. blockdev_init() for empty drives, where the cache mode is overridden
with the value from the options when a medium is inserted
Therefore, this patch doesn't change the current behaviour. It will be
convenient, however, for additional users of blk_new() (like block
jobs) if the most sensible WCE setting is the default.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Commit 983a1600 changed the semantics of blk_write_zeroes() to
be byte-based rather than sector-based, but did not change the
name, which is an open invitation for other code to misuse the
function. Renaming to pwrite_zeroes() makes it more in line
with other byte-based interfaces, and will help make it easier
to track which remaining write_zeroes interfaces still need
conversion.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
When changing the BlockDriverState that a BdrvChild points to while the
node is currently drained, we must call the .drained_end() parent
callback. Conversely, when this means attaching a new node that is
already drained, we need to call .drained_begin().
bdrv_root_attach_child() takes now an opaque parameter, which is needed
because the callbacks must also be called if we're attaching a new child
to the BlockBackend when the root node is already drained, and they need
a way to identify the BlockBackend. Previously, child->opaque was set
too late and the callbacks would still see it as NULL.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Until now, bdrv_drained_begin() used bdrv_drain() internally to drain
the queue. This is kind of backwards and caused quiescing code to be
duplicated because bdrv_drained_begin() had to ensure that no new
requests come in even after bdrv_drain() returns, whereas bdrv_drain()
had to have them because it could be called from other places.
Instead move the bdrv_drain() code to bdrv_drained_begin() and make
bdrv_drain() a simple wrapper around bdrv_drained_begin/end().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
blk_new() cannot fail so its Error ** parameter has become superfluous.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There are no callers to bdrv_open() or bdrv_open_inherit() left that
pass a pointer to a non-NULL BDS pointer as the first argument of these
functions, so we can finally drop that parameter and just make them
return the new BDS.
Generally, the following pattern is applied:
bs = NULL;
ret = bdrv_open(&bs, ..., &local_err);
if (ret < 0) {
error_propagate(errp, local_err);
...
}
by
bs = bdrv_open(..., errp);
if (!bs) {
ret = -EINVAL;
...
}
Of course, there are only a few instances where the pattern is really
pure.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Its only caller is blk_new_open(), so we can just inline it there.
The bdrv_new_root() call is dropped in the process because we can just
let bdrv_open() create the BDS.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The bdrv_next() users all leaked the BdrvNextIterator after completing
the iteration. Simply changing bdrv_next() to free the iterator before
returning NULL at the end of list doesn't work because some callers exit
the loop before looking at all BDSes.
This patch moves the BdrvNextIterator from the heap to the stack of
the caller and switches to a bdrv_first()/bdrv_next() interface for
initialising the iterator.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
In order to correctly check whether a given cluster is read as zero, we
don't only need to check whether bdrv_get_block_status_above() sets
BDRV_BLOCK_ZERO, but also if all sectors for the whole cluster have the
same status.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
We should check for (res & BDRV_BLOCK_ZERO) only. The situation when we
will have !(res & BDRV_BLOCK_DATA) and will not have BDRV_BLOCK_ZERO is
not possible for images with bdi.unallocated_blocks_are_zero == true.
For those images where it's false, however, it can happen and we must
not consider the data zeroed then or we would corrupt the image.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of propagating any change of a BDS's AioContext only to its file
and backing children and letting driver-specific code do the rest, just
propagate it to all and drop the thus superfluous implementations of
bdrv_{at,de}tach_aio_context() in Quorum, blkverify and VMDK.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch removes the remaining users of bs->blk, which will allow us
to have multiple BBs on top of a single BDS. In the meantime, all checks
that are currently in place to prevent the user from creating such
setups can be switched to bdrv_has_blk() instead of accessing BDS.blk.
Future patches can allow them and e.g. enable users to mirror to a block
device that already has a BlockBackend on it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
query-named-block-nodes should not return information that is related
to the attached BlockBackend rather than the node itself, so throttling
information needs to be removed from it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
We need to introduce a separate BdrvNextIterator struct that can keep
more state than just the current BDS in order to avoid using the bs->blk
pointer.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
In many cases we just want to know whether a BDS has at least one BB
attached, without needing to know the exact BB that is attached. In
contrast to bs->blk, this is still a valid question when more than one
BB can be attached, so just answer it by checking the parents list.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Since virtio-blk implements request merging itself these days, the only
remaining users are test cases for the function. That doesn't make the
function exactly useful any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Block jobs don't actually make use of the iostatus for their BDSes, but
they manage a separate block job iostatus. Still, they require that it
is enabled for the source BDS and they enable it automatically for the
target and set the error handling mode - which ends up never being used
by the job.
This patch removes all of the BDS iostatus handling from the block job,
which removes another few bs->blk accesses.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
When block job errors were introduced, we assigned the iostatus of the
target BDS "just in case". The field has never been accessible for the
user because the target isn't listed in query-block.
Before we can allow the user to have a second BlockBackend on the
target, we need to clean this up. If anything, we would want to set the
iostatus for the internal BB of the job (which we can always do later),
but certainly not for a separate BB which the job doesn't even use.
As a nice side effect, this gets us rid of another bs->blk use.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
In order to get rid of bs->blk for bdrv_get_device_name() and
bdrv_get_device_or_node_name(), ask all parents for their name and
simply pick the first one.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
We want to get rid of BlockDriverState.blk in order to allow multiple
BlockBackends per BDS. Converting the device callbacks in block.c (which
assume a single BlockBackend) to per-child callbacks gets us rid of the
first few instances.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Checking whether there are throttled requests requires going to the
associated BlockBackend, which we want to avoid.
All users of bdrv_requests_pending() in block/io.c already call
bdrv_parent_drained_begin() first, which restarts all throttled
requests, so no throttled requests can be left here and this is removal
of dead code.
The remaining users (assertions during graph manipulation in block.c)
don't care about requests that are still queued in the BlockBackend and
haven't been issued for a BlockDriverState yet.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
This moves the throttling related part of the BDS life cycle management
to BlockBackend. The throttling group reference is now kept even when no
medium is inserted.
With this commit, throttling isn't disabled and then re-enabled any more
during graph reconfiguration. This fixes the temporary breakage of I/O
throttling when used with live snapshots or block jobs that manipulate
the graph.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
So far, bdrv_parent_drained_begin/end() was called for the duration of
the actual bdrv_drain() at the beginning of a drained section, but we
really should keep parents quiesced until the end of the drained
section.
This does not actually change behaviour at this point because the only
user of the .drained_begin/end BdrvChildRole callback is I/O throttling,
which already doesn't send any new requests after flushing its queue in
.drained_begin. The patch merely removes a trap for future users.
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
This removes the last part of I/O throttling from block/io.c and moves
it to the BlockBackend.
Instead of having knowledge about throttling inside io.c, we can call a
BdrvChild callback .drained_begin/end, which happens to drain the
throttled requests for BlockBackend parents.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
BlockBackends use it to get a back pointer from BdrvChild to
BlockBackend in any BdrvChildRole callbacks.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch changes where the throttling state is stored (used to be the
BlockDriverState, now it is the BlockBackend), but it doesn't actually
make it a BB level feature yet. For example, throttling is still
disabled when the BDS is detached from the BB.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
As a first step towards moving I/O throttling to the BlockBackend level,
this patch changes all pointers in struct ThrottleGroup from referencing
a BlockDriverState to referencing a BlockBackend.
This change is valid because we made sure that throttling can only be
enabled on BDSes which have a BB attached.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Some features, like I/O throttling, are implemented outside
block-backend.c, but still want to keep information in BlockBackend,
e.g. list entries that allow keeping a list of BlockBackends.
In order to avoid exposing the whole struct layout in the public header
file, this patch introduces an embedded public struct where such
information can be added and a pair of functions to convert between
BlockBackend and BlockBackendPublic.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
It was already true in principle that a throttled BDS always has a BB
attached, except that the order of operations while attaching or
detaching a BDS to/from a BB wasn't careful enough.
This commit breaks graph manipulations while I/O throttling is enabled.
It would have been possible to keep things working with some temporary
hacks, but quite cumbersome, so it's not worth the hassle. We'll fix
things again in a minute.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Move it to the actual users. There are still a few includes of
qemu/bswap.h in headers; removing them is left for future work.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJXNIcBAAoJEH8JsnLIjy/WiUQP/Rzfo8pe7TWA2InxdcDOPsx4
2/tHHJdVkffnNX5rdBvc0mOUZNJxej0NtJu2e63BB+ydYju//xw8gruKU7TR+Nd3
nPNSsqk80prK3RNgWu7qymBvIkHDDcDQhlp48HKq+dxrfConXtHmoXapGsqc0S47
xu03oC6WzSIyLf7TLytcjUmEprQSaCGOwsb/XaHAWL750fFAGcdy/K5PWBpUv6DN
T0jZ3u4UneE1jeabRmqAwjgDJXC9l6riH9fP/ZtYhgNlNj84zlMXajUHSULhGknP
cTGjwwg9tOvhcjTdhdRmWlvG1m0T77ZX3icfZLhcTdb/Uz68NXVqs8P25IGV9McD
DPrb3T/M8JUoqLXJxIpxUm2Levof5v0dUF1PHmN5bT7pshcqv/1J7v8Fdtf9l9mp
zI0+FK1TZ102C0H2F7AWYZSlo2EfNUSd02QQx6MbfDokDIlIxY+EgP1/Es5XlkqC
wc7HrJvq+uix2zXw9bn9Vg9p/nDuxlRx+ppRRarNNRonaqTrx/1qAaas4bsqc9Gz
H6gxw7BHybm0TZFdHqAdIonpesecYw6yWUXT/mQehbfphsmQmu/d2HvF2C9uUm4X
O0JduBlKOTm2hMcg5qL6Gko8WaQIctdCJH/1Onts92cZnm8Vr/9zcmMgwGoCd7sE
+t6Yg0jqpTUJwhZhIuCw
=NbjJ
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches
# gpg: Signature made Thu 12 May 2016 14:37:05 BST using RSA key ID C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
* remotes/kevin/tags/for-upstream: (69 commits)
qemu-iotests: iotests: fail hard if not run via "check"
block: enable testing of LUKS driver with block I/O tests
block: add support for encryption secrets in block I/O tests
block: add support for --image-opts in block I/O tests
qemu-io: Add 'write -z -u' to test MAY_UNMAP flag
qemu-io: Add 'write -f' to test FUA flag
qemu-io: Allow unaligned access by default
qemu-io: Use bool for command line flags
qemu-io: Make 'open' subcommand more like command line
qemu-io: Add missing option documentation
qmp: add monitor command to add/remove a child
quorum: implement bdrv_add_child() and bdrv_del_child()
Add new block driver interface to add/delete a BDS's child
qemu-img: check block status of backing file when converting.
iotests: fix the redirection order in 083
block: Inactivate all children
block: Drop superfluous invalidating bs->file from drivers
block: Invalidate all children
nbd: Simplify client FUA handling
block: Honor BDRV_REQ_FUA during write_zeroes
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Now they are invalidated by the block layer, so it's not necessary to
do this in block drivers' implementations of .bdrv_invalidate_cache.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that the block layer honors per-bds FUA support, we don't
have to duplicate the fallback flush at the NBD layer. The
static function nbd_co_writev_flags() is no longer needed, and
the driver can just directly use nbd_client_co_writev().
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The block layer has a couple of cases where it can lose
Force Unit Access semantics when writing a large block of
zeroes, such that the request returns before the zeroes
have been guaranteed to land on underlying media.
SCSI does not support FUA during WRITESAME(10/16); FUA is only
supported if it falls back to WRITE(10/16). But where the
underlying device is new enough to not need a fallback, it
means that any upper layer request with FUA semantics was
silently ignoring BDRV_REQ_FUA.
Conversely, NBD has situations where it can support FUA but not
ZERO_WRITE; when that happens, the generic block layer fallback
to bdrv_driver_pwritev() (or the older bdrv_co_writev() in qemu
2.6) was losing the FUA flag.
The problem of losing flags unrelated to ZERO_WRITE has been
latent in bdrv_co_do_write_zeroes() since commit aa7bfbff, but
back then, it did not matter because there was no FUA flag. It
became observable when commit 93f5e6d8 paved the way for flags
that can impact correctness, when we should have been using
bdrv_co_writev_flags() with modified flags. Compare to commit
9eeb6dd, which got flag manipulation right in
bdrv_co_do_zero_pwritev().
Symptoms: I tested with qemu-io with default writethrough cache
(which is supposed to use FUA semantics on every write), and
targetted an NBD client connected to a server that intentionally
did not advertise NBD_FLAG_SEND_FUA. When doing 'write 0 512',
the NBD client sent two operations (NBD_CMD_WRITE then
NBD_CMD_FLUSH) to get the fallback FUA semantics; but when doing
'write -z 0 512', the NBD client sent only NBD_CMD_WRITE.
The fix is do to a cleanup bdrv_co_flush() at the end of the
operation if any step in the middle relied on a BDS that does
not natively support FUA for that step (note that we don't
need to flush after every operation, if the operation is broken
into chunks based on bounce-buffer sizing). Each BDS gains a
new flag .supported_zero_flags, which parallels the use of
.supported_write_flags but only when accessing a zero write
operation (the flags MUST be different, because of SCSI having
different semantics based on WRITE vs. WRITESAME; and also
because BDRV_REQ_MAY_UNMAP only makes sense on zero writes).
Also fix some documentation to describe -ENOTSUP semantics,
particularly since iscsi depends on those semantics.
Down the road, we may want to add a driver where its
.bdrv_co_pwritev() honors all three of BDRV_REQ_FUA,
BDRV_REQ_ZERO_WRITE, and BDRV_REQ_MAY_UNMAP, and advertise
this via bs->supported_write_flags for blocks opened by that
driver; such a driver should NOT supply .bdrv_co_write_zeroes
nor .supported_zero_flags. But none of the drivers touched
in this patch want to do that (the act of writing zeroes is
different enough from normal writes to deserve a second
callback).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Pre-patch, .supported_write_flags lives at the driver level, which
means we are blindly declaring that all block devices using a
given driver will either equally support FUA, or that we need a
fallback at the block layer. But there are drivers where FUA
support is a per-block decision: the NBD block driver is dependent
on the remote server advertising NBD_FLAG_SEND_FUA (and has
fallback code to duplicate the flush that the block layer would do
if NBD had not set .supported_write_flags); and the iscsi block
driver is dependent on the mode sense bits advertised by the
underlying device (and is currently silently ignoring FUA requests
if the underlying device does not support FUA).
The fix is to make supported flags as a per-BDS option, set during
.bdrv_open(). This patch moves the variable and fixes NBD and iscsi
to set it only conditionally; later patches will then further
simplify the NBD driver to quit duplicating work done at the block
layer, as well as tackle the fact that SCSI does not support FUA
semantics on WRITESAME(10/16) but only on WRITE(10/16).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is a possibility that qcow2_co_write_zeroes() will be called
with the partial block. This could be synthetically triggered with
qemu-io -c "write -z 32k 4k"
and can happen in the real life in qemu-nbd. The latter happens under
the following conditions:
(1) qemu-nbd is started with --detect-zeroes=on and is connected to the
kernel NBD client
(2) third party program opens kernel NBD device with O_DIRECT
(3) third party program performs write operation with memory buffer
not aligned to the page
In this case qcow2_co_write_zeroes() is unable to perform the operation
and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller
switches to non-optimized version and writes real zeroes to the disk.
The patch creates a shortcut. If the block is read as zeroes, f.e. if
it is unallocated, the request is extended to cover full block.
User-visible situation with this block is not changed. Before the patch
the block is filled in the image with real zeroes. After that patch the
block is marked as zeroed in metadata. Thus any subsequent changes in
backing store chain are not affected.
Kevin, thank you for a cool suggestion.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that there are no remaining clients, we can drop the
sector-based blk_read(), blk_write(), blk_aio_readv(), and
blk_aio_writev(). Sadly, there are still remaining
sector-based interfaces, such as blk_*discard(), or
blk_write_compressed(); those will have to wait for another
day.
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>