this patch tries to optimize zero write requests
by automatically using bdrv_write_zeroes if it is
supported by the format.
This significantly speeds up file system initialization and
should speed zero write test used to test backend storage
performance.
I ran the following 2 tests on my internal SSD with a
50G QCOW2 container and on an attached iSCSI storage.
a) mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX
QCOW2 [off] [on] [unmap]
-----
runtime: 14secs 1.1secs 1.1secs
filesize: 937M 18M 18M
iSCSI [off] [on] [unmap]
----
runtime: 9.3s 0.9s 0.9s
b) dd if=/dev/zero of=/dev/vdX bs=1M oflag=direct
QCOW2 [off] [on] [unmap]
-----
runtime: 246secs 18secs 18secs
filesize: 51G 192K 192K
throughput: 203M/s 2.3G/s 2.3G/s
iSCSI* [off] [on] [unmap]
----
runtime: 8mins 45secs 33secs
throughput: 106M/s 1.2G/s 1.6G/s
allocated: 100% 100% 0%
* The storage was connected via an 1Gbit interface.
It seems to internally handle writing zeroes
via WRITESAME16 very fast.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the filename given to bdrv_open() is prefixed with "json:", parse the
rest as a JSON object and merge the result into the options QDict. If
there are conflicts, the options QDict takes precedence.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_is_allocated() shouldn't return true for sectors that are
unallocated, but after the end of a short backing file, even though
such sectors are (correctly) marked as containing zeros.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
The immediately visible effect of this patch is that it fixes committing
a temporary snapshot to its backing file. Previously, it would fail with
a "permission denied" error because bdrv_inherited_flags() forced the
backing file to be read-only, ignoring the r/w reopen of bdrv_commit().
The bigger problem this revealed is that the original open flags must
actually only be applied to the temporary snapshot, and the original
image file must be treated as a backing file of the temporary snapshot
and get the right flags for that.
Reported-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Use the same function as bdrv_open() for determining what the right
flags for bs->file are. Without doing this, a reopen means that
bs->file loses BDRV_O_CACHE_WB or BDRV_O_UNMAP if bs doesn't have it as
well.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
This reverts commit 3a389e7926. The commit
was wrong and what it tried to fix just works today without any change.
What the commit tried to fix:
When creating live snapshots, the new image file is opened with
BDRV_O_NO_BACKING because the whole backing chain is already opened.
It is then appended to the chain using bdrv_append(). The result of
this was that the image had a backing file, but BDRV_O_NO_BACKING
was still set. This is obviously inconsistent.
There used to be some places in qemu that closed and image and then
opened it again, with its old flags (a bdrv_open()/close() sequence
involves reopening the whole backing file chain, too). In this case
the BDRV_O_NO_BACKING flag meant that the backing chain wasn't
reopened and only the top layer was left.
(Most, but not all of these places are replaced by bdrv_reopen()
today, which doesn't touch the backing files at all.)
Other places that looked at bs->open_flags weren't interested in
BDRV_O_NO_BACKING, so no breakage there.
What it actually did:
The commit moved the BDRV_O_NO_BACKING away to the backing file.
Because the bdrv_open()/close() sequences only looked at the flags
of the top level BlockDriverState and used it for the whole chain,
the flag didn't hurt there any more. Obviously, it is still
inconsistent because the backing file may have another backing file,
but without practical impact.
At the same time, it swapped all other flags. This is practically
irrelevant as long as live snapshots only allow opening the new
layer with the same flags as the old top layer. It still doesn't
make any sense, and it is a time bomb that explodes as soon as the
flags can differ.
bdrv_append_temp_snapshot() is such a case: It adds the new flag
BDRV_O_TEMPORARY for the temporary snapshot. The swapping of commit
3a389e79 results in the following nonsensical configuration:
bs->open_flags: BDRV_O_TEMPORARY cleared
bs->file->open_flags: BDRV_O_TEMPORARY set
bs->backing_hd->open_flags: BDRV_O_TEMPORARY set
bs->backing_hd->file->open_flags: BDRV_O_TEMPORARY cleared
We're still lucky because the format layer ignores the flag and the
protocol layer happens to get the right value, but sooner or later
this is bound to go wrong...
What the right fix would have been:
Simply clear the BDRV_O_NO_BACKING flag when the BlockDriverState is
appended to an existing backing file chain, because now it does have
a backing file.
Commit 4ddc07ca already implemented this silently in bdrv_append(),
so we don't have to come up with a new fix.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Instead of having unlink() calls in the generic block layer, where we
aren't even guarateed to have a file name, move them to those block
drivers that are actually used and that always have a filename. Gets us
rid of some #ifdefs as well.
The patch also converts bs->is_temporary to a new BDRV_O_TEMPORARY open
flag so that it is inherited in the protocol layer and the raw-posix and
raw-win32 drivers can unlink the file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Copy on Read makes sense on the format level where backing files are
implemented, but it's not required on the protocol level. While it
shouldn't actively break anything to have COR enabled on both layers,
needless serialisation and allocation checks may impact performance.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Instead of manipulation flags inline, move the derivation of the flags
of a backing file into a new function next to the existing functions
that derive flags for bs->file and for the block driver open function.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Instead of having bdrv_open_flags() as a function that creates flags for
several unrelated places and then adding open-coded flags on top, create
a new function that derives the flags for bs->file from the flags for bs.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Some block drivers have multiple BlockDriver instances with identical
format_name fields (e.g. gluster, nbd).
Both qemu-img and qemu will use bdrv_iterate_format() to list the
supported formats when a help option is invoked. As protocols and
formats may register multiple drivers, redundant listings of formats
occur (e.g., "Supported formats: ... gluster gluster gluster gluster ...
").
Since the list of driver formats will be small, this performs a simple
linear search on format_name, and ignores any duplicates.
The end result change is that the iterator will no longer receive
duplicate string names, nor will it receive NULL pointers.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
error_is_set(&var) is the same as var != NULL, but it takes
whole-program analysis to figure that out. Unnecessarily hard for
optimizers, static checkers, and human readers. Commit 84d18f0 dumbed
it down to obvious, but a few more have crept in since, and
documentation was overlooked. Dumb these down, too.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In 1.7.1 qcow2_create2 reopen the file for flushing without the BDRV_O_NO_BACKING
flags.
As a consequence the code would recursively open the whole backing chain.
These three stack arrays would pile up through the recursion and lead to a coroutine
stack overflow.
Convert these array to malloced buffers in order to streamline the coroutine
footprint.
Symptoms where freezes or segfaults on production machines while taking QMP externals
snapshots. The overflow disturbed coroutine switching.
[Resolved conflicts on qemu.git/master since the patch was against v1.7.1
--Stefan]
Signed-off-by: Benoit Canet <benoit.canet@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Since commit f298d071, block devices added with blockdev-add don't have
a QemuOpts around in dinfo->opts. Consequently, we can't rely any more
on QemuOpts catching duplicate IDs for block devices.
This patch adds a new check for duplicate IDs to bdrv_new(), and moves
the existing check that the ID isn't already taken for a node-name there
as well.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This patch adds an errp parameter to bdrv_new() and updates all its
callers. The next patches will make use of this in order to check for
duplicate IDs. Most of the callers know that their ID is fine, so they
can simply assert that there is no error.
Behaviour doesn't change with this patch yet as bdrv_new() doesn't
actually assign errors to errp.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
bdrv_getlength could fail, check the return value before using it.
Return NULL and set errno if it fails. Callers are updated to handle
the error case.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Insanely large requests could cause an integer overflow in
bdrv_rw_co() while converting sectors to bytes. This patch catches the
problem and returns an error (if we hadn't overflown the integer here,
bdrv_check_byte_request() would have rejected the request, so we're not
breaking anything that was supposed to work before).
We actually do have a test case that triggers behaviour where we
accidentally let such a request pass, so that it would return success,
but read 0 bytes instead of the requested 4 GB. It fails now like it
should.
If the vdi block driver wants to be able to deal with huge images, it
can't read the whole block bitmap at once into memory like it does
today, but needs to use a metadata cache like qcow2 does.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 8f4754ed intended to protect against integer overflow bugs in
block drivers by making sure that a single request that is passed to
drivers is no longer than INT_MAX bytes.
However, meanwhile there are some callers that don't use that code path
any more but call bdrv_check_byte_request() directy, so let's add a
check there as well.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Since commit 9fd3171a, BDRV_O_SNAPSHOT uses an option QDict to specify
the originally requested image as the backing file of the newly created
temporary snapshot. This means that the filename is stored in
"file.filename", which is an option that is not parsed for protocol
names. Therefore things like -drive file=nbd:localhost:10809 were
broken because it looked for a local file with the literal name
'nbd:localhost:10809'.
This patch changes the way BDRV_O_SNAPSHOT works once again. We now open
the originally requested image as normal, and then do a similar
operation as for live snapshots to put the temporary snapshot on top.
This way, both driver specific options and parsed filenames work.
As a nice side effect, this results in code movement to factor
bdrv_append_temp_snapshot() out. This is a good preparation for moving
its call to drive_init() and friends eventually.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
When using the QDict option 'filename', it is supposed to be interpreted
literally. The code did correctly avoid guessing the protocol from any
string before the first colon, but it still called bdrv_parse_filename()
which would, for example, incorrectly remove a 'file:' prefix in the
raw-posix driver.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Limiting the size of a single request to INT_MAX not only fixes a
direct integer overflow in bdrv_check_request() (which would only
trigger bad behaviour with ridiculously huge images, as in close to
2^64 bytes), but can also prevent overflows in all block drivers.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If it returns an error, the migrated VM will not be started, but qemu
exits with an error message.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Opening an encrypted image takes an additional step: setting the key.
Between open and the key set, the image must not be used.
We have some protection against accidental use in place: you can't
unpause a guest while we're missing keys. You can, however, hot-plug
block devices lacking keys into a running guest just fine, or insert
media lacking keys. In the latter case, notifying the guest of the
insert is delayed until the key is set, which may suffice to protect
at least some guests in common usage.
This patch makes the protection apply in more cases, in a rather
heavy-handed way: it doesn't let you open encrypted images unless
we're in a paused state.
It doesn't extend the protection to users other than the guest (block
jobs?). Use of runstate_check() from block.c is disgusting. Best I
can do right now.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If the image file cannot be opened and was created as a temporary file,
it should be deleted; thus, in this case, we should jump to the
"unlink_and_fail" label and not just to "fail".
Reported-by: Benoît Canet <benoit@irqsave.net>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch keep the recursive way of doing things but simplify it by giving
two responsabilities to all block filters implementors.
They will need to do two things:
-Set the is_filter field of their block driver to true.
-Implement the bdrv_recurse_is_first_non_filter method of their block driver like
it is done on the Quorum block driver. (block/quorum.c)
[Paolo Bonzini <pbonzini@redhat.com> pointed out that this patch changes
the semantics of blkverify, which now recurses down both bs->file and
s->test_file.
-- Stefan]
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Currently, bdrv_debug_resume() requires every bs->drv in the BDS stack
to be NULL until a bs->drv with an implementation of bdrv_debug_resume()
is found. For a normal function, this would be fine, but this is a
function for debugging purposes and should therefore allow intermediate
BDS not to have a driver (i.e., be "ejected"). Otherwise, it is hard to
debug such situations.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
After migration has completed, we call bdrv_invalidate_cache() so that
drivers which cache some data drop their stale copy of the data and
reread it from the image file to get a new version of data that the
source modified while the migration was running.
Reloading metadata from the image file is useless, though, if the size
of the image file stays stale (this is a value that is cached for all
image formats in block.c). Reads from (meta)data after the old EOF
return only zeroes, causing image corruption.
We need to update bs->total_sectors in all layers that could potentially
have changed their size (i.e. backing files are not a concern - if they
are changed, we're in bigger trouble)
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Using an invalid option for a block device that is opened with
BDRV_O_PROTOCOL led to drv = NULL, and when trying to include the driver
name in the error message, qemu dereferenced it:
$ x86_64-softmmu/qemu-system-x86_64 -drive file=/tmp/test.qcow2,file.foo=bar
Segmentation fault (core dumped)
With this patch applied, the expected error message is printed:
$ x86_64-softmmu/qemu-system-x86_64 -drive file=/tmp/test.qcow2,file.foo=bar
qemu-system-x86_64: -drive file=/tmp/test.qcow2,file.foo=bar: could
not open disk image /tmp/test.qcow2: Block protocol 'file' doesn't
support the option 'foo'
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Currently, bdrv_file_open() always removes the "filename" option from
the options QDict after bdrv_parse_filename() has been (successfully)
called. However, for drivers with bdrv_needs_filename, it makes more
sense for bdrv_parse_filename() to overwrite the "filename" option and
for bdrv_file_open() to fetch the filename from there.
Since there currently are no drivers that implement
bdrv_parse_filename() and have bdrv_needs_filename set, this does not
change current behavior.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Moving only the node_name one field could lead to some inconsitencies where a
node_name was defined on a bs which was not registered in the graph node list.
bdrv_swap between a named node bs and a non named node bs would lead to this.
bdrv_make_anon would then crash because it would try to remove the bs from the
graph node list while it is not in it.
This patch remove named node bses from the graph node list before doing the swap
then insert them back.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For sg backends, bs->request_alignment is meaningless and may be 0.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
If TMPDIR is not specified, the default was to use /tmp for the working
copy of the block devices. Update this to /var/tmp instead, so systems
using tmp-on-tmpfs don't end up inadvertently using RAM for the block
device.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This option is now unnecessary since specifying BDRV_O_PROTOCOL as flag
will do exactly the same.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The fail and success paths of bdrv_file_open() may be further shortened
by reusing code already existent in bdrv_open(). This includes
bdrv_file_open() not taking the reference to options which allows the
removal of QDECREF(options) in that function.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The fail paths of bdrv_file_open() and bdrv_open() naturally exhibit
similarities, thus it is possible to reuse the one from bdrv_open() and
shorten the one in bdrv_file_open() accordingly.
Also, setting bs->options in bdrv_file_open() is not necessary if it is
already done in bdrv_open().
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Change bdrv_file_open() to take a simple pointer to an already existing
BDS instead of an indirect one. The BDS will be created in bdrv_open()
if necessary.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Remove the reference parameter and the related handling code from
bdrv_file_open(), since it exists in bdrv_open() now as well.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add the bdrv_open() option BDRV_O_PROTOCOL which results in passing the
call to bdrv_file_open(). Additionally, make bdrv_file_open() static and
therefore bdrv_open() the only way to call it.
Consequently, all existing calls to bdrv_file_open() have to be adjusted
to use bdrv_open() with the BDRV_O_PROTOCOL flag instead.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow bdrv_open() to handle references to existing block devices just as
bdrv_file_open() is already capable of.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Make bdrv_open() take a pointer to a BDS pointer, similarly to
bdrv_file_open(). If a pointer to a NULL pointer is given, bdrv_open()
will create a new BDS with an empty name; if the BDS pointer is not
NULL, that existing BDS will be reused (in the same way as bdrv_open()
already did).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Consider top level BlockDriverStates as well.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Tested-by: Benoit Canet <benoit@irqsave.net>
error_is_set(&var) is the same as var != NULL, but it takes
whole-program analysis to figure that out. Unnecessarily hard for
optimizers, static checkers, and human readers. Dumb it down to
obvious.
Gets rid of several dozen Coverity false positives.
Note that the obvious form is already used in many places.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Since we introduced node_name for named bs of the graph modify the opening by
reference to use it as a fallback.
This patch also enforce the separation of the device id and graph node
namespaces.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The following patch will reuse bdrv_lookup_bs in order to open images by
references so the rules of usage of bdrv_lookup_bs must be relaxed a bit.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
On 32 bit hosts, size_t is too small for align as the bitmask
~(align - 1) will zero out the higher 32 bits of the offset.
While at it, change the local overlap_bytes variable to unsigned to
match the field in BdrvTrackedRequest.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>