Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1453832250-766-22-git-send-email-peter.maydell@linaro.org
The PCI spec recommends devices use additional alignment for MSI-X
data structures to allow software to map them to separate processor
pages. One advantage of doing this is that we can emulate those data
structures without a significant performance impact to the operation
of the device. Some devices fail to implement that suggestion and
assigned device performance suffers.
One such case of this is a Mellanox MT27500 series, ConnectX-3 VF,
where the MSI-X vector table and PBA are aligned on separate 4K
pages. If PBA emulation is enabled, performance suffers. It's not
clear how much value we get from PBA emulation, but the solution here
is to only lazily enable the emulated PBA when a masked MSI-X vector
fires. We then attempt to more aggresively disable the PBA memory
region any time a vector is unmasked. The expectation is then that
a typical VM will run entirely with PBA emulation disabled, and only
when used is that emulation re-enabled.
Reported-by: Shyam Kaushik <shyam.kaushik@gmail.com>
Tested-by: Shyam Kaushik <shyam.kaushik@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
For quirks that support the full PCIe extended config space, limit the
quirk to only the size of config space available through vfio. This
allows host systems with broken MMCONFIG regions to still make use of
these quirks without generating bad address faults trying to access
beyond the end of config space exposed through vfio. This may expose
direct access to the mirror of extended config space, only trapping
the sub-range of standard config space, but allowing this makes the
quirk, and thus the device, functional. We expect that only device
specific accesses make use of the mirror, not general extended PCI
capability accesses, so any virtualization in this space is likely
unnecessary anyway, and the device is still IOMMU isolated, so it
should only be able to hurt itself through any bogus configurations
enabled by this space.
Link: https://www.redhat.com/archives/vfio-users/2015-November/msg00192.html
Reported-by: Ronnie Swanink <ronnie@ronnieswanink.nl>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer,
for two reasons. One, it catches multiplication overflowing size_t.
Two, it returns T * rather than void *, which lets the compiler catch
more type errors.
This commit only touches allocations with size arguments of the form
sizeof(T). Same Coccinelle semantic patch as in commit b45c03f.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When we have a PCIe VM, such as Q35, guests start to care more about
valid configurations of devices relative to the VM view of the PCI
topology. Windows will error with a Code 10 for an assigned device if
a PCIe capability is found for a device on a conventional bus. We
also have the possibility of IOMMUs, like VT-d, where the where the
guest may be acutely aware of valid express capabilities on physical
hardware.
Some devices, like tg3 are adversely affected by this due to driver
dependencies on the PCIe capability. The only solution for such
devices is to attach them to an express capable bus in the VM.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In-kernel ITS emulation on ARM64 will require to supply requester IDs.
These IDs can now be retrieved from the device pointer using new
pci_requester_id() function.
This patch adds pci_dev pointer to KVM GSI routing functions and makes
callers passing it.
x86 architecture does not use requester IDs, but hw/i386/kvm/pci-assign.c
also made passing PCI device pointer instead of NULL for consistency with
the rest of the code.
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Message-Id: <ce081423ba2394a4efc30f30708fca07656bc500.1444916432.git.p.fedin@samsung.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
At present the memory listener used by vfio to keep host IOMMU mappings
in sync with the guest memory image assumes that if a guest IOMMU
appears, then it has no existing mappings.
This may not be true if a VFIO device is hotplugged onto a guest bus
which didn't previously include a VFIO device, and which has existing
guest IOMMU mappings.
Therefore, use the memory_region_register_iommu_notifier_replay()
function in order to fix this case, replaying existing guest IOMMU
mappings, bringing the host IOMMU into sync with the guest IOMMU.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Depending on the host IOMMU type we determine and record the available page
sizes for IOMMU translation. We'll need this for other validation in
future patches.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The current vfio core code assumes that the host IOMMU is capable of
mapping any IOVA the guest wants to use to where we need. However, real
IOMMUs generally only support translating a certain range of IOVAs (the
"DMA window") not a full 64-bit address space.
The common x86 IOMMUs support a wide enough range that guests are very
unlikely to go beyond it in practice, however the IOMMU used on IBM Power
machines - in the default configuration - supports only a much more limited
IOVA range, usually 0..2GiB.
If the guest attempts to set up an IOVA range that the host IOMMU can't
map, qemu won't report an error until it actually attempts to map a bad
IOVA. If guest RAM is being mapped directly into the IOMMU (i.e. no guest
visible IOMMU) then this will show up very quickly. If there is a guest
visible IOMMU, however, the problem might not show up until much later when
the guest actually attempt to DMA with an IOVA the host can't handle.
This patch adds a test so that we will detect earlier if the guest is
attempting to use IOVA ranges that the host IOMMU won't be able to deal
with.
For now, we assume that "Type1" (x86) IOMMUs can support any IOVA, this is
incorrect, but no worse than what we have already. We can't do better for
now because the Type1 kernel interface doesn't tell us what IOVA range the
IOMMU actually supports.
For the Power "sPAPR TCE" IOMMU, however, we can retrieve the supported
IOVA range and validate guest IOVA ranges against it, and this patch does
so.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a DMA mapping operation fails in vfio_listener_region_add() it
checks to see if we've already completed initial setup of the
container. If so it reports an error so the setup code can fail
gracefully, otherwise throws a hw_error().
There are other potential failure cases in vfio_listener_region_add()
which could benefit from the same logic, so move it to its own
fail: block. Later patches can use this to extend other failure cases
to fail as gracefully as possible under the circumstances.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently the VFIOContainer iommu_data field contains a union with
different information for different host iommu types. However:
* It only actually contains information for the x86-like "Type1" iommu
* Because we have a common listener the Type1 fields are actually used
on all IOMMU types, including the SPAPR TCE type as well
In fact we now have a general structure for the listener which is unlikely
to ever need per-iommu-type information, so this patch removes the union.
In a similar way we can unify the setup of the vfio memory listener in
vfio_connect_container() that is currently split across a switch on iommu
type, but is effectively the same in both cases.
The iommu_data.release pointer was only needed as a cleanup function
which would handle potentially different data in the union. With the
union gone, it too can be removed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In irqfd mode, current code attempts to set a resamplefd whatever
the type of the IRQ. For an edge-sensitive IRQ this attempt fails
and as a consequence, the whole irqfd setup fails and we fall back
to the slow mode. This patch bypasses the resamplefd setting for
non level-sentive IRQs.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
unmask EventNotifier might not be initialized in case of edge
sensitive irq. Using EventNotifier pointers make life simpler to
handle the edge-sensitive irqfd setup.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With current implementation, eventfd VFIO signaling is first set up and
then irqfd is setup, if supported and allowed.
This start sequence causes several issues with IRQ forwarding setup
which, if supported, is transparently attempted on irqfd setup:
IRQ forwarding setup is likely to fail if the IRQ is detected as under
injection into the guest (active at irqchip level or VFIO masked).
This currently always happens because the current sequence explicitly
VFIO-masks the IRQ before setting irqfd.
Even if that masking were removed, we couldn't prevent the case where
the IRQ is under injection into the guest.
So the simpler solution is to remove this 2-step startup and directly
attempt irqfd setup. This is what this patch does.
Also in case the eventfd setup fails, there is no reason to go farther:
let's abort.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Specifying an emulated PCI vendor/device ID can be useful for testing
various quirk paths, even though the behavior and functionality of
the device with bogus IDs is fully unsupportable. We need to use a
uint32_t for the vendor/device IDs, even though the registers
themselves are only 16-bit in order to be able to determine whether
the value is valid and user set.
The same support is added for subsystem vendor/device ID, though these
have the possibility of being useful and supported for more than a
testing tool. An emulated platform might want to impose their own
subsystem IDs or at least hide the physical subsystem ID. Windows
guests will often reinstall drivers due to a change in subsystem IDs,
something that VM users may want to avoid. Of course careful
attention would be required to ensure that guest drivers do not rely
on the subsystem ID as a basis for device driver quirks.
All of these options are added using the standard experimental option
prefix and should not be considered stable.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Simplify access to commonly referenced PCI vendor and device ID by
caching it on the VFIOPCIDevice struct.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is just another quirk, for reset rather than affecting memory
regions. Move it to our new quirks file.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Config windows make use of an address register and a data register.
In VGA cards, these are often used to provide real mode code in the
BIOS an easy way to access MMIO registers since the window often
resides in an I/O port register. When the MMIO register has a mirror
of PCI config space, we need to trap those accesses and redirect them
to emulated config space.
The previous version of this functionality made use of a single
MemoryRegion and single match address. This version uses separate
MemoryRegions for each of the address and data registers and allows
for multiple match addresses. This is useful for Nvidia cards which
have two ranges which index into PCI config space.
The previous implementation is left for the follow-on patch for a more
reviewable diff.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Another rework of this quirk, this time to update to the new quirk
structure. We can handle the address and data registers with
separate MemoryRegions and a quirk specific data structure, making the
code much more understandable.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The Nvidia 0x3d0 quirk makes use of a two separate registers and gives
us our first chance to make use of separate memory regions for each to
simplify the code a bit.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is an easy quirk that really doesn't need a data structure if
its own. We can pass vdev as the opaque data and access to the
MemoryRegion isn't required.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIOQuirk hosts a single memory region and a fixed set of data fields
that try to handle all the quirk cases, but end up making those that
don't exactly match really confusing. This patch introduces a struct
intended to provide more flexibility and simpler code. VFIOQuirk is
stripped to its basics, an opaque data pointer for quirk specific
data and a pointer to an array of MemoryRegions with a counter. This
still allows us to have common teardown routines, but adds much
greater flexibility to support multiple memory regions and quirk
specific data structures that are easier to maintain. The existing
VFIOQuirk is transformed into VFIOLegacyQuirk, which further patches
will eliminate entirely.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Create a vendor:device ID helper that we'll also use as we rework the
rest of the quirks. Re-reading the config entries, even if we get
more blacklist entries, is trivial overhead and only incurred during
device setup. There's no need to typedef the blacklist structure,
it's a static private data type used once. The elements get bumped
up to uint32_t to avoid future maintenance issues if PCI_ANY_ID gets
used for a blacklist entry (avoiding an actual hardware match). Our
test loop is also crying out to be simplified as a for loop.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The default should be to allow mmap and new drivers shouldn't need to
expose an option or set it to other than the allocation default in
their initfn. Take advantage of the experimental flag to change this
option to the correct polarity.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tracing is more effective when we can completely disable all KVM
bypass paths. Make these runtime rather than build-time configurable.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This allows vfio_msi* tracing. The MSI/X interrupt tracing is also
pulled out of #ifdef DEBUG_VFIO to avoid a recompile for tracing this
path. A few cycles to read the message is hardly anything if we're
already in QEMU.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Rename functions and tracing callbacks so that we can trace vfio_intx*
to see all the INTx related activities.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With the addition of the Chelsio quirk we have an error path out of
vfio_early_setup_msix() that doesn't free the allocated VFIOMSIXInfo
struct. This doesn't introduce a leak as it still gets freed in the
vfio_put_device() path, but it's complicated and sloppy to rely on
that. Restructure to free the allocated data on error and only link
it into the vdev on success.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
There's quite a bit of cleanup that can be done to the RTL8168 quirk,
as well as the tracing to prevent a spew of uninteresting accesses
for anything else the driver might choose to use the window registers
for besides the MSI-X table. There should be no functional change,
but it's now possible to get compact and useful traces by enabling
vfio_rtl8168_quirk*, ex:
vfio_rtl8168_quirk_write 0000:04:00.0 [address]: 0x1f000
vfio_rtl8168_quirk_read 0000:04:00.0 [address]: 0x8001f000
vfio_rtl8168_quirk_read 0000:04:00.0 [data]: 0xfee0100c
vfio_rtl8168_quirk_write 0000:04:00.0 [address]: 0x1f004
vfio_rtl8168_quirk_read 0000:04:00.0 [address]: 0x8001f004
vfio_rtl8168_quirk_read 0000:04:00.0 [data]: 0x0
vfio_rtl8168_quirk_write 0000:04:00.0 [address]: 0x1f008
vfio_rtl8168_quirk_read 0000:04:00.0 [address]: 0x8001f008
vfio_rtl8168_quirk_read 0000:04:00.0 [data]: 0x49b1
vfio_rtl8168_quirk_write 0000:04:00.0 [address]: 0x1f00c
vfio_rtl8168_quirk_read 0000:04:00.0 [address]: 0x8001f00c
vfio_rtl8168_quirk_read 0000:04:00.0 [data]: 0x0
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Minor cleanup.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
A number of files were including dirent.h but not using any
of the functions it provides
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Many source files have doubled words (eg "the the", "to to",
and so on). Most of these can simply be removed, but a couple
were actual mis-spellings (eg "to to" instead of "to do").
There was even one triple word score "to to to" :-)
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
bootindex was incorrectly changed to a device Property during the
platform code split, resulting in it no longer working. Remove it.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: qemu-stable@nongnu.org # v2.3+
The RTL8168 quirk correctly describes using bit 31 as a signal to
mark a latch/completion, but the code mistakenly uses bit 28. This
causes the Realtek driver to spin on this register for quite a while,
20k cycles on Windows 7 v7.092 driver. Then it gets frustrated and
tries to set the bit itself and spins for another 20k cycles. For
some this still results in a working driver, for others not. About
the only thing the code really does in its current form is protect
the guest from sneaking in writes to the real hardware MSI-X table.
The fix is obviously to use bit 31 as we document that we should.
The other problem doesn't seem to affect current drivers as nobody
seems to use these window registers for writes to the MSI-X table, but
we need to use the stored data when a write is triggered, not the
value of the current write, which only provides the offset.
Note that only the Windows drivers from Realtek seem to use these
registers, the Microsoft drivers provided with Windows 8.1 do not
access them, nor do Linux in-kernel drivers.
Link: https://bugs.launchpad.net/qemu/+bug/1384892
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: qemu-stable@nongnu.org # v2.1+
Fix pba_offset initialization value for Chelsio T5 Virtual Function
device. The T5 hardware has a bug in it where it reports a Pending Interrupt
Bit Array Offset of 0x8000 for its SR-IOV Virtual Functions instead
of the 0x1000 that the hardware actually uses internally. As the hardware
doesn't return the correct pba_offset value, add a quirk to instead
return a hardcoded value of 0x1000 when a Chelsio T5 VF device is
detected.
This bug has been fixed in the Chelsio's next chip series T6 but there are
no plans to respin the T5 ASIC for this bug. It is just documented in the
T5 Errata and left it at that.
Signed-off-by: Gabriel Laupre <glaupre@chelsio.com>
Reviewed-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
On systems with guest visible IOMMU, adding a new memory region onto
PCI bus calls vfio_listener_region_add() for every DMA window. This
installs a notifier for IOMMU memory regions. The notifier is supposed
to be removed vfio_listener_region_del(), however in the case of mixed
PHB (emulated + VFIO devices) when last VFIO device is unplugged and
container gets destroyed, all existing DMA windows stay alive altogether
with the notifiers which are on the linked list which head was in
the destroyed container.
This unregisters IOMMU memory region notifier when a container is
destroyed.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch aims at optimizing IRQ handling using irqfd framework.
Instead of handling the eventfds on user-side they are handled on
kernel side using
- the KVM irqfd framework,
- the VFIO driver virqfd framework.
the virtual IRQ completion is trapped at interrupt controller
This removes the need for fast/slow path swap.
Overall this brings significant performance improvements.
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Vikram Sethi <vikrams@codeaurora.org>
Acked-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Anticipating for the introduction of new add/remove functions taking
a qemu_irq parameter, let's rename existing ones with a gsi suffix.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Vikram Sethi <vikrams@codeaurora.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is system level code, and should only depend on the host page
size, not the target page size.
Note that HOST_PAGE_SIZE is misleadingly lead and is really aligning
to both host and target page size. Hence it's replacement with
REAL_HOST_PAGE_SIZE.
Signed-off-by: Peter Crosthwaite <crosthwaite.peter@gmail.com>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
size_t is an unsigned type, thus the error case is never reached in
the below call to pread. If bytes is negative, it will be seen as
a very high positive value.
Spotted by Coverity.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Include linux/vfio.h after sys/ioctl.h, just like in hw/vfio/common.c.
Signed-off-by: Leon Alrae <leon.alrae@imgtec.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-id: 1434544500-22405-1-git-send-email-leon.alrae@imgtec.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
g_malloc0_n() is introduced since glib-2.24 while QEMU currently
requires glib-2.22. This may cause a link error on some distributions.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The platform device class has become abstract. This patch introduces
a calxeda xgmac device that derives from it.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch adds the code requested to assign interrupts to
a guest. The interrupts are mediated through user handled
eventfds only.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Vikram Sethi <vikrams@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>