Commit Graph

479 Commits

Author SHA1 Message Date
Alex Williamson
0dd0e297f0 vfio-pci: Disable binding to PFs with SR-IOV enabled
We expect to receive PFs with SR-IOV disabled, however some host
drivers leave SR-IOV enabled at unbind.  This puts us in a state where
we can potentially assign both the PF and the VF, leading to both
functionality as well as security concerns due to lack of managing the
SR-IOV state as well as vendor dependent isolation from the PF to VF.
If we were to attempt to actively disable SR-IOV on driver probe, we
risk VF bound drivers blocking, potentially risking live lock
scenarios.  Therefore simply refuse to bind to PFs with SR-IOV enabled
with a warning message indicating the issue.  Users can resolve this
by re-binding to the host driver and disabling SR-IOV before
attempting to use the device with vfio-pci.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-08-06 12:23:19 -06:00
Gustavo A. R. Silva
544c05a60a vfio: Mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-08-06 12:22:54 -06:00
Linus Torvalds
ffb48e7924 powerpc fixes for 4.18 #4
Two regression fixes, one for xmon disassembly formatting and the other to fix
 the E500 build.
 
 Two commits to fix a potential security issue in the VFIO code under obscure
 circumstances.
 
 And finally a fix to the Power9 idle code to restore SPRG3, which is user
 visible and used for sched_getcpu().
 
 Thanks to:
   Alexey Kardashevskiy, David Gibson. Gautham R. Shenoy, James Clarke.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJbUrFqExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYCW
 mQ//eYpaYIkEthXH0uHUN2wpWZlhbg0wUtjclsT5RUonDwMC2PM9BUuLk61RtIlD
 jbi39GrUISHf13U1Ydhb3e1I5B+TpDe6hgb/dGUMAXPe6rt+jFREogZR+vgIU0ep
 Q8ta1GIgbI6La0zXn5o3apUtqR7bAQ3cWD2T8vN4tQmAQZPw1fV13cZZ2kFs5JFO
 aYX8pD76wkAUz8Im+bweziRSRYdAIi8Oxt1Cdyg9Oti5y4fp4LuQKa/qbfkswkkk
 2ycG3TWr4Ln8RM/GaUPW1UPh1Zd8b6et7vUnhxO1g/JdEXlnm9A+DifJFrUk/+y8
 DsyofdXUj+u+LXX/H8uqqse7ysfvkC53R15Jo8irISsIgYZcv4yKsZKGJR27wHGV
 h9KBDA0ZK+czU2jK4gZAdMTs1IICgXGUVIL+nI8U1sRBep3CI3HguqBVoC/MGes3
 WR2+8i2diL1m2jAHR5Nd3+ArBpI876ZalhDmM1Mv3GRUAvjo9nsehk32/nfBDYUM
 nPQd1vi3F6w1MSlj7aqA3quTeGANwlnOcdC4QB++tz7oO1sY6ZYZtii3jsGSluXL
 vP/Mxvz2AG3v9KPCDQJqzx0n3vMAmGphetddxYl1X1dSwbwLNx0GggKNe0xw6sjp
 6lbin0w/Ot79VoSEncuRdm1hrtIktwubsH5CGB7Gxke5Kgk=
 =Sfmf
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Two regression fixes, one for xmon disassembly formatting and the
  other to fix the E500 build.

  Two commits to fix a potential security issue in the VFIO code under
  obscure circumstances.

  And finally a fix to the Power9 idle code to restore SPRG3, which is
  user visible and used for sched_getcpu().

  Thanks to: Alexey Kardashevskiy, David Gibson. Gautham R. Shenoy,
  James Clarke"

* tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/powernv: Fix save/restore of SPRG3 on entry/exit from stop (idle)
  powerpc/Makefile: Assemble with -me500 when building for E500
  KVM: PPC: Check if IOMMU page is contained in the pinned physical page
  vfio/spapr: Use IOMMU pageshift rather than pagesize
  powerpc/xmon: Fix disassembly since printf changes
2018-07-21 16:46:53 -07:00
Sinan Kaya
c6a44ba950 PCI: Rename pci_try_reset_bus() to pci_reset_bus()
Now that the old implementation of pci_reset_bus() is gone, replace
pci_try_reset_bus() with pci_reset_bus().

Compared to the old implementation, new code will fail immmediately with
-EAGAIN if object lock cannot be obtained.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-07-19 18:04:23 -05:00
Sinan Kaya
811c5cb37d PCI: Unify try slot and bus reset API
Drivers are expected to call pci_try_reset_slot() or pci_try_reset_bus() by
querying if a system supports hotplug or not.  A survey showed that most
drivers don't do this and we are leaking hotplug capability to the user.

Hide pci_try_slot_reset() from drivers and embed into pci_try_bus_reset().
Change pci_try_reset_bus() parameter from struct pci_bus to struct pci_dev.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-07-19 18:04:23 -05:00
Gustavo A. R. Silva
0e714d2778 vfio/pci: Fix potential Spectre v1
info.index can be indirectly controlled by user-space, hence leading
to a potential exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:

drivers/vfio/pci/vfio_pci.c:734 vfio_pci_ioctl()
warn: potential spectre issue 'vdev->region'

Fix this by sanitizing info.index before indirectly using it to index
vdev->region

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-07-18 12:57:25 -06:00
Alexey Kardashevskiy
76fa4975f3 KVM: PPC: Check if IOMMU page is contained in the pinned physical page
A VM which has:
 - a DMA capable device passed through to it (eg. network card);
 - running a malicious kernel that ignores H_PUT_TCE failure;
 - capability of using IOMMU pages bigger that physical pages
can create an IOMMU mapping that exposes (for example) 16MB of
the host physical memory to the device when only 64K was allocated to the VM.

The remaining 16MB - 64K will be some other content of host memory, possibly
including pages of the VM, but also pages of host kernel memory, host
programs or other VMs.

The attacking VM does not control the location of the page it can map,
and is only allowed to map as many pages as it has pages of RAM.

We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
an IOMMU page is contained in the physical page so the PCI hardware won't
get access to unassigned host memory; however this check is missing in
the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
did not hit this yet as the very first time when the mapping happens
we do not have tbl::it_userspace allocated yet and fall back to
the userspace which in turn calls VFIO IOMMU driver, this fails and
the guest does not retry,

This stores the smallest preregistered page size in the preregistered
region descriptor and changes the mm_iommu_xxx API to check this against
the IOMMU page size.

This calculates maximum page size as a minimum of the natural region
alignment and compound page size. For the page shift this uses the shift
returned by find_linux_pte() which indicates how the page is mapped to
the current userspace - if the page is huge and this is not a zero, then
it is a leaf pte and the page is mapped within the range.

Fixes: 121f80ba68 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-18 16:17:17 +10:00
Alexey Kardashevskiy
1463edca67 vfio/spapr: Use IOMMU pageshift rather than pagesize
The size is always equal to 1 page so let's use this. Later on this will
be used for other checks which use page shifts to check the granularity
of access.

This should cause no behavioral change.

Cc: stable@vger.kernel.org # v4.12+
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-18 16:17:07 +10:00
Alexey Kardashevskiy
a68bd1267b powerpc/powernv/ioda: Allocate indirect TCE levels on demand
At the moment we allocate the entire TCE table, twice (hardware part and
userspace translation cache). This normally works as we normally have
contigous memory and the guest will map entire RAM for 64bit DMA.

However if we have sparse RAM (one example is a memory device), then
we will allocate TCEs which will never be used as the guest only maps
actual memory for DMA. If it is a single level TCE table, there is nothing
we can really do but if it a multilevel table, we can skip allocating
TCEs we know we won't need.

This adds ability to allocate only first level, saving memory.

This changes iommu_table::free() to avoid allocating of an extra level;
iommu_table::set() will do this when needed.

This adds @alloc parameter to iommu_table::exchange() to tell the callback
if it can allocate an extra level; the flag is set to "false" for
the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
H_TOO_HARD.

This still requires the entire table to be counted in mm::locked_vm.

To be conservative, this only does on-demand allocation when
the usespace cache table is requested which is the case of VFIO.

The example math for a system replicating a powernv setup with NVLink2
in a guest:
16GB RAM mapped at 0x0
128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000

the table to cover that all with 64K pages takes:
(((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB

If we allocate only necessary TCE levels, we will only need:
(((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
levels).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-16 22:53:11 +10:00
Alexey Kardashevskiy
090bad39b2 powerpc/powernv: Add indirect levels to it_userspace
We want to support sparse memory and therefore huge chunks of DMA windows
do not need to be mapped. If a DMA window big enough to require 2 or more
indirect levels, and a DMA window is used to map all RAM (which is
a default case for 64bit window), we can actually save some memory by
not allocation TCE for regions which we are not going to map anyway.

The hardware tables alreary support indirect levels but we also keep
host-physical-to-userspace translation array which is allocated by
vmalloc() and is a flat array which might use quite some memory.

This converts it_userspace from vmalloc'ed array to a multi level table.

As the format becomes platform dependend, this replaces the direct access
to it_usespace with a iommu_table_ops::useraddrptr hook which returns
a pointer to the userspace copy of a TCE; future extension will return
NULL if the level was not allocated.

This should not change non-KVM handling of TCE tables and it_userspace
will not be allocated for non-KVM tables.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-16 22:53:10 +10:00
Alexey Kardashevskiy
00a5c58d94 KVM: PPC: Make iommu_table::it_userspace big endian
We are going to reuse multilevel TCE code for the userspace copy of
the TCE table and since it is big endian, let's make the copy big endian
too.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-16 22:53:09 +10:00
Jason Gunthorpe
bb94b55af3 vfio: Use get_user_pages_longterm correctly
The patch noted in the fixes below converted get_user_pages_fast() to
get_user_pages_longterm(), however the two calls differ in a few ways.

First _fast() is documented to not require the mmap_sem, while _longterm()
is documented to need it. Hold the mmap sem as required.

Second, _fast accepts an 'int write' while _longterm uses 'unsigned int
gup_flags', so the expression '!!(prot & IOMMU_WRITE)' is only working by
luck as FOLL_WRITE is currently == 0x1. Use the expected FOLL_WRITE
constant instead.

Fixes: 94db151dc8 ("vfio: disable filesystem-dax page pinning")
Cc: <stable@vger.kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-30 13:58:09 -06:00
Alex Williamson
08ca1b52f6 vfio/pci: Make IGD support a configurable option
Allow the code which provides extensions to support direct assignment
of Intel IGD (GVT-d) to be compiled out of the kernel if desired.  The
config option for this was previously automatically enabled on X86,
therefore the default remains Y.  This simply provides the option to
disable it even for X86.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-18 16:39:50 -06:00
Linus Torvalds
467590e055 VFIO updates for v4.18
- Bind type1 task tracking to group_leader to facilitate vCPU hotplug
    in QEMU (Alex Williamson)
 
  - Sample mdev display drivers, including region-based host and
    guest Linux drivers and bochs compatible dmabuf device
    (Gerd Hoffmann)
 
  - Fix vfio-platform reset module leak (Geert Uytterhoeven)
 
  - vfio-platform error message consistency (Geert Uytterhoeven)
 
  - Global checking for mdev uuid collisions rather than per parent
    device (Alex Williamson)
 
  - Use match_string() helper (Yisheng Xie)
 
  - vfio-platform PM domain fixes (Geert Uytterhoeven)
 
  - Fix sample mbochs driver build dependency (Arnd Bergmann)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJbHt7lAAoJECObm247sIsi+1UP/Av+uDmJ3NkSvDprI2WqgL7+
 /AsUJTa0Nb64KI6Z2ZOYjdYHuJBHfF/txGpBTf2EW94Gnn3qisZxBblaVbBsluK+
 0BsPArgXbmMsMh2A/WJ+RgyHBvNlnE8tHcxDYlXsjzzXo6gypXziTZOYb1qexxIf
 YrZRJOX9LJV8o2HwapcoAkKrbJA83X6mJM4vx/o5+GtEeKr35pSImCB9I6CcvDE0
 tUP9nbZjo5YMT7c4GWhL+aB/vL5mHS7c1z8nog6pl5Dp3I/HNlzq4kfhhlIsrAvI
 AGAwLQN9f3uKnE9hKKJEfbJC7CTbMBJAyYx3OSH3M2JLctwky2I7zQWCl2654Z+1
 z81BI/bTKkrV9t88lJKdslOIff0tzmSx81bqFEQPtCpqXSPa4MS90mRo+KFsagw4
 dlgEgL9pHRUqBF/OJmq6OAKtKV6SHsJnOwRd7bLI8VLrY67ipLxE1cvmKkK2LkQ5
 cwyoGuF2f2RYfaENEnKlxoBZCfpxnLgT0Dk5/DWnsOMbGKF6x0KigVjzFhigvaGC
 FPsnE8mTJgCmkbe26l+tRkQdlVLiqzqbpE0mAjhaX0GXLsEe+50R52731AaQwPVC
 IZ51ti3+2VF6NcU1xs0QXIWf76g2nVjy7vzyAeHxVy3lcypTwWnR39InHpQt89h7
 E/IDSHgFl6O5G0hAt8yn
 =po45
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.18-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Bind type1 task tracking to group_leader to facilitate vCPU hotplug
   in QEMU (Alex Williamson)

 - Sample mdev display drivers, including region-based host and guest
   Linux drivers and bochs compatible dmabuf device
   (Gerd Hoffmann)

 - Fix vfio-platform reset module leak (Geert Uytterhoeven)

 - vfio-platform error message consistency (Geert Uytterhoeven)

 - Global checking for mdev uuid collisions rather than per parent
   device (Alex Williamson)

 - Use match_string() helper (Yisheng Xie)

 - vfio-platform PM domain fixes (Geert Uytterhoeven)

 - Fix sample mbochs driver build dependency (Arnd Bergmann)

* tag 'vfio-v4.18-rc1' of git://github.com/awilliam/linux-vfio:
  samples: mbochs: add DMA_SHARED_BUFFER dependency
  vfio: platform: Fix using devices in PM Domains
  vfio: use match_string() helper
  vfio/mdev: Re-order sysfs attribute creation
  vfio/mdev: Check globally for duplicate devices
  vfio: platform: Make printed error messages more consistent
  vfio: platform: Fix reset module leak in error path
  sample: vfio bochs vbe display (host device for bochs-drm)
  sample: vfio mdev display - guest driver
  sample: vfio mdev display - host device
  vfio/type1: Fix task tracking for QEMU vCPU hotplug
2018-06-12 13:11:26 -07:00
Geert Uytterhoeven
415eb9fc0e vfio: platform: Fix using devices in PM Domains
If a device is part of a PM Domain (e.g. power and/or clock domain), its
power state is managed using Runtime PM.  Without Runtime PM, the device
may not be powered up or clocked, causing subtle failures, crashes, or
system lock-ups when the device is accessed by the guest.

Fix this by adding Runtime PM support, powering the device when the VFIO
device is opened by the guest.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Acked-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-08 10:24:37 -06:00
Yisheng Xie
e77addf018 vfio: use match_string() helper
match_string() returns the index of an array for a matching string,
which can be used intead of open coded variant.

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-08 10:24:33 -06:00
Alex Williamson
6a62c1dfb5 vfio/mdev: Re-order sysfs attribute creation
There exists a gap at the end of mdev_device_create() where the device
is visible to userspace, but we're not yet ready to handle removal, as
triggered through the 'remove' attribute.  We handle this properly in
mdev_device_remove() with an -EAGAIN return, but we can marginally
reduce this gap by adding this attribute as a final step of our sysfs
setup.

Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-08 10:24:30 -06:00
Alex Williamson
002fe996f6 vfio/mdev: Check globally for duplicate devices
When we create an mdev device, we check for duplicates against the
parent device and return -EEXIST if found, but the mdev device
namespace is global since we'll link all devices from the bus.  We do
catch this later in sysfs_do_create_link_sd() to return -EEXIST, but
with it comes a kernel warning and stack trace for trying to create
duplicate sysfs links, which makes it an undesirable response.

Therefore we should really be looking for duplicates across all mdev
parent devices, or as implemented here, against our mdev device list.
Using mdev_list to prevent duplicates means that we can remove
mdev_parent.lock, but in order not to serialize mdev device creation
and removal globally, we add mdev_device.active which allows UUIDs to
be reserved such that we can drop the mdev_list_lock before the mdev
device is fully in place.

Two behavioral notes; first, mdev_parent.lock had the side-effect of
serializing mdev create and remove ops per parent device.  This was
an implementation detail, not an intentional guarantee provided to
the mdev vendor drivers.  Vendor drivers can trivially provide this
serialization internally if necessary.  Second, review comments note
the new -EAGAIN behavior when the device, and in particular the remove
attribute, becomes visible in sysfs.  If a remove is triggered prior
to completion of mdev_device_create() the user will see a -EAGAIN
error.  While the errno is different, receiving an error during this
period is not, the previous implementation returned -ENODEV for the
same condition.  Furthermore, the consistency to the user is improved
in the case where mdev_device_remove_ops() returns error.  Previously
concurrent calls to mdev_device_remove() could see the device
disappear with -ENODEV and return in the case of error.  Now a user
would see -EAGAIN while the device is in this transitory state.

Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Acked-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-08 10:24:27 -06:00
Geert Uytterhoeven
d59502bc05 vfio: platform: Make printed error messages more consistent
- Capitalize the first word of error messages,
  - Unwrap statements that fit on a single line,
  - Use "VFIO" instead of "vfio" as the error message prefix.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-08 10:24:23 -06:00
Geert Uytterhoeven
28a6838788 vfio: platform: Fix reset module leak in error path
If the IOMMU group setup fails, the reset module is not released.

Fixes: b5add544d6 ("vfio, platform: make reset driver a requirement by default")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Acked-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-08 10:24:20 -06:00
Alex Williamson
48d8476b41 vfio/type1: Fix task tracking for QEMU vCPU hotplug
MAP_DMA ioctls might be called from various threads within a process,
for example when using QEMU, the vCPU threads are often generating
these calls and we therefore take a reference to that vCPU task.
However, QEMU also supports vCPU hotplug on some machines and the task
that called MAP_DMA may have exited by the time UNMAP_DMA is called,
resulting in the mm_struct pointer being NULL and thus a failure to
match against the existing mapping.

To resolve this, we instead take a reference to the thread
group_leader, which has the same mm_struct and resource limits, but
is less likely exit, at least in the QEMU case.  A difficulty here is
guaranteeing that the capabilities of the group_leader match that of
the calling thread, which we resolve by tracking CAP_IPC_LOCK at the
time of calling rather than at an indeterminate time in the future.
Potentially this also results in better efficiency as this is now
recorded once per MAP_DMA ioctl.

Reported-by: Xu Yandong <xuyandong2@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-08 10:21:39 -06:00
Linus Torvalds
408afb8d78 Merge branch 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull aio updates from Al Viro:
 "Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly.

  The only thing I'm holding back for a day or so is Adam's aio ioprio -
  his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case),
  but let it sit in -next for decency sake..."

* 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  aio: sanitize the limit checking in io_submit(2)
  aio: fold do_io_submit() into callers
  aio: shift copyin of iocb into io_submit_one()
  aio_read_events_ring(): make a bit more readable
  aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
  aio: take list removal to (some) callers of aio_complete()
  aio: add missing break for the IOCB_CMD_FDSYNC case
  random: convert to ->poll_mask
  timerfd: convert to ->poll_mask
  eventfd: switch to ->poll_mask
  pipe: convert to ->poll_mask
  crypto: af_alg: convert to ->poll_mask
  net/rxrpc: convert to ->poll_mask
  net/iucv: convert to ->poll_mask
  net/phonet: convert to ->poll_mask
  net/nfc: convert to ->poll_mask
  net/caif: convert to ->poll_mask
  net/bluetooth: convert to ->poll_mask
  net/sctp: convert to ->poll_mask
  net/tipc: convert to ->poll_mask
  ...
2018-06-04 13:57:43 -07:00
Alex Williamson
89c29def6b Revert "vfio/type1: Improve memory pinning process for raw PFN mapping"
Bisection by Amadeusz Sławiński implicates this commit leading to bad
page state issues after VM shutdown, likely due to unbalanced page
references.  The original commit was intended only as a performance
improvement, therefore revert for offline rework.

Link: https://lkml.org/lkml/2018/6/2/97
Fixes: 356e88ebe4 ("vfio/type1: Improve memory pinning process for raw PFN mapping")
Cc: Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
Reported-by: Amadeusz Sławiński <amade@asmblr.net>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-02 08:41:44 -06:00
Christoph Hellwig
9965ed174e fs: add new vfs_poll and file_can_poll helpers
These abstract out calls to the poll method in preparation for changes
in how we poll.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-26 09:16:44 +02:00
Linus Torvalds
f605ba97fb VFIO updates for v4.17-rc1
- Adopt iommu_unmap_fast() interface to type1 backend
    (Suravee Suthikulpanit)
 
  - mdev sample driver fixup (Shunyong Yang)
 
  - More efficient PFN mapping handling in type1 backend
    (Jason Cai)
 
  - VFIO device ioeventfd interface (Alex Williamson)
 
  - Tag new vfio-platform sub-maintainer (Alex Williamson)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJax+VBAAoJECObm247sIsijnUQAI/TEvEGxZkUEXZ5DeFvcXjM
 N5ICSkApFaAyCDmrR5ljfd4u0k1OCePH9v9BR2BsfdMZtNRGCMMeQGYv0NR44Ude
 8wwJh3aitg3angZTaetaWt4o43A1SDHXg4JsjDqcSL6XR7b465gzr10OjuXCN0Wa
 9ltdlRaxEZ/SMrR7oITqJ6CGrSu6OWtQnaMUA9c2lLsNRTVUt8wyv54HhMSdBA4E
 Sm2IjdLwLbVPvStMbVzsd+Rm9nIoVNbaGuURfS7yx6FU30URTuajmbY3AtewA/1w
 BBMgTAbdGaLN7xbxxzZwAApMbHDFoiNrLGT63Y+ylEL4IPSBBksqvqpijsHDy/5g
 ASI9O32i04Wy1x7744nhSPI3XPWBL0rXdRvZHk5OIisIJS7NFk4g05S3wEz1Kfxz
 Vb0DW7AXZmunCFgPH3Oli0V41HfZrDx5F/X8FqtucnGSv2c3CVwMiHgueKDIXx96
 mtujLuXb/qrIUM+/nJ36090DOmiTVD8k5GMcetc9Wu7S4AFQlkTmmOroGQQxRyXA
 giP3rxHCt+H0OSjn0OwjjCsoB0MmMbeUD9Y9Ak0CQU2gSrj2G4/2tVpNYO8Uz8u+
 sInZWClJrRskG6vegLFBoR6um9vYbFU6/WaSb6cDPiixScmSwbm7c1hu0PVOo56p
 8WwkBomAv4iFcAxDCTQG
 =yaeE
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.17-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Adopt iommu_unmap_fast() interface to type1 backend
   (Suravee Suthikulpanit)

 - mdev sample driver fixup (Shunyong Yang)

 - More efficient PFN mapping handling in type1 backend
   (Jason Cai)

 - VFIO device ioeventfd interface (Alex Williamson)

 - Tag new vfio-platform sub-maintainer (Alex Williamson)

* tag 'vfio-v4.17-rc1' of git://github.com/awilliam/linux-vfio:
  MAINTAINERS: vfio/platform: Update sub-maintainer
  vfio/pci: Add ioeventfd support
  vfio/pci: Use endian neutral helpers
  vfio/pci: Pull BAR mapping setup from read-write path
  vfio/type1: Improve memory pinning process for raw PFN mapping
  vfio-mdev/samples: change RDI interrupt condition
  vfio/type1: Adopt fast IOTLB flush interface when unmap IOVAs
2018-04-06 19:44:27 -07:00
Alex Williamson
30656177c4 vfio/pci: Add ioeventfd support
The ioeventfd here is actually irqfd handling of an ioeventfd such as
supported in KVM.  A user is able to pre-program a device write to
occur when the eventfd triggers.  This is yet another instance of
eventfd-irqfd triggering between KVM and vfio.  The impetus for this
is high frequency writes to pages which are virtualized in QEMU.
Enabling this near-direct write path for selected registers within
the virtualized page can improve performance and reduce overhead.
Specifically this is initially targeted at NVIDIA graphics cards where
the driver issues a write to an MMIO register within a virtualized
region in order to allow the MSI interrupt to re-trigger.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-26 13:22:58 -06:00
Alex Williamson
07fd7ef3a1 vfio/pci: Use endian neutral helpers
The iowriteXX/ioreadXX functions assume little endian hardware and
convert to little endian on a write and from little endian on a read.
We currently do our own explicit conversion to negate this.  Instead,
add some endian dependent defines to avoid all byte swaps.  There
should be no functional change other than big endian systems aren't
penalized with wasted swaps.

Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-26 13:22:58 -06:00
Alex Williamson
0d77ed3589 vfio/pci: Pull BAR mapping setup from read-write path
This creates a common helper that we'll use for ioeventfd setup.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-26 13:22:58 -06:00
Jason Cai (Xiang Feng)
356e88ebe4 vfio/type1: Improve memory pinning process for raw PFN mapping
When using vfio to pass through a PCIe device (e.g. a GPU card) that
has a huge BAR (e.g. 16GB), a lot of cycles are wasted on memory
pinning because PFNs of PCI BAR are not backed by struct page, and
the corresponding VMA has flag VM_PFNMAP.

With this change, when pinning a region which is a raw PFN mapping,
it can skip unnecessary user memory pinning process, and thus, can
significantly improve VM's boot up time when passing through devices
via VFIO. In my test on a Xeon E5 2.6GHz, the time mapping a 16GB
BAR was reduced from about 0.4s to 1.5us.

Signed-off-by: Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-22 13:18:27 -06:00
Alex Williamson
834814e802 Revert: "vfio-pci: Mask INTx if a device is not capabable of enabling it"
This reverts commit 2170dd0431

The intent of commit 2170dd0431 ("vfio-pci: Mask INTx if a device is
not capabable of enabling it") was to disallow the user from seeing
that the device supports INTx if the platform is incapable of enabling
it.  The detection of this case however incorrectly includes devices
which natively do not support INTx, such as SR-IOV VFs, and further
discussions reveal gaps even for the target use case.

Reported-by: Arjun Vynipadath <arjun@chelsio.com>
Fixes: 2170dd0431 ("vfio-pci: Mask INTx if a device is not capabable of enabling it")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-21 22:50:19 -06:00
Suravee Suthikulpanit
6bd06f5a48 vfio/type1: Adopt fast IOTLB flush interface when unmap IOVAs
VFIO IOMMU type1 currently upmaps IOVA pages synchronously, which requires
IOTLB flushing for every unmapping. This results in large IOTLB flushing
overhead when handling pass-through devices has a large number of mapped
IOVAs. This can be avoided by using the new IOTLB flushing interface.

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[aw - use LIST_HEAD]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-21 12:46:19 -06:00
Dan Williams
94db151dc8 vfio: disable filesystem-dax page pinning
Filesystem-DAX is incompatible with 'longterm' page pinning. Without
page cache indirection a DAX mapping maps filesystem blocks directly.
This means that the filesystem must not modify a file's block map while
any page in a mapping is pinned. In order to prevent the situation of
userspace holding of filesystem operations indefinitely, disallow
'longterm' Filesystem-DAX mappings.

RDMA has the same conflict and the plan there is to add a 'with lease'
mechanism to allow the kernel to notify userspace that the mapping is
being torn down for block-map maintenance. Perhaps something similar can
be put in place for vfio.

Note that xfs and ext4 still report:

   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"

...at mount time, and resolving the dax-dma-vs-truncate problem is one
of the last hurdles to remove that designation.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org>
Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
Tested-by: Haozhong Zhang <haozhong.zhang@intel.com>
Fixes: d475c6346a ("dax,ext2: replace XIP read and write with DAX I/O")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-03-02 18:00:04 -08:00
Linus Torvalds
a9a08845e9 vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:34:03 -08:00
Linus Torvalds
34b1cf60ab VFIO updates for v4.16-rc1
- Mask INTx from user if pdev->irq is zero (Alexey Kardashevskiy)
 
  - Capability helper cleanup (Alex Williamson)
 
  - Allow mmaps overlapping MSI-X vector table with region capability
    exposing this feature (Alexey Kardashevskiy)
 
  - mdev static cleanups (Xiongwei Song)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJac2tQAAoJECObm247sIsiUr8P/2zrK0G/mVrHWGEjlnlvgYjg
 FSozgN0fmc9mJ+Hg8ntfXn4GXuTaCq0uz96eMl9Wy1tMpLs4hoQf20IRojIpqgpb
 aVUHKg+7sJUWNsd+u1jSBb64SvQLdTbesfGgL8WLcJB90jWxzaZjEZ3mgKO4Lb88
 rNEpiTVoOURDgJo+bOMEnJJ6okmpRLBgw5pqrdLT2BButZg3QfLtcuoY18pFYxc7
 INy4YPWPe93aiDGloUrjj6xKRKfTaL7L8KGZBlk4FR5JENDRoOtGEguGcQRl0u8w
 IYLIkIE9p172S5bkeCYqawyxAPgQQIk5Wd4buFArg9w7tdWJOkEiCMwSu/LCocR8
 CiwZHKOWp7mJWv3XxJGY+rU3nHIuB+IaeKknE6rLXgkLVXED5Ta4pPzrpoUZezsT
 yyI5U2BnWNL5ISaWY3i+YQKvFUgPe8Az9Zw4p3zGUJ+zu2QheN7x+BVXT0xENMfk
 2sdNFeZwkxmB18pdsJdb+/DpL9yCrS7VxP3IDnAdfR8VIyLE3QFRTnsfNX7F1Uvr
 zKBZChuah2osiiM3k3ncwkovsM6iN1ZLVnHUg7xRBmp6WQnBYN02wrTXM/e1UqsS
 nKIBYA9fIcpLZ4hXMuiMmk7LFlScl/HngW+h+jWAyoP/j3X0gw/YM5hrPEJEV3fN
 WuIPpBYN36+0V9cUDMnN
 =Ww3n
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.16-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Mask INTx from user if pdev->irq is zero (Alexey Kardashevskiy)

 - Capability helper cleanup (Alex Williamson)

 - Allow mmaps overlapping MSI-X vector table with region capability
   exposing this feature (Alexey Kardashevskiy)

 - mdev static cleanups (Xiongwei Song)

* tag 'vfio-v4.16-rc1' of git://github.com/awilliam/linux-vfio:
  vfio: mdev: make a couple of functions and structure vfio_mdev_driver static
  vfio-pci: Allow mapping MSIX BAR
  vfio: Simplify capability helper
  vfio-pci: Mask INTx if a device is not capabable of enabling it
2018-02-01 13:18:25 -08:00
Xiongwei Song
46ed90f157 vfio: mdev: make a couple of functions and structure vfio_mdev_driver static
The functions vfio_mdev_probe, vfio_mdev_remove and the structure
vfio_mdev_driver are only used in this file, so make them static.

Clean up sparse warnings:
drivers/vfio/mdev/vfio_mdev.c:114:5: warning: no previous prototype
for 'vfio_mdev_probe' [-Wmissing-prototypes]
drivers/vfio/mdev/vfio_mdev.c:121:6: warning: no previous prototype
for 'vfio_mdev_remove' [-Wmissing-prototypes]

Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
Reviewed-by: Quan Xu <quan.xu0@gmail.com>
Reviewed-by: Liu, Yi L <yi.l.liu@intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-01-09 12:13:08 -07:00
Alexey Kardashevskiy
a32295c612 vfio-pci: Allow mapping MSIX BAR
By default VFIO disables mapping of MSIX BAR to the userspace as
the userspace may program it in a way allowing spurious interrupts;
instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl.
In order to eliminate guessing from the userspace about what is
mmapable, VFIO also advertises a sparse list of regions allowed to mmap.

This works fine as long as the system page size equals to the MSIX
alignment requirement which is 4KB. However with a bigger page size
the existing code prohibits mapping non-MSIX parts of a page with MSIX
structures so these parts have to be emulated via slow reads/writes on
a VFIO device fd. If these emulated bits are accessed often, this has
serious impact on performance.

This allows mmap of the entire BAR containing MSIX vector table.

This removes the sparse capability for PCI devices as it becomes useless.

As the userspace needs to know for sure whether mmapping of the MSIX
vector containing data can succeed, this adds a new capability -
VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - which explicitly tells the userspace
that the entire BAR can be mmapped.

This does not touch the MSIX mangling in the BAR read/write handlers as
we are doing this just to enable direct access to non MSIX registers.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw - fixup whitespace, trim function name]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-12-20 10:05:06 -07:00
Alex Williamson
dda01f787d vfio: Simplify capability helper
The vfio_info_add_capability() helper requires the caller to pass a
capability ID, which it then uses to fill in header fields, assuming
hard coded versions.  This makes for an awkward and rigid interface.
The only thing we want this helper to do is allocate sufficient
space in the caps buffer and chain this capability into the list.
Reduce it to that simple task.

Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-12-20 09:53:54 -07:00
Alexey Kardashevskiy
2170dd0431 vfio-pci: Mask INTx if a device is not capabable of enabling it
At the moment VFIO rightfully assumes that INTx is supported if
the interrupt pin is not set to zero in the device config space.
However if that is not the case (the pin is not zero but pdev->irq is),
vfio_intx_enable() fails.

In order to prevent the userspace from trying to enable INTx when we know
that it cannot work, let's mask the PCI_INTERRUPT_PIN register.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-12-20 09:53:52 -07:00
Al Viro
3ad6f93e98 annotate poll-related wait keys
__poll_t is also used as wait key in some waitqueues.
Verify that wait_..._poll() gets __poll_t as key and
provide a helper for wakeup functions to get back to
that __poll_t value.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:19:54 -05:00
Al Viro
e6c8adca20 anntotate the places where ->poll() return values go
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:19:53 -05:00
Linus Torvalds
e0ca3826b1 VFIO Updates for Linux v4.15
* Virtualize PCI MPS and MRRS registers
 
  * Avoid soft lockups on SPAPR when clearing TCE
 
  * Broadcom FlexRM platform device support
 
  * Samples driver cleanup & type1 integer overflow fix
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJaCgl2AAoJECObm247sIsiT3kP/A7GsPfKMjKwkTFHUzlQ0/gj
 vCaMeQbfrSSmQE4c3dad5X53Y//FPlB9U+4y4eJiF+VwNYy8Emcue9SuMv8gp4WK
 HUONhRYcbfpL/qdYAcWRkdlS5In9fBsWJAauHSBbBW7jfCGmXiSVcHlEXscMfX1O
 VLs5aW9SHwabHmIgIVu3BL8rMXgm/nLeUHy5mjMq9SiBE1Wc69xkEn+gq4rn7Ibd
 JoIZymGpvQ5L4G4L9fZNAOKma9WQgXOvENfNbYqWxSJCm5/8Ke30tmZEljoIPsqE
 O/Twa/KBJ8Ah9FwLheSPmk0+qxq5VPfcprNb3ZlFYBUJl97EihSzEnw6KtiMgOZA
 GPEfzZOTip+lPShIhGfTuE/EwHcubYUXnFt6jLmjZHIFuMrYy/uDsIyNL6hgYjDl
 qJCiRzCyXvD8f2BvLqp7Ar1yd2khqYoLrxEshmxEt3ysDz4iqb3+RcHqVgmV9bP1
 xHip0SY+VrdZBEICCFas8rhS9fhE+8JuAAeMfvNHn30v7lHi3Wgp34tghy3nIikf
 /tb+1N4IRjo0TxTs1an/L1AaM+4J1D8qWvDE8NuiW+reY3oTtq0rmkIkD/HBZRhL
 zm8Y0+fNcfMAPVBd/amXWKuqKyxtKjqtIO5IQEpBB6fV+HThf1/UcguiI0x//ZfW
 TdP6GydKc1oiNsEv7217
 =86A4
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.15-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Virtualize PCI MPS and MRRS registers

 - Avoid soft lockups on SPAPR when clearing TCE

 - Broadcom FlexRM platform device support

 - Samples driver cleanup & type1 integer overflow fix

* tag 'vfio-v4.15-rc1' of git://github.com/awilliam/linux-vfio:
  vfio: platform: reset: Add Broadcom FlexRM reset module
  vfio/type1: silence integer overflow warning
  vfio-mdev/samples: make mdev_fops const and static
  vfio/spapr: Add cond_resched() for huge updates
  vfio/pci: Virtualize Maximum Read Request Size
  vfio/pci: Virtualize Maximum Payload Size
2017-11-14 16:47:47 -08:00
Alex Williamson
f3199673e6 Merge branches 'vfio/fixes' and 'vfio/platform' into vfio-next-20171113.0 2017-11-13 13:59:45 -07:00
Ingo Molnar
8c5db92a70 Merge branch 'linus' into locking/core, to resolve conflicts
Conflicts:
	include/linux/compiler-clang.h
	include/linux/compiler-gcc.h
	include/linux/compiler-intel.h
	include/uapi/linux/stddef.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:32:44 +01:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Mark Rutland
6aa7de0591 locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25 11:01:08 +02:00
Anup Patel
e25883411e vfio: platform: reset: Add Broadcom FlexRM reset module
This patch adds Broadcom FlexRM low-level reset for
VFIO platform.

It will do the following:
1. Disable/Deactivate each FlexRM ring
2. Flush each FlexRM ring

The cleanup sequence for FlexRM rings is adapted from
Broadcom FlexRM mailbox driver.

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Oza Oza <oza.oza@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-10-20 11:43:24 -06:00
Dan Carpenter
71a7d3d78e vfio/type1: silence integer overflow warning
I get a static checker warning about the potential integer overflow if
we add "unmap->iova + unmap->size".  The integer overflow isn't really
harmful, but we may as well fix it.  Also unmap->size gets truncated to
size_t when we pass it to vfio_find_dma() so we could check for too high
values of that as well.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-10-20 11:41:56 -06:00
Alexey Kardashevskiy
5c2fefd882 vfio/spapr: Add cond_resched() for huge updates
Clearing very big IOMMU tables can trigger soft lockups. This adds
cond_resched() to allow the scheduler to do context switching when
it decides to.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-10-02 12:39:11 -06:00
Alex Williamson
cf0d53ba49 vfio/pci: Virtualize Maximum Read Request Size
MRRS defines the maximum read request size a device is allowed to
make.  Drivers will often increase this to allow more data transfer
with a single request.  Completions to this request are bound by the
MPS setting for the bus.  Aside from device quirks (none known), it
doesn't seem to make sense to set an MRRS value less than MPS, yet
this is a likely scenario given that user drivers do not have a
system-wide view of the PCI topology.  Virtualize MRRS such that the
user can set MRRS >= MPS, but use MPS as the floor value that we'll
write to hardware.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-10-02 12:39:10 -06:00
Alex Williamson
523184972b vfio/pci: Virtualize Maximum Payload Size
With virtual PCI-Express chipsets, we now see userspace/guest drivers
trying to match the physical MPS setting to a virtual downstream port.
Of course a lone physical device surrounded by virtual interconnects
cannot make a correct decision for a proper MPS setting.  Instead,
let's virtualize the MPS control register so that writes through to
hardware are disallowed.  Userspace drivers like QEMU assume they can
write anything to the device and we'll filter out anything dangerous.
Since mismatched MPS can lead to AER and other faults, let's add it
to the kernel side rather than relying on userspace virtualization to
handle it.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
2017-10-02 12:39:09 -06:00
Arvind Yadav
417fb50d55 vfio: platform: constify amba_id
amba_id are not supposed to change at runtime. All functions
working with const amba_id. So mark the non-const structs as const.

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-08-30 14:03:42 -06:00
Alex Williamson
6586b561a9 vfio: Stall vfio_del_group_dev() for container group detach
When the user unbinds the last device of a group from a vfio bus
driver, the devices within that group should be available for other
purposes.  We currently have a race that makes this generally, but
not always true.  The device can be unbound from the vfio bus driver,
but remaining IOMMU context of the group attached to the container
can result in errors as the next driver configures DMA for the device.

Wait for the group to be detached from the IOMMU backend before
allowing the bus driver remove callback to complete.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-08-30 14:02:16 -06:00
Eric Auger
d935ad91f0 vfio: fix noiommu vfio_iommu_group_get reference count
In vfio_iommu_group_get() we want to increase the reference
count of the iommu group.

In noiommu case, the group does not exist and is allocated.
iommu_group_add_device() increases the group ref count. However we
then call iommu_group_put() which decrements it.

This leads to a "refcount_t: underflow WARN_ON".

Only decrement the ref count in case of iommu_group_add_device
failure.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-08-30 14:00:47 -06:00
Robin Murphy
f203f7f1db vfio/type1: Give hardware MSI regions precedence
If the IOMMU driver advertises 'real' reserved regions for MSIs, but
still includes the software-managed region as well, we are currently
blind to the former and will configure the IOMMU domain to map MSIs into
the latter, which is unlikely to work as expected.

Since it would take a ridiculous hardware topology for both regions to
be valid (which would be rather difficult to support in general), we
should be safe to assume that the presence of any hardware regions makes
the software region irrelevant. However, the IOMMU driver might still
advertise the software region by default, particularly if the hardware
regions are filled in elsewhere by generic code, so it might not be fair
for VFIO to be super-strict about not mixing them. To that end, make
vfio_iommu_has_sw_msi() robust against the presence of both region types
at once, so that we end up doing what is almost certainly right, rather
than what is almost certainly wrong.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-08-10 13:11:50 -06:00
Robin Murphy
db406cc0ac vfio/type1: Cope with hardware MSI reserved regions
For ARM-based systems with a GICv3 ITS to provide interrupt isolation,
but hardware limitations which are worked around by having MSIs bypass
SMMU translation (e.g. HiSilicon Hip06/Hip07), VFIO neglects to check
for the IRQ_DOMAIN_FLAG_MSI_REMAP capability, (and thus erroneously
demands unsafe_interrupts) if a software-managed MSI region is absent.

Fix this by always checking for isolation capability at both the IRQ
domain and IOMMU domain levels, rather than predicating that on whether
MSIs require an IOMMU mapping (which was always slightly tenuous logic).

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-08-10 13:11:49 -06:00
Alex Williamson
796b755066 vfio/pci: Fix handling of RC integrated endpoint PCIe capability size
Root complex integrated endpoints do not have a link and therefore may
use a smaller PCIe capability in config space than we expect when
building our config map.  Add a case for these to avoid reporting an
erroneous overlap.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-07-27 10:39:33 -06:00
Alex Williamson
9f47803503 vfio/pci: Use pci_try_reset_function() on initial open
Device lock bites again; if a device .remove() callback races a user
calling ioctl(VFIO_GROUP_GET_DEVICE_FD), the unbind request will hold
the device lock, but the user ioctl may have already taken a vfio_device
reference.  In the case of a PCI device, the initial open will attempt
to reset the device, which again attempts to get the device lock,
resulting in deadlock.  Use the trylock PCI reset interface and return
error on the open path if reset fails due to lock contention.

Link: https://lkml.org/lkml/2017/7/25/381
Reported-by: Wen Congyang <wencongyang2@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-07-26 14:33:15 -06:00
Linus Torvalds
8c6f5e7359 VFIO updates for v4.13-rc1
- Include Intel XXV710 in INTx workaround (Alex Williamson)
 
  - Make use of ERR_CAST() for error return (Dan Carpenter)
 
  - Fix vfio_group release deadlock from iommu notifier (Alex Williamson)
 
  - Unset KVM-VFIO attributes only on group match (Alex Williamson)
 
  - Fix release path group/file matching with KVM-VFIO (Alex Williamson)
 
  - Remove unnecessary lock uses triggering lockdep splat (Alex Williamson)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJZZ71TAAoJECObm247sIsi1mUP/1/5kubhHRE/Y11SnX4Bpie0
 X6YVbV3WLGuV9jaFMI7EgYJLZ6pBvCRX0CLUzEdrbS4LAlTLQu2GSY5tqhcLJumE
 0mV1Wj1jOO9BAur13/sJ+IohbZeK10dtbTkWv+YhrVSpRaLP+ituaKsajEV12WRH
 6dxho2bqfvNcTC8yjE+vqe9mU3jXYPGuCx8oYGaDXEsCjzrdbOFnAG0/s/2cWpb+
 D1ADHd3020VAHZRnHRBLFfMczza1jqllhSAUfdMw1gRGCQDq3k1XenzVNLLTbtYy
 VmEWHa+R/OCfbKVxaDqPsgOTK7x7DKn+Pzb3lWCdQ8v5X+2ubHpVIYjxTDSSTbt3
 YJ7a+hNk8AHkFgwS7x8BdOT8mmNGb1NZldjS4dv2VWkfcTnMQnubnYSCzGztto9h
 P2THKBil6djPb9S3pCvtKUiHSIZedYZlKofUldrOdGDAZmzLLlf8lTzijGjDYKFM
 pQeZC+xeEhZXURipgkH+a+paYgDtKEfwSlABODghjCcJf7S/GbyVPLOKLXzoVb2y
 Ml8eGlo4O/cNniQK5faH447ilM7hzS1aG83uGnHTfe8VgKjI7Z5ZSxKOtoEq5bDz
 bb91E6GVLKHqT0LVS1YZfrnqK0hX/QAd/sK1REM5nN8JNmPLyLpjv8FaJEhpk1vC
 z4At5+pfKM8DYrW3EGmc
 =3A4K
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.13-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Include Intel XXV710 in INTx workaround (Alex Williamson)

 - Make use of ERR_CAST() for error return (Dan Carpenter)

 - Fix vfio_group release deadlock from iommu notifier (Alex Williamson)

 - Unset KVM-VFIO attributes only on group match (Alex Williamson)

 - Fix release path group/file matching with KVM-VFIO (Alex Williamson)

 - Remove unnecessary lock uses triggering lockdep splat (Alex Williamson)

* tag 'vfio-v4.13-rc1' of git://github.com/awilliam/linux-vfio:
  vfio: Remove unnecessary uses of vfio_container.group_lock
  vfio: New external user group/file match
  kvm-vfio: Decouple only when we match a group
  vfio: Fix group release deadlock
  vfio: Use ERR_CAST() instead of open coding it
  vfio/pci: Add Intel XXV710 to hidden INTx devices
2017-07-13 12:23:54 -07:00
Alex Williamson
7f56c30bd0 vfio: Remove unnecessary uses of vfio_container.group_lock
The original intent of vfio_container.group_lock is to protect
vfio_container.group_list, however over time it's become a crutch to
prevent changes in container composition any time we call into the
iommu driver backend.  This introduces problems when we start to have
more complex interactions, for example when a user's DMA unmap request
triggers a notification to an mdev vendor driver, who responds by
attempting to unpin mappings within that request, re-entering the
iommu backend.  We incorrectly assume that the use of read-locks here
allow for this nested locking behavior, but a poorly timed write-lock
could in fact trigger a deadlock.

The current use of group_lock seems to fall into the trap of locking
code, not data.  Correct that by removing uses of group_lock that are
not directly related to group_list.  Note that the vfio type1 iommu
backend has its own mutex, vfio_iommu.lock, which it uses to protect
itself for each of these interfaces anyway.  The group_lock appears to
be a redundancy for these interfaces and type1 even goes so far as to
release its mutex to allow for exactly the re-entrant code path above.

Reported-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: stable@vger.kernel.org # v4.10+
2017-07-07 15:37:38 -06:00
Alex Williamson
5d6dee80a1 vfio: New external user group/file match
At the point where the kvm-vfio pseudo device wants to release its
vfio group reference, we can't always acquire a new reference to make
that happen.  The group can be in a state where we wouldn't allow a
new reference to be added.  This new helper function allows a caller
to match a file to a group to facilitate this.  Given a file and
group, report if they match.  Thus the caller needs to already have a
group reference to match to the file.  This allows the deletion of a
group without acquiring a new reference.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Cc: stable@vger.kernel.org
2017-06-28 13:50:05 -06:00
Alex Williamson
811642d8d8 vfio: Fix group release deadlock
If vfio_iommu_group_notifier() acquires a group reference and that
reference becomes the last reference to the group, then vfio_group_put
introduces a deadlock code path where we're trying to unregister from
the iommu notifier chain from within a callout of that chain.  Use a
work_struct to release this reference asynchronously.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Cc: stable@vger.kernel.org
2017-06-28 13:49:38 -06:00
Ingo Molnar
ac6424b981 sched/wait: Rename wait_queue_t => wait_queue_entry_t
Rename:

	wait_queue_t		=>	wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:27 +02:00
Dan Carpenter
7b3a10df1d vfio: Use ERR_CAST() instead of open coding it
It's a small cleanup to use ERR_CAST() here.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-06-13 09:24:21 -06:00
Alex Williamson
7d57e5e933 vfio/pci: Add Intel XXV710 to hidden INTx devices
XXV710 has the same broken INTx behavior as the rest of the X/XL710
series, the interrupt status register is not wired to report pending
INTx interrupts, thus we never associate the interrupt to the device.
Extend the device IDs to include these so that we hide that the
device supports INTx at all to the user.

Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
2017-06-13 09:22:57 -06:00
Linus Torvalds
7246f60068 powerpc updates for 4.12 part 1.
Highlights include:
 
  - Larger virtual address space on 64-bit server CPUs. By default we use a 128TB
    virtual address space, but a process can request access to the full 512TB by
    passing a hint to mmap().
 
  - Support for the new Power9 "XIVE" interrupt controller.
 
  - TLB flushing optimisations for the radix MMU on Power9.
 
  - Support for CAPI cards on Power9, using the "Coherent Accelerator Interface
    Architecture 2.0".
 
  - The ability to configure the mmap randomisation limits at build and runtime.
 
  - Several small fixes and cleanups to the kprobes code, as well as support for
    KPROBES_ON_FTRACE.
 
  - Major improvements to handling of system reset interrupts, correctly treating
    them as NMIs, giving them a dedicated stack and using a new hypervisor call
    to trigger them, all of which should aid debugging and robustness.
 
 Many fixes and other minor enhancements.
 
 Thanks to:
   Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
   Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Ben
   Hutchings, Benjamin Herrenschmidt, Bhupesh Sharma, Chris Packham, Christian
   Zigotzky, Christophe Leroy, Christophe Lombard, Daniel Axtens, David Gibson,
   Gautham R. Shenoy, Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli,
   Hamish Martin, Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan,
   Mahesh J Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
   R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
   Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell Currey, Sukadev
   Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C. Harding, Tyrel Datwyler,
   Uma Krishnan, Vaibhav Jain, Vipin K Parashar, Yang Shi.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZDHUMAAoJEFHr6jzI4aWAT7oQALkE2Nj3gjcn1z0SkFhq/1iO
 Py9Elmqm4E+L6NKYtBY5dS8xVAJ088ffzERyqJ1FY1LHkB8tn8bWRcMQmbjAFzTI
 V4TAzDNI890BN/F4ptrYRwNFxRBHAvZ4NDunTzagwYnwmTzW9PYHmOi4pvWTo3Tw
 KFUQ0joLSEgHzyfXxYB3fyj41u8N0FZvhfazdNSqia2Y5Vwwv/ION5jKplDM+09Y
 EtVEXFvaKAS1sjbM/d/Jo5rblHfR0D9/lYV10+jjyIokjzslIpyTbnj3izeYoM5V
 I4h99372zfsEjBGPPXyM3khL3zizGMSDYRmJHQSaKxjtecS9SPywPTZ8ufO/aSzV
 Ngq6nlND+f1zep29VQ0cxd3Jh40skWOXzxJaFjfDT25xa6FbfsWP2NCtk8PGylZ7
 EyqTuCWkMgIP02KlX3oHvEB2LRRPCDmRU2zECecRGNJrIQwYC2xjoiVi7Q8Qe8rY
 gr7Ib5Jj/a+uiTcCIy37+5nXq2s14/JBOKqxuYZIxeuZFvKYuRUipbKWO05WDOAz
 m/pSzeC3J8AAoYiqR0gcSOuJTOnJpGhs7zrQFqnEISbXIwLW+ICumzOmTAiBqOEY
 Rt8uW2gYkPwKLrE05445RfVUoERaAjaE06eRMOWS6slnngHmmnRJbf3PcoALiJkT
 ediqGEj0/N1HMB31V5tS
 =vSF3
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Larger virtual address space on 64-bit server CPUs. By default we
     use a 128TB virtual address space, but a process can request access
     to the full 512TB by passing a hint to mmap().

   - Support for the new Power9 "XIVE" interrupt controller.

   - TLB flushing optimisations for the radix MMU on Power9.

   - Support for CAPI cards on Power9, using the "Coherent Accelerator
     Interface Architecture 2.0".

   - The ability to configure the mmap randomisation limits at build and
     runtime.

   - Several small fixes and cleanups to the kprobes code, as well as
     support for KPROBES_ON_FTRACE.

   - Major improvements to handling of system reset interrupts,
     correctly treating them as NMIs, giving them a dedicated stack and
     using a new hypervisor call to trigger them, all of which should
     aid debugging and robustness.

   - Many fixes and other minor enhancements.

  Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple,
  Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton
  Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt,
  Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy,
  Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy,
  Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin,
  Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J
  Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
  R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver
  O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell
  Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C.
  Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar,
  Yang Shi"

* tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
  powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
  powerpc/powernv: Fix TCE kill on NVLink2
  powerpc/mm/radix: Drop support for CPUs without lockless tlbie
  powerpc/book3s/mce: Move add_taint() later in virtual mode
  powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
  powerpc/smp: Document irq enable/disable after migrating IRQs
  powerpc/mpc52xx: Don't select user-visible RTAS_PROC
  powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
  powerpc/eeh: Clean up and document event handling functions
  powerpc/eeh: Avoid use after free in eeh_handle_special_event()
  cxl: Mask slice error interrupts after first occurrence
  cxl: Route eeh events to all drivers in cxl_pci_error_detected()
  cxl: Force context lock during EEH flow
  powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
  powerpc/xmon: Teach xmon oops about radix vectors
  powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
  powerpc/pseries: Enable VFIO
  powerpc/powernv: Fix iommu table size calculation hook for small tables
  powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
  powerpc: Add arch/powerpc/tools directory
  ...
2017-05-05 11:36:44 -07:00
Alex Williamson
7cb671e7a3 vfio/type1: Reduce repetitive calls in vfio_pin_pages_remote()
vfio_pin_pages_remote() is typically called to iterate over a range
of memory.  Testing CAP_IPC_LOCK is relatively expensive, so it makes
sense to push it up to the caller, which can then repeatedly call
vfio_pin_pages_remote() using that value.  This can show nearly a 20%
improvement on the worst case path through VFIO_IOMMU_MAP_DMA with
contiguous page mapping disabled.  Testing RLIMIT_MEMLOCK is much more
lightweight, but we bring it along on the same principle and it does
seem to show a marginal improvement.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-04-18 15:01:15 -06:00
Alex Williamson
80dbe1fbaf vfio/type1: Prune vfio_pin_page_external()
With vfio_lock_acct() testing the locked memory limit under mmap_sem,
it's redundant to do it here for a single page.  We can also reorder
our tests such that we can avoid testing for reserved pages if we're
not doing accounting and let vfio_lock_acct() test the process
CAP_IPC_LOCK.  Finally, this function oddly returns 1 on success.
Update to return zero on success, -errno on error.  Since the function
only pins a single page, there's no need to return the number of pages
pinned.

N.B. vfio_pin_pages_remote() can pin a large contiguous range of pages
before calling vfio_lock_acct().  If we were to similarly remove the
extra test there, a user could temporarily pin far more pages than
they're allowed.

Suggested-by: Kirti Wankhede <kwankhede@nvidia.com>
Suggested-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-04-18 15:01:15 -06:00
Alex Williamson
0cfef2b741 vfio/type1: Remove locked page accounting workqueue
If the mmap_sem is contented then the vfio type1 IOMMU backend will
defer locked page accounting updates to a workqueue task.  This has a
few problems and depending on which side the user tries to play, they
might be over-penalized for unmaps that haven't yet been accounted or
race the workqueue to enter more mappings than they're allowed.  The
original intent of this workqueue mechanism seems to be focused on
reducing latency through the ioctl, but we cannot do so at the cost
of correctness.  Remove this workqueue mechanism and update the
callers to allow for failure.  We can also now recheck the limit under
write lock to make sure we don't exceed it.

vfio_pin_pages_remote() also now necessarily includes an unwind path
which we can jump to directly if the consecutive page pinning finds
that we're exceeding the user's memory limits.  This avoids the
current lazy approach which does accounting and mapping up to the
fault, only to return an error on the next iteration to unwind the
entire vfio_dma.

Cc: stable@vger.kernel.org
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-04-13 14:10:15 -06:00
Alexey Kardashevskiy
3393af24b6 vfio/spapr_tce: Check kzalloc() return when preregistering memory
This adds missing checking for kzalloc() return value.

Fixes: 4b6fad7097 ("powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-04-11 13:40:09 -06:00
Alexey Kardashevskiy
1282ba7fc2 vfio/powerpc/spapr_tce: Enforce IOMMU type compatibility check
The existing SPAPR TCE driver advertises both VFIO_SPAPR_TCE_IOMMU and
VFIO_SPAPR_TCE_v2_IOMMU types to the userspace and the userspace usually
picks the v2.

Normally the userspace would create a container, attach an IOMMU group
to it and only then set the IOMMU type (which would normally be v2).

However a specific IOMMU group may not support v2, in other words
it may not implement set_window/unset_window/take_ownership/
release_ownership and such a group should not be attached to
a v2 container.

This adds extra checks that a new group can do what the selected IOMMU
type suggests. The userspace can then test the return value from
ioctl(VFIO_SET_IOMMU, VFIO_SPAPR_TCE_v2_IOMMU) and try
VFIO_SPAPR_TCE_IOMMU.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-04-11 13:38:46 -06:00
Alexey Kardashevskiy
e5afdf9dd5 powerpc/vfio_spapr_tce: Add reference counting to iommu_table
So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_tce_table_put() and makes
iommu_free_table() static. iommu_tce_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-30 21:42:11 +11:00
Alexey Kardashevskiy
11edf116e3 powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

As from now on the iommu_free_table() calls it_ops->free(), we need
to have it_ops initialized before calling iommu_free_table() so this
moves this initialization in pnv_pci_ioda2_create_table().

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-30 21:42:11 +11:00
Linus Torvalds
d07c6f46c4 VFIO fixes for v4.11-rc4
- Rework sanity check for mdev driver group notifier de-registration
    (Alex Williamson)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJY1UFMAAoJECObm247sIsiKiwQAIiB+7sobUsKHaBtlqlybyGR
 xQuU/vv0YfaSmRQJGpSSCP8jctUszpVc1fzce/mP8G/HTOBnbMGk3c2VP3gAZQb7
 bkGXfvslmFstj301Kf7xOYazGUXHkDXTQMzYf9v8JsAt4WfoYORhTuUR1//rj0gf
 qocEwfrxVOotwN6C2EeuEGa4WBZ7n9HC2fB8cSAo9FOH7HhT/Y/vztZ9NwYoJSZo
 sgufl42hRusQaiS4JD9sakny2KhBujrCWxT/qPwPHaGHxNvM/8z+96IrMhhL4CMi
 85VMIEoPZIZ9vQY26PQRR/b1hXgTF1vwn7bwL/HCCKnIxe+0e0VaFZ4xxWTfuHKz
 g86VpGVpxfYfZl4KSMatNS+M4eJwrANGHnT6c2tJWxIG8ApVgWE1Ae9ae9L9fkYV
 mYUNuQuugQh6WQaNyhlY7Watc7/6FM5XuCtxo8BfdqoKWtZGUU2TO8c/jjOdzfBU
 4UJjyzbROAGGNRnCp7qekakBGpKUjlaMenKyf9ZbAJzi5RGjJF41HXCq2P5nA8+x
 xsSUVXUW33+M06MTmfQ4THdelRTPZkMLDxGvvIfHc3KKztDKZdt8R2CoHKWxgOLi
 mlNYnSmMaVTBbZxO2jaEf3CKJu/Rcs4kkDqXPYcd/KPFtQRETEwNHC8sgx6VMDhJ
 4nG+YnX6Shz0mRheOiuZ
 =nnnX
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.11-rc4' of git://github.com/awilliam/linux-vfio

Pull VFIO fix from Alex Williamson:
 "Rework sanity check for mdev driver group notifier de-registration
  (Alex Williamson)"

* tag 'vfio-v4.11-rc4' of git://github.com/awilliam/linux-vfio:
  vfio: Rework group release notifier warning
2017-03-24 14:39:36 -07:00
Robin Murphy
9d3a4de4cb iommu: Disambiguate MSI region types
The introduction of reserved regions has left a couple of rough edges
which we could do with sorting out sooner rather than later. Since we
are not yet addressing the potential dynamic aspect of software-managed
reservations and presenting them at arbitrary fixed addresses, it is
incongruous that we end up displaying hardware vs. software-managed MSI
regions to userspace differently, especially since ARM-based systems may
actually require one or the other, or even potentially both at once,
(which iommu-dma currently has no hope of dealing with at all). Let's
resolve the former user-visible inconsistency ASAP before the ABI has
been baked into a kernel release, in a way that also lays the groundwork
for the latter shortcoming to be addressed by follow-up patches.

For clarity, rename the software-managed type to IOMMU_RESV_SW_MSI, use
IOMMU_RESV_MSI to describe the hardware type, and document everything a
little bit. Since the x86 MSI remapping hardware falls squarely under
this meaning of IOMMU_RESV_MSI, apply that type to their regions as well,
so that we tell the same story to userspace across all platforms.

Secondly, as the various region types require quite different handling,
and it really makes little sense to ever try combining them, convert the
bitfield-esque #defines to a plain enum in the process before anyone
gets the wrong impression.

Fixes: d30ddcaa7b ("iommu: Add a new type field in iommu_resv_region")
Reviewed-by: Eric Auger <eric.auger@redhat.com>
CC: Alex Williamson <alex.williamson@redhat.com>
CC: David Woodhouse <dwmw2@infradead.org>
CC: kvm@vger.kernel.org
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-03-22 16:16:17 +01:00
Alex Williamson
65b1adebfe vfio: Rework group release notifier warning
The intent of the original warning is make sure that the mdev vendor
driver has removed any group notifiers at the point where the group
is closed by the user.  Theoretically this would be through an
orderly shutdown where any devices are release prior to the group
release.  We can't always count on an orderly shutdown, the user can
close the group before the notifier can be removed or the user task
might be killed.  We'd like to add this sanity test when the group is
idle and the only references are from the devices within the group
themselves, but we don't have a good way to do that.  Instead check
both when the group itself is removed and when the group is opened.
A bit later than we'd prefer, but better than the current over
aggressive approach.

Fixes: ccd46dbae7 ("vfio: support notifier chain in vfio_group")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: <stable@vger.kernel.org> # v4.10
Cc: Jike Song <jike.song@intel.com>
2017-03-21 13:19:09 -06:00
Ingo Molnar
3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Ingo Molnar
6e84f31522 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

The APIs that are going to be moved first are:

   mm_alloc()
   __mmdrop()
   mmdrop()
   mmdrop_async_fn()
   mmdrop_async()
   mmget_not_zero()
   mmput()
   mmput_async()
   get_task_mm()
   mm_access()
   mm_release()

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:28 +01:00
Linus Torvalds
f14cc3b13d VFIO updates for v4.11
- Kconfig fixes for SPAPR_TCE_IOMMU=n (Michael Ellerman)
  - Module softdep rather than request_module to simplify usage from
    initrd (Alex Williamson)
  - Comment typo fix (Changbin Du)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJYry1kAAoJECObm247sIsieQ8P/0iydvBdW3AACPADQIvfV3+K
 PfAj3RwO1AvKB2Uiiql8t+S+qr5WO2Z4oeyOsaapmaOq8fxN76nTtIPUPg9SzkUx
 DDFNT+Zl/wFGA+vDfPcDFaFAf35mNTLN2hmqcrkTwLL7flA0ICYZzpT0U3nP6aEt
 go10DkQXQ1aGJkTfBluIqe0FTBd2EZ3+/mJbjiaHRlV5CcXGEtAAe8IC+XJoJJds
 AZ1RQiLauE6Rj4t2JQegZJH+ntX7zPk329jD0DjxZq9y08uEqdyeXWj31O8SRucB
 hD4QGN6NS8Qv+TlpdMVnsv6E3oZhUZIObH2p0nhUJ8ZpdEkTZoR/73rJKipz6svX
 9TuuiIJL2ibZ40rEpABIx1N0ykHiUmXgtCuVfjiKSXmsdZYSu7bK/CbOVBGvS+WE
 dKbgob44QozgD34wt5mv9by1jnxxeDodbIfkXE3p4b2TKaaFFnpCG2Yex6XOjjCU
 T4peLwxTyszqVDdghJ5l8S2661+UqS+VZQXSXMQRg4fLLx+HjB0JPo/NcfjNFhUE
 /RwXvOoVRRn0EKTX5v5Ty5YVx+/KWGm3wAsa1Ar+YGATKn3tS91Vz4cKYZA/bdUc
 GKkGI5FrARjtWF860cBtq/KWk2qYy7EzCdVfkk56e5s4clrZwjbet6W/RmZGSvVn
 R+XAN7GucRg/aMJyrwJM
 =6Gze
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.11-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Kconfig fixes for SPAPR_TCE_IOMMU=n (Michael Ellerman)

 - Module softdep rather than request_module to simplify usage from
   initrd (Alex Williamson)

 - Comment typo fix (Changbin Du)

* tag 'vfio-v4.11-rc1' of git://github.com/awilliam/linux-vfio:
  vfio: fix a typo in comment of function vfio_pin_pages
  vfio: Replace module request with softdep
  vfio/mdev: Use a module softdep for vfio_mdev
  vfio: Fix build break when SPAPR_TCE_IOMMU=n
2017-02-23 11:26:09 -08:00
Changbin Du
d9d84780f1 vfio: fix a typo in comment of function vfio_pin_pages
Correct the description that 'unpinned' -> 'pinned'.

Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-02-22 11:40:15 -07:00
Linus Torvalds
ebb4949eb3 IOMMU Updates for Linux v4.11
The changes include:
 
 	* KVM PCIe/MSI passthrough support on ARM/ARM64
 
 	* Introduction of a core representation for individual hardware
 	  iommus
 
 	* Support for IOMMU privileged mappings as supported by some
 	  ARM IOMMUS
 
 	* 16-bit SID support for ARM-SMMUv2
 
 	* Stream table optimization for ARM-SMMUv3
 
 	* Various fixes and other small improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJYqw3hAAoJECvwRC2XARrjPy0P/35ykfHIAESJuF+72ziaoAYA
 ZvMrli8rGq7n+ntaIGPx9rV+hZTUSF8V2bfsYV7SAn5iYuViXZqvOtC3BAEp6GNC
 cdMeQfqXoHiWVMdXdOihzk+6YCQvBxqPOvUtYFqVhOo3Yrz8Dc71KsKvrTndEUVY
 f7bXHKssVONkWMga9lIVDgEefG5VyJPEQaxJXB9ymLHXbwWOcISe1lgtkrzFSxSH
 H9YNI07Tfcxfn6rN8jGmcYFYM58xwBicpB4HBw5uytMBYAsxqTEsx4X5dGpOF6RH
 cFW9nby+9ZlcTMyuWXKAck3o8df2ZC1xiSjnz+DHQdBPFiFNqIL3PVUcaz9PnF2e
 e6Y+DA3s+jykeiCvi2K0Z9RwTg7t8S5spel+UCeNVSnIjE9pqZNLF8vsDjF17zuR
 +zcFm7RVI397QVQGp0dbqhtxnwqt/3CX/wlzpvuNdEZa4vwujpcnM9tfl6gyFrF8
 awK9Fj5ryAn4DEiM+8yiRHwLrU5ij1cfc8jQdqleEB2ca7Wv3g1uhhS0QTXOFY9u
 A7ygOna25U1EcOwjC6ebjiEL115ZEOrXo+eChhzCHoUEHCVxL+L/NAMEsUcMqPIw
 3XsHhru0HbXgd5O5wHX39s2je8G3+ElqQwy8Ja3DimV6tvon7yaKCXy9QU+2aa1u
 3r53R/0mW1ijtOfK+I0b
 =5b3I
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull IOMMU UPDATES from Joerg Roedel:

 - KVM PCIe/MSI passthrough support on ARM/ARM64

 - introduction of a core representation for individual hardware iommus

 - support for IOMMU privileged mappings as supported by some ARM IOMMUS

 - 16-bit SID support for ARM-SMMUv2

 - stream table optimization for ARM-SMMUv3

 - various fixes and other small improvements

* tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (61 commits)
  vfio/type1: Fix error return code in vfio_iommu_type1_attach_group()
  iommu: Remove iommu_register_instance interface
  iommu/exynos: Make use of iommu_device_register interface
  iommu/mediatek: Make use of iommu_device_register interface
  iommu/msm: Make use of iommu_device_register interface
  iommu/arm-smmu: Make use of the iommu_register interface
  iommu: Add iommu_device_set_fwnode() interface
  iommu: Make iommu_device_link/unlink take a struct iommu_device
  iommu: Add sysfs bindings for struct iommu_device
  iommu: Introduce new 'struct iommu_device'
  iommu: Rename struct iommu_device
  iommu: Rename iommu_get_instance()
  iommu: Fix static checker warning in iommu_insert_device_resv_regions
  iommu: Avoid unnecessary assignment of dev->iommu_fwspec
  iommu/mediatek: Remove bogus 'select' statements
  iommu/dma: Remove bogus dma_supported() implementation
  iommu/ipmmu-vmsa: Restrict IOMMU Domain Geometry to 32-bit address space
  iommu/vt-d: Don't over-free page table directories
  iommu/vt-d: Tylersburg isoch identity map check is done too late.
  iommu/vt-d: Fix some macros that are incorrectly specified in intel-iommu
  ...
2017-02-20 16:42:43 -08:00
Joerg Roedel
8d2932dd06 Merge branches 'iommu/fixes', 'arm/exynos', 'arm/renesas', 'arm/smmu', 'arm/mediatek', 'arm/core', 'x86/vt-d' and 'core' into next 2017-02-10 15:13:10 +01:00
Wei Yongjun
2c9f1af528 vfio/type1: Fix error return code in vfio_iommu_type1_attach_group()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 5d70499218 ("vfio/type1: Allow transparent MSI IOVA allocation")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-02-10 15:09:11 +01:00
Alex Williamson
0ca582fd04 vfio: Replace module request with softdep
Rather than doing a module request from within the init function, add
a soft dependency on the available IOMMU backend drivers.  This makes
the dependency visible to userspace when picking modules for the
ram disk.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-02-09 12:13:53 -07:00
Alex Williamson
f790eb57e6 vfio/mdev: Use a module softdep for vfio_mdev
Use an explicit module softdep rather than a request module call such
that the dependency is exposed to userspace.  This allows us to more
easily support modules loaded at initrd time.

Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-02-08 13:13:25 -07:00
Michael Ellerman
d88423f784 vfio: Fix build break when SPAPR_TCE_IOMMU=n
Currently the kconfig logic for VFIO_IOMMU_SPAPR_TCE and VFIO_SPAPR_EEH
is broken when SPAPR_TCE_IOMMU=n. Leading to:

    warning: (VFIO) selects VFIO_IOMMU_SPAPR_TCE which has unmet direct dependencies (VFIO && SPAPR_TCE_IOMMU)
    warning: (VFIO) selects VFIO_IOMMU_SPAPR_TCE which has unmet direct dependencies (VFIO && SPAPR_TCE_IOMMU)
    drivers/vfio/vfio_iommu_spapr_tce.c:113:8: error: implicit declaration of function 'mm_iommu_find'

This stems from the fact that VFIO selects VFIO_IOMMU_SPAPR_TCE, and
although it has an if clause, the condition is not correct.

We could fix it by doing select VFIO_IOMMU_SPAPR_TCE if SPAPR_TCE_IOMMU,
but the cleaner fix is to drop the selects and tie VFIO_IOMMU_SPAPR_TCE
to the value of VFIO, and express the dependencies in only once place.

Do the same for VFIO_SPAPR_EEH.

The end result is that the values of VFIO_IOMMU_SPAPR_TCE and
VFIO_SPAPR_EEH follow the value of VFIO, except when SPAPR_TCE_IOMMU=n
and/or EEH=n. Which is exactly what we want to happen.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-02-08 13:13:25 -07:00
Alexey Kardashevskiy
930a42ded3 vfio/spapr_tce: Set window when adding additional groups to container
If a container already has a group attached, attaching a new group
should just program already created IOMMU tables to the hardware via
the iommu_table_group_ops::set_window() callback.

However commit 6f01cc692a ("vfio/spapr: Add a helper to create
default DMA window") did not just simplify the code but also removed
the set_window() calls in the case of attaching groups to a container
which already has tables so it broke VFIO PCI hotplug.

This reverts set_window() bits in tce_iommu_take_ownership_ddw().

Fixes: 6f01cc692a ("vfio/spapr: Add a helper to create default DMA window")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-02-07 11:48:16 -07:00
Alexey Kardashevskiy
2da64d20a0 vfio/spapr: Fix missing mutex unlock when creating a window
Commit d9c728949d ("vfio/spapr: Postpone default window creation")
added an additional exit to the VFIO_IOMMU_SPAPR_TCE_CREATE case and
made it possible to return from tce_iommu_ioctl() without unlocking
container->lock; this fixes the issue.

Fixes: d9c728949d ("vfio/spapr: Postpone default window creation")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-02-01 09:48:34 -07:00
Joerg Roedel
93fa6cf60a Merge branch 'iommu/guest-msi' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/core 2017-01-30 15:58:47 +01:00
Greg Kurz
bd00fdf198 vfio/spapr: fail tce_iommu_attach_group() when iommu_data is null
The recently added mediated VFIO driver doesn't know about powerpc iommu.
It thus doesn't register a struct iommu_table_group in the iommu group
upon device creation. The iommu_data pointer hence remains null.

This causes a kernel oops when userspace tries to set the iommu type of a
container associated with a mediated device to VFIO_SPAPR_TCE_v2_IOMMU.

[   82.585440] mtty mtty: MDEV: Registered
[   87.655522] iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 10
[   87.655527] vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 10
[  116.297184] Unable to handle kernel paging request for data at address 0x00000030
[  116.297389] Faulting instruction address: 0xd000000007870524
[  116.297465] Oops: Kernel access of bad area, sig: 11 [#1]
[  116.297611] SMP NR_CPUS=2048
[  116.297611] NUMA
[  116.297627] PowerNV
...
[  116.297954] CPU: 33 PID: 7067 Comm: qemu-system-ppc Not tainted 4.10.0-rc5-mdev-test #8
[  116.297993] task: c000000e7718b680 task.stack: c000000e77214000
[  116.298025] NIP: d000000007870524 LR: d000000007870518 CTR: 0000000000000000
[  116.298064] REGS: c000000e77217990 TRAP: 0300   Not tainted  (4.10.0-rc5-mdev-test)
[  116.298103] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[  116.298107]   CR: 84004444  XER: 00000000
[  116.298154] CFAR: c00000000000888c DAR: 0000000000000030 DSISR: 40000000 SOFTE: 1
               GPR00: d000000007870518 c000000e77217c10 d00000000787b0ed c000000eed2103c0
               GPR04: 0000000000000000 0000000000000000 c000000eed2103e0 0000000f24320000
               GPR08: 0000000000000104 0000000000000001 0000000000000000 d0000000078729b0
               GPR12: c00000000025b7e0 c00000000fe08400 0000000000000001 000001002d31d100
               GPR16: 000001002c22c850 00003ffff315c750 0000000043145680 0000000043141bc0
               GPR20: ffffffffffffffed fffffffffffff000 0000000020003b65 d000000007706018
               GPR24: c000000f16cf0d98 d000000007706000 c000000003f42980 c000000003f42980
               GPR28: c000000f1575ac00 c000000003f429c8 0000000000000000 c000000eed2103c0
[  116.298504] NIP [d000000007870524] tce_iommu_attach_group+0x10c/0x360 [vfio_iommu_spapr_tce]
[  116.298555] LR [d000000007870518] tce_iommu_attach_group+0x100/0x360 [vfio_iommu_spapr_tce]
[  116.298601] Call Trace:
[  116.298610] [c000000e77217c10] [d000000007870518] tce_iommu_attach_group+0x100/0x360 [vfio_iommu_spapr_tce] (unreliable)
[  116.298671] [c000000e77217cb0] [d0000000077033a0] vfio_fops_unl_ioctl+0x278/0x3e0 [vfio]
[  116.298713] [c000000e77217d40] [c0000000002a3ebc] do_vfs_ioctl+0xcc/0x8b0
[  116.298745] [c000000e77217de0] [c0000000002a4700] SyS_ioctl+0x60/0xc0
[  116.298782] [c000000e77217e30] [c00000000000b220] system_call+0x38/0xfc
[  116.298812] Instruction dump:
[  116.298828] 7d3f4b78 409effc8 3d220000 e9298020 3c800140 38a00018 608480c0 e8690028
[  116.298869] 4800249d e8410018 7c7f1b79 41820230 <e93e0030> 2fa90000 419e0114 e9090020
[  116.298914] ---[ end trace 1e10b0ced08b9120 ]---

This patch fixes the oops.

Reported-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-01-24 10:12:04 -07:00
Eric Auger
9d72f87bab vfio/type1: Check MSI remapping at irq domain level
In case the IOMMU translates MSI transactions (typical case
on ARM), we check MSI remapping capability at IRQ domain
level. Otherwise it is checked at IOMMU level.

At this stage the arm-smmu-(v3) still advertise the
IOMMU_CAP_INTR_REMAP capability at IOMMU level. This will be
removed in subsequent patches.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-23 15:00:46 +00:00
Eric Auger
5d70499218 vfio/type1: Allow transparent MSI IOVA allocation
When attaching a group to the container, check the group's
reserved regions and test whether the IOMMU translates MSI
transactions. If yes, we initialize an IOVA allocator through
the iommu_get_msi_cookie API. This will allow the MSI IOVAs
to be transparently allocated on MSI controller's compose().

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-23 15:00:46 +00:00
Alex Williamson
94a6fa899d vfio/type1: Remove pid_namespace.h include
Using has_capability() rather than ns_capable(), we're no longer using
this header.

Cc: Jike Song <jike.song@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-01-13 08:23:33 -07:00
Jike Song
d1b333d12c vfio iommu type1: fix the testing of capability for remote task
Before the mdev enhancement type1 iommu used capable() to test the
capability of current task; in the course of mdev development a
new requirement, testing for another task other than current, was
raised.  ns_capable() was used for this purpose, however it still
tests current, the only difference is, in a specified namespace.

Fix it by using has_capability() instead, which tests the cap for
specified task in init_user_ns, the same namespace as capable().

Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-01-12 16:05:35 -07:00
Arvind Yadav
e19f32da5d vfio-pci: Handle error from pci_iomap
Here, pci_iomap can fail, handle this case release selected
pci regions and return -ENOMEM.

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-01-04 08:34:39 -07:00
Arnd Bergmann
45e8697144 vfio-pci: use 32-bit comparisons for register address for gcc-4.5
Using ancient compilers (gcc-4.5 or older) on ARM, we get a link
failure with the vfio-pci driver:

ERROR: "__aeabi_lcmp" [drivers/vfio/pci/vfio-pci.ko] undefined!

The reason is that the compiler tries to do a comparison of
a 64-bit range. This changes it to convert to a 32-bit number
explicitly first, as newer compilers do for themselves.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-30 08:13:47 -07:00
Alex Williamson
99e3123e3d vfio-mdev: Make mdev_device private and abstract interfaces
Abstract access to mdev_device so that we can define which interfaces
are public rather than relying on comments in the structure.

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
2016-12-30 08:13:44 -07:00
Alex Williamson
9372e6feaa vfio-mdev: Make mdev_parent private
Rather than hoping for good behavior by marking some elements
internal, enforce it by making the entire structure private and
creating an accessor function for the one useful external field.

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
2016-12-30 08:13:41 -07:00
Alex Williamson
42930553a7 vfio-mdev: de-polute the namespace, rename parent_device & parent_ops
Add an mdev_ prefix so we're not poluting the namespace so much.

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
2016-12-30 08:13:38 -07:00
Alex Williamson
49550787a9 vfio-mdev: Fix remove race
Using the mtty mdev sample driver we can generate a remove race by
starting one shell that continuously creates mtty devices and several
other shells all attempting to remove devices, in my case four remove
shells.  The fault occurs in mdev_remove_sysfs_files() where the
passed type arg is NULL, which suggests we've received a struct device
in mdev_device_remove() but it's in some sort of teardown state.  The
solution here is to make use of the accidentally unused list_head on
the mdev_device such that the mdev core keeps a list of all the mdev
devices.  This allows us to validate that we have a valid mdev before
we start removal, remove it from the list to prevent others from
working on it, and if the vendor driver refuses to remove, we can
re-add it to the list.

Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-30 08:13:33 -07:00
Alex Williamson
6c38c055cc vfio/type1: Restore mapping performance with mdev support
As part of the mdev support, type1 now gets a task reference per
vfio_dma and uses that to get an mm reference for the task while
working on accounting.  That's correct, but it's not fast.  For some
paths, like vfio_pin_pages_remote(), we know we're only called from
user context, so we can restore the lighter weight calls.  In other
cases, we're effectively already testing whether we're in the stored
task context elsewhere, extend this vfio_lock_acct() as well.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
2016-12-30 08:13:31 -07:00
Linus Torvalds
de399813b5 powerpc updates for 4.10
Highlights include:
 
  - Support for the kexec_file_load() syscall, which is a prereq for secure and
    trusted boot.
 
  - Prevent kernel execution of userspace on P9 Radix (similar to SMEP/PXN).
 
  - Sort the exception tables at build time, to save time at boot, and store
    them as relative offsets to save space in the kernel image & memory.
 
  - Allow building the kernel with thin archives, which should allow us to build
    an allyesconfig once some other fixes land.
 
  - Build fixes to allow us to correctly rebuild when changing the kernel endian
    from big to little or vice versa.
 
  - Plumbing so that we can avoid doing a full mm TLB flush on P9 Radix.
 
  - Initial stack protector support (-fstack-protector).
 
  - Support for dumping the radix (aka. Linux) and hash page tables via debugfs.
 
  - Fix an oops in cxl coredump generation when cxl_get_fd() is used.
 
  - Freescale updates from Scott: "Highlights include 8xx hugepage support,
    qbman fixes/cleanup, device tree updates, and some misc cleanup."
 
  - Many and varied fixes and minor enhancements as always.
 
 Thanks to:
   Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual,
   Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz, Christophe Jaillet,
   Christophe Leroy, Denis Kirjanov, Elimar Riesebieter, Frederic Barrat,
   Gautham R. Shenoy, Geliang Tang, Geoff Levand, Jack Miller, Johan Hovold,
   Lars-Peter Clausen, Libin, Madhavan Srinivasan, Michael Neuling, Nathan
   Fontenot, Naveen N. Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin,
   Rashmica Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
   Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYU4YSAAoJEFHr6jzI4aWAC4gQALtIAqqPon0Cd5b/FVVcMbW7
 mMqB2b/0FGEl5GoRTzGUDaQqElilm6AEVfHO86C7DFji/a6olneFfw87iz+mtWuZ
 JvrNq68ZiSnoeszdUy4MgtXFLb5sTzNMev4skaHfjI9E5CepWBoR0zH4G+kNVnd5
 WSgudv8Cq4Px+MEuTOigt3QYjHzZ3cw/XNOOm9c+oGj+PDW4O9UItVI+S1WLoey4
 rAB2nRcLMDPuwfRQC9XsF3zEbkv4h1dEXo/EBRuRpcF+0lLTzFw1lv1WE8OxlUmS
 kAXbty3dIytBfSbtJT0c0Ps6sfQ4HFhu6ZV2fjnxNTz2KDkBIN7LBYHmBYiqY9oZ
 9zvbUWtfiTu5ocfRtTq7rC/Hcj4Kbr9S9F/FvXR0WyDsKgu4xxAovqC3gcn6YjYK
 Rr1tcCI4nUzyhVJVmd+OEhUvc5JbFy9aGage+YeOyejfvvSbXIunaxWlPjoDkvim
 Vjl+UKU8gw51XFssqY5ZBi/HNlMFKYedLpMFp/fItnLglhj50V0eFWkpDgdSCYom
 vo9ifPLZx8n8m8De3H7TV4E0F4gCHcTeqZdu7tW9AAUVM6iLJcDLm3asGmtNh21t
 snOHNOJ5QSIno6ezUUg29T6VBjbPh46fdJJSlIZrEe8OzLZ1haGyttf0tD00PQvY
 Z2W/m3gxafnOeGgBqvyv
 =xOzf
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Support for the kexec_file_load() syscall, which is a prereq for
     secure and trusted boot.

   - Prevent kernel execution of userspace on P9 Radix (similar to
     SMEP/PXN).

   - Sort the exception tables at build time, to save time at boot, and
     store them as relative offsets to save space in the kernel image &
     memory.

   - Allow building the kernel with thin archives, which should allow us
     to build an allyesconfig once some other fixes land.

   - Build fixes to allow us to correctly rebuild when changing the
     kernel endian from big to little or vice versa.

   - Plumbing so that we can avoid doing a full mm TLB flush on P9
     Radix.

   - Initial stack protector support (-fstack-protector).

   - Support for dumping the radix (aka. Linux) and hash page tables via
     debugfs.

   - Fix an oops in cxl coredump generation when cxl_get_fd() is used.

   - Freescale updates from Scott: "Highlights include 8xx hugepage
     support, qbman fixes/cleanup, device tree updates, and some misc
     cleanup."

   - Many and varied fixes and minor enhancements as always.

  Thanks to:
    Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman
    Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz,
    Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar
    Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff
    Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin,
    Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N.
    Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica
    Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
    Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain"

[ And thanks to Michael, who took time off from a new baby to get this
  pull request done.   - Linus ]

* tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits)
  powerpc/fsl/dts: add FMan node for t1042d4rdb
  powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb
  powerpc/fsl/dts: add QMan and BMan nodes on t1024
  powerpc/fsl/dts: add QMan and BMan nodes on t1023
  soc/fsl/qman: test: use DEFINE_SPINLOCK()
  powerpc/fsl-lbc: use DEFINE_SPINLOCK()
  powerpc/8xx: Implement support of hugepages
  powerpc: get hugetlbpage handling more generic
  powerpc: port 64 bits pgtable_cache to 32 bits
  powerpc/boot: Request no dynamic linker for boot wrapper
  soc/fsl/bman: Use resource_size instead of computation
  soc/fsl/qe: use builtin_platform_driver
  powerpc/fsl_pmc: use builtin_platform_driver
  powerpc/83xx/suspend: use builtin_platform_driver
  powerpc/ftrace: Fix the comments for ftrace_modify_code
  powerpc/perf: macros for power9 format encoding
  powerpc/perf: power9 raw event format encoding
  powerpc/perf: update attribute_group data structure
  powerpc/perf: factor out the event format field
  powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  ...
2016-12-16 09:26:42 -08:00
Linus Torvalds
0ab7b12c49 pci-v4.10-changes
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYUt1vAAoJEFmIoMA60/r8abgP/3R+5Lsk5/kfAHk5/2Mtqbvg
 mZ0eDUpY9GbUeMjSq84Nr2H8u7d+1AJCCu8KtDJYZCmjZpnSp2SuE2PS5JoGC7zC
 fintD24jlIF4/J5+HeVXXmbfr3xATxvpTuiSLEi8sLBRJ3KRIswhMSwoPwOyeTQw
 v/EclWKPGYcI5Zp0oigY9/Jd3q3lQ17KXppi/0dDoLh7PNOFvEHItXWzmf++u/NP
 iYT9R1xmzEsy0/HRd6hiwPT2xA8YsAXxgobhHooUgh1FWmZ02Tg1WjgDemOW4lVh
 kNIUcsLczh7wZCceogrrJ+pwb9+NyyIyKuHPv6OG3ieyz1IZdznaj1fAE5HJYiPo
 eVS7cP1S6DyV3Y5qFj5F2dSRS7T4GXdXG5mNhmeCpUHs0vfzSCG36jLmhTy8UIxs
 1rCf5oFa+uU9q0okfH8VtcGOXqWjGgyxTSGGfF71HUMLnPbsci2fxC2cO6svzIX7
 wDY0uxOzpyMIYMuQR6iz7VqvAwEaZ+7pfMIrWWdDcQ9/5tCNJ49cLuKaThPL4bVu
 juiGBQtnTLg8tjrhjDL9tQiJpuVIweVXyyQ1fvZoVXkMLlhVCF2ttirvwFUit2PB
 84OlevQZ+9QdE/qalrWbv4qzhesuiwu0avkzjGoqg6tWTF0epu2AHI2vqy6UBYEG
 tcfJPEcz1019PKZNSvWy
 =ut0k
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "PCI changes:

   - add support for PCI on ARM64 boxes with ACPI. We already had this
     for theoretical spec-compliant hardware; now we're adding quirks
     for the actual hardware (Cavium, HiSilicon, Qualcomm, X-Gene)

   - add runtime PM support for hotplug ports

   - enable runtime suspend for Intel UHCI that uses platform-specific
     wakeup signaling

   - add yet another host bridge registration interface. We hope this is
     extensible enough to subsume the others

   - expose device revision in sysfs for DRM

   - to avoid device conflicts, make sure any VF BAR updates are done
     before enabling the VF

   - avoid unnecessary link retrains for ASPM

   - allow INTx masking on Mellanox devices that support it

   - allow access to non-standard VPD for Chelsio devices

   - update Broadcom iProc support for PAXB v2, PAXC v2, inbound DMA,
     etc

   - update Rockchip support for max-link-speed

   - add NVIDIA Tegra210 support

   - add Layerscape LS1046a support

   - update R-Car compatibility strings

   - add Qualcomm MSM8996 support

   - remove some uninformative bootup messages"

* tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (115 commits)
  PCI: Enable access to non-standard VPD for Chelsio devices (cxgb3)
  PCI: Expand "VPD access disabled" quirk message
  PCI: pciehp: Remove loading message
  PCI: hotplug: Remove hotplug core message
  PCI: Remove service driver load/unload messages
  PCI/AER: Log AER IRQ when claiming Root Port
  PCI/AER: Log errors with PCI device, not PCIe service device
  PCI/AER: Remove unused version macros
  PCI/PME: Log PME IRQ when claiming Root Port
  PCI/PME: Drop unused support for PMEs from Root Complex Event Collectors
  PCI: Move config space size macros to pci_regs.h
  x86/platform/intel-mid: Constify mid_pci_platform_pm
  PCI/ASPM: Don't retrain link if ASPM not possible
  PCI: iproc: Skip check for legacy IRQ on PAXC buses
  PCI: pciehp: Leave power indicator on when enabling already-enabled slot
  PCI: pciehp: Prioritize data-link event over presence detect
  PCI: rcar: Add gen3 fallback compatibility string for pcie-rcar
  PCI: rcar: Use gen2 fallback compatibility last
  PCI: rcar-gen2: Use gen2 fallback compatibility last
  PCI: rockchip: Move the deassert of pm/aclk/pclk after phy_init()
  ..
2016-12-15 12:46:48 -08:00
Lorenzo Stoakes
5b56d49fc3 mm: add locked parameter to get_user_pages_remote()
Patch series "mm: unexport __get_user_pages_unlocked()".

This patch series continues the cleanup of get_user_pages*() functions
taking advantage of the fact we can now pass gup_flags as we please.

It firstly adds an additional 'locked' parameter to
get_user_pages_remote() to allow for its callers to utilise
VM_FAULT_RETRY functionality.  This is necessary as the invocation of
__get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
this and no other existing higher level function would allow it to do
so.

Secondly existing callers of __get_user_pages_unlocked() are replaced
with the appropriate higher-level replacement -
get_user_pages_unlocked() if the current task and memory descriptor are
referenced, or get_user_pages_remote() if other task/memory descriptors
are referenced (having acquiring mmap_sem.)

This patch (of 2):

Add a int *locked parameter to get_user_pages_remote() to allow
VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().

Taking into account the previous adjustments to get_user_pages*()
functions allowing for the passing of gup_flags, we are now in a
position where __get_user_pages_unlocked() need only be exported for his
ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
subsequently unexport __get_user_pages_unlocked() as well as allowing
for future flexibility in the use of get_user_pages_remote().

[sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
  Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Wang Sheng-Hui
cc10385b6f PCI: Move config space size macros to pci_regs.h
Move PCI configuration space size macros (PCI_CFG_SPACE_SIZE and
PCI_CFG_SPACE_EXP_SIZE) from drivers/pci/pci.h to
include/uapi/linux/pci_regs.h so they can be used by more drivers and
eliminate duplicate definitions.

[bhelgaas: Expand comment to include PCI-X details]
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2016-12-12 10:05:22 -06:00
Kirti Wankhede
2b8bb1d771 vfio iommu type1: Fix size argument to vfio_find_dma() in pin_pages/unpin_pages
Passing zero for the size to vfio_find_dma() isn't compatible with
matching the start address of an existing vfio_dma. Doing so triggers a
corner case. In vfio_find_dma(), when the start address is equal to
dma->iova and size is 0, check for the end of search range makes it to
take wrong side of RB-tree. That fails the search even though the address
is present in mapped dma ranges.
In functions pin_pages and unpin_pages, the iova which is being searched
is base address of page to be pinned or unpinned. So here size should be
set to PAGE_SIZE, as argument to vfio_find_dma().

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-06 12:35:53 -07:00
Kirti Wankhede
7c03f42846 vfio iommu type1: Fix size argument to vfio_find_dma() during DMA UNMAP.
Passing zero for the size to vfio_find_dma() isn't compatible with
matching the start address of an existing vfio_dma. Doing so triggers a
corner case. In vfio_find_dma(), when the start address is equal to
dma->iova and size is 0, check for the end of search range makes it to
take wrong side of RB-tree. That fails the search even though the address
is present in mapped dma ranges. Due to this, in vfio_dma_do_unmap(),
while checking boundary conditions, size should be set to 1 for verifying
start address of unmap range.
vfio_find_dma() is also used to verify last address in unmap range with
size = 0, but in that case address to be searched is calculated with
start + size - 1 and so it works correctly.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
[aw: changelog tweak]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-06 12:28:04 -07:00
Kirti Wankhede
3cedd7d75f vfio iommu type1: WARN_ON if notifier block is not unregistered
mdev vendor driver should unregister the iommu notifier since the vfio
iommu can persist beyond the attachment of the mdev group. WARN_ON will
show warning if vendor driver doesn't unregister the notifier and is
forced to follow the implementations steps.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-05 16:04:33 -07:00
Alexey Kardashevskiy
4b6fad7097 powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.

This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:34 +11:00
Alexey Kardashevskiy
bc82d122ae vfio/spapr: Reference mm in tce_container
In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better have tce_container take a reference to mm_struct and
use it later when the process is gone (@current or @current->mm is NULL).

This references mm and stores the pointer in the container; this is done
in a new helper - tce_iommu_mm_set() - when one of the following happens:
- a container is enabled (IOMMU v1);
- a first attempt to pre-register memory is made (IOMMU v2);
- a DMA window is created (IOMMU v2).
The @mm stays referenced till the container is destroyed.

This replaces current->mm with container->mm everywhere except debug
prints.

This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from making changes to a memory
context of other processes.

DMA map/unmap ioctls() do not check for @mm as they already check
for @enabled which is set after tce_iommu_mm_set() is called.

This does not reference a task as multiple threads within the same mm
are allowed to ioctl() to vfio and supposedly they will have same limits
and capabilities and if they do not, we'll just fail with no harm made.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:33 +11:00
Alexey Kardashevskiy
d9c728949d vfio/spapr: Postpone default window creation
We are going to allow the userspace to configure container in
one memory context and pass container fd to another so
we are postponing memory allocations accounted against
the locked memory limit. One of previous patches took care of
it_userspace.

At the moment we create the default DMA window when the first group is
attached to a container; this is done for the userspace which is not
DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory
pre-registration - such client expects the default DMA window to exist.

This postpones the default DMA window allocation till one of
the folliwing happens:
1. first map/unmap request arrives;
2. new window is requested;
This adds noop for the case when the userspace requested removal
of the default window which has not been created yet.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:32 +11:00
Alexey Kardashevskiy
6f01cc692a vfio/spapr: Add a helper to create default DMA window
There is already a helper to create a DMA window which does allocate
a table and programs it to the IOMMU group. However
tce_iommu_take_ownership_ddw() did not use it and did these 2 calls
itself to simplify error path.

Since we are going to delay the default window creation till
the default window is accessed/removed or new window is added,
we need a helper to create a default window from all these cases.

This adds tce_iommu_create_default_window(). Since it relies on
a VFIO container to have at least one IOMMU group (for future use),
this changes tce_iommu_attach_group() to add a group to the container
first and then call the new helper.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:31 +11:00
Alexey Kardashevskiy
39701e56f5 vfio/spapr: Postpone allocation of userspace version of TCE table
The iommu_table struct manages a hardware TCE table and a vmalloc'd
table with corresponding userspace addresses. Both are allocated when
the default DMA window is created and this happens when the very first
group is attached to a container.

As we are going to allow the userspace to configure container in one
memory context and pas container fd to another, we have to postpones
such allocations till a container fd is passed to the destination
user process so we would account locked memory limit against the actual
container user constrainsts.

This postpones the it_userspace array allocation till it is used first
time for mapping. The unmapping patch already checks if the array is
allocated.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:30 +11:00
Alexey Kardashevskiy
d7baee6901 powerpc/iommu: Stop using @current in mm_iommu_xxx
This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.

This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.

This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:29 +11:00
Jike Song
ccd46dbae7 vfio: support notifier chain in vfio_group
Beyond vfio_iommu events, users might also be interested in
vfio_group events. For example, if a vfio_group is used along
with Qemu/KVM, whenever kvm pointer is set to/cleared from the
vfio_group, users could be notified.

Currently only VFIO_GROUP_NOTIFY_SET_KVM supported.

Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Jike Song <jike.song@intel.com>
[aw: remove use of new typedef]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-01 10:40:05 -07:00
Jike Song
22195cbd34 vfio: vfio_register_notifier: classify iommu notifier
Currently vfio_register_notifier assumes that there is only one
notifier chain, which is in vfio_iommu. However, the user might
also be interested in events other than vfio_iommu, for example,
vfio_group. Refactor vfio_{un}register_notifier implementation
to make it feasible.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Jike Song <jike.song@intel.com>
[aw: merge with commit 816ca69ea9c7 ("vfio: Fix handling of error returned by 'vfio_group_get_from_dev()'"), remove typedef]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-01 09:38:47 -07:00
Christophe JAILLET
d256459fae vfio: Fix handling of error returned by 'vfio_group_get_from_dev()'
'vfio_group_get_from_dev()' seems to return only NULL on error, not an
error pointer.

Fixes: 2169037dc3 ("vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops")
Fixes: c086de818d ("vfio iommu: Add blocking notifier to notify DMA_UNMAP")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-01 08:45:49 -07:00
Eric Auger
5ba6de98c7 vfio: fix vfio_info_cap_add/shift
Capability header next field is an offset relative to the start of
the INFO buffer. tmp->next is assigned the proper value but iterations
implemented in vfio_info_cap_add and vfio_info_cap_shift use next
as an offset between the headers. When coping with multiple capabilities
this leads to an Oops.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-21 11:51:53 -07:00
Cao jin
f4cb410019 vfio/pci: Drop unnecessary pcibios_err_to_errno()
As of commit d97ffe2368 ("PCI: Fix return value from
pci_user_{read,write}_config_*()") it's unnecessary to call
pcibios_err_to_errno() to fixup the return value from these functions.

pcibios_err_to_errno() already does simple passthrough of -errno values,
therefore no functional change is expected.

[aw: changelog]
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-18 11:06:42 -07:00
Kirti Wankhede
8e1c5a4048 docs: Add Documentation for Mediated devices
Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede
a1e03e9bcc vfio_platform: Updated to use vfio_set_irqs_validate_and_prepare()
Updated vfio_platform_common.c file to use
vfio_set_irqs_validate_and_prepare()

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede
ef198aaa16 vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare()
Updated vfio_pci.c file to use vfio_set_irqs_validate_and_prepare()

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede
c747f08aea vfio: Introduce vfio_set_irqs_validate_and_prepare()
Vendor driver using mediated device framework would use same mechnism to
validate and prepare IRQs. Introducing this function to reduce code
replication in multiple drivers.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede
c535d34569 vfio_pci: Update vfio_pci to use vfio_info_add_capability()
Update msix_sparse_mmap_cap() to use vfio_info_add_capability()
Update region type capability to use vfio_info_add_capability()

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede
b3c0a866f1 vfio: Introduce common function to add capabilities
Vendor driver using mediated device framework should use
vfio_info_add_capability() to add capabilities.
Introduced this function to reduce code duplication in vendor drivers.

vfio_info_cap_shift() manipulated a data buffer to add an offset to each
element in a chain. This data buffer is documented in a uapi header.
Changing vfio_info_cap_shift symbol to be available to all drivers.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede
c086de818d vfio iommu: Add blocking notifier to notify DMA_UNMAP
Added blocking notifier to IOMMU TYPE1 driver to notify vendor drivers
about DMA_UNMAP.
Exported two APIs vfio_register_notifier() and vfio_unregister_notifier().
Notifier should be registered, if external user wants to use
vfio_pin_pages()/vfio_unpin_pages() APIs to pin/unpin pages.
Vendor driver should use VFIO_IOMMU_NOTIFY_DMA_UNMAP action to invalidate
mappings.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:07 -07:00
Kirti Wankhede
a54eb55045 vfio iommu type1: Add support for mediated devices
VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- Domain for external user is tracked separately in vfio_iommu structure.
  It is allocated when group for first mdev device is attached.
- Pages pinned for external domain are tracked in each vfio_dma structure
  for that iova range.
- Page tracking rb-tree in vfio_dma keeps <iova, pfn, ref_count>. Key of
  rb-tree is iova, but it actually aims to track pfns.
- On external pin request for an iova, page is pinned once, if iova is
  already pinned and tracked, ref_count is incremented.
- External unpin request unpins pages only when ref_count is 0.
- Pinned pages list is used to find pfn from iova and then unpin it.
  WARN_ON is added if there are entires in pfn_list while detaching the
  group and releasing the domain.
- Page accounting is updated to account in its address space where the
  pages are pinned/unpinned, i.e dma->task
-  Accouting for mdev device is only done if there is no iommu capable
  domain in the container. When there is a direct device assigned to the
  container and that domain is iommu capable, all pages are already pinned
  during DMA_MAP.
- Page accouting is updated on hot plug and unplug mdev device and pass
  through device.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- Linux VM hot plug and unplug vGPU device while GPU pass through device
  exist
- Linux VM hot plug and unplug GPU pass through device while vGPU device
  exist

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:25:11 -07:00
Kirti Wankhede
8f0d5bb95f vfio iommu type1: Add task structure to vfio_dma
Add task structure to vfio_dma structure. Task structure is used for:
- During DMA_UNMAP, same task who mapped it or other task who shares same
address space is allowed to unmap, otherwise unmap fails.
QEMU maps few iova ranges initially, then fork threads and from the child
thread calls DMA_UNMAP on previously mapped iova. Since child shares same
address space, DMA_UNMAP is successful.
- Avoid accessing struct mm while process is exiting by acquiring
reference of task's mm during page accounting.
- It is also used to get task mlock capability and rlimit for mlock.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:25:09 -07:00
Kirti Wankhede
7896c998f0 vfio iommu type1: Add find_iommu_group() function
Add find_iommu_group()

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:25:06 -07:00
Kirti Wankhede
ea85cf353e vfio iommu type1: Update argument of vaddr_get_pfn()
Update arguments of vaddr_get_pfn() to take struct mm_struct *mm as input
argument.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:25:03 -07:00
Kirti Wankhede
3624a2486c vfio iommu type1: Update arguments of vfio_lock_acct
Added task structure as input argument to vfio_lock_acct() function.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:25:01 -07:00
Kirti Wankhede
2169037dc3 vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops
Added APIs for pining and unpining set of pages. These call back into
backend iommu module to actually pin and unpin pages.
Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.

Renamed static functions in vfio_type1_iommu.c to resolve conflicts

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:24:58 -07:00
Kirti Wankhede
32f55d835b vfio: Common function to increment container_users
This change rearrange functions to have common function to increment
container_users

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:24:55 -07:00
Kirti Wankhede
7ed3ea8a71 vfio: Rearrange functions to get vfio_group from dev
This patch rearranges functions to get vfio_group from device

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:24:52 -07:00
Kirti Wankhede
fa3da00cb8 vfio: VFIO based driver for Mediated devices
vfio_mdev driver registers with mdev core driver.
mdev core driver creates mediated device and calls probe routine of
vfio_mdev driver for each device.
Probe routine of vfio_mdev driver adds mediated device to VFIO core module

This driver forms a shim layer that pass through VFIO devices operations
to vendor driver for mediated devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:24:50 -07:00
Kirti Wankhede
7b96953bc6 vfio: Mediated device Core driver
Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |  mdev     | |                         |              |
 | |  bus      | +------------------------>+              |<-> VFIO user
 | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
 | |           | |                         |              |
 | +-----------+ |                         +--------------+
 |               |
 |  MDEV CORE    |
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         struct device_driver    driver;
};

Mediated bus driver for mdev device should use this interface to register
and unregister with core driver respectively:

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated bus driver is responsible to add/delete mediated devices to/from
VFIO group when devices are bound and unbound to the driver.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in its driver. APIs are :

* dev_attr_groups: attributes of the parent device.
* mdev_attr_groups: attributes of the mediated device.
* supported_type_groups: attributes to define supported type. This is
			 mandatory field.
* create: to allocate basic resources in vendor driver for a mediated
         device. This is mandatory to be provided by vendor driver.
* remove: to free resources in vendor driver when mediated device is
         destroyed. This is mandatory to be provided by vendor driver.
* open: open callback of mediated device
* release: release callback of mediated device
* read : read emulation callback.
* write: write emulation callback.
* ioctl: ioctl callback.
* mmap: mmap emulation callback.

Drivers should use these interfaces to register and unregister device to
mdev core driver respectively:

extern int  mdev_register_device(struct device *dev,
                                 const struct parent_ops *ops);
extern void mdev_unregister_device(struct device *dev);

There are no locks to serialize above callbacks in mdev driver and
vfio_mdev driver. If required, vendor driver can have locks to serialize
above APIs in their driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:24:48 -07:00
Vlad Tsyrklevich
05692d7005 vfio/pci: Fix integer overflows, bitmask check
The VFIO_DEVICE_SET_IRQS ioctl did not sufficiently sanitize
user-supplied integers, potentially allowing memory corruption. This
patch adds appropriate integer overflow checks, checks the range bounds
for VFIO_IRQ_SET_DATA_NONE, and also verifies that only single element
in the VFIO_IRQ_SET_DATA_TYPE_MASK bitmask is set.
VFIO_IRQ_SET_ACTION_TYPE_MASK is already correctly checked later in
vfio_pci_set_irqs_ioctl().

Furthermore, a kzalloc is changed to a kcalloc because the use of a
kzalloc with an integer multiplication allowed an integer overflow
condition to be reached without this patch. kcalloc checks for overflow
and should prevent a similar occurrence.

Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-10-26 13:49:29 -06:00
Christoph Hellwig
61771468e0 vfio_pci: use pci_alloc_irq_vectors
Simplify the interrupt setup by using the new PCI layer helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-09-29 13:36:38 -06:00
Alex Williamson
c93a97ee05 vfio-pci: Disable INTx after MSI/X teardown
The MSI/X shutdown path can gratuitously enable INTx, which is not
something we want to happen if we're dealing with broken INTx device.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-09-26 13:52:19 -06:00
Alex Williamson
ddf9dc0eb5 vfio-pci: Virtualize PCIe & AF FLR
We use a BAR restore trick to try to detect when a user has performed
a device reset, possibly through FLR or other backdoors, to put things
back into a working state.  This is important for backdoor resets, but
we can actually just virtualize the "front door" resets provided via
PCIe and AF FLR.  Set these bits as virtualized + writable, allowing
the default write to set them in vconfig, then we can simply check the
bit, perform an FLR of our own, and clear the bit.  We don't actually
have the granularity in PCI to specify the type of reset we want to
do, but generally devices don't implement both PCIe and AF FLR and
we'll favor these over other types of reset, so we should generally
lineup.  We do test whether the device provides the requested FLR type
to stay consistent with hardware capabilities though.

This seems to fix several instance of devices getting into bad states
with userspace drivers, like dpdk, running inside a VM.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Greg Rose <grose@lightfleet.com>
2016-09-26 13:52:16 -06:00
Baoyou Xie
2e06285655 vfio: platform: mark symbols static where possible
We get a few warnings when building kernel with W=1:
drivers/vfio/platform/vfio_platform_common.c:76:5: warning: no previous prototype for 'vfio_platform_acpi_call_reset' [-Wmissing-prototypes]
drivers/vfio/platform/vfio_platform_common.c:98:6: warning: no previous prototype for 'vfio_platform_acpi_has_reset' [-Wmissing-prototypes]
drivers/vfio/platform/vfio_platform_common.c:640:5: warning: no previous prototype for 'vfio_platform_of_probe' [-Wmissing-prototypes]
drivers/vfio/platform/reset/vfio_platform_amdxgbe.c:59:5: warning: no previous prototype for 'vfio_platform_amdxgbe_reset' [-Wmissing-prototypes]
drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c:60:5: warning: no previous prototype for 'vfio_platform_calxedaxgmac_reset' [-Wmissing-prototypes]
....

In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.

Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-09-13 16:11:37 -06:00
Wei Jiangang
8138dabbab vfio/pci: Fix typos in comments
Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-08-29 12:39:09 -06:00
Alex Williamson
c8952a7075 vfio/pci: Fix NULL pointer oops in error interrupt setup handling
There are multiple cases in vfio_pci_set_ctx_trigger_single() where
we assume we can safely read from our data pointer without actually
checking whether the user has passed any data via the count field.
VFIO_IRQ_SET_DATA_NONE in particular is entirely broken since we
attempt to pull an int32_t file descriptor out before even checking
the data type.  The other data types assume the data pointer contains
one element of their type as well.

In part this is good news because we were previously restricted from
doing much sanitization of parameters because it was missed in the
past and we didn't want to break existing users.  Clearly DATA_NONE
is completely broken, so it must not have any users and we can fix
it up completely.  For DATA_BOOL and DATA_EVENTFD, we'll just
protect ourselves, returning error when count is zero since we
previously would have oopsed.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Chris Thompson <the_cartographer@hotmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Auger <eric.auger@redhat.com>
2016-08-08 16:16:23 -06:00
Sinan Kaya
0991bbdbf5 vfio: platform: check reset call return code during release
Release call is ignoring the return code from reset call and can
potentially continue even though reset call failed.

If reset_required module parameter is set, this patch is going
to validate the return code and will cause stack dump with
WARN_ON and warn the user of failure.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:54:45 -06:00
Sinan Kaya
e99442323f vfio: platform: check reset call return code during open
Open call is ignoring the return code from reset call and can
potentially continue even though reset call failed.

If reset_required module parameter is set, this patch is going
to validate the return code and will abort open if reset fails.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:54:45 -06:00
Sinan Kaya
b5add544d6 vfio, platform: make reset driver a requirement by default
The code was allowing platform devices to be used without a supporting
VFIO reset driver. The hardware can be left in some inconsistent state
after a guest machine abort.

The reset driver will put the hardware back to safe state and disable
interrupts before returning the control back to the host machine.

Adding a new reset_required kernel module option to platform VFIO drivers.
The default value is true for the DT and ACPI based drivers.
The reset requirement value for AMBA drivers is set to false and is
unchangeable to maintain the existing functionality.

New requirements are:
1. A reset function needs to be implemented by the corresponding driver
via DT/ACPI.
2. The reset function needs to be discovered via DT/ACPI.

The probe of the driver will fail if any of the above conditions are
not satisfied.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:46 -06:00
Sinan Kaya
d30daa33ec vfio: platform: call _RST method when using ACPI
The device tree code checks for the presence of a reset driver and calls
the of_reset function pointer by looking up the reset driver as a module.

ACPI defines _RST method to perform device level reset. After the _RST
method is executed, the OS can resume using the device. _RST method is
expected to stop DMA transfers and IRQs.

This patch introduces two functions as vfio_platform_acpi_has_reset and
vfio_platform_acpi_call_reset. The has reset method is used to declare
reset capability via the ioctl flag VFIO_DEVICE_FLAGS_RESET. The call
reset function is used to execute the _RST ACPI method.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:44 -06:00
Sinan Kaya
5afec27474 vfio: platform: add extra debug info argument to call reset
Getting ready to bring out extra debug information to the caller
so that more verbose information can be printed when an error is
observed.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:43 -06:00
Sinan Kaya
a12a9368e1 vfio: platform: add support for ACPI probe
The code is using the compatible DT string to associate a reset driver
with the actual device itself. The compatible string does not exist on
ACPI based systems. HID is the unique identifier for a device driver
instead.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:41 -06:00
Sinan Kaya
dc5542fb11 vfio: platform: determine reset capability
Creating a new function to determine if this driver supports reset
function or not. This is an attempt to abstract device tree calls
from the rest of the code.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:40 -06:00
Sinan Kaya
f084aa7495 vfio: platform: move reset call to a common function
The reset call sequence seems to replicate itself multiple times
across the file. Grouping them together for maintenance reasons.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:38 -06:00