[RFC 0/2] VFIO SRIOV support

* [RFC 0/2] VFIO SRIOV support
@ 2015-12-22 13:42 Ilya Lesokhin
  2015-12-22 13:42 ` [RFC 1/2] PCI: Expose iov_set_numvfs and iov_resource_size for modules Ilya Lesokhin
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Ilya Lesokhin @ 2015-12-22 13:42 UTC (permalink / raw)
  To: kvm, linux-pci
  Cc: bhelgaas, alex.williamson, noaos, haggaie, ogerlitz, liranl, ilyal

Today the QEMU hypervisor allows assigning a physical device to a VM,
facilitating driver development. However, it does not support enabling
SR-IOV by the VM kernel driver. Our goal is to implement such support,
allowing developers working on SR-IOV physical function drivers to work
inside VMs as well.

This patch series implements the kernel side of our solution.  It extends
the VFIO driver to support the PCIE SRIOV extended capability with
following features:
1. The ability to probe SRIOV BAR sizes.
2. The ability to enable and disable sriov.

This patch series is going to be used by QEMU to expose sriov capabilities
to VM. We already have an early prototype based on Knut Omang's patches for
SRIOV[1]. 

Open issues:
1. Binding the new VFs to VFIO driver.
Once the VM enables sriov it expects the new VFs to appear inside the VM.
To this end we need to bind the new vfs to the VFIO driver and have QEMU
grab them. We are currently achieve this goal using:
echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
but we are not happy about this solution as a system might have another
device with the same id that is unrelated to our VM.
Other solution we've considered are:
 a. Having user space unbind and then bind the VFs to VFIO.
     Typically resulting in an unnecessary probing of the device.
 b. Adding a driver argument to pci_enable_sriov(...) and have
    vfio call pci_enable_sriov with the vfio driver as argument.
    This solution avoids the unnecessary but is more intrusive.

2. How to tell if it is safe to disable SRIOV?
In the current implementation, a userspace can enable sriov, grab one of
the VFs and then call disable sriov without releasing the device.  This
will result in a deadlock where the user process is stuck inside disable
sriov waiting for itself to release the device. Killing the process leaves
it in a zombie state.
We also get a strange warning saying:
[  181.668492] WARNING: CPU: 22 PID: 3684 at kernel/sched/core.c:7497 __might_sleep+0x77/0x80() 
[  181.668502] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810aa193>] prepare_to_wait_event+0x63/0xf0

3. How to expose the Supported Page Sizes and System Page Size registers in
the SRIOV capability? 
Presently the hypervisor initializes Supported Page Sizes once and assumes
it doesn't change therefore we cannot allow user space to change this
register at will. The first solution that comes to mind is to expose a
device that only supports the page size selected by the hypervisor.
Unfourtently, Per SR-IOV spec section 3.3.12, PFs are required to support
4-KB, 8-KB, 64-KB, 256-KB, 1-MB, and 4-MB page sizes. We currently map both
registers as virtualized and read only and leave user space to worry about
this problem.

4. Other SRIOV capabilities.
Do we want to hide capabilities we do not support in the SR-IOV
Capabilities register? or leave it to the userspace application?

[1] https://github.com/knuto/qemu/tree/sriov_patches_v6

Ilya Lesokhin (2):
  PCI: Expose iov_set_numvfs and iov_resource_size for modules.
  VFIO: Add support for SRIOV extended capablity

 drivers/pci/iov.c                  |   4 +-
 drivers/vfio/pci/vfio_pci_config.c | 169 +++++++++++++++++++++++++++++++++----
 include/linux/pci.h                |   4 +
 3 files changed, 159 insertions(+), 18 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 10+ messages in thread