linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] VFIO: capability chains
@ 2015-11-23 20:43 Alex Williamson
  2015-11-23 20:43 ` [RFC PATCH 1/3] vfio: Define " Alex Williamson
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Alex Williamson @ 2015-11-23 20:43 UTC (permalink / raw)
  To: alex.williamson; +Cc: linux-kernel, kvm

Please see the commit log and comments in patch 1 for a general
explanation of the problems that this series tries to address.  The
general problem is that we have several cases where we want to expose
variable sized information to the user, whether it's sparse mmaps for
a region, as implemented here, or DMA mapping ranges of an IOMMU, or
reserved MSI mapping ranges, etc.  Extending data structures is hard;
extending them to report variable sized data is really hard.  After
considering several options, I think the best approach is to copy how
PCI does capabilities.  This allows the ioctl to only expose the
capabilities that are relevant for them, avoids data structures that
are too complicated to parse, and avoids creating a new ioctl each
time we think of something else that we'd like to report.  This method
also doesn't preclude extensions to the fixed structure since the
offset of these capabilities is entirely dynamic.

Comments welcome, I'll also follow-up to the QEMU and KVM lists with
an RFC making use of this for mmaps skipping over the MSI-X table.
Thanks,

Alex

---

Alex Williamson (3):
      vfio: Define capability chains
      vfio: Define sparse mmap capability for regions
      vfio/pci: Include sparse mmap capability for MSI-X table regions


 drivers/vfio/pci/vfio_pci.c |  101 +++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h   |   53 ++++++++++++++++++++++-
 2 files changed, 152 insertions(+), 2 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/3] vfio: Define capability chains
  2015-11-23 20:43 [RFC PATCH 0/3] VFIO: capability chains Alex Williamson
@ 2015-11-23 20:43 ` Alex Williamson
  2015-11-23 20:43 ` [RFC PATCH 2/3] vfio: Define sparse mmap capability for regions Alex Williamson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Alex Williamson @ 2015-11-23 20:43 UTC (permalink / raw)
  To: alex.williamson; +Cc: linux-kernel, kvm

We have a few cases where we need to extend the data returned from the
INFO ioctls in VFIO.  For instance we already have devices exposed
through vfio-pci where VFIO_DEVICE_GET_REGION_INFO reports the region
as mmap-capable, but really only supports sparse mmaps, avoiding the
MSI-X table.  If we wanted to provide in-kernel emulation or extended
functionality for devices, we'd also want the ability to tell the
user not to mmap various regions, rather than forcing them to figure
it out on their own.

Another example is VFIO_IOMMU_GET_INFO.  We'd really like to expose
the actual IOVA capabilities of the IOMMU rather than letting the
user assume the address space they have available to them.  We could
add IOVA base and size fields to struct vfio_iommu_type1_info, but
what if we have multiple IOVA ranges.  For instance x86 uses a range
of addresses at 0xfee00000 for MSI vectors.  These typically are not
available for standard DMA IOVA mappings and splits our available IOVA
space into two ranges.  POWER systems have both an IOVA window below
4G as well as dynamic data window which they can use to remap all of
guest memory.

Representing variable sized arrays within a fixed structure makes it
very difficult to parse, we'd therefore like to put this data beyond
fixed fields within the data structures.  One way to do this is to
emulate capabilities in PCI configuration space.  A new flag indciates
whether capabilties are supported and a new fixed field reports the
offset of the first entry.  Users can then walk the chain to find
capabilities, adding capabilities does not require additional fields
in the fixed structure, and parsing variable sized data becomes
trivial.

This patch outlines the theory and base header structure, which
should be shared by all future users.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 include/uapi/linux/vfio.h |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 751b69f..432569f 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -59,6 +59,33 @@
 #define VFIO_TYPE	(';')
 #define VFIO_BASE	100
 
+/*
+ * For extension of INFO ioctls, VFIO makes use of a capability chain
+ * designed after PCI/e capabilities.  A flag bit indicates whether
+ * this capability chain is supported and a field defined in the fixed
+ * structure defines the offset of the first capability in the chain.
+ * This field is only valid when the corresponding bit in the flags
+ * bitmap is set.  This offset field is relative to the start of the
+ * INFO buffer, as is the next field within each capability header.
+ * The id within the header is a shared address space per INFO ioctl,
+ * while the version field is specific to the capability id.  The
+ * contents following the header are specific to the capability id.
+ */
+struct vfio_info_cap_header {
+	__u16	id;		/* Identifies capability */
+	__u16	version;	/* Version specific to the capability ID */
+	__u32	next;		/* Offset of next capability */
+};
+
+/*
+ * Callers of INFO ioctls passing insufficiently sized buffers will see
+ * the capability chain flag bit set, a zero value for the first capability
+ * offset (if available within the provided argsz), and argsz will be
+ * updated to report the necessary buffer size.  For compatibility, the
+ * INFO ioctl will not report error in this case, but the capability chain
+ * will not be available.
+ */
+
 /* -------- IOCTLs for VFIO file descriptor (/dev/vfio/vfio) -------- */
 
 /**


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/3] vfio: Define sparse mmap capability for regions
  2015-11-23 20:43 [RFC PATCH 0/3] VFIO: capability chains Alex Williamson
  2015-11-23 20:43 ` [RFC PATCH 1/3] vfio: Define " Alex Williamson
@ 2015-11-23 20:43 ` Alex Williamson
  2015-11-23 20:43 ` [RFC PATCH 3/3] vfio/pci: Include sparse mmap capability for MSI-X table regions Alex Williamson
  2015-12-18  2:05 ` [RFC PATCH 0/3] VFIO: capability chains Alexey Kardashevskiy
  3 siblings, 0 replies; 7+ messages in thread
From: Alex Williamson @ 2015-11-23 20:43 UTC (permalink / raw)
  To: alex.williamson; +Cc: linux-kernel, kvm

We can't always support mmap across an entire device region, for
example we deny mmaps covering the MSI-X table of PCI devices, but
we don't really have a way to report it.  We expect the user to
implicitly know this restriction.  We also can't split the region
because vfio-pci defines an API with fixed region index to BAR
number mapping.  We therefore define a new capability which lists
areas within the region that may be mmap'd.  In addition to the
MSI-X case, this potentially enables in-kernel emulation and
extensions to devices.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 include/uapi/linux/vfio.h |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 432569f..d3f6499 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -221,13 +221,37 @@ struct vfio_region_info {
 #define VFIO_REGION_INFO_FLAG_READ	(1 << 0) /* Region supports read */
 #define VFIO_REGION_INFO_FLAG_WRITE	(1 << 1) /* Region supports write */
 #define VFIO_REGION_INFO_FLAG_MMAP	(1 << 2) /* Region supports mmap */
+#define VFIO_REGION_INFO_FLAG_CAPS	(1 << 3) /* Info supports caps */
 	__u32	index;		/* Region index */
-	__u32	resv;		/* Reserved for alignment */
+	__u32	cap_offset;	/* Offset within info struct of first cap */
 	__u64	size;		/* Region size (bytes) */
 	__u64	offset;		/* Region offset from start of device fd */
 };
 #define VFIO_DEVICE_GET_REGION_INFO	_IO(VFIO_TYPE, VFIO_BASE + 8)
 
+/*
+ * The sparse mmap capability allows finer granularity of specifying areas
+ * within a region with mmap support.  When specified, the user should only
+ * mmap the offset ranges specified by the areas array.  mmaps outside of the
+ * areas specified may fail (such as the range covering a PCI MSI-X table) or
+ * may result in improper device behavior.
+ *
+ * The structures below define version 1 of this capability.
+ */
+#define VFIO_REGION_INFO_CAP_SPARSE_MMAP	1
+
+struct vfio_region_sparse_mmap_area {
+	__u64	offset;	/* Offset of mmap'able area within region */
+	__u64	size;	/* Size of mmap'able area */
+};
+
+struct vfio_region_info_cap_sparse_mmap {
+	struct vfio_info_cap_header header;
+	__u32	nr_areas;
+	__u32	reserved;
+	struct vfio_region_sparse_mmap_area areas[];
+};
+
 /**
  * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
  *				    struct vfio_irq_info)


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 3/3] vfio/pci: Include sparse mmap capability for MSI-X table regions
  2015-11-23 20:43 [RFC PATCH 0/3] VFIO: capability chains Alex Williamson
  2015-11-23 20:43 ` [RFC PATCH 1/3] vfio: Define " Alex Williamson
  2015-11-23 20:43 ` [RFC PATCH 2/3] vfio: Define sparse mmap capability for regions Alex Williamson
@ 2015-11-23 20:43 ` Alex Williamson
  2015-12-18  2:05 ` [RFC PATCH 0/3] VFIO: capability chains Alexey Kardashevskiy
  3 siblings, 0 replies; 7+ messages in thread
From: Alex Williamson @ 2015-11-23 20:43 UTC (permalink / raw)
  To: alex.williamson; +Cc: linux-kernel, kvm

vfio-pci has never allowed the user to directly mmap the MSI-X vector
table, but we've always relied on implicit knowledge of the user that
they cannot do this.  Now that we have capability chains that we can
expose in the region info ioctl and a sparse mmap capability that
represents the sub-areas within the region that can be mmap'd, we can
make the mmap constraints more explicit.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c |  101 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 32b88bd..46e7aed 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -421,6 +421,77 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
 	return walk.ret;
 }
 
+struct caps {
+	struct vfio_info_cap_header *buf;
+	size_t size;
+	size_t head;
+};
+
+static void *add_region_info_cap(struct caps *caps,
+				 size_t size, u16 id, u16 version)
+{
+	void *tmp;
+	struct vfio_info_cap_header *header;
+
+	/* This would be ridiculous and exceeds the ioctl's abilities */
+	BUG_ON(caps->size + size + sizeof(struct vfio_region_info) > U32_MAX);
+
+	tmp = krealloc(caps->buf, caps->size + size, GFP_KERNEL);
+	if (!tmp) {
+		kfree(caps->buf);
+		caps->size = 0;
+		return ERR_PTR(-ENOMEM);
+	}
+
+	caps->buf = tmp;
+	header = tmp + caps->size;
+	header->id = id;
+	header->version = version;
+	header->next = caps->head;
+	caps->head = caps->size + sizeof(struct vfio_region_info);
+	caps->size += size;
+
+	return header;
+}
+
+static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev, struct caps *caps)
+{
+	struct vfio_region_info_cap_sparse_mmap *sparse;
+	size_t end, size;
+	int nr_areas = 2, i = 0;
+
+	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
+
+	/* If MSI-X table is aligned to the start or end, only one area */
+	if (((vdev->msix_offset & PAGE_MASK) == 0) ||
+	    (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
+		nr_areas = 1;
+
+	size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+
+	sparse = add_region_info_cap(caps, size,
+				     VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
+	if (IS_ERR(sparse))
+		return PTR_ERR(sparse);
+
+	sparse->nr_areas = nr_areas;
+
+	if (vdev->msix_offset & PAGE_MASK) {
+		sparse->areas[i].offset = 0;
+		sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
+		i++;
+	}
+
+	if (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) < end) {
+		sparse->areas[i].offset = PAGE_ALIGN(vdev->msix_offset +
+						     vdev->msix_size);
+		sparse->areas[i].size = end - sparse->areas[i].offset;
+		i++;
+	}
+
+	return 0;
+}
+
 static long vfio_pci_ioctl(void *device_data,
 			   unsigned int cmd, unsigned long arg)
 {
@@ -451,6 +522,8 @@ static long vfio_pci_ioctl(void *device_data,
 	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
 		struct pci_dev *pdev = vdev->pdev;
 		struct vfio_region_info info;
+		struct caps caps = { .buf = NULL, .size = 0, .head = 0 };
+		int ret;
 
 		minsz = offsetofend(struct vfio_region_info, offset);
 
@@ -479,8 +552,15 @@ static long vfio_pci_ioctl(void *device_data,
 				     VFIO_REGION_INFO_FLAG_WRITE;
 			if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
 			    pci_resource_flags(pdev, info.index) &
-			    IORESOURCE_MEM && info.size >= PAGE_SIZE)
+			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
 				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
+				if (info.index == vdev->msix_bar) {
+					ret = msix_sparse_mmap_cap(vdev, &caps);
+					if (ret)
+						return ret;
+				}
+			}
+
 			break;
 		case VFIO_PCI_ROM_REGION_INDEX:
 		{
@@ -520,6 +600,25 @@ static long vfio_pci_ioctl(void *device_data,
 			return -EINVAL;
 		}
 
+		if (caps.size) {
+			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
+			if (info.argsz < sizeof(info) + caps.size) {
+				info.argsz = sizeof(info) + caps.size;
+				info.cap_offset = 0;
+			} else {
+				ret = copy_to_user((void __user *)arg +
+						   sizeof(info), caps.buf,
+						   caps.size);
+				if (ret) {
+					kfree(caps.buf);
+					return ret;
+				}
+				info.cap_offset = caps.head;
+			}
+
+			kfree(caps.buf);
+		}
+
 		return copy_to_user((void __user *)arg, &info, minsz);
 
 	} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/3] VFIO: capability chains
  2015-11-23 20:43 [RFC PATCH 0/3] VFIO: capability chains Alex Williamson
                   ` (2 preceding siblings ...)
  2015-11-23 20:43 ` [RFC PATCH 3/3] vfio/pci: Include sparse mmap capability for MSI-X table regions Alex Williamson
@ 2015-12-18  2:05 ` Alexey Kardashevskiy
  2015-12-18  2:38   ` Alex Williamson
  3 siblings, 1 reply; 7+ messages in thread
From: Alexey Kardashevskiy @ 2015-12-18  2:05 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linux-kernel, kvm

On 11/24/2015 07:43 AM, Alex Williamson wrote:
> Please see the commit log and comments in patch 1 for a general
> explanation of the problems that this series tries to address.  The
> general problem is that we have several cases where we want to expose
> variable sized information to the user, whether it's sparse mmaps for
> a region, as implemented here, or DMA mapping ranges of an IOMMU, or
> reserved MSI mapping ranges, etc.  Extending data structures is hard;
> extending them to report variable sized data is really hard.  After
> considering several options, I think the best approach is to copy how
> PCI does capabilities.  This allows the ioctl to only expose the
> capabilities that are relevant for them, avoids data structures that
> are too complicated to parse, and avoids creating a new ioctl each
> time we think of something else that we'd like to report.  This method
> also doesn't preclude extensions to the fixed structure since the
> offset of these capabilities is entirely dynamic.
>
> Comments welcome, I'll also follow-up to the QEMU and KVM lists with
> an RFC making use of this for mmaps skipping over the MSI-X table.
> Thanks,

Out of curiosity - could this information be exposed to the userspace via 
/sys/bus/pci/devices/xxxx:xx:xx:x/vfio_xxxx? It seems not to change after 
vfio_pci driver is bound to a device.




> Alex
>
> ---
>
> Alex Williamson (3):
>        vfio: Define capability chains
>        vfio: Define sparse mmap capability for regions
>        vfio/pci: Include sparse mmap capability for MSI-X table regions
>
>
>   drivers/vfio/pci/vfio_pci.c |  101 +++++++++++++++++++++++++++++++++++++++++++
>   include/uapi/linux/vfio.h   |   53 ++++++++++++++++++++++-
>   2 files changed, 152 insertions(+), 2 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/3] VFIO: capability chains
  2015-12-18  2:05 ` [RFC PATCH 0/3] VFIO: capability chains Alexey Kardashevskiy
@ 2015-12-18  2:38   ` Alex Williamson
  2015-12-18  4:34     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Williamson @ 2015-12-18  2:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linux-kernel, kvm

On Fri, 2015-12-18 at 13:05 +1100, Alexey Kardashevskiy wrote:
> On 11/24/2015 07:43 AM, Alex Williamson wrote:
> > Please see the commit log and comments in patch 1 for a general
> > explanation of the problems that this series tries to address.  The
> > general problem is that we have several cases where we want to
> > expose
> > variable sized information to the user, whether it's sparse mmaps
> > for
> > a region, as implemented here, or DMA mapping ranges of an IOMMU,
> > or
> > reserved MSI mapping ranges, etc.  Extending data structures is
> > hard;
> > extending them to report variable sized data is really hard.  After
> > considering several options, I think the best approach is to copy
> > how
> > PCI does capabilities.  This allows the ioctl to only expose the
> > capabilities that are relevant for them, avoids data structures
> > that
> > are too complicated to parse, and avoids creating a new ioctl each
> > time we think of something else that we'd like to report.  This
> > method
> > also doesn't preclude extensions to the fixed structure since the
> > offset of these capabilities is entirely dynamic.
> > 
> > Comments welcome, I'll also follow-up to the QEMU and KVM lists
> > with
> > an RFC making use of this for mmaps skipping over the MSI-X table.
> > Thanks,
> 
> Out of curiosity - could this information be exposed to the userspace
> via 
> /sys/bus/pci/devices/xxxx:xx:xx:x/vfio_xxxx? It seems not to change
> after 
> vfio_pci driver is bound to a device.

For what purpose?  vfio doesn't have a sysfs interface, why start one? 
Thanks,

Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/3] VFIO: capability chains
  2015-12-18  2:38   ` Alex Williamson
@ 2015-12-18  4:34     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 7+ messages in thread
From: Alexey Kardashevskiy @ 2015-12-18  4:34 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linux-kernel, kvm

On 12/18/2015 01:38 PM, Alex Williamson wrote:
> On Fri, 2015-12-18 at 13:05 +1100, Alexey Kardashevskiy wrote:
>> On 11/24/2015 07:43 AM, Alex Williamson wrote:
>>> Please see the commit log and comments in patch 1 for a general
>>> explanation of the problems that this series tries to address.  The
>>> general problem is that we have several cases where we want to
>>> expose
>>> variable sized information to the user, whether it's sparse mmaps
>>> for
>>> a region, as implemented here, or DMA mapping ranges of an IOMMU,
>>> or
>>> reserved MSI mapping ranges, etc.  Extending data structures is
>>> hard;
>>> extending them to report variable sized data is really hard.  After
>>> considering several options, I think the best approach is to copy
>>> how
>>> PCI does capabilities.  This allows the ioctl to only expose the
>>> capabilities that are relevant for them, avoids data structures
>>> that
>>> are too complicated to parse, and avoids creating a new ioctl each
>>> time we think of something else that we'd like to report.  This
>>> method
>>> also doesn't preclude extensions to the fixed structure since the
>>> offset of these capabilities is entirely dynamic.
>>>
>>> Comments welcome, I'll also follow-up to the QEMU and KVM lists
>>> with
>>> an RFC making use of this for mmaps skipping over the MSI-X table.
>>> Thanks,
>>
>> Out of curiosity - could this information be exposed to the userspace
>> via
>> /sys/bus/pci/devices/xxxx:xx:xx:x/vfio_xxxx? It seems not to change
>> after
>> vfio_pci driver is bound to a device.
>
> For what purpose?  vfio doesn't have a sysfs interface, why start one?
> Thanks,

well, it could simplify debugging a bit if this information was available 
from the userspace without programming a test tool doing some ioctl()'s. 
Not a big deal though...



-- 
Alexey

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-12-18  4:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-23 20:43 [RFC PATCH 0/3] VFIO: capability chains Alex Williamson
2015-11-23 20:43 ` [RFC PATCH 1/3] vfio: Define " Alex Williamson
2015-11-23 20:43 ` [RFC PATCH 2/3] vfio: Define sparse mmap capability for regions Alex Williamson
2015-11-23 20:43 ` [RFC PATCH 3/3] vfio/pci: Include sparse mmap capability for MSI-X table regions Alex Williamson
2015-12-18  2:05 ` [RFC PATCH 0/3] VFIO: capability chains Alexey Kardashevskiy
2015-12-18  2:38   ` Alex Williamson
2015-12-18  4:34     ` Alexey Kardashevskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).