kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
@ 2021-01-19 20:02 Matthew Rosato
  2021-01-19 20:02 ` [PATCH 1/4] s390/pci: track alignment/length strictness for zpci_dev Matthew Rosato
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-19 20:02 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

Today, ISM devices are completely disallowed for vfio-pci passthrough as
QEMU will reject the device due to an (inappropriate) MSI-X check.
However, in an effort to enable ISM device passthrough, I realized that the
manner in which ISM performs block write operations is highly incompatible
with the way that QEMU s390 PCI instruction interception and
vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
devices have particular requirements in regards to the alignment, size and
order of writes performed.  Furthermore, they require that legacy/non-MIO
s390 PCI instructions are used, which is also not guaranteed when the I/O
is passed through the typical userspace channels.

As a result, this patchset proposes a new VFIO region to allow a guest to
pass certain PCI instruction intercepts directly to the s390 host kernel
PCI layer for execution, pinning the guest buffer in memory briefly in
order to execute the requested PCI instruction.

Changes from RFC -> v1:
- No functional changes, just minor commentary changes -- Re-posting along
with updated QEMU set.

Matthew Rosato (4):
  s390/pci: track alignment/length strictness for zpci_dev
  vfio-pci/zdev: Pass the relaxed alignment flag
  s390/pci: Get hardware-reported max store block length
  vfio-pci/zdev: Introduce the zPCI I/O vfio region

 arch/s390/include/asm/pci.h         |   4 +-
 arch/s390/include/asm/pci_clp.h     |   7 +-
 arch/s390/pci/pci_clp.c             |   2 +
 drivers/vfio/pci/vfio_pci.c         |   8 ++
 drivers/vfio/pci/vfio_pci_private.h |   6 ++
 drivers/vfio/pci/vfio_pci_zdev.c    | 160 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h           |   4 +
 include/uapi/linux/vfio_zdev.h      |  34 ++++++++
 8 files changed, 222 insertions(+), 3 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] s390/pci: track alignment/length strictness for zpci_dev
  2021-01-19 20:02 [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
@ 2021-01-19 20:02 ` Matthew Rosato
  2021-01-19 20:02 ` [PATCH 2/4] vfio-pci/zdev: Pass the relaxed alignment flag Matthew Rosato
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-19 20:02 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

Some zpci device types (e.g., ISM) follow different rules for length
and alignment of pci instructions.  Recognize this and keep track of
it in the zpci_dev.

Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
---
 arch/s390/include/asm/pci.h     | 3 ++-
 arch/s390/include/asm/pci_clp.h | 4 +++-
 arch/s390/pci/pci_clp.c         | 1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 2126289..f16ffba 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -133,7 +133,8 @@ struct zpci_dev {
 	u8		has_hp_slot	: 1;
 	u8		is_physfn	: 1;
 	u8		util_str_avail	: 1;
-	u8		reserved	: 4;
+	u8		relaxed_align	: 1;
+	u8		reserved	: 3;
 	unsigned int	devfn;		/* DEVFN part of the RID*/
 
 	struct mutex lock;
diff --git a/arch/s390/include/asm/pci_clp.h b/arch/s390/include/asm/pci_clp.h
index 1f4b666..9fb7cbf 100644
--- a/arch/s390/include/asm/pci_clp.h
+++ b/arch/s390/include/asm/pci_clp.h
@@ -150,7 +150,9 @@ struct clp_rsp_query_pci_grp {
 	u16			:  4;
 	u16 noi			: 12;	/* number of interrupts */
 	u8 version;
-	u8			:  6;
+	u8			:  4;
+	u8 relaxed_align	:  1;	/* Relax length and alignment rules */
+	u8			:  1;
 	u8 frame		:  1;
 	u8 refresh		:  1;	/* TLB refresh mode */
 	u16 reserved2;
diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
index 153720d..630f8fc 100644
--- a/arch/s390/pci/pci_clp.c
+++ b/arch/s390/pci/pci_clp.c
@@ -103,6 +103,7 @@ static void clp_store_query_pci_fngrp(struct zpci_dev *zdev,
 	zdev->max_msi = response->noi;
 	zdev->fmb_update = response->mui;
 	zdev->version = response->version;
+	zdev->relaxed_align = response->relaxed_align;
 
 	switch (response->version) {
 	case 1:
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/4] vfio-pci/zdev: Pass the relaxed alignment flag
  2021-01-19 20:02 [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
  2021-01-19 20:02 ` [PATCH 1/4] s390/pci: track alignment/length strictness for zpci_dev Matthew Rosato
@ 2021-01-19 20:02 ` Matthew Rosato
  2021-01-19 20:02 ` [PATCH 3/4] s390/pci: Get hardware-reported max store block length Matthew Rosato
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-19 20:02 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

Use an additional bit of the VFIO_DEVICE_INFO_CAP_ZPCI_GROUP flags
field to pass whether or not the associated device supports relaxed
length and alignment for some I/O operations.

Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_zdev.c | 2 ++
 include/uapi/linux/vfio_zdev.h   | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 2296856..57e19ff 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -60,6 +60,8 @@ static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
 		.version = zdev->version
 	};
 
+	if (zdev->relaxed_align)
+		cap.flags |= VFIO_DEVICE_INFO_ZPCI_FLAG_RELAXED;
 	return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
 }
 
diff --git a/include/uapi/linux/vfio_zdev.h b/include/uapi/linux/vfio_zdev.h
index b430939..b0b6596 100644
--- a/include/uapi/linux/vfio_zdev.h
+++ b/include/uapi/linux/vfio_zdev.h
@@ -43,6 +43,7 @@ struct vfio_device_info_cap_zpci_group {
 	__u64 msi_addr;		/* MSI address */
 	__u64 flags;
 #define VFIO_DEVICE_INFO_ZPCI_FLAG_REFRESH 1 /* Program-specified TLB refresh */
+#define VFIO_DEVICE_INFO_ZPCI_FLAG_RELAXED 2 /* Relaxed Length and Alignment */
 	__u16 mui;		/* Measurement Block Update Interval */
 	__u16 noi;		/* Maximum number of MSIs */
 	__u16 maxstbl;		/* Maximum Store Block Length */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/4] s390/pci: Get hardware-reported max store block length
  2021-01-19 20:02 [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
  2021-01-19 20:02 ` [PATCH 1/4] s390/pci: track alignment/length strictness for zpci_dev Matthew Rosato
  2021-01-19 20:02 ` [PATCH 2/4] vfio-pci/zdev: Pass the relaxed alignment flag Matthew Rosato
@ 2021-01-19 20:02 ` Matthew Rosato
  2021-01-19 20:02 ` [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region Matthew Rosato
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-19 20:02 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

We'll need to know this information for vfio passthrough.

Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
---
 arch/s390/include/asm/pci.h     | 1 +
 arch/s390/include/asm/pci_clp.h | 3 ++-
 arch/s390/pci/pci_clp.c         | 1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index f16ffba..04f1f48 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -181,6 +181,7 @@ struct zpci_dev {
 	atomic64_t mapped_pages;
 	atomic64_t unmapped_pages;
 
+	u16		maxstbl;
 	u8		version;
 	enum pci_bus_speed max_bus_speed;
 
diff --git a/arch/s390/include/asm/pci_clp.h b/arch/s390/include/asm/pci_clp.h
index 9fb7cbf..ddd1604 100644
--- a/arch/s390/include/asm/pci_clp.h
+++ b/arch/s390/include/asm/pci_clp.h
@@ -155,7 +155,8 @@ struct clp_rsp_query_pci_grp {
 	u8			:  1;
 	u8 frame		:  1;
 	u8 refresh		:  1;	/* TLB refresh mode */
-	u16 reserved2;
+	u16			:  3;
+	u16 maxstbl		: 13;
 	u16 mui;
 	u16			: 16;
 	u16 maxfaal;
diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
index 630f8fc..99b64fe 100644
--- a/arch/s390/pci/pci_clp.c
+++ b/arch/s390/pci/pci_clp.c
@@ -104,6 +104,7 @@ static void clp_store_query_pci_fngrp(struct zpci_dev *zdev,
 	zdev->fmb_update = response->mui;
 	zdev->version = response->version;
 	zdev->relaxed_align = response->relaxed_align;
+	zdev->maxstbl = response->maxstbl;
 
 	switch (response->version) {
 	case 1:
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-19 20:02 [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
                   ` (2 preceding siblings ...)
  2021-01-19 20:02 ` [PATCH 3/4] s390/pci: Get hardware-reported max store block length Matthew Rosato
@ 2021-01-19 20:02 ` Matthew Rosato
  2021-01-20 13:21   ` Niklas Schnelle
                     ` (2 more replies)
  2021-01-19 20:50 ` [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
  2021-01-20  9:02 ` Pierre Morel
  5 siblings, 3 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-19 20:02 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
specific requirements in terms of alignment as well as the patterns in
which the data is read/written. Allowing these to proceed through the
typical vfio_pci_bar_rw path will cause them to be broken in up in such a
way that these requirements can't be guaranteed. In addition, ISM devices
do not support the MIO codepaths that might be triggered on vfio I/O coming
from userspace; we must be able to ensure that these devices use the
non-MIO instructions.  To facilitate this, provide a new vfio region by
which non-MIO instructions can be passed directly to the host kernel s390
PCI layer, to be reliably issued as non-MIO instructions.

This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
and implements the ability to pass PCISTB and PCILG instructions over it,
as these are what is required for ISM devices.

Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci.c         |   8 ++
 drivers/vfio/pci/vfio_pci_private.h |   6 ++
 drivers/vfio/pci/vfio_pci_zdev.c    | 158 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h           |   4 +
 include/uapi/linux/vfio_zdev.h      |  33 ++++++++
 5 files changed, 209 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 706de3e..e1c156e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -407,6 +407,14 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		}
 	}
 
+	if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV)) {
+		ret = vfio_pci_zdev_io_init(vdev);
+		if (ret && ret != -ENODEV) {
+			pci_warn(pdev, "Failed to setup zPCI I/O region\n");
+			return ret;
+		}
+	}
+
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5c90e56..bc49980 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -217,12 +217,18 @@ static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
 #ifdef CONFIG_VFIO_PCI_ZDEV
 extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 				       struct vfio_info_cap *caps);
+extern int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev);
 #else
 static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 					      struct vfio_info_cap *caps)
 {
 	return -ENODEV;
 }
+
+static inline int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
 #endif
 
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 57e19ff..a962043 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -18,6 +18,7 @@
 #include <linux/vfio_zdev.h>
 #include <asm/pci_clp.h>
 #include <asm/pci_io.h>
+#include <asm/pci_insn.h>
 
 #include "vfio_pci_private.h"
 
@@ -143,3 +144,160 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 
 	return ret;
 }
+
+static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev,
+				  char __user *buf, size_t count,
+				  loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	struct vfio_region_zpci_io *region = vdev->region[i].data;
+	struct zpci_dev *zdev = to_zpci(vdev->pdev);
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	void *base = region;
+	struct page *gpage;
+	void *gaddr;
+	u64 *data;
+	int ret;
+	u64 req;
+
+	if ((!vdev->pdev->bus) || (!zdev))
+		return -ENODEV;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	if (!iswrite) {
+		/* Only allow reads to the _hdr area */
+		if (pos + count > offsetof(struct vfio_region_zpci_io, req))
+			return -EFAULT;
+		if (copy_to_user(buf, base + pos, count))
+			return -EFAULT;
+		return count;
+	}
+
+	/* Only allow writes to the _req area */
+	if (pos < offsetof(struct vfio_region_zpci_io, req))
+		return -EFAULT;
+	if (copy_from_user(base + pos, buf, count))
+		return -EFAULT;
+
+	/*
+	 * Read operations are limited to 8B
+	 */
+	if ((region->req.flags & VFIO_ZPCI_IO_FLAG_READ) &&
+		(region->req.len > 8)) {
+		return -EIO;
+	}
+
+	/*
+	 * Block write operations are limited to hardware-reported max
+	 */
+	if ((region->req.flags & VFIO_ZPCI_IO_FLAG_BLOCKW) &&
+		(region->req.len > zdev->maxstbl)) {
+		return -EIO;
+	}
+
+	/*
+	 * While some devices may allow relaxed alignment for the PCISTB
+	 * instruction, the VFIO region requires the input buffer to be on a
+	 * DWORD boundary for all operations for simplicity.
+	 */
+	if (!IS_ALIGNED(region->req.gaddr, sizeof(uint64_t)))
+		return -EIO;
+
+	/*
+	 * For now, the largest allowed block I/O is advertised as PAGE_SIZE,
+	 * and cannot exceed a page boundary - so a single page is enough.  The
+	 * guest should have validated this but let's double-check that the
+	 * request will not cross a page boundary.
+	 */
+	if (((region->req.gaddr & ~PAGE_MASK)
+			+ region->req.len - 1) & PAGE_MASK) {
+		return -EIO;
+	}
+
+	mutex_lock(&zdev->lock);
+
+	ret = get_user_pages_fast(region->req.gaddr & PAGE_MASK, 1, 0, &gpage);
+	if (ret <= 0) {
+		count = -EIO;
+		goto out;
+	}
+	gaddr = page_address(gpage);
+	gaddr += (region->req.gaddr & ~PAGE_MASK);
+	data = (u64 *)gaddr;
+
+	req = ZPCI_CREATE_REQ(zdev->fh, region->req.pcias, region->req.len);
+
+	/* Perform the requested I/O operation */
+	if (region->req.flags & VFIO_ZPCI_IO_FLAG_READ) {
+		/* PCILG */
+		ret = __zpci_load(data, req,
+				region->req.offset);
+	} else if (region->req.flags & VFIO_ZPCI_IO_FLAG_BLOCKW) {
+		/* PCISTB */
+		ret = __zpci_store_block(data, req,
+					region->req.offset);
+	} else {
+		/* Undefined Operation or none provided */
+		count = -EIO;
+	}
+	if (ret < 0)
+		count = -EIO;
+
+	put_page(gpage);
+
+out:
+	mutex_unlock(&zdev->lock);
+	return count;
+}
+
+static void vfio_pci_zdev_io_release(struct vfio_pci_device *vdev,
+				     struct vfio_pci_region *region)
+{
+	kfree(region->data);
+}
+
+static const struct vfio_pci_regops vfio_pci_zdev_io_regops = {
+	.rw		= vfio_pci_zdev_io_rw,
+	.release	= vfio_pci_zdev_io_release,
+};
+
+int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev)
+{
+	struct vfio_region_zpci_io *region;
+	struct zpci_dev *zdev;
+	int ret;
+
+	if (!vdev->pdev->bus)
+		return -ENODEV;
+
+	zdev = to_zpci(vdev->pdev);
+	if (!zdev)
+		return -ENODEV;
+
+	region = kmalloc(sizeof(struct vfio_region_zpci_io), GFP_KERNEL);
+
+	ret = vfio_pci_register_dev_region(vdev,
+		PCI_VENDOR_ID_IBM | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+		VFIO_REGION_SUBTYPE_IBM_ZPCI_IO,
+		&vfio_pci_zdev_io_regops,
+		sizeof(struct vfio_region_zpci_io),
+		VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
+		region);
+
+	if (ret) {
+		kfree(region);
+		return ret;
+	}
+
+	/* Setup the initial header information */
+	region->hdr.flags = 0;
+	region->hdr.max = zdev->maxstbl;
+	region->hdr.reserved = 0;
+	region->hdr.reserved2 = 0;
+
+	return ret;
+}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index d181277..5547f9b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -338,6 +338,10 @@ struct vfio_region_info_cap_type {
  * to do TLB invalidation on a GPU.
  */
 #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
+/*
+ * IBM zPCI I/O region
+ */
+#define VFIO_REGION_SUBTYPE_IBM_ZPCI_IO		(2)
 
 /* sub-types for VFIO_REGION_TYPE_GFX */
 #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
diff --git a/include/uapi/linux/vfio_zdev.h b/include/uapi/linux/vfio_zdev.h
index b0b6596..830acca4 100644
--- a/include/uapi/linux/vfio_zdev.h
+++ b/include/uapi/linux/vfio_zdev.h
@@ -76,4 +76,37 @@ struct vfio_device_info_cap_zpci_pfip {
 	__u8 pfip[];
 };
 
+/**
+ * VFIO_REGION_SUBTYPE_IBM_ZPCI_IO - VFIO zPCI PCI Direct I/O Region
+ *
+ * This region is used to transfer I/O operations from the guest directly
+ * to the host zPCI I/O layer.  The same instruction requested by the guest
+ * (e.g. PCISTB) will always be used, even if the MIO variant is available.
+ *
+ * The _hdr area is user-readable and is used to provide setup information.
+ * The _req area is user-writable and is used to provide the I/O operation.
+ */
+struct vfio_zpci_io_hdr {
+	__u64 flags;
+	__u16 max;		/* Max block operation size allowed */
+	__u16 reserved;
+	__u32 reserved2;
+};
+
+struct vfio_zpci_io_req {
+	__u64 flags;
+#define VFIO_ZPCI_IO_FLAG_READ (1 << 0) /* Read Operation Specified */
+#define VFIO_ZPCI_IO_FLAG_BLOCKW (1 << 1) /* Block Write Operation Specified */
+	__u64 gaddr;		/* Address of guest data */
+	__u64 offset;		/* Offset into target PCI Address Space */
+	__u32 reserved;
+	__u16 len;		/* Length of guest operation */
+	__u8 pcias;		/* Target PCI Address Space */
+	__u8 reserved2;
+};
+
+struct vfio_region_zpci_io {
+	struct vfio_zpci_io_hdr hdr;
+	struct vfio_zpci_io_req req;
+};
 #endif
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2021-01-19 20:02 [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
                   ` (3 preceding siblings ...)
  2021-01-19 20:02 ` [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region Matthew Rosato
@ 2021-01-19 20:50 ` Matthew Rosato
  2021-01-20  9:02 ` Pierre Morel
  5 siblings, 0 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-19 20:50 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

On 1/19/21 3:02 PM, Matthew Rosato wrote:
> Today, ISM devices are completely disallowed for vfio-pci passthrough as
> QEMU will reject the device due to an (inappropriate) MSI-X check.
> However, in an effort to enable ISM device passthrough, I realized that the
> manner in which ISM performs block write operations is highly incompatible
> with the way that QEMU s390 PCI instruction interception and
> vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
> devices have particular requirements in regards to the alignment, size and
> order of writes performed.  Furthermore, they require that legacy/non-MIO
> s390 PCI instructions are used, which is also not guaranteed when the I/O
> is passed through the typical userspace channels.
> 
> As a result, this patchset proposes a new VFIO region to allow a guest to
> pass certain PCI instruction intercepts directly to the s390 host kernel
> PCI layer for execution, pinning the guest buffer in memory briefly in
> order to execute the requested PCI instruction.
> 
> Changes from RFC -> v1:
> - No functional changes, just minor commentary changes -- Re-posting along
> with updated QEMU set.
> 

Link to latest QEMU patch set:
https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg04881.html

> Matthew Rosato (4):
>    s390/pci: track alignment/length strictness for zpci_dev
>    vfio-pci/zdev: Pass the relaxed alignment flag
>    s390/pci: Get hardware-reported max store block length
>    vfio-pci/zdev: Introduce the zPCI I/O vfio region
> 
>   arch/s390/include/asm/pci.h         |   4 +-
>   arch/s390/include/asm/pci_clp.h     |   7 +-
>   arch/s390/pci/pci_clp.c             |   2 +
>   drivers/vfio/pci/vfio_pci.c         |   8 ++
>   drivers/vfio/pci/vfio_pci_private.h |   6 ++
>   drivers/vfio/pci/vfio_pci_zdev.c    | 160 ++++++++++++++++++++++++++++++++++++
>   include/uapi/linux/vfio.h           |   4 +
>   include/uapi/linux/vfio_zdev.h      |  34 ++++++++
>   8 files changed, 222 insertions(+), 3 deletions(-)
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2021-01-19 20:02 [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
                   ` (4 preceding siblings ...)
  2021-01-19 20:50 ` [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
@ 2021-01-20  9:02 ` Pierre Morel
  2021-01-20 14:02   ` Matthew Rosato
  5 siblings, 1 reply; 24+ messages in thread
From: Pierre Morel @ 2021-01-20  9:02 UTC (permalink / raw)
  To: Matthew Rosato, alex.williamson, cohuck, schnelle
  Cc: borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm, linux-kernel



On 1/19/21 9:02 PM, Matthew Rosato wrote:
> Today, ISM devices are completely disallowed for vfio-pci passthrough as
> QEMU will reject the device due to an (inappropriate) MSI-X check.
> However, in an effort to enable ISM device passthrough, I realized that the
> manner in which ISM performs block write operations is highly incompatible
> with the way that QEMU s390 PCI instruction interception and
> vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
> devices have particular requirements in regards to the alignment, size and
> order of writes performed.  Furthermore, they require that legacy/non-MIO
> s390 PCI instructions are used, which is also not guaranteed when the I/O
> is passed through the typical userspace channels.
> 
> As a result, this patchset proposes a new VFIO region to allow a guest to
> pass certain PCI instruction intercepts directly to the s390 host kernel
> PCI layer for execution, pinning the guest buffer in memory briefly in
> order to execute the requested PCI instruction.
> 
> Changes from RFC -> v1:
> - No functional changes, just minor commentary changes -- Re-posting along
> with updated QEMU set.
> 

Hi,

there are is a concerns about this patch series:
As the title says it is strongly related to ISM hardware.

Why being so specific?

Regards,
Pierre

-- 
Pierre Morel
IBM Lab Boeblingen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-19 20:02 ` [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region Matthew Rosato
@ 2021-01-20 13:21   ` Niklas Schnelle
  2021-01-20 17:10     ` Matthew Rosato
  2021-01-21 20:50     ` Alex Williamson
  2021-01-21 10:01   ` Niklas Schnelle
  2021-01-22 23:48   ` Alex Williamson
  2 siblings, 2 replies; 24+ messages in thread
From: Niklas Schnelle @ 2021-01-20 13:21 UTC (permalink / raw)
  To: Matthew Rosato, alex.williamson, cohuck
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel



On 1/19/21 9:02 PM, Matthew Rosato wrote:
> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
.. snip ...
> +
> +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev,
> +				  char __user *buf, size_t count,
> +				  loff_t *ppos, bool iswrite)
> +{
... snip ...
> +	/*
> +	 * For now, the largest allowed block I/O is advertised as PAGE_SIZE,
> +	 * and cannot exceed a page boundary - so a single page is enough.  The
> +	 * guest should have validated this but let's double-check that the
> +	 * request will not cross a page boundary.
> +	 */
> +	if (((region->req.gaddr & ~PAGE_MASK)
> +			+ region->req.len - 1) & PAGE_MASK) {
> +		return -EIO;
> +	}
> +
> +	mutex_lock(&zdev->lock);

I plan on using the zdev->lock for preventing concurrent zPCI devices
removal/configuration state changes between zPCI availability/error
events and enable_slot()/disable_slot() and /sys/bus/pci/devices/<dev>/recover.

With that use in place using it here causes a deadlock when doing 
"echo 0 > /sys/bus/pci/slots/<fid>/power from the host for an ISM device
attached to a guest.

This is because the (soft) hot unplug will cause vfio to notify QEMU, which
sends a deconfiguration request to the guest, which then tries to
gracefully shutdown the device. During that shutdown the device will
be accessed, running into this code path which then blocks on
the lock held by the disable_slot() code which waits on vfio
releasing the device.

Alex may correct me if I'm wrong but I think instead vfio should
be holding the PCI device lock via pci_device_lock(pdev).

The annotated trace with my test code looks as follows:

[  618.025091] Call Trace:
[  618.025093]  [<00000007c1a139e0>] __schedule+0x360/0x960
[  618.025104]  [<00000007c1a14760>] schedule_preempt_disabled+0x60/0x100
[  618.025107]  [<00000007c1a16b48>] __mutex_lock+0x358/0x880
[  618.025110]  [<00000007c1a170a2>] mutex_lock_nested+0x32/0x40
[  618.025112]  [<000003ff805a3948>] vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci]
[  618.025120]  [<000003ff8059b2b0>] vfio_pci_write+0xd0/0xe0 [vfio_pci]
[  618.025124]  [<00000007c0fa5392>] __s390x_sys_pwrite64+0x112/0x360
[  618.025129]  [<00000007c1a0aaf6>] __do_syscall+0x116/0x190
[  618.025132]  [<00000007c1a1deda>] system_call+0x72/0x98
[  618.025137] 1 lock held by qemu-system-s39/1315:
[  618.025139]  #0: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci]
[  618.025151]
               Showing all locks held in the system:
[  618.025166] 1 lock held by khungtaskd/99:
[  618.025168]  #0: 00000007c1ed4748 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x210
[  618.025194] 6 locks held by zsh/1190:
[  618.025196]  #0: 0000000095fc0488 (sb_writers#3){....}-{0:0}, at: __do_syscall+0x116/0x190
[  618.025226]  #1: 00000000975bf090 (&of->mutex){....}-{3:3}, at: kernfs_fop_write+0x9a/0x240
[  618.025236]  #2: 000000008584be78 (kn->active#245){....}-{0:0}, at: kernfs_fop_write+0xa6/0x240
[  618.025243]  #3: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: disable_slot+0x32/0x130 <-------------------------------------|
[  618.025252]  #4: 00000007c1f53468 (pci_rescan_remove_lock){....}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x240   |
[  618.025260]  #5: 0000000085d8a1a0 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x1d0                              |
[  618.025271] 1 lock held by qemu-system-s39/1312:                                                                               D
[  618.025273] 1 lock held by qemu-system-s39/1313:                                                                               E
[  618.025275]  #0: 00000000d47e80d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]                              A
[  618.025322] 1 lock held by qemu-system-s39/1314:                                                                               D
[  618.025324]  #0: 00000000d34700d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]                              |
[  618.025345] 1 lock held by qemu-system-s39/1315:                                                                               |
[  618.025347]  #0: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] <------------------|
[  618.025355] 1 lock held by qemu-system-s39/1317:
[  618.025357]  #0: 00000000d34480d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
[  618.025378] 1 lock held by qemu-system-s39/1318:
[  618.025380]  #0: 00000000d34380d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
[  618.025400] 1 lock held by qemu-system-s39/1319:
[  618.025403]  #0: 00000000d47e8a90 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
[  618.025424] 2 locks held by zsh/1391:
[  618.025426]  #0: 00000000d4a708a0 (&tty->ldisc_sem){....}-{0:0}, at: tty_ldisc_ref_wait+0x34/0x70
[  618.025435]  #1: 0000038002fc72f0 (&ldata->atomic_read_lock){....}-{3:3}, at: n_tty_read+0xc8/0xa50


> +
> +	ret = get_user_pages_fast(region->req.gaddr & PAGE_MASK, 1, 0, &gpage);
> +	if (ret <= 0) {
> +		count = -EIO;
... snip ...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2021-01-20  9:02 ` Pierre Morel
@ 2021-01-20 14:02   ` Matthew Rosato
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-20 14:02 UTC (permalink / raw)
  To: Pierre Morel, alex.williamson, cohuck, schnelle
  Cc: borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm, linux-kernel

On 1/20/21 4:02 AM, Pierre Morel wrote:
> 
> 
> On 1/19/21 9:02 PM, Matthew Rosato wrote:
>> Today, ISM devices are completely disallowed for vfio-pci passthrough as
>> QEMU will reject the device due to an (inappropriate) MSI-X check.
>> However, in an effort to enable ISM device passthrough, I realized 
>> that the
>> manner in which ISM performs block write operations is highly 
>> incompatible
>> with the way that QEMU s390 PCI instruction interception and
>> vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
>> devices have particular requirements in regards to the alignment, size 
>> and
>> order of writes performed.  Furthermore, they require that legacy/non-MIO
>> s390 PCI instructions are used, which is also not guaranteed when the I/O
>> is passed through the typical userspace channels.
>>
>> As a result, this patchset proposes a new VFIO region to allow a guest to
>> pass certain PCI instruction intercepts directly to the s390 host kernel
>> PCI layer for execution, pinning the guest buffer in memory briefly in
>> order to execute the requested PCI instruction.
>>
>> Changes from RFC -> v1:
>> - No functional changes, just minor commentary changes -- Re-posting 
>> along
>> with updated QEMU set.
>>
> 
> Hi,
> 
> there are is a concerns about this patch series:
> As the title says it is strongly related to ISM hardware.
> 
> Why being so specific?

Because prior investigations have shown that the region can only be 
safely used by a device type that does not implement MSI-X (use of this 
region by a vfio-pci device that has MSI-X capability interferes with 
vfio-pci MSI-X masking, since we are bypassing the typical VFIO bar 
regions and vfio-pci MSI-X masking is triggered by those region accesses).

So, in lieu of another suggestion that would overcome that issue (nobody 
has suggested anything thus far), the proposal is to limit the region's 
use to fix the specific problem at hand (ISM devices won't function 
properly if passed through).  That doesn't preclude this region from 
being used for a different device type later, but ISM is why we are 
introducing it now.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-20 13:21   ` Niklas Schnelle
@ 2021-01-20 17:10     ` Matthew Rosato
  2021-01-20 17:28       ` Niklas Schnelle
  2021-01-21 20:50     ` Alex Williamson
  1 sibling, 1 reply; 24+ messages in thread
From: Matthew Rosato @ 2021-01-20 17:10 UTC (permalink / raw)
  To: Niklas Schnelle, alex.williamson, cohuck
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

On 1/20/21 8:21 AM, Niklas Schnelle wrote:
> 
> 
> On 1/19/21 9:02 PM, Matthew Rosato wrote:
>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
> .. snip ...
>> +
>> +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev,
>> +				  char __user *buf, size_t count,
>> +				  loff_t *ppos, bool iswrite)
>> +{
> ... snip ...
>> +	/*
>> +	 * For now, the largest allowed block I/O is advertised as PAGE_SIZE,
>> +	 * and cannot exceed a page boundary - so a single page is enough.  The
>> +	 * guest should have validated this but let's double-check that the
>> +	 * request will not cross a page boundary.
>> +	 */
>> +	if (((region->req.gaddr & ~PAGE_MASK)
>> +			+ region->req.len - 1) & PAGE_MASK) {
>> +		return -EIO;
>> +	}
>> +
>> +	mutex_lock(&zdev->lock);
> 
> I plan on using the zdev->lock for preventing concurrent zPCI devices
> removal/configuration state changes between zPCI availability/error
> events and enable_slot()/disable_slot() and /sys/bus/pci/devices/<dev>/recover.
> 
> With that use in place using it here causes a deadlock when doing
> "echo 0 > /sys/bus/pci/slots/<fid>/power from the host for an ISM device
> attached to a guest.
> 
> This is because the (soft) hot unplug will cause vfio to notify QEMU, which
> sends a deconfiguration request to the guest, which then tries to
> gracefully shutdown the device. During that shutdown the device will
> be accessed, running into this code path which then blocks on
> the lock held by the disable_slot() code which waits on vfio
> releasing the device.
> 

Oof, good catch.  The primary purpose of acquiring the zdev lock here 
was to ensure that the region is only being used to process one 
operation at a time and at the time I wrote this initially the zdev lock 
seemed like a reasonable candidate :)

> Alex may correct me if I'm wrong but I think instead vfio should
> be holding the PCI device lock via pci_device_lock(pdev).
> 

OK, I can have a look at this.  Thanks.


> The annotated trace with my test code looks as follows:
> 
> [  618.025091] Call Trace:
> [  618.025093]  [<00000007c1a139e0>] __schedule+0x360/0x960
> [  618.025104]  [<00000007c1a14760>] schedule_preempt_disabled+0x60/0x100
> [  618.025107]  [<00000007c1a16b48>] __mutex_lock+0x358/0x880
> [  618.025110]  [<00000007c1a170a2>] mutex_lock_nested+0x32/0x40
> [  618.025112]  [<000003ff805a3948>] vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci]
> [  618.025120]  [<000003ff8059b2b0>] vfio_pci_write+0xd0/0xe0 [vfio_pci]
> [  618.025124]  [<00000007c0fa5392>] __s390x_sys_pwrite64+0x112/0x360
> [  618.025129]  [<00000007c1a0aaf6>] __do_syscall+0x116/0x190
> [  618.025132]  [<00000007c1a1deda>] system_call+0x72/0x98
> [  618.025137] 1 lock held by qemu-system-s39/1315:
> [  618.025139]  #0: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci]
> [  618.025151]
>                 Showing all locks held in the system:
> [  618.025166] 1 lock held by khungtaskd/99:
> [  618.025168]  #0: 00000007c1ed4748 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x210
> [  618.025194] 6 locks held by zsh/1190:
> [  618.025196]  #0: 0000000095fc0488 (sb_writers#3){....}-{0:0}, at: __do_syscall+0x116/0x190
> [  618.025226]  #1: 00000000975bf090 (&of->mutex){....}-{3:3}, at: kernfs_fop_write+0x9a/0x240
> [  618.025236]  #2: 000000008584be78 (kn->active#245){....}-{0:0}, at: kernfs_fop_write+0xa6/0x240
> [  618.025243]  #3: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: disable_slot+0x32/0x130 <-------------------------------------|
> [  618.025252]  #4: 00000007c1f53468 (pci_rescan_remove_lock){....}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x240   |
> [  618.025260]  #5: 0000000085d8a1a0 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x1d0                              |
> [  618.025271] 1 lock held by qemu-system-s39/1312:                                                                               D
> [  618.025273] 1 lock held by qemu-system-s39/1313:                                                                               E
> [  618.025275]  #0: 00000000d47e80d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]                              A
> [  618.025322] 1 lock held by qemu-system-s39/1314:                                                                               D
> [  618.025324]  #0: 00000000d34700d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]                              |
> [  618.025345] 1 lock held by qemu-system-s39/1315:                                                                               |
> [  618.025347]  #0: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] <------------------|
> [  618.025355] 1 lock held by qemu-system-s39/1317:
> [  618.025357]  #0: 00000000d34480d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
> [  618.025378] 1 lock held by qemu-system-s39/1318:
> [  618.025380]  #0: 00000000d34380d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
> [  618.025400] 1 lock held by qemu-system-s39/1319:
> [  618.025403]  #0: 00000000d47e8a90 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
> [  618.025424] 2 locks held by zsh/1391:
> [  618.025426]  #0: 00000000d4a708a0 (&tty->ldisc_sem){....}-{0:0}, at: tty_ldisc_ref_wait+0x34/0x70
> [  618.025435]  #1: 0000038002fc72f0 (&ldata->atomic_read_lock){....}-{3:3}, at: n_tty_read+0xc8/0xa50
> 
> 
>> +
>> +	ret = get_user_pages_fast(region->req.gaddr & PAGE_MASK, 1, 0, &gpage);
>> +	if (ret <= 0) {
>> +		count = -EIO;
> ... snip ...
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-20 17:10     ` Matthew Rosato
@ 2021-01-20 17:28       ` Niklas Schnelle
  2021-01-20 17:40         ` Matthew Rosato
  0 siblings, 1 reply; 24+ messages in thread
From: Niklas Schnelle @ 2021-01-20 17:28 UTC (permalink / raw)
  To: Matthew Rosato, alex.williamson, cohuck
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel



On 1/20/21 6:10 PM, Matthew Rosato wrote:
> On 1/20/21 8:21 AM, Niklas Schnelle wrote:
>>
>>
>> On 1/19/21 9:02 PM, Matthew Rosato wrote:
>>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
>> .. snip ...
>>> +
>>> +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev,
>>> +                  char __user *buf, size_t count,
>>> +                  loff_t *ppos, bool iswrite)
>>> +{
>> ... snip ...
>>> +    /*
>>> +     * For now, the largest allowed block I/O is advertised as PAGE_SIZE,
>>> +     * and cannot exceed a page boundary - so a single page is enough.  The
>>> +     * guest should have validated this but let's double-check that the
>>> +     * request will not cross a page boundary.
>>> +     */
>>> +    if (((region->req.gaddr & ~PAGE_MASK)
>>> +            + region->req.len - 1) & PAGE_MASK) {
>>> +        return -EIO;
>>> +    }
>>> +
>>> +    mutex_lock(&zdev->lock);
>>
>> I plan on using the zdev->lock for preventing concurrent zPCI devices
>> removal/configuration state changes between zPCI availability/error
>> events and enable_slot()/disable_slot() and /sys/bus/pci/devices/<dev>/recover.
>>
>> With that use in place using it here causes a deadlock when doing
>> "echo 0 > /sys/bus/pci/slots/<fid>/power from the host for an ISM device
>> attached to a guest.
>>
>> This is because the (soft) hot unplug will cause vfio to notify QEMU, which
>> sends a deconfiguration request to the guest, which then tries to
>> gracefully shutdown the device. During that shutdown the device will
>> be accessed, running into this code path which then blocks on
>> the lock held by the disable_slot() code which waits on vfio
>> releasing the device.
>>
> 
> Oof, good catch.  The primary purpose of acquiring the zdev lock here was to ensure that the region is only being used to process one operation at a time and at the time I wrote this initially the zdev lock seemed like a reasonable candidate :)

Oh ok, I thought it was to guard against removal. What's the reason
we can't have multiple concurrent users of the region though?
The PCISTB/PCILG should be able to be executed concurrently but
I'm probably missing something.

> 
>> Alex may correct me if I'm wrong but I think instead vfio should
>> be holding the PCI device lock via pci_device_lock(pdev).
>>
> 
> OK, I can have a look at this.  Thanks.

Hmm, ok that idea was based on my assumption that it guards
against removal. My bad, of course without my patch it wouldn't
and this came before..

> 
> 
>> The annotated trace with my test code looks as follows:
>>
>> [  618.025091] Call Trace:
>> [  618.025093]  [<00000007c1a139e0>] __schedule+0x360/0x960
>> [  618.025104]  [<00000007c1a14760>] schedule_preempt_disabled+0x60/0x100
>> [  618.025107]  [<00000007c1a16b48>] __mutex_lock+0x358/0x880
>> [  618.025110]  [<00000007c1a170a2>] mutex_lock_nested+0x32/0x40
>> [  618.025112]  [<000003ff805a3948>] vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci]
>> [  618.025120]  [<000003ff8059b2b0>] vfio_pci_write+0xd0/0xe0 [vfio_pci]
>> [  618.025124]  [<00000007c0fa5392>] __s390x_sys_pwrite64+0x112/0x360
>> [  618.025129]  [<00000007c1a0aaf6>] __do_syscall+0x116/0x190
>> [  618.025132]  [<00000007c1a1deda>] system_call+0x72/0x98
>> [  618.025137] 1 lock held by qemu-system-s39/1315:
>> [  618.025139]  #0: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci]
>> [  618.025151]
>>                 Showing all locks held in the system:
>> [  618.025166] 1 lock held by khungtaskd/99:
>> [  618.025168]  #0: 00000007c1ed4748 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x210
>> [  618.025194] 6 locks held by zsh/1190:
>> [  618.025196]  #0: 0000000095fc0488 (sb_writers#3){....}-{0:0}, at: __do_syscall+0x116/0x190
>> [  618.025226]  #1: 00000000975bf090 (&of->mutex){....}-{3:3}, at: kernfs_fop_write+0x9a/0x240
>> [  618.025236]  #2: 000000008584be78 (kn->active#245){....}-{0:0}, at: kernfs_fop_write+0xa6/0x240
>> [  618.025243]  #3: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: disable_slot+0x32/0x130 <-------------------------------------|
>> [  618.025252]  #4: 00000007c1f53468 (pci_rescan_remove_lock){....}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x240   |
>> [  618.025260]  #5: 0000000085d8a1a0 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x1d0                              |
>> [  618.025271] 1 lock held by qemu-system-s39/1312:                                                                               D
>> [  618.025273] 1 lock held by qemu-system-s39/1313:                                                                               E
>> [  618.025275]  #0: 00000000d47e80d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]                              A
>> [  618.025322] 1 lock held by qemu-system-s39/1314:                                                                               D
>> [  618.025324]  #0: 00000000d34700d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]                              |
>> [  618.025345] 1 lock held by qemu-system-s39/1315:                                                                               |
>> [  618.025347]  #0: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] <------------------|
>> [  618.025355] 1 lock held by qemu-system-s39/1317:
>> [  618.025357]  #0: 00000000d34480d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
>> [  618.025378] 1 lock held by qemu-system-s39/1318:
>> [  618.025380]  #0: 00000000d34380d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
>> [  618.025400] 1 lock held by qemu-system-s39/1319:
>> [  618.025403]  #0: 00000000d47e8a90 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
>> [  618.025424] 2 locks held by zsh/1391:
>> [  618.025426]  #0: 00000000d4a708a0 (&tty->ldisc_sem){....}-{0:0}, at: tty_ldisc_ref_wait+0x34/0x70
>> [  618.025435]  #1: 0000038002fc72f0 (&ldata->atomic_read_lock){....}-{3:3}, at: n_tty_read+0xc8/0xa50
>>
>>
>>> +
>>> +    ret = get_user_pages_fast(region->req.gaddr & PAGE_MASK, 1, 0, &gpage);
>>> +    if (ret <= 0) {
>>> +        count = -EIO;
>> ... snip ...
>>
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-20 17:28       ` Niklas Schnelle
@ 2021-01-20 17:40         ` Matthew Rosato
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-20 17:40 UTC (permalink / raw)
  To: Niklas Schnelle, alex.williamson, cohuck
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

On 1/20/21 12:28 PM, Niklas Schnelle wrote:
> 
> 
> On 1/20/21 6:10 PM, Matthew Rosato wrote:
>> On 1/20/21 8:21 AM, Niklas Schnelle wrote:
>>>
>>>
>>> On 1/19/21 9:02 PM, Matthew Rosato wrote:
>>>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
>>> .. snip ...
>>>> +
>>>> +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev,
>>>> +                  char __user *buf, size_t count,
>>>> +                  loff_t *ppos, bool iswrite)
>>>> +{
>>> ... snip ...
>>>> +    /*
>>>> +     * For now, the largest allowed block I/O is advertised as PAGE_SIZE,
>>>> +     * and cannot exceed a page boundary - so a single page is enough.  The
>>>> +     * guest should have validated this but let's double-check that the
>>>> +     * request will not cross a page boundary.
>>>> +     */
>>>> +    if (((region->req.gaddr & ~PAGE_MASK)
>>>> +            + region->req.len - 1) & PAGE_MASK) {
>>>> +        return -EIO;
>>>> +    }
>>>> +
>>>> +    mutex_lock(&zdev->lock);
>>>
>>> I plan on using the zdev->lock for preventing concurrent zPCI devices
>>> removal/configuration state changes between zPCI availability/error
>>> events and enable_slot()/disable_slot() and /sys/bus/pci/devices/<dev>/recover.
>>>
>>> With that use in place using it here causes a deadlock when doing
>>> "echo 0 > /sys/bus/pci/slots/<fid>/power from the host for an ISM device
>>> attached to a guest.
>>>
>>> This is because the (soft) hot unplug will cause vfio to notify QEMU, which
>>> sends a deconfiguration request to the guest, which then tries to
>>> gracefully shutdown the device. During that shutdown the device will
>>> be accessed, running into this code path which then blocks on
>>> the lock held by the disable_slot() code which waits on vfio
>>> releasing the device.
>>>
>>
>> Oof, good catch.  The primary purpose of acquiring the zdev lock here was to ensure that the region is only being used to process one operation at a time and at the time I wrote this initially the zdev lock seemed like a reasonable candidate :)
> 
> Oh ok, I thought it was to guard against removal. What's the reason
> we can't have multiple concurrent users of the region though?
> The PCISTB/PCILG should be able to be executed concurrently but
> I'm probably missing something.
> 

Yes, we are protecting the vfio region data, not PCISTB/PCILG execution.

When we enter vfio_pci_zdev_io_rw() with an I/O operation (a 'write' 
operation from the guest) we read in the data from the guest and place 
it into vdev->region[].data.  The data being read in is effectively 
instructing us what I/O operation to perform.  Now, if you have 2 
threads performing a write operation to this function at the same time 
(and furthermore, at the same or overlapping positions), they would 
overwrite (or collide) with each other, resulting in contents written to 
vdev->region[].data that are unpredictable and thus the wrong I/O 
operations probably done.

It so happens that the QEMU implementation will ensure that won't 
happen, but the kernel interface also needs to make sure it doesn't happen.

>>
>>> Alex may correct me if I'm wrong but I think instead vfio should
>>> be holding the PCI device lock via pci_device_lock(pdev).
>>>
>>
>> OK, I can have a look at this.  Thanks.
> 
> Hmm, ok that idea was based on my assumption that it guards
> against removal. My bad, of course without my patch it wouldn't
> and this came before..
> 

That would be a nice added bonus (we discussed it offline late last year 
so that's probably why you're thinking this).

>>
>>
>>> The annotated trace with my test code looks as follows:
>>>
>>> [  618.025091] Call Trace:
>>> [  618.025093]  [<00000007c1a139e0>] __schedule+0x360/0x960
>>> [  618.025104]  [<00000007c1a14760>] schedule_preempt_disabled+0x60/0x100
>>> [  618.025107]  [<00000007c1a16b48>] __mutex_lock+0x358/0x880
>>> [  618.025110]  [<00000007c1a170a2>] mutex_lock_nested+0x32/0x40
>>> [  618.025112]  [<000003ff805a3948>] vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci]
>>> [  618.025120]  [<000003ff8059b2b0>] vfio_pci_write+0xd0/0xe0 [vfio_pci]
>>> [  618.025124]  [<00000007c0fa5392>] __s390x_sys_pwrite64+0x112/0x360
>>> [  618.025129]  [<00000007c1a0aaf6>] __do_syscall+0x116/0x190
>>> [  618.025132]  [<00000007c1a1deda>] system_call+0x72/0x98
>>> [  618.025137] 1 lock held by qemu-system-s39/1315:
>>> [  618.025139]  #0: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci]
>>> [  618.025151]
>>>                  Showing all locks held in the system:
>>> [  618.025166] 1 lock held by khungtaskd/99:
>>> [  618.025168]  #0: 00000007c1ed4748 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x210
>>> [  618.025194] 6 locks held by zsh/1190:
>>> [  618.025196]  #0: 0000000095fc0488 (sb_writers#3){....}-{0:0}, at: __do_syscall+0x116/0x190
>>> [  618.025226]  #1: 00000000975bf090 (&of->mutex){....}-{3:3}, at: kernfs_fop_write+0x9a/0x240
>>> [  618.025236]  #2: 000000008584be78 (kn->active#245){....}-{0:0}, at: kernfs_fop_write+0xa6/0x240
>>> [  618.025243]  #3: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: disable_slot+0x32/0x130 <-------------------------------------|
>>> [  618.025252]  #4: 00000007c1f53468 (pci_rescan_remove_lock){....}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x240   |
>>> [  618.025260]  #5: 0000000085d8a1a0 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x1d0                              |
>>> [  618.025271] 1 lock held by qemu-system-s39/1312:                                                                               D
>>> [  618.025273] 1 lock held by qemu-system-s39/1313:                                                                               E
>>> [  618.025275]  #0: 00000000d47e80d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]                              A
>>> [  618.025322] 1 lock held by qemu-system-s39/1314:                                                                               D
>>> [  618.025324]  #0: 00000000d34700d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]                              |
>>> [  618.025345] 1 lock held by qemu-system-s39/1315:                                                                               |
>>> [  618.025347]  #0: 000000008524b4e8 (&zdev->lock){....}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] <------------------|
>>> [  618.025355] 1 lock held by qemu-system-s39/1317:
>>> [  618.025357]  #0: 00000000d34480d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
>>> [  618.025378] 1 lock held by qemu-system-s39/1318:
>>> [  618.025380]  #0: 00000000d34380d0 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
>>> [  618.025400] 1 lock held by qemu-system-s39/1319:
>>> [  618.025403]  #0: 00000000d47e8a90 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm]
>>> [  618.025424] 2 locks held by zsh/1391:
>>> [  618.025426]  #0: 00000000d4a708a0 (&tty->ldisc_sem){....}-{0:0}, at: tty_ldisc_ref_wait+0x34/0x70
>>> [  618.025435]  #1: 0000038002fc72f0 (&ldata->atomic_read_lock){....}-{3:3}, at: n_tty_read+0xc8/0xa50
>>>
>>>
>>>> +
>>>> +    ret = get_user_pages_fast(region->req.gaddr & PAGE_MASK, 1, 0, &gpage);
>>>> +    if (ret <= 0) {
>>>> +        count = -EIO;
>>> ... snip ...
>>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-19 20:02 ` [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region Matthew Rosato
  2021-01-20 13:21   ` Niklas Schnelle
@ 2021-01-21 10:01   ` Niklas Schnelle
  2021-01-21 15:57     ` Matthew Rosato
  2021-01-22 23:48   ` Alex Williamson
  2 siblings, 1 reply; 24+ messages in thread
From: Niklas Schnelle @ 2021-01-21 10:01 UTC (permalink / raw)
  To: Matthew Rosato, alex.williamson, cohuck
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel



On 1/19/21 9:02 PM, Matthew Rosato wrote:
> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
> specific requirements in terms of alignment as well as the patterns in
> which the data is read/written. Allowing these to proceed through the
> typical vfio_pci_bar_rw path will cause them to be broken in up in such a
> way that these requirements can't be guaranteed. In addition, ISM devices
> do not support the MIO codepaths that might be triggered on vfio I/O coming
> from userspace; we must be able to ensure that these devices use the
> non-MIO instructions.  To facilitate this, provide a new vfio region by
> which non-MIO instructions can be passed directly to the host kernel s390
> PCI layer, to be reliably issued as non-MIO instructions.
> 
> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
> and implements the ability to pass PCISTB and PCILG instructions over it,
> as these are what is required for ISM devices.
> 
> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
> ---
>  drivers/vfio/pci/vfio_pci.c         |   8 ++
>  drivers/vfio/pci/vfio_pci_private.h |   6 ++
>  drivers/vfio/pci/vfio_pci_zdev.c    | 158 ++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h           |   4 +
>  include/uapi/linux/vfio_zdev.h      |  33 ++++++++
>  5 files changed, 209 insertions(+)

Related to the discussion on the QEMU side, if we have a check
to make sure this is only used for ISM, then this patch should
make that clear in its wording and also in the paths
(drivers/vfio/pci/vfio_pci_ism.c instead of vfio_pci_zdev.c.)
This also has precedent with the region for IGD in
drivers/vfio/pci/vfio_pci_igd.c.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-21 10:01   ` Niklas Schnelle
@ 2021-01-21 15:57     ` Matthew Rosato
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-21 15:57 UTC (permalink / raw)
  To: Niklas Schnelle, alex.williamson, cohuck
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

On 1/21/21 5:01 AM, Niklas Schnelle wrote:
> 
> 
> On 1/19/21 9:02 PM, Matthew Rosato wrote:
>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
>> specific requirements in terms of alignment as well as the patterns in
>> which the data is read/written. Allowing these to proceed through the
>> typical vfio_pci_bar_rw path will cause them to be broken in up in such a
>> way that these requirements can't be guaranteed. In addition, ISM devices
>> do not support the MIO codepaths that might be triggered on vfio I/O coming
>> from userspace; we must be able to ensure that these devices use the
>> non-MIO instructions.  To facilitate this, provide a new vfio region by
>> which non-MIO instructions can be passed directly to the host kernel s390
>> PCI layer, to be reliably issued as non-MIO instructions.
>>
>> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
>> and implements the ability to pass PCISTB and PCILG instructions over it,
>> as these are what is required for ISM devices.
>>
>> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
>> ---
>>   drivers/vfio/pci/vfio_pci.c         |   8 ++
>>   drivers/vfio/pci/vfio_pci_private.h |   6 ++
>>   drivers/vfio/pci/vfio_pci_zdev.c    | 158 ++++++++++++++++++++++++++++++++++++
>>   include/uapi/linux/vfio.h           |   4 +
>>   include/uapi/linux/vfio_zdev.h      |  33 ++++++++
>>   5 files changed, 209 insertions(+)
> 
> Related to the discussion on the QEMU side, if we have a check
> to make sure this is only used for ISM, then this patch should
> make that clear in its wording and also in the paths
> (drivers/vfio/pci/vfio_pci_ism.c instead of vfio_pci_zdev.c.)
> This also has precedent with the region for IGD in
> drivers/vfio/pci/vfio_pci_igd.c.
> 

This is a fair point, but just to tie up threads here -- the QEMU 
discussion has since moved towards making the use-case less restrictive 
rather than ISM-only (though getting ISM working is still the motivating 
factor here).  So as such I think I'd prefer to keep this in 
vfio_pci_zdev.c as other hardware could use the region if they meet the 
necessary criteria.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-20 13:21   ` Niklas Schnelle
  2021-01-20 17:10     ` Matthew Rosato
@ 2021-01-21 20:50     ` Alex Williamson
  1 sibling, 0 replies; 24+ messages in thread
From: Alex Williamson @ 2021-01-21 20:50 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, cohuck, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On Wed, 20 Jan 2021 14:21:59 +0100
Niklas Schnelle <schnelle@linux.ibm.com> wrote:

> On 1/19/21 9:02 PM, Matthew Rosato wrote:
> > Some s390 PCI devices (e.g. ISM) perform I/O operations that have very  
> .. snip ...
> > +
> > +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev,
> > +				  char __user *buf, size_t count,
> > +				  loff_t *ppos, bool iswrite)
> > +{  
> ... snip ...
> > +	/*
> > +	 * For now, the largest allowed block I/O is advertised as PAGE_SIZE,
> > +	 * and cannot exceed a page boundary - so a single page is enough.  The
> > +	 * guest should have validated this but let's double-check that the
> > +	 * request will not cross a page boundary.
> > +	 */
> > +	if (((region->req.gaddr & ~PAGE_MASK)
> > +			+ region->req.len - 1) & PAGE_MASK) {
> > +		return -EIO;
> > +	}
> > +
> > +	mutex_lock(&zdev->lock);  
> 
> I plan on using the zdev->lock for preventing concurrent zPCI devices
> removal/configuration state changes between zPCI availability/error
> events and enable_slot()/disable_slot() and /sys/bus/pci/devices/<dev>/recover.
> 
> With that use in place using it here causes a deadlock when doing 
> "echo 0 > /sys/bus/pci/slots/<fid>/power from the host for an ISM device
> attached to a guest.
> 
> This is because the (soft) hot unplug will cause vfio to notify QEMU, which
> sends a deconfiguration request to the guest, which then tries to
> gracefully shutdown the device. During that shutdown the device will
> be accessed, running into this code path which then blocks on
> the lock held by the disable_slot() code which waits on vfio
> releasing the device.
> 
> Alex may correct me if I'm wrong but I think instead vfio should
> be holding the PCI device lock via pci_device_lock(pdev).

There be dragons in device_lock, which is why we have all the crude
try-lock semantics in reset paths.  Don't use it trivially.  Device
lock is not necessary here otherwise, the device is bound to a driver
and has an open userspace file descriptor through which the user is
performing this access.  The device can't be removed without unbinding
the driver, which can't happen while the user still has open files to
it.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-19 20:02 ` [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region Matthew Rosato
  2021-01-20 13:21   ` Niklas Schnelle
  2021-01-21 10:01   ` Niklas Schnelle
@ 2021-01-22 23:48   ` Alex Williamson
  2021-01-25 14:40     ` Matthew Rosato
  2 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2021-01-22 23:48 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: cohuck, schnelle, pmorel, borntraeger, hca, gor, gerald.schaefer,
	linux-s390, kvm, linux-kernel

On Tue, 19 Jan 2021 15:02:30 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
> specific requirements in terms of alignment as well as the patterns in
> which the data is read/written. Allowing these to proceed through the
> typical vfio_pci_bar_rw path will cause them to be broken in up in such a
> way that these requirements can't be guaranteed. In addition, ISM devices
> do not support the MIO codepaths that might be triggered on vfio I/O coming
> from userspace; we must be able to ensure that these devices use the
> non-MIO instructions.  To facilitate this, provide a new vfio region by
> which non-MIO instructions can be passed directly to the host kernel s390
> PCI layer, to be reliably issued as non-MIO instructions.
> 
> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
> and implements the ability to pass PCISTB and PCILG instructions over it,
> as these are what is required for ISM devices.

There have been various discussions about splitting vfio-pci to allow
more device specific drivers rather adding duct tape and bailing wire
for various device specific features to extend vfio-pci.  The latest
iteration is here[1].  Is it possible that such a solution could simply
provide the standard BAR region indexes, but with an implementation that
works on s390, rather than creating new device specific regions to
perform the same task?  Thanks,

Alex

[1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurtovoy@nvidia.com/

> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
> ---
>  drivers/vfio/pci/vfio_pci.c         |   8 ++
>  drivers/vfio/pci/vfio_pci_private.h |   6 ++
>  drivers/vfio/pci/vfio_pci_zdev.c    | 158 ++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h           |   4 +
>  include/uapi/linux/vfio_zdev.h      |  33 ++++++++
>  5 files changed, 209 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 706de3e..e1c156e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -407,6 +407,14 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  		}
>  	}
>  
> +	if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV)) {
> +		ret = vfio_pci_zdev_io_init(vdev);
> +		if (ret && ret != -ENODEV) {
> +			pci_warn(pdev, "Failed to setup zPCI I/O region\n");
> +			return ret;
> +		}
> +	}
> +
>  	vfio_pci_probe_mmaps(vdev);
>  
>  	return 0;
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 5c90e56..bc49980 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -217,12 +217,18 @@ static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
>  #ifdef CONFIG_VFIO_PCI_ZDEV
>  extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
>  				       struct vfio_info_cap *caps);
> +extern int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev);
>  #else
>  static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
>  					      struct vfio_info_cap *caps)
>  {
>  	return -ENODEV;
>  }
> +
> +static inline int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev)
> +{
> +	return -ENODEV;
> +}
>  #endif
>  
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
> index 57e19ff..a962043 100644
> --- a/drivers/vfio/pci/vfio_pci_zdev.c
> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
> @@ -18,6 +18,7 @@
>  #include <linux/vfio_zdev.h>
>  #include <asm/pci_clp.h>
>  #include <asm/pci_io.h>
> +#include <asm/pci_insn.h>
>  
>  #include "vfio_pci_private.h"
>  
> @@ -143,3 +144,160 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
>  
>  	return ret;
>  }
> +
> +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev,
> +				  char __user *buf, size_t count,
> +				  loff_t *ppos, bool iswrite)
> +{
> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> +	struct vfio_region_zpci_io *region = vdev->region[i].data;
> +	struct zpci_dev *zdev = to_zpci(vdev->pdev);
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	void *base = region;
> +	struct page *gpage;
> +	void *gaddr;
> +	u64 *data;
> +	int ret;
> +	u64 req;
> +
> +	if ((!vdev->pdev->bus) || (!zdev))
> +		return -ENODEV;
> +
> +	if (pos >= vdev->region[i].size)
> +		return -EINVAL;
> +
> +	count = min(count, (size_t)(vdev->region[i].size - pos));
> +
> +	if (!iswrite) {
> +		/* Only allow reads to the _hdr area */
> +		if (pos + count > offsetof(struct vfio_region_zpci_io, req))
> +			return -EFAULT;
> +		if (copy_to_user(buf, base + pos, count))
> +			return -EFAULT;
> +		return count;
> +	}
> +
> +	/* Only allow writes to the _req area */
> +	if (pos < offsetof(struct vfio_region_zpci_io, req))
> +		return -EFAULT;
> +	if (copy_from_user(base + pos, buf, count))
> +		return -EFAULT;
> +
> +	/*
> +	 * Read operations are limited to 8B
> +	 */
> +	if ((region->req.flags & VFIO_ZPCI_IO_FLAG_READ) &&
> +		(region->req.len > 8)) {
> +		return -EIO;
> +	}
> +
> +	/*
> +	 * Block write operations are limited to hardware-reported max
> +	 */
> +	if ((region->req.flags & VFIO_ZPCI_IO_FLAG_BLOCKW) &&
> +		(region->req.len > zdev->maxstbl)) {
> +		return -EIO;
> +	}
> +
> +	/*
> +	 * While some devices may allow relaxed alignment for the PCISTB
> +	 * instruction, the VFIO region requires the input buffer to be on a
> +	 * DWORD boundary for all operations for simplicity.
> +	 */
> +	if (!IS_ALIGNED(region->req.gaddr, sizeof(uint64_t)))
> +		return -EIO;
> +
> +	/*
> +	 * For now, the largest allowed block I/O is advertised as PAGE_SIZE,
> +	 * and cannot exceed a page boundary - so a single page is enough.  The
> +	 * guest should have validated this but let's double-check that the
> +	 * request will not cross a page boundary.
> +	 */
> +	if (((region->req.gaddr & ~PAGE_MASK)
> +			+ region->req.len - 1) & PAGE_MASK) {
> +		return -EIO;
> +	}
> +
> +	mutex_lock(&zdev->lock);
> +
> +	ret = get_user_pages_fast(region->req.gaddr & PAGE_MASK, 1, 0, &gpage);
> +	if (ret <= 0) {
> +		count = -EIO;
> +		goto out;
> +	}
> +	gaddr = page_address(gpage);
> +	gaddr += (region->req.gaddr & ~PAGE_MASK);
> +	data = (u64 *)gaddr;
> +
> +	req = ZPCI_CREATE_REQ(zdev->fh, region->req.pcias, region->req.len);
> +
> +	/* Perform the requested I/O operation */
> +	if (region->req.flags & VFIO_ZPCI_IO_FLAG_READ) {
> +		/* PCILG */
> +		ret = __zpci_load(data, req,
> +				region->req.offset);
> +	} else if (region->req.flags & VFIO_ZPCI_IO_FLAG_BLOCKW) {
> +		/* PCISTB */
> +		ret = __zpci_store_block(data, req,
> +					region->req.offset);
> +	} else {
> +		/* Undefined Operation or none provided */
> +		count = -EIO;
> +	}
> +	if (ret < 0)
> +		count = -EIO;
> +
> +	put_page(gpage);
> +
> +out:
> +	mutex_unlock(&zdev->lock);
> +	return count;
> +}
> +
> +static void vfio_pci_zdev_io_release(struct vfio_pci_device *vdev,
> +				     struct vfio_pci_region *region)
> +{
> +	kfree(region->data);
> +}
> +
> +static const struct vfio_pci_regops vfio_pci_zdev_io_regops = {
> +	.rw		= vfio_pci_zdev_io_rw,
> +	.release	= vfio_pci_zdev_io_release,
> +};
> +
> +int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev)
> +{
> +	struct vfio_region_zpci_io *region;
> +	struct zpci_dev *zdev;
> +	int ret;
> +
> +	if (!vdev->pdev->bus)
> +		return -ENODEV;
> +
> +	zdev = to_zpci(vdev->pdev);
> +	if (!zdev)
> +		return -ENODEV;
> +
> +	region = kmalloc(sizeof(struct vfio_region_zpci_io), GFP_KERNEL);
> +
> +	ret = vfio_pci_register_dev_region(vdev,
> +		PCI_VENDOR_ID_IBM | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> +		VFIO_REGION_SUBTYPE_IBM_ZPCI_IO,
> +		&vfio_pci_zdev_io_regops,
> +		sizeof(struct vfio_region_zpci_io),
> +		VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
> +		region);
> +
> +	if (ret) {
> +		kfree(region);
> +		return ret;
> +	}
> +
> +	/* Setup the initial header information */
> +	region->hdr.flags = 0;
> +	region->hdr.max = zdev->maxstbl;
> +	region->hdr.reserved = 0;
> +	region->hdr.reserved2 = 0;
> +
> +	return ret;
> +}
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index d181277..5547f9b 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -338,6 +338,10 @@ struct vfio_region_info_cap_type {
>   * to do TLB invalidation on a GPU.
>   */
>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> +/*
> + * IBM zPCI I/O region
> + */
> +#define VFIO_REGION_SUBTYPE_IBM_ZPCI_IO		(2)
>  
>  /* sub-types for VFIO_REGION_TYPE_GFX */
>  #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
> diff --git a/include/uapi/linux/vfio_zdev.h b/include/uapi/linux/vfio_zdev.h
> index b0b6596..830acca4 100644
> --- a/include/uapi/linux/vfio_zdev.h
> +++ b/include/uapi/linux/vfio_zdev.h
> @@ -76,4 +76,37 @@ struct vfio_device_info_cap_zpci_pfip {
>  	__u8 pfip[];
>  };
>  
> +/**
> + * VFIO_REGION_SUBTYPE_IBM_ZPCI_IO - VFIO zPCI PCI Direct I/O Region
> + *
> + * This region is used to transfer I/O operations from the guest directly
> + * to the host zPCI I/O layer.  The same instruction requested by the guest
> + * (e.g. PCISTB) will always be used, even if the MIO variant is available.
> + *
> + * The _hdr area is user-readable and is used to provide setup information.
> + * The _req area is user-writable and is used to provide the I/O operation.
> + */
> +struct vfio_zpci_io_hdr {
> +	__u64 flags;
> +	__u16 max;		/* Max block operation size allowed */
> +	__u16 reserved;
> +	__u32 reserved2;
> +};
> +
> +struct vfio_zpci_io_req {
> +	__u64 flags;
> +#define VFIO_ZPCI_IO_FLAG_READ (1 << 0) /* Read Operation Specified */
> +#define VFIO_ZPCI_IO_FLAG_BLOCKW (1 << 1) /* Block Write Operation Specified */
> +	__u64 gaddr;		/* Address of guest data */
> +	__u64 offset;		/* Offset into target PCI Address Space */
> +	__u32 reserved;
> +	__u16 len;		/* Length of guest operation */
> +	__u8 pcias;		/* Target PCI Address Space */
> +	__u8 reserved2;
> +};
> +
> +struct vfio_region_zpci_io {
> +	struct vfio_zpci_io_hdr hdr;
> +	struct vfio_zpci_io_req req;
> +};
>  #endif


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-22 23:48   ` Alex Williamson
@ 2021-01-25 14:40     ` Matthew Rosato
  2021-01-25 15:42       ` Cornelia Huck
  2021-01-26 23:18       ` Alex Williamson
  0 siblings, 2 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-25 14:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, schnelle, pmorel, borntraeger, hca, gor, gerald.schaefer,
	linux-s390, kvm, linux-kernel

On 1/22/21 6:48 PM, Alex Williamson wrote:
> On Tue, 19 Jan 2021 15:02:30 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
>> specific requirements in terms of alignment as well as the patterns in
>> which the data is read/written. Allowing these to proceed through the
>> typical vfio_pci_bar_rw path will cause them to be broken in up in such a
>> way that these requirements can't be guaranteed. In addition, ISM devices
>> do not support the MIO codepaths that might be triggered on vfio I/O coming
>> from userspace; we must be able to ensure that these devices use the
>> non-MIO instructions.  To facilitate this, provide a new vfio region by
>> which non-MIO instructions can be passed directly to the host kernel s390
>> PCI layer, to be reliably issued as non-MIO instructions.
>>
>> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
>> and implements the ability to pass PCISTB and PCILG instructions over it,
>> as these are what is required for ISM devices.
> 
> There have been various discussions about splitting vfio-pci to allow
> more device specific drivers rather adding duct tape and bailing wire
> for various device specific features to extend vfio-pci.  The latest
> iteration is here[1].  Is it possible that such a solution could simply
> provide the standard BAR region indexes, but with an implementation that
> works on s390, rather than creating new device specific regions to
> perform the same task?  Thanks,
> 
> Alex
> 
> [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurtovoy@nvidia.com/
> 

Thanks for the pointer, I'll have to keep an eye on this.  An approach 
like this could solve some issues, but I think a main issue that still 
remains with relying on the standard BAR region indexes (whether using 
the current vfio-pci driver or a device-specific driver) is that QEMU 
writes to said BAR memory region are happening in, at most, 8B chunks 
(which then, in the current general-purpose vfio-pci code get further 
split up into 4B iowrite operations).  The alternate approach I'm 
proposing here is allowing for the whole payload (4K) in a single 
operation, which is significantly faster.  So, I suspect even with a 
device specific driver we'd want this sort of a region anyhow..

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-25 14:40     ` Matthew Rosato
@ 2021-01-25 15:42       ` Cornelia Huck
  2021-01-25 15:52         ` Matthew Rosato
  2021-01-26 23:18       ` Alex Williamson
  1 sibling, 1 reply; 24+ messages in thread
From: Cornelia Huck @ 2021-01-25 15:42 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Alex Williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On Mon, 25 Jan 2021 09:40:38 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 1/22/21 6:48 PM, Alex Williamson wrote:
> > On Tue, 19 Jan 2021 15:02:30 -0500
> > Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> >   
> >> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
> >> specific requirements in terms of alignment as well as the patterns in
> >> which the data is read/written. Allowing these to proceed through the
> >> typical vfio_pci_bar_rw path will cause them to be broken in up in such a
> >> way that these requirements can't be guaranteed. In addition, ISM devices
> >> do not support the MIO codepaths that might be triggered on vfio I/O coming
> >> from userspace; we must be able to ensure that these devices use the
> >> non-MIO instructions.  To facilitate this, provide a new vfio region by
> >> which non-MIO instructions can be passed directly to the host kernel s390
> >> PCI layer, to be reliably issued as non-MIO instructions.
> >>
> >> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
> >> and implements the ability to pass PCISTB and PCILG instructions over it,
> >> as these are what is required for ISM devices.  
> > 
> > There have been various discussions about splitting vfio-pci to allow
> > more device specific drivers rather adding duct tape and bailing wire
> > for various device specific features to extend vfio-pci.  The latest
> > iteration is here[1].  Is it possible that such a solution could simply
> > provide the standard BAR region indexes, but with an implementation that
> > works on s390, rather than creating new device specific regions to
> > perform the same task?  Thanks,
> > 
> > Alex
> > 
> > [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurtovoy@nvidia.com/
> >   
> 
> Thanks for the pointer, I'll have to keep an eye on this.  An approach 
> like this could solve some issues, but I think a main issue that still 
> remains with relying on the standard BAR region indexes (whether using 
> the current vfio-pci driver or a device-specific driver) is that QEMU 
> writes to said BAR memory region are happening in, at most, 8B chunks 
> (which then, in the current general-purpose vfio-pci code get further 
> split up into 4B iowrite operations).  The alternate approach I'm 
> proposing here is allowing for the whole payload (4K) in a single 
> operation, which is significantly faster.  So, I suspect even with a 
> device specific driver we'd want this sort of a region anyhow..

I'm also wondering about device specific vs architecture/platform
specific handling.

If we're trying to support ISM devices, that's device specific
handling; but if we're trying to add more generic things like the large
payload support, that's not necessarily tied to a device, is it? For
example, could a device support large payload if plugged into a z, but
not if plugged into another machine?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-25 15:42       ` Cornelia Huck
@ 2021-01-25 15:52         ` Matthew Rosato
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-25 15:52 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On 1/25/21 10:42 AM, Cornelia Huck wrote:
> On Mon, 25 Jan 2021 09:40:38 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
>> On 1/22/21 6:48 PM, Alex Williamson wrote:
>>> On Tue, 19 Jan 2021 15:02:30 -0500
>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>    
>>>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
>>>> specific requirements in terms of alignment as well as the patterns in
>>>> which the data is read/written. Allowing these to proceed through the
>>>> typical vfio_pci_bar_rw path will cause them to be broken in up in such a
>>>> way that these requirements can't be guaranteed. In addition, ISM devices
>>>> do not support the MIO codepaths that might be triggered on vfio I/O coming
>>>> from userspace; we must be able to ensure that these devices use the
>>>> non-MIO instructions.  To facilitate this, provide a new vfio region by
>>>> which non-MIO instructions can be passed directly to the host kernel s390
>>>> PCI layer, to be reliably issued as non-MIO instructions.
>>>>
>>>> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
>>>> and implements the ability to pass PCISTB and PCILG instructions over it,
>>>> as these are what is required for ISM devices.
>>>
>>> There have been various discussions about splitting vfio-pci to allow
>>> more device specific drivers rather adding duct tape and bailing wire
>>> for various device specific features to extend vfio-pci.  The latest
>>> iteration is here[1].  Is it possible that such a solution could simply
>>> provide the standard BAR region indexes, but with an implementation that
>>> works on s390, rather than creating new device specific regions to
>>> perform the same task?  Thanks,
>>>
>>> Alex
>>>
>>> [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurtovoy@nvidia.com/
>>>    
>>
>> Thanks for the pointer, I'll have to keep an eye on this.  An approach
>> like this could solve some issues, but I think a main issue that still
>> remains with relying on the standard BAR region indexes (whether using
>> the current vfio-pci driver or a device-specific driver) is that QEMU
>> writes to said BAR memory region are happening in, at most, 8B chunks
>> (which then, in the current general-purpose vfio-pci code get further
>> split up into 4B iowrite operations).  The alternate approach I'm
>> proposing here is allowing for the whole payload (4K) in a single
>> operation, which is significantly faster.  So, I suspect even with a
>> device specific driver we'd want this sort of a region anyhow..
> 
> I'm also wondering about device specific vs architecture/platform
> specific handling.
> 
> If we're trying to support ISM devices, that's device specific
> handling; but if we're trying to add more generic things like the large
> payload support, that's not necessarily tied to a device, is it? For
> example, could a device support large payload if plugged into a z, but
> not if plugged into another machine? >

Yes, that's correct -- While ISM is providing the impetus and has a hard 
requirement for some of this due to the MIO instruction quirk, the 
mechanism being implemented here is definitely not ISM-specific -- it's 
more like an s390-wide quirk that could really benefit any device that 
wants to do large payloads (PCISTB).

And I think that ultimately goes back to why Pierre wanted to have QEMU 
be as permissive as possible in using the region vs limiting it only to 
ISM.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-25 14:40     ` Matthew Rosato
  2021-01-25 15:42       ` Cornelia Huck
@ 2021-01-26 23:18       ` Alex Williamson
  2021-01-27 14:23         ` Matthew Rosato
  1 sibling, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2021-01-26 23:18 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: cohuck, schnelle, pmorel, borntraeger, hca, gor, gerald.schaefer,
	linux-s390, kvm, linux-kernel

On Mon, 25 Jan 2021 09:40:38 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 1/22/21 6:48 PM, Alex Williamson wrote:
> > On Tue, 19 Jan 2021 15:02:30 -0500
> > Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> >   
> >> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
> >> specific requirements in terms of alignment as well as the patterns in
> >> which the data is read/written. Allowing these to proceed through the
> >> typical vfio_pci_bar_rw path will cause them to be broken in up in such a
> >> way that these requirements can't be guaranteed. In addition, ISM devices
> >> do not support the MIO codepaths that might be triggered on vfio I/O coming
> >> from userspace; we must be able to ensure that these devices use the
> >> non-MIO instructions.  To facilitate this, provide a new vfio region by
> >> which non-MIO instructions can be passed directly to the host kernel s390
> >> PCI layer, to be reliably issued as non-MIO instructions.
> >>
> >> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
> >> and implements the ability to pass PCISTB and PCILG instructions over it,
> >> as these are what is required for ISM devices.  
> > 
> > There have been various discussions about splitting vfio-pci to allow
> > more device specific drivers rather adding duct tape and bailing wire
> > for various device specific features to extend vfio-pci.  The latest
> > iteration is here[1].  Is it possible that such a solution could simply
> > provide the standard BAR region indexes, but with an implementation that
> > works on s390, rather than creating new device specific regions to
> > perform the same task?  Thanks,
> > 
> > Alex
> > 
> > [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurtovoy@nvidia.com/
> >   
> 
> Thanks for the pointer, I'll have to keep an eye on this.  An approach 
> like this could solve some issues, but I think a main issue that still 
> remains with relying on the standard BAR region indexes (whether using 
> the current vfio-pci driver or a device-specific driver) is that QEMU 
> writes to said BAR memory region are happening in, at most, 8B chunks 
> (which then, in the current general-purpose vfio-pci code get further 
> split up into 4B iowrite operations).  The alternate approach I'm 
> proposing here is allowing for the whole payload (4K) in a single 
> operation, which is significantly faster.  So, I suspect even with a 
> device specific driver we'd want this sort of a region anyhow..

Why is this device specific behavior?  It would be a fair argument that
acceptable device access widths for MMIO are always device specific, so
we should never break them down.  Looking at the PCI spec, a TLP
requires a dword (4-byte) aligned address with a 10-bit length field
indicating the number of dwords, so up to 4K data as you suggest is the
whole payload.  It's quite possible that the reason we don't have more
access width problems is that MMIO is typically mmap'd on other
platforms.  We get away with using the x-no-mmap=on flag for debugging,
but it's not unheard of that the device also doesn't work quite
correctly with that flag, which could be due to access width or timing
difference.

So really, I don't see why we wouldn't want to maintain the guest
access width through QEMU and the kernel interface for all devices.  It
seems like that should be our default vfio-pci implementation.  I think
we chose the current width based on the QEMU implementation that was
already splitting accesses, and it (mostly) worked.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-26 23:18       ` Alex Williamson
@ 2021-01-27 14:23         ` Matthew Rosato
  2021-01-27 15:53           ` Alex Williamson
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Rosato @ 2021-01-27 14:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, schnelle, pmorel, borntraeger, hca, gor, gerald.schaefer,
	linux-s390, kvm, linux-kernel

On 1/26/21 6:18 PM, Alex Williamson wrote:
> On Mon, 25 Jan 2021 09:40:38 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
>> On 1/22/21 6:48 PM, Alex Williamson wrote:
>>> On Tue, 19 Jan 2021 15:02:30 -0500
>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>    
>>>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
>>>> specific requirements in terms of alignment as well as the patterns in
>>>> which the data is read/written. Allowing these to proceed through the
>>>> typical vfio_pci_bar_rw path will cause them to be broken in up in such a
>>>> way that these requirements can't be guaranteed. In addition, ISM devices
>>>> do not support the MIO codepaths that might be triggered on vfio I/O coming
>>>> from userspace; we must be able to ensure that these devices use the
>>>> non-MIO instructions.  To facilitate this, provide a new vfio region by
>>>> which non-MIO instructions can be passed directly to the host kernel s390
>>>> PCI layer, to be reliably issued as non-MIO instructions.
>>>>
>>>> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
>>>> and implements the ability to pass PCISTB and PCILG instructions over it,
>>>> as these are what is required for ISM devices.
>>>
>>> There have been various discussions about splitting vfio-pci to allow
>>> more device specific drivers rather adding duct tape and bailing wire
>>> for various device specific features to extend vfio-pci.  The latest
>>> iteration is here[1].  Is it possible that such a solution could simply
>>> provide the standard BAR region indexes, but with an implementation that
>>> works on s390, rather than creating new device specific regions to
>>> perform the same task?  Thanks,
>>>
>>> Alex
>>>
>>> [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurtovoy@nvidia.com/
>>>    
>>
>> Thanks for the pointer, I'll have to keep an eye on this.  An approach
>> like this could solve some issues, but I think a main issue that still
>> remains with relying on the standard BAR region indexes (whether using
>> the current vfio-pci driver or a device-specific driver) is that QEMU
>> writes to said BAR memory region are happening in, at most, 8B chunks
>> (which then, in the current general-purpose vfio-pci code get further
>> split up into 4B iowrite operations).  The alternate approach I'm
>> proposing here is allowing for the whole payload (4K) in a single
>> operation, which is significantly faster.  So, I suspect even with a
>> device specific driver we'd want this sort of a region anyhow..
> 
> Why is this device specific behavior?  It would be a fair argument that
> acceptable device access widths for MMIO are always device specific, so
> we should never break them down.  Looking at the PCI spec, a TLP
> requires a dword (4-byte) aligned address with a 10-bit length field > indicating the number of dwords, so up to 4K data as you suggest is the

Well, as I mentioned in a different thread, it's not really device 
specific behavior but rather architecture/s390x specific behavior; 
PCISTB is an interface available to all PCI devices on s390x, it just so 
happens that ISM is the first device type that is running into the 
nuance.  The architecture is designed in such a way that other devices 
can use the same interface in the same manner.

To drill down a bit, the PCISTB payload can either be 'strict' or 
'relaxed', which via the s390x architecture 'relaxed' means it's not 
dword-aligned but rather byte-aligned up to 4K.  We find out at init 
time which interface a device supports --  So, for a device that 
supports 'relaxed' PCISTB like ISM, an example would be a PCISTB of 11 
bytes coming from a non-dword-aligned address is permissible, which 
doesn't match the TLP from the spec as you described...  I believe this 
'relaxed' operation that steps outside of the spec is unique to s390x. 
(Conversely, devices that use 'strict' PCISTB conform to the typical TLP 
definition)

This deviation from spec is to my mind is another argument to treat 
these particular PCISTB separately.

> whole payload.  It's quite possible that the reason we don't have more
> access width problems is that MMIO is typically mmap'd on other
> platforms.  We get away with using the x-no-mmap=on flag for debugging,
> but it's not unheard of that the device also doesn't work quite
> correctly with that flag, which could be due to access width or timing
> difference.
> 
> So really, I don't see why we wouldn't want to maintain the guest
> access width through QEMU and the kernel interface for all devices.  It
> seems like that should be our default vfio-pci implementation.  I think
> we chose the current width based on the QEMU implementation that was
> already splitting accesses, and it (mostly) worked.  Thanks,
> 

But unless you think that allowing more flexibility than the PCI spec 
dictates (byte-aligned up to 4K rather than dword-aligned up to 4K, see 
above) this still wouldn't allow me to solve the issue I'm trying to 
with this patch set.

If you DO think allowing byte-aligned access up to 4K is OK, then I'm 
still left with a further issue (@Niklas):  I'm also using this 
region-based approach to ensure that the host uses a matching interface 
when talking to the host device (basically, s390x has two different 
versions of access to PCI devices, and for ISM at least we need to 
ensure that the same operation intercepted from the guest is being used 
on the host vs attempting to 'upgrade', which always happens via the 
standard s390s kernel PCI interfaces).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-27 14:23         ` Matthew Rosato
@ 2021-01-27 15:53           ` Alex Williamson
  2021-01-27 17:45             ` Cornelia Huck
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2021-01-27 15:53 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: cohuck, schnelle, pmorel, borntraeger, hca, gor, gerald.schaefer,
	linux-s390, kvm, linux-kernel

On Wed, 27 Jan 2021 09:23:04 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 1/26/21 6:18 PM, Alex Williamson wrote:
> > On Mon, 25 Jan 2021 09:40:38 -0500
> > Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> >   
> >> On 1/22/21 6:48 PM, Alex Williamson wrote:  
> >>> On Tue, 19 Jan 2021 15:02:30 -0500
> >>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> >>>      
> >>>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
> >>>> specific requirements in terms of alignment as well as the patterns in
> >>>> which the data is read/written. Allowing these to proceed through the
> >>>> typical vfio_pci_bar_rw path will cause them to be broken in up in such a
> >>>> way that these requirements can't be guaranteed. In addition, ISM devices
> >>>> do not support the MIO codepaths that might be triggered on vfio I/O coming
> >>>> from userspace; we must be able to ensure that these devices use the
> >>>> non-MIO instructions.  To facilitate this, provide a new vfio region by
> >>>> which non-MIO instructions can be passed directly to the host kernel s390
> >>>> PCI layer, to be reliably issued as non-MIO instructions.
> >>>>
> >>>> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
> >>>> and implements the ability to pass PCISTB and PCILG instructions over it,
> >>>> as these are what is required for ISM devices.  
> >>>
> >>> There have been various discussions about splitting vfio-pci to allow
> >>> more device specific drivers rather adding duct tape and bailing wire
> >>> for various device specific features to extend vfio-pci.  The latest
> >>> iteration is here[1].  Is it possible that such a solution could simply
> >>> provide the standard BAR region indexes, but with an implementation that
> >>> works on s390, rather than creating new device specific regions to
> >>> perform the same task?  Thanks,
> >>>
> >>> Alex
> >>>
> >>> [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurtovoy@nvidia.com/
> >>>      
> >>
> >> Thanks for the pointer, I'll have to keep an eye on this.  An approach
> >> like this could solve some issues, but I think a main issue that still
> >> remains with relying on the standard BAR region indexes (whether using
> >> the current vfio-pci driver or a device-specific driver) is that QEMU
> >> writes to said BAR memory region are happening in, at most, 8B chunks
> >> (which then, in the current general-purpose vfio-pci code get further
> >> split up into 4B iowrite operations).  The alternate approach I'm
> >> proposing here is allowing for the whole payload (4K) in a single
> >> operation, which is significantly faster.  So, I suspect even with a
> >> device specific driver we'd want this sort of a region anyhow..  
> > 
> > Why is this device specific behavior?  It would be a fair argument that
> > acceptable device access widths for MMIO are always device specific, so
> > we should never break them down.  Looking at the PCI spec, a TLP
> > requires a dword (4-byte) aligned address with a 10-bit length field > indicating the number of dwords, so up to 4K data as you suggest is the  
> 
> Well, as I mentioned in a different thread, it's not really device 

Sorry, I tried to follow the thread, not sure it's possible w/o lots of
preexisting s390 knowledge.

> specific behavior but rather architecture/s390x specific behavior; 
> PCISTB is an interface available to all PCI devices on s390x, it just so 
> happens that ISM is the first device type that is running into the 
> nuance.  The architecture is designed in such a way that other devices 
> can use the same interface in the same manner.

As a platform access mechanism, this leans towards a platform specific
implementation of the PCI BAR regions.
 
> To drill down a bit, the PCISTB payload can either be 'strict' or 
> 'relaxed', which via the s390x architecture 'relaxed' means it's not 
> dword-aligned but rather byte-aligned up to 4K.  We find out at init 
> time which interface a device supports --  So, for a device that 
> supports 'relaxed' PCISTB like ISM, an example would be a PCISTB of 11 
> bytes coming from a non-dword-aligned address is permissible, which 
> doesn't match the TLP from the spec as you described...  I believe this 
> 'relaxed' operation that steps outside of the spec is unique to s390x. 
> (Conversely, devices that use 'strict' PCISTB conform to the typical TLP 
> definition)
> 
> This deviation from spec is to my mind is another argument to treat 
> these particular PCISTB separately.

I think that's just an accessor abstraction, we're not asking users to
generate TLPs.  If we expose a byte granularity interface, some
platforms might pass that directly to the PCISTB command, otherwise
might align the address, perform a dword access, and return the
requested byte.  AFAICT, both conventional and express PCI use dword
alignement on the bus, so this should be valid and at best questions
whether ISM is really PCI or not.

> > whole payload.  It's quite possible that the reason we don't have more
> > access width problems is that MMIO is typically mmap'd on other
> > platforms.  We get away with using the x-no-mmap=on flag for debugging,
> > but it's not unheard of that the device also doesn't work quite
> > correctly with that flag, which could be due to access width or timing
> > difference.
> > 
> > So really, I don't see why we wouldn't want to maintain the guest
> > access width through QEMU and the kernel interface for all devices.  It
> > seems like that should be our default vfio-pci implementation.  I think
> > we chose the current width based on the QEMU implementation that was
> > already splitting accesses, and it (mostly) worked.  Thanks,
> >   
> 
> But unless you think that allowing more flexibility than the PCI spec 
> dictates (byte-aligned up to 4K rather than dword-aligned up to 4K, see 
> above) this still wouldn't allow me to solve the issue I'm trying to 
> with this patch set.

As above, it still seems like an improvement to honor user access width
to the ability of the platform or bus/device interface.  If ISM is
really that different from PCI in this respect, it only strengthens the
argument to make a separate bus driver derived from vfio-pci(-core) imo.

> If you DO think allowing byte-aligned access up to 4K is OK, then I'm 
> still left with a further issue (@Niklas):  I'm also using this 
> region-based approach to ensure that the host uses a matching interface 
> when talking to the host device (basically, s390x has two different 
> versions of access to PCI devices, and for ISM at least we need to 
> ensure that the same operation intercepted from the guest is being used 
> on the host vs attempting to 'upgrade', which always happens via the 
> standard s390s kernel PCI interfaces).

In the proposed vfio-pci-core library model, devices would be attached
to the most appropriate vfio bus driver, an ISM device might be bound
to a vfio-zpci-ism (heh, "-ism") driver on the host, standard device
might simply be attached to vfio-pci.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-27 15:53           ` Alex Williamson
@ 2021-01-27 17:45             ` Cornelia Huck
  2021-01-27 18:27               ` Matthew Rosato
  0 siblings, 1 reply; 24+ messages in thread
From: Cornelia Huck @ 2021-01-27 17:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Matthew Rosato, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On Wed, 27 Jan 2021 08:53:05 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 27 Jan 2021 09:23:04 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
> > On 1/26/21 6:18 PM, Alex Williamson wrote:  
> > > On Mon, 25 Jan 2021 09:40:38 -0500
> > > Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> > >     
> > >> On 1/22/21 6:48 PM, Alex Williamson wrote:    
> > >>> On Tue, 19 Jan 2021 15:02:30 -0500
> > >>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> > >>>        
> > >>>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
> > >>>> specific requirements in terms of alignment as well as the patterns in
> > >>>> which the data is read/written. Allowing these to proceed through the
> > >>>> typical vfio_pci_bar_rw path will cause them to be broken in up in such a
> > >>>> way that these requirements can't be guaranteed. In addition, ISM devices
> > >>>> do not support the MIO codepaths that might be triggered on vfio I/O coming
> > >>>> from userspace; we must be able to ensure that these devices use the
> > >>>> non-MIO instructions.  To facilitate this, provide a new vfio region by
> > >>>> which non-MIO instructions can be passed directly to the host kernel s390
> > >>>> PCI layer, to be reliably issued as non-MIO instructions.
> > >>>>
> > >>>> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
> > >>>> and implements the ability to pass PCISTB and PCILG instructions over it,
> > >>>> as these are what is required for ISM devices.    
> > >>>
> > >>> There have been various discussions about splitting vfio-pci to allow
> > >>> more device specific drivers rather adding duct tape and bailing wire
> > >>> for various device specific features to extend vfio-pci.  The latest
> > >>> iteration is here[1].  Is it possible that such a solution could simply
> > >>> provide the standard BAR region indexes, but with an implementation that
> > >>> works on s390, rather than creating new device specific regions to
> > >>> perform the same task?  Thanks,
> > >>>
> > >>> Alex
> > >>>
> > >>> [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurtovoy@nvidia.com/
> > >>>        
> > >>
> > >> Thanks for the pointer, I'll have to keep an eye on this.  An approach
> > >> like this could solve some issues, but I think a main issue that still
> > >> remains with relying on the standard BAR region indexes (whether using
> > >> the current vfio-pci driver or a device-specific driver) is that QEMU
> > >> writes to said BAR memory region are happening in, at most, 8B chunks
> > >> (which then, in the current general-purpose vfio-pci code get further
> > >> split up into 4B iowrite operations).  The alternate approach I'm
> > >> proposing here is allowing for the whole payload (4K) in a single
> > >> operation, which is significantly faster.  So, I suspect even with a
> > >> device specific driver we'd want this sort of a region anyhow..    
> > > 
> > > Why is this device specific behavior?  It would be a fair argument that
> > > acceptable device access widths for MMIO are always device specific, so
> > > we should never break them down.  Looking at the PCI spec, a TLP
> > > requires a dword (4-byte) aligned address with a 10-bit length field > indicating the number of dwords, so up to 4K data as you suggest is the    
> > 
> > Well, as I mentioned in a different thread, it's not really device   
> 
> Sorry, I tried to follow the thread, not sure it's possible w/o lots of
> preexisting s390 knowledge.
> 
> > specific behavior but rather architecture/s390x specific behavior; 
> > PCISTB is an interface available to all PCI devices on s390x, it just so 
> > happens that ISM is the first device type that is running into the 
> > nuance.  The architecture is designed in such a way that other devices 
> > can use the same interface in the same manner.  
> 
> As a platform access mechanism, this leans towards a platform specific
> implementation of the PCI BAR regions.
>  
> > To drill down a bit, the PCISTB payload can either be 'strict' or 
> > 'relaxed', which via the s390x architecture 'relaxed' means it's not 
> > dword-aligned but rather byte-aligned up to 4K.  We find out at init 
> > time which interface a device supports --  So, for a device that 
> > supports 'relaxed' PCISTB like ISM, an example would be a PCISTB of 11 
> > bytes coming from a non-dword-aligned address is permissible, which 
> > doesn't match the TLP from the spec as you described...  I believe this 
> > 'relaxed' operation that steps outside of the spec is unique to s390x. 
> > (Conversely, devices that use 'strict' PCISTB conform to the typical TLP 
> > definition)
> > 
> > This deviation from spec is to my mind is another argument to treat 
> > these particular PCISTB separately.  
> 
> I think that's just an accessor abstraction, we're not asking users to
> generate TLPs.  If we expose a byte granularity interface, some
> platforms might pass that directly to the PCISTB command, otherwise
> might align the address, perform a dword access, and return the
> requested byte.  AFAICT, both conventional and express PCI use dword
> alignement on the bus, so this should be valid and at best questions
> whether ISM is really PCI or not.

The vibes I'm getting from ISM is that it is mostly a construct using
(one set of) the s390 pci instructions, which ends up being something
not entirely unlike a pci device... the question is how much things
like the 'relaxed' payload may also be supported by 'real' pci devices
plugged into an s390.

> 
> > > whole payload.  It's quite possible that the reason we don't have more
> > > access width problems is that MMIO is typically mmap'd on other
> > > platforms.  We get away with using the x-no-mmap=on flag for debugging,
> > > but it's not unheard of that the device also doesn't work quite
> > > correctly with that flag, which could be due to access width or timing
> > > difference.
> > > 
> > > So really, I don't see why we wouldn't want to maintain the guest
> > > access width through QEMU and the kernel interface for all devices.  It
> > > seems like that should be our default vfio-pci implementation.  I think
> > > we chose the current width based on the QEMU implementation that was
> > > already splitting accesses, and it (mostly) worked.  Thanks,
> > >     
> > 
> > But unless you think that allowing more flexibility than the PCI spec 
> > dictates (byte-aligned up to 4K rather than dword-aligned up to 4K, see 
> > above) this still wouldn't allow me to solve the issue I'm trying to 
> > with this patch set.  
> 
> As above, it still seems like an improvement to honor user access width
> to the ability of the platform or bus/device interface.  If ISM is
> really that different from PCI in this respect, it only strengthens the
> argument to make a separate bus driver derived from vfio-pci(-core) imo.
> 
> > If you DO think allowing byte-aligned access up to 4K is OK, then I'm 
> > still left with a further issue (@Niklas):  I'm also using this 
> > region-based approach to ensure that the host uses a matching interface 
> > when talking to the host device (basically, s390x has two different 
> > versions of access to PCI devices, and for ISM at least we need to 
> > ensure that the same operation intercepted from the guest is being used 
> > on the host vs attempting to 'upgrade', which always happens via the 
> > standard s390s kernel PCI interfaces).  
> 
> In the proposed vfio-pci-core library model, devices would be attached
> to the most appropriate vfio bus driver, an ISM device might be bound
> to a vfio-zpci-ism (heh, "-ism") driver on the host, standard device

That would be a nice name, at least :)

> might simply be attached to vfio-pci.

I'm wondering what a good split would be there. ISTM that the devices
the vfio-pci-core split is written in mind with are still normal pci
devices, just with some extra needs that can be fulfilled via an aux
driver. ISM seems to need special treatment for normal accesses, but
does not hook into a separate framework. (If other zpci devices need
special accesses, would that then be a zpci framework?)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2021-01-27 17:45             ` Cornelia Huck
@ 2021-01-27 18:27               ` Matthew Rosato
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Rosato @ 2021-01-27 18:27 UTC (permalink / raw)
  To: Cornelia Huck, Alex Williamson
  Cc: schnelle, pmorel, borntraeger, hca, gor, gerald.schaefer,
	linux-s390, kvm, linux-kernel

On 1/27/21 12:45 PM, Cornelia Huck wrote:
> On Wed, 27 Jan 2021 08:53:05 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Wed, 27 Jan 2021 09:23:04 -0500
>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>
>>> On 1/26/21 6:18 PM, Alex Williamson wrote:
>>>> On Mon, 25 Jan 2021 09:40:38 -0500
>>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>>      
>>>>> On 1/22/21 6:48 PM, Alex Williamson wrote:
>>>>>> On Tue, 19 Jan 2021 15:02:30 -0500
>>>>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>>>>         
>>>>>>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
>>>>>>> specific requirements in terms of alignment as well as the patterns in
>>>>>>> which the data is read/written. Allowing these to proceed through the
>>>>>>> typical vfio_pci_bar_rw path will cause them to be broken in up in such a
>>>>>>> way that these requirements can't be guaranteed. In addition, ISM devices
>>>>>>> do not support the MIO codepaths that might be triggered on vfio I/O coming
>>>>>>> from userspace; we must be able to ensure that these devices use the
>>>>>>> non-MIO instructions.  To facilitate this, provide a new vfio region by
>>>>>>> which non-MIO instructions can be passed directly to the host kernel s390
>>>>>>> PCI layer, to be reliably issued as non-MIO instructions.
>>>>>>>
>>>>>>> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
>>>>>>> and implements the ability to pass PCISTB and PCILG instructions over it,
>>>>>>> as these are what is required for ISM devices.
>>>>>>
>>>>>> There have been various discussions about splitting vfio-pci to allow
>>>>>> more device specific drivers rather adding duct tape and bailing wire
>>>>>> for various device specific features to extend vfio-pci.  The latest
>>>>>> iteration is here[1].  Is it possible that such a solution could simply
>>>>>> provide the standard BAR region indexes, but with an implementation that
>>>>>> works on s390, rather than creating new device specific regions to
>>>>>> perform the same task?  Thanks,
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurtovoy@nvidia.com/
>>>>>>         
>>>>>
>>>>> Thanks for the pointer, I'll have to keep an eye on this.  An approach
>>>>> like this could solve some issues, but I think a main issue that still
>>>>> remains with relying on the standard BAR region indexes (whether using
>>>>> the current vfio-pci driver or a device-specific driver) is that QEMU
>>>>> writes to said BAR memory region are happening in, at most, 8B chunks
>>>>> (which then, in the current general-purpose vfio-pci code get further
>>>>> split up into 4B iowrite operations).  The alternate approach I'm
>>>>> proposing here is allowing for the whole payload (4K) in a single
>>>>> operation, which is significantly faster.  So, I suspect even with a
>>>>> device specific driver we'd want this sort of a region anyhow..
>>>>
>>>> Why is this device specific behavior?  It would be a fair argument that
>>>> acceptable device access widths for MMIO are always device specific, so
>>>> we should never break them down.  Looking at the PCI spec, a TLP
>>>> requires a dword (4-byte) aligned address with a 10-bit length field > indicating the number of dwords, so up to 4K data as you suggest is the
>>>
>>> Well, as I mentioned in a different thread, it's not really device
>>
>> Sorry, I tried to follow the thread, not sure it's possible w/o lots of
>> preexisting s390 knowledge.
>>
>>> specific behavior but rather architecture/s390x specific behavior;
>>> PCISTB is an interface available to all PCI devices on s390x, it just so
>>> happens that ISM is the first device type that is running into the
>>> nuance.  The architecture is designed in such a way that other devices
>>> can use the same interface in the same manner.
>>
>> As a platform access mechanism, this leans towards a platform specific
>> implementation of the PCI BAR regions.
>>   
>>> To drill down a bit, the PCISTB payload can either be 'strict' or
>>> 'relaxed', which via the s390x architecture 'relaxed' means it's not
>>> dword-aligned but rather byte-aligned up to 4K.  We find out at init
>>> time which interface a device supports --  So, for a device that
>>> supports 'relaxed' PCISTB like ISM, an example would be a PCISTB of 11
>>> bytes coming from a non-dword-aligned address is permissible, which
>>> doesn't match the TLP from the spec as you described...  I believe this
>>> 'relaxed' operation that steps outside of the spec is unique to s390x.
>>> (Conversely, devices that use 'strict' PCISTB conform to the typical TLP
>>> definition)
>>>
>>> This deviation from spec is to my mind is another argument to treat
>>> these particular PCISTB separately.
>>
>> I think that's just an accessor abstraction, we're not asking users to
>> generate TLPs.  If we expose a byte granularity interface, some
>> platforms might pass that directly to the PCISTB command, otherwise
>> might align the address, perform a dword access, and return the
>> requested byte.  AFAICT, both conventional and express PCI use dword
>> alignement on the bus, so this should be valid and at best questions
>> whether ISM is really PCI or not.
> 
> The vibes I'm getting from ISM is that it is mostly a construct using
> (one set of) the s390 pci instructions, which ends up being something
> not entirely unlike a pci device... the question is how much things

Yep, that's a fair assessment.

> like the 'relaxed' payload may also be supported by 'real' pci devices
> plugged into an s390.

The architecture allows for it, but in practicality PCI devices being 
used on s390x whose drivers are meant to run on additional platforms 
besides s390x (ex: mlx) aren't going to use something like that unless 
1) an s390x-specific exploitation is tacked on to the driver for some 
reason and 2) I think support for such behavior would need to be 
implemented in the kernel -- As really, our s390 kernel PCI logic today 
makes no special accommodations for relaxed-alignment I/O coming from 
drivers -- this is one of the reasons the host ISM driver makes s390 PCI 
calls directly rather than going through the standard kernel PCI 
interface (and subsequently why using an alternate delivery vehicle for 
pass through ISM that connects directly with the s390 PCI interface made 
sense to me)

'Relaxed' is very s390-specific and as far as I understand is only 
expected from a device type that is unique to s390 (like ISM) because of 
the way it deviates from standard PCI.  And today, the only device that 
behaves in that way (relaxed) and lacks support for certain PCI 
operations / requires a special access method (e.g. no PCISTG allowed) 
is ISM - else we'd see more devices that require direct invocation of 
s390 PCI interfaces in the kernel.


> 
>>
>>>> whole payload.  It's quite possible that the reason we don't have more
>>>> access width problems is that MMIO is typically mmap'd on other
>>>> platforms.  We get away with using the x-no-mmap=on flag for debugging,
>>>> but it's not unheard of that the device also doesn't work quite
>>>> correctly with that flag, which could be due to access width or timing
>>>> difference.
>>>>
>>>> So really, I don't see why we wouldn't want to maintain the guest
>>>> access width through QEMU and the kernel interface for all devices.  It
>>>> seems like that should be our default vfio-pci implementation.  I think
>>>> we chose the current width based on the QEMU implementation that was
>>>> already splitting accesses, and it (mostly) worked.  Thanks,
>>>>      
>>>
>>> But unless you think that allowing more flexibility than the PCI spec
>>> dictates (byte-aligned up to 4K rather than dword-aligned up to 4K, see
>>> above) this still wouldn't allow me to solve the issue I'm trying to
>>> with this patch set.
>>
>> As above, it still seems like an improvement to honor user access width
>> to the ability of the platform or bus/device interface.  If ISM is
>> really that different from PCI in this respect, it only strengthens the
>> argument to make a separate bus driver derived from vfio-pci(-core) imo.
>>
>>> If you DO think allowing byte-aligned access up to 4K is OK, then I'm
>>> still left with a further issue (@Niklas):  I'm also using this
>>> region-based approach to ensure that the host uses a matching interface
>>> when talking to the host device (basically, s390x has two different
>>> versions of access to PCI devices, and for ISM at least we need to
>>> ensure that the same operation intercepted from the guest is being used
>>> on the host vs attempting to 'upgrade', which always happens via the
>>> standard s390s kernel PCI interfaces).
>>
>> In the proposed vfio-pci-core library model, devices would be attached
>> to the most appropriate vfio bus driver, an ISM device might be bound
>> to a vfio-zpci-ism (heh, "-ism") driver on the host, standard device
> 
> That would be a nice name, at least :)
> 
>> might simply be attached to vfio-pci.
> 
> I'm wondering what a good split would be there. ISTM that the devices
> the vfio-pci-core split is written in mind with are still normal pci
> devices, just with some extra needs that can be fulfilled via an aux
> driver. ISM seems to need special treatment for normal accesses, but
> does not hook into a separate framework. (If other zpci devices need
> special accesses, would that then be a zpci framework?)
> 

As of today, I don't believe we have other devices that -require- 
special access, as I mentioned above (no future guarantee I suppose). 
However the notion of an improved PCISTB payload size beyond ISM devices 
(regardless of 'strict' or 'relaxed') is desirable, but not strictly 
necessary, as it would be applicable to any PCI device whose driver 
wants to do writes >8B on s390x and would allow that operation to be 
done in "one shot."



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-01-27 18:28 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-19 20:02 [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
2021-01-19 20:02 ` [PATCH 1/4] s390/pci: track alignment/length strictness for zpci_dev Matthew Rosato
2021-01-19 20:02 ` [PATCH 2/4] vfio-pci/zdev: Pass the relaxed alignment flag Matthew Rosato
2021-01-19 20:02 ` [PATCH 3/4] s390/pci: Get hardware-reported max store block length Matthew Rosato
2021-01-19 20:02 ` [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region Matthew Rosato
2021-01-20 13:21   ` Niklas Schnelle
2021-01-20 17:10     ` Matthew Rosato
2021-01-20 17:28       ` Niklas Schnelle
2021-01-20 17:40         ` Matthew Rosato
2021-01-21 20:50     ` Alex Williamson
2021-01-21 10:01   ` Niklas Schnelle
2021-01-21 15:57     ` Matthew Rosato
2021-01-22 23:48   ` Alex Williamson
2021-01-25 14:40     ` Matthew Rosato
2021-01-25 15:42       ` Cornelia Huck
2021-01-25 15:52         ` Matthew Rosato
2021-01-26 23:18       ` Alex Williamson
2021-01-27 14:23         ` Matthew Rosato
2021-01-27 15:53           ` Alex Williamson
2021-01-27 17:45             ` Cornelia Huck
2021-01-27 18:27               ` Matthew Rosato
2021-01-19 20:50 ` [PATCH 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
2021-01-20  9:02 ` Pierre Morel
2021-01-20 14:02   ` Matthew Rosato

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).