kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
@ 2020-12-09 20:27 Matthew Rosato
  2020-12-09 20:27 ` [RFC 1/4] s390/pci: track alignment/length strictness for zpci_dev Matthew Rosato
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Matthew Rosato @ 2020-12-09 20:27 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

Today, ISM devices are completely disallowed for vfio-pci passthrough as
QEMU will reject the device due to an (inappropriate) MSI-X check.
However, in an effort to enable ISM device passthrough, I realized that the
manner in which ISM performs block write operations is highly incompatible
with the way that QEMU s390 PCI instruction interception and
vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
devices have particular requirements in regards to the alignment, size and
order of writes performed.  Furthermore, they require that legacy/non-MIO
s390 PCI instructions are used, which is also not guaranteed when the I/O
is passed through the typical userspace channels.

As a result, this patchset proposes a new VFIO region to allow a guest to
pass certain PCI instruction intercepts directly to the s390 host kernel
PCI layer for exeuction, pinning the guest buffer in memory briefly in
order to execute the requested PCI instruction.

Matthew Rosato (4):
  s390/pci: track alignment/length strictness for zpci_dev
  vfio-pci/zdev: Pass the relaxed alignment flag
  s390/pci: Get hardware-reported max store block length
  vfio-pci/zdev: Introduce the zPCI I/O vfio region

 arch/s390/include/asm/pci.h         |   4 +-
 arch/s390/include/asm/pci_clp.h     |   7 +-
 arch/s390/pci/pci_clp.c             |   2 +
 drivers/vfio/pci/vfio_pci.c         |   8 ++
 drivers/vfio/pci/vfio_pci_private.h |   6 ++
 drivers/vfio/pci/vfio_pci_zdev.c    | 160 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h           |   4 +
 include/uapi/linux/vfio_zdev.h      |  33 ++++++++
 8 files changed, 221 insertions(+), 3 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC 1/4] s390/pci: track alignment/length strictness for zpci_dev
  2020-12-09 20:27 [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
@ 2020-12-09 20:27 ` Matthew Rosato
  2020-12-10 10:33   ` Cornelia Huck
  2020-12-09 20:27 ` [RFC 2/4] vfio-pci/zdev: Pass the relaxed alignment flag Matthew Rosato
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Matthew Rosato @ 2020-12-09 20:27 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

Some zpci device types (e.g., ISM) follow different rules for length
and alignment of pci instructions.  Recognize this and keep track of
it in the zpci_dev.

Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
---
 arch/s390/include/asm/pci.h     | 3 ++-
 arch/s390/include/asm/pci_clp.h | 4 +++-
 arch/s390/pci/pci_clp.c         | 1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 2126289..f16ffba 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -133,7 +133,8 @@ struct zpci_dev {
 	u8		has_hp_slot	: 1;
 	u8		is_physfn	: 1;
 	u8		util_str_avail	: 1;
-	u8		reserved	: 4;
+	u8		relaxed_align	: 1;
+	u8		reserved	: 3;
 	unsigned int	devfn;		/* DEVFN part of the RID*/
 
 	struct mutex lock;
diff --git a/arch/s390/include/asm/pci_clp.h b/arch/s390/include/asm/pci_clp.h
index 1f4b666..9fb7cbf 100644
--- a/arch/s390/include/asm/pci_clp.h
+++ b/arch/s390/include/asm/pci_clp.h
@@ -150,7 +150,9 @@ struct clp_rsp_query_pci_grp {
 	u16			:  4;
 	u16 noi			: 12;	/* number of interrupts */
 	u8 version;
-	u8			:  6;
+	u8			:  4;
+	u8 relaxed_align	:  1;	/* Relax length and alignment rules */
+	u8			:  1;
 	u8 frame		:  1;
 	u8 refresh		:  1;	/* TLB refresh mode */
 	u16 reserved2;
diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
index 153720d..630f8fc 100644
--- a/arch/s390/pci/pci_clp.c
+++ b/arch/s390/pci/pci_clp.c
@@ -103,6 +103,7 @@ static void clp_store_query_pci_fngrp(struct zpci_dev *zdev,
 	zdev->max_msi = response->noi;
 	zdev->fmb_update = response->mui;
 	zdev->version = response->version;
+	zdev->relaxed_align = response->relaxed_align;
 
 	switch (response->version) {
 	case 1:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC 2/4] vfio-pci/zdev: Pass the relaxed alignment flag
  2020-12-09 20:27 [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
  2020-12-09 20:27 ` [RFC 1/4] s390/pci: track alignment/length strictness for zpci_dev Matthew Rosato
@ 2020-12-09 20:27 ` Matthew Rosato
  2020-12-09 20:27 ` [RFC 3/4] s390/pci: Get hardware-reported max store block length Matthew Rosato
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Matthew Rosato @ 2020-12-09 20:27 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

Use an additional bit of the VFIO_DEVICE_INFO_CAP_ZPCI_GROUP flags
field to pass whether or not the associated device supports relaxed
length and alignment for some I/O operations.

Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_zdev.c | 2 ++
 include/uapi/linux/vfio_zdev.h   | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 2296856..57e19ff 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -60,6 +60,8 @@ static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
 		.version = zdev->version
 	};
 
+	if (zdev->relaxed_align)
+		cap.flags |= VFIO_DEVICE_INFO_ZPCI_FLAG_RELAXED;
 	return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
 }
 
diff --git a/include/uapi/linux/vfio_zdev.h b/include/uapi/linux/vfio_zdev.h
index b430939..b0b6596 100644
--- a/include/uapi/linux/vfio_zdev.h
+++ b/include/uapi/linux/vfio_zdev.h
@@ -43,6 +43,7 @@ struct vfio_device_info_cap_zpci_group {
 	__u64 msi_addr;		/* MSI address */
 	__u64 flags;
 #define VFIO_DEVICE_INFO_ZPCI_FLAG_REFRESH 1 /* Program-specified TLB refresh */
+#define VFIO_DEVICE_INFO_ZPCI_FLAG_RELAXED 2 /* Relaxed Length and Alignment */
 	__u16 mui;		/* Measurement Block Update Interval */
 	__u16 noi;		/* Maximum number of MSIs */
 	__u16 maxstbl;		/* Maximum Store Block Length */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC 3/4] s390/pci: Get hardware-reported max store block length
  2020-12-09 20:27 [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
  2020-12-09 20:27 ` [RFC 1/4] s390/pci: track alignment/length strictness for zpci_dev Matthew Rosato
  2020-12-09 20:27 ` [RFC 2/4] vfio-pci/zdev: Pass the relaxed alignment flag Matthew Rosato
@ 2020-12-09 20:27 ` Matthew Rosato
  2020-12-09 20:27 ` [RFC 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region Matthew Rosato
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Matthew Rosato @ 2020-12-09 20:27 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

We'll need to know this information for vfio passthrough.

Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
---
 arch/s390/include/asm/pci.h     | 1 +
 arch/s390/include/asm/pci_clp.h | 3 ++-
 arch/s390/pci/pci_clp.c         | 1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index f16ffba..04f1f48 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -181,6 +181,7 @@ struct zpci_dev {
 	atomic64_t mapped_pages;
 	atomic64_t unmapped_pages;
 
+	u16		maxstbl;
 	u8		version;
 	enum pci_bus_speed max_bus_speed;
 
diff --git a/arch/s390/include/asm/pci_clp.h b/arch/s390/include/asm/pci_clp.h
index 9fb7cbf..ddd1604 100644
--- a/arch/s390/include/asm/pci_clp.h
+++ b/arch/s390/include/asm/pci_clp.h
@@ -155,7 +155,8 @@ struct clp_rsp_query_pci_grp {
 	u8			:  1;
 	u8 frame		:  1;
 	u8 refresh		:  1;	/* TLB refresh mode */
-	u16 reserved2;
+	u16			:  3;
+	u16 maxstbl		: 13;
 	u16 mui;
 	u16			: 16;
 	u16 maxfaal;
diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
index 630f8fc..99b64fe 100644
--- a/arch/s390/pci/pci_clp.c
+++ b/arch/s390/pci/pci_clp.c
@@ -104,6 +104,7 @@ static void clp_store_query_pci_fngrp(struct zpci_dev *zdev,
 	zdev->fmb_update = response->mui;
 	zdev->version = response->version;
 	zdev->relaxed_align = response->relaxed_align;
+	zdev->maxstbl = response->maxstbl;
 
 	switch (response->version) {
 	case 1:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
  2020-12-09 20:27 [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
                   ` (2 preceding siblings ...)
  2020-12-09 20:27 ` [RFC 3/4] s390/pci: Get hardware-reported max store block length Matthew Rosato
@ 2020-12-09 20:27 ` Matthew Rosato
  2020-12-09 20:52 ` [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
  2020-12-10 12:33 ` Cornelia Huck
  5 siblings, 0 replies; 19+ messages in thread
From: Matthew Rosato @ 2020-12-09 20:27 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

Some s390 PCI devices (e.g. ISM) perform I/O operations that have very
specific requirements in terms of alignment as well as the patterns in
which the data is read/written. Allowing these to proceed through the
typical vfio_pci_bar_rw path will cause them to be broken in up in such a
way that these requirements can't be guaranteed. In addition, ISM devices
do not support the MIO codepaths that might be triggered on vfio I/O coming
from userspace; we must be able to ensure that these devices use the
non-MIO instructions.  To facilitate this, provide a new vfio region by
which non-MIO instructions can be passed directly to the host kernel s390
PCI layer, to be reliably issued as non-MIO instructions.

This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region
and implements the ability to pass PCISTB and PCILG instructions over it,
as these are what is required for ISM devices.

Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci.c         |   8 ++
 drivers/vfio/pci/vfio_pci_private.h |   6 ++
 drivers/vfio/pci/vfio_pci_zdev.c    | 158 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h           |   4 +
 include/uapi/linux/vfio_zdev.h      |  32 ++++++++
 5 files changed, 208 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index e619017..241b6fb 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -409,6 +409,14 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		}
 	}
 
+	if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV)) {
+		ret = vfio_pci_zdev_io_init(vdev);
+		if (ret && ret != -ENODEV) {
+			pci_warn(pdev, "Failed to setup zPCI I/O region\n");
+			return ret;
+		}
+	}
+
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5c90e56..bc49980 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -217,12 +217,18 @@ static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
 #ifdef CONFIG_VFIO_PCI_ZDEV
 extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 				       struct vfio_info_cap *caps);
+extern int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev);
 #else
 static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 					      struct vfio_info_cap *caps)
 {
 	return -ENODEV;
 }
+
+static inline int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
 #endif
 
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 57e19ff..a962043 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -18,6 +18,7 @@
 #include <linux/vfio_zdev.h>
 #include <asm/pci_clp.h>
 #include <asm/pci_io.h>
+#include <asm/pci_insn.h>
 
 #include "vfio_pci_private.h"
 
@@ -143,3 +144,160 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 
 	return ret;
 }
+
+static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev,
+				  char __user *buf, size_t count,
+				  loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	struct vfio_region_zpci_io *region = vdev->region[i].data;
+	struct zpci_dev *zdev = to_zpci(vdev->pdev);
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	void *base = region;
+	struct page *gpage;
+	void *gaddr;
+	u64 *data;
+	int ret;
+	u64 req;
+
+	if ((!vdev->pdev->bus) || (!zdev))
+		return -ENODEV;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	if (!iswrite) {
+		/* Only allow reads to the _hdr area */
+		if (pos + count > offsetof(struct vfio_region_zpci_io, req))
+			return -EFAULT;
+		if (copy_to_user(buf, base + pos, count))
+			return -EFAULT;
+		return count;
+	}
+
+	/* Only allow writes to the _req area */
+	if (pos < offsetof(struct vfio_region_zpci_io, req))
+		return -EFAULT;
+	if (copy_from_user(base + pos, buf, count))
+		return -EFAULT;
+
+	/*
+	 * Read operations are limited to 8B
+	 */
+	if ((region->req.flags & VFIO_ZPCI_IO_FLAG_READ) &&
+		(region->req.len > 8)) {
+		return -EIO;
+	}
+
+	/*
+	 * Block write operations are limited to hardware-reported max
+	 */
+	if ((region->req.flags & VFIO_ZPCI_IO_FLAG_BLOCKW) &&
+		(region->req.len > zdev->maxstbl)) {
+		return -EIO;
+	}
+
+	/*
+	 * While some devices may allow relaxed alignment for the PCISTB
+	 * instruction, the VFIO region requires the input buffer to be on a
+	 * DWORD boundary for all operations for simplicity.
+	 */
+	if (!IS_ALIGNED(region->req.gaddr, sizeof(uint64_t)))
+		return -EIO;
+
+	/*
+	 * For now, the largest allowed block I/O is advertised as PAGE_SIZE,
+	 * and cannot exceed a page boundary - so a single page is enough.  The
+	 * guest should have validated this but let's double-check that the
+	 * request will not cross a page boundary.
+	 */
+	if (((region->req.gaddr & ~PAGE_MASK)
+			+ region->req.len - 1) & PAGE_MASK) {
+		return -EIO;
+	}
+
+	mutex_lock(&zdev->lock);
+
+	ret = get_user_pages_fast(region->req.gaddr & PAGE_MASK, 1, 0, &gpage);
+	if (ret <= 0) {
+		count = -EIO;
+		goto out;
+	}
+	gaddr = page_address(gpage);
+	gaddr += (region->req.gaddr & ~PAGE_MASK);
+	data = (u64 *)gaddr;
+
+	req = ZPCI_CREATE_REQ(zdev->fh, region->req.pcias, region->req.len);
+
+	/* Perform the requested I/O operation */
+	if (region->req.flags & VFIO_ZPCI_IO_FLAG_READ) {
+		/* PCILG */
+		ret = __zpci_load(data, req,
+				region->req.offset);
+	} else if (region->req.flags & VFIO_ZPCI_IO_FLAG_BLOCKW) {
+		/* PCISTB */
+		ret = __zpci_store_block(data, req,
+					region->req.offset);
+	} else {
+		/* Undefined Operation or none provided */
+		count = -EIO;
+	}
+	if (ret < 0)
+		count = -EIO;
+
+	put_page(gpage);
+
+out:
+	mutex_unlock(&zdev->lock);
+	return count;
+}
+
+static void vfio_pci_zdev_io_release(struct vfio_pci_device *vdev,
+				     struct vfio_pci_region *region)
+{
+	kfree(region->data);
+}
+
+static const struct vfio_pci_regops vfio_pci_zdev_io_regops = {
+	.rw		= vfio_pci_zdev_io_rw,
+	.release	= vfio_pci_zdev_io_release,
+};
+
+int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev)
+{
+	struct vfio_region_zpci_io *region;
+	struct zpci_dev *zdev;
+	int ret;
+
+	if (!vdev->pdev->bus)
+		return -ENODEV;
+
+	zdev = to_zpci(vdev->pdev);
+	if (!zdev)
+		return -ENODEV;
+
+	region = kmalloc(sizeof(struct vfio_region_zpci_io), GFP_KERNEL);
+
+	ret = vfio_pci_register_dev_region(vdev,
+		PCI_VENDOR_ID_IBM | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+		VFIO_REGION_SUBTYPE_IBM_ZPCI_IO,
+		&vfio_pci_zdev_io_regops,
+		sizeof(struct vfio_region_zpci_io),
+		VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
+		region);
+
+	if (ret) {
+		kfree(region);
+		return ret;
+	}
+
+	/* Setup the initial header information */
+	region->hdr.flags = 0;
+	region->hdr.max = zdev->maxstbl;
+	region->hdr.reserved = 0;
+	region->hdr.reserved2 = 0;
+
+	return ret;
+}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2f313a2..6fbaec3 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -338,6 +338,10 @@ struct vfio_region_info_cap_type {
  * to do TLB invalidation on a GPU.
  */
 #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
+/*
+ * IBM zPCI I/O region
+ */
+#define VFIO_REGION_SUBTYPE_IBM_ZPCI_IO		(2)
 
 /* sub-types for VFIO_REGION_TYPE_GFX */
 #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
diff --git a/include/uapi/linux/vfio_zdev.h b/include/uapi/linux/vfio_zdev.h
index b0b6596..22d3408 100644
--- a/include/uapi/linux/vfio_zdev.h
+++ b/include/uapi/linux/vfio_zdev.h
@@ -76,4 +76,36 @@ struct vfio_device_info_cap_zpci_pfip {
 	__u8 pfip[];
 };
 
+/**
+ * VFIO_REGION_SUBTYPE_IBM_ZPCI_IO - VFIO zPCI PCI Direct I/O Region
+ *
+ * This region is used to transfer I/O operations from the guest directly
+ * to the host zPCI I/O layer.
+ *
+ * The _hdr area is user-readable and is used to provide setup information.
+ * The _req area is user-writable and is used to provide the I/O operation.
+ */
+struct vfio_zpci_io_hdr {
+	__u64 flags;
+	__u16 max;		/* Max block operation size allowed */
+	__u16 reserved;
+	__u32 reserved2;
+};
+
+struct vfio_zpci_io_req {
+	__u64 flags;
+#define VFIO_ZPCI_IO_FLAG_READ (1 << 0) /* Read Operation Specified */
+#define VFIO_ZPCI_IO_FLAG_BLOCKW (1 << 1) /* Block Write Operation Specified */
+	__u64 gaddr;		/* Address of guest data */
+	__u64 offset;		/* Offset into target PCI Address Space */
+	__u32 reserved;
+	__u16 len;		/* Length of guest operation */
+	__u8 pcias;		/* Target PCI Address Space */
+	__u8 reserved2;
+};
+
+struct vfio_region_zpci_io {
+	struct vfio_zpci_io_hdr hdr;
+	struct vfio_zpci_io_req req;
+};
 #endif
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2020-12-09 20:27 [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
                   ` (3 preceding siblings ...)
  2020-12-09 20:27 ` [RFC 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region Matthew Rosato
@ 2020-12-09 20:52 ` Matthew Rosato
  2020-12-10 12:33 ` Cornelia Huck
  5 siblings, 0 replies; 19+ messages in thread
From: Matthew Rosato @ 2020-12-09 20:52 UTC (permalink / raw)
  To: alex.williamson, cohuck, schnelle
  Cc: pmorel, borntraeger, hca, gor, gerald.schaefer, linux-s390, kvm,
	linux-kernel

On 12/9/20 3:27 PM, Matthew Rosato wrote:
> Today, ISM devices are completely disallowed for vfio-pci passthrough as
> QEMU will reject the device due to an (inappropriate) MSI-X check.
> However, in an effort to enable ISM device passthrough, I realized that the
> manner in which ISM performs block write operations is highly incompatible
> with the way that QEMU s390 PCI instruction interception and
> vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
> devices have particular requirements in regards to the alignment, size and
> order of writes performed.  Furthermore, they require that legacy/non-MIO
> s390 PCI instructions are used, which is also not guaranteed when the I/O
> is passed through the typical userspace channels.
> 
> As a result, this patchset proposes a new VFIO region to allow a guest to
> pass certain PCI instruction intercepts directly to the s390 host kernel
> PCI layer for exeuction, pinning the guest buffer in memory briefly in
> order to execute the requested PCI instruction.
> 
> Matthew Rosato (4):
>    s390/pci: track alignment/length strictness for zpci_dev
>    vfio-pci/zdev: Pass the relaxed alignment flag
>    s390/pci: Get hardware-reported max store block length
>    vfio-pci/zdev: Introduce the zPCI I/O vfio region
> 
>   arch/s390/include/asm/pci.h         |   4 +-
>   arch/s390/include/asm/pci_clp.h     |   7 +-
>   arch/s390/pci/pci_clp.c             |   2 +
>   drivers/vfio/pci/vfio_pci.c         |   8 ++
>   drivers/vfio/pci/vfio_pci_private.h |   6 ++
>   drivers/vfio/pci/vfio_pci_zdev.c    | 160 ++++++++++++++++++++++++++++++++++++
>   include/uapi/linux/vfio.h           |   4 +
>   include/uapi/linux/vfio_zdev.h      |  33 ++++++++
>   8 files changed, 221 insertions(+), 3 deletions(-)
> 

Associated qemu patchset:
https://lists.gnu.org/archive/html/qemu-devel/2020-12/msg02377.html


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 1/4] s390/pci: track alignment/length strictness for zpci_dev
  2020-12-09 20:27 ` [RFC 1/4] s390/pci: track alignment/length strictness for zpci_dev Matthew Rosato
@ 2020-12-10 10:33   ` Cornelia Huck
  2020-12-10 15:26     ` Matthew Rosato
  0 siblings, 1 reply; 19+ messages in thread
From: Cornelia Huck @ 2020-12-10 10:33 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: alex.williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On Wed,  9 Dec 2020 15:27:47 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> Some zpci device types (e.g., ISM) follow different rules for length
> and alignment of pci instructions.  Recognize this and keep track of
> it in the zpci_dev.
> 
> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
> ---
>  arch/s390/include/asm/pci.h     | 3 ++-
>  arch/s390/include/asm/pci_clp.h | 4 +++-
>  arch/s390/pci/pci_clp.c         | 1 +
>  3 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index 2126289..f16ffba 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -133,7 +133,8 @@ struct zpci_dev {
>  	u8		has_hp_slot	: 1;
>  	u8		is_physfn	: 1;
>  	u8		util_str_avail	: 1;
> -	u8		reserved	: 4;
> +	u8		relaxed_align	: 1;
> +	u8		reserved	: 3;
>  	unsigned int	devfn;		/* DEVFN part of the RID*/
>  
>  	struct mutex lock;
> diff --git a/arch/s390/include/asm/pci_clp.h b/arch/s390/include/asm/pci_clp.h
> index 1f4b666..9fb7cbf 100644
> --- a/arch/s390/include/asm/pci_clp.h
> +++ b/arch/s390/include/asm/pci_clp.h
> @@ -150,7 +150,9 @@ struct clp_rsp_query_pci_grp {
>  	u16			:  4;
>  	u16 noi			: 12;	/* number of interrupts */
>  	u8 version;
> -	u8			:  6;
> +	u8			:  4;
> +	u8 relaxed_align	:  1;	/* Relax length and alignment rules */
> +	u8			:  1;
>  	u8 frame		:  1;
>  	u8 refresh		:  1;	/* TLB refresh mode */
>  	u16 reserved2;
> diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
> index 153720d..630f8fc 100644
> --- a/arch/s390/pci/pci_clp.c
> +++ b/arch/s390/pci/pci_clp.c
> @@ -103,6 +103,7 @@ static void clp_store_query_pci_fngrp(struct zpci_dev *zdev,
>  	zdev->max_msi = response->noi;
>  	zdev->fmb_update = response->mui;
>  	zdev->version = response->version;
> +	zdev->relaxed_align = response->relaxed_align;
>  
>  	switch (response->version) {
>  	case 1:

Hm, what does that 'relaxed alignment' imply? Is that something that
can apply to emulated devices as well?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2020-12-09 20:27 [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
                   ` (4 preceding siblings ...)
  2020-12-09 20:52 ` [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
@ 2020-12-10 12:33 ` Cornelia Huck
  2020-12-10 15:51   ` Matthew Rosato
  5 siblings, 1 reply; 19+ messages in thread
From: Cornelia Huck @ 2020-12-10 12:33 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: alex.williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On Wed,  9 Dec 2020 15:27:46 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> Today, ISM devices are completely disallowed for vfio-pci passthrough as
> QEMU will reject the device due to an (inappropriate) MSI-X check.
> However, in an effort to enable ISM device passthrough, I realized that the
> manner in which ISM performs block write operations is highly incompatible
> with the way that QEMU s390 PCI instruction interception and
> vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
> devices have particular requirements in regards to the alignment, size and
> order of writes performed.  Furthermore, they require that legacy/non-MIO
> s390 PCI instructions are used, which is also not guaranteed when the I/O
> is passed through the typical userspace channels.

The part about the non-MIO instructions confuses me. How can MIO
instructions be generated with the current code, and why does changing
the write pattern help?

> 
> As a result, this patchset proposes a new VFIO region to allow a guest to
> pass certain PCI instruction intercepts directly to the s390 host kernel
> PCI layer for exeuction, pinning the guest buffer in memory briefly in
> order to execute the requested PCI instruction.
> 
> Matthew Rosato (4):
>   s390/pci: track alignment/length strictness for zpci_dev
>   vfio-pci/zdev: Pass the relaxed alignment flag
>   s390/pci: Get hardware-reported max store block length
>   vfio-pci/zdev: Introduce the zPCI I/O vfio region
> 
>  arch/s390/include/asm/pci.h         |   4 +-
>  arch/s390/include/asm/pci_clp.h     |   7 +-
>  arch/s390/pci/pci_clp.c             |   2 +
>  drivers/vfio/pci/vfio_pci.c         |   8 ++
>  drivers/vfio/pci/vfio_pci_private.h |   6 ++
>  drivers/vfio/pci/vfio_pci_zdev.c    | 160 ++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h           |   4 +
>  include/uapi/linux/vfio_zdev.h      |  33 ++++++++
>  8 files changed, 221 insertions(+), 3 deletions(-)
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 1/4] s390/pci: track alignment/length strictness for zpci_dev
  2020-12-10 10:33   ` Cornelia Huck
@ 2020-12-10 15:26     ` Matthew Rosato
  2020-12-11 11:37       ` Cornelia Huck
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Rosato @ 2020-12-10 15:26 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On 12/10/20 5:33 AM, Cornelia Huck wrote:
> On Wed,  9 Dec 2020 15:27:47 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
>> Some zpci device types (e.g., ISM) follow different rules for length
>> and alignment of pci instructions.  Recognize this and keep track of
>> it in the zpci_dev.
>>
>> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
>> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
>> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pci.h     | 3 ++-
>>   arch/s390/include/asm/pci_clp.h | 4 +++-
>>   arch/s390/pci/pci_clp.c         | 1 +
>>   3 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
>> index 2126289..f16ffba 100644
>> --- a/arch/s390/include/asm/pci.h
>> +++ b/arch/s390/include/asm/pci.h
>> @@ -133,7 +133,8 @@ struct zpci_dev {
>>   	u8		has_hp_slot	: 1;
>>   	u8		is_physfn	: 1;
>>   	u8		util_str_avail	: 1;
>> -	u8		reserved	: 4;
>> +	u8		relaxed_align	: 1;
>> +	u8		reserved	: 3;
>>   	unsigned int	devfn;		/* DEVFN part of the RID*/
>>   
>>   	struct mutex lock;
>> diff --git a/arch/s390/include/asm/pci_clp.h b/arch/s390/include/asm/pci_clp.h
>> index 1f4b666..9fb7cbf 100644
>> --- a/arch/s390/include/asm/pci_clp.h
>> +++ b/arch/s390/include/asm/pci_clp.h
>> @@ -150,7 +150,9 @@ struct clp_rsp_query_pci_grp {
>>   	u16			:  4;
>>   	u16 noi			: 12;	/* number of interrupts */
>>   	u8 version;
>> -	u8			:  6;
>> +	u8			:  4;
>> +	u8 relaxed_align	:  1;	/* Relax length and alignment rules */
>> +	u8			:  1;
>>   	u8 frame		:  1;
>>   	u8 refresh		:  1;	/* TLB refresh mode */
>>   	u16 reserved2;
>> diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
>> index 153720d..630f8fc 100644
>> --- a/arch/s390/pci/pci_clp.c
>> +++ b/arch/s390/pci/pci_clp.c
>> @@ -103,6 +103,7 @@ static void clp_store_query_pci_fngrp(struct zpci_dev *zdev,
>>   	zdev->max_msi = response->noi;
>>   	zdev->fmb_update = response->mui;
>>   	zdev->version = response->version;
>> +	zdev->relaxed_align = response->relaxed_align;
>>   
>>   	switch (response->version) {
>>   	case 1:
> 
> Hm, what does that 'relaxed alignment' imply? Is that something that
> can apply to emulated devices as well?
> 
The relaxed alignment simply loosens the rules on the PCISTB instruction 
so that it doesn't have to be on particular boundaries / have a minimum 
length restriction, these effectively allow ISM devices to use PCISTB 
instead of PCISTG for just about everything.  If you have a look at the 
patch "s390x/pci: Handle devices that support relaxed alignment" from 
the linked qemu set, you can get an idea of what the bit changes via the 
way qemu has to be more permissive of what the guest provides for PCISTB.

Re: emulated devices...  The S390 PCI I/O layer in the guest is always 
issuing strict? aligned I/O for PCISTB, and if it decided to later 
change that behavior it would need to look at this CLP bit to decide 
whether that would be a valid operation for a given PCI function anyway. 
  This bit will remain off in the CLP response we give for emulated 
devices, ensuring that should such a change occur in the guest s390 PCI 
I/O layer, we'd just continue getting strictly-aligned PCISTB.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2020-12-10 12:33 ` Cornelia Huck
@ 2020-12-10 15:51   ` Matthew Rosato
  2020-12-10 16:14     ` Niklas Schnelle
  2020-12-11 14:35     ` Cornelia Huck
  0 siblings, 2 replies; 19+ messages in thread
From: Matthew Rosato @ 2020-12-10 15:51 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On 12/10/20 7:33 AM, Cornelia Huck wrote:
> On Wed,  9 Dec 2020 15:27:46 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
>> Today, ISM devices are completely disallowed for vfio-pci passthrough as
>> QEMU will reject the device due to an (inappropriate) MSI-X check.
>> However, in an effort to enable ISM device passthrough, I realized that the
>> manner in which ISM performs block write operations is highly incompatible
>> with the way that QEMU s390 PCI instruction interception and
>> vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
>> devices have particular requirements in regards to the alignment, size and
>> order of writes performed.  Furthermore, they require that legacy/non-MIO
>> s390 PCI instructions are used, which is also not guaranteed when the I/O
>> is passed through the typical userspace channels.
> 
> The part about the non-MIO instructions confuses me. How can MIO
> instructions be generated with the current code, and why does changing

So to be clear, they are not being generated at all in the guest as the 
necessary facility is reported as unavailable.

Let's talk about Linux in LPAR / the host kernel:  When hardware that 
supports MIO instructions is available, all userspace I/O traffic is 
going to be routed through the MIO variants of the s390 PCI 
instructions.  This is working well for other device types, but does not 
work for ISM which does not support these variants.  However, the ISM 
driver also does not invoke the userspace I/O routines for the kernel, 
it invokes the s390 PCI layer directly, which in turn ensures the proper 
PCI instructions are used -- This approach falls apart when the guest 
ISM driver invokes those routines in the guest -- we (qemu) pass those 
non-MIO instructions from the guest as memory operations through 
vfio-pci, traversing through the vfio I/O layer in the guest 
(vfio_pci_bar_rw and friends), where we then arrive in the host s390 PCI 
layer -- where the MIO variant is used because the facility is available.

Per conversations with Niklas (on CC), it's not trivial to decide by the 
time we reach the s390 PCI I/O layer to switch gears and use the non-MIO 
instruction set.

> the write pattern help?

The write pattern is a separate issue from non-MIO instruction 
requirements...  Certain address spaces require specific instructions to 
be used (so, no substituting PCISTG for PCISTB - that happens too by 
default for any writes coming into the host s390 PCI layer that are 
<=8B, and they all are when the PCISTB is broken up into 8B memory 
operations that travel through vfio_pci_bar_rw, which further breaks 
those up into 4B operations).  There's also a requirement for some 
writes that the data, if broken up, be written in a certain order in 
order to properly trigger events. :(  The ability to pass the entire 
PCISTB payload vs breaking it into 8B chunks is also significantly faster.

> 
>>
>> As a result, this patchset proposes a new VFIO region to allow a guest to
>> pass certain PCI instruction intercepts directly to the s390 host kernel
>> PCI layer for exeuction, pinning the guest buffer in memory briefly in
>> order to execute the requested PCI instruction.
>>
>> Matthew Rosato (4):
>>    s390/pci: track alignment/length strictness for zpci_dev
>>    vfio-pci/zdev: Pass the relaxed alignment flag
>>    s390/pci: Get hardware-reported max store block length
>>    vfio-pci/zdev: Introduce the zPCI I/O vfio region
>>
>>   arch/s390/include/asm/pci.h         |   4 +-
>>   arch/s390/include/asm/pci_clp.h     |   7 +-
>>   arch/s390/pci/pci_clp.c             |   2 +
>>   drivers/vfio/pci/vfio_pci.c         |   8 ++
>>   drivers/vfio/pci/vfio_pci_private.h |   6 ++
>>   drivers/vfio/pci/vfio_pci_zdev.c    | 160 ++++++++++++++++++++++++++++++++++++
>>   include/uapi/linux/vfio.h           |   4 +
>>   include/uapi/linux/vfio_zdev.h      |  33 ++++++++
>>   8 files changed, 221 insertions(+), 3 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2020-12-10 15:51   ` Matthew Rosato
@ 2020-12-10 16:14     ` Niklas Schnelle
  2020-12-11 14:14       ` Cornelia Huck
  2020-12-11 14:35     ` Cornelia Huck
  1 sibling, 1 reply; 19+ messages in thread
From: Niklas Schnelle @ 2020-12-10 16:14 UTC (permalink / raw)
  To: Matthew Rosato, Cornelia Huck
  Cc: alex.williamson, pmorel, borntraeger, hca, gor, gerald.schaefer,
	linux-s390, kvm, linux-kernel



On 12/10/20 4:51 PM, Matthew Rosato wrote:
> On 12/10/20 7:33 AM, Cornelia Huck wrote:
>> On Wed,  9 Dec 2020 15:27:46 -0500
>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>
>>> Today, ISM devices are completely disallowed for vfio-pci passthrough as
>>> QEMU will reject the device due to an (inappropriate) MSI-X check.
>>> However, in an effort to enable ISM device passthrough, I realized that the
>>> manner in which ISM performs block write operations is highly incompatible
>>> with the way that QEMU s390 PCI instruction interception and
>>> vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
>>> devices have particular requirements in regards to the alignment, size and
>>> order of writes performed.  Furthermore, they require that legacy/non-MIO
>>> s390 PCI instructions are used, which is also not guaranteed when the I/O
>>> is passed through the typical userspace channels.
>>
>> The part about the non-MIO instructions confuses me. How can MIO
>> instructions be generated with the current code, and why does changing
> 
> So to be clear, they are not being generated at all in the guest as the necessary facility is reported as unavailable.
> 
> Let's talk about Linux in LPAR / the host kernel:  When hardware that supports MIO instructions is available, all userspace I/O traffic is going to be routed through the MIO variants of the s390 PCI instructions.  This is working well for other device types, but does not work for ISM which does not support these variants.  However, the ISM driver also does not invoke the userspace I/O routines for the kernel, it invokes the s390 PCI layer directly, which in turn ensures the proper PCI instructions are used -- This approach falls apart when the guest ISM driver invokes those routines in the guest -- we (qemu) pass those non-MIO instructions from the guest as memory operations through vfio-pci, traversing through the vfio I/O layer in the guest (vfio_pci_bar_rw and friends), where we then arrive in the host s390 PCI layer -- where the MIO variant is used because the facility is available.

Slight clarification since I think the word "userspace" is a bit overloaded as
KVM folks often use it to talk about the guest even when that calls through vfio.
Application userspace (i.e. things like DPDK) can use PCI MIO Load/Stores
directly on mmap()ed/ioremap()ed memory these don't go through the Kernel at
all.
QEMU while also in userspace on the other hand goes through the vfio_bar_rw()
region which uses the common code _Kernel_ ioread()/iowrite() API. This Kernel
ioread()/iowrite() API uses PCI MIO Load/Stores by default on machines that
support them (z15 currently).  The ISM driver, knowing that its device does not
support MIO, goes around this API and directly calls zpci_store()/zpci_load().


> 
> Per conversations with Niklas (on CC), it's not trivial to decide by the time we reach the s390 PCI I/O layer to switch gears and use the non-MIO instruction set.

Yes, we have some ideas about dynamically switching to legacy PCI stores in
ioread()/iowrite() for devices that are set up for it but since that only gets
an ioremap()ed address, a value and a size it would evolve such nasty things as
looking at this virtual address to determine if it includes a ZPCI_ADDR()
cookie that we use to get to the function handle needed for the legacy PCI
Load/Stores, while MIO PCI Load/Stores directly work on virtual addresses.

Now purely for the Kernel API we think this could work since that always
allocates between VMALLOC_START and VMALLOC_END and we control where we put the
ZPCI_ADDR() cookie but I'm very hesitant to add something like that.

As for application userspace (DPDK) we do have a syscall
(arch/s390/pci/pci_mmio.c) API that had a similar problem but we could make use
of the fact that our Architecture is pretty nifty with address spaces and just
execute the MIO PCI Load/Store in the syscall _as if_ by the calling userspace
application.


> 
>> the write pattern help?
> 
... snip ...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 1/4] s390/pci: track alignment/length strictness for zpci_dev
  2020-12-10 15:26     ` Matthew Rosato
@ 2020-12-11 11:37       ` Cornelia Huck
  0 siblings, 0 replies; 19+ messages in thread
From: Cornelia Huck @ 2020-12-11 11:37 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: alex.williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On Thu, 10 Dec 2020 10:26:22 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 12/10/20 5:33 AM, Cornelia Huck wrote:
> > On Wed,  9 Dec 2020 15:27:47 -0500
> > Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> >   
> >> Some zpci device types (e.g., ISM) follow different rules for length
> >> and alignment of pci instructions.  Recognize this and keep track of
> >> it in the zpci_dev.
> >>
> >> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
> >> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
> >> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
> >> ---
> >>   arch/s390/include/asm/pci.h     | 3 ++-
> >>   arch/s390/include/asm/pci_clp.h | 4 +++-
> >>   arch/s390/pci/pci_clp.c         | 1 +
> >>   3 files changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> >> index 2126289..f16ffba 100644
> >> --- a/arch/s390/include/asm/pci.h
> >> +++ b/arch/s390/include/asm/pci.h
> >> @@ -133,7 +133,8 @@ struct zpci_dev {
> >>   	u8		has_hp_slot	: 1;
> >>   	u8		is_physfn	: 1;
> >>   	u8		util_str_avail	: 1;
> >> -	u8		reserved	: 4;
> >> +	u8		relaxed_align	: 1;
> >> +	u8		reserved	: 3;
> >>   	unsigned int	devfn;		/* DEVFN part of the RID*/
> >>   
> >>   	struct mutex lock;
> >> diff --git a/arch/s390/include/asm/pci_clp.h b/arch/s390/include/asm/pci_clp.h
> >> index 1f4b666..9fb7cbf 100644
> >> --- a/arch/s390/include/asm/pci_clp.h
> >> +++ b/arch/s390/include/asm/pci_clp.h
> >> @@ -150,7 +150,9 @@ struct clp_rsp_query_pci_grp {
> >>   	u16			:  4;
> >>   	u16 noi			: 12;	/* number of interrupts */
> >>   	u8 version;
> >> -	u8			:  6;
> >> +	u8			:  4;
> >> +	u8 relaxed_align	:  1;	/* Relax length and alignment rules */
> >> +	u8			:  1;
> >>   	u8 frame		:  1;
> >>   	u8 refresh		:  1;	/* TLB refresh mode */
> >>   	u16 reserved2;
> >> diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
> >> index 153720d..630f8fc 100644
> >> --- a/arch/s390/pci/pci_clp.c
> >> +++ b/arch/s390/pci/pci_clp.c
> >> @@ -103,6 +103,7 @@ static void clp_store_query_pci_fngrp(struct zpci_dev *zdev,
> >>   	zdev->max_msi = response->noi;
> >>   	zdev->fmb_update = response->mui;
> >>   	zdev->version = response->version;
> >> +	zdev->relaxed_align = response->relaxed_align;
> >>   
> >>   	switch (response->version) {
> >>   	case 1:  
> > 
> > Hm, what does that 'relaxed alignment' imply? Is that something that
> > can apply to emulated devices as well?
> >   
> The relaxed alignment simply loosens the rules on the PCISTB instruction 
> so that it doesn't have to be on particular boundaries / have a minimum 
> length restriction, these effectively allow ISM devices to use PCISTB 
> instead of PCISTG for just about everything.  If you have a look at the 
> patch "s390x/pci: Handle devices that support relaxed alignment" from 
> the linked qemu set, you can get an idea of what the bit changes via the 
> way qemu has to be more permissive of what the guest provides for PCISTB.

Ok, so it is basically a characteristic of a specific device that
changes the rules of what pcistb will accept.

> 
> Re: emulated devices...  The S390 PCI I/O layer in the guest is always 
> issuing strict? aligned I/O for PCISTB, and if it decided to later 
> change that behavior it would need to look at this CLP bit to decide 
> whether that would be a valid operation for a given PCI function anyway. 
>   This bit will remain off in the CLP response we give for emulated 
> devices, ensuring that should such a change occur in the guest s390 PCI 
> I/O layer, we'd just continue getting strictly-aligned PCISTB.

My question was more whether that was a feature that might make sense
to emulate on the hypervisor side for fully emulated devices. I'd like
to leave the door open for emulated devices to advertise this and make
it possible for guests using those devices to use pcistb with the
relaxed rules, if it makes sense.

I guess we can easily retrofit this if we come up with a use case for it.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2020-12-10 16:14     ` Niklas Schnelle
@ 2020-12-11 14:14       ` Cornelia Huck
  0 siblings, 0 replies; 19+ messages in thread
From: Cornelia Huck @ 2020-12-11 14:14 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, alex.williamson, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On Thu, 10 Dec 2020 17:14:24 +0100
Niklas Schnelle <schnelle@linux.ibm.com> wrote:

> On 12/10/20 4:51 PM, Matthew Rosato wrote:
> > On 12/10/20 7:33 AM, Cornelia Huck wrote:  
> >> On Wed,  9 Dec 2020 15:27:46 -0500
> >> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> >>  
> >>> Today, ISM devices are completely disallowed for vfio-pci passthrough as
> >>> QEMU will reject the device due to an (inappropriate) MSI-X check.
> >>> However, in an effort to enable ISM device passthrough, I realized that the
> >>> manner in which ISM performs block write operations is highly incompatible
> >>> with the way that QEMU s390 PCI instruction interception and
> >>> vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
> >>> devices have particular requirements in regards to the alignment, size and
> >>> order of writes performed.  Furthermore, they require that legacy/non-MIO
> >>> s390 PCI instructions are used, which is also not guaranteed when the I/O
> >>> is passed through the typical userspace channels.  
> >>
> >> The part about the non-MIO instructions confuses me. How can MIO
> >> instructions be generated with the current code, and why does changing  
> > 
> > So to be clear, they are not being generated at all in the guest as the necessary facility is reported as unavailable.
> > 
> > Let's talk about Linux in LPAR / the host kernel:  When hardware that supports MIO instructions is available, all userspace I/O traffic is going to be routed through the MIO variants of the s390 PCI instructions.  This is working well for other device types, but does not work for ISM which does not support these variants.  However, the ISM driver also does not invoke the userspace I/O routines for the kernel, it invokes the s390 PCI layer directly, which in turn ensures the proper PCI instructions are used -- This approach falls apart when the guest ISM driver invokes those routines in the guest -- we (qemu) pass those non-MIO instructions from the guest as memory operations through vfio-pci, traversing through the vfio I/O layer in the guest (vfio_pci_bar_rw and friends), where we then arrive in the host s390 PCI layer -- where the MIO variant is used because the facility is available.  
> 
> Slight clarification since I think the word "userspace" is a bit overloaded as
> KVM folks often use it to talk about the guest even when that calls through vfio.
> Application userspace (i.e. things like DPDK) can use PCI MIO Load/Stores
> directly on mmap()ed/ioremap()ed memory these don't go through the Kernel at
> all.
> QEMU while also in userspace on the other hand goes through the vfio_bar_rw()
> region which uses the common code _Kernel_ ioread()/iowrite() API. This Kernel
> ioread()/iowrite() API uses PCI MIO Load/Stores by default on machines that
> support them (z15 currently).  The ISM driver, knowing that its device does not
> support MIO, goes around this API and directly calls zpci_store()/zpci_load().

Ok, thanks for the explanation.

> 
> 
> > 
> > Per conversations with Niklas (on CC), it's not trivial to decide by the time we reach the s390 PCI I/O layer to switch gears and use the non-MIO instruction set.  
> 
> Yes, we have some ideas about dynamically switching to legacy PCI stores in
> ioread()/iowrite() for devices that are set up for it but since that only gets
> an ioremap()ed address, a value and a size it would evolve such nasty things as
> looking at this virtual address to determine if it includes a ZPCI_ADDR()
> cookie that we use to get to the function handle needed for the legacy PCI
> Load/Stores, while MIO PCI Load/Stores directly work on virtual addresses.
> 
> Now purely for the Kernel API we think this could work since that always
> allocates between VMALLOC_START and VMALLOC_END and we control where we put the
> ZPCI_ADDR() cookie but I'm very hesitant to add something like that.
> 
> As for application userspace (DPDK) we do have a syscall
> (arch/s390/pci/pci_mmio.c) API that had a similar problem but we could make use
> of the fact that our Architecture is pretty nifty with address spaces and just
> execute the MIO PCI Load/Store in the syscall _as if_ by the calling userspace
> application.

Is ISM (currently) the only device that needs to use the non-MIO
instructions, or are there others as well? Is there any characteristic
that a meta driver like vfio could discover, or is it a device quirk
you just need to know about?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2020-12-10 15:51   ` Matthew Rosato
  2020-12-10 16:14     ` Niklas Schnelle
@ 2020-12-11 14:35     ` Cornelia Huck
  2020-12-11 15:01       ` Matthew Rosato
  1 sibling, 1 reply; 19+ messages in thread
From: Cornelia Huck @ 2020-12-11 14:35 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: alex.williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On Thu, 10 Dec 2020 10:51:23 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 12/10/20 7:33 AM, Cornelia Huck wrote:
> > On Wed,  9 Dec 2020 15:27:46 -0500
> > Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> >   
> >> Today, ISM devices are completely disallowed for vfio-pci passthrough as
> >> QEMU will reject the device due to an (inappropriate) MSI-X check.
> >> However, in an effort to enable ISM device passthrough, I realized that the
> >> manner in which ISM performs block write operations is highly incompatible
> >> with the way that QEMU s390 PCI instruction interception and
> >> vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
> >> devices have particular requirements in regards to the alignment, size and
> >> order of writes performed.  Furthermore, they require that legacy/non-MIO
> >> s390 PCI instructions are used, which is also not guaranteed when the I/O
> >> is passed through the typical userspace channels.  
> > 
> > The part about the non-MIO instructions confuses me. How can MIO
> > instructions be generated with the current code, and why does changing  
> 
> So to be clear, they are not being generated at all in the guest as the 
> necessary facility is reported as unavailable.
> 
> Let's talk about Linux in LPAR / the host kernel:  When hardware that 
> supports MIO instructions is available, all userspace I/O traffic is 
> going to be routed through the MIO variants of the s390 PCI 
> instructions.  This is working well for other device types, but does not 
> work for ISM which does not support these variants.  However, the ISM 
> driver also does not invoke the userspace I/O routines for the kernel, 
> it invokes the s390 PCI layer directly, which in turn ensures the proper 
> PCI instructions are used -- This approach falls apart when the guest 
> ISM driver invokes those routines in the guest -- we (qemu) pass those 
> non-MIO instructions from the guest as memory operations through 
> vfio-pci, traversing through the vfio I/O layer in the guest 
> (vfio_pci_bar_rw and friends), where we then arrive in the host s390 PCI 
> layer -- where the MIO variant is used because the facility is available.
> 
> Per conversations with Niklas (on CC), it's not trivial to decide by the 
> time we reach the s390 PCI I/O layer to switch gears and use the non-MIO 
> instruction set.
> 
> > the write pattern help?  
> 
> The write pattern is a separate issue from non-MIO instruction 
> requirements...  Certain address spaces require specific instructions to 
> be used (so, no substituting PCISTG for PCISTB - that happens too by 
> default for any writes coming into the host s390 PCI layer that are 
> <=8B, and they all are when the PCISTB is broken up into 8B memory 
> operations that travel through vfio_pci_bar_rw, which further breaks 
> those up into 4B operations).  There's also a requirement for some 
> writes that the data, if broken up, be written in a certain order in 
> order to properly trigger events. :(  The ability to pass the entire 
> PCISTB payload vs breaking it into 8B chunks is also significantly faster.

Let me summarize this to make sure I understand this new region
correctly:

- some devices may have relaxed alignment/length requirements for
  pcistb (and friends?)
- some devices may actually require writes to be done in a large chunk
  instead of being broken up (is that a strict subset of the devices
  above?)
- some devices do not support the new MIO instructions (is that a
  subset of the relaxed alignment devices? I'm not familiar with the
  MIO instructions)

The patchsets introduce a new region that (a) is used by QEMU to submit
writes in one go, and (b) makes sure to call into the non-MIO
instructions directly; it's basically killing two birds with one stone
for ISM devices. Are these two requirements (large writes and non-MIO)
always going hand-in-hand, or is ISM just an odd device?

If there's an expectation that the new region will always use the
non-MIO instructions (in addition to the changed write handling), it
should be noted in the description for the region as well.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2020-12-11 14:35     ` Cornelia Huck
@ 2020-12-11 15:01       ` Matthew Rosato
  2020-12-11 15:04         ` Matthew Rosato
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Rosato @ 2020-12-11 15:01 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On 12/11/20 9:35 AM, Cornelia Huck wrote:
> On Thu, 10 Dec 2020 10:51:23 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
>> On 12/10/20 7:33 AM, Cornelia Huck wrote:
>>> On Wed,  9 Dec 2020 15:27:46 -0500
>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>    
>>>> Today, ISM devices are completely disallowed for vfio-pci passthrough as
>>>> QEMU will reject the device due to an (inappropriate) MSI-X check.
>>>> However, in an effort to enable ISM device passthrough, I realized that the
>>>> manner in which ISM performs block write operations is highly incompatible
>>>> with the way that QEMU s390 PCI instruction interception and
>>>> vfio_pci_bar_rw break up I/O operations into 8B and 4B operations -- ISM
>>>> devices have particular requirements in regards to the alignment, size and
>>>> order of writes performed.  Furthermore, they require that legacy/non-MIO
>>>> s390 PCI instructions are used, which is also not guaranteed when the I/O
>>>> is passed through the typical userspace channels.
>>>
>>> The part about the non-MIO instructions confuses me. How can MIO
>>> instructions be generated with the current code, and why does changing
>>
>> So to be clear, they are not being generated at all in the guest as the
>> necessary facility is reported as unavailable.
>>
>> Let's talk about Linux in LPAR / the host kernel:  When hardware that
>> supports MIO instructions is available, all userspace I/O traffic is
>> going to be routed through the MIO variants of the s390 PCI
>> instructions.  This is working well for other device types, but does not
>> work for ISM which does not support these variants.  However, the ISM
>> driver also does not invoke the userspace I/O routines for the kernel,
>> it invokes the s390 PCI layer directly, which in turn ensures the proper
>> PCI instructions are used -- This approach falls apart when the guest
>> ISM driver invokes those routines in the guest -- we (qemu) pass those
>> non-MIO instructions from the guest as memory operations through
>> vfio-pci, traversing through the vfio I/O layer in the guest
>> (vfio_pci_bar_rw and friends), where we then arrive in the host s390 PCI
>> layer -- where the MIO variant is used because the facility is available.
>>
>> Per conversations with Niklas (on CC), it's not trivial to decide by the
>> time we reach the s390 PCI I/O layer to switch gears and use the non-MIO
>> instruction set.
>>
>>> the write pattern help?
>>
>> The write pattern is a separate issue from non-MIO instruction
>> requirements...  Certain address spaces require specific instructions to
>> be used (so, no substituting PCISTG for PCISTB - that happens too by
>> default for any writes coming into the host s390 PCI layer that are
>> <=8B, and they all are when the PCISTB is broken up into 8B memory
>> operations that travel through vfio_pci_bar_rw, which further breaks
>> those up into 4B operations).  There's also a requirement for some
>> writes that the data, if broken up, be written in a certain order in
>> order to properly trigger events. :(  The ability to pass the entire
>> PCISTB payload vs breaking it into 8B chunks is also significantly faster.
> 
> Let me summarize this to make sure I understand this new region
> correctly:
> 
> - some devices may have relaxed alignment/length requirements for
>    pcistb (and friends?)

The relaxed alignment bit is really specific to PCISTB behavior, so the 
"and friends" doesn't apply there.

> - some devices may actually require writes to be done in a large chunk
>    instead of being broken up (is that a strict subset of the devices
>    above?)

Yes, this is specific to ISM devices, which are always a relaxed 
alignment/length device.

The inverse is an interesting question though (relaxed alignment devices 
that are not ISM, which you've posed as a possible future extension for 
emulated devices).  I'm not sure that any (real devices) exist where 
(relaxed_alignment && !ism), but 'what if' -- I guess the right approach 
would mean additional code in QEMU to handle relaxed alignment for the 
vfio mmio path as well (seen as pcistb_default in my qemu patchset) and 
being very specific in QEMU to only enable the region for an ism device.

> - some devices do not support the new MIO instructions (is that a
>    subset of the relaxed alignment devices? I'm not familiar with the
>    MIO instructions)
> 

The non-MIO requirement is again specific to ISM, which is a subset of 
the relaxed alignment devices.  In this case, the requirement is not 
limited to PCISTB, and that's why PCILG is also included here.  The ISM 
driver does not use PCISTG, and the only PCISTG instructions coming from 
the guest against an ISM device would be against the config space and 
those are OK to go through vfio still; so what was provided via the 
region is effectively the bare-minimum requirement to allow ISM to 
function properly in the guest.

> The patchsets introduce a new region that (a) is used by QEMU to submit
> writes in one go, and (b) makes sure to call into the non-MIO
> instructions directly; it's basically killing two birds with one stone
> for ISM devices. Are these two requirements (large writes and non-MIO)
> always going hand-in-hand, or is ISM just an odd device?

I would say that ISM is definitely a special-case device, even just 
looking at the way it's implemented in the base kernel (interacting 
directly with the s390 kernel PCI layer in order to avoid use of MIO 
instructions -- no other driver does this).  But that said, having the 
two requirements hand-in-hand I think is not bad, though -- This 
approach ensures the specific instruction the guest wanted (or in this 
case, needed) is actually executed on the underlying host.

That said, the ability to re-use the large write for other devices would 
be nice -- but as hinted in the QEMU cover letter, this approach only 
works because ISM does not support MSI-X; using this approach for 
MSI-X-enabled devices breaks the MSI-X masking that vfio-pci does in 
QEMU (I tried an approach that used this region approach for all 3 
instructions as a test, PCISTG/PCISTB/PCILG, and threw it against mlx -- 
any writes against an MSI-X enabled bar will miss the msi-x notifiers 
since we aren't performing memory operations against the typical 
vfio-pci bar).

> 
> If there's an expectation that the new region will always use the
> non-MIO instructions (in addition to the changed write handling), it
> should be noted in the description for the region as well.
> 

Yes, this is indeed the expectation; I can clarify that.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2020-12-11 15:01       ` Matthew Rosato
@ 2020-12-11 15:04         ` Matthew Rosato
  2020-12-17 12:59           ` Cornelia Huck
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Rosato @ 2020-12-11 15:04 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On 12/11/20 10:01 AM, Matthew Rosato wrote:
> On 12/11/20 9:35 AM, Cornelia Huck wrote:
>> On Thu, 10 Dec 2020 10:51:23 -0500
>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>
>>> On 12/10/20 7:33 AM, Cornelia Huck wrote:
>>>> On Wed,  9 Dec 2020 15:27:46 -0500
>>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>>> Today, ISM devices are completely disallowed for vfio-pci 
>>>>> passthrough as
>>>>> QEMU will reject the device due to an (inappropriate) MSI-X check.
>>>>> However, in an effort to enable ISM device passthrough, I realized 
>>>>> that the
>>>>> manner in which ISM performs block write operations is highly 
>>>>> incompatible
>>>>> with the way that QEMU s390 PCI instruction interception and
>>>>> vfio_pci_bar_rw break up I/O operations into 8B and 4B operations 
>>>>> -- ISM
>>>>> devices have particular requirements in regards to the alignment, 
>>>>> size and
>>>>> order of writes performed.  Furthermore, they require that 
>>>>> legacy/non-MIO
>>>>> s390 PCI instructions are used, which is also not guaranteed when 
>>>>> the I/O
>>>>> is passed through the typical userspace channels.
>>>>
>>>> The part about the non-MIO instructions confuses me. How can MIO
>>>> instructions be generated with the current code, and why does changing
>>>
>>> So to be clear, they are not being generated at all in the guest as the
>>> necessary facility is reported as unavailable.
>>>
>>> Let's talk about Linux in LPAR / the host kernel:  When hardware that
>>> supports MIO instructions is available, all userspace I/O traffic is
>>> going to be routed through the MIO variants of the s390 PCI
>>> instructions.  This is working well for other device types, but does not
>>> work for ISM which does not support these variants.  However, the ISM
>>> driver also does not invoke the userspace I/O routines for the kernel,
>>> it invokes the s390 PCI layer directly, which in turn ensures the proper
>>> PCI instructions are used -- This approach falls apart when the guest
>>> ISM driver invokes those routines in the guest -- we (qemu) pass those
>>> non-MIO instructions from the guest as memory operations through
>>> vfio-pci, traversing through the vfio I/O layer in the guest
>>> (vfio_pci_bar_rw and friends), where we then arrive in the host s390 PCI
>>> layer -- where the MIO variant is used because the facility is 
>>> available.
>>>
>>> Per conversations with Niklas (on CC), it's not trivial to decide by the
>>> time we reach the s390 PCI I/O layer to switch gears and use the non-MIO
>>> instruction set.
>>>
>>>> the write pattern help?
>>>
>>> The write pattern is a separate issue from non-MIO instruction
>>> requirements...  Certain address spaces require specific instructions to
>>> be used (so, no substituting PCISTG for PCISTB - that happens too by
>>> default for any writes coming into the host s390 PCI layer that are
>>> <=8B, and they all are when the PCISTB is broken up into 8B memory
>>> operations that travel through vfio_pci_bar_rw, which further breaks
>>> those up into 4B operations).  There's also a requirement for some
>>> writes that the data, if broken up, be written in a certain order in
>>> order to properly trigger events. :(  The ability to pass the entire
>>> PCISTB payload vs breaking it into 8B chunks is also significantly 
>>> faster.
>>
>> Let me summarize this to make sure I understand this new region
>> correctly:
>>
>> - some devices may have relaxed alignment/length requirements for
>>    pcistb (and friends?)
> 
> The relaxed alignment bit is really specific to PCISTB behavior, so the 
> "and friends" doesn't apply there.
> 
>> - some devices may actually require writes to be done in a large chunk
>>    instead of being broken up (is that a strict subset of the devices
>>    above?)
> 
> Yes, this is specific to ISM devices, which are always a relaxed 
> alignment/length device.
> 
> The inverse is an interesting question though (relaxed alignment devices 
> that are not ISM, which you've posed as a possible future extension for 
> emulated devices).  I'm not sure that any (real devices) exist where 
> (relaxed_alignment && !ism), but 'what if' -- I guess the right approach 
> would mean additional code in QEMU to handle relaxed alignment for the 
> vfio mmio path as well (seen as pcistb_default in my qemu patchset) and 
> being very specific in QEMU to only enable the region for an ism device.

Let me be more precise there...  It would be additional code to handle 
relaxed alignment for the default pcistb path (pcistb_default) which 
would include BOTH emulated devices (should we ever surface the relaxed 
alignment CLP bit and the guest kernel honor it) as well as any s390x 
vfio-pci device that doesn't use this new I/O region described here.

> 
>> - some devices do not support the new MIO instructions (is that a
>>    subset of the relaxed alignment devices? I'm not familiar with the
>>    MIO instructions)
>>
> 
> The non-MIO requirement is again specific to ISM, which is a subset of 
> the relaxed alignment devices.  In this case, the requirement is not 
> limited to PCISTB, and that's why PCILG is also included here.  The ISM 
> driver does not use PCISTG, and the only PCISTG instructions coming from 
> the guest against an ISM device would be against the config space and 
> those are OK to go through vfio still; so what was provided via the 
> region is effectively the bare-minimum requirement to allow ISM to 
> function properly in the guest.
> 
>> The patchsets introduce a new region that (a) is used by QEMU to submit
>> writes in one go, and (b) makes sure to call into the non-MIO
>> instructions directly; it's basically killing two birds with one stone
>> for ISM devices. Are these two requirements (large writes and non-MIO)
>> always going hand-in-hand, or is ISM just an odd device?
> 
> I would say that ISM is definitely a special-case device, even just 
> looking at the way it's implemented in the base kernel (interacting 
> directly with the s390 kernel PCI layer in order to avoid use of MIO 
> instructions -- no other driver does this).  But that said, having the 
> two requirements hand-in-hand I think is not bad, though -- This 
> approach ensures the specific instruction the guest wanted (or in this 
> case, needed) is actually executed on the underlying host.
> 
> That said, the ability to re-use the large write for other devices would 
> be nice -- but as hinted in the QEMU cover letter, this approach only 
> works because ISM does not support MSI-X; using this approach for 
> MSI-X-enabled devices breaks the MSI-X masking that vfio-pci does in 
> QEMU (I tried an approach that used this region approach for all 3 
> instructions as a test, PCISTG/PCISTB/PCILG, and threw it against mlx -- 
> any writes against an MSI-X enabled bar will miss the msi-x notifiers 
> since we aren't performing memory operations against the typical 
> vfio-pci bar).
> 
>>
>> If there's an expectation that the new region will always use the
>> non-MIO instructions (in addition to the changed write handling), it
>> should be noted in the description for the region as well.
>>
> 
> Yes, this is indeed the expectation; I can clarify that.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2020-12-11 15:04         ` Matthew Rosato
@ 2020-12-17 12:59           ` Cornelia Huck
  2020-12-17 16:04             ` Matthew Rosato
  0 siblings, 1 reply; 19+ messages in thread
From: Cornelia Huck @ 2020-12-17 12:59 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: alex.williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On Fri, 11 Dec 2020 10:04:43 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 12/11/20 10:01 AM, Matthew Rosato wrote:
> > On 12/11/20 9:35 AM, Cornelia Huck wrote:  

> >> Let me summarize this to make sure I understand this new region
> >> correctly:
> >>
> >> - some devices may have relaxed alignment/length requirements for
> >>    pcistb (and friends?)  
> > 
> > The relaxed alignment bit is really specific to PCISTB behavior, so the 
> > "and friends" doesn't apply there.

Ok.

> >   
> >> - some devices may actually require writes to be done in a large chunk
> >>    instead of being broken up (is that a strict subset of the devices
> >>    above?)  
> > 
> > Yes, this is specific to ISM devices, which are always a relaxed 
> > alignment/length device.
> > 
> > The inverse is an interesting question though (relaxed alignment devices 
> > that are not ISM, which you've posed as a possible future extension for 
> > emulated devices).  I'm not sure that any (real devices) exist where 
> > (relaxed_alignment && !ism), but 'what if' -- I guess the right approach 
> > would mean additional code in QEMU to handle relaxed alignment for the 
> > vfio mmio path as well (seen as pcistb_default in my qemu patchset) and 
> > being very specific in QEMU to only enable the region for an ism device.  
> 
> Let me be more precise there...  It would be additional code to handle 
> relaxed alignment for the default pcistb path (pcistb_default) which 
> would include BOTH emulated devices (should we ever surface the relaxed 
> alignment CLP bit and the guest kernel honor it) as well as any s390x 
> vfio-pci device that doesn't use this new I/O region described here.

Understood. Not sure if it is useful, but the important part is that we
could extend it if we wanted.

> 
> >   
> >> - some devices do not support the new MIO instructions (is that a
> >>    subset of the relaxed alignment devices? I'm not familiar with the
> >>    MIO instructions)
> >>  
> > 
> > The non-MIO requirement is again specific to ISM, which is a subset of 
> > the relaxed alignment devices.  In this case, the requirement is not 
> > limited to PCISTB, and that's why PCILG is also included here.  The ISM 
> > driver does not use PCISTG, and the only PCISTG instructions coming from 
> > the guest against an ISM device would be against the config space and 
> > those are OK to go through vfio still; so what was provided via the 
> > region is effectively the bare-minimum requirement to allow ISM to 
> > function properly in the guest.
> >   
> >> The patchsets introduce a new region that (a) is used by QEMU to submit
> >> writes in one go, and (b) makes sure to call into the non-MIO
> >> instructions directly; it's basically killing two birds with one stone
> >> for ISM devices. Are these two requirements (large writes and non-MIO)
> >> always going hand-in-hand, or is ISM just an odd device?  
> > 
> > I would say that ISM is definitely a special-case device, even just 
> > looking at the way it's implemented in the base kernel (interacting 
> > directly with the s390 kernel PCI layer in order to avoid use of MIO 
> > instructions -- no other driver does this).  But that said, having the 
> > two requirements hand-in-hand I think is not bad, though -- This 
> > approach ensures the specific instruction the guest wanted (or in this 
> > case, needed) is actually executed on the underlying host.

The basic question I have is whether it makes sense to specialcase the
ISM device (can we even find out that we're dealing with an ISM device
here?) to force the non-MIO instructions, as it is just a device with
some special requirements, or tie non-MIO to relaxed alignment. (Could
relaxed alignment devices in theory be served by MIO instructions as
well?)

Another thing that came to my mind is whether we consider the guest to
be using a pci device and needing weird instructions to do that because
it's on s390, or whether it is issuing instructions for a device that
happens to be a pci device (sorry if that sounds a bit meta :)

> > 
> > That said, the ability to re-use the large write for other devices would 
> > be nice -- but as hinted in the QEMU cover letter, this approach only 
> > works because ISM does not support MSI-X; using this approach for 
> > MSI-X-enabled devices breaks the MSI-X masking that vfio-pci does in 
> > QEMU (I tried an approach that used this region approach for all 3 
> > instructions as a test, PCISTG/PCISTB/PCILG, and threw it against mlx -- 
> > any writes against an MSI-X enabled bar will miss the msi-x notifiers 
> > since we aren't performing memory operations against the typical 
> > vfio-pci bar).

Ugh. I wonder why ISM is so different from anything else.

> >   
> >>
> >> If there's an expectation that the new region will always use the
> >> non-MIO instructions (in addition to the changed write handling), it
> >> should be noted in the description for the region as well.
> >>  
> > 
> > Yes, this is indeed the expectation; I can clarify that.
> >   

Thanks!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2020-12-17 12:59           ` Cornelia Huck
@ 2020-12-17 16:04             ` Matthew Rosato
  2020-12-22 16:18               ` Cornelia Huck
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Rosato @ 2020-12-17 16:04 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On 12/17/20 7:59 AM, Cornelia Huck wrote:
> On Fri, 11 Dec 2020 10:04:43 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
>> On 12/11/20 10:01 AM, Matthew Rosato wrote:
>>> On 12/11/20 9:35 AM, Cornelia Huck wrote:
> 
>>>> Let me summarize this to make sure I understand this new region
>>>> correctly:
>>>>
>>>> - some devices may have relaxed alignment/length requirements for
>>>>     pcistb (and friends?)
>>>
>>> The relaxed alignment bit is really specific to PCISTB behavior, so the
>>> "and friends" doesn't apply there.
> 
> Ok.
> 
>>>    
>>>> - some devices may actually require writes to be done in a large chunk
>>>>     instead of being broken up (is that a strict subset of the devices
>>>>     above?)
>>>
>>> Yes, this is specific to ISM devices, which are always a relaxed
>>> alignment/length device.
>>>
>>> The inverse is an interesting question though (relaxed alignment devices
>>> that are not ISM, which you've posed as a possible future extension for
>>> emulated devices).  I'm not sure that any (real devices) exist where
>>> (relaxed_alignment && !ism), but 'what if' -- I guess the right approach
>>> would mean additional code in QEMU to handle relaxed alignment for the
>>> vfio mmio path as well (seen as pcistb_default in my qemu patchset) and
>>> being very specific in QEMU to only enable the region for an ism device.
>>
>> Let me be more precise there...  It would be additional code to handle
>> relaxed alignment for the default pcistb path (pcistb_default) which
>> would include BOTH emulated devices (should we ever surface the relaxed
>> alignment CLP bit and the guest kernel honor it) as well as any s390x
>> vfio-pci device that doesn't use this new I/O region described here.
> 
> Understood. Not sure if it is useful, but the important part is that we
> could extend it if we wanted.
> 
>>
>>>    
>>>> - some devices do not support the new MIO instructions (is that a
>>>>     subset of the relaxed alignment devices? I'm not familiar with the
>>>>     MIO instructions)
>>>>   
>>>
>>> The non-MIO requirement is again specific to ISM, which is a subset of
>>> the relaxed alignment devices.  In this case, the requirement is not
>>> limited to PCISTB, and that's why PCILG is also included here.  The ISM
>>> driver does not use PCISTG, and the only PCISTG instructions coming from
>>> the guest against an ISM device would be against the config space and
>>> those are OK to go through vfio still; so what was provided via the
>>> region is effectively the bare-minimum requirement to allow ISM to
>>> function properly in the guest.
>>>    
>>>> The patchsets introduce a new region that (a) is used by QEMU to submit
>>>> writes in one go, and (b) makes sure to call into the non-MIO
>>>> instructions directly; it's basically killing two birds with one stone
>>>> for ISM devices. Are these two requirements (large writes and non-MIO)
>>>> always going hand-in-hand, or is ISM just an odd device?
>>>
>>> I would say that ISM is definitely a special-case device, even just
>>> looking at the way it's implemented in the base kernel (interacting
>>> directly with the s390 kernel PCI layer in order to avoid use of MIO
>>> instructions -- no other driver does this).  But that said, having the
>>> two requirements hand-in-hand I think is not bad, though -- This
>>> approach ensures the specific instruction the guest wanted (or in this
>>> case, needed) is actually executed on the underlying host.
> 
> The basic question I have is whether it makes sense to specialcase the
> ISM device (can we even find out that we're dealing with an ISM device
> here?) to force the non-MIO instructions, as it is just a device with

Yes, with the addition of the CLP data passed from the host via vfio 
capabilities, we can tell this is an ISM device specifically via the 
'pft' field in VFOI_DEVICE_INFO_CAP_ZPCI_BASE.  We don't actually 
surface that field to the guest itself in the Q PCI FN clp rsponse (has 
to do with Function Measurement Block requirements) but we can certainly 
use that information in QEMU to restrict this behavior to only ISM devices.

> some special requirements, or tie non-MIO to relaxed alignment. (Could
> relaxed alignment devices in theory be served by MIO instructions as
> well?)

In practice, I think there are none today, but per the architecture it 
IS possible to have relaxed alignment devices served by MIO 
instructions, so we shouldn't rely on that bit alone as I'm doing in 
this RFC.  I think instead relying on the pft value as I mention above 
is what we have to do.

> 
> Another thing that came to my mind is whether we consider the guest to
> be using a pci device and needing weird instructions to do that because
> it's on s390, or whether it is issuing instructions for a device that
> happens to be a pci device (sorry if that sounds a bit meta :)
> 

Typically, I'd classify things as the former but I think ISM seems more 
like the latter -- To me, ISM seems like less a classic PCI device and 
more a device that happens to be using s390 PCI interfaces to accomplish 
its goal.  But it's probably more of a case of this particular device 
(and it's driver) are s390-specific and therefore built with the unique 
s390 interface in-mind (and in fact invokes it directly rather than 
through the general PCI layer), rather than fitting the typical PCI 
device architecture on top of the s390 interface.


>>>
>>> That said, the ability to re-use the large write for other devices would
>>> be nice -- but as hinted in the QEMU cover letter, this approach only
>>> works because ISM does not support MSI-X; using this approach for
>>> MSI-X-enabled devices breaks the MSI-X masking that vfio-pci does in
>>> QEMU (I tried an approach that used this region approach for all 3
>>> instructions as a test, PCISTG/PCISTB/PCILG, and threw it against mlx --
>>> any writes against an MSI-X enabled bar will miss the msi-x notifiers
>>> since we aren't performing memory operations against the typical
>>> vfio-pci bar).
> 
> Ugh. I wonder why ISM is so different from anything else.
> 

...  I've asked myself that alot lately :)

>>>    
>>>>
>>>> If there's an expectation that the new region will always use the
>>>> non-MIO instructions (in addition to the changed write handling), it
>>>> should be noted in the description for the region as well.
>>>>   
>>>
>>> Yes, this is indeed the expectation; I can clarify that.
>>>    
> 
> Thanks!
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support
  2020-12-17 16:04             ` Matthew Rosato
@ 2020-12-22 16:18               ` Cornelia Huck
  0 siblings, 0 replies; 19+ messages in thread
From: Cornelia Huck @ 2020-12-22 16:18 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: alex.williamson, schnelle, pmorel, borntraeger, hca, gor,
	gerald.schaefer, linux-s390, kvm, linux-kernel

On Thu, 17 Dec 2020 11:04:48 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 12/17/20 7:59 AM, Cornelia Huck wrote:
> > The basic question I have is whether it makes sense to specialcase the
> > ISM device (can we even find out that we're dealing with an ISM device
> > here?) to force the non-MIO instructions, as it is just a device with  
> 
> Yes, with the addition of the CLP data passed from the host via vfio 
> capabilities, we can tell this is an ISM device specifically via the 
> 'pft' field in VFOI_DEVICE_INFO_CAP_ZPCI_BASE.  We don't actually 
> surface that field to the guest itself in the Q PCI FN clp rsponse (has 
> to do with Function Measurement Block requirements) but we can certainly 
> use that information in QEMU to restrict this behavior to only ISM devices.
> 
> > some special requirements, or tie non-MIO to relaxed alignment. (Could
> > relaxed alignment devices in theory be served by MIO instructions as
> > well?)  
> 
> In practice, I think there are none today, but per the architecture it 
> IS possible to have relaxed alignment devices served by MIO 
> instructions, so we shouldn't rely on that bit alone as I'm doing in 
> this RFC.  I think instead relying on the pft value as I mention above 
> is what we have to do.

From what you write this looks like the best way to me as well.

> 
> > 
> > Another thing that came to my mind is whether we consider the guest to
> > be using a pci device and needing weird instructions to do that because
> > it's on s390, or whether it is issuing instructions for a device that
> > happens to be a pci device (sorry if that sounds a bit meta :)
> >   
> 
> Typically, I'd classify things as the former but I think ISM seems more 
> like the latter -- To me, ISM seems like less a classic PCI device and 
> more a device that happens to be using s390 PCI interfaces to accomplish 
> its goal.  But it's probably more of a case of this particular device 
> (and it's driver) are s390-specific and therefore built with the unique 
> s390 interface in-mind (and in fact invokes it directly rather than 
> through the general PCI layer), rather than fitting the typical PCI 
> device architecture on top of the s390 interface.

Nod, it certainly feels like that.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-12-22 16:20 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-09 20:27 [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
2020-12-09 20:27 ` [RFC 1/4] s390/pci: track alignment/length strictness for zpci_dev Matthew Rosato
2020-12-10 10:33   ` Cornelia Huck
2020-12-10 15:26     ` Matthew Rosato
2020-12-11 11:37       ` Cornelia Huck
2020-12-09 20:27 ` [RFC 2/4] vfio-pci/zdev: Pass the relaxed alignment flag Matthew Rosato
2020-12-09 20:27 ` [RFC 3/4] s390/pci: Get hardware-reported max store block length Matthew Rosato
2020-12-09 20:27 ` [RFC 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region Matthew Rosato
2020-12-09 20:52 ` [RFC 0/4] vfio-pci/zdev: Fixing s390 vfio-pci ISM support Matthew Rosato
2020-12-10 12:33 ` Cornelia Huck
2020-12-10 15:51   ` Matthew Rosato
2020-12-10 16:14     ` Niklas Schnelle
2020-12-11 14:14       ` Cornelia Huck
2020-12-11 14:35     ` Cornelia Huck
2020-12-11 15:01       ` Matthew Rosato
2020-12-11 15:04         ` Matthew Rosato
2020-12-17 12:59           ` Cornelia Huck
2020-12-17 16:04             ` Matthew Rosato
2020-12-22 16:18               ` Cornelia Huck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).