linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/3] EEH Support for VFIO PCI Device
@ 2014-05-27  8:40 Gavin Shan
  2014-05-27  8:40 ` [PATCH v7 1/3] powerpc/eeh: Avoid event on passed PE Gavin Shan
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Gavin Shan @ 2014-05-27  8:40 UTC (permalink / raw)
  To: kvm-ppc; +Cc: aik, agraf, Gavin Shan, alex.williamson, qiudayu, linuxppc-dev

The series of patches intends to support EEH for PCI devices, which are
passed through to PowerKVM based guest via VFIO. The implementation is
straightforward based on the issues or problems we have to resolve to
support EEH for PowerKVM based guest.

- Emulation for EEH RTAS requests. All EEH RTAS requests goes to QEMU firstly.
  If QEMU can't handle it, the request will be sent to host via newly introduced
  VFIO container IOCTL command (VFIO_EEH_OP) and gets handled in host kernel.

The series of patches requires corresponding QEMU changes.

Change log
==========
v1 -> v2:
        * EEH RTAS requests are routed to QEMU, and then possiblly to host kerenl.
          The mechanism KVM in-kernel handling is dropped.
        * Error injection is reimplemented based syscall, instead of KVM in-kerenl
          handling. The logic for error injection token management is moved to
          QEMU. The error injection request is routed to QEMU and then possiblly
          to host kernel.
v2 -> v3:
        * Make the fields in struct eeh_vfio_pci_addr, struct vfio_eeh_info based
          on the comments from Alexey.
        * Define macros for EEH VFIO operations (Alexey).
        * Clear frozen state after successful PE reset. 
        * Merge original [PATCH 1/2/3] to one.
v3 -> v4:
        * Remove the error injection from the patchset. Mike or I will work on that
          later.
        * Rename CONFIG_VFIO_EEH to VFIO_PCI_EEH.
        * Rename the IOCTL command to VFIO_EEH_OP and it's handled by VFIO-PCI device
          instead of VFIO container.
        * Rename the IOCTL argument structure to "vfio_eeh_op" accordingly. Also, more
          fields added to hold return values for RTAS requests.
        * The address mapping stuff is totally removed. When opening or releasing VFIO
          PCI device, notification sent to EEH to update the flags indicates the device
          is passed to guest or not.
        * Change pr_warn() to pr_debug() to avoid DOS as pointed by Alex.W
        * Argument size check issue pointed by Alex.W.
v4 -> v5:
        * Functions for VFIO PCI EEH support are moved to eeh.c and exported from there.
          VFIO PCI driver just uses those functions to tackle IOCTL command VFIO_EEH_OP.
          All of this is to make the code organized in a good way as suggested by Alex.G.
          Another potential benefit is PowerNV/pSeries are sharing "eeh_ops" and same
          infrastructure could possiblly work for KVM_PR and KVM_HV mode at the same time.
        * Don't clear error injection registers after finishing PE reset as the patchset
          is doing nothing related to error injection.
        * Amending Documentation/vfio.txt, which was missed in last revision.
        * No QEMU changes for this revision. "v4" works well. Also, remove "RFC" from the
          subject as the design is basically recognized.
v5 -> v6:
        * CONFIG_VFIO_PCI_EEH removed. Instead to use CONFIG_EEH.
        * Split one ioctl command to 5.
        * In eeh.c, description has been added for those exported functions. Also, the
          functions have negative return values for error and information with other values.
          All digital numbers have been replaced by macros defined in eeh.h. The comments,
          including the function names have been amended not to mention "guest" or "vfio".
        * Add one mutex to protect flag in eeh_dev_open()/release().
        * More information on how to use those ioctl commands to Documentation/vfio.txt.
v6 -> v7:
	* Remove ioctl command VFIO_EEH_PE_GET_ADDR, the PE address will be figured out
	  in userland (e.g. QEMU) as Alex.G suggested.
	* Let sPAPR VFIO container process the ioctl commands as VFIO container is naturally
	  corresponds to IOMMU group (aka PE on sPAPR platform).
	* All VFIO PCI EEH ioctl commands have "argsz+flags" for its companion data struct.
	* For VFIO PCI EEH ioctl commands, ioctl() returns negative number to indicate error
	  or zero for success. Additinal output information is transported by the companion
	  data struct.
	* Explaining PE in Documentation/vfio.txt, typo fixes, more comments suggested by
	  Alex.G.
	* Split/merge patches according to suggestions from Alex.G and Alex.W.
	* To have EEH stub in drivers/vfio/pci/, which was suggested by Alex.W.
	* Define various EEH options as macros in vfio.h for userland to use.

Gavin Shan (3):
  powerpc/eeh: Avoid event on passed PE
  powerpc/eeh: EEH support for VFIO PCI device
  drivers/vfio: EEH support for VFIO PCI device

Documentation/vfio.txt                    |  92 +++++++++++++++++++++++++++++++++++++++++++-
arch/powerpc/include/asm/eeh.h            |  47 +++++++++++++++++++++++
arch/powerpc/kernel/eeh.c                 | 294 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/eeh-ioda.c |   3 +-
drivers/vfio/pci/Makefile                 |   1 +
drivers/vfio/pci/vfio_pci.c               |  20 +++++++---
drivers/vfio/pci/vfio_pci_eeh.c           |  46 ++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_private.h       |   5 +++
drivers/vfio/vfio_iommu_spapr_tce.c       |  85 +++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/vfio.h                 |  66 ++++++++++++++++++++++++++++++++
10 files changed, 651 insertions(+), 8 deletions(-)

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v7 1/3] powerpc/eeh: Avoid event on passed PE
  2014-05-27  8:40 [PATCH v7 0/3] EEH Support for VFIO PCI Device Gavin Shan
@ 2014-05-27  8:40 ` Gavin Shan
  2014-05-27  8:40 ` [PATCH v7 2/3] powerpc/eeh: EEH support for VFIO PCI device Gavin Shan
  2014-05-27  8:40 ` [PATCH v7 3/3] drivers/vfio: " Gavin Shan
  2 siblings, 0 replies; 31+ messages in thread
From: Gavin Shan @ 2014-05-27  8:40 UTC (permalink / raw)
  To: kvm-ppc; +Cc: aik, agraf, Gavin Shan, alex.williamson, qiudayu, linuxppc-dev

If we detects frozen state on PE that has been passed through to somebody
else. we needn't handle it. Instead, we rely on the device's owner to
detect and recover it. The patch avoid EEH event on the frozen passed PE
so that the device's owner can have chance to handle that.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h            | 32 +++++++++++++++++++++++++++++++
 arch/powerpc/kernel/eeh.c                 |  8 ++++++++
 arch/powerpc/platforms/powernv/eeh-ioda.c |  3 ++-
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 7782056..34a2d83 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -72,6 +72,7 @@ struct device_node;
 #define EEH_PE_RESET		(1 << 2)	/* PE reset in progress	*/
 
 #define EEH_PE_KEEP		(1 << 8)	/* Keep PE on hotplug	*/
+#define EEH_PE_PASSTHROUGH	(1 << 9)	/* PE owned by guest	*/
 
 struct eeh_pe {
 	int type;			/* PE type: PHB/Bus/Device	*/
@@ -93,6 +94,21 @@ struct eeh_pe {
 #define eeh_pe_for_each_dev(pe, edev, tmp) \
 		list_for_each_entry_safe(edev, tmp, &pe->edevs, list)
 
+static inline bool eeh_pe_passed(struct eeh_pe *pe)
+{
+	return pe ? !!(pe->state & EEH_PE_PASSTHROUGH) : false;
+}
+
+static inline void eeh_pe_set_passed(struct eeh_pe *pe, bool passed)
+{
+	if (pe) {
+		if (passed)
+			pe->state |= EEH_PE_PASSTHROUGH;
+		else
+			pe->state &= ~EEH_PE_PASSTHROUGH;
+	}
+}
+
 /*
  * The struct is used to trace EEH state for the associated
  * PCI device node or PCI device. In future, it might
@@ -110,6 +126,7 @@ struct eeh_pe {
 #define EEH_DEV_SYSFS		(1 << 9)	/* Sysfs created	*/
 #define EEH_DEV_REMOVED		(1 << 10)	/* Removed permanently	*/
 #define EEH_DEV_FRESET		(1 << 11)	/* Fundamental reset	*/
+#define EEH_DEV_PASSTHROUGH	(1 << 12)	/* Owned by guest	*/
 
 struct eeh_dev {
 	int mode;			/* EEH mode			*/
@@ -138,6 +155,21 @@ static inline struct pci_dev *eeh_dev_to_pci_dev(struct eeh_dev *edev)
 	return edev ? edev->pdev : NULL;
 }
 
+static inline bool eeh_dev_passed(struct eeh_dev *dev)
+{
+	return dev ? !!(dev->mode & EEH_DEV_PASSTHROUGH) : false;
+}
+
+static inline void eeh_dev_set_passed(struct eeh_dev *dev, bool passed)
+{
+	if (dev) {
+		if (passed)
+			dev->mode |= EEH_DEV_PASSTHROUGH;
+		else
+			dev->mode &= ~EEH_DEV_PASSTHROUGH;
+	}
+}
+
 /* Return values from eeh_ops::next_error */
 enum {
 	EEH_NEXT_ERR_NONE = 0,
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 9c6b899..3bc8b12 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 	if (ret > 0)
 		return ret;
 
+	/*
+	 * If the PE isn't owned by us, we shouldn't check the
+	 * state. Instead, let the owner handle it if the PE has
+	 * been frozen.
+	 */
+	if (eeh_pe_passed(pe))
+		return 0;
+
 	/* If we already have a pending isolation event for this
 	 * slot, we know it's bad already, we don't need to check.
 	 * Do this checking under a lock; as multiple PCI devices
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index cab3e62..79193eb 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -892,7 +892,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
 				opal_pci_eeh_freeze_clear(phb->opal_id, frozen_pe_no,
 					OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
 				ret = EEH_NEXT_ERR_NONE;
-			} else if ((*pe)->state & EEH_PE_ISOLATED) {
+			} else if ((*pe)->state & EEH_PE_ISOLATED ||
+				   eeh_pe_passed(*pe)) {
 				ret = EEH_NEXT_ERR_NONE;
 			} else {
 				pr_err("EEH: Frozen PHB#%x-PE#%x (%s) detected\n",
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v7 2/3] powerpc/eeh: EEH support for VFIO PCI device
  2014-05-27  8:40 [PATCH v7 0/3] EEH Support for VFIO PCI Device Gavin Shan
  2014-05-27  8:40 ` [PATCH v7 1/3] powerpc/eeh: Avoid event on passed PE Gavin Shan
@ 2014-05-27  8:40 ` Gavin Shan
  2014-05-27  8:40 ` [PATCH v7 3/3] drivers/vfio: " Gavin Shan
  2 siblings, 0 replies; 31+ messages in thread
From: Gavin Shan @ 2014-05-27  8:40 UTC (permalink / raw)
  To: kvm-ppc; +Cc: aik, agraf, Gavin Shan, alex.williamson, qiudayu, linuxppc-dev

The patch exports functions to be used by new ioctl commands, which
will be introduced in subsequent patch, to support EEH functinality
for VFIO PCI device.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h |  15 +++
 arch/powerpc/kernel/eeh.c      | 286 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 301 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 34a2d83..ffc95e7 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -26,6 +26,7 @@
 #include <linux/string.h>
 #include <linux/time.h>
 
+struct iommu_table;
 struct pci_dev;
 struct pci_bus;
 struct device_node;
@@ -191,6 +192,8 @@ enum {
 #define EEH_OPT_ENABLE		1	/* EEH enable	*/
 #define EEH_OPT_THAW_MMIO	2	/* MMIO enable	*/
 #define EEH_OPT_THAW_DMA	3	/* DMA enable	*/
+#define EEH_OPT_GET_PE_ADDR	0	/* Get PE addr	*/
+#define EEH_OPT_GET_PE_MODE	1	/* Get PE mode	*/
 #define EEH_STATE_UNAVAILABLE	(1 << 0)	/* State unavailable	*/
 #define EEH_STATE_NOT_SUPPORT	(1 << 1)	/* EEH not supported	*/
 #define EEH_STATE_RESET_ACTIVE	(1 << 2)	/* Active reset		*/
@@ -198,6 +201,11 @@ enum {
 #define EEH_STATE_DMA_ACTIVE	(1 << 4)	/* Active DMA		*/
 #define EEH_STATE_MMIO_ENABLED	(1 << 5)	/* MMIO enabled		*/
 #define EEH_STATE_DMA_ENABLED	(1 << 6)	/* DMA enabled		*/
+#define EEH_PE_STATE_NORMAL		0	/* Normal state		*/
+#define EEH_PE_STATE_RESET		1		/* PE reset	*/
+#define EEH_PE_STATE_STOPPED_IO_DMA	2		/* Stopped	*/
+#define EEH_PE_STATE_STOPPED_DMA	4		/* Stopped DMA	*/
+#define EEH_PE_STATE_UNAVAIL		5		/* Unavailable	*/
 #define EEH_RESET_DEACTIVATE	0	/* Deactivate the PE reset	*/
 #define EEH_RESET_HOT		1	/* Hot reset			*/
 #define EEH_RESET_FUNDAMENTAL	3	/* Fundamental reset		*/
@@ -305,6 +313,13 @@ void eeh_add_device_late(struct pci_dev *);
 void eeh_add_device_tree_late(struct pci_bus *);
 void eeh_add_sysfs_files(struct pci_bus *);
 void eeh_remove_device(struct pci_dev *);
+int eeh_dev_open(struct pci_dev *pdev);
+void eeh_dev_release(struct pci_dev *pdev);
+struct eeh_pe *eeh_iommu_table_to_pe(struct iommu_table *tbl);
+int eeh_pe_set_option(struct eeh_pe *pe, int option);
+int eeh_pe_get_state(struct eeh_pe *pe);
+int eeh_pe_reset(struct eeh_pe *pe, int option);
+int eeh_pe_configure(struct eeh_pe *pe);
 
 /**
  * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 3bc8b12..30693c1 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -40,6 +40,7 @@
 #include <asm/eeh.h>
 #include <asm/eeh_event.h>
 #include <asm/io.h>
+#include <asm/iommu.h>
 #include <asm/machdep.h>
 #include <asm/ppc-pci.h>
 #include <asm/rtas.h>
@@ -108,6 +109,9 @@ struct eeh_ops *eeh_ops = NULL;
 /* Lock to avoid races due to multiple reports of an error */
 DEFINE_RAW_SPINLOCK(confirm_error_lock);
 
+/* Lock to protect passed flags */
+static DEFINE_MUTEX(eeh_dev_mutex);
+
 /* Buffer for reporting pci register dumps. Its here in BSS, and
  * not dynamically alloced, so that it ends up in RMO where RTAS
  * can access it.
@@ -1106,6 +1110,288 @@ void eeh_remove_device(struct pci_dev *dev)
 	edev->mode &= ~EEH_DEV_SYSFS;
 }
 
+/**
+ * eeh_dev_open - Mark EEH device and PE as passed through
+ * @pdev: PCI device
+ *
+ * Mark the indicated EEH device and PE as passed through.
+ * In the result, the EEH errors detected on the PE won't be
+ * reported. The owner of the device will be responsible for
+ * detection and recovery.
+ */
+int eeh_dev_open(struct pci_dev *pdev)
+{
+	struct eeh_dev *edev;
+
+	mutex_lock(&eeh_dev_mutex);
+
+	/* No PCI device ? */
+	if (!pdev) {
+		mutex_unlock(&eeh_dev_mutex);
+		return -ENODEV;
+	}
+
+	/* No EEH device ? */
+	edev = pci_dev_to_eeh_dev(pdev);
+	if (!edev || !edev->pe) {
+		mutex_unlock(&eeh_dev_mutex);
+		return -ENODEV;
+	}
+
+	eeh_dev_set_passed(edev, true);
+	eeh_pe_set_passed(edev->pe, true);
+	mutex_unlock(&eeh_dev_mutex);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(eeh_dev_open);
+
+/**
+ * eeh_dev_release - Reclaim the ownership of EEH device
+ * @pdev: PCI device
+ *
+ * Reclaim ownership of EEH device, potentially the corresponding
+ * PE. In the result, the EEH errors detected on the PE will be
+ * reported and handled as usual.
+ */
+void eeh_dev_release(struct pci_dev *pdev)
+{
+	bool release_pe = true;
+	struct eeh_pe *pe = NULL;
+	struct eeh_dev *tmp, *edev;
+
+	mutex_lock(&eeh_dev_mutex);
+
+	/* No PCI device ? */
+	if (!pdev) {
+		mutex_unlock(&eeh_dev_mutex);
+		return;
+	}
+
+	/* No EEH device ? */
+	edev = pci_dev_to_eeh_dev(pdev);
+	if (!edev || !eeh_dev_passed(edev) ||
+	    !edev->pe || !eeh_pe_passed(pe)) {
+		mutex_unlock(&eeh_dev_mutex);
+		return;
+	}
+
+	/* Release device */
+	pe = edev->pe;
+	eeh_dev_set_passed(edev, false);
+
+	/* Release PE */
+	eeh_pe_for_each_dev(pe, edev, tmp) {
+		if (eeh_dev_passed(edev)) {
+			release_pe = false;
+			break;
+		}
+	}
+
+	if (release_pe)
+		eeh_pe_set_passed(pe, false);
+
+	mutex_unlock(&eeh_dev_mutex);
+}
+EXPORT_SYMBOL(eeh_dev_release);
+
+/**
+ * eeh_iommu_table_to_pe - Convert IOMMU table to EEH PE
+ * @tbl: IOMMU table
+ *
+ * The routine is called to convert IOMMU table to EEH PE.
+ */
+struct eeh_pe *eeh_iommu_table_to_pe(struct iommu_table *tbl)
+{
+	struct pci_dev *pdev = NULL;
+	struct eeh_dev *edev;
+	bool found = false;
+
+	/* No IOMMU table ? */
+	if (!tbl)
+		return NULL;
+
+	/* No PCI device ? */
+	for_each_pci_dev(pdev) {
+		if (get_iommu_table_base(&pdev->dev) == tbl) {
+			found = true;
+			break;
+		}
+	}
+	if (!found)
+		return NULL;
+
+	/* No EEH device or PE ? */
+	edev = pci_dev_to_eeh_dev(pdev);
+	if (!edev || !edev->pe)
+		return NULL;
+
+	return edev->pe;
+}
+
+/**
+ * eeh_pe_set_option - Set options for the indicated PE
+ * @pe: EEH PE
+ * @option: requested option
+ *
+ * The routine is called to enable or disable EEH functionality
+ * on the indicated PE, to enable IO or DMA for the frozen PE.
+ */
+int eeh_pe_set_option(struct eeh_pe *pe, int option)
+{
+	int ret = 0;
+
+	/* Invalid PE ? */
+	if (!pe)
+		return -ENODEV;
+
+	/*
+	 * EEH functionality could possibly be disabled, just
+	 * return error for the case. And the EEH functinality
+	 * isn't expected to be disabled on one specific PE.
+	 */
+	switch (option) {
+	case EEH_OPT_ENABLE:
+		if (eeh_enabled())
+			break;
+		ret = -EIO;
+		break;
+	case EEH_OPT_DISABLE:
+		break;
+	case EEH_OPT_THAW_MMIO:
+	case EEH_OPT_THAW_DMA:
+		if (!eeh_ops || !eeh_ops->set_option) {
+			ret = -ENOENT;
+			break;
+		}
+
+		ret = eeh_ops->set_option(pe, option);
+		break;
+	default:
+		pr_debug("%s: Option %d out of range (%d, %d)\n",
+			__func__, option, EEH_OPT_DISABLE, EEH_OPT_THAW_DMA);
+		ret = -EINVAL;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(eeh_pe_set_option);
+
+/**
+ * eeh_pe_get_state - Retrieve PE's state
+ * @pe: EEH PE
+ *
+ * Retrieve the PE's state, which includes 3 aspects: enabled
+ * DMA, enabled IO and asserted reset.
+ */
+int eeh_pe_get_state(struct eeh_pe *pe)
+{
+	int result, ret = 0;
+
+	/* Existing PE ? */
+	if (!pe)
+		return -ENODEV;
+
+	if (!eeh_ops || !eeh_ops->get_state)
+		return -ENOENT;
+
+	result = eeh_ops->get_state(pe, NULL);
+	if (!(result & EEH_STATE_RESET_ACTIVE) &&
+	     (result & EEH_STATE_DMA_ENABLED) &&
+	     (result & EEH_STATE_MMIO_ENABLED))
+		ret = EEH_PE_STATE_NORMAL;
+	else if (result & EEH_STATE_RESET_ACTIVE)
+		ret = EEH_PE_STATE_RESET;
+	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
+		 !(result & EEH_STATE_DMA_ENABLED) &&
+		 !(result & EEH_STATE_MMIO_ENABLED))
+		ret = EEH_PE_STATE_STOPPED_IO_DMA;
+	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
+		 (result & EEH_STATE_DMA_ENABLED) &&
+		 !(result & EEH_STATE_MMIO_ENABLED))
+		ret = EEH_PE_STATE_STOPPED_DMA;
+	else
+		ret = EEH_PE_STATE_UNAVAIL;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(eeh_pe_get_state);
+
+/**
+ * eeh_pe_reset - Issue PE reset according to specified type
+ * @pe: EEH PE
+ * @option: reset type
+ *
+ * The routine is called to reset the specified PE with the
+ * indicated type, either fundamental reset or hot reset.
+ * PE reset is the most important part for error recovery.
+ */
+int eeh_pe_reset(struct eeh_pe *pe, int option)
+{
+	int ret = 0;
+
+	/* Invalid PE ? */
+	if (!pe)
+		return -ENODEV;
+
+	if (!eeh_ops || !eeh_ops->set_option || !eeh_ops->reset)
+		return -ENOENT;
+
+	switch (option) {
+	case EEH_RESET_DEACTIVATE:
+		ret = eeh_ops->reset(pe, option);
+		if (ret)
+			break;
+
+		/*
+		 * The PE is still in frozen state and we need to clear
+		 * that. It's good to clear frozen state after deassert
+		 * to avoid messy IO access during reset, which might
+		 * cause recursive frozen PE.
+		 */
+		ret = eeh_ops->set_option(pe, EEH_OPT_THAW_MMIO);
+		if (!ret)
+			ret = eeh_ops->set_option(pe, EEH_OPT_THAW_DMA);
+		if (!ret)
+			eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
+		break;
+	case EEH_RESET_HOT:
+	case EEH_RESET_FUNDAMENTAL:
+		ret = eeh_ops->reset(pe, option);
+		break;
+	default:
+		pr_debug("%s: Unsupported option %d\n",
+			__func__, option);
+		ret = -EINVAL;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(eeh_pe_reset);
+
+/**
+ * eeh_pe_configure - Configure PCI bridges after PE reset
+ * @pe: EEH PE
+ *
+ * The routine is called to restore the PCI config space for
+ * those PCI devices, especially PCI bridges affected by PE
+ * reset issued previously.
+ */
+int eeh_pe_configure(struct eeh_pe *pe)
+{
+	int ret = 0;
+
+	/* Invalid PE ? */
+	if (!pe)
+		return -ENODEV;
+
+	/* Restore config space for the affected devices */
+	eeh_pe_restore_bars(pe);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(eeh_pe_configure);
+
 static int proc_eeh_show(struct seq_file *m, void *v)
 {
 	if (!eeh_enabled()) {
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-27  8:40 [PATCH v7 0/3] EEH Support for VFIO PCI Device Gavin Shan
  2014-05-27  8:40 ` [PATCH v7 1/3] powerpc/eeh: Avoid event on passed PE Gavin Shan
  2014-05-27  8:40 ` [PATCH v7 2/3] powerpc/eeh: EEH support for VFIO PCI device Gavin Shan
@ 2014-05-27  8:40 ` Gavin Shan
  2014-05-27 18:15   ` Alex Williamson
  2 siblings, 1 reply; 31+ messages in thread
From: Gavin Shan @ 2014-05-27  8:40 UTC (permalink / raw)
  To: kvm-ppc; +Cc: aik, agraf, Gavin Shan, alex.williamson, qiudayu, linuxppc-dev

The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
 drivers/vfio/pci/Makefile           |  1 +
 drivers/vfio/pci/vfio_pci.c         | 20 +++++---
 drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |  5 ++
 drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
 7 files changed, 308 insertions(+), 7 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index b9ca023..d890fed 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -305,7 +305,15 @@ faster, the map/unmap handling has been implemented in real mode which provides
 an excellent performance which has limitations such as inability to do
 locked pages accounting in real time.
 
-So 3 additional ioctls have been added:
+4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
+subtree that can be treated as a unit for the purposes of partitioning and
+error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
+function of a multi-function IOA, or multiple IOAs (possibly including switch
+and bridge structures above the multiple IOAs). PPC64 guests detect PCI errors
+and recover from them via EEH RTAS services, which works on the basis of
+additional ioctl commands.
+
+So 7 additional ioctls have been added:
 
 	VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
 		of the DMA window on the PCI bus.
@@ -316,6 +324,17 @@ So 3 additional ioctls have been added:
 
 	VFIO_IOMMU_DISABLE - disables the container.
 
+	VFIO_EEH_PE_SET_OPTION - enables or disables EEH functionality on the
+		specified device. Also, it can be used to remove IO or DMA
+		stopped state on the frozen PE.
+
+	VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
+
+	VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
+		error recovering.
+
+	VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
+		one of the major steps for error recoverying.
 
 The code flow from the example above should be slightly changed:
 
@@ -346,6 +365,77 @@ The code flow from the example above should be slightly changed:
 	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
 	.....
 
+Based on the initial example we have, the following piece of code could be
+reference for EEH setup and error handling:
+
+	struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
+	struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
+	struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
+	struct vfio_eeh_pe_configure configure = { .argsz = sizeof(configure) };
+
+	....
+
+	/* Get a file descriptor for the device */
+	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
+
+	/* Enable the EEH functionality on the device */
+	option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
+	ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
+
+	/* You're suggested to create additional data struct to represent
+	 * PE, and put child devices belonging to same IOMMU group to the
+	 * PE instance for later reference.
+	 */
+
+	/* Check the PE's state and make sure it's in functional state */
+	ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
+
+	/* Save device's state. pci_save_state() would be good enough
+	 * as an example.
+	 */
+
+	/* Test and setup the device */
+	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
+
+	....
+
+	/* When 0xFF's returned from reading PCI config space or IO BARs
+	 * of the PCI device. Check the PE state to see if that has been
+	 * frozen.
+	 */
+	ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
+
+	/* Waiting for pending PCI transactions to be completed and don't
+	 * produce any more PCI traffic from/to the affected PE until
+	 * recovery is finished.
+	 */
+
+	/* Enable IO for the affected PE and collect logs. Usually, the
+	 * standard part of PCI config space, AER registers are dumped
+	 * as logs for further analysis.
+	 */
+	option.option = VFIO_EEH_PE_SET_OPT_IO;
+	ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
+
+	/* Issue PE reset */
+	reset.option = VFIO_EEH_PE_RESET_HOT;
+	ioctl(container, VFIO_EEH_PE_RESET, &reset);
+	reset.option = VFIO_EEH_PE_RESET_DEACTIVATE;
+	ioctl(container, VFIO_EEH_PE_RESET, &reset);
+
+	/* Configure the PCI bridges for the affected PE */
+	ioctl(container, VFIO_EEH_PE_CONFIGURE, &configure);
+
+	/* Restored state we saved at initialization time. pci_restore_state()
+	 * is good enough as an example.
+	 */
+
+	/* Hopefully, error is recovered successfully. Now, you can resume to
+	 * start PCI traffic to/from the affected PE.
+	 */
+
+	....
+
 -------------------------------------------------------------------------------
 
 [1] VFIO was originally an acronym for "Virtual Function I/O" in its
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 1310792..faad885 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,4 +1,5 @@
 
 vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y += vfio_pci_eeh.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 7ba0424..7c8d26a 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -156,8 +156,10 @@ static void vfio_pci_release(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
 
-	if (atomic_dec_and_test(&vdev->refcnt))
+	if (atomic_dec_and_test(&vdev->refcnt)) {
+		vfio_pci_eeh_release(vdev->pdev);
 		vfio_pci_disable(vdev);
+	}
 
 	module_put(THIS_MODULE);
 }
@@ -165,19 +167,25 @@ static void vfio_pci_release(void *device_data)
 static int vfio_pci_open(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
+	int ret;
 
 	if (!try_module_get(THIS_MODULE))
 		return -ENODEV;
 
 	if (atomic_inc_return(&vdev->refcnt) == 1) {
-		int ret = vfio_pci_enable(vdev);
-		if (ret) {
-			module_put(THIS_MODULE);
-			return ret;
-		}
+		ret = vfio_pci_enable(vdev);
+		if (ret)
+			goto error;
+
+		ret = vfio_pci_eeh_open(vdev->pdev);
+		if (ret)
+			goto error;
 	}
 
 	return 0;
+error:
+	module_put(THIS_MODULE);
+	return ret;
 }
 
 static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
diff --git a/drivers/vfio/pci/vfio_pci_eeh.c b/drivers/vfio/pci/vfio_pci_eeh.c
new file mode 100644
index 0000000..9c25207
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_eeh.c
@@ -0,0 +1,46 @@
+/*
+ * EEH functionality support for VFIO PCI devices.
+ *
+ * Copyright Gavin Shan, IBM Corporation 2014.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#ifdef CONFIG_EEH
+#include <asm/eeh.h>
+#endif
+
+#include "vfio_pci_private.h"
+
+int vfio_pci_eeh_open(struct pci_dev *pdev)
+{
+	int ret = 0;
+
+#ifdef CONFIG_EEH
+	ret = eeh_dev_open(pdev);
+#endif
+	return ret;
+}
+
+void vfio_pci_eeh_release(struct pci_dev *pdev)
+{
+#ifdef CONFIG_EEH
+	eeh_dev_release(pdev);
+#endif
+}
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 9c6d5d0..c3cbe40 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -90,4 +90,9 @@ extern void vfio_pci_virqfd_exit(void);
 
 extern int vfio_config_init(struct vfio_pci_device *vdev);
 extern void vfio_config_free(struct vfio_pci_device *vdev);
+
+/* EEH stub */
+extern int vfio_pci_eeh_open(struct pci_dev *pdev);
+extern void vfio_pci_eeh_release(struct pci_dev *pdev);
+
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index a84788b..666691b 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -21,6 +21,9 @@
 #include <linux/vfio.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#ifdef CONFIG_EEH
+#include <asm/eeh.h>
+#endif
 
 #define DRIVER_VERSION  "0.1"
 #define DRIVER_AUTHOR   "aik@ozlabs.ru"
@@ -147,6 +150,83 @@ static void tce_iommu_release(void *iommu_data)
 	kfree(container);
 }
 
+static long tce_iommu_eeh_ioctl(void *iommu_data,
+				unsigned int cmd, unsigned long arg)
+{
+	struct tce_container *container = iommu_data;
+	unsigned long minsz;
+	int ret = 0;
+
+#ifdef CONFIG_EEH
+	switch (cmd) {
+	case VFIO_EEH_PE_SET_OPTION: {
+		struct vfio_eeh_pe_set_option option;
+
+		minsz = offsetofend(struct vfio_eeh_pe_set_option, option);
+		if (copy_from_user(&option, (void __user *)arg, minsz))
+			return -EFAULT;
+		if (option.argsz < minsz)
+			return -EINVAL;
+
+		ret = eeh_pe_set_option(eeh_iommu_table_to_pe(container->tbl),
+					option.option);
+		break;
+	}
+	case VFIO_EEH_PE_GET_STATE: {
+		struct vfio_eeh_pe_get_state state;
+
+		minsz = offsetofend(struct vfio_eeh_pe_get_state, state);
+		if (copy_from_user(&state, (void __user *)arg, minsz))
+			return -EFAULT;
+		if (state.argsz < minsz)
+			return -EINVAL;
+
+		ret = eeh_pe_get_state(eeh_iommu_table_to_pe(container->tbl));
+		if (ret >= 0) {
+			state.state = ret;
+			if (copy_to_user((void __user *)arg, &state, minsz))
+				return -EFAULT;
+			ret = 0;
+		}
+		break;
+	}
+	case VFIO_EEH_PE_RESET: {
+		struct vfio_eeh_pe_reset reset;
+
+		minsz = offsetofend(struct vfio_eeh_pe_reset, option);
+		if (copy_from_user(&reset, (void __user *)arg, minsz))
+			return -EFAULT;
+		if (reset.argsz < minsz)
+			return -EINVAL;
+
+		ret = eeh_pe_reset(eeh_iommu_table_to_pe(container->tbl),
+				   reset.option);
+		break;
+	}
+	case VFIO_EEH_PE_CONFIGURE: {
+		struct vfio_eeh_pe_configure configure;
+
+		minsz = offsetofend(struct vfio_eeh_pe_configure, flags);
+		if (copy_from_user(&configure, (void __user *)arg, minsz))
+			return -EFAULT;
+		if (configure.argsz < minsz)
+			return -EINVAL;
+
+		ret = eeh_pe_configure(eeh_iommu_table_to_pe(container->tbl));
+		break;
+	}
+	default:
+		ret = -EINVAL;
+		pr_debug("%s: Cannot handle command %d\n",
+			__func__, cmd);
+	}
+#else
+	ret = -ENOENT;
+#endif
+
+	return ret;
+}
+
 static long tce_iommu_ioctl(void *iommu_data,
 				 unsigned int cmd, unsigned long arg)
 {
@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
 		tce_iommu_disable(container);
 		mutex_unlock(&container->lock);
 		return 0;
+	case VFIO_EEH_PE_SET_OPTION:
+	case VFIO_EEH_PE_GET_STATE:
+	case VFIO_EEH_PE_RESET:
+	case VFIO_EEH_PE_CONFIGURE:
+		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..c5fac36 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
 
+/*
+ * EEH functionality can be enabled or disabled on one specific device.
+ * Also, the DMA or IO frozen state can be removed from the frozen PE
+ * if required.
+ */
+struct vfio_eeh_pe_set_option {
+	__u32 argsz;
+	__u32 flags;
+	__u32 option;
+#define VFIO_EEH_PE_SET_OPT_DISABLE	0	/* Disable EEH	*/
+#define VFIO_EEH_PE_SET_OPT_ENABLE	1	/* Enable EEH	*/
+#define VFIO_EEH_PE_SET_OPT_IO		2	/* Enable IO	*/
+#define VFIO_EEH_PE_SET_OPT_DMA		3	/* Enable DMA	*/
+};
+
+#define VFIO_EEH_PE_SET_OPTION		_IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/*
+ * Each EEH PE should have unique address to be identified. PE's
+ * sharing mode is also useful information as well.
+ */
+#define VFIO_EEH_PE_GET_ADDRESS		0	/* Get address	*/
+#define VFIO_EEH_PE_GET_MODE		1	/* Query mode	*/
+#define VFIO_EEH_PE_MODE_NONE		0	/* Not a PE	*/
+#define VFIO_EEH_PE_MODE_NOT_SHARED	1	/* Exclusive	*/
+#define VFIO_EEH_PE_MODE_SHARED		2	/* Shared mode	*/
+
+/*
+ * EEH PE might have been frozen because of PCI errors. Also, it might
+ * be experiencing reset for error revoery. The following command helps
+ * to get the state.
+ */
+struct vfio_eeh_pe_get_state {
+	__u32 argsz;
+	__u32 flags;
+	__u32 state;
+};
+
+#define VFIO_EEH_PE_GET_STATE		_IO(VFIO_TYPE, VFIO_BASE + 22)
+
+/*
+ * Reset is the major step to recover problematic PE. The following
+ * command helps on that.
+ */
+struct vfio_eeh_pe_reset {
+	__u32 argsz;
+	__u32 flags;
+	__u32 option;
+#define VFIO_EEH_PE_RESET_DEACTIVATE	0	/* Deactivate reset	*/
+#define VFIO_EEH_PE_RESET_HOT		1	/* Hot reset		*/
+#define VFIO_EEH_PE_RESET_FUNDAMENTAL	3	/* Fundamental reset	*/
+};
+
+#define VFIO_EEH_PE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 23)
+
+/*
+ * One of the steps for recovery after PE reset is to configure the
+ * PCI bridges affected by the PE reset.
+ */
+struct vfio_eeh_pe_configure {
+	__u32 argsz;
+	__u32 flags;
+};
+
+#define VFIO_EEH_PE_CONFIGURE		_IO(VFIO_TYPE, VFIO_BASE + 24)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-27  8:40 ` [PATCH v7 3/3] drivers/vfio: " Gavin Shan
@ 2014-05-27 18:15   ` Alex Williamson
  2014-05-27 20:30     ` Benjamin Herrenschmidt
                       ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Alex Williamson @ 2014-05-27 18:15 UTC (permalink / raw)
  To: Gavin Shan; +Cc: aik, agraf, kvm-ppc, qiudayu, linuxppc-dev

On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
> The patch adds new IOCTL commands for sPAPR VFIO container device
> to support EEH functionality for PCI devices, which have been passed
> through from host to somebody else via VFIO.
> 
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>  drivers/vfio/pci/Makefile           |  1 +
>  drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>  drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_private.h |  5 ++
>  drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>  7 files changed, 308 insertions(+), 7 deletions(-)
>  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
> 
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index b9ca023..d890fed 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -305,7 +305,15 @@ faster, the map/unmap handling has been implemented in real mode which provides
>  an excellent performance which has limitations such as inability to do
>  locked pages accounting in real time.
>  
> -So 3 additional ioctls have been added:
> +4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
> +subtree that can be treated as a unit for the purposes of partitioning and
> +error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
> +function of a multi-function IOA, or multiple IOAs (possibly including switch
> +and bridge structures above the multiple IOAs). PPC64 guests detect PCI errors
> +and recover from them via EEH RTAS services, which works on the basis of
> +additional ioctl commands.
> +
> +So 7 additional ioctls have been added:
>  
>  	VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
>  		of the DMA window on the PCI bus.
> @@ -316,6 +324,17 @@ So 3 additional ioctls have been added:
>  
>  	VFIO_IOMMU_DISABLE - disables the container.
>  
> +	VFIO_EEH_PE_SET_OPTION - enables or disables EEH functionality on the
> +		specified device. Also, it can be used to remove IO or DMA
> +		stopped state on the frozen PE.
> +
> +	VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
> +
> +	VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
> +		error recovering.
> +
> +	VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
> +		one of the major steps for error recoverying.
>  
>  The code flow from the example above should be slightly changed:
>  
> @@ -346,6 +365,77 @@ The code flow from the example above should be slightly changed:
>  	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
>  	.....
>  
> +Based on the initial example we have, the following piece of code could be
> +reference for EEH setup and error handling:
> +
> +	struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
> +	struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
> +	struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
> +	struct vfio_eeh_pe_configure configure = { .argsz = sizeof(configure) };
> +
> +	....
> +
> +	/* Get a file descriptor for the device */
> +	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
> +
> +	/* Enable the EEH functionality on the device */
> +	option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
> +	ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
> +
> +	/* You're suggested to create additional data struct to represent
> +	 * PE, and put child devices belonging to same IOMMU group to the
> +	 * PE instance for later reference.
> +	 */
> +
> +	/* Check the PE's state and make sure it's in functional state */
> +	ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
> +
> +	/* Save device's state. pci_save_state() would be good enough
> +	 * as an example.
> +	 */
> +
> +	/* Test and setup the device */
> +	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
> +
> +	....
> +
> +	/* When 0xFF's returned from reading PCI config space or IO BARs
> +	 * of the PCI device. Check the PE state to see if that has been
> +	 * frozen.
> +	 */
> +	ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
> +
> +	/* Waiting for pending PCI transactions to be completed and don't
> +	 * produce any more PCI traffic from/to the affected PE until
> +	 * recovery is finished.
> +	 */
> +
> +	/* Enable IO for the affected PE and collect logs. Usually, the
> +	 * standard part of PCI config space, AER registers are dumped
> +	 * as logs for further analysis.
> +	 */
> +	option.option = VFIO_EEH_PE_SET_OPT_IO;
> +	ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
> +
> +	/* Issue PE reset */
> +	reset.option = VFIO_EEH_PE_RESET_HOT;
> +	ioctl(container, VFIO_EEH_PE_RESET, &reset);
> +	reset.option = VFIO_EEH_PE_RESET_DEACTIVATE;
> +	ioctl(container, VFIO_EEH_PE_RESET, &reset);
> +
> +	/* Configure the PCI bridges for the affected PE */
> +	ioctl(container, VFIO_EEH_PE_CONFIGURE, &configure);
> +
> +	/* Restored state we saved at initialization time. pci_restore_state()
> +	 * is good enough as an example.
> +	 */
> +
> +	/* Hopefully, error is recovered successfully. Now, you can resume to
> +	 * start PCI traffic to/from the affected PE.
> +	 */
> +
> +	....
> +
>  -------------------------------------------------------------------------------
>  
>  [1] VFIO was originally an acronym for "Virtual Function I/O" in its
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 1310792..faad885 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,4 +1,5 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
> +vfio-pci-y += vfio_pci_eeh.o

Why do we build this w/o CONFIG_EEH?  Can't we define the functions as
static inline in the header in that case?

>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 7ba0424..7c8d26a 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -156,8 +156,10 @@ static void vfio_pci_release(void *device_data)
>  {
>  	struct vfio_pci_device *vdev = device_data;
>  
> -	if (atomic_dec_and_test(&vdev->refcnt))
> +	if (atomic_dec_and_test(&vdev->refcnt)) {
> +		vfio_pci_eeh_release(vdev->pdev);
>  		vfio_pci_disable(vdev);
> +	}
>  
>  	module_put(THIS_MODULE);
>  }
> @@ -165,19 +167,25 @@ static void vfio_pci_release(void *device_data)
>  static int vfio_pci_open(void *device_data)
>  {
>  	struct vfio_pci_device *vdev = device_data;
> +	int ret;
>  
>  	if (!try_module_get(THIS_MODULE))
>  		return -ENODEV;
>  
>  	if (atomic_inc_return(&vdev->refcnt) == 1) {
> -		int ret = vfio_pci_enable(vdev);
> -		if (ret) {
> -			module_put(THIS_MODULE);
> -			return ret;
> -		}
> +		ret = vfio_pci_enable(vdev);
> +		if (ret)
> +			goto error;
> +
> +		ret = vfio_pci_eeh_open(vdev->pdev);
> +		if (ret)
> +			goto error;
>  	}
>  
>  	return 0;
> +error:
> +	module_put(THIS_MODULE);
> +	return ret;
>  }
>  
>  static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
> diff --git a/drivers/vfio/pci/vfio_pci_eeh.c b/drivers/vfio/pci/vfio_pci_eeh.c
> new file mode 100644
> index 0000000..9c25207
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_eeh.c
> @@ -0,0 +1,46 @@
> +/*
> + * EEH functionality support for VFIO PCI devices.
> + *
> + * Copyright Gavin Shan, IBM Corporation 2014.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> +#include <linux/interrupt.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/notifier.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/slab.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>

Cleanup the includes

> +#ifdef CONFIG_EEH
> +#include <asm/eeh.h>
> +#endif

This shouldn't even be compiles w/o CONFIG_EEH

> +
> +#include "vfio_pci_private.h"
> +
> +int vfio_pci_eeh_open(struct pci_dev *pdev)
> +{
> +	int ret = 0;
> +
> +#ifdef CONFIG_EEH
> +	ret = eeh_dev_open(pdev);
> +#endif
> +	return ret;
> +}
> +
> +void vfio_pci_eeh_release(struct pci_dev *pdev)
> +{
> +#ifdef CONFIG_EEH
> +	eeh_dev_release(pdev);
> +#endif
> +}
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 9c6d5d0..c3cbe40 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -90,4 +90,9 @@ extern void vfio_pci_virqfd_exit(void);
>  
>  extern int vfio_config_init(struct vfio_pci_device *vdev);
>  extern void vfio_config_free(struct vfio_pci_device *vdev);
> +
> +/* EEH stub */
> +extern int vfio_pci_eeh_open(struct pci_dev *pdev);
> +extern void vfio_pci_eeh_release(struct pci_dev *pdev);

The #ifdef with static inlines should be here.

> +
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index a84788b..666691b 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -21,6 +21,9 @@
>  #include <linux/vfio.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#ifdef CONFIG_EEH
> +#include <asm/eeh.h>
> +#endif

Now we're infecting another file with EEH, can't we put it in a central
location for vfio?
>  
>  #define DRIVER_VERSION  "0.1"
>  #define DRIVER_AUTHOR   "aik@ozlabs.ru"
> @@ -147,6 +150,83 @@ static void tce_iommu_release(void *iommu_data)
>  	kfree(container);
>  }
>  
> +static long tce_iommu_eeh_ioctl(void *iommu_data,
> +				unsigned int cmd, unsigned long arg)
> +{
> +	struct tce_container *container = iommu_data;
> +	unsigned long minsz;
> +	int ret = 0;
> +
> +#ifdef CONFIG_EEH
> +	switch (cmd) {
> +	case VFIO_EEH_PE_SET_OPTION: {
> +		struct vfio_eeh_pe_set_option option;
> +
> +		minsz = offsetofend(struct vfio_eeh_pe_set_option, option);
> +		if (copy_from_user(&option, (void __user *)arg, minsz))
> +			return -EFAULT;
> +		if (option.argsz < minsz)
> +			return -EINVAL;
> +
> +		ret = eeh_pe_set_option(eeh_iommu_table_to_pe(container->tbl),
> +					option.option);
> +		break;
> +	}
> +	case VFIO_EEH_PE_GET_STATE: {
> +		struct vfio_eeh_pe_get_state state;
> +
> +		minsz = offsetofend(struct vfio_eeh_pe_get_state, state);
> +		if (copy_from_user(&state, (void __user *)arg, minsz))
> +			return -EFAULT;
> +		if (state.argsz < minsz)
> +			return -EINVAL;
> +
> +		ret = eeh_pe_get_state(eeh_iommu_table_to_pe(container->tbl));
> +		if (ret >= 0) {
> +			state.state = ret;
> +			if (copy_to_user((void __user *)arg, &state, minsz))
> +				return -EFAULT;
> +			ret = 0;
> +		}

This looks like one of those cases where you should just use the ioctl
return value.

> +		break;
> +	}
> +	case VFIO_EEH_PE_RESET: {
> +		struct vfio_eeh_pe_reset reset;
> +
> +		minsz = offsetofend(struct vfio_eeh_pe_reset, option);
> +		if (copy_from_user(&reset, (void __user *)arg, minsz))
> +			return -EFAULT;
> +		if (reset.argsz < minsz)
> +			return -EINVAL;
> +
> +		ret = eeh_pe_reset(eeh_iommu_table_to_pe(container->tbl),
> +				   reset.option);
> +		break;
> +	}
> +	case VFIO_EEH_PE_CONFIGURE: {
> +		struct vfio_eeh_pe_configure configure;
> +
> +		minsz = offsetofend(struct vfio_eeh_pe_configure, flags);
> +		if (copy_from_user(&configure, (void __user *)arg, minsz))
> +			return -EFAULT;
> +		if (configure.argsz < minsz)
> +			return -EINVAL;
> +
> +		ret = eeh_pe_configure(eeh_iommu_table_to_pe(container->tbl));
> +		break;
> +	}
> +	default:
> +		ret = -EINVAL;
> +		pr_debug("%s: Cannot handle command %d\n",
> +			__func__, cmd);
> +	}
> +#else
> +	ret = -ENOENT;
> +#endif

Hmm, more like a BUG in the default case the way it's coded here (not
even sure it's worth the default entry).  The #else case should probably
be -ENOTTY like other unimplemented ioctls.

> +
> +	return ret;
> +}
> +
>  static long tce_iommu_ioctl(void *iommu_data,
>  				 unsigned int cmd, unsigned long arg)
>  {
> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		tce_iommu_disable(container);
>  		mutex_unlock(&container->lock);
>  		return 0;
> +	case VFIO_EEH_PE_SET_OPTION:
> +	case VFIO_EEH_PE_GET_STATE:
> +	case VFIO_EEH_PE_RESET:
> +	case VFIO_EEH_PE_CONFIGURE:
> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);

This is where it would have really made sense to have a single
VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
AlexG, are you really attached to splitting these out into separate
ioctls?

>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index cb9023d..c5fac36 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
>  
>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>  
> +/*
> + * EEH functionality can be enabled or disabled on one specific device.
> + * Also, the DMA or IO frozen state can be removed from the frozen PE
> + * if required.
> + */
> +struct vfio_eeh_pe_set_option {
> +	__u32 argsz;
> +	__u32 flags;
> +	__u32 option;
> +#define VFIO_EEH_PE_SET_OPT_DISABLE	0	/* Disable EEH	*/
> +#define VFIO_EEH_PE_SET_OPT_ENABLE	1	/* Enable EEH	*/
> +#define VFIO_EEH_PE_SET_OPT_IO		2	/* Enable IO	*/
> +#define VFIO_EEH_PE_SET_OPT_DMA		3	/* Enable DMA	*/

This is more of a "command" than an "option" isn't it?  Each of these
probably needs a more significant description.

> +};
> +
> +#define VFIO_EEH_PE_SET_OPTION		_IO(VFIO_TYPE, VFIO_BASE + 21)
> +
> +/*
> + * Each EEH PE should have unique address to be identified. PE's
> + * sharing mode is also useful information as well.
> + */
> +#define VFIO_EEH_PE_GET_ADDRESS		0	/* Get address	*/
> +#define VFIO_EEH_PE_GET_MODE		1	/* Query mode	*/
> +#define VFIO_EEH_PE_MODE_NONE		0	/* Not a PE	*/
> +#define VFIO_EEH_PE_MODE_NOT_SHARED	1	/* Exclusive	*/
> +#define VFIO_EEH_PE_MODE_SHARED		2	/* Shared mode	*/
> +
> +/*
> + * EEH PE might have been frozen because of PCI errors. Also, it might
> + * be experiencing reset for error revoery. The following command helps
> + * to get the state.
> + */
> +struct vfio_eeh_pe_get_state {
> +	__u32 argsz;
> +	__u32 flags;
> +	__u32 state;
> +};

Should state be a union to better describe the value returned?  What
exactly is the address and why does the user need to know it?  Does this
need user input or could we just return the address and mode regardless?

> +
> +#define VFIO_EEH_PE_GET_STATE		_IO(VFIO_TYPE, VFIO_BASE + 22)
> +
> +/*
> + * Reset is the major step to recover problematic PE. The following
> + * command helps on that.
> + */
> +struct vfio_eeh_pe_reset {
> +	__u32 argsz;
> +	__u32 flags;
> +	__u32 option;
> +#define VFIO_EEH_PE_RESET_DEACTIVATE	0	/* Deactivate reset	*/
> +#define VFIO_EEH_PE_RESET_HOT		1	/* Hot reset		*/
> +#define VFIO_EEH_PE_RESET_FUNDAMENTAL	3	/* Fundamental reset	*/

How does a user know which of these to use?

> +};
> +
> +#define VFIO_EEH_PE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 23)
> +
> +/*
> + * One of the steps for recovery after PE reset is to configure the
> + * PCI bridges affected by the PE reset.
> + */
> +struct vfio_eeh_pe_configure {
> +	__u32 argsz;
> +	__u32 flags;
> +};

Parameters are probably not necessary here.

> +
> +#define VFIO_EEH_PE_CONFIGURE		_IO(VFIO_TYPE, VFIO_BASE + 24)
> +
>  /* ***************************************************************** */
>  
>  #endif /* _UAPIVFIO_H */

How does a user learn that a device supports EEH?  Do they just start
making ioctl calls and getting a failure?  Shouldn't we make use of one
of the flag bits on the device or add a capability on the container for
the user to query?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-27 18:15   ` Alex Williamson
@ 2014-05-27 20:30     ` Benjamin Herrenschmidt
  2014-05-27 20:37       ` Alex Williamson
  2014-05-27 22:49     ` Alexander Graf
  2014-05-28  0:55     ` Gavin Shan
  2 siblings, 1 reply; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-27 20:30 UTC (permalink / raw)
  To: Alex Williamson; +Cc: aik, Gavin Shan, kvm-ppc, agraf, qiudayu, linuxppc-dev

On Tue, 2014-05-27 at 12:15 -0600, Alex Williamson wrote:

> > +/*
> > + * Reset is the major step to recover problematic PE. The following
> > + * command helps on that.
> > + */
> > +struct vfio_eeh_pe_reset {
> > +	__u32 argsz;
> > +	__u32 flags;
> > +	__u32 option;
> > +#define VFIO_EEH_PE_RESET_DEACTIVATE	0	/* Deactivate reset	*/
> > +#define VFIO_EEH_PE_RESET_HOT		1	/* Hot reset		*/
> > +#define VFIO_EEH_PE_RESET_FUNDAMENTAL	3	/* Fundamental reset	*/
> 
> How does a user know which of these to use?

The usual way is the driver asks for one or the other, this plumbs back
into the guest EEH code which itself plumbs into the PCIe error recovery
framework in Linux.

However I do have a question for Gavin here: Why do we expose an
explicit "deactivate" ? The reset functions should do the whole
reset sequence (assertion, delay, deassertion). In fact the firmware
doesn't really give you a choice for PERST right ? Or do we have
a requirement to expose both phases for RTAS? (In that case I'm
happy to ignore the deassertion there too).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-27 20:30     ` Benjamin Herrenschmidt
@ 2014-05-27 20:37       ` Alex Williamson
  2014-05-27 20:41         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Williamson @ 2014-05-27 20:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, Gavin Shan, kvm-ppc, agraf, qiudayu, linuxppc-dev

On Wed, 2014-05-28 at 06:30 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-27 at 12:15 -0600, Alex Williamson wrote:
> 
> > > +/*
> > > + * Reset is the major step to recover problematic PE. The following
> > > + * command helps on that.
> > > + */
> > > +struct vfio_eeh_pe_reset {
> > > +	__u32 argsz;
> > > +	__u32 flags;
> > > +	__u32 option;
> > > +#define VFIO_EEH_PE_RESET_DEACTIVATE	0	/* Deactivate reset	*/
> > > +#define VFIO_EEH_PE_RESET_HOT		1	/* Hot reset		*/
> > > +#define VFIO_EEH_PE_RESET_FUNDAMENTAL	3	/* Fundamental reset	*/
> > 
> > How does a user know which of these to use?
> 
> The usual way is the driver asks for one or the other, this plumbs back
> into the guest EEH code which itself plumbs into the PCIe error recovery
> framework in Linux.

So magic?

> 
> However I do have a question for Gavin here: Why do we expose an
> explicit "deactivate" ? The reset functions should do the whole
> reset sequence (assertion, delay, deassertion). In fact the firmware
> doesn't really give you a choice for PERST right ? Or do we have
> a requirement to expose both phases for RTAS? (In that case I'm
> happy to ignore the deassertion there too).
> 
> Cheers,
> Ben.
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-27 20:37       ` Alex Williamson
@ 2014-05-27 20:41         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-27 20:41 UTC (permalink / raw)
  To: Alex Williamson; +Cc: aik, Gavin Shan, kvm-ppc, agraf, qiudayu, linuxppc-dev

On Tue, 2014-05-27 at 14:37 -0600, Alex Williamson wrote:

> > The usual way is the driver asks for one or the other, this plumbs back
> > into the guest EEH code which itself plumbs into the PCIe error recovery
> > framework in Linux.
> 
> So magic?

Yes. The driver is expected to more or less knows what kind of reset it
wants for its device. Ideally hot reset is sufficient but some drivers
knows that the device they drive is crappy enough that it mostly ignores
hot reset and really needs a PERST for example...

Also we have other reasons to expose those interfaces outside of EEH. 

For example, some drivers might want to specifically trigger a PERST
after a microcode update. IE. There are path outside of EEH error
recovery where drivers in the guest might want to trigger a reset
to the device and they have control under some circumstances on
which kind of reset they are doing (and the guest Linux does  have
different code path to do a hot reset vs. a fundamental reset).

So we need to expose that distinction to be able to honor the guest
decision.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-27 18:15   ` Alex Williamson
  2014-05-27 20:30     ` Benjamin Herrenschmidt
@ 2014-05-27 22:49     ` Alexander Graf
  2014-05-28  0:39       ` Alex Williamson
  2014-05-28  0:55     ` Gavin Shan
  2 siblings, 1 reply; 31+ messages in thread
From: Alexander Graf @ 2014-05-27 22:49 UTC (permalink / raw)
  To: Alex Williamson, Gavin Shan; +Cc: aik, qiudayu, linuxppc-dev, kvm-ppc


On 27.05.14 20:15, Alex Williamson wrote:
> On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>> The patch adds new IOCTL commands for sPAPR VFIO container device
>> to support EEH functionality for PCI devices, which have been passed
>> through from host to somebody else via VFIO.
>>
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>>   Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>   drivers/vfio/pci/Makefile           |  1 +
>>   drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>   drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>   drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>   drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>   include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>   7 files changed, 308 insertions(+), 7 deletions(-)
>>   create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c

[...]

>> +
>> +	return ret;
>> +}
>> +
>>   static long tce_iommu_ioctl(void *iommu_data,
>>   				 unsigned int cmd, unsigned long arg)
>>   {
>> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>   		tce_iommu_disable(container);
>>   		mutex_unlock(&container->lock);
>>   		return 0;
>> +	case VFIO_EEH_PE_SET_OPTION:
>> +	case VFIO_EEH_PE_GET_STATE:
>> +	case VFIO_EEH_PE_RESET:
>> +	case VFIO_EEH_PE_CONFIGURE:
>> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
> This is where it would have really made sense to have a single
> VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
> AlexG, are you really attached to splitting these out into separate
> ioctls?

I don't see the problem. We need to forward 4 ioctls to a separate piece 
of code, so we forward 4 ioctls to a separate piece of code :). Putting 
them into one ioctl just moves the switch() into another function.


Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-27 22:49     ` Alexander Graf
@ 2014-05-28  0:39       ` Alex Williamson
  2014-05-28  0:44         ` Alexander Graf
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Williamson @ 2014-05-28  0:39 UTC (permalink / raw)
  To: Alexander Graf; +Cc: aik, Gavin Shan, kvm-ppc, qiudayu, linuxppc-dev

On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
> On 27.05.14 20:15, Alex Williamson wrote:
> > On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
> >> The patch adds new IOCTL commands for sPAPR VFIO container device
> >> to support EEH functionality for PCI devices, which have been passed
> >> through from host to somebody else via VFIO.
> >>
> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> ---
> >>   Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
> >>   drivers/vfio/pci/Makefile           |  1 +
> >>   drivers/vfio/pci/vfio_pci.c         | 20 +++++---
> >>   drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
> >>   drivers/vfio/pci/vfio_pci_private.h |  5 ++
> >>   drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
> >>   include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
> >>   7 files changed, 308 insertions(+), 7 deletions(-)
> >>   create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
> 
> [...]
> 
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>   static long tce_iommu_ioctl(void *iommu_data,
> >>   				 unsigned int cmd, unsigned long arg)
> >>   {
> >> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
> >>   		tce_iommu_disable(container);
> >>   		mutex_unlock(&container->lock);
> >>   		return 0;
> >> +	case VFIO_EEH_PE_SET_OPTION:
> >> +	case VFIO_EEH_PE_GET_STATE:
> >> +	case VFIO_EEH_PE_RESET:
> >> +	case VFIO_EEH_PE_CONFIGURE:
> >> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
> > This is where it would have really made sense to have a single
> > VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
> > AlexG, are you really attached to splitting these out into separate
> > ioctls?
> 
> I don't see the problem. We need to forward 4 ioctls to a separate piece 
> of code, so we forward 4 ioctls to a separate piece of code :). Putting 
> them into one ioctl just moves the switch() into another function.

And uses an extra 3 ioctl numbers and gives us extra things to update if
we ever need to add more ioctls, etc.  ioctl numbers are an address
space, how much address space do we really want to give to EEH?  It's
not a big difference, but I don't think it's completely even either.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28  0:39       ` Alex Williamson
@ 2014-05-28  0:44         ` Alexander Graf
  2014-05-28  0:57           ` Alex Williamson
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Graf @ 2014-05-28  0:44 UTC (permalink / raw)
  To: Alex Williamson; +Cc: aik, Gavin Shan, kvm-ppc, qiudayu, linuxppc-dev


On 28.05.14 02:39, Alex Williamson wrote:
> On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
>> On 27.05.14 20:15, Alex Williamson wrote:
>>> On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>>> The patch adds new IOCTL commands for sPAPR VFIO container device
>>>> to support EEH functionality for PCI devices, which have been passed
>>>> through from host to somebody else via VFIO.
>>>>
>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>> ---
>>>>    Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>>>    drivers/vfio/pci/Makefile           |  1 +
>>>>    drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>>>    drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>>>    drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>>>    drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>>>    include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>>>    7 files changed, 308 insertions(+), 7 deletions(-)
>>>>    create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>> [...]
>>
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>    static long tce_iommu_ioctl(void *iommu_data,
>>>>    				 unsigned int cmd, unsigned long arg)
>>>>    {
>>>> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>>>    		tce_iommu_disable(container);
>>>>    		mutex_unlock(&container->lock);
>>>>    		return 0;
>>>> +	case VFIO_EEH_PE_SET_OPTION:
>>>> +	case VFIO_EEH_PE_GET_STATE:
>>>> +	case VFIO_EEH_PE_RESET:
>>>> +	case VFIO_EEH_PE_CONFIGURE:
>>>> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
>>> This is where it would have really made sense to have a single
>>> VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
>>> AlexG, are you really attached to splitting these out into separate
>>> ioctls?
>> I don't see the problem. We need to forward 4 ioctls to a separate piece
>> of code, so we forward 4 ioctls to a separate piece of code :). Putting
>> them into one ioctl just moves the switch() into another function.
> And uses an extra 3 ioctl numbers and gives us extra things to update if
> we ever need to add more ioctls, etc.  ioctl numbers are an address
> space, how much address space do we really want to give to EEH?  It's
> not a big difference, but I don't think it's completely even either.
> Thanks,

Yes, that's the point. I by far prefer to have you push back on anyone 
who introduces useless ioctls rather than have a separate EEH number 
space that people can just throw anything in they like ;).


Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-27 18:15   ` Alex Williamson
  2014-05-27 20:30     ` Benjamin Herrenschmidt
  2014-05-27 22:49     ` Alexander Graf
@ 2014-05-28  0:55     ` Gavin Shan
  2014-05-28 11:41       ` Alexander Graf
  2014-05-28 16:32       ` Alex Williamson
  2 siblings, 2 replies; 31+ messages in thread
From: Gavin Shan @ 2014-05-28  0:55 UTC (permalink / raw)
  To: Alex Williamson; +Cc: aik, Gavin Shan, kvm-ppc, agraf, qiudayu, linuxppc-dev

On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
>On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>> The patch adds new IOCTL commands for sPAPR VFIO container device
>> to support EEH functionality for PCI devices, which have been passed
>> through from host to somebody else via VFIO.
>> 
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>>  Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>  drivers/vfio/pci/Makefile           |  1 +
>>  drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>  drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>  drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>  7 files changed, 308 insertions(+), 7 deletions(-)
>>  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>> 
>> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>> index b9ca023..d890fed 100644
>> --- a/Documentation/vfio.txt
>> +++ b/Documentation/vfio.txt
>> @@ -305,7 +305,15 @@ faster, the map/unmap handling has been implemented in real mode which provides
>>  an excellent performance which has limitations such as inability to do
>>  locked pages accounting in real time.
>>  
>> -So 3 additional ioctls have been added:
>> +4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
>> +subtree that can be treated as a unit for the purposes of partitioning and
>> +error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
>> +function of a multi-function IOA, or multiple IOAs (possibly including switch
>> +and bridge structures above the multiple IOAs). PPC64 guests detect PCI errors
>> +and recover from them via EEH RTAS services, which works on the basis of
>> +additional ioctl commands.
>> +
>> +So 7 additional ioctls have been added:
>>  
>>  	VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
>>  		of the DMA window on the PCI bus.
>> @@ -316,6 +324,17 @@ So 3 additional ioctls have been added:
>>  
>>  	VFIO_IOMMU_DISABLE - disables the container.
>>  
>> +	VFIO_EEH_PE_SET_OPTION - enables or disables EEH functionality on the
>> +		specified device. Also, it can be used to remove IO or DMA
>> +		stopped state on the frozen PE.
>> +
>> +	VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
>> +
>> +	VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
>> +		error recovering.
>> +
>> +	VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
>> +		one of the major steps for error recoverying.
>>  
>>  The code flow from the example above should be slightly changed:
>>  
>> @@ -346,6 +365,77 @@ The code flow from the example above should be slightly changed:
>>  	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
>>  	.....
>>  
>> +Based on the initial example we have, the following piece of code could be
>> +reference for EEH setup and error handling:
>> +
>> +	struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
>> +	struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
>> +	struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
>> +	struct vfio_eeh_pe_configure configure = { .argsz = sizeof(configure) };
>> +
>> +	....
>> +
>> +	/* Get a file descriptor for the device */
>> +	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
>> +
>> +	/* Enable the EEH functionality on the device */
>> +	option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
>> +	ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
>> +
>> +	/* You're suggested to create additional data struct to represent
>> +	 * PE, and put child devices belonging to same IOMMU group to the
>> +	 * PE instance for later reference.
>> +	 */
>> +
>> +	/* Check the PE's state and make sure it's in functional state */
>> +	ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
>> +
>> +	/* Save device's state. pci_save_state() would be good enough
>> +	 * as an example.
>> +	 */
>> +
>> +	/* Test and setup the device */
>> +	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
>> +
>> +	....
>> +
>> +	/* When 0xFF's returned from reading PCI config space or IO BARs
>> +	 * of the PCI device. Check the PE state to see if that has been
>> +	 * frozen.
>> +	 */
>> +	ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
>> +
>> +	/* Waiting for pending PCI transactions to be completed and don't
>> +	 * produce any more PCI traffic from/to the affected PE until
>> +	 * recovery is finished.
>> +	 */
>> +
>> +	/* Enable IO for the affected PE and collect logs. Usually, the
>> +	 * standard part of PCI config space, AER registers are dumped
>> +	 * as logs for further analysis.
>> +	 */
>> +	option.option = VFIO_EEH_PE_SET_OPT_IO;
>> +	ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
>> +
>> +	/* Issue PE reset */
>> +	reset.option = VFIO_EEH_PE_RESET_HOT;
>> +	ioctl(container, VFIO_EEH_PE_RESET, &reset);
>> +	reset.option = VFIO_EEH_PE_RESET_DEACTIVATE;
>> +	ioctl(container, VFIO_EEH_PE_RESET, &reset);
>> +
>> +	/* Configure the PCI bridges for the affected PE */
>> +	ioctl(container, VFIO_EEH_PE_CONFIGURE, &configure);
>> +
>> +	/* Restored state we saved at initialization time. pci_restore_state()
>> +	 * is good enough as an example.
>> +	 */
>> +
>> +	/* Hopefully, error is recovered successfully. Now, you can resume to
>> +	 * start PCI traffic to/from the affected PE.
>> +	 */
>> +
>> +	....
>> +
>>  -------------------------------------------------------------------------------
>>  
>>  [1] VFIO was originally an acronym for "Virtual Function I/O" in its
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 1310792..faad885 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -1,4 +1,5 @@
>>  
>>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>> +vfio-pci-y += vfio_pci_eeh.o
>
>Why do we build this w/o CONFIG_EEH?  Can't we define the functions as
>static inline in the header in that case?
>

Ok. Will do in next revision.

>>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 7ba0424..7c8d26a 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -156,8 +156,10 @@ static void vfio_pci_release(void *device_data)
>>  {
>>  	struct vfio_pci_device *vdev = device_data;
>>  
>> -	if (atomic_dec_and_test(&vdev->refcnt))
>> +	if (atomic_dec_and_test(&vdev->refcnt)) {
>> +		vfio_pci_eeh_release(vdev->pdev);
>>  		vfio_pci_disable(vdev);
>> +	}
>>  
>>  	module_put(THIS_MODULE);
>>  }
>> @@ -165,19 +167,25 @@ static void vfio_pci_release(void *device_data)
>>  static int vfio_pci_open(void *device_data)
>>  {
>>  	struct vfio_pci_device *vdev = device_data;
>> +	int ret;
>>  
>>  	if (!try_module_get(THIS_MODULE))
>>  		return -ENODEV;
>>  
>>  	if (atomic_inc_return(&vdev->refcnt) == 1) {
>> -		int ret = vfio_pci_enable(vdev);
>> -		if (ret) {
>> -			module_put(THIS_MODULE);
>> -			return ret;
>> -		}
>> +		ret = vfio_pci_enable(vdev);
>> +		if (ret)
>> +			goto error;
>> +
>> +		ret = vfio_pci_eeh_open(vdev->pdev);
>> +		if (ret)
>> +			goto error;
>>  	}
>>  
>>  	return 0;
>> +error:
>> +	module_put(THIS_MODULE);
>> +	return ret;
>>  }
>>  
>>  static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
>> diff --git a/drivers/vfio/pci/vfio_pci_eeh.c b/drivers/vfio/pci/vfio_pci_eeh.c
>> new file mode 100644
>> index 0000000..9c25207
>> --- /dev/null
>> +++ b/drivers/vfio/pci/vfio_pci_eeh.c
>> @@ -0,0 +1,46 @@
>> +/*
>> + * EEH functionality support for VFIO PCI devices.
>> + *
>> + * Copyright Gavin Shan, IBM Corporation 2014.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include <linux/device.h>
>> +#include <linux/eventfd.h>
>> +#include <linux/file.h>
>> +#include <linux/interrupt.h>
>> +#include <linux/iommu.h>
>> +#include <linux/module.h>
>> +#include <linux/mutex.h>
>> +#include <linux/notifier.h>
>> +#include <linux/pci.h>
>> +#include <linux/pm_runtime.h>
>> +#include <linux/slab.h>
>> +#include <linux/types.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/vfio.h>
>
>Cleanup the includes
>

Ok. Will do in next revision.

>> +#ifdef CONFIG_EEH
>> +#include <asm/eeh.h>
>> +#endif
>
>This shouldn't even be compiles w/o CONFIG_EEH
>

Yeah. Will fix in next revision as you suggested.

>> +
>> +#include "vfio_pci_private.h"
>> +
>> +int vfio_pci_eeh_open(struct pci_dev *pdev)
>> +{
>> +	int ret = 0;
>> +
>> +#ifdef CONFIG_EEH
>> +	ret = eeh_dev_open(pdev);
>> +#endif
>> +	return ret;
>> +}
>> +
>> +void vfio_pci_eeh_release(struct pci_dev *pdev)
>> +{
>> +#ifdef CONFIG_EEH
>> +	eeh_dev_release(pdev);
>> +#endif
>> +}
>> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
>> index 9c6d5d0..c3cbe40 100644
>> --- a/drivers/vfio/pci/vfio_pci_private.h
>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>> @@ -90,4 +90,9 @@ extern void vfio_pci_virqfd_exit(void);
>>  
>>  extern int vfio_config_init(struct vfio_pci_device *vdev);
>>  extern void vfio_config_free(struct vfio_pci_device *vdev);
>> +
>> +/* EEH stub */
>> +extern int vfio_pci_eeh_open(struct pci_dev *pdev);
>> +extern void vfio_pci_eeh_release(struct pci_dev *pdev);
>
>The #ifdef with static inlines should be here.
>

thanks for the detailed comments.

>> +
>>  #endif /* VFIO_PCI_PRIVATE_H */
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index a84788b..666691b 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -21,6 +21,9 @@
>>  #include <linux/vfio.h>
>>  #include <asm/iommu.h>
>>  #include <asm/tce.h>
>> +#ifdef CONFIG_EEH
>> +#include <asm/eeh.h>
>> +#endif
>
>Now we're infecting another file with EEH, can't we put it in a central
>location for vfio?

How about something like this:

- Create drivers/vfio/vfio_eeh.c and it will be built when having CONFIG_EEH
- Declare all EEH functions in include/linux/vfio.h (it's the only central
  header file I can find except you have a better one).

#ifdef CONFIG_EEH
extern int vfio_pci_eeh_open(struct pci_dev *pdev);
extern void vfio_pci_eeh_release(struct pci_dev *pdev);
extern long vfio_iommu_spapr_eeh_ioctl(struct iommu_table *tbl,
				       unsigned int cmd, unsigned long arg); 
#else
static int vfio_pci_eeh_open(struct pci_dev *pdev)
{
	return 0;
}

static void vfio_pci_eeh_release(struct pci_dev *pdev)
{
}

static inline vfio_iommu_spapr_eeh_ioctl(struct iommu_table *tbl,
					 unsigned int cmd, unsigned long arg);
{
	return -ENOTTY;
}
#endif

>>  
>>  #define DRIVER_VERSION  "0.1"
>>  #define DRIVER_AUTHOR   "aik@ozlabs.ru"
>> @@ -147,6 +150,83 @@ static void tce_iommu_release(void *iommu_data)
>>  	kfree(container);
>>  }
>>  
>> +static long tce_iommu_eeh_ioctl(void *iommu_data,
>> +				unsigned int cmd, unsigned long arg)
>> +{
>> +	struct tce_container *container = iommu_data;
>> +	unsigned long minsz;
>> +	int ret = 0;
>> +
>> +#ifdef CONFIG_EEH
>> +	switch (cmd) {
>> +	case VFIO_EEH_PE_SET_OPTION: {
>> +		struct vfio_eeh_pe_set_option option;
>> +
>> +		minsz = offsetofend(struct vfio_eeh_pe_set_option, option);
>> +		if (copy_from_user(&option, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +		if (option.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		ret = eeh_pe_set_option(eeh_iommu_table_to_pe(container->tbl),
>> +					option.option);
>> +		break;
>> +	}
>> +	case VFIO_EEH_PE_GET_STATE: {
>> +		struct vfio_eeh_pe_get_state state;
>> +
>> +		minsz = offsetofend(struct vfio_eeh_pe_get_state, state);
>> +		if (copy_from_user(&state, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +		if (state.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		ret = eeh_pe_get_state(eeh_iommu_table_to_pe(container->tbl));
>> +		if (ret >= 0) {
>> +			state.state = ret;
>> +			if (copy_to_user((void __user *)arg, &state, minsz))
>> +				return -EFAULT;
>> +			ret = 0;
>> +		}
>
>This looks like one of those cases where you should just use the ioctl
>return value.
>

Ok. I'll use ioctl return value to carry output information. Note it might
have "0" as the output information.

>> +		break;
>> +	}
>> +	case VFIO_EEH_PE_RESET: {
>> +		struct vfio_eeh_pe_reset reset;
>> +
>> +		minsz = offsetofend(struct vfio_eeh_pe_reset, option);
>> +		if (copy_from_user(&reset, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +		if (reset.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		ret = eeh_pe_reset(eeh_iommu_table_to_pe(container->tbl),
>> +				   reset.option);
>> +		break;
>> +	}
>> +	case VFIO_EEH_PE_CONFIGURE: {
>> +		struct vfio_eeh_pe_configure configure;
>> +
>> +		minsz = offsetofend(struct vfio_eeh_pe_configure, flags);
>> +		if (copy_from_user(&configure, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +		if (configure.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		ret = eeh_pe_configure(eeh_iommu_table_to_pe(container->tbl));
>> +		break;
>> +	}
>> +	default:
>> +		ret = -EINVAL;
>> +		pr_debug("%s: Cannot handle command %d\n",
>> +			__func__, cmd);
>> +	}
>> +#else
>> +	ret = -ENOENT;
>> +#endif
>
>Hmm, more like a BUG in the default case the way it's coded here (not
>even sure it's worth the default entry).  The #else case should probably
>be -ENOTTY like other unimplemented ioctls.
>

Yeah, we don't need default case here and I'll remove it in next revision.
Also, "-ENOTTY" will be used in future for unimplemented functions.

>> +
>> +	return ret;
>> +}
>> +
>>  static long tce_iommu_ioctl(void *iommu_data,
>>  				 unsigned int cmd, unsigned long arg)
>>  {
>> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  		tce_iommu_disable(container);
>>  		mutex_unlock(&container->lock);
>>  		return 0;
>> +	case VFIO_EEH_PE_SET_OPTION:
>> +	case VFIO_EEH_PE_GET_STATE:
>> +	case VFIO_EEH_PE_RESET:
>> +	case VFIO_EEH_PE_CONFIGURE:
>> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
>
>This is where it would have really made sense to have a single
>VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
>AlexG, are you really attached to splitting these out into separate
>ioctls?
>

I really don't know. Alex.G want separate ioctl commands, but you
suggested to combine them. Could you guys please just tell me which
one (separate vs combined) I need to have in next revision? :-)

>>  	}
>>  
>>  	return -ENOTTY;
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index cb9023d..c5fac36 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
>>  
>>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>>  
>> +/*
>> + * EEH functionality can be enabled or disabled on one specific device.
>> + * Also, the DMA or IO frozen state can be removed from the frozen PE
>> + * if required.
>> + */
>> +struct vfio_eeh_pe_set_option {
>> +	__u32 argsz;
>> +	__u32 flags;
>> +	__u32 option;
>> +#define VFIO_EEH_PE_SET_OPT_DISABLE	0	/* Disable EEH	*/
>> +#define VFIO_EEH_PE_SET_OPT_ENABLE	1	/* Enable EEH	*/
>> +#define VFIO_EEH_PE_SET_OPT_IO		2	/* Enable IO	*/
>> +#define VFIO_EEH_PE_SET_OPT_DMA		3	/* Enable DMA	*/
>
>This is more of a "command" than an "option" isn't it?  Each of these
>probably needs a more significant description.
>

Yeah, it would be regarded as "opcode" and I'll add more description about
them in next revision.

>> +};
>> +
>> +#define VFIO_EEH_PE_SET_OPTION		_IO(VFIO_TYPE, VFIO_BASE + 21)
>> +
>> +/*
>> + * Each EEH PE should have unique address to be identified. PE's
>> + * sharing mode is also useful information as well.
>> + */
>> +#define VFIO_EEH_PE_GET_ADDRESS		0	/* Get address	*/
>> +#define VFIO_EEH_PE_GET_MODE		1	/* Query mode	*/
>> +#define VFIO_EEH_PE_MODE_NONE		0	/* Not a PE	*/
>> +#define VFIO_EEH_PE_MODE_NOT_SHARED	1	/* Exclusive	*/
>> +#define VFIO_EEH_PE_MODE_SHARED		2	/* Shared mode	*/
>> +
>> +/*
>> + * EEH PE might have been frozen because of PCI errors. Also, it might
>> + * be experiencing reset for error revoery. The following command helps
>> + * to get the state.
>> + */
>> +struct vfio_eeh_pe_get_state {
>> +	__u32 argsz;
>> +	__u32 flags;
>> +	__u32 state;
>> +};
>
>Should state be a union to better describe the value returned?  What
>exactly is the address and why does the user need to know it?  Does this
>need user input or could we just return the address and mode regardless?
>

Ok. I think you want enum (not union) for state. I'll have macros for the
state in next revision as I did that for other cases.

Those macros defined for "address" just for ABI stuff as Alex.G mentioned.
There isn't corresponding ioctl command for host to get address any more
because QEMU (user) will have to figure it out by himself. The "address"
here means PE address and user has to figure it out according to PE
segmentation.

>> +
>> +#define VFIO_EEH_PE_GET_STATE		_IO(VFIO_TYPE, VFIO_BASE + 22)
>> +
>> +/*
>> + * Reset is the major step to recover problematic PE. The following
>> + * command helps on that.
>> + */
>> +struct vfio_eeh_pe_reset {
>> +	__u32 argsz;
>> +	__u32 flags;
>> +	__u32 option;
>> +#define VFIO_EEH_PE_RESET_DEACTIVATE	0	/* Deactivate reset	*/
>> +#define VFIO_EEH_PE_RESET_HOT		1	/* Hot reset		*/
>> +#define VFIO_EEH_PE_RESET_FUNDAMENTAL	3	/* Fundamental reset	*/
>
>How does a user know which of these to use?
>

I think Ben already helped explaining it for a lot in the subsequent
replies. Thanks to Ben :-)

>> +};
>> +
>> +#define VFIO_EEH_PE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 23)
>> +
>> +/*
>> + * One of the steps for recovery after PE reset is to configure the
>> + * PCI bridges affected by the PE reset.
>> + */
>> +struct vfio_eeh_pe_configure {
>> +	__u32 argsz;
>> +	__u32 flags;
>> +};
>
>Parameters are probably not necessary here.
>

Yep. I'll remove it in next revision. Thanks for your comments, Alex.

>> +
>> +#define VFIO_EEH_PE_CONFIGURE		_IO(VFIO_TYPE, VFIO_BASE + 24)
>> +
>>  /* ***************************************************************** */
>>  
>>  #endif /* _UAPIVFIO_H */
>
>How does a user learn that a device supports EEH?  Do they just start
>making ioctl calls and getting a failure?  Shouldn't we make use of one
>of the flag bits on the device or add a capability on the container for
>the user to query?
>

User needs to make some ioctl calls to make sure EEH can be supported on
one specific device:

	struct vfio_eeh_pe_set_option set_option;

	/* User have to make sure the device isn't a bridge and there
	 * has one PE for it, which means that case of VFIO_EEH_PE_MODE_NONE.
	 */

	set_option.argsz = sizeof(set_option);
	set_option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
	ret = ioctl(container, VFIO_EEH_PE_SET_OPTION, &set_option);
	if (ret < 0) {
		/* EEH can't supported */
	}

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28  0:44         ` Alexander Graf
@ 2014-05-28  0:57           ` Alex Williamson
  2014-05-28 11:37             ` Alexander Graf
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Williamson @ 2014-05-28  0:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: aik, Gavin Shan, kvm-ppc, qiudayu, linuxppc-dev

On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
> On 28.05.14 02:39, Alex Williamson wrote:
> > On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
> >> On 27.05.14 20:15, Alex Williamson wrote:
> >>> On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
> >>>> The patch adds new IOCTL commands for sPAPR VFIO container device
> >>>> to support EEH functionality for PCI devices, which have been passed
> >>>> through from host to somebody else via VFIO.
> >>>>
> >>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >>>> ---
> >>>>    Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
> >>>>    drivers/vfio/pci/Makefile           |  1 +
> >>>>    drivers/vfio/pci/vfio_pci.c         | 20 +++++---
> >>>>    drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
> >>>>    drivers/vfio/pci/vfio_pci_private.h |  5 ++
> >>>>    drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
> >>>>    include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
> >>>>    7 files changed, 308 insertions(+), 7 deletions(-)
> >>>>    create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
> >> [...]
> >>
> >>>> +
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>>    static long tce_iommu_ioctl(void *iommu_data,
> >>>>    				 unsigned int cmd, unsigned long arg)
> >>>>    {
> >>>> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
> >>>>    		tce_iommu_disable(container);
> >>>>    		mutex_unlock(&container->lock);
> >>>>    		return 0;
> >>>> +	case VFIO_EEH_PE_SET_OPTION:
> >>>> +	case VFIO_EEH_PE_GET_STATE:
> >>>> +	case VFIO_EEH_PE_RESET:
> >>>> +	case VFIO_EEH_PE_CONFIGURE:
> >>>> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
> >>> This is where it would have really made sense to have a single
> >>> VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
> >>> AlexG, are you really attached to splitting these out into separate
> >>> ioctls?
> >> I don't see the problem. We need to forward 4 ioctls to a separate piece
> >> of code, so we forward 4 ioctls to a separate piece of code :). Putting
> >> them into one ioctl just moves the switch() into another function.
> > And uses an extra 3 ioctl numbers and gives us extra things to update if
> > we ever need to add more ioctls, etc.  ioctl numbers are an address
> > space, how much address space do we really want to give to EEH?  It's
> > not a big difference, but I don't think it's completely even either.
> > Thanks,
> 
> Yes, that's the point. I by far prefer to have you push back on anyone 
> who introduces useless ioctls rather than have a separate EEH number 
> space that people can just throw anything in they like ;).

Well, I appreciate that, but having them as separate ioctls doesn't
really prevent that either.  Any one of these 4 could be set to take a
sub-option to extend and contort the EEH interface.  The only way to
prevent that would be to avoid the argsz+flags hack that make the ioctl
extendable.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28  0:57           ` Alex Williamson
@ 2014-05-28 11:37             ` Alexander Graf
  2014-05-28 16:17               ` Alex Williamson
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Graf @ 2014-05-28 11:37 UTC (permalink / raw)
  To: Alex Williamson; +Cc: aik, Gavin Shan, kvm-ppc, qiudayu, linuxppc-dev


On 28.05.14 02:57, Alex Williamson wrote:
> On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
>> On 28.05.14 02:39, Alex Williamson wrote:
>>> On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
>>>> On 27.05.14 20:15, Alex Williamson wrote:
>>>>> On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>>>>> The patch adds new IOCTL commands for sPAPR VFIO container device
>>>>>> to support EEH functionality for PCI devices, which have been passed
>>>>>> through from host to somebody else via VFIO.
>>>>>>
>>>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>>> ---
>>>>>>     Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>>>>>     drivers/vfio/pci/Makefile           |  1 +
>>>>>>     drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>>>>>     drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>>>>>     drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>>>>>     drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>>>>>     include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>>>>>     7 files changed, 308 insertions(+), 7 deletions(-)
>>>>>>     create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>>>> [...]
>>>>
>>>>>> +
>>>>>> +	return ret;
>>>>>> +}
>>>>>> +
>>>>>>     static long tce_iommu_ioctl(void *iommu_data,
>>>>>>     				 unsigned int cmd, unsigned long arg)
>>>>>>     {
>>>>>> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>>>>>     		tce_iommu_disable(container);
>>>>>>     		mutex_unlock(&container->lock);
>>>>>>     		return 0;
>>>>>> +	case VFIO_EEH_PE_SET_OPTION:
>>>>>> +	case VFIO_EEH_PE_GET_STATE:
>>>>>> +	case VFIO_EEH_PE_RESET:
>>>>>> +	case VFIO_EEH_PE_CONFIGURE:
>>>>>> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
>>>>> This is where it would have really made sense to have a single
>>>>> VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
>>>>> AlexG, are you really attached to splitting these out into separate
>>>>> ioctls?
>>>> I don't see the problem. We need to forward 4 ioctls to a separate piece
>>>> of code, so we forward 4 ioctls to a separate piece of code :). Putting
>>>> them into one ioctl just moves the switch() into another function.
>>> And uses an extra 3 ioctl numbers and gives us extra things to update if
>>> we ever need to add more ioctls, etc.  ioctl numbers are an address
>>> space, how much address space do we really want to give to EEH?  It's
>>> not a big difference, but I don't think it's completely even either.
>>> Thanks,
>> Yes, that's the point. I by far prefer to have you push back on anyone
>> who introduces useless ioctls rather than have a separate EEH number
>> space that people can just throw anything in they like ;).
> Well, I appreciate that, but having them as separate ioctls doesn't
> really prevent that either.  Any one of these 4 could be set to take a
> sub-option to extend and contort the EEH interface.  The only way to
> prevent that would be to avoid the argsz+flags hack that make the ioctl
> extendable.  Thanks,

Sure, that's what patch review is about. I'm really more concerned about 
whose court the number space is in - you or Gavin. If we're talking 
about top level ioctls you will care a lot more.

But I'm not religious about this. You're the VFIO maintainer, so it's 
your call. I just personally cringe when I see an ioctl that gets an 
"opcode" and a "parameter" argument where the "parameter" argument is a 
union with one struct for each opcode.


Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28  0:55     ` Gavin Shan
@ 2014-05-28 11:41       ` Alexander Graf
  2014-05-28 12:49         ` Gavin Shan
  2014-05-28 16:32       ` Alex Williamson
  1 sibling, 1 reply; 31+ messages in thread
From: Alexander Graf @ 2014-05-28 11:41 UTC (permalink / raw)
  To: Gavin Shan, Alex Williamson; +Cc: aik, qiudayu, linuxppc-dev, kvm-ppc


On 28.05.14 02:55, Gavin Shan wrote:
> On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
>> On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>> The patch adds new IOCTL commands for sPAPR VFIO container device
>>> to support EEH functionality for PCI devices, which have been passed
>>> through from host to somebody else via VFIO.
>>>
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> ---
>>>   Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>>   drivers/vfio/pci/Makefile           |  1 +
>>>   drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>>   drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>>   drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>>   drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>>   include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>>   7 files changed, 308 insertions(+), 7 deletions(-)
>>>   create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c

[...]

>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>> index cb9023d..c5fac36 100644
>>> --- a/include/uapi/linux/vfio.h
>>> +++ b/include/uapi/linux/vfio.h
>>> @@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
>>>   
>>>   #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>>>   
>>> +/*
>>> + * EEH functionality can be enabled or disabled on one specific device.
>>> + * Also, the DMA or IO frozen state can be removed from the frozen PE
>>> + * if required.
>>> + */
>>> +struct vfio_eeh_pe_set_option {
>>> +	__u32 argsz;
>>> +	__u32 flags;
>>> +	__u32 option;
>>> +#define VFIO_EEH_PE_SET_OPT_DISABLE	0	/* Disable EEH	*/
>>> +#define VFIO_EEH_PE_SET_OPT_ENABLE	1	/* Enable EEH	*/
>>> +#define VFIO_EEH_PE_SET_OPT_IO		2	/* Enable IO	*/
>>> +#define VFIO_EEH_PE_SET_OPT_DMA		3	/* Enable DMA	*/
>> This is more of a "command" than an "option" isn't it?  Each of these
>> probably needs a more significant description.
>>
> Yeah, it would be regarded as "opcode" and I'll add more description about
> them in next revision.

Please just call them commands.

>
>>> +};
>>> +
>>> +#define VFIO_EEH_PE_SET_OPTION		_IO(VFIO_TYPE, VFIO_BASE + 21)
>>> +
>>> +/*
>>> + * Each EEH PE should have unique address to be identified. PE's
>>> + * sharing mode is also useful information as well.
>>> + */
>>> +#define VFIO_EEH_PE_GET_ADDRESS		0	/* Get address	*/
>>> +#define VFIO_EEH_PE_GET_MODE		1	/* Query mode	*/
>>> +#define VFIO_EEH_PE_MODE_NONE		0	/* Not a PE	*/
>>> +#define VFIO_EEH_PE_MODE_NOT_SHARED	1	/* Exclusive	*/
>>> +#define VFIO_EEH_PE_MODE_SHARED		2	/* Shared mode	*/
>>> +
>>> +/*
>>> + * EEH PE might have been frozen because of PCI errors. Also, it might
>>> + * be experiencing reset for error revoery. The following command helps
>>> + * to get the state.
>>> + */
>>> +struct vfio_eeh_pe_get_state {
>>> +	__u32 argsz;
>>> +	__u32 flags;
>>> +	__u32 state;
>>> +};
>> Should state be a union to better describe the value returned?  What
>> exactly is the address and why does the user need to know it?  Does this
>> need user input or could we just return the address and mode regardless?
>>
> Ok. I think you want enum (not union) for state. I'll have macros for the
> state in next revision as I did that for other cases.
>
> Those macros defined for "address" just for ABI stuff as Alex.G mentioned.
> There isn't corresponding ioctl command for host to get address any more
> because QEMU (user) will have to figure it out by himself. The "address"
> here means PE address and user has to figure it out according to PE
> segmentation.

Why would the user ever need the address?


Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 11:41       ` Alexander Graf
@ 2014-05-28 12:49         ` Gavin Shan
  2014-05-28 13:12           ` Alexander Graf
  2014-05-28 21:58           ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 31+ messages in thread
From: Gavin Shan @ 2014-05-28 12:49 UTC (permalink / raw)
  To: Alexander Graf
  Cc: aik, Gavin Shan, kvm-ppc, Alex Williamson, qiudayu, linuxppc-dev

On Wed, May 28, 2014 at 01:41:35PM +0200, Alexander Graf wrote:
>
>On 28.05.14 02:55, Gavin Shan wrote:
>>On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
>>>On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>>>The patch adds new IOCTL commands for sPAPR VFIO container device
>>>>to support EEH functionality for PCI devices, which have been passed
>>>>through from host to somebody else via VFIO.
>>>>
>>>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>---
>>>>  Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>>>  drivers/vfio/pci/Makefile           |  1 +
>>>>  drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>>>  drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>>>  drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>>>  include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>>>  7 files changed, 308 insertions(+), 7 deletions(-)
>>>>  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>
>[...]
>
>>>>diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>index cb9023d..c5fac36 100644
>>>>--- a/include/uapi/linux/vfio.h
>>>>+++ b/include/uapi/linux/vfio.h
>>>>@@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
>>>>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>>>>+/*
>>>>+ * EEH functionality can be enabled or disabled on one specific device.
>>>>+ * Also, the DMA or IO frozen state can be removed from the frozen PE
>>>>+ * if required.
>>>>+ */
>>>>+struct vfio_eeh_pe_set_option {
>>>>+	__u32 argsz;
>>>>+	__u32 flags;
>>>>+	__u32 option;
>>>>+#define VFIO_EEH_PE_SET_OPT_DISABLE	0	/* Disable EEH	*/
>>>>+#define VFIO_EEH_PE_SET_OPT_ENABLE	1	/* Enable EEH	*/
>>>>+#define VFIO_EEH_PE_SET_OPT_IO		2	/* Enable IO	*/
>>>>+#define VFIO_EEH_PE_SET_OPT_DMA		3	/* Enable DMA	*/
>>>This is more of a "command" than an "option" isn't it?  Each of these
>>>probably needs a more significant description.
>>>
>>Yeah, it would be regarded as "opcode" and I'll add more description about
>>them in next revision.
>
>Please just call them commands.
>

Ok. I guess you want me to change the macro names like this ?

#define VFIO_EEH_CMD_DISABLE	0	/* Disable EEH functionality	*/
#define VFIO_EEH_CMD_ENABLE	1	/* Enable EEH functionality	*/
#define VFIO_EEH_CMD_ENABLE_IO	2	/* Enable IO for frozen PE	*/
#define VFIO_EEH_CMD_ENABLE_DMA	3	/* Enable DMA for frozen PE	*/

>>
>>>>+};
>>>>+
>>>>+#define VFIO_EEH_PE_SET_OPTION		_IO(VFIO_TYPE, VFIO_BASE + 21)
>>>>+
>>>>+/*
>>>>+ * Each EEH PE should have unique address to be identified. PE's
>>>>+ * sharing mode is also useful information as well.
>>>>+ */
>>>>+#define VFIO_EEH_PE_GET_ADDRESS		0	/* Get address	*/
>>>>+#define VFIO_EEH_PE_GET_MODE		1	/* Query mode	*/
>>>>+#define VFIO_EEH_PE_MODE_NONE		0	/* Not a PE	*/
>>>>+#define VFIO_EEH_PE_MODE_NOT_SHARED	1	/* Exclusive	*/
>>>>+#define VFIO_EEH_PE_MODE_SHARED		2	/* Shared mode	*/
>>>>+
>>>>+/*
>>>>+ * EEH PE might have been frozen because of PCI errors. Also, it might
>>>>+ * be experiencing reset for error revoery. The following command helps
>>>>+ * to get the state.
>>>>+ */
>>>>+struct vfio_eeh_pe_get_state {
>>>>+	__u32 argsz;
>>>>+	__u32 flags;
>>>>+	__u32 state;
>>>>+};
>>>Should state be a union to better describe the value returned?  What
>>>exactly is the address and why does the user need to know it?  Does this
>>>need user input or could we just return the address and mode regardless?
>>>
>>Ok. I think you want enum (not union) for state. I'll have macros for the
>>state in next revision as I did that for other cases.
>>
>>Those macros defined for "address" just for ABI stuff as Alex.G mentioned.
>>There isn't corresponding ioctl command for host to get address any more
>>because QEMU (user) will have to figure it out by himself. The "address"
>>here means PE address and user has to figure it out according to PE
>>segmentation.
>
>Why would the user ever need the address?
>

I will remove those "address" related macros in next revision because it's
user-level bussiness, not related to host kernel any more.

If the user is QEMU + guest, we need the address to identify the PE though PHB
BUID could be used as same purpose to get PHB, which is one-to-one mapping with
IOMMU group on sPAPR platform. However, once the PE address is built and returned
to guest, guest will use the PE address as input parameter in subsequent RTAS
calls.

If the user is some kind of application who just uses the ioctl() without supporting
RTAS calls. We don't need care about PE address. 

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 12:49         ` Gavin Shan
@ 2014-05-28 13:12           ` Alexander Graf
  2014-05-28 23:13             ` Gavin Shan
  2014-05-28 21:58           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 31+ messages in thread
From: Alexander Graf @ 2014-05-28 13:12 UTC (permalink / raw)
  To: Gavin Shan; +Cc: aik, kvm-ppc, Alex Williamson, qiudayu, linuxppc-dev


On 28.05.14 14:49, Gavin Shan wrote:
> On Wed, May 28, 2014 at 01:41:35PM +0200, Alexander Graf wrote:
>> On 28.05.14 02:55, Gavin Shan wrote:
>>> On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
>>>> On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>>>> The patch adds new IOCTL commands for sPAPR VFIO container device
>>>>> to support EEH functionality for PCI devices, which have been passed
>>>>> through from host to somebody else via VFIO.
>>>>>
>>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>> ---
>>>>>   Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>>>>   drivers/vfio/pci/Makefile           |  1 +
>>>>>   drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>>>>   drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>>>>   drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>>>>   drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>>>>   include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>>>>   7 files changed, 308 insertions(+), 7 deletions(-)
>>>>>   create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>> [...]
>>
>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>> index cb9023d..c5fac36 100644
>>>>> --- a/include/uapi/linux/vfio.h
>>>>> +++ b/include/uapi/linux/vfio.h
>>>>> @@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
>>>>>   #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>>>>> +/*
>>>>> + * EEH functionality can be enabled or disabled on one specific device.
>>>>> + * Also, the DMA or IO frozen state can be removed from the frozen PE
>>>>> + * if required.
>>>>> + */
>>>>> +struct vfio_eeh_pe_set_option {
>>>>> +	__u32 argsz;
>>>>> +	__u32 flags;
>>>>> +	__u32 option;
>>>>> +#define VFIO_EEH_PE_SET_OPT_DISABLE	0	/* Disable EEH	*/
>>>>> +#define VFIO_EEH_PE_SET_OPT_ENABLE	1	/* Enable EEH	*/
>>>>> +#define VFIO_EEH_PE_SET_OPT_IO		2	/* Enable IO	*/
>>>>> +#define VFIO_EEH_PE_SET_OPT_DMA		3	/* Enable DMA	*/
>>>> This is more of a "command" than an "option" isn't it?  Each of these
>>>> probably needs a more significant description.
>>>>
>>> Yeah, it would be regarded as "opcode" and I'll add more description about
>>> them in next revision.
>> Please just call them commands.
>>
> Ok. I guess you want me to change the macro names like this ?
>
> #define VFIO_EEH_CMD_DISABLE	0	/* Disable EEH functionality	*/
> #define VFIO_EEH_CMD_ENABLE	1	/* Enable EEH functionality	*/
> #define VFIO_EEH_CMD_ENABLE_IO	2	/* Enable IO for frozen PE	*/
> #define VFIO_EEH_CMD_ENABLE_DMA	3	/* Enable DMA for frozen PE	*/

Yes, the ioctl name too.

>
>>>>> +};
>>>>> +
>>>>> +#define VFIO_EEH_PE_SET_OPTION		_IO(VFIO_TYPE, VFIO_BASE + 21)
>>>>> +
>>>>> +/*
>>>>> + * Each EEH PE should have unique address to be identified. PE's
>>>>> + * sharing mode is also useful information as well.
>>>>> + */
>>>>> +#define VFIO_EEH_PE_GET_ADDRESS		0	/* Get address	*/
>>>>> +#define VFIO_EEH_PE_GET_MODE		1	/* Query mode	*/
>>>>> +#define VFIO_EEH_PE_MODE_NONE		0	/* Not a PE	*/
>>>>> +#define VFIO_EEH_PE_MODE_NOT_SHARED	1	/* Exclusive	*/
>>>>> +#define VFIO_EEH_PE_MODE_SHARED		2	/* Shared mode	*/
>>>>> +
>>>>> +/*
>>>>> + * EEH PE might have been frozen because of PCI errors. Also, it might
>>>>> + * be experiencing reset for error revoery. The following command helps
>>>>> + * to get the state.
>>>>> + */
>>>>> +struct vfio_eeh_pe_get_state {
>>>>> +	__u32 argsz;
>>>>> +	__u32 flags;
>>>>> +	__u32 state;
>>>>> +};
>>>> Should state be a union to better describe the value returned?  What
>>>> exactly is the address and why does the user need to know it?  Does this
>>>> need user input or could we just return the address and mode regardless?
>>>>
>>> Ok. I think you want enum (not union) for state. I'll have macros for the
>>> state in next revision as I did that for other cases.
>>>
>>> Those macros defined for "address" just for ABI stuff as Alex.G mentioned.
>>> There isn't corresponding ioctl command for host to get address any more
>>> because QEMU (user) will have to figure it out by himself. The "address"
>>> here means PE address and user has to figure it out according to PE
>>> segmentation.
>> Why would the user ever need the address?
>>
> I will remove those "address" related macros in next revision because it's
> user-level bussiness, not related to host kernel any more.

Ok, so the next question is whether there will be any state outside of 
GET_MODE left in the future.


Alex

> If the user is QEMU + guest, we need the address to identify the PE though PHB
> BUID could be used as same purpose to get PHB, which is one-to-one mapping with
> IOMMU group on sPAPR platform. However, once the PE address is built and returned
> to guest, guest will use the PE address as input parameter in subsequent RTAS
> calls.
>
> If the user is some kind of application who just uses the ioctl() without supporting
> RTAS calls. We don't need care about PE address.
>
> Thanks,
> Gavin
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 11:37             ` Alexander Graf
@ 2014-05-28 16:17               ` Alex Williamson
  2014-05-28 22:40                 ` Alexander Graf
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Williamson @ 2014-05-28 16:17 UTC (permalink / raw)
  To: Alexander Graf; +Cc: aik, Gavin Shan, kvm-ppc, qiudayu, linuxppc-dev

On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:
> On 28.05.14 02:57, Alex Williamson wrote:
> > On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
> >> On 28.05.14 02:39, Alex Williamson wrote:
> >>> On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
> >>>> On 27.05.14 20:15, Alex Williamson wrote:
> >>>>> On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
> >>>>>> The patch adds new IOCTL commands for sPAPR VFIO container device
> >>>>>> to support EEH functionality for PCI devices, which have been passed
> >>>>>> through from host to somebody else via VFIO.
> >>>>>>
> >>>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >>>>>> ---
> >>>>>>     Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
> >>>>>>     drivers/vfio/pci/Makefile           |  1 +
> >>>>>>     drivers/vfio/pci/vfio_pci.c         | 20 +++++---
> >>>>>>     drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
> >>>>>>     drivers/vfio/pci/vfio_pci_private.h |  5 ++
> >>>>>>     drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
> >>>>>>     include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
> >>>>>>     7 files changed, 308 insertions(+), 7 deletions(-)
> >>>>>>     create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
> >>>> [...]
> >>>>
> >>>>>> +
> >>>>>> +	return ret;
> >>>>>> +}
> >>>>>> +
> >>>>>>     static long tce_iommu_ioctl(void *iommu_data,
> >>>>>>     				 unsigned int cmd, unsigned long arg)
> >>>>>>     {
> >>>>>> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
> >>>>>>     		tce_iommu_disable(container);
> >>>>>>     		mutex_unlock(&container->lock);
> >>>>>>     		return 0;
> >>>>>> +	case VFIO_EEH_PE_SET_OPTION:
> >>>>>> +	case VFIO_EEH_PE_GET_STATE:
> >>>>>> +	case VFIO_EEH_PE_RESET:
> >>>>>> +	case VFIO_EEH_PE_CONFIGURE:
> >>>>>> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
> >>>>> This is where it would have really made sense to have a single
> >>>>> VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
> >>>>> AlexG, are you really attached to splitting these out into separate
> >>>>> ioctls?
> >>>> I don't see the problem. We need to forward 4 ioctls to a separate piece
> >>>> of code, so we forward 4 ioctls to a separate piece of code :). Putting
> >>>> them into one ioctl just moves the switch() into another function.
> >>> And uses an extra 3 ioctl numbers and gives us extra things to update if
> >>> we ever need to add more ioctls, etc.  ioctl numbers are an address
> >>> space, how much address space do we really want to give to EEH?  It's
> >>> not a big difference, but I don't think it's completely even either.
> >>> Thanks,
> >> Yes, that's the point. I by far prefer to have you push back on anyone
> >> who introduces useless ioctls rather than have a separate EEH number
> >> space that people can just throw anything in they like ;).
> > Well, I appreciate that, but having them as separate ioctls doesn't
> > really prevent that either.  Any one of these 4 could be set to take a
> > sub-option to extend and contort the EEH interface.  The only way to
> > prevent that would be to avoid the argsz+flags hack that make the ioctl
> > extendable.  Thanks,
> 
> Sure, that's what patch review is about. I'm really more concerned about 
> whose court the number space is in - you or Gavin. If we're talking 
> about top level ioctls you will care a lot more.
> 
> But I'm not religious about this. You're the VFIO maintainer, so it's 
> your call. I just personally cringe when I see an ioctl that gets an 
> "opcode" and a "parameter" argument where the "parameter" argument is a 
> union with one struct for each opcode.

Well, what would it look like...

struct vfio_eeh_pe_op {
	__u32 argsz;
	__u32 flags;
	__u32 op;
};

Couldn't every single one of these be a separate "op"?  Are there any
cases where we can't use the ioctl return value?

VFIO_EEH_PE_DISABLE
VFIO_EEH_PE_ENABLE
VFIO_EEH_PE_UNFREEZE_IO
VFIO_EEH_PE_UNFREEZE_DMA
VFIO_EEH_PE_GET_MODE
VFIO_EEH_PE_RESET_DEACTIVATE
VFIO_EEH_PE_RESET_HOT
VFIO_EEH_PE_RESET_FUNDAMENTAL
VFIO_EEH_PE_CONFIGURE

It doesn't look that bad to me, what am I missing?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28  0:55     ` Gavin Shan
  2014-05-28 11:41       ` Alexander Graf
@ 2014-05-28 16:32       ` Alex Williamson
  2014-05-29  0:05         ` Gavin Shan
  1 sibling, 1 reply; 31+ messages in thread
From: Alex Williamson @ 2014-05-28 16:32 UTC (permalink / raw)
  To: Gavin Shan; +Cc: aik, agraf, kvm-ppc, qiudayu, linuxppc-dev

On Wed, 2014-05-28 at 10:55 +1000, Gavin Shan wrote:
> On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
> >On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
> >> The patch adds new IOCTL commands for sPAPR VFIO container device
> >> to support EEH functionality for PCI devices, which have been passed
> >> through from host to somebody else via VFIO.
> >> 
> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> ---
> >>  Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
> >>  drivers/vfio/pci/Makefile           |  1 +
> >>  drivers/vfio/pci/vfio_pci.c         | 20 +++++---
> >>  drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
> >>  drivers/vfio/pci/vfio_pci_private.h |  5 ++
> >>  drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
> >>  include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
> >>  7 files changed, 308 insertions(+), 7 deletions(-)
> >>  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
> >> 
> >> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> >> index b9ca023..d890fed 100644
> >> --- a/Documentation/vfio.txt
> >> +++ b/Documentation/vfio.txt
> >> @@ -305,7 +305,15 @@ faster, the map/unmap handling has been implemented in real mode which provides
> >>  an excellent performance which has limitations such as inability to do
> >>  locked pages accounting in real time.
> >>  
> >> -So 3 additional ioctls have been added:
> >> +4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
> >> +subtree that can be treated as a unit for the purposes of partitioning and
> >> +error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
> >> +function of a multi-function IOA, or multiple IOAs (possibly including switch
> >> +and bridge structures above the multiple IOAs). PPC64 guests detect PCI errors
> >> +and recover from them via EEH RTAS services, which works on the basis of
> >> +additional ioctl commands.
> >> +
> >> +So 7 additional ioctls have been added:
> >>  
> >>  	VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
> >>  		of the DMA window on the PCI bus.
> >> @@ -316,6 +324,17 @@ So 3 additional ioctls have been added:
> >>  
> >>  	VFIO_IOMMU_DISABLE - disables the container.
> >>  
> >> +	VFIO_EEH_PE_SET_OPTION - enables or disables EEH functionality on the
> >> +		specified device. Also, it can be used to remove IO or DMA
> >> +		stopped state on the frozen PE.
> >> +
> >> +	VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
> >> +
> >> +	VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
> >> +		error recovering.
> >> +
> >> +	VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
> >> +		one of the major steps for error recoverying.
> >>  
> >>  The code flow from the example above should be slightly changed:
> >>  
> >> @@ -346,6 +365,77 @@ The code flow from the example above should be slightly changed:
> >>  	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
> >>  	.....
> >>  
> >> +Based on the initial example we have, the following piece of code could be
> >> +reference for EEH setup and error handling:
> >> +
> >> +	struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
> >> +	struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
> >> +	struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
> >> +	struct vfio_eeh_pe_configure configure = { .argsz = sizeof(configure) };
> >> +
> >> +	....
> >> +
> >> +	/* Get a file descriptor for the device */
> >> +	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
> >> +
> >> +	/* Enable the EEH functionality on the device */
> >> +	option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
> >> +	ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
> >> +
> >> +	/* You're suggested to create additional data struct to represent
> >> +	 * PE, and put child devices belonging to same IOMMU group to the
> >> +	 * PE instance for later reference.
> >> +	 */
> >> +
> >> +	/* Check the PE's state and make sure it's in functional state */
> >> +	ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
> >> +
> >> +	/* Save device's state. pci_save_state() would be good enough
> >> +	 * as an example.
> >> +	 */
> >> +
> >> +	/* Test and setup the device */
> >> +	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
> >> +
> >> +	....
> >> +
> >> +	/* When 0xFF's returned from reading PCI config space or IO BARs
> >> +	 * of the PCI device. Check the PE state to see if that has been
> >> +	 * frozen.
> >> +	 */
> >> +	ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
> >> +
> >> +	/* Waiting for pending PCI transactions to be completed and don't
> >> +	 * produce any more PCI traffic from/to the affected PE until
> >> +	 * recovery is finished.
> >> +	 */
> >> +
> >> +	/* Enable IO for the affected PE and collect logs. Usually, the
> >> +	 * standard part of PCI config space, AER registers are dumped
> >> +	 * as logs for further analysis.
> >> +	 */
> >> +	option.option = VFIO_EEH_PE_SET_OPT_IO;
> >> +	ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
> >> +
> >> +	/* Issue PE reset */
> >> +	reset.option = VFIO_EEH_PE_RESET_HOT;
> >> +	ioctl(container, VFIO_EEH_PE_RESET, &reset);
> >> +	reset.option = VFIO_EEH_PE_RESET_DEACTIVATE;
> >> +	ioctl(container, VFIO_EEH_PE_RESET, &reset);
> >> +
> >> +	/* Configure the PCI bridges for the affected PE */
> >> +	ioctl(container, VFIO_EEH_PE_CONFIGURE, &configure);
> >> +
> >> +	/* Restored state we saved at initialization time. pci_restore_state()
> >> +	 * is good enough as an example.
> >> +	 */
> >> +
> >> +	/* Hopefully, error is recovered successfully. Now, you can resume to
> >> +	 * start PCI traffic to/from the affected PE.
> >> +	 */
> >> +
> >> +	....
> >> +
> >>  -------------------------------------------------------------------------------
> >>  
> >>  [1] VFIO was originally an acronym for "Virtual Function I/O" in its
> >> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> >> index 1310792..faad885 100644
> >> --- a/drivers/vfio/pci/Makefile
> >> +++ b/drivers/vfio/pci/Makefile
> >> @@ -1,4 +1,5 @@
> >>  
> >>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
> >> +vfio-pci-y += vfio_pci_eeh.o
> >
> >Why do we build this w/o CONFIG_EEH?  Can't we define the functions as
> >static inline in the header in that case?
> >
> 
> Ok. Will do in next revision.
> 
> >>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index 7ba0424..7c8d26a 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -156,8 +156,10 @@ static void vfio_pci_release(void *device_data)
> >>  {
> >>  	struct vfio_pci_device *vdev = device_data;
> >>  
> >> -	if (atomic_dec_and_test(&vdev->refcnt))
> >> +	if (atomic_dec_and_test(&vdev->refcnt)) {
> >> +		vfio_pci_eeh_release(vdev->pdev);
> >>  		vfio_pci_disable(vdev);
> >> +	}
> >>  
> >>  	module_put(THIS_MODULE);
> >>  }
> >> @@ -165,19 +167,25 @@ static void vfio_pci_release(void *device_data)
> >>  static int vfio_pci_open(void *device_data)
> >>  {
> >>  	struct vfio_pci_device *vdev = device_data;
> >> +	int ret;
> >>  
> >>  	if (!try_module_get(THIS_MODULE))
> >>  		return -ENODEV;
> >>  
> >>  	if (atomic_inc_return(&vdev->refcnt) == 1) {
> >> -		int ret = vfio_pci_enable(vdev);
> >> -		if (ret) {
> >> -			module_put(THIS_MODULE);
> >> -			return ret;
> >> -		}
> >> +		ret = vfio_pci_enable(vdev);
> >> +		if (ret)
> >> +			goto error;
> >> +
> >> +		ret = vfio_pci_eeh_open(vdev->pdev);
> >> +		if (ret)
> >> +			goto error;
> >>  	}
> >>  
> >>  	return 0;
> >> +error:
> >> +	module_put(THIS_MODULE);
> >> +	return ret;
> >>  }
> >>  
> >>  static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
> >> diff --git a/drivers/vfio/pci/vfio_pci_eeh.c b/drivers/vfio/pci/vfio_pci_eeh.c
> >> new file mode 100644
> >> index 0000000..9c25207
> >> --- /dev/null
> >> +++ b/drivers/vfio/pci/vfio_pci_eeh.c
> >> @@ -0,0 +1,46 @@
> >> +/*
> >> + * EEH functionality support for VFIO PCI devices.
> >> + *
> >> + * Copyright Gavin Shan, IBM Corporation 2014.
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify
> >> + * it under the terms of the GNU General Public License version 2 as
> >> + * published by the Free Software Foundation.
> >> + */
> >> +
> >> +#include <linux/device.h>
> >> +#include <linux/eventfd.h>
> >> +#include <linux/file.h>
> >> +#include <linux/interrupt.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/module.h>
> >> +#include <linux/mutex.h>
> >> +#include <linux/notifier.h>
> >> +#include <linux/pci.h>
> >> +#include <linux/pm_runtime.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/types.h>
> >> +#include <linux/uaccess.h>
> >> +#include <linux/vfio.h>
> >
> >Cleanup the includes
> >
> 
> Ok. Will do in next revision.
> 
> >> +#ifdef CONFIG_EEH
> >> +#include <asm/eeh.h>
> >> +#endif
> >
> >This shouldn't even be compiles w/o CONFIG_EEH
> >
> 
> Yeah. Will fix in next revision as you suggested.
> 
> >> +
> >> +#include "vfio_pci_private.h"
> >> +
> >> +int vfio_pci_eeh_open(struct pci_dev *pdev)
> >> +{
> >> +	int ret = 0;
> >> +
> >> +#ifdef CONFIG_EEH
> >> +	ret = eeh_dev_open(pdev);
> >> +#endif
> >> +	return ret;
> >> +}
> >> +
> >> +void vfio_pci_eeh_release(struct pci_dev *pdev)
> >> +{
> >> +#ifdef CONFIG_EEH
> >> +	eeh_dev_release(pdev);
> >> +#endif
> >> +}
> >> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> >> index 9c6d5d0..c3cbe40 100644
> >> --- a/drivers/vfio/pci/vfio_pci_private.h
> >> +++ b/drivers/vfio/pci/vfio_pci_private.h
> >> @@ -90,4 +90,9 @@ extern void vfio_pci_virqfd_exit(void);
> >>  
> >>  extern int vfio_config_init(struct vfio_pci_device *vdev);
> >>  extern void vfio_config_free(struct vfio_pci_device *vdev);
> >> +
> >> +/* EEH stub */
> >> +extern int vfio_pci_eeh_open(struct pci_dev *pdev);
> >> +extern void vfio_pci_eeh_release(struct pci_dev *pdev);
> >
> >The #ifdef with static inlines should be here.
> >
> 
> thanks for the detailed comments.
> 
> >> +
> >>  #endif /* VFIO_PCI_PRIVATE_H */
> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> index a84788b..666691b 100644
> >> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> @@ -21,6 +21,9 @@
> >>  #include <linux/vfio.h>
> >>  #include <asm/iommu.h>
> >>  #include <asm/tce.h>
> >> +#ifdef CONFIG_EEH
> >> +#include <asm/eeh.h>
> >> +#endif
> >
> >Now we're infecting another file with EEH, can't we put it in a central
> >location for vfio?
> 
> How about something like this:
> 
> - Create drivers/vfio/vfio_eeh.c and it will be built when having CONFIG_EEH
> - Declare all EEH functions in include/linux/vfio.h (it's the only central
>   header file I can find except you have a better one).
> 
> #ifdef CONFIG_EEH
> extern int vfio_pci_eeh_open(struct pci_dev *pdev);
> extern void vfio_pci_eeh_release(struct pci_dev *pdev);
> extern long vfio_iommu_spapr_eeh_ioctl(struct iommu_table *tbl,
> 				       unsigned int cmd, unsigned long arg); 
> #else
> static int vfio_pci_eeh_open(struct pci_dev *pdev)
> {
> 	return 0;
> }
> 
> static void vfio_pci_eeh_release(struct pci_dev *pdev)
> {
> }
> 
> static inline vfio_iommu_spapr_eeh_ioctl(struct iommu_table *tbl,
> 					 unsigned int cmd, unsigned long arg);
> {
> 	return -ENOTTY;
> }
> #endif

Seems like an improvement

> >>  
> >>  #define DRIVER_VERSION  "0.1"
> >>  #define DRIVER_AUTHOR   "aik@ozlabs.ru"
> >> @@ -147,6 +150,83 @@ static void tce_iommu_release(void *iommu_data)
> >>  	kfree(container);
> >>  }
> >>  
> >> +static long tce_iommu_eeh_ioctl(void *iommu_data,
> >> +				unsigned int cmd, unsigned long arg)
> >> +{
> >> +	struct tce_container *container = iommu_data;
> >> +	unsigned long minsz;
> >> +	int ret = 0;
> >> +
> >> +#ifdef CONFIG_EEH
> >> +	switch (cmd) {
> >> +	case VFIO_EEH_PE_SET_OPTION: {
> >> +		struct vfio_eeh_pe_set_option option;
> >> +
> >> +		minsz = offsetofend(struct vfio_eeh_pe_set_option, option);
> >> +		if (copy_from_user(&option, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +		if (option.argsz < minsz)
> >> +			return -EINVAL;
> >> +
> >> +		ret = eeh_pe_set_option(eeh_iommu_table_to_pe(container->tbl),
> >> +					option.option);
> >> +		break;
> >> +	}
> >> +	case VFIO_EEH_PE_GET_STATE: {
> >> +		struct vfio_eeh_pe_get_state state;
> >> +
> >> +		minsz = offsetofend(struct vfio_eeh_pe_get_state, state);
> >> +		if (copy_from_user(&state, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +		if (state.argsz < minsz)
> >> +			return -EINVAL;
> >> +
> >> +		ret = eeh_pe_get_state(eeh_iommu_table_to_pe(container->tbl));
> >> +		if (ret >= 0) {
> >> +			state.state = ret;
> >> +			if (copy_to_user((void __user *)arg, &state, minsz))
> >> +				return -EFAULT;
> >> +			ret = 0;
> >> +		}
> >
> >This looks like one of those cases where you should just use the ioctl
> >return value.
> >
> 
> Ok. I'll use ioctl return value to carry output information. Note it might
> have "0" as the output information.

I think that's fine.


> >> +		break;
> >> +	}
> >> +	case VFIO_EEH_PE_RESET: {
> >> +		struct vfio_eeh_pe_reset reset;
> >> +
> >> +		minsz = offsetofend(struct vfio_eeh_pe_reset, option);
> >> +		if (copy_from_user(&reset, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +		if (reset.argsz < minsz)
> >> +			return -EINVAL;
> >> +
> >> +		ret = eeh_pe_reset(eeh_iommu_table_to_pe(container->tbl),
> >> +				   reset.option);
> >> +		break;
> >> +	}
> >> +	case VFIO_EEH_PE_CONFIGURE: {
> >> +		struct vfio_eeh_pe_configure configure;
> >> +
> >> +		minsz = offsetofend(struct vfio_eeh_pe_configure, flags);
> >> +		if (copy_from_user(&configure, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +		if (configure.argsz < minsz)
> >> +			return -EINVAL;
> >> +
> >> +		ret = eeh_pe_configure(eeh_iommu_table_to_pe(container->tbl));
> >> +		break;
> >> +	}
> >> +	default:
> >> +		ret = -EINVAL;
> >> +		pr_debug("%s: Cannot handle command %d\n",
> >> +			__func__, cmd);
> >> +	}
> >> +#else
> >> +	ret = -ENOENT;
> >> +#endif
> >
> >Hmm, more like a BUG in the default case the way it's coded here (not
> >even sure it's worth the default entry).  The #else case should probably
> >be -ENOTTY like other unimplemented ioctls.
> >
> 
> Yeah, we don't need default case here and I'll remove it in next revision.
> Also, "-ENOTTY" will be used in future for unimplemented functions.
> 
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  static long tce_iommu_ioctl(void *iommu_data,
> >>  				 unsigned int cmd, unsigned long arg)
> >>  {
> >> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
> >>  		tce_iommu_disable(container);
> >>  		mutex_unlock(&container->lock);
> >>  		return 0;
> >> +	case VFIO_EEH_PE_SET_OPTION:
> >> +	case VFIO_EEH_PE_GET_STATE:
> >> +	case VFIO_EEH_PE_RESET:
> >> +	case VFIO_EEH_PE_CONFIGURE:
> >> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
> >
> >This is where it would have really made sense to have a single
> >VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
> >AlexG, are you really attached to splitting these out into separate
> >ioctls?
> >
> 
> I really don't know. Alex.G want separate ioctl commands, but you
> suggested to combine them. Could you guys please just tell me which
> one (separate vs combined) I need to have in next revision? :-)
> 
> >>  	}
> >>  
> >>  	return -ENOTTY;
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index cb9023d..c5fac36 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
> >>  
> >>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
> >>  
> >> +/*
> >> + * EEH functionality can be enabled or disabled on one specific device.
> >> + * Also, the DMA or IO frozen state can be removed from the frozen PE
> >> + * if required.
> >> + */
> >> +struct vfio_eeh_pe_set_option {
> >> +	__u32 argsz;
> >> +	__u32 flags;
> >> +	__u32 option;
> >> +#define VFIO_EEH_PE_SET_OPT_DISABLE	0	/* Disable EEH	*/
> >> +#define VFIO_EEH_PE_SET_OPT_ENABLE	1	/* Enable EEH	*/
> >> +#define VFIO_EEH_PE_SET_OPT_IO		2	/* Enable IO	*/
> >> +#define VFIO_EEH_PE_SET_OPT_DMA		3	/* Enable DMA	*/
> >
> >This is more of a "command" than an "option" isn't it?  Each of these
> >probably needs a more significant description.
> >
> 
> Yeah, it would be regarded as "opcode" and I'll add more description about
> them in next revision.
> 
> >> +};
> >> +
> >> +#define VFIO_EEH_PE_SET_OPTION		_IO(VFIO_TYPE, VFIO_BASE + 21)
> >> +
> >> +/*
> >> + * Each EEH PE should have unique address to be identified. PE's
> >> + * sharing mode is also useful information as well.
> >> + */
> >> +#define VFIO_EEH_PE_GET_ADDRESS		0	/* Get address	*/
> >> +#define VFIO_EEH_PE_GET_MODE		1	/* Query mode	*/
> >> +#define VFIO_EEH_PE_MODE_NONE		0	/* Not a PE	*/
> >> +#define VFIO_EEH_PE_MODE_NOT_SHARED	1	/* Exclusive	*/
> >> +#define VFIO_EEH_PE_MODE_SHARED		2	/* Shared mode	*/
> >> +
> >> +/*
> >> + * EEH PE might have been frozen because of PCI errors. Also, it might
> >> + * be experiencing reset for error revoery. The following command helps
> >> + * to get the state.
> >> + */
> >> +struct vfio_eeh_pe_get_state {
> >> +	__u32 argsz;
> >> +	__u32 flags;
> >> +	__u32 state;
> >> +};
> >
> >Should state be a union to better describe the value returned?  What
> >exactly is the address and why does the user need to know it?  Does this
> >need user input or could we just return the address and mode regardless?
> >
> 
> Ok. I think you want enum (not union) for state. I'll have macros for the
> state in next revision as I did that for other cases.
> 
> Those macros defined for "address" just for ABI stuff as Alex.G mentioned.
> There isn't corresponding ioctl command for host to get address any more
> because QEMU (user) will have to figure it out by himself. The "address"
> here means PE address and user has to figure it out according to PE
> segmentation.
> 
> >> +
> >> +#define VFIO_EEH_PE_GET_STATE		_IO(VFIO_TYPE, VFIO_BASE + 22)
> >> +
> >> +/*
> >> + * Reset is the major step to recover problematic PE. The following
> >> + * command helps on that.
> >> + */
> >> +struct vfio_eeh_pe_reset {
> >> +	__u32 argsz;
> >> +	__u32 flags;
> >> +	__u32 option;
> >> +#define VFIO_EEH_PE_RESET_DEACTIVATE	0	/* Deactivate reset	*/
> >> +#define VFIO_EEH_PE_RESET_HOT		1	/* Hot reset		*/
> >> +#define VFIO_EEH_PE_RESET_FUNDAMENTAL	3	/* Fundamental reset	*/
> >
> >How does a user know which of these to use?
> >
> 
> I think Ben already helped explaining it for a lot in the subsequent
> replies. Thanks to Ben :-)
> 
> >> +};
> >> +
> >> +#define VFIO_EEH_PE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 23)
> >> +
> >> +/*
> >> + * One of the steps for recovery after PE reset is to configure the
> >> + * PCI bridges affected by the PE reset.
> >> + */
> >> +struct vfio_eeh_pe_configure {
> >> +	__u32 argsz;
> >> +	__u32 flags;
> >> +};
> >
> >Parameters are probably not necessary here.
> >
> 
> Yep. I'll remove it in next revision. Thanks for your comments, Alex.
> 
> >> +
> >> +#define VFIO_EEH_PE_CONFIGURE		_IO(VFIO_TYPE, VFIO_BASE + 24)
> >> +
> >>  /* ***************************************************************** */
> >>  
> >>  #endif /* _UAPIVFIO_H */
> >
> >How does a user learn that a device supports EEH?  Do they just start
> >making ioctl calls and getting a failure?  Shouldn't we make use of one
> >of the flag bits on the device or add a capability on the container for
> >the user to query?
> >
> 
> User needs to make some ioctl calls to make sure EEH can be supported on
> one specific device:
> 
> 	struct vfio_eeh_pe_set_option set_option;
> 
> 	/* User have to make sure the device isn't a bridge and there
> 	 * has one PE for it, which means that case of VFIO_EEH_PE_MODE_NONE.
> 	 */
> 
> 	set_option.argsz = sizeof(set_option);
> 	set_option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
> 	ret = ioctl(container, VFIO_EEH_PE_SET_OPTION, &set_option);
> 	if (ret < 0) {
> 		/* EEH can't supported */
> 	}

No, the user should be able to query whether it's available with
VFIO_CHECK_EXTENSION.

I think I'm also still looking for an explanation of how the user learns
about the error that occurred, not just that it occurred, but what it
was.  The description of the recovery process indicates a lot retrieval
step, but there's no interface here for the user to get logs.  Is that a
TBD?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 12:49         ` Gavin Shan
  2014-05-28 13:12           ` Alexander Graf
@ 2014-05-28 21:58           ` Benjamin Herrenschmidt
  2014-05-28 22:46             ` Alexander Graf
  2014-05-30  3:44             ` Alexey Kardashevskiy
  1 sibling, 2 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-28 21:58 UTC (permalink / raw)
  To: Gavin Shan
  Cc: aik, Alexander Graf, kvm-ppc, Alex Williamson, qiudayu, linuxppc-dev

On Wed, 2014-05-28 at 22:49 +1000, Gavin Shan wrote:
> 
> I will remove those "address" related macros in next revision because it's
> user-level bussiness, not related to host kernel any more.
> 
> If the user is QEMU + guest, we need the address to identify the PE though PHB
> BUID could be used as same purpose to get PHB, which is one-to-one mapping with
> IOMMU group on sPAPR platform. However, once the PE address is built and returned
> to guest, guest will use the PE address as input parameter in subsequent RTAS
> calls.
> 
> If the user is some kind of application who just uses the ioctl() without supporting
> RTAS calls. We don't need care about PE address. 

I am a bit reluctant with that PE==PHB equation we seem to be introducing.

This isn't the case in HW. It's possible that this is how we handle VFIO *today*
in qemu but it doesn't have to be does it ?

It also paints us into a corner if we want to start implementing some kind of
emulated EEH for selected emulated devices and/or virtio.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 16:17               ` Alex Williamson
@ 2014-05-28 22:40                 ` Alexander Graf
  2014-05-28 23:37                   ` Gavin Shan
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Graf @ 2014-05-28 22:40 UTC (permalink / raw)
  To: Alex Williamson; +Cc: aik, Gavin Shan, kvm-ppc, qiudayu, linuxppc-dev


On 28.05.14 18:17, Alex Williamson wrote:
> On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:
>> On 28.05.14 02:57, Alex Williamson wrote:
>>> On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
>>>> On 28.05.14 02:39, Alex Williamson wrote:
>>>>> On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
>>>>>> On 27.05.14 20:15, Alex Williamson wrote:
>>>>>>> On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>>>>>>> The patch adds new IOCTL commands for sPAPR VFIO container device
>>>>>>>> to support EEH functionality for PCI devices, which have been passed
>>>>>>>> through from host to somebody else via VFIO.
>>>>>>>>
>>>>>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>>>>> ---
>>>>>>>>      Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>>>>>>>      drivers/vfio/pci/Makefile           |  1 +
>>>>>>>>      drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>>>>>>>      drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>>>>>>>      drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>>>>>>>      drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>>>>>>>      include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>>>>>>>      7 files changed, 308 insertions(+), 7 deletions(-)
>>>>>>>>      create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>>>>>> [...]
>>>>>>
>>>>>>>> +
>>>>>>>> +	return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      static long tce_iommu_ioctl(void *iommu_data,
>>>>>>>>      				 unsigned int cmd, unsigned long arg)
>>>>>>>>      {
>>>>>>>> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>>>>>>>      		tce_iommu_disable(container);
>>>>>>>>      		mutex_unlock(&container->lock);
>>>>>>>>      		return 0;
>>>>>>>> +	case VFIO_EEH_PE_SET_OPTION:
>>>>>>>> +	case VFIO_EEH_PE_GET_STATE:
>>>>>>>> +	case VFIO_EEH_PE_RESET:
>>>>>>>> +	case VFIO_EEH_PE_CONFIGURE:
>>>>>>>> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
>>>>>>> This is where it would have really made sense to have a single
>>>>>>> VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
>>>>>>> AlexG, are you really attached to splitting these out into separate
>>>>>>> ioctls?
>>>>>> I don't see the problem. We need to forward 4 ioctls to a separate piece
>>>>>> of code, so we forward 4 ioctls to a separate piece of code :). Putting
>>>>>> them into one ioctl just moves the switch() into another function.
>>>>> And uses an extra 3 ioctl numbers and gives us extra things to update if
>>>>> we ever need to add more ioctls, etc.  ioctl numbers are an address
>>>>> space, how much address space do we really want to give to EEH?  It's
>>>>> not a big difference, but I don't think it's completely even either.
>>>>> Thanks,
>>>> Yes, that's the point. I by far prefer to have you push back on anyone
>>>> who introduces useless ioctls rather than have a separate EEH number
>>>> space that people can just throw anything in they like ;).
>>> Well, I appreciate that, but having them as separate ioctls doesn't
>>> really prevent that either.  Any one of these 4 could be set to take a
>>> sub-option to extend and contort the EEH interface.  The only way to
>>> prevent that would be to avoid the argsz+flags hack that make the ioctl
>>> extendable.  Thanks,
>> Sure, that's what patch review is about. I'm really more concerned about
>> whose court the number space is in - you or Gavin. If we're talking
>> about top level ioctls you will care a lot more.
>>
>> But I'm not religious about this. You're the VFIO maintainer, so it's
>> your call. I just personally cringe when I see an ioctl that gets an
>> "opcode" and a "parameter" argument where the "parameter" argument is a
>> union with one struct for each opcode.
> Well, what would it look like...
>
> struct vfio_eeh_pe_op {
> 	__u32 argsz;
> 	__u32 flags;
> 	__u32 op;
> };
>
> Couldn't every single one of these be a separate "op"?  Are there any
> cases where we can't use the ioctl return value?
>
> VFIO_EEH_PE_DISABLE
> VFIO_EEH_PE_ENABLE
> VFIO_EEH_PE_UNFREEZE_IO
> VFIO_EEH_PE_UNFREEZE_DMA
> VFIO_EEH_PE_GET_MODE
> VFIO_EEH_PE_RESET_DEACTIVATE
> VFIO_EEH_PE_RESET_HOT
> VFIO_EEH_PE_RESET_FUNDAMENTAL
> VFIO_EEH_PE_CONFIGURE
>
> It doesn't look that bad to me, what am I missing?  Thanks,

Yup, that looks well to me as well :)


Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 21:58           ` Benjamin Herrenschmidt
@ 2014-05-28 22:46             ` Alexander Graf
  2014-05-28 23:18               ` Benjamin Herrenschmidt
  2014-05-30  3:44             ` Alexey Kardashevskiy
  1 sibling, 1 reply; 31+ messages in thread
From: Alexander Graf @ 2014-05-28 22:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Gavin Shan
  Cc: aik, qiudayu, Alex Williamson, linuxppc-dev, kvm-ppc


On 28.05.14 23:58, Benjamin Herrenschmidt wrote:
> On Wed, 2014-05-28 at 22:49 +1000, Gavin Shan wrote:
>> I will remove those "address" related macros in next revision because it's
>> user-level bussiness, not related to host kernel any more.
>>
>> If the user is QEMU + guest, we need the address to identify the PE though PHB
>> BUID could be used as same purpose to get PHB, which is one-to-one mapping with
>> IOMMU group on sPAPR platform. However, once the PE address is built and returned
>> to guest, guest will use the PE address as input parameter in subsequent RTAS
>> calls.
>>
>> If the user is some kind of application who just uses the ioctl() without supporting
>> RTAS calls. We don't need care about PE address.
> I am a bit reluctant with that PE==PHB equation we seem to be introducing.
>
> This isn't the case in HW. It's possible that this is how we handle VFIO *today*
> in qemu but it doesn't have to be does it ?

Right, but that's pure QEMU internals. From the VFIO kernel interface's 
point of view, a VFIO group is a PE context, as that's what gets IOMMU 
controlled together too.

> It also paints us into a corner if we want to start implementing some kind of
> emulated EEH for selected emulated devices and/or virtio.

I don't think so :). In QEMU the PHB emulation would have to notify the 
"container" (IOMMU emulation layer -> PE) that a PE operation happened. 
It's that emulation code's responsibility to broadcast operations across 
its own emulated operations (recover config space access, reconfigure 
BARs, etc) and the VFIO PE operations.

So from a kernel interface point of view, I think leaving out any 
address information is the right way to go. Whether we managed to get 
all QEMU internal interfaces modeled correctly yet has to be seen on the 
next patch set revision :).


Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 13:12           ` Alexander Graf
@ 2014-05-28 23:13             ` Gavin Shan
  0 siblings, 0 replies; 31+ messages in thread
From: Gavin Shan @ 2014-05-28 23:13 UTC (permalink / raw)
  To: Alexander Graf
  Cc: aik, Gavin Shan, kvm-ppc, Alex Williamson, qiudayu, linuxppc-dev

On Wed, May 28, 2014 at 03:12:35PM +0200, Alexander Graf wrote:
>
>On 28.05.14 14:49, Gavin Shan wrote:
>>On Wed, May 28, 2014 at 01:41:35PM +0200, Alexander Graf wrote:
>>>On 28.05.14 02:55, Gavin Shan wrote:
>>>>On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
>>>>>On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>>>>>The patch adds new IOCTL commands for sPAPR VFIO container device
>>>>>>to support EEH functionality for PCI devices, which have been passed
>>>>>>through from host to somebody else via VFIO.
>>>>>>
>>>>>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>>>---
>>>>>>  Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>>>>>  drivers/vfio/pci/Makefile           |  1 +
>>>>>>  drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>>>>>  drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>>>>>  drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>>>>>  include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>>>>>  7 files changed, 308 insertions(+), 7 deletions(-)
>>>>>>  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>>>[...]
>>>
>>>>>>diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>>>index cb9023d..c5fac36 100644
>>>>>>--- a/include/uapi/linux/vfio.h
>>>>>>+++ b/include/uapi/linux/vfio.h
>>>>>>@@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
>>>>>>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>>>>>>+/*
>>>>>>+ * EEH functionality can be enabled or disabled on one specific device.
>>>>>>+ * Also, the DMA or IO frozen state can be removed from the frozen PE
>>>>>>+ * if required.
>>>>>>+ */
>>>>>>+struct vfio_eeh_pe_set_option {
>>>>>>+	__u32 argsz;
>>>>>>+	__u32 flags;
>>>>>>+	__u32 option;
>>>>>>+#define VFIO_EEH_PE_SET_OPT_DISABLE	0	/* Disable EEH	*/
>>>>>>+#define VFIO_EEH_PE_SET_OPT_ENABLE	1	/* Enable EEH	*/
>>>>>>+#define VFIO_EEH_PE_SET_OPT_IO		2	/* Enable IO	*/
>>>>>>+#define VFIO_EEH_PE_SET_OPT_DMA		3	/* Enable DMA	*/
>>>>>This is more of a "command" than an "option" isn't it?  Each of these
>>>>>probably needs a more significant description.
>>>>>
>>>>Yeah, it would be regarded as "opcode" and I'll add more description about
>>>>them in next revision.
>>>Please just call them commands.
>>>
>>Ok. I guess you want me to change the macro names like this ?
>>
>>#define VFIO_EEH_CMD_DISABLE	0	/* Disable EEH functionality	*/
>>#define VFIO_EEH_CMD_ENABLE	1	/* Enable EEH functionality	*/
>>#define VFIO_EEH_CMD_ENABLE_IO	2	/* Enable IO for frozen PE	*/
>>#define VFIO_EEH_CMD_ENABLE_DMA	3	/* Enable DMA for frozen PE	*/
>
>Yes, the ioctl name too.
>

Ok. Thanks. I will also to rename those "option" / "command" related macros
to "VFIO_EEH_CMD_xxxx" in next revision.

>>
>>>>>>+};
>>>>>>+
>>>>>>+#define VFIO_EEH_PE_SET_OPTION		_IO(VFIO_TYPE, VFIO_BASE + 21)
>>>>>>+
>>>>>>+/*
>>>>>>+ * Each EEH PE should have unique address to be identified. PE's
>>>>>>+ * sharing mode is also useful information as well.
>>>>>>+ */
>>>>>>+#define VFIO_EEH_PE_GET_ADDRESS		0	/* Get address	*/
>>>>>>+#define VFIO_EEH_PE_GET_MODE		1	/* Query mode	*/
>>>>>>+#define VFIO_EEH_PE_MODE_NONE		0	/* Not a PE	*/
>>>>>>+#define VFIO_EEH_PE_MODE_NOT_SHARED	1	/* Exclusive	*/
>>>>>>+#define VFIO_EEH_PE_MODE_SHARED		2	/* Shared mode	*/
>>>>>>+
>>>>>>+/*
>>>>>>+ * EEH PE might have been frozen because of PCI errors. Also, it might
>>>>>>+ * be experiencing reset for error revoery. The following command helps
>>>>>>+ * to get the state.
>>>>>>+ */
>>>>>>+struct vfio_eeh_pe_get_state {
>>>>>>+	__u32 argsz;
>>>>>>+	__u32 flags;
>>>>>>+	__u32 state;
>>>>>>+};
>>>>>Should state be a union to better describe the value returned?  What
>>>>>exactly is the address and why does the user need to know it?  Does this
>>>>>need user input or could we just return the address and mode regardless?
>>>>>
>>>>Ok. I think you want enum (not union) for state. I'll have macros for the
>>>>state in next revision as I did that for other cases.
>>>>
>>>>Those macros defined for "address" just for ABI stuff as Alex.G mentioned.
>>>>There isn't corresponding ioctl command for host to get address any more
>>>>because QEMU (user) will have to figure it out by himself. The "address"
>>>>here means PE address and user has to figure it out according to PE
>>>>segmentation.
>>>Why would the user ever need the address?
>>>
>>I will remove those "address" related macros in next revision because it's
>>user-level bussiness, not related to host kernel any more.
>
>Ok, so the next question is whether there will be any state outside
>of GET_MODE left in the future.
>

That's also user-level business and those macros should be removed as well :-)

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 22:46             ` Alexander Graf
@ 2014-05-28 23:18               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-28 23:18 UTC (permalink / raw)
  To: Alexander Graf
  Cc: aik, Gavin Shan, kvm-ppc, Alex Williamson, qiudayu, linuxppc-dev

On Thu, 2014-05-29 at 00:46 +0200, Alexander Graf wrote:
> I don't think so :). In QEMU the PHB emulation would have to notify the 
> "container" (IOMMU emulation layer -> PE) that a PE operation happened. 
> It's that emulation code's responsibility to broadcast operations across 
> its own emulated operations (recover config space access, reconfigure 
> BARs, etc) and the VFIO PE operations.
> 
> So from a kernel interface point of view, I think leaving out any 
> address information is the right way to go. Whether we managed to get 
> all QEMU internal interfaces modeled correctly yet has to be seen on the 
> next patch set revision :).

For a kernel interface I absolutely agree, all we care about is the group fd.

>From a qemu internal, I suppose it makes things easier to operate at the PHB
level for now and we can revisit later if the need arises.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 22:40                 ` Alexander Graf
@ 2014-05-28 23:37                   ` Gavin Shan
  2014-05-28 23:38                     ` Alexander Graf
  0 siblings, 1 reply; 31+ messages in thread
From: Gavin Shan @ 2014-05-28 23:37 UTC (permalink / raw)
  To: Alexander Graf
  Cc: aik, Gavin Shan, kvm-ppc, Alex Williamson, qiudayu, linuxppc-dev

On Thu, May 29, 2014 at 12:40:26AM +0200, Alexander Graf wrote:
>On 28.05.14 18:17, Alex Williamson wrote:
>>On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:
>>>On 28.05.14 02:57, Alex Williamson wrote:
>>>>On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
>>>>>On 28.05.14 02:39, Alex Williamson wrote:
>>>>>>On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
>>>>>>>On 27.05.14 20:15, Alex Williamson wrote:
>>>>>>>>On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>>>>>>>>The patch adds new IOCTL commands for sPAPR VFIO container device
>>>>>>>>>to support EEH functionality for PCI devices, which have been passed
>>>>>>>>>through from host to somebody else via VFIO.
>>>>>>>>>
>>>>>>>>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>>>>>>---
>>>>>>>>>     Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>>>>>>>>     drivers/vfio/pci/Makefile           |  1 +
>>>>>>>>>     drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>>>>>>>>     drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>>>>>>>>     drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>>>>>>>>     drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>>>>>>>>     include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>>>>>>>>     7 files changed, 308 insertions(+), 7 deletions(-)
>>>>>>>>>     create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>>>>>>>[...]
>>>>>>>
>>>>>>>>>+
>>>>>>>>>+	return ret;
>>>>>>>>>+}
>>>>>>>>>+
>>>>>>>>>     static long tce_iommu_ioctl(void *iommu_data,
>>>>>>>>>     				 unsigned int cmd, unsigned long arg)
>>>>>>>>>     {
>>>>>>>>>@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>>>>>>>>     		tce_iommu_disable(container);
>>>>>>>>>     		mutex_unlock(&container->lock);
>>>>>>>>>     		return 0;
>>>>>>>>>+	case VFIO_EEH_PE_SET_OPTION:
>>>>>>>>>+	case VFIO_EEH_PE_GET_STATE:
>>>>>>>>>+	case VFIO_EEH_PE_RESET:
>>>>>>>>>+	case VFIO_EEH_PE_CONFIGURE:
>>>>>>>>>+		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
>>>>>>>>This is where it would have really made sense to have a single
>>>>>>>>VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
>>>>>>>>AlexG, are you really attached to splitting these out into separate
>>>>>>>>ioctls?
>>>>>>>I don't see the problem. We need to forward 4 ioctls to a separate piece
>>>>>>>of code, so we forward 4 ioctls to a separate piece of code :). Putting
>>>>>>>them into one ioctl just moves the switch() into another function.
>>>>>>And uses an extra 3 ioctl numbers and gives us extra things to update if
>>>>>>we ever need to add more ioctls, etc.  ioctl numbers are an address
>>>>>>space, how much address space do we really want to give to EEH?  It's
>>>>>>not a big difference, but I don't think it's completely even either.
>>>>>>Thanks,
>>>>>Yes, that's the point. I by far prefer to have you push back on anyone
>>>>>who introduces useless ioctls rather than have a separate EEH number
>>>>>space that people can just throw anything in they like ;).
>>>>Well, I appreciate that, but having them as separate ioctls doesn't
>>>>really prevent that either.  Any one of these 4 could be set to take a
>>>>sub-option to extend and contort the EEH interface.  The only way to
>>>>prevent that would be to avoid the argsz+flags hack that make the ioctl
>>>>extendable.  Thanks,
>>>Sure, that's what patch review is about. I'm really more concerned about
>>>whose court the number space is in - you or Gavin. If we're talking
>>>about top level ioctls you will care a lot more.
>>>
>>>But I'm not religious about this. You're the VFIO maintainer, so it's
>>>your call. I just personally cringe when I see an ioctl that gets an
>>>"opcode" and a "parameter" argument where the "parameter" argument is a
>>>union with one struct for each opcode.
>>Well, what would it look like...
>>
>>struct vfio_eeh_pe_op {
>>	__u32 argsz;
>>	__u32 flags;
>>	__u32 op;
>>};
>>
>>Couldn't every single one of these be a separate "op"?  Are there any
>>cases where we can't use the ioctl return value?
>>
>>VFIO_EEH_PE_DISABLE
>>VFIO_EEH_PE_ENABLE
>>VFIO_EEH_PE_UNFREEZE_IO
>>VFIO_EEH_PE_UNFREEZE_DMA
>>VFIO_EEH_PE_GET_MODE
>>VFIO_EEH_PE_RESET_DEACTIVATE
>>VFIO_EEH_PE_RESET_HOT
>>VFIO_EEH_PE_RESET_FUNDAMENTAL
>>VFIO_EEH_PE_CONFIGURE
>>
>>It doesn't look that bad to me, what am I missing?  Thanks,
>
>Yup, that looks well to me as well :)
>

s/VFIO_EEH_PE_GET_MODE/VFIO_EEH_PE_GET_STATE.

I'll include this in next revision. Thanks, Alex.

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 23:37                   ` Gavin Shan
@ 2014-05-28 23:38                     ` Alexander Graf
  2014-05-28 23:41                       ` Gavin Shan
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Graf @ 2014-05-28 23:38 UTC (permalink / raw)
  To: Gavin Shan; +Cc: aik, kvm-ppc, Alex Williamson, qiudayu, linuxppc-dev


On 29.05.14 01:37, Gavin Shan wrote:
> On Thu, May 29, 2014 at 12:40:26AM +0200, Alexander Graf wrote:
>> On 28.05.14 18:17, Alex Williamson wrote:
>>> On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:
>>>> On 28.05.14 02:57, Alex Williamson wrote:
>>>>> On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
>>>>>> On 28.05.14 02:39, Alex Williamson wrote:
>>>>>>> On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
>>>>>>>> On 27.05.14 20:15, Alex Williamson wrote:
>>>>>>>>> On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>>>>>>>>> The patch adds new IOCTL commands for sPAPR VFIO container device
>>>>>>>>>> to support EEH functionality for PCI devices, which have been passed
>>>>>>>>>> through from host to somebody else via VFIO.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>>>>>>> ---
>>>>>>>>>>      Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>>>>>>>>>      drivers/vfio/pci/Makefile           |  1 +
>>>>>>>>>>      drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>>>>>>>>>      drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>>>>>>>>>      drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>>>>>>>>>      drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>>>>>>>>>      include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>>>>>>>>>      7 files changed, 308 insertions(+), 7 deletions(-)
>>>>>>>>>>      create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> +	return ret;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>      static long tce_iommu_ioctl(void *iommu_data,
>>>>>>>>>>      				 unsigned int cmd, unsigned long arg)
>>>>>>>>>>      {
>>>>>>>>>> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>>>>>>>>>      		tce_iommu_disable(container);
>>>>>>>>>>      		mutex_unlock(&container->lock);
>>>>>>>>>>      		return 0;
>>>>>>>>>> +	case VFIO_EEH_PE_SET_OPTION:
>>>>>>>>>> +	case VFIO_EEH_PE_GET_STATE:
>>>>>>>>>> +	case VFIO_EEH_PE_RESET:
>>>>>>>>>> +	case VFIO_EEH_PE_CONFIGURE:
>>>>>>>>>> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
>>>>>>>>> This is where it would have really made sense to have a single
>>>>>>>>> VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
>>>>>>>>> AlexG, are you really attached to splitting these out into separate
>>>>>>>>> ioctls?
>>>>>>>> I don't see the problem. We need to forward 4 ioctls to a separate piece
>>>>>>>> of code, so we forward 4 ioctls to a separate piece of code :). Putting
>>>>>>>> them into one ioctl just moves the switch() into another function.
>>>>>>> And uses an extra 3 ioctl numbers and gives us extra things to update if
>>>>>>> we ever need to add more ioctls, etc.  ioctl numbers are an address
>>>>>>> space, how much address space do we really want to give to EEH?  It's
>>>>>>> not a big difference, but I don't think it's completely even either.
>>>>>>> Thanks,
>>>>>> Yes, that's the point. I by far prefer to have you push back on anyone
>>>>>> who introduces useless ioctls rather than have a separate EEH number
>>>>>> space that people can just throw anything in they like ;).
>>>>> Well, I appreciate that, but having them as separate ioctls doesn't
>>>>> really prevent that either.  Any one of these 4 could be set to take a
>>>>> sub-option to extend and contort the EEH interface.  The only way to
>>>>> prevent that would be to avoid the argsz+flags hack that make the ioctl
>>>>> extendable.  Thanks,
>>>> Sure, that's what patch review is about. I'm really more concerned about
>>>> whose court the number space is in - you or Gavin. If we're talking
>>>> about top level ioctls you will care a lot more.
>>>>
>>>> But I'm not religious about this. You're the VFIO maintainer, so it's
>>>> your call. I just personally cringe when I see an ioctl that gets an
>>>> "opcode" and a "parameter" argument where the "parameter" argument is a
>>>> union with one struct for each opcode.
>>> Well, what would it look like...
>>>
>>> struct vfio_eeh_pe_op {
>>> 	__u32 argsz;
>>> 	__u32 flags;
>>> 	__u32 op;
>>> };
>>>
>>> Couldn't every single one of these be a separate "op"?  Are there any
>>> cases where we can't use the ioctl return value?
>>>
>>> VFIO_EEH_PE_DISABLE
>>> VFIO_EEH_PE_ENABLE
>>> VFIO_EEH_PE_UNFREEZE_IO
>>> VFIO_EEH_PE_UNFREEZE_DMA
>>> VFIO_EEH_PE_GET_MODE
>>> VFIO_EEH_PE_RESET_DEACTIVATE
>>> VFIO_EEH_PE_RESET_HOT
>>> VFIO_EEH_PE_RESET_FUNDAMENTAL
>>> VFIO_EEH_PE_CONFIGURE
>>>
>>> It doesn't look that bad to me, what am I missing?  Thanks,
>> Yup, that looks well to me as well :)
>>
> s/VFIO_EEH_PE_GET_MODE/VFIO_EEH_PE_GET_STATE.
>
> I'll include this in next revision. Thanks, Alex.

Yup, no need for CMD anymore then either :)


Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 23:38                     ` Alexander Graf
@ 2014-05-28 23:41                       ` Gavin Shan
  0 siblings, 0 replies; 31+ messages in thread
From: Gavin Shan @ 2014-05-28 23:41 UTC (permalink / raw)
  To: Alexander Graf
  Cc: aik, Gavin Shan, kvm-ppc, Alex Williamson, qiudayu, linuxppc-dev

On Thu, May 29, 2014 at 01:38:46AM +0200, Alexander Graf wrote:
>On 29.05.14 01:37, Gavin Shan wrote:
>>On Thu, May 29, 2014 at 12:40:26AM +0200, Alexander Graf wrote:
>>>On 28.05.14 18:17, Alex Williamson wrote:
>>>>On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:
>>>>>On 28.05.14 02:57, Alex Williamson wrote:
>>>>>>On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
>>>>>>>On 28.05.14 02:39, Alex Williamson wrote:
>>>>>>>>On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
>>>>>>>>>On 27.05.14 20:15, Alex Williamson wrote:
>>>>>>>>>>On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>>>>>>>>>>>The patch adds new IOCTL commands for sPAPR VFIO container device
>>>>>>>>>>>to support EEH functionality for PCI devices, which have been passed
>>>>>>>>>>>through from host to somebody else via VFIO.
>>>>>>>>>>>
>>>>>>>>>>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>>>>>>>>---
>>>>>>>>>>>     Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>>>>>>>>>>>     drivers/vfio/pci/Makefile           |  1 +
>>>>>>>>>>>     drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>>>>>>>>>>>     drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>>>>>>>>>>>     drivers/vfio/pci/vfio_pci_private.h |  5 ++
>>>>>>>>>>>     drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>>>>>>>>>>>     include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>>>>>>>>>>>     7 files changed, 308 insertions(+), 7 deletions(-)
>>>>>>>>>>>     create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>>>>>>>>>[...]
>>>>>>>>>
>>>>>>>>>>>+
>>>>>>>>>>>+	return ret;
>>>>>>>>>>>+}
>>>>>>>>>>>+
>>>>>>>>>>>     static long tce_iommu_ioctl(void *iommu_data,
>>>>>>>>>>>     				 unsigned int cmd, unsigned long arg)
>>>>>>>>>>>     {
>>>>>>>>>>>@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>>>>>>>>>>     		tce_iommu_disable(container);
>>>>>>>>>>>     		mutex_unlock(&container->lock);
>>>>>>>>>>>     		return 0;
>>>>>>>>>>>+	case VFIO_EEH_PE_SET_OPTION:
>>>>>>>>>>>+	case VFIO_EEH_PE_GET_STATE:
>>>>>>>>>>>+	case VFIO_EEH_PE_RESET:
>>>>>>>>>>>+	case VFIO_EEH_PE_CONFIGURE:
>>>>>>>>>>>+		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
>>>>>>>>>>This is where it would have really made sense to have a single
>>>>>>>>>>VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
>>>>>>>>>>AlexG, are you really attached to splitting these out into separate
>>>>>>>>>>ioctls?
>>>>>>>>>I don't see the problem. We need to forward 4 ioctls to a separate piece
>>>>>>>>>of code, so we forward 4 ioctls to a separate piece of code :). Putting
>>>>>>>>>them into one ioctl just moves the switch() into another function.
>>>>>>>>And uses an extra 3 ioctl numbers and gives us extra things to update if
>>>>>>>>we ever need to add more ioctls, etc.  ioctl numbers are an address
>>>>>>>>space, how much address space do we really want to give to EEH?  It's
>>>>>>>>not a big difference, but I don't think it's completely even either.
>>>>>>>>Thanks,
>>>>>>>Yes, that's the point. I by far prefer to have you push back on anyone
>>>>>>>who introduces useless ioctls rather than have a separate EEH number
>>>>>>>space that people can just throw anything in they like ;).
>>>>>>Well, I appreciate that, but having them as separate ioctls doesn't
>>>>>>really prevent that either.  Any one of these 4 could be set to take a
>>>>>>sub-option to extend and contort the EEH interface.  The only way to
>>>>>>prevent that would be to avoid the argsz+flags hack that make the ioctl
>>>>>>extendable.  Thanks,
>>>>>Sure, that's what patch review is about. I'm really more concerned about
>>>>>whose court the number space is in - you or Gavin. If we're talking
>>>>>about top level ioctls you will care a lot more.
>>>>>
>>>>>But I'm not religious about this. You're the VFIO maintainer, so it's
>>>>>your call. I just personally cringe when I see an ioctl that gets an
>>>>>"opcode" and a "parameter" argument where the "parameter" argument is a
>>>>>union with one struct for each opcode.
>>>>Well, what would it look like...
>>>>
>>>>struct vfio_eeh_pe_op {
>>>>	__u32 argsz;
>>>>	__u32 flags;
>>>>	__u32 op;
>>>>};
>>>>
>>>>Couldn't every single one of these be a separate "op"?  Are there any
>>>>cases where we can't use the ioctl return value?
>>>>
>>>>VFIO_EEH_PE_DISABLE
>>>>VFIO_EEH_PE_ENABLE
>>>>VFIO_EEH_PE_UNFREEZE_IO
>>>>VFIO_EEH_PE_UNFREEZE_DMA
>>>>VFIO_EEH_PE_GET_MODE
>>>>VFIO_EEH_PE_RESET_DEACTIVATE
>>>>VFIO_EEH_PE_RESET_HOT
>>>>VFIO_EEH_PE_RESET_FUNDAMENTAL
>>>>VFIO_EEH_PE_CONFIGURE
>>>>
>>>>It doesn't look that bad to me, what am I missing?  Thanks,
>>>Yup, that looks well to me as well :)
>>>
>>s/VFIO_EEH_PE_GET_MODE/VFIO_EEH_PE_GET_STATE.
>>
>>I'll include this in next revision. Thanks, Alex.
>
>Yup, no need for CMD anymore then either :)
>

Yep. Thanks, Guys :)

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 16:32       ` Alex Williamson
@ 2014-05-29  0:05         ` Gavin Shan
  2014-05-29  0:44           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 31+ messages in thread
From: Gavin Shan @ 2014-05-29  0:05 UTC (permalink / raw)
  To: Alex Williamson; +Cc: aik, Gavin Shan, kvm-ppc, agraf, qiudayu, linuxppc-dev

On Wed, May 28, 2014 at 10:32:11AM -0600, Alex Williamson wrote:
>On Wed, 2014-05-28 at 10:55 +1000, Gavin Shan wrote:
>> On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
>> >On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
>> >> The patch adds new IOCTL commands for sPAPR VFIO container device
>> >> to support EEH functionality for PCI devices, which have been passed
>> >> through from host to somebody else via VFIO.
>> >> 
>> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> >> ---
>> >>  Documentation/vfio.txt              | 92 ++++++++++++++++++++++++++++++++++++-
>> >>  drivers/vfio/pci/Makefile           |  1 +
>> >>  drivers/vfio/pci/vfio_pci.c         | 20 +++++---
>> >>  drivers/vfio/pci/vfio_pci_eeh.c     | 46 +++++++++++++++++++
>> >>  drivers/vfio/pci/vfio_pci_private.h |  5 ++
>> >>  drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++++++++++++++++++++++++++++++++++
>> >>  include/uapi/linux/vfio.h           | 66 ++++++++++++++++++++++++++
>> >>  7 files changed, 308 insertions(+), 7 deletions(-)
>> >>  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
>> >> 
>> >> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>> >> index b9ca023..d890fed 100644
>> >> --- a/Documentation/vfio.txt
>> >> +++ b/Documentation/vfio.txt
>> >> @@ -305,7 +305,15 @@ faster, the map/unmap handling has been implemented in real mode which provides
>> >>  an excellent performance which has limitations such as inability to do
>> >>  locked pages accounting in real time.
>> >>  
>> >> -So 3 additional ioctls have been added:
>> >> +4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
>> >> +subtree that can be treated as a unit for the purposes of partitioning and
>> >> +error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
>> >> +function of a multi-function IOA, or multiple IOAs (possibly including switch
>> >> +and bridge structures above the multiple IOAs). PPC64 guests detect PCI errors
>> >> +and recover from them via EEH RTAS services, which works on the basis of
>> >> +additional ioctl commands.
>> >> +
>> >> +So 7 additional ioctls have been added:
>> >>  
>> >>  	VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
>> >>  		of the DMA window on the PCI bus.
>> >> @@ -316,6 +324,17 @@ So 3 additional ioctls have been added:
>> >>  
>> >>  	VFIO_IOMMU_DISABLE - disables the container.
>> >>  
>> >> +	VFIO_EEH_PE_SET_OPTION - enables or disables EEH functionality on the
>> >> +		specified device. Also, it can be used to remove IO or DMA
>> >> +		stopped state on the frozen PE.
>> >> +
>> >> +	VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
>> >> +
>> >> +	VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
>> >> +		error recovering.
>> >> +
>> >> +	VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
>> >> +		one of the major steps for error recoverying.
>> >>  
>> >>  The code flow from the example above should be slightly changed:
>> >>  
>> >> @@ -346,6 +365,77 @@ The code flow from the example above should be slightly changed:
>> >>  	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
>> >>  	.....
>> >>  
>> >> +Based on the initial example we have, the following piece of code could be
>> >> +reference for EEH setup and error handling:
>> >> +
>> >> +	struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
>> >> +	struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
>> >> +	struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
>> >> +	struct vfio_eeh_pe_configure configure = { .argsz = sizeof(configure) };
>> >> +
>> >> +	....
>> >> +
>> >> +	/* Get a file descriptor for the device */
>> >> +	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
>> >> +
>> >> +	/* Enable the EEH functionality on the device */
>> >> +	option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
>> >> +	ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
>> >> +
>> >> +	/* You're suggested to create additional data struct to represent
>> >> +	 * PE, and put child devices belonging to same IOMMU group to the
>> >> +	 * PE instance for later reference.
>> >> +	 */
>> >> +
>> >> +	/* Check the PE's state and make sure it's in functional state */
>> >> +	ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
>> >> +
>> >> +	/* Save device's state. pci_save_state() would be good enough
>> >> +	 * as an example.
>> >> +	 */
>> >> +
>> >> +	/* Test and setup the device */
>> >> +	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
>> >> +
>> >> +	....
>> >> +
>> >> +	/* When 0xFF's returned from reading PCI config space or IO BARs
>> >> +	 * of the PCI device. Check the PE state to see if that has been
>> >> +	 * frozen.
>> >> +	 */
>> >> +	ioctl(container, VFIO_EEH_PE_GET_STATE, &state);
>> >> +
>> >> +	/* Waiting for pending PCI transactions to be completed and don't
>> >> +	 * produce any more PCI traffic from/to the affected PE until
>> >> +	 * recovery is finished.
>> >> +	 */
>> >> +
>> >> +	/* Enable IO for the affected PE and collect logs. Usually, the
>> >> +	 * standard part of PCI config space, AER registers are dumped
>> >> +	 * as logs for further analysis.
>> >> +	 */
>> >> +	option.option = VFIO_EEH_PE_SET_OPT_IO;
>> >> +	ioctl(container, VFIO_EEH_PE_SET_OPTION, &option);
>> >> +
>> >> +	/* Issue PE reset */
>> >> +	reset.option = VFIO_EEH_PE_RESET_HOT;
>> >> +	ioctl(container, VFIO_EEH_PE_RESET, &reset);
>> >> +	reset.option = VFIO_EEH_PE_RESET_DEACTIVATE;
>> >> +	ioctl(container, VFIO_EEH_PE_RESET, &reset);
>> >> +
>> >> +	/* Configure the PCI bridges for the affected PE */
>> >> +	ioctl(container, VFIO_EEH_PE_CONFIGURE, &configure);
>> >> +
>> >> +	/* Restored state we saved at initialization time. pci_restore_state()
>> >> +	 * is good enough as an example.
>> >> +	 */
>> >> +
>> >> +	/* Hopefully, error is recovered successfully. Now, you can resume to
>> >> +	 * start PCI traffic to/from the affected PE.
>> >> +	 */
>> >> +
>> >> +	....
>> >> +
>> >>  -------------------------------------------------------------------------------
>> >>  
>> >>  [1] VFIO was originally an acronym for "Virtual Function I/O" in its
>> >> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> >> index 1310792..faad885 100644
>> >> --- a/drivers/vfio/pci/Makefile
>> >> +++ b/drivers/vfio/pci/Makefile
>> >> @@ -1,4 +1,5 @@
>> >>  
>> >>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>> >> +vfio-pci-y += vfio_pci_eeh.o
>> >
>> >Why do we build this w/o CONFIG_EEH?  Can't we define the functions as
>> >static inline in the header in that case?
>> >
>> 
>> Ok. Will do in next revision.
>> 
>> >>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> >> index 7ba0424..7c8d26a 100644
>> >> --- a/drivers/vfio/pci/vfio_pci.c
>> >> +++ b/drivers/vfio/pci/vfio_pci.c
>> >> @@ -156,8 +156,10 @@ static void vfio_pci_release(void *device_data)
>> >>  {
>> >>  	struct vfio_pci_device *vdev = device_data;
>> >>  
>> >> -	if (atomic_dec_and_test(&vdev->refcnt))
>> >> +	if (atomic_dec_and_test(&vdev->refcnt)) {
>> >> +		vfio_pci_eeh_release(vdev->pdev);
>> >>  		vfio_pci_disable(vdev);
>> >> +	}
>> >>  
>> >>  	module_put(THIS_MODULE);
>> >>  }
>> >> @@ -165,19 +167,25 @@ static void vfio_pci_release(void *device_data)
>> >>  static int vfio_pci_open(void *device_data)
>> >>  {
>> >>  	struct vfio_pci_device *vdev = device_data;
>> >> +	int ret;
>> >>  
>> >>  	if (!try_module_get(THIS_MODULE))
>> >>  		return -ENODEV;
>> >>  
>> >>  	if (atomic_inc_return(&vdev->refcnt) == 1) {
>> >> -		int ret = vfio_pci_enable(vdev);
>> >> -		if (ret) {
>> >> -			module_put(THIS_MODULE);
>> >> -			return ret;
>> >> -		}
>> >> +		ret = vfio_pci_enable(vdev);
>> >> +		if (ret)
>> >> +			goto error;
>> >> +
>> >> +		ret = vfio_pci_eeh_open(vdev->pdev);
>> >> +		if (ret)
>> >> +			goto error;
>> >>  	}
>> >>  
>> >>  	return 0;
>> >> +error:
>> >> +	module_put(THIS_MODULE);
>> >> +	return ret;
>> >>  }
>> >>  
>> >>  static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
>> >> diff --git a/drivers/vfio/pci/vfio_pci_eeh.c b/drivers/vfio/pci/vfio_pci_eeh.c
>> >> new file mode 100644
>> >> index 0000000..9c25207
>> >> --- /dev/null
>> >> +++ b/drivers/vfio/pci/vfio_pci_eeh.c
>> >> @@ -0,0 +1,46 @@
>> >> +/*
>> >> + * EEH functionality support for VFIO PCI devices.
>> >> + *
>> >> + * Copyright Gavin Shan, IBM Corporation 2014.
>> >> + *
>> >> + * This program is free software; you can redistribute it and/or modify
>> >> + * it under the terms of the GNU General Public License version 2 as
>> >> + * published by the Free Software Foundation.
>> >> + */
>> >> +
>> >> +#include <linux/device.h>
>> >> +#include <linux/eventfd.h>
>> >> +#include <linux/file.h>
>> >> +#include <linux/interrupt.h>
>> >> +#include <linux/iommu.h>
>> >> +#include <linux/module.h>
>> >> +#include <linux/mutex.h>
>> >> +#include <linux/notifier.h>
>> >> +#include <linux/pci.h>
>> >> +#include <linux/pm_runtime.h>
>> >> +#include <linux/slab.h>
>> >> +#include <linux/types.h>
>> >> +#include <linux/uaccess.h>
>> >> +#include <linux/vfio.h>
>> >
>> >Cleanup the includes
>> >
>> 
>> Ok. Will do in next revision.
>> 
>> >> +#ifdef CONFIG_EEH
>> >> +#include <asm/eeh.h>
>> >> +#endif
>> >
>> >This shouldn't even be compiles w/o CONFIG_EEH
>> >
>> 
>> Yeah. Will fix in next revision as you suggested.
>> 
>> >> +
>> >> +#include "vfio_pci_private.h"
>> >> +
>> >> +int vfio_pci_eeh_open(struct pci_dev *pdev)
>> >> +{
>> >> +	int ret = 0;
>> >> +
>> >> +#ifdef CONFIG_EEH
>> >> +	ret = eeh_dev_open(pdev);
>> >> +#endif
>> >> +	return ret;
>> >> +}
>> >> +
>> >> +void vfio_pci_eeh_release(struct pci_dev *pdev)
>> >> +{
>> >> +#ifdef CONFIG_EEH
>> >> +	eeh_dev_release(pdev);
>> >> +#endif
>> >> +}
>> >> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
>> >> index 9c6d5d0..c3cbe40 100644
>> >> --- a/drivers/vfio/pci/vfio_pci_private.h
>> >> +++ b/drivers/vfio/pci/vfio_pci_private.h
>> >> @@ -90,4 +90,9 @@ extern void vfio_pci_virqfd_exit(void);
>> >>  
>> >>  extern int vfio_config_init(struct vfio_pci_device *vdev);
>> >>  extern void vfio_config_free(struct vfio_pci_device *vdev);
>> >> +
>> >> +/* EEH stub */
>> >> +extern int vfio_pci_eeh_open(struct pci_dev *pdev);
>> >> +extern void vfio_pci_eeh_release(struct pci_dev *pdev);
>> >
>> >The #ifdef with static inlines should be here.
>> >
>> 
>> thanks for the detailed comments.
>> 
>> >> +
>> >>  #endif /* VFIO_PCI_PRIVATE_H */
>> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> >> index a84788b..666691b 100644
>> >> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> >> @@ -21,6 +21,9 @@
>> >>  #include <linux/vfio.h>
>> >>  #include <asm/iommu.h>
>> >>  #include <asm/tce.h>
>> >> +#ifdef CONFIG_EEH
>> >> +#include <asm/eeh.h>
>> >> +#endif
>> >
>> >Now we're infecting another file with EEH, can't we put it in a central
>> >location for vfio?
>> 
>> How about something like this:
>> 
>> - Create drivers/vfio/vfio_eeh.c and it will be built when having CONFIG_EEH
>> - Declare all EEH functions in include/linux/vfio.h (it's the only central
>>   header file I can find except you have a better one).
>> 
>> #ifdef CONFIG_EEH
>> extern int vfio_pci_eeh_open(struct pci_dev *pdev);
>> extern void vfio_pci_eeh_release(struct pci_dev *pdev);
>> extern long vfio_iommu_spapr_eeh_ioctl(struct iommu_table *tbl,
>> 				       unsigned int cmd, unsigned long arg); 
>> #else
>> static int vfio_pci_eeh_open(struct pci_dev *pdev)
>> {
>> 	return 0;
>> }
>> 
>> static void vfio_pci_eeh_release(struct pci_dev *pdev)
>> {
>> }
>> 
>> static inline vfio_iommu_spapr_eeh_ioctl(struct iommu_table *tbl,
>> 					 unsigned int cmd, unsigned long arg);
>> {
>> 	return -ENOTTY;
>> }
>> #endif
>
>Seems like an improvement
>

So I'll include the change in next revision.

>> >>  
>> >>  #define DRIVER_VERSION  "0.1"
>> >>  #define DRIVER_AUTHOR   "aik@ozlabs.ru"
>> >> @@ -147,6 +150,83 @@ static void tce_iommu_release(void *iommu_data)
>> >>  	kfree(container);
>> >>  }
>> >>  
>> >> +static long tce_iommu_eeh_ioctl(void *iommu_data,
>> >> +				unsigned int cmd, unsigned long arg)
>> >> +{
>> >> +	struct tce_container *container = iommu_data;
>> >> +	unsigned long minsz;
>> >> +	int ret = 0;
>> >> +
>> >> +#ifdef CONFIG_EEH
>> >> +	switch (cmd) {
>> >> +	case VFIO_EEH_PE_SET_OPTION: {
>> >> +		struct vfio_eeh_pe_set_option option;
>> >> +
>> >> +		minsz = offsetofend(struct vfio_eeh_pe_set_option, option);
>> >> +		if (copy_from_user(&option, (void __user *)arg, minsz))
>> >> +			return -EFAULT;
>> >> +		if (option.argsz < minsz)
>> >> +			return -EINVAL;
>> >> +
>> >> +		ret = eeh_pe_set_option(eeh_iommu_table_to_pe(container->tbl),
>> >> +					option.option);
>> >> +		break;
>> >> +	}
>> >> +	case VFIO_EEH_PE_GET_STATE: {
>> >> +		struct vfio_eeh_pe_get_state state;
>> >> +
>> >> +		minsz = offsetofend(struct vfio_eeh_pe_get_state, state);
>> >> +		if (copy_from_user(&state, (void __user *)arg, minsz))
>> >> +			return -EFAULT;
>> >> +		if (state.argsz < minsz)
>> >> +			return -EINVAL;
>> >> +
>> >> +		ret = eeh_pe_get_state(eeh_iommu_table_to_pe(container->tbl));
>> >> +		if (ret >= 0) {
>> >> +			state.state = ret;
>> >> +			if (copy_to_user((void __user *)arg, &state, minsz))
>> >> +				return -EFAULT;
>> >> +			ret = 0;
>> >> +		}
>> >
>> >This looks like one of those cases where you should just use the ioctl
>> >return value.
>> >
>> 
>> Ok. I'll use ioctl return value to carry output information. Note it might
>> have "0" as the output information.
>
>I think that's fine.
>
>

Ok. Thanks. I'll going to have "combined" ioctl command as you just suggested.
Also to use the return value from ioctl() for informtion carriage.

>> >> +		break;
>> >> +	}
>> >> +	case VFIO_EEH_PE_RESET: {
>> >> +		struct vfio_eeh_pe_reset reset;
>> >> +
>> >> +		minsz = offsetofend(struct vfio_eeh_pe_reset, option);
>> >> +		if (copy_from_user(&reset, (void __user *)arg, minsz))
>> >> +			return -EFAULT;
>> >> +		if (reset.argsz < minsz)
>> >> +			return -EINVAL;
>> >> +
>> >> +		ret = eeh_pe_reset(eeh_iommu_table_to_pe(container->tbl),
>> >> +				   reset.option);
>> >> +		break;
>> >> +	}
>> >> +	case VFIO_EEH_PE_CONFIGURE: {
>> >> +		struct vfio_eeh_pe_configure configure;
>> >> +
>> >> +		minsz = offsetofend(struct vfio_eeh_pe_configure, flags);
>> >> +		if (copy_from_user(&configure, (void __user *)arg, minsz))
>> >> +			return -EFAULT;
>> >> +		if (configure.argsz < minsz)
>> >> +			return -EINVAL;
>> >> +
>> >> +		ret = eeh_pe_configure(eeh_iommu_table_to_pe(container->tbl));
>> >> +		break;
>> >> +	}
>> >> +	default:
>> >> +		ret = -EINVAL;
>> >> +		pr_debug("%s: Cannot handle command %d\n",
>> >> +			__func__, cmd);
>> >> +	}
>> >> +#else
>> >> +	ret = -ENOENT;
>> >> +#endif
>> >
>> >Hmm, more like a BUG in the default case the way it's coded here (not
>> >even sure it's worth the default entry).  The #else case should probably
>> >be -ENOTTY like other unimplemented ioctls.
>> >
>> 
>> Yeah, we don't need default case here and I'll remove it in next revision.
>> Also, "-ENOTTY" will be used in future for unimplemented functions.
>> 
>> >> +
>> >> +	return ret;
>> >> +}
>> >> +
>> >>  static long tce_iommu_ioctl(void *iommu_data,
>> >>  				 unsigned int cmd, unsigned long arg)
>> >>  {
>> >> @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>> >>  		tce_iommu_disable(container);
>> >>  		mutex_unlock(&container->lock);
>> >>  		return 0;
>> >> +	case VFIO_EEH_PE_SET_OPTION:
>> >> +	case VFIO_EEH_PE_GET_STATE:
>> >> +	case VFIO_EEH_PE_RESET:
>> >> +	case VFIO_EEH_PE_CONFIGURE:
>> >> +		return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
>> >
>> >This is where it would have really made sense to have a single
>> >VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
>> >AlexG, are you really attached to splitting these out into separate
>> >ioctls?
>> >
>> 
>> I really don't know. Alex.G want separate ioctl commands, but you
>> suggested to combine them. Could you guys please just tell me which
>> one (separate vs combined) I need to have in next revision? :-)
>> 
>> >>  	}
>> >>  
>> >>  	return -ENOTTY;
>> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> >> index cb9023d..c5fac36 100644
>> >> --- a/include/uapi/linux/vfio.h
>> >> +++ b/include/uapi/linux/vfio.h
>> >> @@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
>> >>  
>> >>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>> >>  
>> >> +/*
>> >> + * EEH functionality can be enabled or disabled on one specific device.
>> >> + * Also, the DMA or IO frozen state can be removed from the frozen PE
>> >> + * if required.
>> >> + */
>> >> +struct vfio_eeh_pe_set_option {
>> >> +	__u32 argsz;
>> >> +	__u32 flags;
>> >> +	__u32 option;
>> >> +#define VFIO_EEH_PE_SET_OPT_DISABLE	0	/* Disable EEH	*/
>> >> +#define VFIO_EEH_PE_SET_OPT_ENABLE	1	/* Enable EEH	*/
>> >> +#define VFIO_EEH_PE_SET_OPT_IO		2	/* Enable IO	*/
>> >> +#define VFIO_EEH_PE_SET_OPT_DMA		3	/* Enable DMA	*/
>> >
>> >This is more of a "command" than an "option" isn't it?  Each of these
>> >probably needs a more significant description.
>> >
>> 
>> Yeah, it would be regarded as "opcode" and I'll add more description about
>> them in next revision.
>> 
>> >> +};
>> >> +
>> >> +#define VFIO_EEH_PE_SET_OPTION		_IO(VFIO_TYPE, VFIO_BASE + 21)
>> >> +
>> >> +/*
>> >> + * Each EEH PE should have unique address to be identified. PE's
>> >> + * sharing mode is also useful information as well.
>> >> + */
>> >> +#define VFIO_EEH_PE_GET_ADDRESS		0	/* Get address	*/
>> >> +#define VFIO_EEH_PE_GET_MODE		1	/* Query mode	*/
>> >> +#define VFIO_EEH_PE_MODE_NONE		0	/* Not a PE	*/
>> >> +#define VFIO_EEH_PE_MODE_NOT_SHARED	1	/* Exclusive	*/
>> >> +#define VFIO_EEH_PE_MODE_SHARED		2	/* Shared mode	*/
>> >> +
>> >> +/*
>> >> + * EEH PE might have been frozen because of PCI errors. Also, it might
>> >> + * be experiencing reset for error revoery. The following command helps
>> >> + * to get the state.
>> >> + */
>> >> +struct vfio_eeh_pe_get_state {
>> >> +	__u32 argsz;
>> >> +	__u32 flags;
>> >> +	__u32 state;
>> >> +};
>> >
>> >Should state be a union to better describe the value returned?  What
>> >exactly is the address and why does the user need to know it?  Does this
>> >need user input or could we just return the address and mode regardless?
>> >
>> 
>> Ok. I think you want enum (not union) for state. I'll have macros for the
>> state in next revision as I did that for other cases.
>> 
>> Those macros defined for "address" just for ABI stuff as Alex.G mentioned.
>> There isn't corresponding ioctl command for host to get address any more
>> because QEMU (user) will have to figure it out by himself. The "address"
>> here means PE address and user has to figure it out according to PE
>> segmentation.
>> 
>> >> +
>> >> +#define VFIO_EEH_PE_GET_STATE		_IO(VFIO_TYPE, VFIO_BASE + 22)
>> >> +
>> >> +/*
>> >> + * Reset is the major step to recover problematic PE. The following
>> >> + * command helps on that.
>> >> + */
>> >> +struct vfio_eeh_pe_reset {
>> >> +	__u32 argsz;
>> >> +	__u32 flags;
>> >> +	__u32 option;
>> >> +#define VFIO_EEH_PE_RESET_DEACTIVATE	0	/* Deactivate reset	*/
>> >> +#define VFIO_EEH_PE_RESET_HOT		1	/* Hot reset		*/
>> >> +#define VFIO_EEH_PE_RESET_FUNDAMENTAL	3	/* Fundamental reset	*/
>> >
>> >How does a user know which of these to use?
>> >
>> 
>> I think Ben already helped explaining it for a lot in the subsequent
>> replies. Thanks to Ben :-)
>> 
>> >> +};
>> >> +
>> >> +#define VFIO_EEH_PE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 23)
>> >> +
>> >> +/*
>> >> + * One of the steps for recovery after PE reset is to configure the
>> >> + * PCI bridges affected by the PE reset.
>> >> + */
>> >> +struct vfio_eeh_pe_configure {
>> >> +	__u32 argsz;
>> >> +	__u32 flags;
>> >> +};
>> >
>> >Parameters are probably not necessary here.
>> >
>> 
>> Yep. I'll remove it in next revision. Thanks for your comments, Alex.
>> 
>> >> +
>> >> +#define VFIO_EEH_PE_CONFIGURE		_IO(VFIO_TYPE, VFIO_BASE + 24)
>> >> +
>> >>  /* ***************************************************************** */
>> >>  
>> >>  #endif /* _UAPIVFIO_H */
>> >
>> >How does a user learn that a device supports EEH?  Do they just start
>> >making ioctl calls and getting a failure?  Shouldn't we make use of one
>> >of the flag bits on the device or add a capability on the container for
>> >the user to query?
>> >
>> 
>> User needs to make some ioctl calls to make sure EEH can be supported on
>> one specific device:
>> 
>> 	struct vfio_eeh_pe_set_option set_option;
>> 
>> 	/* User have to make sure the device isn't a bridge and there
>> 	 * has one PE for it, which means that case of VFIO_EEH_PE_MODE_NONE.
>> 	 */
>> 
>> 	set_option.argsz = sizeof(set_option);
>> 	set_option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
>> 	ret = ioctl(container, VFIO_EEH_PE_SET_OPTION, &set_option);
>> 	if (ret < 0) {
>> 		/* EEH can't supported */
>> 	}
>
>No, the user should be able to query whether it's available with
>VFIO_CHECK_EXTENSION.
>

Sounds like a good idea. If you agree, I'll have one more extention option
and use it in tce_iommu_ioctl():

#define VFIO_EEH_SUPPORT	5

>I think I'm also still looking for an explanation of how the user learns
>about the error that occurred, not just that it occurred, but what it
>was.  The description of the recovery process indicates a lot retrieval
>step, but there's no interface here for the user to get logs.  Is that a
>TBD?  Thanks,
>

The log stuff is TBD and I'll figure it out later.

About to what are the errors, there are a lot. Most of them are related
to hardware level, for example unstable PCI link. Usually, those error
bits defined in AER fatal error state register contribute to EEH errors.
It could be software related, e.g. violating IOMMU protection (read/write
permission etc), or even one PCI device isn't capable of DMAing. Hopefully,
it's the explaination you're looking for? :-)

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-29  0:05         ` Gavin Shan
@ 2014-05-29  0:44           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-29  0:44 UTC (permalink / raw)
  To: Gavin Shan; +Cc: aik, agraf, kvm-ppc, Alex Williamson, qiudayu, linuxppc-dev

On Thu, 2014-05-29 at 10:05 +1000, Gavin Shan wrote:
> The log stuff is TBD and I'll figure it out later.
> 
> About to what are the errors, there are a lot. Most of them are related
> to hardware level, for example unstable PCI link. Usually, those error
> bits defined in AER fatal error state register contribute to EEH errors.
> It could be software related, e.g. violating IOMMU protection (read/write
> permission etc), or even one PCI device isn't capable of DMAing. Hopefully,
> it's the explaination you're looking for? :-)

Note to Alex('s) ...

The log we get from FW at the moment in the host is:

  - In the case of pHyp / RTAS host, opaque. Basically it's a blob that we store
and that can be sent to IBM service people :-) Not terribly practical.

  - On PowerNV, it's a IODA specific data structure (basically a dump of a 
bunch of register state and tables). IODA is our IO architecture (sadly the
document itself isn't public at this point) and we have two versions, IODA1
and IODA2. You can consider the structure as chipset specific basically.

What I want to do in the long run is:

  - In the case of pHyp/RTAS host, there's not much we can do, so basically
forward that log as-is.

  - In the case of PowerNV, forward the log *and* add a well-defined blob to
it that does some basic interpretation of it. In fact I want to do the latter
more generally in the host kernel for host kernel errors as well, but we
can forward it to the guest via VFIO too. What I mean by interpretation is
something along the lines of an error type "DMA IOMMU fault, MMIO error,
Link loss, PCIe UE, ..." among a list of well known error types that
represent the most common ones, with a little bit of added info when
available (for most DMA errors we can provide the DMA address that faulted
for example).

So Gavin and I need to work a bit on that, both in the context of the host
kernel to improve the reporting there, and in the context of what we pass to
user space.

However, no driver today cares about that information. The PCIe error recovery
system doesn't carry it and it has no impact on the EEH recovery procedures,
so EEH in that sense is useful without that reporting. It's useful for the
programmer (or user/admin) to identify what went wrong but it's not used by
the automated recovery process.

One last thing to look at is in the case of a VFIO device, we might want to
silence the host kernel printf's once we support guest EEH since otherwise
the guest has a path to flood the host kernel log by triggering a lot of
EEH errors purposefully.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-28 21:58           ` Benjamin Herrenschmidt
  2014-05-28 22:46             ` Alexander Graf
@ 2014-05-30  3:44             ` Alexey Kardashevskiy
  2014-05-30  3:49               ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 31+ messages in thread
From: Alexey Kardashevskiy @ 2014-05-30  3:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Gavin Shan
  Cc: qiudayu, Alex Williamson, linuxppc-dev, Alexander Graf, kvm-ppc

On 05/29/2014 07:58 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2014-05-28 at 22:49 +1000, Gavin Shan wrote:
>>
>> I will remove those "address" related macros in next revision because it's
>> user-level bussiness, not related to host kernel any more.
>>
>> If the user is QEMU + guest, we need the address to identify the PE though PHB
>> BUID could be used as same purpose to get PHB, which is one-to-one mapping with
>> IOMMU group on sPAPR platform. However, once the PE address is built and returned
>> to guest, guest will use the PE address as input parameter in subsequent RTAS
>> calls.
>>
>> If the user is some kind of application who just uses the ioctl() without supporting
>> RTAS calls. We don't need care about PE address. 
> 
> I am a bit reluctant with that PE==PHB equation we seem to be introducing.
>
> This isn't the case in HW.

It is pseries, not real HW. Does phyp allow multiple real host PEs on the
same virtual PHB?


> It's possible that this is how we handle VFIO *today*
> in qemu but it doesn't have to be does it ?
> 
> It also paints us into a corner if we want to start implementing some kind of
> emulated EEH for selected emulated devices and/or virtio.




-- 
Alexey

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device
  2014-05-30  3:44             ` Alexey Kardashevskiy
@ 2014-05-30  3:49               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-30  3:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Gavin Shan, kvm-ppc, Alexander Graf, Alex Williamson, qiudayu,
	linuxppc-dev

On Fri, 2014-05-30 at 13:44 +1000, Alexey Kardashevskiy wrote:
> 
> It is pseries, not real HW. Does phyp allow multiple real host PEs on
> the same virtual PHB?

I don't know how recent pHyp versions do it. I think they pretty much
have a PE == 1 virtual PHB so you might be right there.

On older HW they definitely used P2P bridges as PEs.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-05-30  3:49 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-27  8:40 [PATCH v7 0/3] EEH Support for VFIO PCI Device Gavin Shan
2014-05-27  8:40 ` [PATCH v7 1/3] powerpc/eeh: Avoid event on passed PE Gavin Shan
2014-05-27  8:40 ` [PATCH v7 2/3] powerpc/eeh: EEH support for VFIO PCI device Gavin Shan
2014-05-27  8:40 ` [PATCH v7 3/3] drivers/vfio: " Gavin Shan
2014-05-27 18:15   ` Alex Williamson
2014-05-27 20:30     ` Benjamin Herrenschmidt
2014-05-27 20:37       ` Alex Williamson
2014-05-27 20:41         ` Benjamin Herrenschmidt
2014-05-27 22:49     ` Alexander Graf
2014-05-28  0:39       ` Alex Williamson
2014-05-28  0:44         ` Alexander Graf
2014-05-28  0:57           ` Alex Williamson
2014-05-28 11:37             ` Alexander Graf
2014-05-28 16:17               ` Alex Williamson
2014-05-28 22:40                 ` Alexander Graf
2014-05-28 23:37                   ` Gavin Shan
2014-05-28 23:38                     ` Alexander Graf
2014-05-28 23:41                       ` Gavin Shan
2014-05-28  0:55     ` Gavin Shan
2014-05-28 11:41       ` Alexander Graf
2014-05-28 12:49         ` Gavin Shan
2014-05-28 13:12           ` Alexander Graf
2014-05-28 23:13             ` Gavin Shan
2014-05-28 21:58           ` Benjamin Herrenschmidt
2014-05-28 22:46             ` Alexander Graf
2014-05-28 23:18               ` Benjamin Herrenschmidt
2014-05-30  3:44             ` Alexey Kardashevskiy
2014-05-30  3:49               ` Benjamin Herrenschmidt
2014-05-28 16:32       ` Alex Williamson
2014-05-29  0:05         ` Gavin Shan
2014-05-29  0:44           ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).