linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] EEH Support for VFIO PCI device
@ 2014-05-21  5:03 Gavin Shan
  2014-05-21  5:03 ` [PATCH v5 1/4] drivers/vfio: Introduce CONFIG_VFIO_PCI_EEH Gavin Shan
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Gavin Shan @ 2014-05-21  5:03 UTC (permalink / raw)
  To: kvm-ppc; +Cc: aik, agraf, Gavin Shan, alex.williamson, qiudayu, linuxppc-dev

The series of patches intends to support EEH for PCI devices, which are
passed through to PowerKVM based guest via VFIO. The implementation is
straightforward based on the issues or problems we have to resolve to
support EEH for PowerKVM based guest.

- Emulation for EEH RTAS requests. All EEH RTAS requests goes to QEMU firstly.
  If QEMU can't handle it, the request will be sent to host via newly introduced
  VFIO container IOCTL command (VFIO_EEH_OP) and gets handled in host kernel.

The series of patches requires corresponding QEMU changes.

Change log
==========
v1 -> v2:
        * EEH RTAS requests are routed to QEMU, and then possiblly to host kerenl.
          The mechanism KVM in-kernel handling is dropped.
        * Error injection is reimplemented based syscall, instead of KVM in-kerenl
          handling. The logic for error injection token management is moved to
          QEMU. The error injection request is routed to QEMU and then possiblly
          to host kernel.
v2 -> v3:
        * Make the fields in struct eeh_vfio_pci_addr, struct vfio_eeh_info based
          on the comments from Alexey.
        * Define macros for EEH VFIO operations (Alexey).
        * Clear frozen state after successful PE reset.
        * Merge original [PATCH 1/2/3] to one.
v3 -> v4:
        * Remove the error injection from the patchset. Mike or I will work on that
          later.
        * Rename CONFIG_VFIO_EEH to VFIO_PCI_EEH.
        * Rename the IOCTL command to VFIO_EEH_OP and it's handled by VFIO-PCI device
          instead of VFIO container.
        * Rename the IOCTL argument structure to "vfio_eeh_op" accordingly. Also, more
          fields added to hold return values for RTAS requests.
        * The address mapping stuff is totally removed. When opening or releasing VFIO
          PCI device, notification sent to EEH to update the flags indicates the device
          is passed to guest or not.
        * Change pr_warn() to pr_debug() to avoid DOS as pointed by Alex.W
        * Argument size check issue pointed by Alex.W.
v4 -> v5:
	* Functions for VFIO PCI EEH support are moved to eeh.c and exported from there.
	  VFIO PCI driver just uses those functions to tackle IOCTL command VFIO_EEH_OP.
	  All of this is to make the code organized in a good way as suggested by Alex.G.
	  Another potential benefit is PowerNV/pSeries are sharing "eeh_ops" and same
	  infrastructure could possiblly work for KVM_PR and KVM_HV mode at the same time.
	* Don't clear error injection registers after finishing PE reset as the patchset
	  is doing nothing related to error injection.
	* Amending Documentation/vfio.txt, which was missed in last revision.
	* No QEMU changes for this revision. "v4" works well. Also, remove "RFC" from the
	  subject as the design is basically recognized.

----

Gavin Shan (4):
  drivers/vfio: Introduce CONFIG_VFIO_PCI_EEH
  powerpc/eeh: Flags for passed device and PE
  drivers/vfio: EEH support for VFIO PCI device
  powerpc/eeh: Avoid event on passed PE

 Documentation/vfio.txt                    |   6 +-
 arch/powerpc/include/asm/eeh.h            |  42 ++++
 arch/powerpc/kernel/eeh.c                 | 331 ++++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/eeh-ioda.c |   3 +-
 drivers/vfio/pci/Kconfig                  |   6 +
 drivers/vfio/pci/vfio_pci.c               |  99 ++++++++-
 include/uapi/linux/vfio.h                 |  43 ++++
 7 files changed, 522 insertions(+), 8 deletions(-)

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 1/4] drivers/vfio: Introduce CONFIG_VFIO_PCI_EEH
  2014-05-21  5:03 [PATCH v5 0/4] EEH Support for VFIO PCI device Gavin Shan
@ 2014-05-21  5:03 ` Gavin Shan
  2014-05-21  5:03 ` [PATCH v5 2/4] powerpc/eeh: Flags for passed device and PE Gavin Shan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-05-21  5:03 UTC (permalink / raw)
  To: kvm-ppc; +Cc: aik, agraf, Gavin Shan, alex.williamson, qiudayu, linuxppc-dev

The patch introduces CONFIG_VFIO_PCI_EEH for more IOCTL commands
on VFIO PCI device support EEH funtionality for PCI devices that
are passed through from host to guest.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 drivers/vfio/pci/Kconfig | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index c41b01e..dd0e0f0 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -1,6 +1,7 @@
 config VFIO_PCI
 	tristate "VFIO support for PCI devices"
 	depends on VFIO && PCI && EVENTFD
+	select VFIO_PCI_EEH if PPC_POWERNV
 	help
 	  Support for the PCI VFIO bus driver.  This is required to make
 	  use of PCI drivers using the VFIO framework.
@@ -16,3 +17,8 @@ config VFIO_PCI_VGA
 	  BIOS and generic video drivers.
 
 	  If you don't know what to do here, say N.
+
+config VFIO_PCI_EEH
+	tristate
+	depends on EEH && VFIO_IOMMU_SPAPR_TCE
+	default n
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 2/4] powerpc/eeh: Flags for passed device and PE
  2014-05-21  5:03 [PATCH v5 0/4] EEH Support for VFIO PCI device Gavin Shan
  2014-05-21  5:03 ` [PATCH v5 1/4] drivers/vfio: Introduce CONFIG_VFIO_PCI_EEH Gavin Shan
@ 2014-05-21  5:03 ` Gavin Shan
  2014-05-21  5:03 ` [PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device Gavin Shan
  2014-05-21  5:03 ` [PATCH v5 4/4] powerpc/eeh: Avoid event on passed PE Gavin Shan
  3 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-05-21  5:03 UTC (permalink / raw)
  To: kvm-ppc; +Cc: aik, agraf, Gavin Shan, alex.williamson, qiudayu, linuxppc-dev

The patch introduces new flags for EEH device and PE to indicate
that the device or PE has been passed through to guest. In turn,
we will deliver EEH errors to guest for further handling, which
will be done in subsequent patches.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 7782056..34a2d83 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -72,6 +72,7 @@ struct device_node;
 #define EEH_PE_RESET		(1 << 2)	/* PE reset in progress	*/
 
 #define EEH_PE_KEEP		(1 << 8)	/* Keep PE on hotplug	*/
+#define EEH_PE_PASSTHROUGH	(1 << 9)	/* PE owned by guest	*/
 
 struct eeh_pe {
 	int type;			/* PE type: PHB/Bus/Device	*/
@@ -93,6 +94,21 @@ struct eeh_pe {
 #define eeh_pe_for_each_dev(pe, edev, tmp) \
 		list_for_each_entry_safe(edev, tmp, &pe->edevs, list)
 
+static inline bool eeh_pe_passed(struct eeh_pe *pe)
+{
+	return pe ? !!(pe->state & EEH_PE_PASSTHROUGH) : false;
+}
+
+static inline void eeh_pe_set_passed(struct eeh_pe *pe, bool passed)
+{
+	if (pe) {
+		if (passed)
+			pe->state |= EEH_PE_PASSTHROUGH;
+		else
+			pe->state &= ~EEH_PE_PASSTHROUGH;
+	}
+}
+
 /*
  * The struct is used to trace EEH state for the associated
  * PCI device node or PCI device. In future, it might
@@ -110,6 +126,7 @@ struct eeh_pe {
 #define EEH_DEV_SYSFS		(1 << 9)	/* Sysfs created	*/
 #define EEH_DEV_REMOVED		(1 << 10)	/* Removed permanently	*/
 #define EEH_DEV_FRESET		(1 << 11)	/* Fundamental reset	*/
+#define EEH_DEV_PASSTHROUGH	(1 << 12)	/* Owned by guest	*/
 
 struct eeh_dev {
 	int mode;			/* EEH mode			*/
@@ -138,6 +155,21 @@ static inline struct pci_dev *eeh_dev_to_pci_dev(struct eeh_dev *edev)
 	return edev ? edev->pdev : NULL;
 }
 
+static inline bool eeh_dev_passed(struct eeh_dev *dev)
+{
+	return dev ? !!(dev->mode & EEH_DEV_PASSTHROUGH) : false;
+}
+
+static inline void eeh_dev_set_passed(struct eeh_dev *dev, bool passed)
+{
+	if (dev) {
+		if (passed)
+			dev->mode |= EEH_DEV_PASSTHROUGH;
+		else
+			dev->mode &= ~EEH_DEV_PASSTHROUGH;
+	}
+}
+
 /* Return values from eeh_ops::next_error */
 enum {
 	EEH_NEXT_ERR_NONE = 0,
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device
  2014-05-21  5:03 [PATCH v5 0/4] EEH Support for VFIO PCI device Gavin Shan
  2014-05-21  5:03 ` [PATCH v5 1/4] drivers/vfio: Introduce CONFIG_VFIO_PCI_EEH Gavin Shan
  2014-05-21  5:03 ` [PATCH v5 2/4] powerpc/eeh: Flags for passed device and PE Gavin Shan
@ 2014-05-21  5:03 ` Gavin Shan
  2014-05-21 13:07   ` Alexander Graf
  2014-05-21  5:03 ` [PATCH v5 4/4] powerpc/eeh: Avoid event on passed PE Gavin Shan
  3 siblings, 1 reply; 11+ messages in thread
From: Gavin Shan @ 2014-05-21  5:03 UTC (permalink / raw)
  To: kvm-ppc; +Cc: aik, agraf, Gavin Shan, alex.williamson, qiudayu, linuxppc-dev

The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
to support EEH functionality for PCI devices, which have been
passed from host to guest via VFIO.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 Documentation/vfio.txt         |   6 +-
 arch/powerpc/include/asm/eeh.h |  10 ++
 arch/powerpc/kernel/eeh.c      | 323 +++++++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c    |  99 ++++++++++++-
 include/uapi/linux/vfio.h      |  43 ++++++
 5 files changed, 474 insertions(+), 7 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index b9ca023..bb17ec7 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -305,7 +305,10 @@ faster, the map/unmap handling has been implemented in real mode which provides
 an excellent performance which has limitations such as inability to do
 locked pages accounting in real time.
 
-So 3 additional ioctls have been added:
+4) PPC64 guests detect PCI errors and recover from them via EEH RTAS services,
+which works on the basis of additional ioctl command VFIO_EEH_OP.
+
+So 4 additional ioctls have been added:
 
 	VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
 		of the DMA window on the PCI bus.
@@ -316,6 +319,7 @@ So 3 additional ioctls have been added:
 
 	VFIO_IOMMU_DISABLE - disables the container.
 
+	VFIO_EEH_OP - EEH dependent operations
 
 The code flow from the example above should be slightly changed:
 
diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 34a2d83..93922ef 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -305,6 +305,16 @@ void eeh_add_device_late(struct pci_dev *);
 void eeh_add_device_tree_late(struct pci_bus *);
 void eeh_add_sysfs_files(struct pci_bus *);
 void eeh_remove_device(struct pci_dev *);
+#ifdef CONFIG_VFIO_PCI_EEH
+int eeh_vfio_open(struct pci_dev *pdev);
+void eeh_vfio_release(struct pci_dev *pdev);
+int eeh_vfio_set_pe_option(struct pci_dev *pdev, int option, int *retval);
+int eeh_vfio_get_pe_addr(struct pci_dev *pdev, int option,
+			 int *retval, int *info);
+int eeh_vfio_get_pe_state(struct pci_dev *pdev, int *retval, int *state);
+int eeh_vfio_reset_pe(struct pci_dev *pdev, int option, int *retval);
+int eeh_vfio_configure_pe(struct pci_dev *pdev, int *retval);
+#endif
 
 /**
  * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 9c6b899..2aaf90e 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1098,6 +1098,329 @@ void eeh_remove_device(struct pci_dev *dev)
 	edev->mode &= ~EEH_DEV_SYSFS;
 }
 
+#ifdef CONFIG_VFIO_PCI_EEH
+int eeh_vfio_open(struct pci_dev *pdev)
+{
+	struct eeh_dev *edev;
+
+	/* No PCI device ? */
+	if (!pdev)
+		return -ENODEV;
+
+	/* No EEH device ? */
+	edev = pci_dev_to_eeh_dev(pdev);
+	if (!edev || !edev->pe)
+		return -ENODEV;
+
+	eeh_dev_set_passed(edev, true);
+	eeh_pe_set_passed(edev->pe, true);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(eeh_vfio_open);
+
+void eeh_vfio_release(struct pci_dev *pdev)
+{
+	bool release_pe = true;
+	struct eeh_pe *pe = NULL;
+	struct eeh_dev *tmp, *edev;
+
+	/* No PCI device ? */
+	if (!pdev)
+		return;
+
+	/* No EEH device ? */
+	edev = pci_dev_to_eeh_dev(pdev);
+	if (!edev || !eeh_dev_passed(edev) ||
+	    !edev->pe || !eeh_pe_passed(pe))
+		return;
+
+	/* Release device */
+	pe = edev->pe;
+	eeh_dev_set_passed(edev, false);
+
+	/* Release PE */
+	eeh_pe_for_each_dev(pe, edev, tmp) {
+		if (eeh_dev_passed(edev)) {
+			release_pe = false;
+			break;
+		}
+	}
+
+	if (release_pe)
+		eeh_pe_set_passed(pe, false);
+}
+EXPORT_SYMBOL(eeh_vfio_release);
+
+static int eeh_vfio_check_dev(struct pci_dev *pdev,
+			      struct eeh_dev **pedev,
+			      struct eeh_pe **ppe)
+{
+	struct eeh_dev *edev;
+
+	/* No device ? */
+	if (!pdev)
+		return -ENODEV;
+
+	edev = pci_dev_to_eeh_dev(pdev);
+	if (!edev || !eeh_dev_passed(edev) ||
+	    !edev->pe || !eeh_pe_passed(edev->pe))
+		return -ENODEV;
+
+	if (pedev)
+		*pedev = edev;
+	if (ppe)
+		*ppe = edev->pe;
+
+	return 0;
+}
+
+int eeh_vfio_set_pe_option(struct pci_dev *pdev, int option, int *retval)
+{
+	struct eeh_dev *edev;
+	struct eeh_pe *pe;
+	int ret = 0;
+
+	/* Device existing ? */
+	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
+	if (ret) {
+		pr_debug("%s: Cannot find device %s\n",
+			__func__, pdev ? pci_name(pdev) : "NULL");
+		*retval = -7;
+		goto out;
+	}
+
+	/* Invalid option ? */
+	if (option < EEH_OPT_DISABLE ||
+	    option > EEH_OPT_THAW_DMA) {
+		pr_debug("%s: Option %d out of range (%d, %d)\n",
+			__func__, option, EEH_OPT_DISABLE, EEH_OPT_THAW_DMA);
+		*retval = -3;
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (option == EEH_OPT_DISABLE ||
+	    option == EEH_OPT_ENABLE) {
+		*retval = 0;
+	} else {
+		if (!eeh_ops || !eeh_ops->set_option) {
+			*retval = -7;
+			ret = -ENOENT;
+			goto out;
+		}
+
+		ret = eeh_ops->set_option(pe, option);
+		if (ret) {
+			pr_debug("%s: Failure %d from backend\n",
+				__func__, ret);
+			*retval = -1;
+			goto out;
+		}
+
+		*retval = 0;
+	}
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(eeh_vfio_set_pe_option);
+
+int eeh_vfio_get_pe_addr(struct pci_dev *pdev, int option,
+			 int *retval, int *info)
+{
+	struct pci_bus *bus;
+	struct eeh_dev *edev;
+	struct eeh_pe *pe;
+	int ret = 0;
+
+	/* Device existing ? */
+	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
+	if (ret) {
+		*retval = -3;
+		goto out;
+	}
+
+	/* Invalid option ? */
+	if (option != 0 && option != 1) {
+		pr_debug("%s: option %d out of range (0, 1)\n",
+			__func__, option);
+		*retval = -3;
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Fill result according to option. We don't differentiate
+	 * PCI bus and device dependent PE here. So all PEs are
+	 * built in "shared" mode. Also, the PE address has the format
+	 * of "00BBSS00".
+	 */
+	if (option == 0) {
+		bus = eeh_pe_bus_get(pe);
+		if (!bus) {
+			*retval = -3;
+			ret = -ENODEV;
+			goto out;
+		}
+
+		*retval = 0;
+		*info = bus->number << 16;
+	} else {
+		*retval = 0;
+		*info = 1;
+	}
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(eeh_vfio_get_pe_addr);
+
+int eeh_vfio_get_pe_state(struct pci_dev *pdev, int *retval, int *state)
+{
+	struct eeh_dev *edev;
+	struct eeh_pe *pe;
+	int result, ret = 0;
+
+	/* Device existing ? */
+	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
+	if (ret) {
+		*retval = -3;
+		goto out;
+	}
+
+	if (!eeh_ops || !eeh_ops->get_state) {
+		pr_debug("%s: Unsupported request\n",
+			__func__);
+		ret = -ENOENT;
+		*retval = -3;
+		goto out;
+	}
+
+	result = eeh_ops->get_state(pe, NULL);
+	if (!(result & EEH_STATE_RESET_ACTIVE) &&
+	     (result & EEH_STATE_DMA_ENABLED) &&
+	     (result & EEH_STATE_MMIO_ENABLED))
+		*state = 0;
+	else if (result & EEH_STATE_RESET_ACTIVE)
+		*state = 1;
+	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
+		 !(result & EEH_STATE_DMA_ENABLED) &&
+		 !(result & EEH_STATE_MMIO_ENABLED))
+		*state = 2;
+	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
+		 (result & EEH_STATE_DMA_ENABLED) &&
+		 !(result & EEH_STATE_MMIO_ENABLED))
+		*state = 4;
+	else
+		*state = 5;
+
+	*retval = 0;
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(eeh_vfio_get_pe_state);
+
+int eeh_vfio_reset_pe(struct pci_dev *pdev, int option, int *retval)
+{
+	struct eeh_dev *edev;
+	struct eeh_pe *pe;
+	int ret = 0;
+
+	/* Device existing ? */
+	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
+	if (ret) {
+		*retval = -3;
+		goto out;
+	}
+
+	/* Invalid option ? */
+	if (option != EEH_RESET_DEACTIVATE &&
+	    option != EEH_RESET_HOT &&
+	    option != EEH_RESET_FUNDAMENTAL) {
+		pr_debug("%s: Unsupported option %d\n",
+			__func__, option);
+		ret = -EINVAL;
+		*retval = -3;
+		goto out;
+	}
+
+	if (!eeh_ops || !eeh_ops->set_option || !eeh_ops->reset) {
+		pr_debug("%s: Unsupported request\n",
+			__func__);
+		ret = -ENOENT;
+		*retval = -7;
+		goto out;
+	}
+
+	ret = eeh_ops->reset(pe, option);
+	if (ret) {
+		pr_debug("%s: Failure %d from backend\n",
+			 __func__, ret);
+		*retval = -1;
+		goto out;
+	}
+
+	/*
+	 * The PE is still in frozen state and we need clear that.
+	 * It's good to clear frozen state after deassert to avoid
+	 * messy IO access during reset, which might cause recrusive
+	 * frozen PE.
+	 */
+	if (option == EEH_RESET_DEACTIVATE) {
+		ret = eeh_ops->set_option(pe, EEH_OPT_THAW_MMIO);
+		if (ret) {
+			pr_debug("%s: Cannot enable IO for PHB#%d-PE#%d (%d)\n",
+				__func__, pe->phb->global_number, pe->addr, ret);
+			*retval = -1;
+			goto out;
+		}
+
+		ret = eeh_ops->set_option(pe, EEH_OPT_THAW_DMA);
+		if (ret) {
+			pr_debug("%s: Cannot enable DMA for PHB#%d-PE#%d (%d)\n",
+				__func__, pe->phb->global_number, pe->addr, ret);
+			*retval = -1;
+			goto out;
+		}
+
+		eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
+	}
+
+	*retval = 0;
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(eeh_vfio_reset_pe);
+
+int eeh_vfio_configure_pe(struct pci_dev *pdev, int *retval)
+{
+	struct eeh_dev *edev;
+	struct eeh_pe *pe;
+	int ret = 0;
+
+	/* Device existing ? */
+	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
+	if (ret) {
+		*retval = -3;
+		goto out;
+	}
+
+	/*
+	 * The access to PCI config space on VFIO device has some
+	 * limitations. Part of PCI config space, including BAR
+	 * registers are not readable and writable. So the guest
+	 * should have stale values for those registers and we have
+	 * to restore them in host side.
+	 */
+	eeh_pe_restore_bars(pe);
+	*retval = 0;
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(eeh_vfio_configure_pe);
+
+#endif /* CONFIG_VFIO_PCI_EEH */
+
 static int proc_eeh_show(struct seq_file *m, void *v)
 {
 	if (!eeh_enabled()) {
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 7ba0424..05c3dde 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -25,6 +25,9 @@
 #include <linux/types.h>
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
+#ifdef CONFIG_VFIO_PCI_EEH
+#include <asm/eeh.h>
+#endif
 
 #include "vfio_pci_private.h"
 
@@ -152,32 +155,57 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 	pci_restore_state(pdev);
 }
 
+static void vfio_eeh_pci_release(struct pci_dev *pdev)
+{
+#ifdef CONFIG_VFIO_PCI_EEH
+	eeh_vfio_release(pdev);
+#endif
+}
+
 static void vfio_pci_release(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
 
-	if (atomic_dec_and_test(&vdev->refcnt))
+	if (atomic_dec_and_test(&vdev->refcnt)) {
+		vfio_eeh_pci_release(vdev->pdev);
 		vfio_pci_disable(vdev);
+	}
 
 	module_put(THIS_MODULE);
 }
 
+static int vfio_eeh_pci_open(struct pci_dev *pdev)
+{
+	int ret = 0;
+
+#ifdef CONFIG_VFIO_PCI_EEH
+	ret = eeh_vfio_open(pdev);
+#endif
+	return ret;
+}
+
 static int vfio_pci_open(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
+	int ret;
 
 	if (!try_module_get(THIS_MODULE))
 		return -ENODEV;
 
 	if (atomic_inc_return(&vdev->refcnt) == 1) {
-		int ret = vfio_pci_enable(vdev);
-		if (ret) {
-			module_put(THIS_MODULE);
-			return ret;
-		}
+		ret = vfio_pci_enable(vdev);
+		if (ret)
+			goto error;
+
+		ret = vfio_eeh_pci_open(vdev->pdev);
+		if (ret)
+			goto error;
 	}
 
 	return 0;
+error:
+	module_put(THIS_MODULE);
+	return ret;
 }
 
 static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
@@ -321,6 +349,51 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
 	return walk.ret;
 }
 
+static int vfio_eeh_pci_ioctl(struct pci_dev *pdev, struct vfio_eeh_op *info)
+{
+	int ret = 0;
+
+#ifdef CONFIG_VFIO_PCI_EEH
+	switch (info->op) {
+	case VFIO_EEH_OP_SET_OPTION:
+		ret = eeh_vfio_set_pe_option(pdev,
+					     info->option.option,
+					     &info->option.ret);
+		break;
+	case VFIO_EEH_OP_GET_ADDR:
+		ret = eeh_vfio_get_pe_addr(pdev,
+					   info->addr.option,
+					   &info->addr.ret,
+					   &info->addr.info);
+		break;
+	case VFIO_EEH_OP_GET_STATE:
+		ret = eeh_vfio_get_pe_state(pdev,
+					    &info->state.ret,
+					    &info->state.reset_state);
+		info->state.cfg_cap = 1;
+		info->state.pe_unavail_info = 1000;
+		info->state.pe_recovery_info = 0;
+		break;
+	case VFIO_EEH_OP_PE_RESET:
+		ret = eeh_vfio_reset_pe(pdev,
+					info->reset.option,
+					&info->reset.ret);
+		break;
+	case VFIO_EEH_OP_PE_CONFIG:
+		ret = eeh_vfio_configure_pe(pdev,
+					    &info->config.ret);
+	default:
+		ret = -EINVAL;
+		pr_debug("%s: Cannot handle op#%d\n",
+			__func__, info->op);
+	}
+#else
+	ret = -ENOENT;
+#endif
+
+	return ret;
+}
+
 static long vfio_pci_ioctl(void *device_data,
 			   unsigned int cmd, unsigned long arg)
 {
@@ -682,6 +755,20 @@ hot_reset_release:
 
 		kfree(groups);
 		return ret;
+	} else if (cmd == VFIO_EEH_OP) {
+		struct vfio_eeh_op info;
+		int ret = 0;
+
+		minsz = sizeof(info);
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		ret = vfio_eeh_pci_ioctl(vdev->pdev, &info);
+		if (copy_to_user((void __user *)arg, &info, minsz))
+			ret = -EFAULT;
+		return ret;
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..518961d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -455,6 +455,49 @@ struct vfio_iommu_spapr_tce_info {
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
 
+/*
+ * The VFIO operation struct provides way to support EEH functionality
+ * for PCI device that is passed from host to guest via VFIO.
+ */
+#define VFIO_EEH_OP_SET_OPTION	0
+#define VFIO_EEH_OP_GET_ADDR	1
+#define VFIO_EEH_OP_GET_STATE	2
+#define VFIO_EEH_OP_PE_RESET	3
+#define VFIO_EEH_OP_PE_CONFIG	4
+
+struct vfio_eeh_op {
+	__u32 argsz;
+	__u32 op;
+
+	union {
+		struct vfio_eeh_set_option {
+			__u32 option;
+			__s32 ret;
+		} option;
+		struct vfio_eeh_pe_addr {
+			__u32 option;
+			__s32 ret;
+			__u32 info;
+		} addr;
+		struct vfio_eeh_pe_state {
+			__s32 ret;
+			__u32 reset_state;
+			__u32 cfg_cap;
+			__u32 pe_unavail_info;
+			__u32 pe_recovery_info;
+		} state;
+		struct vfio_eeh_reset {
+			__u32 option;
+			__s32 ret;
+		} reset;
+		struct vfio_eeh_config {
+			__s32 ret;
+		} config;
+	};
+};
+
+#define VFIO_EEH_OP	_IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 4/4] powerpc/eeh: Avoid event on passed PE
  2014-05-21  5:03 [PATCH v5 0/4] EEH Support for VFIO PCI device Gavin Shan
                   ` (2 preceding siblings ...)
  2014-05-21  5:03 ` [PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device Gavin Shan
@ 2014-05-21  5:03 ` Gavin Shan
  2014-05-21 13:13   ` Alexander Graf
  3 siblings, 1 reply; 11+ messages in thread
From: Gavin Shan @ 2014-05-21  5:03 UTC (permalink / raw)
  To: kvm-ppc; +Cc: aik, agraf, Gavin Shan, alex.williamson, qiudayu, linuxppc-dev

If we detects frozen state on PE that has been passed to guest, we
needn't handle it. Instead, we rely on the guest to detect and recover
it. The patch avoid EEH event on the frozen passed PE so that the guest
can have chance to handle that.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/eeh.c                 | 8 ++++++++
 arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 2aaf90e..25fd12d 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 	if (ret > 0)
 		return ret;
 
+	/*
+	 * If the PE has been passed to guest, we won't check the
+	 * state. Instead, let the guest handle it if the PE has
+	 * been frozen.
+	 */
+	if (eeh_pe_passed(pe))
+		return 0;
+
 	/* If we already have a pending isolation event for this
 	 * slot, we know it's bad already, we don't need to check.
 	 * Do this checking under a lock; as multiple PCI devices
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 1b5982f..03a3ed2 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -890,7 +890,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
 				opal_pci_eeh_freeze_clear(phb->opal_id, frozen_pe_no,
 					OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
 				ret = EEH_NEXT_ERR_NONE;
-			} else if ((*pe)->state & EEH_PE_ISOLATED) {
+			} else if ((*pe)->state & EEH_PE_ISOLATED ||
+				   eeh_pe_passed(*pe)) {
 				ret = EEH_NEXT_ERR_NONE;
 			} else {
 				pr_err("EEH: Frozen PHB#%x-PE#%x (%s) detected\n",
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device
  2014-05-21  5:03 ` [PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device Gavin Shan
@ 2014-05-21 13:07   ` Alexander Graf
  2014-05-21 21:56     ` Benjamin Herrenschmidt
  2014-05-21 23:48     ` Gavin Shan
  0 siblings, 2 replies; 11+ messages in thread
From: Alexander Graf @ 2014-05-21 13:07 UTC (permalink / raw)
  To: Gavin Shan, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, linuxppc-dev


On 21.05.14 07:03, Gavin Shan wrote:
> The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
> to support EEH functionality for PCI devices, which have been
> passed from host to guest via VFIO.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   Documentation/vfio.txt         |   6 +-
>   arch/powerpc/include/asm/eeh.h |  10 ++
>   arch/powerpc/kernel/eeh.c      | 323 +++++++++++++++++++++++++++++++++++++++++
>   drivers/vfio/pci/vfio_pci.c    |  99 ++++++++++++-
>   include/uapi/linux/vfio.h      |  43 ++++++
>   5 files changed, 474 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index b9ca023..bb17ec7 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -305,7 +305,10 @@ faster, the map/unmap handling has been implemented in real mode which provides
>   an excellent performance which has limitations such as inability to do
>   locked pages accounting in real time.
>   
> -So 3 additional ioctls have been added:
> +4) PPC64 guests detect PCI errors and recover from them via EEH RTAS services,
> +which works on the basis of additional ioctl command VFIO_EEH_OP.
> +
> +So 4 additional ioctls have been added:
>   
>   	VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
>   		of the DMA window on the PCI bus.
> @@ -316,6 +319,7 @@ So 3 additional ioctls have been added:
>   
>   	VFIO_IOMMU_DISABLE - disables the container.
>   
> +	VFIO_EEH_OP - EEH dependent operations

Please document exactly what the ioctl does. In an ideal world, a VFIO 
user will just look at the documentation and be able to write a program 
against the API with it.

>   
>   The code flow from the example above should be slightly changed:
>   
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 34a2d83..93922ef 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -305,6 +305,16 @@ void eeh_add_device_late(struct pci_dev *);
>   void eeh_add_device_tree_late(struct pci_bus *);
>   void eeh_add_sysfs_files(struct pci_bus *);
>   void eeh_remove_device(struct pci_dev *);
> +#ifdef CONFIG_VFIO_PCI_EEH
> +int eeh_vfio_open(struct pci_dev *pdev);
> +void eeh_vfio_release(struct pci_dev *pdev);
> +int eeh_vfio_set_pe_option(struct pci_dev *pdev, int option, int *retval);
> +int eeh_vfio_get_pe_addr(struct pci_dev *pdev, int option,
> +			 int *retval, int *info);
> +int eeh_vfio_get_pe_state(struct pci_dev *pdev, int *retval, int *state);
> +int eeh_vfio_reset_pe(struct pci_dev *pdev, int option, int *retval);
> +int eeh_vfio_configure_pe(struct pci_dev *pdev, int *retval);
> +#endif
>   
>   /**
>    * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 9c6b899..2aaf90e 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1098,6 +1098,329 @@ void eeh_remove_device(struct pci_dev *dev)
>   	edev->mode &= ~EEH_DEV_SYSFS;
>   }
>   
> +#ifdef CONFIG_VFIO_PCI_EEH
> +int eeh_vfio_open(struct pci_dev *pdev)

Why vfio? Also that config option will not be set if vfio is compiled as 
a module.

> +{
> +	struct eeh_dev *edev;
> +
> +	/* No PCI device ? */
> +	if (!pdev)
> +		return -ENODEV;
> +
> +	/* No EEH device ? */
> +	edev = pci_dev_to_eeh_dev(pdev);
> +	if (!edev || !edev->pe)
> +		return -ENODEV;
> +
> +	eeh_dev_set_passed(edev, true);
> +	eeh_pe_set_passed(edev->pe, true);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(eeh_vfio_open);
> +
> +void eeh_vfio_release(struct pci_dev *pdev)
> +{
> +	bool release_pe = true;
> +	struct eeh_pe *pe = NULL;
> +	struct eeh_dev *tmp, *edev;
> +
> +	/* No PCI device ? */
> +	if (!pdev)
> +		return;
> +
> +	/* No EEH device ? */
> +	edev = pci_dev_to_eeh_dev(pdev);
> +	if (!edev || !eeh_dev_passed(edev) ||
> +	    !edev->pe || !eeh_pe_passed(pe))
> +		return;
> +
> +	/* Release device */
> +	pe = edev->pe;
> +	eeh_dev_set_passed(edev, false);
> +
> +	/* Release PE */
> +	eeh_pe_for_each_dev(pe, edev, tmp) {
> +		if (eeh_dev_passed(edev)) {
> +			release_pe = false;
> +			break;
> +		}
> +	}
> +
> +	if (release_pe)
> +		eeh_pe_set_passed(pe, false);
> +}
> +EXPORT_SYMBOL(eeh_vfio_release);
> +
> +static int eeh_vfio_check_dev(struct pci_dev *pdev,
> +			      struct eeh_dev **pedev,
> +			      struct eeh_pe **ppe)
> +{
> +	struct eeh_dev *edev;
> +
> +	/* No device ? */
> +	if (!pdev)
> +		return -ENODEV;
> +
> +	edev = pci_dev_to_eeh_dev(pdev);
> +	if (!edev || !eeh_dev_passed(edev) ||
> +	    !edev->pe || !eeh_pe_passed(edev->pe))
> +		return -ENODEV;
> +
> +	if (pedev)
> +		*pedev = edev;
> +	if (ppe)
> +		*ppe = edev->pe;
> +
> +	return 0;
> +}
> +
> +int eeh_vfio_set_pe_option(struct pci_dev *pdev, int option, int *retval)
> +{
> +	struct eeh_dev *edev;
> +	struct eeh_pe *pe;
> +	int ret = 0;
> +
> +	/* Device existing ? */
> +	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
> +	if (ret) {
> +		pr_debug("%s: Cannot find device %s\n",
> +			__func__, pdev ? pci_name(pdev) : "NULL");
> +		*retval = -7;

What are these? Please use proper kernel internal return values for 
errors. I don't want to see anything even remotely tied to RTAS in any 
of these patches.

> +		goto out;
> +	}
> +
> +	/* Invalid option ? */
> +	if (option < EEH_OPT_DISABLE ||
> +	    option > EEH_OPT_THAW_DMA) {

This is quite confusing to read because it's not obvious what is in 
between these. Just make this a switch() statement that lists the 
allowed options. Gcc will be smart enough to optimize that into a bounds 
check.

> +		pr_debug("%s: Option %d out of range (%d, %d)\n",
> +			__func__, option, EEH_OPT_DISABLE, EEH_OPT_THAW_DMA);
> +		*retval = -3;
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (option == EEH_OPT_DISABLE ||
> +	    option == EEH_OPT_ENABLE) {
> +		*retval = 0;
> +	} else {
> +		if (!eeh_ops || !eeh_ops->set_option) {
> +			*retval = -7;
> +			ret = -ENOENT;
> +			goto out;
> +		}
> +
> +		ret = eeh_ops->set_option(pe, option);
> +		if (ret) {
> +			pr_debug("%s: Failure %d from backend\n",
> +				__func__, ret);
> +			*retval = -1;
> +			goto out;
> +		}
> +
> +		*retval = 0;
> +	}
> +out:
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(eeh_vfio_set_pe_option);
> +
> +int eeh_vfio_get_pe_addr(struct pci_dev *pdev, int option,
> +			 int *retval, int *info)
> +{
> +	struct pci_bus *bus;
> +	struct eeh_dev *edev;
> +	struct eeh_pe *pe;
> +	int ret = 0;
> +
> +	/* Device existing ? */
> +	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
> +	if (ret) {
> +		*retval = -3;
> +		goto out;
> +	}
> +
> +	/* Invalid option ? */
> +	if (option != 0 && option != 1) {

0? 1? What? Don't these have names? And again, please use a switch() for 
this function.

> +		pr_debug("%s: option %d out of range (0, 1)\n",
> +			__func__, option);
> +		*retval = -3;
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Fill result according to option. We don't differentiate
> +	 * PCI bus and device dependent PE here. So all PEs are
> +	 * built in "shared" mode. Also, the PE address has the format
> +	 * of "00BBSS00".
> +	 */
> +	if (option == 0) {
> +		bus = eeh_pe_bus_get(pe);
> +		if (!bus) {
> +			*retval = -3;
> +			ret = -ENODEV;
> +			goto out;
> +		}
> +
> +		*retval = 0;
> +		*info = bus->number << 16;

How about positive numbers for the number and negative ones for errors?

> +	} else {
> +		*retval = 0;
> +		*info = 1;
> +	}
> +out:
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(eeh_vfio_get_pe_addr);
> +
> +int eeh_vfio_get_pe_state(struct pci_dev *pdev, int *retval, int *state)
> +{
> +	struct eeh_dev *edev;
> +	struct eeh_pe *pe;
> +	int result, ret = 0;
> +
> +	/* Device existing ? */
> +	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
> +	if (ret) {
> +		*retval = -3;
> +		goto out;
> +	}
> +
> +	if (!eeh_ops || !eeh_ops->get_state) {
> +		pr_debug("%s: Unsupported request\n",
> +			__func__);
> +		ret = -ENOENT;
> +		*retval = -3;
> +		goto out;
> +	}
> +
> +	result = eeh_ops->get_state(pe, NULL);
> +	if (!(result & EEH_STATE_RESET_ACTIVE) &&
> +	     (result & EEH_STATE_DMA_ENABLED) &&
> +	     (result & EEH_STATE_MMIO_ENABLED))
> +		*state = 0;
> +	else if (result & EEH_STATE_RESET_ACTIVE)
> +		*state = 1;
> +	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
> +		 !(result & EEH_STATE_DMA_ENABLED) &&
> +		 !(result & EEH_STATE_MMIO_ENABLED))
> +		*state = 2;
> +	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
> +		 (result & EEH_STATE_DMA_ENABLED) &&
> +		 !(result & EEH_STATE_MMIO_ENABLED))
> +		*state = 4;
> +	else
> +		*state = 5;

What are these numbers?

> +
> +	*retval = 0;
> +out:
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(eeh_vfio_get_pe_state);
> +
> +int eeh_vfio_reset_pe(struct pci_dev *pdev, int option, int *retval)
> +{
> +	struct eeh_dev *edev;
> +	struct eeh_pe *pe;
> +	int ret = 0;
> +
> +	/* Device existing ? */
> +	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
> +	if (ret) {
> +		*retval = -3;
> +		goto out;
> +	}
> +
> +	/* Invalid option ? */
> +	if (option != EEH_RESET_DEACTIVATE &&
> +	    option != EEH_RESET_HOT &&
> +	    option != EEH_RESET_FUNDAMENTAL) {
> +		pr_debug("%s: Unsupported option %d\n",
> +			__func__, option);
> +		ret = -EINVAL;
> +		*retval = -3;
> +		goto out;
> +	}
> +
> +	if (!eeh_ops || !eeh_ops->set_option || !eeh_ops->reset) {
> +		pr_debug("%s: Unsupported request\n",
> +			__func__);
> +		ret = -ENOENT;
> +		*retval = -7;
> +		goto out;
> +	}
> +
> +	ret = eeh_ops->reset(pe, option);
> +	if (ret) {
> +		pr_debug("%s: Failure %d from backend\n",
> +			 __func__, ret);
> +		*retval = -1;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The PE is still in frozen state and we need clear that.
> +	 * It's good to clear frozen state after deassert to avoid
> +	 * messy IO access during reset, which might cause recrusive

recursive

> +	 * frozen PE.
> +	 */
> +	if (option == EEH_RESET_DEACTIVATE) {
> +		ret = eeh_ops->set_option(pe, EEH_OPT_THAW_MMIO);
> +		if (ret) {
> +			pr_debug("%s: Cannot enable IO for PHB#%d-PE#%d (%d)\n",
> +				__func__, pe->phb->global_number, pe->addr, ret);
> +			*retval = -1;
> +			goto out;
> +		}
> +
> +		ret = eeh_ops->set_option(pe, EEH_OPT_THAW_DMA);
> +		if (ret) {
> +			pr_debug("%s: Cannot enable DMA for PHB#%d-PE#%d (%d)\n",
> +				__func__, pe->phb->global_number, pe->addr, ret);
> +			*retval = -1;
> +			goto out;
> +		}
> +
> +		eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
> +	}
> +
> +	*retval = 0;
> +out:
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(eeh_vfio_reset_pe);
> +
> +int eeh_vfio_configure_pe(struct pci_dev *pdev, int *retval)
> +{
> +	struct eeh_dev *edev;
> +	struct eeh_pe *pe;
> +	int ret = 0;
> +
> +	/* Device existing ? */
> +	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
> +	if (ret) {
> +		*retval = -3;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The access to PCI config space on VFIO device has some
> +	 * limitations. Part of PCI config space, including BAR
> +	 * registers are not readable and writable. So the guest
> +	 * should have stale values for those registers and we have
> +	 * to restore them in host side.

I don't understand this comment. When is "configure_pe" called in the 
first place? Please provide proper function descriptions for each of 
these exported functions that tell someone who may want to use them what 
they do.

Also, don't mention VFIO or guests in any function inside this file.

> +	 */
> +	eeh_pe_restore_bars(pe);
> +	*retval = 0;
> +
> +out:
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(eeh_vfio_configure_pe);
> +
> +#endif /* CONFIG_VFIO_PCI_EEH */
> +
>   static int proc_eeh_show(struct seq_file *m, void *v)
>   {
>   	if (!eeh_enabled()) {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 7ba0424..05c3dde 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -25,6 +25,9 @@
>   #include <linux/types.h>
>   #include <linux/uaccess.h>
>   #include <linux/vfio.h>
> +#ifdef CONFIG_VFIO_PCI_EEH
> +#include <asm/eeh.h>
> +#endif
>   
>   #include "vfio_pci_private.h"
>   
> @@ -152,32 +155,57 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
>   	pci_restore_state(pdev);
>   }
>   
> +static void vfio_eeh_pci_release(struct pci_dev *pdev)
> +{
> +#ifdef CONFIG_VFIO_PCI_EEH
> +	eeh_vfio_release(pdev);
> +#endif
> +}
> +
>   static void vfio_pci_release(void *device_data)
>   {
>   	struct vfio_pci_device *vdev = device_data;
>   
> -	if (atomic_dec_and_test(&vdev->refcnt))
> +	if (atomic_dec_and_test(&vdev->refcnt)) {
> +		vfio_eeh_pci_release(vdev->pdev);
>   		vfio_pci_disable(vdev);
> +	}
>   
>   	module_put(THIS_MODULE);
>   }
>   
> +static int vfio_eeh_pci_open(struct pci_dev *pdev)
> +{
> +	int ret = 0;
> +
> +#ifdef CONFIG_VFIO_PCI_EEH
> +	ret = eeh_vfio_open(pdev);
> +#endif
> +	return ret;
> +}
> +
>   static int vfio_pci_open(void *device_data)
>   {
>   	struct vfio_pci_device *vdev = device_data;
> +	int ret;
>   
>   	if (!try_module_get(THIS_MODULE))
>   		return -ENODEV;
>   
>   	if (atomic_inc_return(&vdev->refcnt) == 1) {
> -		int ret = vfio_pci_enable(vdev);
> -		if (ret) {
> -			module_put(THIS_MODULE);
> -			return ret;
> -		}
> +		ret = vfio_pci_enable(vdev);
> +		if (ret)
> +			goto error;
> +
> +		ret = vfio_eeh_pci_open(vdev->pdev);
> +		if (ret)
> +			goto error;
>   	}
>   
>   	return 0;
> +error:
> +	module_put(THIS_MODULE);
> +	return ret;
>   }
>   
>   static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
> @@ -321,6 +349,51 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
>   	return walk.ret;
>   }
>   
> +static int vfio_eeh_pci_ioctl(struct pci_dev *pdev, struct vfio_eeh_op *info)

I still don't like the idea of that multiplexing ioctl. I don't see any 
benefit in it whatsoever. Just create 5 individual ioctls with their own 
simple interfaces.

Also, this interface has nothing to do with RTAS. So don't sneak in RTAS 
error numbers anywhere ;). It's QEMU's task to convert from kernel error 
codes to RTAS error codes.


Alex

> +{
> +	int ret = 0;
> +
> +#ifdef CONFIG_VFIO_PCI_EEH
> +	switch (info->op) {
> +	case VFIO_EEH_OP_SET_OPTION:
> +		ret = eeh_vfio_set_pe_option(pdev,
> +					     info->option.option,
> +					     &info->option.ret);
> +		break;
> +	case VFIO_EEH_OP_GET_ADDR:
> +		ret = eeh_vfio_get_pe_addr(pdev,
> +					   info->addr.option,
> +					   &info->addr.ret,
> +					   &info->addr.info);
> +		break;
> +	case VFIO_EEH_OP_GET_STATE:
> +		ret = eeh_vfio_get_pe_state(pdev,
> +					    &info->state.ret,
> +					    &info->state.reset_state);
> +		info->state.cfg_cap = 1;
> +		info->state.pe_unavail_info = 1000;
> +		info->state.pe_recovery_info = 0;
> +		break;
> +	case VFIO_EEH_OP_PE_RESET:
> +		ret = eeh_vfio_reset_pe(pdev,
> +					info->reset.option,
> +					&info->reset.ret);
> +		break;
> +	case VFIO_EEH_OP_PE_CONFIG:
> +		ret = eeh_vfio_configure_pe(pdev,
> +					    &info->config.ret);
> +	default:
> +		ret = -EINVAL;
> +		pr_debug("%s: Cannot handle op#%d\n",
> +			__func__, info->op);
> +	}
> +#else
> +	ret = -ENOENT;
> +#endif
> +
> +	return ret;
> +}
> +
>   static long vfio_pci_ioctl(void *device_data,
>   			   unsigned int cmd, unsigned long arg)
>   {
> @@ -682,6 +755,20 @@ hot_reset_release:
>   
>   		kfree(groups);
>   		return ret;
> +	} else if (cmd == VFIO_EEH_OP) {
> +		struct vfio_eeh_op info;
> +		int ret = 0;
> +
> +		minsz = sizeof(info);
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		ret = vfio_eeh_pci_ioctl(vdev->pdev, &info);
> +		if (copy_to_user((void __user *)arg, &info, minsz))
> +			ret = -EFAULT;
> +		return ret;
>   	}
>   
>   	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index cb9023d..518961d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -455,6 +455,49 @@ struct vfio_iommu_spapr_tce_info {
>   
>   #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>   
> +/*
> + * The VFIO operation struct provides way to support EEH functionality
> + * for PCI device that is passed from host to guest via VFIO.
> + */
> +#define VFIO_EEH_OP_SET_OPTION	0
> +#define VFIO_EEH_OP_GET_ADDR	1
> +#define VFIO_EEH_OP_GET_STATE	2
> +#define VFIO_EEH_OP_PE_RESET	3
> +#define VFIO_EEH_OP_PE_CONFIG	4
> +
> +struct vfio_eeh_op {
> +	__u32 argsz;
> +	__u32 op;
> +
> +	union {
> +		struct vfio_eeh_set_option {
> +			__u32 option;
> +			__s32 ret;
> +		} option;
> +		struct vfio_eeh_pe_addr {
> +			__u32 option;
> +			__s32 ret;
> +			__u32 info;
> +		} addr;
> +		struct vfio_eeh_pe_state {
> +			__s32 ret;
> +			__u32 reset_state;
> +			__u32 cfg_cap;
> +			__u32 pe_unavail_info;
> +			__u32 pe_recovery_info;
> +		} state;
> +		struct vfio_eeh_reset {
> +			__u32 option;
> +			__s32 ret;
> +		} reset;
> +		struct vfio_eeh_config {
> +			__s32 ret;
> +		} config;
> +	};
> +};
> +
> +#define VFIO_EEH_OP	_IO(VFIO_TYPE, VFIO_BASE + 21)
> +
>   /* ***************************************************************** */
>   
>   #endif /* _UAPIVFIO_H */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 4/4] powerpc/eeh: Avoid event on passed PE
  2014-05-21  5:03 ` [PATCH v5 4/4] powerpc/eeh: Avoid event on passed PE Gavin Shan
@ 2014-05-21 13:13   ` Alexander Graf
  2014-05-22  0:01     ` Gavin Shan
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Graf @ 2014-05-21 13:13 UTC (permalink / raw)
  To: Gavin Shan, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, linuxppc-dev


On 21.05.14 07:03, Gavin Shan wrote:
> If we detects frozen state on PE that has been passed to guest, we
> needn't handle it. Instead, we rely on the guest to detect and recover
> it. The patch avoid EEH event on the frozen passed PE so that the guest
> can have chance to handle that.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/kernel/eeh.c                 | 8 ++++++++
>   arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++-
>   2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 2aaf90e..25fd12d 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>   	if (ret > 0)
>   		return ret;
>   
> +	/*
> +	 * If the PE has been passed to guest, we won't check the
> +	 * state. Instead, let the guest handle it if the PE has

What guest? The kernel doesn't care whether we use VFIO for a guest or not.


Alex

> +	 * been frozen.
> +	 */
> +	if (eeh_pe_passed(pe))
> +		return 0;
> +
>   	/* If we already have a pending isolation event for this
>   	 * slot, we know it's bad already, we don't need to check.
>   	 * Do this checking under a lock; as multiple PCI devices
> diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
> index 1b5982f..03a3ed2 100644
> --- a/arch/powerpc/platforms/powernv/eeh-ioda.c
> +++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
> @@ -890,7 +890,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
>   				opal_pci_eeh_freeze_clear(phb->opal_id, frozen_pe_no,
>   					OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>   				ret = EEH_NEXT_ERR_NONE;
> -			} else if ((*pe)->state & EEH_PE_ISOLATED) {
> +			} else if ((*pe)->state & EEH_PE_ISOLATED ||
> +				   eeh_pe_passed(*pe)) {
>   				ret = EEH_NEXT_ERR_NONE;
>   			} else {
>   				pr_err("EEH: Frozen PHB#%x-PE#%x (%s) detected\n",

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device
  2014-05-21 13:07   ` Alexander Graf
@ 2014-05-21 21:56     ` Benjamin Herrenschmidt
  2014-05-22  8:11       ` Gavin Shan
  2014-05-21 23:48     ` Gavin Shan
  1 sibling, 1 reply; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-21 21:56 UTC (permalink / raw)
  To: Alexander Graf
  Cc: aik, Gavin Shan, kvm-ppc, alex.williamson, qiudayu, linuxppc-dev

On Wed, 2014-05-21 at 15:07 +0200, Alexander Graf wrote:

> > +#ifdef CONFIG_VFIO_PCI_EEH
> > +int eeh_vfio_open(struct pci_dev *pdev)
> 
> Why vfio? Also that config option will not be set if vfio is compiled as 
> a module.
> 
> > +{
> > +	struct eeh_dev *edev;
> > +
> > +	/* No PCI device ? */
> > +	if (!pdev)
> > +		return -ENODEV;
> > +
> > +	/* No EEH device ? */
> > +	edev = pci_dev_to_eeh_dev(pdev);
> > +	if (!edev || !edev->pe)
> > +		return -ENODEV;
> > +
> > +	eeh_dev_set_passed(edev, true);
> > +	eeh_pe_set_passed(edev->pe, true);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(eeh_vfio_open);

Additionally, shouldn't we have some locking here ? (and in release too)

I don't like relying on the caller locking (if it does it at all).

> > +	/* Device existing ? */
> > +	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
> > +	if (ret) {
> > +		pr_debug("%s: Cannot find device %s\n",
> > +			__func__, pdev ? pci_name(pdev) : "NULL");
> > +		*retval = -7;
> 
> What are these? Please use proper kernel internal return values for 
> errors. I don't want to see anything even remotely tied to RTAS in any 
> of these patches.

Hint: -ENODEV

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device
  2014-05-21 13:07   ` Alexander Graf
  2014-05-21 21:56     ` Benjamin Herrenschmidt
@ 2014-05-21 23:48     ` Gavin Shan
  1 sibling, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-05-21 23:48 UTC (permalink / raw)
  To: Alexander Graf
  Cc: aik, Gavin Shan, kvm-ppc, alex.williamson, qiudayu, linuxppc-dev

On Wed, May 21, 2014 at 03:07:26PM +0200, Alexander Graf wrote:
>
>On 21.05.14 07:03, Gavin Shan wrote:
>>The patch adds new IOCTL command VFIO_EEH_OP to VFIO PCI device
>>to support EEH functionality for PCI devices, which have been
>>passed from host to guest via VFIO.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  Documentation/vfio.txt         |   6 +-
>>  arch/powerpc/include/asm/eeh.h |  10 ++
>>  arch/powerpc/kernel/eeh.c      | 323 +++++++++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci.c    |  99 ++++++++++++-
>>  include/uapi/linux/vfio.h      |  43 ++++++
>>  5 files changed, 474 insertions(+), 7 deletions(-)
>>
>>diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>>index b9ca023..bb17ec7 100644
>>--- a/Documentation/vfio.txt
>>+++ b/Documentation/vfio.txt
>>@@ -305,7 +305,10 @@ faster, the map/unmap handling has been implemented in real mode which provides
>>  an excellent performance which has limitations such as inability to do
>>  locked pages accounting in real time.
>>-So 3 additional ioctls have been added:
>>+4) PPC64 guests detect PCI errors and recover from them via EEH RTAS services,
>>+which works on the basis of additional ioctl command VFIO_EEH_OP.
>>+
>>+So 4 additional ioctls have been added:
>>  	VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
>>  		of the DMA window on the PCI bus.
>>@@ -316,6 +319,7 @@ So 3 additional ioctls have been added:
>>  	VFIO_IOMMU_DISABLE - disables the container.
>>+	VFIO_EEH_OP - EEH dependent operations
>
>Please document exactly what the ioctl does. In an ideal world, a
>VFIO user will just look at the documentation and be able to write a
>program against the API with it.
>

Ok. I'll amend it. Also, I'll split it to 5 ioctl commands in next revision.

>>  The code flow from the example above should be slightly changed:
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index 34a2d83..93922ef 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -305,6 +305,16 @@ void eeh_add_device_late(struct pci_dev *);
>>  void eeh_add_device_tree_late(struct pci_bus *);
>>  void eeh_add_sysfs_files(struct pci_bus *);
>>  void eeh_remove_device(struct pci_dev *);
>>+#ifdef CONFIG_VFIO_PCI_EEH
>>+int eeh_vfio_open(struct pci_dev *pdev);
>>+void eeh_vfio_release(struct pci_dev *pdev);
>>+int eeh_vfio_set_pe_option(struct pci_dev *pdev, int option, int *retval);
>>+int eeh_vfio_get_pe_addr(struct pci_dev *pdev, int option,
>>+			 int *retval, int *info);
>>+int eeh_vfio_get_pe_state(struct pci_dev *pdev, int *retval, int *state);
>>+int eeh_vfio_reset_pe(struct pci_dev *pdev, int option, int *retval);
>>+int eeh_vfio_configure_pe(struct pci_dev *pdev, int *retval);
>>+#endif
>>  /**
>>   * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
>>diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>>index 9c6b899..2aaf90e 100644
>>--- a/arch/powerpc/kernel/eeh.c
>>+++ b/arch/powerpc/kernel/eeh.c
>>@@ -1098,6 +1098,329 @@ void eeh_remove_device(struct pci_dev *dev)
>>  	edev->mode &= ~EEH_DEV_SYSFS;
>>  }
>>+#ifdef CONFIG_VFIO_PCI_EEH
>>+int eeh_vfio_open(struct pci_dev *pdev)
>
>Why vfio? Also that config option will not be set if vfio is compiled
>as a module.
>

The interface is expected to be used by VFIO-PCI. I'll change the function
names to following ones in next revision. No "VFIO" will be seen :-)

eeh_dev_open();
eeh_dev_release();
static eeh_dev_check();
eeh_pe_set_option();
eeh_pe_get_addr();
eeh_pe_get_state();
eeh_pe_reset();
eeh_pe_configure(); 

Yeah, "#ifdef CONFIG_VFIO_PCI_EEH" can be removed safely in next revision.

>>+{
>>+	struct eeh_dev *edev;
>>+
>>+	/* No PCI device ? */
>>+	if (!pdev)
>>+		return -ENODEV;
>>+
>>+	/* No EEH device ? */
>>+	edev = pci_dev_to_eeh_dev(pdev);
>>+	if (!edev || !edev->pe)
>>+		return -ENODEV;
>>+
>>+	eeh_dev_set_passed(edev, true);
>>+	eeh_pe_set_passed(edev->pe, true);
>>+
>>+	return 0;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_vfio_open);
>>+
>>+void eeh_vfio_release(struct pci_dev *pdev)
>>+{
>>+	bool release_pe = true;
>>+	struct eeh_pe *pe = NULL;
>>+	struct eeh_dev *tmp, *edev;
>>+
>>+	/* No PCI device ? */
>>+	if (!pdev)
>>+		return;
>>+
>>+	/* No EEH device ? */
>>+	edev = pci_dev_to_eeh_dev(pdev);
>>+	if (!edev || !eeh_dev_passed(edev) ||
>>+	    !edev->pe || !eeh_pe_passed(pe))
>>+		return;
>>+
>>+	/* Release device */
>>+	pe = edev->pe;
>>+	eeh_dev_set_passed(edev, false);
>>+
>>+	/* Release PE */
>>+	eeh_pe_for_each_dev(pe, edev, tmp) {
>>+		if (eeh_dev_passed(edev)) {
>>+			release_pe = false;
>>+			break;
>>+		}
>>+	}
>>+
>>+	if (release_pe)
>>+		eeh_pe_set_passed(pe, false);
>>+}
>>+EXPORT_SYMBOL(eeh_vfio_release);
>>+
>>+static int eeh_vfio_check_dev(struct pci_dev *pdev,
>>+			      struct eeh_dev **pedev,
>>+			      struct eeh_pe **ppe)
>>+{
>>+	struct eeh_dev *edev;
>>+
>>+	/* No device ? */
>>+	if (!pdev)
>>+		return -ENODEV;
>>+
>>+	edev = pci_dev_to_eeh_dev(pdev);
>>+	if (!edev || !eeh_dev_passed(edev) ||
>>+	    !edev->pe || !eeh_pe_passed(edev->pe))
>>+		return -ENODEV;
>>+
>>+	if (pedev)
>>+		*pedev = edev;
>>+	if (ppe)
>>+		*ppe = edev->pe;
>>+
>>+	return 0;
>>+}
>>+
>>+int eeh_vfio_set_pe_option(struct pci_dev *pdev, int option, int *retval)
>>+{
>>+	struct eeh_dev *edev;
>>+	struct eeh_pe *pe;
>>+	int ret = 0;
>>+
>>+	/* Device existing ? */
>>+	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
>>+	if (ret) {
>>+		pr_debug("%s: Cannot find device %s\n",
>>+			__func__, pdev ? pci_name(pdev) : "NULL");
>>+		*retval = -7;
>
>What are these? Please use proper kernel internal return values for
>errors. I don't want to see anything even remotely tied to RTAS in
>any of these patches.
>

It's the return value for RTAS call "ibm,set-eeh-option". Yeah, it
makes sense to return kernel internal values for errors and will
amend in next revision.

>>+		goto out;
>>+	}
>>+
>>+	/* Invalid option ? */
>>+	if (option < EEH_OPT_DISABLE ||
>>+	    option > EEH_OPT_THAW_DMA) {
>
>This is quite confusing to read because it's not obvious what is in
>between these. Just make this a switch() statement that lists the
>allowed options. Gcc will be smart enough to optimize that into a
>bounds check.
>

Ok. Will use "switch()" in next revision.

>>+		pr_debug("%s: Option %d out of range (%d, %d)\n",
>>+			__func__, option, EEH_OPT_DISABLE, EEH_OPT_THAW_DMA);
>>+		*retval = -3;
>>+		ret = -EINVAL;
>>+		goto out;
>>+	}
>>+
>>+	if (option == EEH_OPT_DISABLE ||
>>+	    option == EEH_OPT_ENABLE) {
>>+		*retval = 0;
>>+	} else {
>>+		if (!eeh_ops || !eeh_ops->set_option) {
>>+			*retval = -7;
>>+			ret = -ENOENT;
>>+			goto out;
>>+		}
>>+
>>+		ret = eeh_ops->set_option(pe, option);
>>+		if (ret) {
>>+			pr_debug("%s: Failure %d from backend\n",
>>+				__func__, ret);
>>+			*retval = -1;
>>+			goto out;
>>+		}
>>+
>>+		*retval = 0;
>>+	}
>>+out:
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_vfio_set_pe_option);
>>+
>>+int eeh_vfio_get_pe_addr(struct pci_dev *pdev, int option,
>>+			 int *retval, int *info)
>>+{
>>+	struct pci_bus *bus;
>>+	struct eeh_dev *edev;
>>+	struct eeh_pe *pe;
>>+	int ret = 0;
>>+
>>+	/* Device existing ? */
>>+	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
>>+	if (ret) {
>>+		*retval = -3;
>>+		goto out;
>>+	}
>>+
>>+	/* Invalid option ? */
>>+	if (option != 0 && option != 1) {
>
>0? 1? What? Don't these have names? And again, please use a switch()
>for this function.
>

Will have names in next revision and use "switch()", thanks.

>>+		pr_debug("%s: option %d out of range (0, 1)\n",
>>+			__func__, option);
>>+		*retval = -3;
>>+		ret = -EINVAL;
>>+		goto out;
>>+	}
>>+
>>+	/*
>>+	 * Fill result according to option. We don't differentiate
>>+	 * PCI bus and device dependent PE here. So all PEs are
>>+	 * built in "shared" mode. Also, the PE address has the format
>>+	 * of "00BBSS00".
>>+	 */
>>+	if (option == 0) {
>>+		bus = eeh_pe_bus_get(pe);
>>+		if (!bus) {
>>+			*retval = -3;
>>+			ret = -ENODEV;
>>+			goto out;
>>+		}
>>+
>>+		*retval = 0;
>>+		*info = bus->number << 16;
>
>How about positive numbers for the number and negative ones for errors?
>

We needn't carry error numbers by "info" because "retval" or "ret" already
had that information :-)

>>+	} else {
>>+		*retval = 0;
>>+		*info = 1;
>>+	}
>>+out:
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_vfio_get_pe_addr);
>>+
>>+int eeh_vfio_get_pe_state(struct pci_dev *pdev, int *retval, int *state)
>>+{
>>+	struct eeh_dev *edev;
>>+	struct eeh_pe *pe;
>>+	int result, ret = 0;
>>+
>>+	/* Device existing ? */
>>+	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
>>+	if (ret) {
>>+		*retval = -3;
>>+		goto out;
>>+	}
>>+
>>+	if (!eeh_ops || !eeh_ops->get_state) {
>>+		pr_debug("%s: Unsupported request\n",
>>+			__func__);
>>+		ret = -ENOENT;
>>+		*retval = -3;
>>+		goto out;
>>+	}
>>+
>>+	result = eeh_ops->get_state(pe, NULL);
>>+	if (!(result & EEH_STATE_RESET_ACTIVE) &&
>>+	     (result & EEH_STATE_DMA_ENABLED) &&
>>+	     (result & EEH_STATE_MMIO_ENABLED))
>>+		*state = 0;
>>+	else if (result & EEH_STATE_RESET_ACTIVE)
>>+		*state = 1;
>>+	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
>>+		 !(result & EEH_STATE_DMA_ENABLED) &&
>>+		 !(result & EEH_STATE_MMIO_ENABLED))
>>+		*state = 2;
>>+	else if (!(result & EEH_STATE_RESET_ACTIVE) &&
>>+		 (result & EEH_STATE_DMA_ENABLED) &&
>>+		 !(result & EEH_STATE_MMIO_ENABLED))
>>+		*state = 4;
>>+	else
>>+		*state = 5;
>
>What are these numbers?
>

Will have names in next revision :)

>>+
>>+	*retval = 0;
>>+out:
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_vfio_get_pe_state);
>>+
>>+int eeh_vfio_reset_pe(struct pci_dev *pdev, int option, int *retval)
>>+{
>>+	struct eeh_dev *edev;
>>+	struct eeh_pe *pe;
>>+	int ret = 0;
>>+
>>+	/* Device existing ? */
>>+	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
>>+	if (ret) {
>>+		*retval = -3;
>>+		goto out;
>>+	}
>>+
>>+	/* Invalid option ? */
>>+	if (option != EEH_RESET_DEACTIVATE &&
>>+	    option != EEH_RESET_HOT &&
>>+	    option != EEH_RESET_FUNDAMENTAL) {
>>+		pr_debug("%s: Unsupported option %d\n",
>>+			__func__, option);
>>+		ret = -EINVAL;
>>+		*retval = -3;
>>+		goto out;
>>+	}
>>+
>>+	if (!eeh_ops || !eeh_ops->set_option || !eeh_ops->reset) {
>>+		pr_debug("%s: Unsupported request\n",
>>+			__func__);
>>+		ret = -ENOENT;
>>+		*retval = -7;
>>+		goto out;
>>+	}
>>+
>>+	ret = eeh_ops->reset(pe, option);
>>+	if (ret) {
>>+		pr_debug("%s: Failure %d from backend\n",
>>+			 __func__, ret);
>>+		*retval = -1;
>>+		goto out;
>>+	}
>>+
>>+	/*
>>+	 * The PE is still in frozen state and we need clear that.
>>+	 * It's good to clear frozen state after deassert to avoid
>>+	 * messy IO access during reset, which might cause recrusive
>
>recursive
>

Thanks.

>>+	 * frozen PE.
>>+	 */
>>+	if (option == EEH_RESET_DEACTIVATE) {
>>+		ret = eeh_ops->set_option(pe, EEH_OPT_THAW_MMIO);
>>+		if (ret) {
>>+			pr_debug("%s: Cannot enable IO for PHB#%d-PE#%d (%d)\n",
>>+				__func__, pe->phb->global_number, pe->addr, ret);
>>+			*retval = -1;
>>+			goto out;
>>+		}
>>+
>>+		ret = eeh_ops->set_option(pe, EEH_OPT_THAW_DMA);
>>+		if (ret) {
>>+			pr_debug("%s: Cannot enable DMA for PHB#%d-PE#%d (%d)\n",
>>+				__func__, pe->phb->global_number, pe->addr, ret);
>>+			*retval = -1;
>>+			goto out;
>>+		}
>>+
>>+		eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
>>+	}
>>+
>>+	*retval = 0;
>>+out:
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_vfio_reset_pe);
>>+
>>+int eeh_vfio_configure_pe(struct pci_dev *pdev, int *retval)
>>+{
>>+	struct eeh_dev *edev;
>>+	struct eeh_pe *pe;
>>+	int ret = 0;
>>+
>>+	/* Device existing ? */
>>+	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
>>+	if (ret) {
>>+		*retval = -3;
>>+		goto out;
>>+	}
>>+
>>+	/*
>>+	 * The access to PCI config space on VFIO device has some
>>+	 * limitations. Part of PCI config space, including BAR
>>+	 * registers are not readable and writable. So the guest
>>+	 * should have stale values for those registers and we have
>>+	 * to restore them in host side.
>
>I don't understand this comment. When is "configure_pe" called in the
>first place? Please provide proper function descriptions for each of
>these exported functions that tell someone who may want to use them
>what they do.
>

Yeah, I'll add the description here (also in Documentation/vfio.txt).

>Also, don't mention VFIO or guests in any function inside this file.
>

Ok. I'll avoid mentioning "VFIO" and "guest".

>>+	 */
>>+	eeh_pe_restore_bars(pe);
>>+	*retval = 0;
>>+
>>+out:
>>+	return ret;
>>+}
>>+EXPORT_SYMBOL_GPL(eeh_vfio_configure_pe);
>>+
>>+#endif /* CONFIG_VFIO_PCI_EEH */
>>+
>>  static int proc_eeh_show(struct seq_file *m, void *v)
>>  {
>>  	if (!eeh_enabled()) {
>>diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>>index 7ba0424..05c3dde 100644
>>--- a/drivers/vfio/pci/vfio_pci.c
>>+++ b/drivers/vfio/pci/vfio_pci.c
>>@@ -25,6 +25,9 @@
>>  #include <linux/types.h>
>>  #include <linux/uaccess.h>
>>  #include <linux/vfio.h>
>>+#ifdef CONFIG_VFIO_PCI_EEH
>>+#include <asm/eeh.h>
>>+#endif
>>  #include "vfio_pci_private.h"
>>@@ -152,32 +155,57 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
>>  	pci_restore_state(pdev);
>>  }
>>+static void vfio_eeh_pci_release(struct pci_dev *pdev)
>>+{
>>+#ifdef CONFIG_VFIO_PCI_EEH
>>+	eeh_vfio_release(pdev);
>>+#endif
>>+}
>>+
>>  static void vfio_pci_release(void *device_data)
>>  {
>>  	struct vfio_pci_device *vdev = device_data;
>>-	if (atomic_dec_and_test(&vdev->refcnt))
>>+	if (atomic_dec_and_test(&vdev->refcnt)) {
>>+		vfio_eeh_pci_release(vdev->pdev);
>>  		vfio_pci_disable(vdev);
>>+	}
>>  	module_put(THIS_MODULE);
>>  }
>>+static int vfio_eeh_pci_open(struct pci_dev *pdev)
>>+{
>>+	int ret = 0;
>>+
>>+#ifdef CONFIG_VFIO_PCI_EEH
>>+	ret = eeh_vfio_open(pdev);
>>+#endif
>>+	return ret;
>>+}
>>+
>>  static int vfio_pci_open(void *device_data)
>>  {
>>  	struct vfio_pci_device *vdev = device_data;
>>+	int ret;
>>  	if (!try_module_get(THIS_MODULE))
>>  		return -ENODEV;
>>  	if (atomic_inc_return(&vdev->refcnt) == 1) {
>>-		int ret = vfio_pci_enable(vdev);
>>-		if (ret) {
>>-			module_put(THIS_MODULE);
>>-			return ret;
>>-		}
>>+		ret = vfio_pci_enable(vdev);
>>+		if (ret)
>>+			goto error;
>>+
>>+		ret = vfio_eeh_pci_open(vdev->pdev);
>>+		if (ret)
>>+			goto error;
>>  	}
>>  	return 0;
>>+error:
>>+	module_put(THIS_MODULE);
>>+	return ret;
>>  }
>>  static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
>>@@ -321,6 +349,51 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
>>  	return walk.ret;
>>  }
>>+static int vfio_eeh_pci_ioctl(struct pci_dev *pdev, struct vfio_eeh_op *info)
>
>I still don't like the idea of that multiplexing ioctl. I don't see
>any benefit in it whatsoever. Just create 5 individual ioctls with
>their own simple interfaces.
>

Ok. Will split in next revision. Thanks.

>Also, this interface has nothing to do with RTAS. So don't sneak in
>RTAS error numbers anywhere ;). It's QEMU's task to convert from
>kernel error codes to RTAS error codes.
>

Ok. Will do in next revision. Thanks for your comments and time, Alex :-)

Thanks,
Gavin

>>+{
>>+	int ret = 0;
>>+
>>+#ifdef CONFIG_VFIO_PCI_EEH
>>+	switch (info->op) {
>>+	case VFIO_EEH_OP_SET_OPTION:
>>+		ret = eeh_vfio_set_pe_option(pdev,
>>+					     info->option.option,
>>+					     &info->option.ret);
>>+		break;
>>+	case VFIO_EEH_OP_GET_ADDR:
>>+		ret = eeh_vfio_get_pe_addr(pdev,
>>+					   info->addr.option,
>>+					   &info->addr.ret,
>>+					   &info->addr.info);
>>+		break;
>>+	case VFIO_EEH_OP_GET_STATE:
>>+		ret = eeh_vfio_get_pe_state(pdev,
>>+					    &info->state.ret,
>>+					    &info->state.reset_state);
>>+		info->state.cfg_cap = 1;
>>+		info->state.pe_unavail_info = 1000;
>>+		info->state.pe_recovery_info = 0;
>>+		break;
>>+	case VFIO_EEH_OP_PE_RESET:
>>+		ret = eeh_vfio_reset_pe(pdev,
>>+					info->reset.option,
>>+					&info->reset.ret);
>>+		break;
>>+	case VFIO_EEH_OP_PE_CONFIG:
>>+		ret = eeh_vfio_configure_pe(pdev,
>>+					    &info->config.ret);
>>+	default:
>>+		ret = -EINVAL;
>>+		pr_debug("%s: Cannot handle op#%d\n",
>>+			__func__, info->op);
>>+	}
>>+#else
>>+	ret = -ENOENT;
>>+#endif
>>+
>>+	return ret;
>>+}
>>+
>>  static long vfio_pci_ioctl(void *device_data,
>>  			   unsigned int cmd, unsigned long arg)
>>  {
>>@@ -682,6 +755,20 @@ hot_reset_release:
>>  		kfree(groups);
>>  		return ret;
>>+	} else if (cmd == VFIO_EEH_OP) {
>>+		struct vfio_eeh_op info;
>>+		int ret = 0;
>>+
>>+		minsz = sizeof(info);
>>+		if (copy_from_user(&info, (void __user *)arg, minsz))
>>+			return -EFAULT;
>>+		if (info.argsz < minsz)
>>+			return -EINVAL;
>>+
>>+		ret = vfio_eeh_pci_ioctl(vdev->pdev, &info);
>>+		if (copy_to_user((void __user *)arg, &info, minsz))
>>+			ret = -EFAULT;
>>+		return ret;
>>  	}
>>  	return -ENOTTY;
>>diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>index cb9023d..518961d 100644
>>--- a/include/uapi/linux/vfio.h
>>+++ b/include/uapi/linux/vfio.h
>>@@ -455,6 +455,49 @@ struct vfio_iommu_spapr_tce_info {
>>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>>+/*
>>+ * The VFIO operation struct provides way to support EEH functionality
>>+ * for PCI device that is passed from host to guest via VFIO.
>>+ */
>>+#define VFIO_EEH_OP_SET_OPTION	0
>>+#define VFIO_EEH_OP_GET_ADDR	1
>>+#define VFIO_EEH_OP_GET_STATE	2
>>+#define VFIO_EEH_OP_PE_RESET	3
>>+#define VFIO_EEH_OP_PE_CONFIG	4
>>+
>>+struct vfio_eeh_op {
>>+	__u32 argsz;
>>+	__u32 op;
>>+
>>+	union {
>>+		struct vfio_eeh_set_option {
>>+			__u32 option;
>>+			__s32 ret;
>>+		} option;
>>+		struct vfio_eeh_pe_addr {
>>+			__u32 option;
>>+			__s32 ret;
>>+			__u32 info;
>>+		} addr;
>>+		struct vfio_eeh_pe_state {
>>+			__s32 ret;
>>+			__u32 reset_state;
>>+			__u32 cfg_cap;
>>+			__u32 pe_unavail_info;
>>+			__u32 pe_recovery_info;
>>+		} state;
>>+		struct vfio_eeh_reset {
>>+			__u32 option;
>>+			__s32 ret;
>>+		} reset;
>>+		struct vfio_eeh_config {
>>+			__s32 ret;
>>+		} config;
>>+	};
>>+};
>>+
>>+#define VFIO_EEH_OP	_IO(VFIO_TYPE, VFIO_BASE + 21)
>>+
>>  /* ***************************************************************** */
>>  #endif /* _UAPIVFIO_H */
>
>--
>To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 4/4] powerpc/eeh: Avoid event on passed PE
  2014-05-21 13:13   ` Alexander Graf
@ 2014-05-22  0:01     ` Gavin Shan
  0 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-05-22  0:01 UTC (permalink / raw)
  To: Alexander Graf
  Cc: aik, Gavin Shan, kvm-ppc, alex.williamson, qiudayu, linuxppc-dev

On Wed, May 21, 2014 at 03:13:11PM +0200, Alexander Graf wrote:
>
>On 21.05.14 07:03, Gavin Shan wrote:
>>If we detects frozen state on PE that has been passed to guest, we
>>needn't handle it. Instead, we rely on the guest to detect and recover
>>it. The patch avoid EEH event on the frozen passed PE so that the guest
>>can have chance to handle that.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/kernel/eeh.c                 | 8 ++++++++
>>  arch/powerpc/platforms/powernv/eeh-ioda.c | 3 ++-
>>  2 files changed, 10 insertions(+), 1 deletion(-)
>>
>>diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>>index 2aaf90e..25fd12d 100644
>>--- a/arch/powerpc/kernel/eeh.c
>>+++ b/arch/powerpc/kernel/eeh.c
>>@@ -400,6 +400,14 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>>  	if (ret > 0)
>>  		return ret;
>>+	/*
>>+	 * If the PE has been passed to guest, we won't check the
>>+	 * state. Instead, let the guest handle it if the PE has
>
>What guest? The kernel doesn't care whether we use VFIO for a guest or not.
>

Ok. I'll not mention "guest" and "vfio" in next revision.

Thanks,
Gavin

>
>Alex
>
>>+	 * been frozen.
>>+	 */
>>+	if (eeh_pe_passed(pe))
>>+		return 0;
>>+
>>  	/* If we already have a pending isolation event for this
>>  	 * slot, we know it's bad already, we don't need to check.
>>  	 * Do this checking under a lock; as multiple PCI devices
>>diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
>>index 1b5982f..03a3ed2 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
>>@@ -890,7 +890,8 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
>>  				opal_pci_eeh_freeze_clear(phb->opal_id, frozen_pe_no,
>>  					OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>  				ret = EEH_NEXT_ERR_NONE;
>>-			} else if ((*pe)->state & EEH_PE_ISOLATED) {
>>+			} else if ((*pe)->state & EEH_PE_ISOLATED ||
>>+				   eeh_pe_passed(*pe)) {
>>  				ret = EEH_NEXT_ERR_NONE;
>>  			} else {
>>  				pr_err("EEH: Frozen PHB#%x-PE#%x (%s) detected\n",
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device
  2014-05-21 21:56     ` Benjamin Herrenschmidt
@ 2014-05-22  8:11       ` Gavin Shan
  0 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-05-22  8:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, Alexander Graf, kvm-ppc, Gavin Shan, alex.williamson,
	qiudayu, linuxppc-dev

On Thu, May 22, 2014 at 07:56:31AM +1000, Benjamin Herrenschmidt wrote:
>On Wed, 2014-05-21 at 15:07 +0200, Alexander Graf wrote:
>
>> > +#ifdef CONFIG_VFIO_PCI_EEH
>> > +int eeh_vfio_open(struct pci_dev *pdev)
>> 
>> Why vfio? Also that config option will not be set if vfio is compiled as 
>> a module.
>> 
>> > +{
>> > +	struct eeh_dev *edev;
>> > +
>> > +	/* No PCI device ? */
>> > +	if (!pdev)
>> > +		return -ENODEV;
>> > +
>> > +	/* No EEH device ? */
>> > +	edev = pci_dev_to_eeh_dev(pdev);
>> > +	if (!edev || !edev->pe)
>> > +		return -ENODEV;
>> > +
>> > +	eeh_dev_set_passed(edev, true);
>> > +	eeh_pe_set_passed(edev->pe, true);
>> > +
>> > +	return 0;
>> > +}
>> > +EXPORT_SYMBOL_GPL(eeh_vfio_open);
>
>Additionally, shouldn't we have some locking here ? (and in release too)
>
>I don't like relying on the caller locking (if it does it at all).
>

Ok. I'll add one mutex for open() and release() in next revision.
Thanks for the comment.

>> > +	/* Device existing ? */
>> > +	ret = eeh_vfio_check_dev(pdev, &edev, &pe);
>> > +	if (ret) {
>> > +		pr_debug("%s: Cannot find device %s\n",
>> > +			__func__, pdev ? pci_name(pdev) : "NULL");
>> > +		*retval = -7;
>> 
>> What are these? Please use proper kernel internal return values for 
>> errors. I don't want to see anything even remotely tied to RTAS in any 
>> of these patches.
>
>Hint: -ENODEV
>

In next revision, Those exported functions will have return value as:

>= 0:	carrried information to caller.
<  0:   error number.

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-05-22  8:12 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-21  5:03 [PATCH v5 0/4] EEH Support for VFIO PCI device Gavin Shan
2014-05-21  5:03 ` [PATCH v5 1/4] drivers/vfio: Introduce CONFIG_VFIO_PCI_EEH Gavin Shan
2014-05-21  5:03 ` [PATCH v5 2/4] powerpc/eeh: Flags for passed device and PE Gavin Shan
2014-05-21  5:03 ` [PATCH v5 3/4] drivers/vfio: EEH support for VFIO PCI device Gavin Shan
2014-05-21 13:07   ` Alexander Graf
2014-05-21 21:56     ` Benjamin Herrenschmidt
2014-05-22  8:11       ` Gavin Shan
2014-05-21 23:48     ` Gavin Shan
2014-05-21  5:03 ` [PATCH v5 4/4] powerpc/eeh: Avoid event on passed PE Gavin Shan
2014-05-21 13:13   ` Alexander Graf
2014-05-22  0:01     ` Gavin Shan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).