All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] CXL EEH Handling
@ 2015-07-14  2:29 Daniel Axtens
  2015-07-14  2:29 ` [PATCH 1/8] cxl: Allow the kernel to trust that an image won't change on PERST Daniel Axtens
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Daniel Axtens @ 2015-07-14  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, benh, mikey, imunsie, Matthew R. Ochs, Daniel Axtens

CXL accelerators are unfortunately not immune from failure. This patch
set enables them to particpate in the Extended Error Handling process.

This series starts with a number of preparatory patches:

 - Patch 1 creates a kernel flag that allows us to confidently assert
   the hardware will not change when it's reset.
 
 - Patch 2 makes sure we don't touch the hardware when it has failed.
 
 - Patches 3-5 make the 'unplug' functions idempotent, so that if we
   get part way through recovery and then fail, being completely
   unplugged as part of removal doesn't cause us to oops out.

 - Patches 6 and 7 refactor init and teardown paths for the adapter
   and AFUs, so that they can be configured and deconfigured
   separately from their allocation and release.

Patch 8 enables EEH, both for the CXL card, and anything attached to
the virtual PHB. Only complete slot resets are supported.

Daniel Axtens (8):
  cxl: Allow the kernel to trust that an image won't change on PERST.
  cxl: Drop commands if the PCI channel is not in normal state
  cxl: Allocate and release the SPA with the AFU
  cxl: Make IRQ release idempotent
  cxl: Clean up adapter MMIO unmap path.
  cxl: Refactor adaptor init/teardown
  cxl: Refactor AFU init/teardown
  cxl: EEH support

 Documentation/ABI/testing/sysfs-class-cxl |  10 +
 drivers/misc/cxl/api.c                    |   7 +
 drivers/misc/cxl/cxl.h                    |  38 ++-
 drivers/misc/cxl/file.c                   |  20 ++
 drivers/misc/cxl/irq.c                    |   9 +
 drivers/misc/cxl/native.c                 | 100 +++++-
 drivers/misc/cxl/pci.c                    | 498 ++++++++++++++++++++++++------
 drivers/misc/cxl/sysfs.c                  |  26 ++
 include/misc/cxl.h                        |  10 +
 9 files changed, 602 insertions(+), 116 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/8] cxl: Allow the kernel to trust that an image won't change on PERST.
  2015-07-14  2:29 [PATCH 0/8] CXL EEH Handling Daniel Axtens
@ 2015-07-14  2:29 ` Daniel Axtens
  2015-07-14  2:29 ` [PATCH 2/8] cxl: Drop commands if the PCI channel is not in normal state Daniel Axtens
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Daniel Axtens @ 2015-07-14  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, benh, mikey, imunsie, Matthew R. Ochs, Daniel Axtens

Provide a kernel API and a sysfs entry which allow a user to specify
that when a card is PERSTed, it's image will stay the same, allowing
it to participate in EEH.

Signed-off-by: Daniel Axtens <dja@axtens.net>
---
 Documentation/ABI/testing/sysfs-class-cxl | 10 ++++++++++
 drivers/misc/cxl/api.c                    |  7 +++++++
 drivers/misc/cxl/cxl.h                    |  1 +
 drivers/misc/cxl/pci.c                    |  1 +
 drivers/misc/cxl/sysfs.c                  | 26 ++++++++++++++++++++++++++
 include/misc/cxl.h                        | 10 ++++++++++
 6 files changed, 55 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-cxl b/Documentation/ABI/testing/sysfs-class-cxl
index 7cd9f2ede369..c82f0c0a8e99 100644
--- a/Documentation/ABI/testing/sysfs-class-cxl
+++ b/Documentation/ABI/testing/sysfs-class-cxl
@@ -212,3 +212,13 @@ Description:    write only
                 Writing 1 will issue a PERST to card which may cause the card
                 to reload the FPGA depending on load_image_on_perst.
 Users:		https://github.com/ibm-capi/libcxl
+
+What:		/sys/class/cxl/<card>/perst_reloads_same_image
+Date:		July 2015
+Contact:	linuxppc-dev@lists.ozlabs.org
+Description:	read/write
+		Trust that when an image is reloaded via PERST, it will not
+		have changed.
+		0 = don't trust, the image may be different (default)
+		1 = trust that the image will not change.
+Users:		https://github.com/ibm-capi/libcxl
diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 0c77240ae2fc..a0b8d5bd50f8 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -329,3 +329,10 @@ int cxl_afu_reset(struct cxl_context *ctx)
 	return cxl_afu_check_and_enable(afu);
 }
 EXPORT_SYMBOL_GPL(cxl_afu_reset);
+
+void cxl_perst_reloads_same_image(struct cxl_afu *afu,
+				  bool perst_reloads_same_image)
+{
+	afu->adapter->perst_same_image = perst_reloads_same_image;
+}
+EXPORT_SYMBOL_GPL(cxl_perst_reloads_same_image);
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index b37d13887766..de611a246c10 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -491,6 +491,7 @@ struct cxl {
 	bool user_image_loaded;
 	bool perst_loads_image;
 	bool perst_select_user;
+	bool perst_same_image;
 };
 
 int cxl_alloc_one_irq(struct cxl *adapter);
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index e6e28ff83e74..5fad83653c9e 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -891,6 +891,7 @@ static int cxl_read_vsec(struct cxl *adapter, struct pci_dev *dev)
 	adapter->user_image_loaded = !!(image_state & CXL_VSEC_USER_IMAGE_LOADED);
 	adapter->perst_loads_image = true;
 	adapter->perst_select_user = !!(image_state & CXL_VSEC_USER_IMAGE_LOADED);
+	adapter->perst_same_image = false;
 
 	CXL_READ_VSEC_NAFUS(dev, vsec, &adapter->slices);
 	CXL_READ_VSEC_AFU_DESC_OFF(dev, vsec, &afu_desc_off);
diff --git a/drivers/misc/cxl/sysfs.c b/drivers/misc/cxl/sysfs.c
index 2d6e104c6a6a..b308d1f37edd 100644
--- a/drivers/misc/cxl/sysfs.c
+++ b/drivers/misc/cxl/sysfs.c
@@ -112,12 +112,38 @@ static ssize_t load_image_on_perst_store(struct device *device,
 	return count;
 }
 
+static ssize_t perst_reloads_same_image_show(struct device *device,
+				 struct device_attribute *attr,
+				 char *buf)
+{
+	struct cxl *adapter = to_cxl_adapter(device);
+
+	return scnprintf(buf, PAGE_SIZE, "%i\n", adapter->perst_same_image);
+}
+
+static ssize_t perst_reloads_same_image_store(struct device *device,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	struct cxl *adapter = to_cxl_adapter(device);
+	int rc;
+	int val;
+
+	rc = sscanf(buf, "%i", &val);
+	if ((rc != 1) || !(val == 1 || val == 0))
+		return -EINVAL;
+
+	adapter->perst_same_image = (val == 1 ? true : false);
+	return count;
+}
+
 static struct device_attribute adapter_attrs[] = {
 	__ATTR_RO(caia_version),
 	__ATTR_RO(psl_revision),
 	__ATTR_RO(base_image),
 	__ATTR_RO(image_loaded),
 	__ATTR_RW(load_image_on_perst),
+	__ATTR_RW(perst_reloads_same_image),
 	__ATTR(reset, S_IWUSR, NULL, reset_adapter_store),
 };
 
diff --git a/include/misc/cxl.h b/include/misc/cxl.h
index 7a6c1d6cc173..f2ffe5bd720d 100644
--- a/include/misc/cxl.h
+++ b/include/misc/cxl.h
@@ -200,4 +200,14 @@ unsigned int cxl_fd_poll(struct file *file, struct poll_table_struct *poll);
 ssize_t cxl_fd_read(struct file *file, char __user *buf, size_t count,
 			   loff_t *off);
 
+/*
+ * For EEH, a driver may want to assert a PERST will reload the same image
+ * from flash into the FPGA.
+ *
+ * This is a property of the entire adapter, not a single AFU, so drivers
+ * should set this property with care!
+ */
+void cxl_perst_reloads_same_image(struct cxl_afu *afu,
+				  bool perst_reloads_same_image);
+
 #endif /* _MISC_CXL_H */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/8] cxl: Drop commands if the PCI channel is not in normal state
  2015-07-14  2:29 [PATCH 0/8] CXL EEH Handling Daniel Axtens
  2015-07-14  2:29 ` [PATCH 1/8] cxl: Allow the kernel to trust that an image won't change on PERST Daniel Axtens
@ 2015-07-14  2:29 ` Daniel Axtens
  2015-07-14  2:29 ` [PATCH 3/8] cxl: Allocate and release the SPA with the AFU Daniel Axtens
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Daniel Axtens @ 2015-07-14  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, benh, mikey, imunsie, Matthew R. Ochs, Daniel Axtens

If the PCI channel has gone down, don't attempt to poke the hardware.

We need to guard every time cxl_whatever_(read|write) is called. This
is because a call to those functions will dereference an offset into an
mmio register, and the mmio mappings get invalidated in the EEH
teardown.

Check in the read/write functions in the header.
We give them the same semantics as usual PCI operations:
 - a write to a channel that is down is ignored.
 - a read from a channel that is down returns all fs.

As far as user visible warnings go:
 - Check link state in file ops, return -EIO if down.
 - Be reasonably quiet if there's an error in a teardown path.
   Detaching the hardware is usually pretty comprehensive at
   tearing things down.
 - Throw a big WARN if someone tries to start a CXL operation
   while the card is down. This gives a useful stacktrace for
   debugging whatever is doing that.

Signed-off-by: Daniel Axtens <dja@axtens.net>
---
 drivers/misc/cxl/cxl.h    | 34 +++++++++++++++++------
 drivers/misc/cxl/file.c   | 19 +++++++++++++
 drivers/misc/cxl/native.c | 71 +++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 113 insertions(+), 11 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index de611a246c10..53c90f6cc4ab 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -530,6 +530,14 @@ struct cxl_process_element {
 	__be32 software_state;
 } __packed;
 
+static inline bool cxl_adapter_link_ok(struct cxl *cxl)
+{
+	struct pci_dev *pdev;
+
+	pdev = to_pci_dev(cxl->dev.parent);
+	return (pdev->error_state == pci_channel_io_normal);
+}
+
 static inline void __iomem *_cxl_p1_addr(struct cxl *cxl, cxl_p1_reg_t reg)
 {
 	WARN_ON(!cpu_has_feature(CPU_FTR_HVMODE));
@@ -537,9 +545,11 @@ static inline void __iomem *_cxl_p1_addr(struct cxl *cxl, cxl_p1_reg_t reg)
 }
 
 #define cxl_p1_write(cxl, reg, val) \
-	out_be64(_cxl_p1_addr(cxl, reg), val)
+	if (cxl_adapter_link_ok(cxl)) \
+		out_be64(_cxl_p1_addr(cxl, reg), val)
 #define cxl_p1_read(cxl, reg) \
-	in_be64(_cxl_p1_addr(cxl, reg))
+	(cxl_adapter_link_ok(cxl) ? in_be64(_cxl_p1_addr(cxl, reg)) \
+	 : (~0ULL))
 
 static inline void __iomem *_cxl_p1n_addr(struct cxl_afu *afu, cxl_p1n_reg_t reg)
 {
@@ -548,9 +558,11 @@ static inline void __iomem *_cxl_p1n_addr(struct cxl_afu *afu, cxl_p1n_reg_t reg
 }
 
 #define cxl_p1n_write(afu, reg, val) \
-	out_be64(_cxl_p1n_addr(afu, reg), val)
+	if (cxl_adapter_link_ok(afu->adapter)) \
+		out_be64(_cxl_p1n_addr(afu, reg), val)
 #define cxl_p1n_read(afu, reg) \
-	in_be64(_cxl_p1n_addr(afu, reg))
+	(cxl_adapter_link_ok(afu->adapter) ? in_be64(_cxl_p1n_addr(afu, reg)) \
+	 : (~0ULL))
 
 static inline void __iomem *_cxl_p2n_addr(struct cxl_afu *afu, cxl_p2n_reg_t reg)
 {
@@ -558,15 +570,21 @@ static inline void __iomem *_cxl_p2n_addr(struct cxl_afu *afu, cxl_p2n_reg_t reg
 }
 
 #define cxl_p2n_write(afu, reg, val) \
-	out_be64(_cxl_p2n_addr(afu, reg), val)
+	if (cxl_adapter_link_ok(afu->adapter)) \
+		out_be64(_cxl_p2n_addr(afu, reg), val)
 #define cxl_p2n_read(afu, reg) \
-	in_be64(_cxl_p2n_addr(afu, reg))
+	(cxl_adapter_link_ok(afu->adapter) ? in_be64(_cxl_p2n_addr(afu, reg)) \
+	 : (~0ULL))
 
 
 #define cxl_afu_cr_read64(afu, cr, off) \
-	in_le64((afu)->afu_desc_mmio + (afu)->crs_offset + ((cr) * (afu)->crs_len) + (off))
+	(cxl_adapter_link_ok(afu->adapter) ? \
+	 in_le64((afu)->afu_desc_mmio + (afu)->crs_offset + ((cr) * (afu)->crs_len) + (off)) : \
+	 (~0ULL))
 #define cxl_afu_cr_read32(afu, cr, off) \
-	in_le32((afu)->afu_desc_mmio + (afu)->crs_offset + ((cr) * (afu)->crs_len) + (off))
+	(cxl_adapter_link_ok(afu->adapter) ? \
+	 in_le32((afu)->afu_desc_mmio + (afu)->crs_offset + ((cr) * (afu)->crs_len) + (off)) : \
+	 0xffffffff)
 u16 cxl_afu_cr_read16(struct cxl_afu *afu, int cr, u64 off);
 u8 cxl_afu_cr_read8(struct cxl_afu *afu, int cr, u64 off);
 
diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
index 72fe168b517d..35df0f32ac08 100644
--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -73,6 +73,11 @@ static int __afu_open(struct inode *inode, struct file *file, bool master)
 	if (!afu->current_mode)
 		goto err_put_afu;
 
+	if (!cxl_adapter_link_ok(adapter)) {
+		rc = -EIO;
+		goto err_put_afu;
+	}
+
 	if (!(ctx = cxl_context_alloc())) {
 		rc = -ENOMEM;
 		goto err_put_afu;
@@ -219,6 +224,9 @@ long afu_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	if (ctx->status == CLOSED)
 		return -EIO;
 
+	if (!cxl_adapter_link_ok(ctx->afu->adapter))
+		return -EIO;
+
 	pr_devel("afu_ioctl\n");
 	switch (cmd) {
 	case CXL_IOCTL_START_WORK:
@@ -243,6 +251,9 @@ int afu_mmap(struct file *file, struct vm_area_struct *vm)
 	if (ctx->status != STARTED)
 		return -EIO;
 
+	if (!cxl_adapter_link_ok(ctx->afu->adapter))
+		return -EIO;
+
 	return cxl_context_iomap(ctx, vm);
 }
 
@@ -287,6 +298,9 @@ ssize_t afu_read(struct file *file, char __user *buf, size_t count,
 	int rc;
 	DEFINE_WAIT(wait);
 
+	if (!cxl_adapter_link_ok(ctx->afu->adapter))
+		return -EIO;
+
 	if (count < CXL_READ_MIN_SIZE)
 		return -EINVAL;
 
@@ -297,6 +311,11 @@ ssize_t afu_read(struct file *file, char __user *buf, size_t count,
 		if (ctx_event_pending(ctx))
 			break;
 
+		if (!cxl_adapter_link_ok(ctx->afu->adapter)) {
+			rc = -EIO;
+			goto out;
+		}
+
 		if (file->f_flags & O_NONBLOCK) {
 			rc = -EAGAIN;
 			goto out;
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 10567f245818..16948915eb0d 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -41,6 +41,12 @@ static int afu_control(struct cxl_afu *afu, u64 command,
 			rc = -EBUSY;
 			goto out;
 		}
+		if (!cxl_adapter_link_ok(afu->adapter)) {
+			afu->enabled = enabled;
+			rc = -EIO;
+			goto out;
+		}
+
 		pr_devel_ratelimited("AFU control... (0x%.16llx)\n",
 				     AFU_Cntl | command);
 		cpu_relax();
@@ -85,6 +91,10 @@ int __cxl_afu_reset(struct cxl_afu *afu)
 
 int cxl_afu_check_and_enable(struct cxl_afu *afu)
 {
+	if (!cxl_adapter_link_ok(afu->adapter)) {
+		WARN(1, "Refusing to enable afu while link down!\n");
+		return -EIO;
+	}
 	if (afu->enabled)
 		return 0;
 	return afu_enable(afu);
@@ -103,6 +113,12 @@ int cxl_psl_purge(struct cxl_afu *afu)
 
 	pr_devel("PSL purge request\n");
 
+	if (!cxl_adapter_link_ok(afu->adapter)) {
+		dev_warn(&afu->dev, "PSL Purge called with link down, ignoring\n");
+		rc = -EIO;
+		goto out;
+	}
+
 	if ((AFU_Cntl & CXL_AFU_Cntl_An_ES_MASK) != CXL_AFU_Cntl_An_ES_Disabled) {
 		WARN(1, "psl_purge request while AFU not disabled!\n");
 		cxl_afu_disable(afu);
@@ -119,6 +135,11 @@ int cxl_psl_purge(struct cxl_afu *afu)
 			rc = -EBUSY;
 			goto out;
 		}
+		if (!cxl_adapter_link_ok(afu->adapter)) {
+			rc = -EIO;
+			goto out;
+		}
+
 		dsisr = cxl_p2n_read(afu, CXL_PSL_DSISR_An);
 		pr_devel_ratelimited("PSL purging... PSL_CNTL: 0x%.16llx  PSL_DSISR: 0x%.16llx\n", PSL_CNTL, dsisr);
 		if (dsisr & CXL_PSL_DSISR_TRANS) {
@@ -215,6 +236,8 @@ int cxl_tlb_slb_invalidate(struct cxl *adapter)
 			dev_warn(&adapter->dev, "WARNING: CXL adapter wide TLBIA timed out!\n");
 			return -EBUSY;
 		}
+		if (!cxl_adapter_link_ok(adapter))
+			return -EIO;
 		cpu_relax();
 	}
 
@@ -224,6 +247,8 @@ int cxl_tlb_slb_invalidate(struct cxl *adapter)
 			dev_warn(&adapter->dev, "WARNING: CXL adapter wide SLBIA timed out!\n");
 			return -EBUSY;
 		}
+		if (!cxl_adapter_link_ok(adapter))
+			return -EIO;
 		cpu_relax();
 	}
 	return 0;
@@ -240,6 +265,11 @@ int cxl_afu_slbia(struct cxl_afu *afu)
 			dev_warn(&afu->dev, "WARNING: CXL AFU SLBIA timed out!\n");
 			return -EBUSY;
 		}
+		/* If the adapter has gone down, we can assume that we
+		 * will PERST it and that will invalidate everything.
+		 */
+		if (!cxl_adapter_link_ok(afu->adapter))
+			return -EIO;
 		cpu_relax();
 	}
 	return 0;
@@ -279,6 +309,8 @@ static void slb_invalid(struct cxl_context *ctx)
 	cxl_p1_write(adapter, CXL_PSL_SLBIA, CXL_TLB_SLB_IQ_LPIDPID);
 
 	while (1) {
+		if (!cxl_adapter_link_ok(adapter))
+			break;
 		slbia = cxl_p1_read(adapter, CXL_PSL_SLBIA);
 		if (!(slbia & CXL_TLB_SLB_P))
 			break;
@@ -308,6 +340,11 @@ static int do_process_element_cmd(struct cxl_context *ctx,
 			rc = -EBUSY;
 			goto out;
 		}
+		if (!cxl_adapter_link_ok(ctx->afu->adapter)) {
+			dev_warn(&ctx->afu->dev, "WARNING: Device link down, aborting Process Element Command!\n");
+			rc = -EIO;
+			goto out;
+		}
 		state = be64_to_cpup(ctx->afu->sw_command_status);
 		if (state == ~0ULL) {
 			pr_err("cxl: Error adding process element to AFU\n");
@@ -355,8 +392,13 @@ static int terminate_process_element(struct cxl_context *ctx)
 
 	mutex_lock(&ctx->afu->spa_mutex);
 	pr_devel("%s Terminate pe: %i started\n", __func__, ctx->pe);
-	rc = do_process_element_cmd(ctx, CXL_SPA_SW_CMD_TERMINATE,
-				    CXL_PE_SOFTWARE_STATE_V | CXL_PE_SOFTWARE_STATE_T);
+	/* We could be asked to terminate when the hw is down.  That
+	 * should always succeed: it's not running if the hw has gone
+	 * away and is being reset.
+	 */
+	if (cxl_adapter_link_ok(ctx->afu->adapter))
+		rc = do_process_element_cmd(ctx, CXL_SPA_SW_CMD_TERMINATE,
+					    CXL_PE_SOFTWARE_STATE_V | CXL_PE_SOFTWARE_STATE_T);
 	ctx->elem->software_state = 0;	/* Remove Valid bit */
 	pr_devel("%s Terminate pe: %i finished\n", __func__, ctx->pe);
 	mutex_unlock(&ctx->afu->spa_mutex);
@@ -369,7 +411,14 @@ static int remove_process_element(struct cxl_context *ctx)
 
 	mutex_lock(&ctx->afu->spa_mutex);
 	pr_devel("%s Remove pe: %i started\n", __func__, ctx->pe);
-	if (!(rc = do_process_element_cmd(ctx, CXL_SPA_SW_CMD_REMOVE, 0)))
+
+	/* We could be asked to remove when the hw is down.  Again, if
+	 * the hw is down, the PE is gone, so we succeed.
+	 */
+	if (cxl_adapter_link_ok(ctx->afu->adapter))
+		rc = do_process_element_cmd(ctx, CXL_SPA_SW_CMD_REMOVE, 0);
+
+	if (!rc)
 		ctx->pe_inserted = false;
 	slb_invalid(ctx);
 	pr_devel("%s Remove pe: %i finished\n", __func__, ctx->pe);
@@ -614,6 +663,11 @@ int cxl_afu_activate_mode(struct cxl_afu *afu, int mode)
 	if (!(mode & afu->modes_supported))
 		return -EINVAL;
 
+	if (!cxl_adapter_link_ok(afu->adapter)) {
+		WARN(1, "Device link is down, refusing to activate!\n");
+		return -EIO;
+	}
+
 	if (mode == CXL_MODE_DIRECTED)
 		return activate_afu_directed(afu);
 	if (mode == CXL_MODE_DEDICATED)
@@ -624,6 +678,11 @@ int cxl_afu_activate_mode(struct cxl_afu *afu, int mode)
 
 int cxl_attach_process(struct cxl_context *ctx, bool kernel, u64 wed, u64 amr)
 {
+	if (!cxl_adapter_link_ok(ctx->afu->adapter)) {
+		WARN(1, "Device link is down, refusing to attach process!\n");
+		return -EIO;
+	}
+
 	ctx->kernel = kernel;
 	if (ctx->afu->current_mode == CXL_MODE_DIRECTED)
 		return attach_afu_directed(ctx, wed, amr);
@@ -668,6 +727,12 @@ int cxl_get_irq(struct cxl_afu *afu, struct cxl_irq_info *info)
 {
 	u64 pidtid;
 
+	/* If the adapter has gone away, we can't get any meaningful
+	 * information.
+	 */
+	if (!cxl_adapter_link_ok(afu->adapter))
+		return -EIO;
+
 	info->dsisr = cxl_p2n_read(afu, CXL_PSL_DSISR_An);
 	info->dar = cxl_p2n_read(afu, CXL_PSL_DAR_An);
 	info->dsr = cxl_p2n_read(afu, CXL_PSL_DSR_An);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/8] cxl: Allocate and release the SPA with the AFU
  2015-07-14  2:29 [PATCH 0/8] CXL EEH Handling Daniel Axtens
  2015-07-14  2:29 ` [PATCH 1/8] cxl: Allow the kernel to trust that an image won't change on PERST Daniel Axtens
  2015-07-14  2:29 ` [PATCH 2/8] cxl: Drop commands if the PCI channel is not in normal state Daniel Axtens
@ 2015-07-14  2:29 ` Daniel Axtens
  2015-07-14  2:29 ` [PATCH 4/8] cxl: Make IRQ release idempotent Daniel Axtens
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Daniel Axtens @ 2015-07-14  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, benh, mikey, imunsie, Matthew R. Ochs, Daniel Axtens

Previously the SPA was allocated and freed upon entering and leaving
AFU-directed mode. This causes some issues for error recovery - contexts
hold a pointer inside the SPA, and they may persist after the AFU has
been detached.

We would ideally like to allocate the SPA when the AFU is allocated, and
release it until the AFU is released. However, we don't know how big the
SPA needs to be until we read the AFU descriptor.

Therefore, restructure the code:

 - Allocate the SPA only once, on the first attach.

 - Release the SPA only when the entire AFU is being released (not
   detached). Guard the release with a NULL check, so we don't free
   if it was never allocated (e.g. dedicated mode)

Signed-off-by: Daniel Axtens <dja@axtens.net>
---
 drivers/misc/cxl/cxl.h    |  3 +++
 drivers/misc/cxl/native.c | 28 ++++++++++++++++++----------
 drivers/misc/cxl/pci.c    |  3 +++
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 53c90f6cc4ab..dfd893331d2a 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -599,6 +599,9 @@ void unregister_cxl_calls(struct cxl_calls *calls);
 int cxl_alloc_adapter_nr(struct cxl *adapter);
 void cxl_remove_adapter_nr(struct cxl *adapter);
 
+int cxl_alloc_spa(struct cxl_afu *afu);
+void cxl_release_spa(struct cxl_afu *afu);
+
 int cxl_file_init(void);
 void cxl_file_exit(void);
 int cxl_register_adapter(struct cxl *adapter);
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 16948915eb0d..debd97147b58 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -182,10 +182,8 @@ static int spa_max_procs(int spa_size)
 	return ((spa_size / 8) - 96) / 17;
 }
 
-static int alloc_spa(struct cxl_afu *afu)
+int cxl_alloc_spa(struct cxl_afu *afu)
 {
-	u64 spap;
-
 	/* Work out how many pages to allocate */
 	afu->spa_order = 0;
 	do {
@@ -204,6 +202,13 @@ static int alloc_spa(struct cxl_afu *afu)
 	pr_devel("spa pages: %i afu->spa_max_procs: %i   afu->num_procs: %i\n",
 		 1<<afu->spa_order, afu->spa_max_procs, afu->num_procs);
 
+	return 0;
+}
+
+static void attach_spa(struct cxl_afu *afu)
+{
+	u64 spap;
+
 	afu->sw_command_status = (__be64 *)((char *)afu->spa +
 					    ((afu->spa_max_procs + 3) * 128));
 
@@ -212,13 +217,15 @@ static int alloc_spa(struct cxl_afu *afu)
 	spap |= CXL_PSL_SPAP_V;
 	pr_devel("cxl: SPA allocated at 0x%p. Max processes: %i, sw_command_status: 0x%p CXL_PSL_SPAP_An=0x%016llx\n", afu->spa, afu->spa_max_procs, afu->sw_command_status, spap);
 	cxl_p1n_write(afu, CXL_PSL_SPAP_An, spap);
-
-	return 0;
 }
 
-static void release_spa(struct cxl_afu *afu)
+static inline void detach_spa(struct cxl_afu *afu)
 {
 	cxl_p1n_write(afu, CXL_PSL_SPAP_An, 0);
+}
+
+void cxl_release_spa(struct cxl_afu *afu)
+{
 	free_pages((unsigned long) afu->spa, afu->spa_order);
 }
 
@@ -446,8 +453,11 @@ static int activate_afu_directed(struct cxl_afu *afu)
 
 	dev_info(&afu->dev, "Activating AFU directed mode\n");
 
-	if (alloc_spa(afu))
-		return -ENOMEM;
+	if (afu->spa == NULL) {
+		if (cxl_alloc_spa(afu))
+			return -ENOMEM;
+	}
+	attach_spa(afu);
 
 	cxl_p1n_write(afu, CXL_PSL_SCNTL_An, CXL_PSL_SCNTL_An_PM_AFU);
 	cxl_p1n_write(afu, CXL_PSL_AMOR_An, 0xFFFFFFFFFFFFFFFFULL);
@@ -560,8 +570,6 @@ static int deactivate_afu_directed(struct cxl_afu *afu)
 	cxl_afu_disable(afu);
 	cxl_psl_purge(afu);
 
-	release_spa(afu);
-
 	return 0;
 }
 
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 5fad83653c9e..e8d6d5560529 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -551,6 +551,9 @@ static void cxl_release_afu(struct device *dev)
 
 	pr_devel("cxl_release_afu\n");
 
+	if (afu->spa)
+		cxl_release_spa(afu);
+
 	kfree(afu);
 }
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/8] cxl: Make IRQ release idempotent
  2015-07-14  2:29 [PATCH 0/8] CXL EEH Handling Daniel Axtens
                   ` (2 preceding siblings ...)
  2015-07-14  2:29 ` [PATCH 3/8] cxl: Allocate and release the SPA with the AFU Daniel Axtens
@ 2015-07-14  2:29 ` Daniel Axtens
  2015-07-14  2:29 ` [PATCH 5/8] cxl: Clean up adapter MMIO unmap path Daniel Axtens
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Daniel Axtens @ 2015-07-14  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, benh, mikey, imunsie, Matthew R. Ochs, Daniel Axtens

Check if an IRQ is mapped before releasing it.

This will simplify future EEH code by allowing unconditional unmapping
of IRQs.

Signed-off-by: Daniel Axtens <dja@axtens.net>
---
 drivers/misc/cxl/irq.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/misc/cxl/irq.c b/drivers/misc/cxl/irq.c
index 680cd263436d..121ec48f3ab4 100644
--- a/drivers/misc/cxl/irq.c
+++ b/drivers/misc/cxl/irq.c
@@ -341,6 +341,9 @@ int cxl_register_psl_err_irq(struct cxl *adapter)
 
 void cxl_release_psl_err_irq(struct cxl *adapter)
 {
+	if (adapter->err_virq != irq_find_mapping(NULL, adapter->err_hwirq))
+		return;
+
 	cxl_p1_write(adapter, CXL_PSL_ErrIVTE, 0x0000000000000000);
 	cxl_unmap_irq(adapter->err_virq, adapter);
 	cxl_release_one_irq(adapter, adapter->err_hwirq);
@@ -374,6 +377,9 @@ int cxl_register_serr_irq(struct cxl_afu *afu)
 
 void cxl_release_serr_irq(struct cxl_afu *afu)
 {
+	if (afu->serr_virq != irq_find_mapping(NULL, afu->serr_hwirq))
+		return;
+
 	cxl_p1n_write(afu, CXL_PSL_SERR_An, 0x0000000000000000);
 	cxl_unmap_irq(afu->serr_virq, afu);
 	cxl_release_one_irq(afu->adapter, afu->serr_hwirq);
@@ -400,6 +406,9 @@ int cxl_register_psl_irq(struct cxl_afu *afu)
 
 void cxl_release_psl_irq(struct cxl_afu *afu)
 {
+	if (afu->psl_virq != irq_find_mapping(NULL, afu->psl_hwirq))
+		return;
+
 	cxl_unmap_irq(afu->psl_virq, afu);
 	cxl_release_one_irq(afu->adapter, afu->psl_hwirq);
 	kfree(afu->psl_irq_name);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/8] cxl: Clean up adapter MMIO unmap path.
  2015-07-14  2:29 [PATCH 0/8] CXL EEH Handling Daniel Axtens
                   ` (3 preceding siblings ...)
  2015-07-14  2:29 ` [PATCH 4/8] cxl: Make IRQ release idempotent Daniel Axtens
@ 2015-07-14  2:29 ` Daniel Axtens
  2015-07-14  2:29 ` [PATCH 6/8] cxl: Refactor adaptor init/teardown Daniel Axtens
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Daniel Axtens @ 2015-07-14  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, benh, mikey, imunsie, Matthew R. Ochs, Daniel Axtens

 - MMIO pointer unmapping is guarded by a null pointer check.
   However, iounmap doesn't null the pointer, just invalidate it.
   Therefore, explicitly null the pointer after unmapping.

 - afu_desc_mmio also needs to be unmapped.

 - PCI regions are allocated in cxl_map_adapter_regs.
   Therefore they should be released in unmap, not elsewhere.

Signed-off-by: Daniel Axtens <dja@axtens.net>
---
 drivers/misc/cxl/pci.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index e8d6d5560529..f65bfac1c496 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -539,10 +539,18 @@ err:
 
 static void cxl_unmap_slice_regs(struct cxl_afu *afu)
 {
-	if (afu->p2n_mmio)
+	if (afu->p2n_mmio) {
 		iounmap(afu->p2n_mmio);
-	if (afu->p1n_mmio)
+		afu->p2n_mmio = NULL;
+	}
+	if (afu->p1n_mmio) {
 		iounmap(afu->p1n_mmio);
+		afu->p1n_mmio = NULL;
+	}
+	if (afu->afu_desc_mmio) {
+		iounmap(afu->afu_desc_mmio);
+		afu->afu_desc_mmio = NULL;
+	}
 }
 
 static void cxl_release_afu(struct device *dev)
@@ -860,10 +868,16 @@ err1:
 
 static void cxl_unmap_adapter_regs(struct cxl *adapter)
 {
-	if (adapter->p1_mmio)
+	if (adapter->p1_mmio) {
 		iounmap(adapter->p1_mmio);
-	if (adapter->p2_mmio)
+		adapter->p1_mmio = NULL;
+		pci_release_region(to_pci_dev(adapter->dev.parent), 2);
+	}
+	if (adapter->p2_mmio) {
 		iounmap(adapter->p2_mmio);
+		adapter->p2_mmio = NULL;
+		pci_release_region(to_pci_dev(adapter->dev.parent), 0);
+	}
 }
 
 static int cxl_read_vsec(struct cxl *adapter, struct pci_dev *dev)
@@ -1073,8 +1087,6 @@ static void cxl_remove_adapter(struct cxl *adapter)
 
 	device_unregister(&adapter->dev);
 
-	pci_release_region(pdev, 0);
-	pci_release_region(pdev, 2);
 	pci_disable_device(pdev);
 }
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 6/8] cxl: Refactor adaptor init/teardown
  2015-07-14  2:29 [PATCH 0/8] CXL EEH Handling Daniel Axtens
                   ` (4 preceding siblings ...)
  2015-07-14  2:29 ` [PATCH 5/8] cxl: Clean up adapter MMIO unmap path Daniel Axtens
@ 2015-07-14  2:29 ` Daniel Axtens
  2015-07-14  2:29 ` [PATCH 7/8] cxl: Refactor AFU init/teardown Daniel Axtens
  2015-07-14  2:29 ` [PATCH 8/8] cxl: EEH support Daniel Axtens
  7 siblings, 0 replies; 9+ messages in thread
From: Daniel Axtens @ 2015-07-14  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, benh, mikey, imunsie, Matthew R. Ochs, Daniel Axtens

Some aspects of initialisation are done only once in the lifetime of
an adapter: for example, allocating memory for the adapter,
allocating the adapter number, or setting up sysfs/debugfs files.

However, we may want to be able to do some parts of the
initialisation multiple times: for example, in error recovery we
want to be able to tear down and then re-map IO memory and IRQs.

Therefore, refactor CXL init/teardown as follows.

 - Keep the overarching functions 'cxl_init_adapter' and its pair,
   'cxl_remove_adapter'.

 - Move all 'once only' allocation/freeing steps to the existing
   'cxl_alloc_adapter' function, and its pair 'cxl_release_adapter'
   (This involves moving allocation of the adapter number out of
   cxl_init_adapter.)

 - Create two new functions: 'cxl_configure_adapter', and its pair
   'cxl_deconfigure_adapter'. These two functions 'wire up' the
   hardware --- they (de)configure resources that do not need to
   last the entire lifetime of the adapter

Signed-off-by: Daniel Axtens <dja@axtens.net>
---
 drivers/misc/cxl/pci.c | 138 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 85 insertions(+), 53 deletions(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index f65bfac1c496..faddfad597a2 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -906,9 +906,7 @@ static int cxl_read_vsec(struct cxl *adapter, struct pci_dev *dev)
 	CXL_READ_VSEC_BASE_IMAGE(dev, vsec, &adapter->base_image);
 	CXL_READ_VSEC_IMAGE_STATE(dev, vsec, &image_state);
 	adapter->user_image_loaded = !!(image_state & CXL_VSEC_USER_IMAGE_LOADED);
-	adapter->perst_loads_image = true;
 	adapter->perst_select_user = !!(image_state & CXL_VSEC_USER_IMAGE_LOADED);
-	adapter->perst_same_image = false;
 
 	CXL_READ_VSEC_NAFUS(dev, vsec, &adapter->slices);
 	CXL_READ_VSEC_AFU_DESC_OFF(dev, vsec, &afu_desc_off);
@@ -967,22 +965,34 @@ static void cxl_release_adapter(struct device *dev)
 
 	pr_devel("cxl_release_adapter\n");
 
+	cxl_remove_adapter_nr(adapter);
+
 	kfree(adapter);
 }
 
-static struct cxl *cxl_alloc_adapter(struct pci_dev *dev)
+static struct cxl *cxl_alloc_adapter(void)
 {
 	struct cxl *adapter;
+	int rc;
 
 	if (!(adapter = kzalloc(sizeof(struct cxl), GFP_KERNEL)))
 		return NULL;
 
-	adapter->dev.parent = &dev->dev;
-	adapter->dev.release = cxl_release_adapter;
-	pci_set_drvdata(dev, adapter);
 	spin_lock_init(&adapter->afu_list_lock);
 
+	if ((rc = cxl_alloc_adapter_nr(adapter)))
+		goto err1;
+
+	if ((rc = dev_set_name(&adapter->dev, "card%i", adapter->adapter_num)))
+		goto err2;
+
 	return adapter;
+
+err2:
+	cxl_remove_adapter_nr(adapter);
+err1:
+	kfree(adapter);
+	return NULL;
 }
 
 static int sanitise_adapter_regs(struct cxl *adapter)
@@ -991,57 +1001,95 @@ static int sanitise_adapter_regs(struct cxl *adapter)
 	return cxl_tlb_slb_invalidate(adapter);
 }
 
-static struct cxl *cxl_init_adapter(struct pci_dev *dev)
+/* This should contain *only* operations that can safely be done in
+ * both creation and recovery.
+ */
+static int cxl_configure_adapter(struct cxl *adapter, struct pci_dev *dev)
 {
-	struct cxl *adapter;
-	bool free = true;
 	int rc;
 
+	adapter->dev.parent = &dev->dev;
+	adapter->dev.release = cxl_release_adapter;
+	pci_set_drvdata(dev, adapter);
 
-	if (!(adapter = cxl_alloc_adapter(dev)))
-		return ERR_PTR(-ENOMEM);
+	if ((rc = pci_enable_device(dev))) {
+		dev_err(&dev->dev, "pci_enable_device failed: %i\n", rc);
+		return rc;
+	}
 
 	if ((rc = cxl_read_vsec(adapter, dev)))
-		goto err1;
+		return rc;
 
 	if ((rc = cxl_vsec_looks_ok(adapter, dev)))
-		goto err1;
+	        return rc;
 
 	if ((rc = setup_cxl_bars(dev)))
-		goto err1;
+		return rc;
 
 	if ((rc = switch_card_to_cxl(dev)))
-		goto err1;
-
-	if ((rc = cxl_alloc_adapter_nr(adapter)))
-		goto err1;
-
-	if ((rc = dev_set_name(&adapter->dev, "card%i", adapter->adapter_num)))
-		goto err2;
+		return rc;
 
 	if ((rc = cxl_update_image_control(adapter)))
-		goto err2;
+		return rc;
 
 	if ((rc = cxl_map_adapter_regs(adapter, dev)))
-		goto err2;
+		return rc;
 
 	if ((rc = sanitise_adapter_regs(adapter)))
-		goto err2;
+		goto err;
 
 	if ((rc = init_implementation_adapter_regs(adapter, dev)))
-		goto err3;
+		goto err;
 
 	if ((rc = pnv_phb_to_cxl_mode(dev, OPAL_PHB_CAPI_MODE_CAPI)))
-		goto err3;
+		goto err;
 
 	/* If recovery happened, the last step is to turn on snooping.
 	 * In the non-recovery case this has no effect */
-	if ((rc = pnv_phb_to_cxl_mode(dev, OPAL_PHB_CAPI_MODE_SNOOP_ON))) {
-		goto err3;
-	}
+	if ((rc = pnv_phb_to_cxl_mode(dev, OPAL_PHB_CAPI_MODE_SNOOP_ON)))
+		goto err;
 
 	if ((rc = cxl_register_psl_err_irq(adapter)))
-		goto err3;
+		goto err;
+
+	return 0;
+
+err:
+	cxl_unmap_adapter_regs(adapter);
+	return rc;
+
+}
+
+static void cxl_deconfigure_adapter(struct cxl *adapter)
+{
+	struct pci_dev *pdev = to_pci_dev(adapter->dev.parent);
+
+	cxl_release_psl_err_irq(adapter);
+	cxl_unmap_adapter_regs(adapter);
+
+	pci_disable_device(pdev);
+}
+
+static struct cxl *cxl_init_adapter(struct pci_dev *dev)
+{
+	struct cxl *adapter;
+	int rc;
+
+	adapter = cxl_alloc_adapter();
+	if (!adapter)
+		return ERR_PTR(-ENOMEM);
+
+	/* Set defaults for parameters which need to persist over
+	 * configure/reconfigure
+	 */
+	adapter->perst_loads_image = true;
+	adapter->perst_same_image = false;
+
+	if ((rc = cxl_configure_adapter(adapter, dev))) {
+		pci_disable_device(dev);
+		cxl_release_adapter(&adapter->dev);
+		return ERR_PTR(rc);
+	}
 
 	/* Don't care if this one fails: */
 	cxl_debugfs_adapter_add(adapter);
@@ -1059,35 +1107,25 @@ static struct cxl *cxl_init_adapter(struct pci_dev *dev)
 	return adapter;
 
 err_put1:
-	device_unregister(&adapter->dev);
-	free = false;
+	/* This should mirror cxl_remove_adapter, except without the
+	 * sysfs parts
+	 */
 	cxl_debugfs_adapter_remove(adapter);
-	cxl_release_psl_err_irq(adapter);
-err3:
-	cxl_unmap_adapter_regs(adapter);
-err2:
-	cxl_remove_adapter_nr(adapter);
-err1:
-	if (free)
-		kfree(adapter);
+	cxl_deconfigure_adapter(adapter);
+	device_unregister(&adapter->dev);
 	return ERR_PTR(rc);
 }
 
 static void cxl_remove_adapter(struct cxl *adapter)
 {
-	struct pci_dev *pdev = to_pci_dev(adapter->dev.parent);
-
 	pr_devel("cxl_remove_adapter\n");
 
 	cxl_sysfs_adapter_remove(adapter);
 	cxl_debugfs_adapter_remove(adapter);
-	cxl_release_psl_err_irq(adapter);
-	cxl_unmap_adapter_regs(adapter);
-	cxl_remove_adapter_nr(adapter);
 
-	device_unregister(&adapter->dev);
+	cxl_deconfigure_adapter(adapter);
 
-	pci_disable_device(pdev);
+	device_unregister(&adapter->dev);
 }
 
 static int cxl_probe(struct pci_dev *dev, const struct pci_device_id *id)
@@ -1101,15 +1139,9 @@ static int cxl_probe(struct pci_dev *dev, const struct pci_device_id *id)
 	if (cxl_verbose)
 		dump_cxl_config_space(dev);
 
-	if ((rc = pci_enable_device(dev))) {
-		dev_err(&dev->dev, "pci_enable_device failed: %i\n", rc);
-		return rc;
-	}
-
 	adapter = cxl_init_adapter(dev);
 	if (IS_ERR(adapter)) {
 		dev_err(&dev->dev, "cxl_init_adapter failed: %li\n", PTR_ERR(adapter));
-		pci_disable_device(dev);
 		return PTR_ERR(adapter);
 	}
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 7/8] cxl: Refactor AFU init/teardown
  2015-07-14  2:29 [PATCH 0/8] CXL EEH Handling Daniel Axtens
                   ` (5 preceding siblings ...)
  2015-07-14  2:29 ` [PATCH 6/8] cxl: Refactor adaptor init/teardown Daniel Axtens
@ 2015-07-14  2:29 ` Daniel Axtens
  2015-07-14  2:29 ` [PATCH 8/8] cxl: EEH support Daniel Axtens
  7 siblings, 0 replies; 9+ messages in thread
From: Daniel Axtens @ 2015-07-14  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, benh, mikey, imunsie, Matthew R. Ochs, Daniel Axtens

As with an adapter, some aspects of initialisation are done only once
in the lifetime of an AFU: for example, allocating memory, or setting
up sysfs/debugfs files.

However, we may want to be able to do some parts of the initialisation
multiple times: for example, in error recovery we want to be able to
tear down and then re-map IO memory and IRQs.

Therefore, refactor AFU init/teardown as follows.

 - Create two new functions: 'cxl_configure_afu', and its pair
   'cxl_deconfigure_afu'. As with the adapter functions,
   these (de)configure resources that do not need to last the entire
   lifetime of the AFU.

 - Allocating and releasing memory remain the task of 'cxl_alloc_afu'
   and 'cxl_release_afu'.

 - Once-only functions that do not involve allocating/releasing memory
   stay in the overarching 'cxl_init_afu'/'cxl_remove_afu' pair.
   However, the task of picking an AFU mode and activating it has been
   broken out.

Signed-off-by: Daniel Axtens <dja@axtens.net>
---
 drivers/misc/cxl/pci.c | 87 +++++++++++++++++++++++++++++---------------------
 1 file changed, 50 insertions(+), 37 deletions(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index faddfad597a2..273e810020ef 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -693,45 +693,67 @@ static int sanitise_afu_regs(struct cxl_afu *afu)
 	return 0;
 }
 
-static int cxl_init_afu(struct cxl *adapter, int slice, struct pci_dev *dev)
+static int cxl_configure_afu(struct cxl_afu *afu, struct cxl *adapter, struct pci_dev *dev)
 {
-	struct cxl_afu *afu;
-	bool free = true;
 	int rc;
 
-	if (!(afu = cxl_alloc_afu(adapter, slice)))
-		return -ENOMEM;
-
-	if ((rc = dev_set_name(&afu->dev, "afu%i.%i", adapter->adapter_num, slice)))
-		goto err1;
-
 	if ((rc = cxl_map_slice_regs(afu, adapter, dev)))
-		goto err1;
+		return rc;
 
 	if ((rc = sanitise_afu_regs(afu)))
-		goto err2;
+		goto err1;
 
 	/* We need to reset the AFU before we can read the AFU descriptor */
 	if ((rc = __cxl_afu_reset(afu)))
-		goto err2;
+		goto err1;
 
 	if (cxl_verbose)
 		dump_afu_descriptor(afu);
 
 	if ((rc = cxl_read_afu_descriptor(afu)))
-		goto err2;
+		goto err1;
 
 	if ((rc = cxl_afu_descriptor_looks_ok(afu)))
-		goto err2;
+		goto err1;
 
 	if ((rc = init_implementation_afu_regs(afu)))
-		goto err2;
+		goto err1;
 
 	if ((rc = cxl_register_serr_irq(afu)))
-		goto err2;
+		goto err1;
 
 	if ((rc = cxl_register_psl_irq(afu)))
-		goto err3;
+		goto err2;
+
+	return 0;
+
+err2:
+	cxl_release_serr_irq(afu);
+err1:
+	cxl_unmap_slice_regs(afu);
+	return rc;
+}
+
+static void cxl_deconfigure_afu(struct cxl_afu *afu)
+{
+	cxl_release_psl_irq(afu);
+	cxl_release_serr_irq(afu);
+	cxl_unmap_slice_regs(afu);
+}
+
+static int cxl_init_afu(struct cxl *adapter, int slice, struct pci_dev *dev)
+{
+	struct cxl_afu *afu;
+	int rc;
+
+	if (!(afu = cxl_alloc_afu(adapter, slice)))
+		return -ENOMEM;
+
+	if ((rc = dev_set_name(&afu->dev, "afu%i.%i", adapter->adapter_num, slice)))
+		goto err_free;
+
+	if ((rc = cxl_configure_afu(afu, adapter, dev)))
+		goto err_free;
 
 	/* Don't care if this fails */
 	cxl_debugfs_afu_add(afu);
@@ -746,10 +768,6 @@ static int cxl_init_afu(struct cxl *adapter, int slice, struct pci_dev *dev)
 	if ((rc = cxl_sysfs_afu_add(afu)))
 		goto err_put1;
 
-
-	if ((rc = cxl_afu_select_best_mode(afu)))
-		goto err_put2;
-
 	adapter->afu[afu->slice] = afu;
 
 	if ((rc = cxl_pci_vphb_add(afu)))
@@ -757,21 +775,16 @@ static int cxl_init_afu(struct cxl *adapter, int slice, struct pci_dev *dev)
 
 	return 0;
 
-err_put2:
-	cxl_sysfs_afu_remove(afu);
 err_put1:
-	device_unregister(&afu->dev);
-	free = false;
+	cxl_deconfigure_afu(afu);
 	cxl_debugfs_afu_remove(afu);
-	cxl_release_psl_irq(afu);
-err3:
-	cxl_release_serr_irq(afu);
-err2:
-	cxl_unmap_slice_regs(afu);
-err1:
-	if (free)
-		kfree(afu);
+	device_unregister(&afu->dev);
 	return rc;
+
+err_free:
+	kfree(afu);
+	return rc;
+
 }
 
 static void cxl_remove_afu(struct cxl_afu *afu)
@@ -791,10 +804,7 @@ static void cxl_remove_afu(struct cxl_afu *afu)
 	cxl_context_detach_all(afu);
 	cxl_afu_deactivate_mode(afu);
 
-	cxl_release_psl_irq(afu);
-	cxl_release_serr_irq(afu);
-	cxl_unmap_slice_regs(afu);
-
+	cxl_deconfigure_afu(afu);
 	device_unregister(&afu->dev);
 }
 
@@ -1148,6 +1158,9 @@ static int cxl_probe(struct pci_dev *dev, const struct pci_device_id *id)
 	for (slice = 0; slice < adapter->slices; slice++) {
 		if ((rc = cxl_init_afu(adapter, slice, dev)))
 			dev_err(&dev->dev, "AFU %i failed to initialise: %i\n", slice, rc);
+
+		if ((rc = cxl_afu_select_best_mode(adapter->afu[slice])))
+			dev_err(&dev->dev, "AFU %i failed to start: %i\n", slice, rc);
 	}
 
 	return 0;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 8/8] cxl: EEH support
  2015-07-14  2:29 [PATCH 0/8] CXL EEH Handling Daniel Axtens
                   ` (6 preceding siblings ...)
  2015-07-14  2:29 ` [PATCH 7/8] cxl: Refactor AFU init/teardown Daniel Axtens
@ 2015-07-14  2:29 ` Daniel Axtens
  7 siblings, 0 replies; 9+ messages in thread
From: Daniel Axtens @ 2015-07-14  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, benh, mikey, imunsie, Matthew R. Ochs, Daniel Axtens

EEH (Enhanced Error Handling) allows a driver to recover from the
temporary failure of an attached PCI card. Enable basic CXL support
for EEH.

Signed-off-by: Daniel Axtens <dja@axtens.net>
---
 drivers/misc/cxl/pci.c | 247 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 247 insertions(+)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 273e810020ef..918426f5cdb8 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -24,6 +24,7 @@
 #include <asm/io.h>
 
 #include "cxl.h"
+#include <misc/cxl.h>
 
 
 #define CXL_PCI_VSEC_ID	0x1280
@@ -1184,10 +1185,256 @@ static void cxl_remove(struct pci_dev *dev)
 	cxl_remove_adapter(adapter);
 }
 
+static pci_ers_result_t cxl_vphb_error_detected(struct cxl_afu *afu,
+						pci_channel_state_t state)
+{
+	struct pci_dev *afu_dev;
+	pci_ers_result_t result = PCI_ERS_RESULT_NEED_RESET;
+	pci_ers_result_t afu_result = PCI_ERS_RESULT_NEED_RESET;
+
+	/* There should only be one entry, but go through the list
+	 * anyway
+	 */
+	list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
+		if (!afu_dev->driver)
+			continue;
+
+		if (afu_dev->driver->err_handler)
+			afu_result = afu_dev->driver->err_handler->error_detected(afu_dev,
+										  state);
+		/* Disconnect trumps all, NONE trumps NEED_RESET */
+		if (afu_result == PCI_ERS_RESULT_DISCONNECT)
+			result = PCI_ERS_RESULT_DISCONNECT;
+		else if ((afu_result == PCI_ERS_RESULT_NONE) &&
+			 (result == PCI_ERS_RESULT_NEED_RESET))
+			result = PCI_ERS_RESULT_NONE;
+	}
+	return result;
+}
+
+static pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+					       pci_channel_state_t state)
+{
+	struct cxl *adapter = pci_get_drvdata(pdev);
+	struct cxl_afu *afu;
+	pci_ers_result_t result = PCI_ERS_RESULT_NEED_RESET;
+	int i;
+
+	/* At this point, we could still have an interrupt pending.
+	 * Let's try to get them out of the way before they do
+	 * anything we don't like.
+	 */
+	schedule();
+
+	/* If we're permanently dead, give up. */
+	if (state == pci_channel_io_perm_failure) {
+		/* Tell the AFU drivers; but we don't care what they
+		 * say, we're going away.
+		 */
+		for (i = 0; i < adapter->slices; i++) {
+			afu = adapter->afu[i];
+			cxl_vphb_error_detected(afu, state);
+		}
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	/* Are we reflashing?
+	 *
+	 * If we reflash, we could come back as something entirely
+	 * different, including a non-CAPI card. As such, by default
+	 * we don't participate in the process. We'll be unbound and
+	 * the slot re-probed. (TODO: check EEH doesn't blindly rebind
+	 * us!)
+	 *
+	 * However, this isn't the entire story: for reliablity
+	 * reasons, we usually want to reflash the FPGA on PERST in
+	 * order to get back to a more reliable known-good state.
+	 *
+	 * This causes us a bit of a problem: if we reflash we can't
+	 * trust that we'll come back the same - we could have a new
+	 * image and been PERSTed in order to load that
+	 * image. However, most of the time we actually *will* come
+	 * back the same - for example a regular EEH event.
+	 *
+	 * Therefore, we allow the user to assert that the image is
+	 * indeed the same and that we should continue on into EEH
+	 * anyway.
+	 */
+	if (adapter->perst_loads_image && !adapter->perst_same_image) {
+		/* TODO take the PHB out of CXL mode */
+		dev_info(&pdev->dev, "reflashing, so opting out of EEH!\n");
+		return PCI_ERS_RESULT_NONE;
+	}
+
+	/*
+	 * At this point, we want to try to recover.  We'll always
+	 * need a complete slot reset: we don't trust any other reset.
+	 *
+	 * Now, we go through each AFU:
+	 *  - We send the driver, if bound, an error_detected callback.
+	 *    We expect it to clean up, but it can also tell us to give
+	 *    up and permanently detach the card. To simplify things, if
+	 *    any bound AFU driver doesn't support EEH, we give up on EEH.
+	 *
+	 *  - We detach all contexts associated with the AFU. This
+	 *    does not free them, but puts them into a CLOSED state
+	 *    which causes any the associated files to return useful
+	 *    errors to userland. It also unmaps, but does not free,
+	 *    any IRQs.
+	 *
+	 *  - We clean up our side: releasing and unmapping resources we hold
+	 *    so we can wire them up again when the hardware comes back up.
+	 *
+	 * Driver authors should note:
+	 *
+	 *  - Any contexts you create in your kernel driver (except
+	 *    those associated with anonymous file descriptors) are
+	 *    your responsibility to free and recreate. Likewise with
+	 *    any attached resources.
+	 *
+	 *  - We will take responsibility for re-initialising the
+	 *    device context (the one set up for you in
+	 *    cxl_pci_enable_device_hook and accessed through
+	 *    cxl_get_context). If you've attached IRQs or other
+	 *    resources to it, they remains yours to free.
+	 *
+	 * All calls you make into cxl that normally touch the
+	 * hardware will not touch the hardware during recovery. So
+	 * you can call the same functions to release resources as you
+	 * normally would.
+	 *
+	 * Two examples:
+	 *
+	 * 1) If you normally free all your resources at the end of
+	 *    each request, or if you use anonymous FDs, your
+	 *    error_detected callback can simply set a flag to tell
+	 *    your driver not to start any new calls. You can then
+	 *    clear the flag in the resume callback.
+	 *
+	 * 2) If you normally allocate your resources on startup:
+	 *     * Set a flag in error_detected as above.
+	 *     * Let CXL detach your contexts.
+	 *     * In slot_reset, free the old resources and allocate new ones.
+	 *     * In resume, clear the flag to allow things to start.
+	 */
+	for (i = 0; i < adapter->slices; i++) {
+		afu = adapter->afu[i];
+
+		result = cxl_vphb_error_detected(afu, state);
+
+		/* Only continue if everyone agrees on NEED_RESET */
+		if (result != PCI_ERS_RESULT_NEED_RESET)
+			return result;
+
+		cxl_context_detach_all(afu);
+		cxl_afu_deactivate_mode(afu);
+		cxl_deconfigure_afu(afu);
+	}
+	cxl_deconfigure_adapter(adapter);
+
+	return result;
+}
+
+static pci_ers_result_t cxl_pci_slot_reset(struct pci_dev *pdev)
+{
+	struct cxl *adapter = pci_get_drvdata(pdev);
+	struct cxl_afu *afu;
+	struct cxl_context *ctx;
+	struct pci_dev *afu_dev;
+	pci_ers_result_t afu_result = PCI_ERS_RESULT_RECOVERED;
+	pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
+	int i;
+
+	if (cxl_configure_adapter(adapter, pdev))
+		goto err;
+
+	for (i = 0; i < adapter->slices; i++) {
+		afu = adapter->afu[i];
+
+		if (cxl_configure_afu(afu, adapter, pdev))
+			goto err;
+
+		if (cxl_afu_select_best_mode(afu))
+			goto err;
+
+		list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
+			/* Reset the device context.
+			 * TODO: make this less disruptive
+			 */
+			ctx = cxl_get_context(afu_dev);
+
+			if (ctx && cxl_release_context(ctx))
+				goto err;
+
+			ctx = cxl_dev_context_init(afu_dev);
+			if (!ctx)
+				goto err;
+
+			afu_dev->dev.archdata.cxl_ctx = ctx;
+
+			if (cxl_afu_check_and_enable(afu))
+				goto err;
+
+			/* If there's a driver attached, allow it to
+			 * chime in on recovery. Drivers should check
+			 * if everything has come back OK.
+			 */
+			if (!afu_dev->driver)
+				continue;
+
+			if (afu_dev->driver->err_handler &&
+			    afu_dev->driver->err_handler->slot_reset)
+				afu_result = afu_dev->driver->err_handler->slot_reset(afu_dev);
+
+			if (afu_result == PCI_ERS_RESULT_DISCONNECT)
+				result = PCI_ERS_RESULT_DISCONNECT;
+		}
+	}
+	return result;
+
+err:
+	/* All the bits that happen in both error_detected and cxl_remove
+	 * should be idempotent, so we don't need to worry about leaving a mix
+	 * of unconfigured and reconfigured resources.
+	 */
+	dev_err(&pdev->dev, "EEH recovery failed. Asking to be disconnected.\n");
+	return PCI_ERS_RESULT_DISCONNECT;
+}
+
+static void cxl_pci_resume(struct pci_dev *pdev)
+{
+	struct cxl *adapter = pci_get_drvdata(pdev);
+	struct cxl_afu *afu;
+	struct pci_dev *afu_dev;
+	int i;
+
+	/* Everything is back now. Drivers should restart work now.
+	 * This is not the place to be checking if everything came back up
+	 * properly, because there's no return value: do that in slot_reset.
+	 */
+	for (i = 0; i < adapter->slices; i++) {
+		afu = adapter->afu[i];
+
+		list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
+			if (afu_dev->driver && afu_dev->driver->err_handler &&
+			    afu_dev->driver->err_handler->resume)
+				afu_dev->driver->err_handler->resume(afu_dev);
+		}
+	}
+}
+
+static const struct pci_error_handlers cxl_err_handler = {
+	.error_detected = cxl_pci_error_detected,
+	.slot_reset = cxl_pci_slot_reset,
+	.resume = cxl_pci_resume,
+};
+
+
 struct pci_driver cxl_pci_driver = {
 	.name = "cxl-pci",
 	.id_table = cxl_pci_tbl,
 	.probe = cxl_probe,
 	.remove = cxl_remove,
 	.shutdown = cxl_remove,
+	.err_handler = &cxl_err_handler,
 };
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-07-14  2:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-14  2:29 [PATCH 0/8] CXL EEH Handling Daniel Axtens
2015-07-14  2:29 ` [PATCH 1/8] cxl: Allow the kernel to trust that an image won't change on PERST Daniel Axtens
2015-07-14  2:29 ` [PATCH 2/8] cxl: Drop commands if the PCI channel is not in normal state Daniel Axtens
2015-07-14  2:29 ` [PATCH 3/8] cxl: Allocate and release the SPA with the AFU Daniel Axtens
2015-07-14  2:29 ` [PATCH 4/8] cxl: Make IRQ release idempotent Daniel Axtens
2015-07-14  2:29 ` [PATCH 5/8] cxl: Clean up adapter MMIO unmap path Daniel Axtens
2015-07-14  2:29 ` [PATCH 6/8] cxl: Refactor adaptor init/teardown Daniel Axtens
2015-07-14  2:29 ` [PATCH 7/8] cxl: Refactor AFU init/teardown Daniel Axtens
2015-07-14  2:29 ` [PATCH 8/8] cxl: EEH support Daniel Axtens

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.