linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] libsas and drivers: NCQ error handling
@ 2022-08-17 14:52 John Garry
  2022-08-17 14:52 ` [PATCH v2 1/6] scsi: pm8001: Modify task abort handling for SATA task John Garry
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: John Garry @ 2022-08-17 14:52 UTC (permalink / raw)
  To: jejb, martin.petersen, jinpu.wang, damien.lemoal, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare, John Garry

As reported in [0], the pm8001 driver NCQ error handling more or less
duplicates what libata does in link error handling, as follows:
- abort all commands
- do autopsy with read log ext 10 command
- reset the target to recover

Indeed for the hisi_sas driver we want to add similar handling for NCQ
errors.

This series add a new libsas API - sas_ata_device_link_abort() - to handle
host NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it. As
mentioned in the pm8001 changeover patch, I would prefer a better place to
locate the SATA ABORT command (rather that nexus reset callback).

Damien kindly tested the v1 series for pm8001, but any further pm8001
testing would be appreciated as I have since tweaked
sas_ata_device_link_abort(). This is because the pm8001 driver hangs on my
arm64 machine read log ext10 command.

Finally with these changes we can make the libsas task alloc/free APIs
private, which they should always have been.

Based on v6.0-rc1

[0] https://lore.kernel.org/linux-scsi/8fb3b093-55f0-1fab-81f4-e8519810a978@huawei.com/

Changes since v1:
- Rename sas_ata_link_abort() -> sas_ata_device_link_abort()
- Set EH RESET flag in sas_ata_device_link_abort()
- Add Jack's Ack tags
- Rebase

John Garry (5):
  scsi: pm8001: Modify task abort handling for SATA task
  scsi: libsas: Add sas_ata_device_link_abort()
  scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors
  scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()
  scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private

Xingui Yang (1):
  scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw

 drivers/scsi/hisi_sas/hisi_sas_main.c  |   5 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c |  22 ++-
 drivers/scsi/libsas/sas_ata.c          |  11 ++
 drivers/scsi/libsas/sas_init.c         |   3 -
 drivers/scsi/libsas/sas_internal.h     |   4 +
 drivers/scsi/pm8001/pm8001_hwi.c       | 194 +++++++------------------
 drivers/scsi/pm8001/pm8001_sas.c       |  13 ++
 drivers/scsi/pm8001/pm8001_sas.h       |   8 +-
 drivers/scsi/pm8001/pm80xx_hwi.c       | 177 ++--------------------
 include/scsi/libsas.h                  |   4 -
 include/scsi/sas_ata.h                 |   5 +
 11 files changed, 133 insertions(+), 313 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/6] scsi: pm8001: Modify task abort handling for SATA task
  2022-08-17 14:52 [PATCH v2 0/6] libsas and drivers: NCQ error handling John Garry
@ 2022-08-17 14:52 ` John Garry
  2022-08-17 14:52 ` [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort() John Garry
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: John Garry @ 2022-08-17 14:52 UTC (permalink / raw)
  To: jejb, martin.petersen, jinpu.wang, damien.lemoal, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare, John Garry

When we try to abort a SATA task, the CCB of the task which we are trying
to avoid may still complete. In this case, we should not touch the task
associated with that CCB as we can race with libsas freeing the last later
in sas_eh_handle_sas_errors() -> sas_eh_finish_cmd() for when
TASK_IS_ABORTED is returned from sas_scsi_find_task()

Signed-off-by: John Garry <john.garry@huawei.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
---
 drivers/scsi/pm8001/pm8001_hwi.c | 15 +++++++++++++--
 drivers/scsi/pm8001/pm8001_sas.c |  8 ++++++++
 drivers/scsi/pm8001/pm80xx_hwi.c | 14 ++++++++++----
 3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/pm8001/pm8001_hwi.c b/drivers/scsi/pm8001/pm8001_hwi.c
index 91d78d0a38fe..1c8a43bf54ad 100644
--- a/drivers/scsi/pm8001/pm8001_hwi.c
+++ b/drivers/scsi/pm8001/pm8001_hwi.c
@@ -2295,7 +2295,9 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha, void *piomb)
 		if (t->dev && (t->dev->lldd_dev))
 			pm8001_dev = t->dev->lldd_dev;
 	} else {
-		pm8001_dbg(pm8001_ha, FAIL, "task null\n");
+		pm8001_dbg(pm8001_ha, FAIL, "task null, freeing CCB tag %d\n",
+			   ccb->ccb_tag);
+		pm8001_ccb_free(pm8001_ha, ccb);
 		return;
 	}
 
@@ -2675,8 +2677,17 @@ static void mpi_sata_event(struct pm8001_hba_info *pm8001_ha, void *piomb)
 	pm8001_dev = ccb->device;
 	if (event)
 		pm8001_dbg(pm8001_ha, FAIL, "sata IO status 0x%x\n", event);
-	if (unlikely(!t || !t->lldd_task || !t->dev))
+
+	if (unlikely(!t)) {
+		pm8001_dbg(pm8001_ha, FAIL, "task null, freeing CCB tag %d\n",
+			   ccb->ccb_tag);
+		pm8001_ccb_free(pm8001_ha, ccb);
 		return;
+	}
+
+	if (unlikely(!t->lldd_task || !t->dev))
+		return;
+
 	ts = &t->task_status;
 	pm8001_dbg(pm8001_ha, DEVIO,
 		   "port_id:0x%x, device_id:0x%x, tag:0x%x, event:0x%x\n",
diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c
index 78cae33aa620..ca29d1454b66 100644
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -984,6 +984,7 @@ int pm8001_query_task(struct sas_task *task)
 /*  mandatory SAM-3, still need free task/ccb info, abort the specified task */
 int pm8001_abort_task(struct sas_task *task)
 {
+	struct pm8001_ccb_info *ccb = task->lldd_task;
 	unsigned long flags;
 	u32 tag;
 	struct domain_device *dev ;
@@ -1114,6 +1115,13 @@ int pm8001_abort_task(struct sas_task *task)
 				pm8001_dev, DS_OPERATIONAL);
 			wait_for_completion(&completion);
 		} else {
+			/*
+			 * Ensure that if we see a completion for the ccb
+			 * associated with the task which we are trying to
+			 * abort then we should not touch the sas_task as it
+			 * may race with libsas freeing it when return here.
+			 */
+			ccb->task = NULL;
 			ret = sas_execute_internal_abort_single(dev, tag, 0, NULL);
 		}
 		rc = TMF_RESP_FUNC_COMPLETE;
diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c
index f8b8624458f7..dd0e06983cd3 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.c
+++ b/drivers/scsi/pm8001/pm80xx_hwi.c
@@ -2396,7 +2396,9 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha,
 		if (t->dev && (t->dev->lldd_dev))
 			pm8001_dev = t->dev->lldd_dev;
 	} else {
-		pm8001_dbg(pm8001_ha, FAIL, "task null\n");
+		pm8001_dbg(pm8001_ha, FAIL, "task null, freeing CCB tag %d\n",
+			   ccb->ccb_tag);
+		pm8001_ccb_free(pm8001_ha, ccb);
 		return;
 	}
 
@@ -2813,12 +2815,16 @@ static void mpi_sata_event(struct pm8001_hba_info *pm8001_ha,
 	ccb = &pm8001_ha->ccb_info[tag];
 	t = ccb->task;
 	pm8001_dev = ccb->device;
-
-	if (unlikely(!t || !t->lldd_task || !t->dev)) {
-		pm8001_dbg(pm8001_ha, FAIL, "task or dev null\n");
+	if (unlikely(!t)) {
+		pm8001_dbg(pm8001_ha, FAIL, "task null, freeing CCB tag %d\n",
+			   ccb->ccb_tag);
+		pm8001_ccb_free(pm8001_ha, ccb);
 		return;
 	}
 
+	if (unlikely(!t->lldd_task || !t->dev))
+		return;
+
 	ts = &t->task_status;
 	pm8001_dbg(pm8001_ha, IOERR, "port_id:0x%x, tag:0x%x, event:0x%x\n",
 		   port_id, tag, event);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()
  2022-08-17 14:52 [PATCH v2 0/6] libsas and drivers: NCQ error handling John Garry
  2022-08-17 14:52 ` [PATCH v2 1/6] scsi: pm8001: Modify task abort handling for SATA task John Garry
@ 2022-08-17 14:52 ` John Garry
  2022-08-17 16:04   ` Damien Le Moal
  2022-08-17 14:52 ` [PATCH v2 3/6] scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors John Garry
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: John Garry @ 2022-08-17 14:52 UTC (permalink / raw)
  To: jejb, martin.petersen, jinpu.wang, damien.lemoal, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare, John Garry

Similar to how AHCI handles NCQ errors in ahci_error_intr() ->
ata_port_abort() -> ata_do_link_abort(), add an NCQ error handler for LLDDs
to call to initiate a link abort.

This will mark all outstanding QCs as failed and kick-off EH.

Note:
The ATA_EH_RESET flag is set for following reasons:
- For hisi_sas, SATA device resources during error handling will only be
  released during reset for ATA EH.
  ATA EH could decide during autopsy that EH would not be required, so
  ensure that it happens (by setting the flag).
- Similar to hisi_sas, pm8001 NCQ error handling requires a hardreset to
  ensure necessary recovery commands are sent (so again we require flag
  ATA_EH_RESET to be set as an insurance policy).

Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/libsas/sas_ata.c | 11 +++++++++++
 include/scsi/sas_ata.h        |  5 +++++
 2 files changed, 16 insertions(+)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index d35c9296f738..9daae64be37e 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -861,6 +861,17 @@ void sas_ata_wait_eh(struct domain_device *dev)
 	ata_port_wait_eh(ap);
 }
 
+void sas_ata_device_link_abort(struct domain_device *device)
+{
+	struct ata_port *ap = device->sata_dev.ap;
+	struct ata_link *link = &ap->link;
+
+	link->eh_info.err_mask |= AC_ERR_DEV;
+	link->eh_info.action |= ATA_EH_RESET;
+	ata_link_abort(link);
+}
+EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);
+
 int sas_execute_ata_cmd(struct domain_device *device, u8 *fis, int force_phy_id)
 {
 	struct sas_tmf_task tmf_task = {};
diff --git a/include/scsi/sas_ata.h b/include/scsi/sas_ata.h
index a1df4f9d57a3..cad0b33064a5 100644
--- a/include/scsi/sas_ata.h
+++ b/include/scsi/sas_ata.h
@@ -32,6 +32,7 @@ void sas_probe_sata(struct asd_sas_port *port);
 void sas_suspend_sata(struct asd_sas_port *port);
 void sas_resume_sata(struct asd_sas_port *port);
 void sas_ata_end_eh(struct ata_port *ap);
+void sas_ata_device_link_abort(struct domain_device *dev);
 int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
 			int force_phy_id);
 int sas_ata_wait_after_reset(struct domain_device *dev, unsigned long deadline);
@@ -87,6 +88,10 @@ static inline void sas_ata_end_eh(struct ata_port *ap)
 {
 }
 
+static inline void sas_ata_device_link_abort(struct domain_device *dev)
+{
+}
+
 static inline int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
 				      int force_phy_id)
 {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/6] scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors
  2022-08-17 14:52 [PATCH v2 0/6] libsas and drivers: NCQ error handling John Garry
  2022-08-17 14:52 ` [PATCH v2 1/6] scsi: pm8001: Modify task abort handling for SATA task John Garry
  2022-08-17 14:52 ` [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort() John Garry
@ 2022-08-17 14:52 ` John Garry
  2022-08-17 14:52 ` [PATCH v2 4/6] scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task() John Garry
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: John Garry @ 2022-08-17 14:52 UTC (permalink / raw)
  To: jejb, martin.petersen, jinpu.wang, damien.lemoal, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare, John Garry

In commit c6b9ef5779c3 ("[SCSI] pm80xx: NCQ error handling changes") the
driver had support added to handle NCQ errors but much of what is done
in this handling is duplicated from the libata EH.

In that named commit we handle in 2x main steps:
a. Issue read log ext10 to examine and clear the errors
b. Issue SATA_ABORT command

Indeed, in libata EH, we do similar to above:
a. ata_do_eh() -> ata_eh_autopsy() -> ata_eh_link_autopsy() ->#
   ata_eh_analyze_ncq_error() -> ata_eh_read_log_10h()
b. ata_do_eh() -> ata_eh_recover() which will issue a device soft reset
   or hard reset

Since there is so much duplication, use sas_ata_link_abort() which will
abort all pending IOs and kick off ATA EH which will do the steps, above.

The pm8001_send_abort_all() call to issue the SATA_ABORT command is added
in pm8001_I_T_nexus_reset() [which is called from ata_eh_recover() to
issue the device reset] as there is currently no better place to locate it.
In future if we add ata_port_operations.softreset callback implementation
for libsas then that may be better.

Signed-off-by: John Garry <john.garry@huawei.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
---
 drivers/scsi/pm8001/pm8001_hwi.c | 179 ++++++++-----------------------
 drivers/scsi/pm8001/pm8001_sas.c |   5 +
 drivers/scsi/pm8001/pm8001_sas.h |   8 +-
 drivers/scsi/pm8001/pm80xx_hwi.c | 163 ++--------------------------
 4 files changed, 58 insertions(+), 297 deletions(-)

diff --git a/drivers/scsi/pm8001/pm8001_hwi.c b/drivers/scsi/pm8001/pm8001_hwi.c
index 1c8a43bf54ad..82de80bb12f3 100644
--- a/drivers/scsi/pm8001/pm8001_hwi.c
+++ b/drivers/scsi/pm8001/pm8001_hwi.c
@@ -1724,7 +1724,15 @@ void pm8001_work_fn(struct work_struct *work)
 				pm8001_free_dev(pm8001_dev);
 			}
 		}
-	}	break;
+	}
+	break;
+	case IO_XFER_ERROR_ABORTED_NCQ_MODE:
+	{
+		pm8001_dev->id |= NCQ_ERR_FLAG;
+		dev = pm8001_dev->sas_device;
+		sas_ata_device_link_abort(dev);
+	}
+	break;
 	}
 	kfree(pw);
 }
@@ -1748,108 +1756,47 @@ int pm8001_handle_event(struct pm8001_hba_info *pm8001_ha, void *data,
 	return ret;
 }
 
-static void pm8001_send_abort_all(struct pm8001_hba_info *pm8001_ha,
-		struct pm8001_device *pm8001_ha_dev)
+void pm8001_send_abort_all(struct pm8001_hba_info *pm8001_ha,
+		struct pm8001_device *pm8001_dev)
 {
+	DECLARE_COMPLETION_ONSTACK(completion);
 	struct pm8001_ccb_info *ccb;
-	struct sas_task *task;
 	struct task_abort_req task_abort;
 	u32 opc = OPC_INB_SATA_ABORT;
+	struct domain_device *dev = pm8001_dev->sas_device;
 	int ret;
 
-	pm8001_ha_dev->id |= NCQ_ABORT_ALL_FLAG;
-	pm8001_ha_dev->id &= ~NCQ_READ_LOG_FLAG;
-
-	task = sas_alloc_slow_task(GFP_ATOMIC);
-	if (!task) {
-		pm8001_dbg(pm8001_ha, FAIL, "cannot allocate task\n");
-		return;
-	}
-
-	task->task_done = pm8001_task_done;
-
-	ccb = pm8001_ccb_alloc(pm8001_ha, pm8001_ha_dev, task);
-	if (!ccb) {
-		sas_free_task(task);
+	ccb = pm8001_ccb_alloc(pm8001_ha, pm8001_dev, NULL);
+	if (!ccb)
 		return;
-	}
 
 	memset(&task_abort, 0, sizeof(task_abort));
 	task_abort.abort_all = cpu_to_le32(1);
-	task_abort.device_id = cpu_to_le32(pm8001_ha_dev->device_id);
+	task_abort.device_id = cpu_to_le32(pm8001_dev->device_id);
 	task_abort.tag = cpu_to_le32(ccb->ccb_tag);
-
+	pm8001_dev->sata_abort_all_completion = &completion;
+	pm8001_dev->id |= NCQ_ERR_ABORT_ALL_FLAG;
 	ret = pm8001_mpi_build_cmd(pm8001_ha, 0, opc, &task_abort,
 				   sizeof(task_abort), 0);
 	if (ret) {
-		sas_free_task(task);
+		pm8001_dev->sata_abort_all_completion = NULL;
 		pm8001_ccb_free(pm8001_ha, ccb);
 	}
-}
-
-static void pm8001_send_read_log(struct pm8001_hba_info *pm8001_ha,
-		struct pm8001_device *pm8001_ha_dev)
-{
-	struct sata_start_req sata_cmd;
-	int res;
-	struct pm8001_ccb_info *ccb;
-	struct sas_task *task = NULL;
-	struct host_to_dev_fis fis;
-	struct domain_device *dev;
-	u32 opc = OPC_INB_SATA_HOST_OPSTART;
-
-	task = sas_alloc_slow_task(GFP_ATOMIC);
-	if (!task) {
-		pm8001_dbg(pm8001_ha, FAIL, "cannot allocate task !!!\n");
-		return;
-	}
-	task->task_done = pm8001_task_done;
 
 	/*
-	 * Allocate domain device by ourselves as libsas is not going to
-	 * provide any.
+	 * It's always wise to use a timeout when dealing with HW.
+	 * The value is abritary, same as PM8001_TASK_TIMEOUT at time of
+	 * writing.
 	 */
-	dev = kzalloc(sizeof(struct domain_device), GFP_ATOMIC);
-	if (!dev) {
-		sas_free_task(task);
-		pm8001_dbg(pm8001_ha, FAIL,
-			   "Domain device cannot be allocated\n");
-		return;
-	}
-	task->dev = dev;
-	task->dev->lldd_dev = pm8001_ha_dev;
-
-	ccb = pm8001_ccb_alloc(pm8001_ha, pm8001_ha_dev, task);
-	if (!ccb) {
-		sas_free_task(task);
-		kfree(dev);
-		return;
-	}
-
-	pm8001_ha_dev->id |= NCQ_READ_LOG_FLAG;
-	pm8001_ha_dev->id |= NCQ_2ND_RLE_FLAG;
-
-	/* construct read log FIS */
-	memset(&fis, 0, sizeof(struct host_to_dev_fis));
-	fis.fis_type = 0x27;
-	fis.flags = 0x80;
-	fis.command = ATA_CMD_READ_LOG_EXT;
-	fis.lbal = 0x10;
-	fis.sector_count = 0x1;
-
-	memset(&sata_cmd, 0, sizeof(sata_cmd));
-	sata_cmd.tag = cpu_to_le32(ccb->ccb_tag);
-	sata_cmd.device_id = cpu_to_le32(pm8001_ha_dev->device_id);
-	sata_cmd.ncqtag_atap_dir_m = cpu_to_le32((0x1 << 7) | (0x5 << 9));
-	memcpy(&sata_cmd.sata_fis, &fis, sizeof(struct host_to_dev_fis));
-
-	res = pm8001_mpi_build_cmd(pm8001_ha, 0, opc, &sata_cmd,
-				   sizeof(sata_cmd), 0);
-	if (res) {
-		sas_free_task(task);
-		pm8001_ccb_free(pm8001_ha, ccb);
-		kfree(dev);
-	}
+	ret = wait_for_completion_timeout(&completion, 20 * HZ);
+	if (!ret)
+		pm8001_dbg(pm8001_ha, FAIL, "Failed to send SATA ABORT ALL:%016llx\n",
+			   SAS_ADDR(dev->sas_addr));
+
+	pm8001_dev->sata_abort_all_completion = NULL;
+	/* Clear the flag regardless - not much else which we can do */
+	pm8001_dev->id	&= ~NCQ_ERR_FLAG;
+	pm8001_dev->id	&= ~NCQ_ERR_ABORT_ALL_FLAG;
 }
 
 /**
@@ -2301,8 +2248,7 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha, void *piomb)
 		return;
 	}
 
-	if ((pm8001_dev && !(pm8001_dev->id & NCQ_READ_LOG_FLAG))
-		&& unlikely(!t || !t->lldd_task || !t->dev)) {
+	if (pm8001_dev && unlikely(!t || !t->lldd_task || !t->dev)) {
 		pm8001_dbg(pm8001_ha, FAIL, "task or dev null\n");
 		return;
 	}
@@ -2360,15 +2306,6 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha, void *piomb)
 		if (param == 0) {
 			ts->resp = SAS_TASK_COMPLETE;
 			ts->stat = SAS_SAM_STAT_GOOD;
-			/* check if response is for SEND READ LOG */
-			if (pm8001_dev &&
-			    (pm8001_dev->id & NCQ_READ_LOG_FLAG)) {
-				pm8001_send_abort_all(pm8001_ha, pm8001_dev);
-				/* Free the tag */
-				pm8001_tag_free(pm8001_ha, tag);
-				sas_free_task(t);
-				return;
-			}
 		} else {
 			u8 len;
 			ts->resp = SAS_TASK_COMPLETE;
@@ -2666,9 +2603,10 @@ static void mpi_sata_event(struct pm8001_hba_info *pm8001_ha, void *piomb)
 	if (event == IO_XFER_ERROR_ABORTED_NCQ_MODE) {
 		/* find device using device id */
 		pm8001_dev = pm8001_find_dev(pm8001_ha, dev_id);
-		/* send read log extension */
 		if (pm8001_dev)
-			pm8001_send_read_log(pm8001_ha, pm8001_dev);
+			pm8001_handle_event(pm8001_ha,
+				pm8001_dev,
+				IO_XFER_ERROR_ABORTED_NCQ_MODE);
 		return;
 	}
 
@@ -3619,6 +3557,12 @@ int pm8001_mpi_task_abort_resp(struct pm8001_hba_info *pm8001_ha, void *piomb)
 	t = ccb->task;
 	pm8001_dev = ccb->device; /* retrieve device */
 
+	if (pm8001_dev->id & NCQ_ERR_ABORT_ALL_FLAG) {
+		if (pm8001_dev->sata_abort_all_completion)
+			complete(pm8001_dev->sata_abort_all_completion);
+		return 0;
+	}
+
 	if (!t)	{
 		pm8001_dbg(pm8001_ha, FAIL, " TASK NULL. RETURNING !!!\n");
 		return -1;
@@ -3645,12 +3589,7 @@ int pm8001_mpi_task_abort_resp(struct pm8001_hba_info *pm8001_ha, void *piomb)
 	pm8001_ccb_task_free(pm8001_ha, ccb);
 	mb();
 
-	if (pm8001_dev->id & NCQ_ABORT_ALL_FLAG) {
-		sas_free_task(t);
-		pm8001_dev->id &= ~NCQ_ABORT_ALL_FLAG;
-	} else {
-		t->task_done(t);
-	}
+	t->task_done(t);
 
 	return 0;
 }
@@ -4202,7 +4141,6 @@ static int pm8001_chip_sata_req(struct pm8001_hba_info *pm8001_ha,
 	u64 phys_addr;
 	u32 ATAP = 0x0;
 	u32 dir;
-	unsigned long flags;
 	u32  opc = OPC_INB_SATA_HOST_OPSTART;
 
 	memset(&sata_cmd, 0, sizeof(sata_cmd));
@@ -4257,39 +4195,6 @@ static int pm8001_chip_sata_req(struct pm8001_hba_info *pm8001_ha,
 		sata_cmd.esgl = 0;
 	}
 
-	/* Check for read log for failed drive and return */
-	if (sata_cmd.sata_fis.command == 0x2f) {
-		if (((pm8001_ha_dev->id & NCQ_READ_LOG_FLAG) ||
-			(pm8001_ha_dev->id & NCQ_ABORT_ALL_FLAG) ||
-			(pm8001_ha_dev->id & NCQ_2ND_RLE_FLAG))) {
-			struct task_status_struct *ts;
-
-			pm8001_ha_dev->id &= 0xDFFFFFFF;
-			ts = &task->task_status;
-
-			spin_lock_irqsave(&task->task_state_lock, flags);
-			ts->resp = SAS_TASK_COMPLETE;
-			ts->stat = SAS_SAM_STAT_GOOD;
-			task->task_state_flags &= ~SAS_TASK_STATE_PENDING;
-			task->task_state_flags |= SAS_TASK_STATE_DONE;
-			if (unlikely((task->task_state_flags &
-					SAS_TASK_STATE_ABORTED))) {
-				spin_unlock_irqrestore(&task->task_state_lock,
-							flags);
-				pm8001_dbg(pm8001_ha, FAIL,
-					   "task 0x%p resp 0x%x  stat 0x%x but aborted by upper layer\n",
-					   task, ts->resp,
-					   ts->stat);
-				pm8001_ccb_task_free(pm8001_ha, ccb);
-			} else {
-				spin_unlock_irqrestore(&task->task_state_lock,
-							flags);
-				pm8001_ccb_task_free_done(pm8001_ha, ccb);
-				return 0;
-			}
-		}
-	}
-
 	return pm8001_mpi_build_cmd(pm8001_ha, 0, opc, &sata_cmd,
 				    sizeof(sata_cmd), 0);
 }
diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c
index ca29d1454b66..895203975e47 100644
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -822,6 +822,11 @@ int pm8001_I_T_nexus_reset(struct domain_device *dev)
 
 	pm8001_dev = dev->lldd_dev;
 	pm8001_ha = pm8001_find_ha_by_dev(dev);
+
+	/* If in NCQ ABORT MODE then we need to issue a SATA_ABORT */
+	if (pm8001_dev->id & NCQ_ERR_FLAG)
+		pm8001_send_abort_all(pm8001_ha, pm8001_dev);
+
 	phy = sas_get_local_phy(dev);
 
 	if (dev_is_sata(dev)) {
diff --git a/drivers/scsi/pm8001/pm8001_sas.h b/drivers/scsi/pm8001/pm8001_sas.h
index c5e3f380a01c..1f602e0dbe92 100644
--- a/drivers/scsi/pm8001/pm8001_sas.h
+++ b/drivers/scsi/pm8001/pm8001_sas.h
@@ -266,6 +266,7 @@ struct pm8001_device {
 	u32			id;
 	struct completion	*dcompletion;
 	struct completion	*setds_completion;
+	struct completion	*sata_abort_all_completion;
 	u32			device_id;
 	atomic_t		running_req;
 };
@@ -579,9 +580,8 @@ struct pm8001_fw_image_header {
 #define FLASH_UPDATE_DNLD_NOT_SUPPORTED		0x10
 #define FLASH_UPDATE_DISABLED			0x11
 
-#define	NCQ_READ_LOG_FLAG			0x80000000
-#define	NCQ_ABORT_ALL_FLAG			0x40000000
-#define	NCQ_2ND_RLE_FLAG			0x20000000
+#define	NCQ_ERR_FLAG			0x20000000
+#define	NCQ_ERR_ABORT_ALL_FLAG	0x40000000
 
 /* Device states */
 #define DS_OPERATIONAL				0x01
@@ -661,6 +661,8 @@ void pm8001_open_reject_retry(
 int pm8001_mem_alloc(struct pci_dev *pdev, void **virt_addr,
 	dma_addr_t *pphys_addr, u32 *pphys_addr_hi, u32 *pphys_addr_lo,
 	u32 mem_size, u32 align);
+void pm8001_send_abort_all(struct pm8001_hba_info *pm8001_ha,
+		struct pm8001_device *pm8001_ha_dev);
 
 void pm8001_chip_iounmap(struct pm8001_hba_info *pm8001_ha);
 int pm8001_mpi_build_cmd(struct pm8001_hba_info *pm8001_ha,
diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c
index dd0e06983cd3..4484c498bcb6 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.c
+++ b/drivers/scsi/pm8001/pm80xx_hwi.c
@@ -1778,113 +1778,6 @@ pm80xx_chip_interrupt_disable(struct pm8001_hba_info *pm8001_ha, u8 vec)
 	pm80xx_chip_intx_interrupt_disable(pm8001_ha);
 }
 
-static void pm80xx_send_abort_all(struct pm8001_hba_info *pm8001_ha,
-		struct pm8001_device *pm8001_ha_dev)
-{
-	struct pm8001_ccb_info *ccb;
-	struct sas_task *task;
-	struct task_abort_req task_abort;
-	u32 opc = OPC_INB_SATA_ABORT;
-	int ret;
-
-	pm8001_ha_dev->id |= NCQ_ABORT_ALL_FLAG;
-	pm8001_ha_dev->id &= ~NCQ_READ_LOG_FLAG;
-
-	task = sas_alloc_slow_task(GFP_ATOMIC);
-	if (!task) {
-		pm8001_dbg(pm8001_ha, FAIL, "cannot allocate task\n");
-		return;
-	}
-	task->task_done = pm8001_task_done;
-
-	ccb = pm8001_ccb_alloc(pm8001_ha, pm8001_ha_dev, task);
-	if (!ccb) {
-		sas_free_task(task);
-		return;
-	}
-
-	memset(&task_abort, 0, sizeof(task_abort));
-	task_abort.abort_all = cpu_to_le32(1);
-	task_abort.device_id = cpu_to_le32(pm8001_ha_dev->device_id);
-	task_abort.tag = cpu_to_le32(ccb->ccb_tag);
-
-	ret = pm8001_mpi_build_cmd(pm8001_ha, 0, opc, &task_abort,
-				   sizeof(task_abort), 0);
-	pm8001_dbg(pm8001_ha, FAIL, "Executing abort task end\n");
-	if (ret) {
-		sas_free_task(task);
-		pm8001_ccb_free(pm8001_ha, ccb);
-	}
-}
-
-static void pm80xx_send_read_log(struct pm8001_hba_info *pm8001_ha,
-		struct pm8001_device *pm8001_ha_dev)
-{
-	struct sata_start_req sata_cmd;
-	int res;
-	struct pm8001_ccb_info *ccb;
-	struct sas_task *task = NULL;
-	struct host_to_dev_fis fis;
-	struct domain_device *dev;
-	u32 opc = OPC_INB_SATA_HOST_OPSTART;
-
-	task = sas_alloc_slow_task(GFP_ATOMIC);
-	if (!task) {
-		pm8001_dbg(pm8001_ha, FAIL, "cannot allocate task !!!\n");
-		return;
-	}
-	task->task_done = pm8001_task_done;
-
-	/*
-	 * Allocate domain device by ourselves as libsas is not going to
-	 * provide any.
-	 */
-	dev = kzalloc(sizeof(struct domain_device), GFP_ATOMIC);
-	if (!dev) {
-		sas_free_task(task);
-		pm8001_dbg(pm8001_ha, FAIL,
-			   "Domain device cannot be allocated\n");
-		return;
-	}
-
-	task->dev = dev;
-	task->dev->lldd_dev = pm8001_ha_dev;
-
-	ccb = pm8001_ccb_alloc(pm8001_ha, pm8001_ha_dev, task);
-	if (!ccb) {
-		sas_free_task(task);
-		kfree(dev);
-		return;
-	}
-
-	pm8001_ha_dev->id |= NCQ_READ_LOG_FLAG;
-	pm8001_ha_dev->id |= NCQ_2ND_RLE_FLAG;
-
-	memset(&sata_cmd, 0, sizeof(sata_cmd));
-
-	/* construct read log FIS */
-	memset(&fis, 0, sizeof(struct host_to_dev_fis));
-	fis.fis_type = 0x27;
-	fis.flags = 0x80;
-	fis.command = ATA_CMD_READ_LOG_EXT;
-	fis.lbal = 0x10;
-	fis.sector_count = 0x1;
-
-	sata_cmd.tag = cpu_to_le32(ccb->ccb_tag);
-	sata_cmd.device_id = cpu_to_le32(pm8001_ha_dev->device_id);
-	sata_cmd.ncqtag_atap_dir_m_dad = cpu_to_le32(((0x1 << 7) | (0x5 << 9)));
-	memcpy(&sata_cmd.sata_fis, &fis, sizeof(struct host_to_dev_fis));
-
-	res = pm8001_mpi_build_cmd(pm8001_ha, 0, opc, &sata_cmd,
-				   sizeof(sata_cmd), 0);
-	pm8001_dbg(pm8001_ha, FAIL, "Executing read log end\n");
-	if (res) {
-		sas_free_task(task);
-		pm8001_ccb_free(pm8001_ha, ccb);
-		kfree(dev);
-	}
-}
-
 /**
  * mpi_ssp_completion - process the event that FW response to the SSP request.
  * @pm8001_ha: our hba card information
@@ -2402,11 +2295,9 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha,
 		return;
 	}
 
-	if ((pm8001_dev && !(pm8001_dev->id & NCQ_READ_LOG_FLAG))
-		&& unlikely(!t || !t->lldd_task || !t->dev)) {
-		pm8001_dbg(pm8001_ha, FAIL, "task or dev null\n");
+
+	if (pm8001_dev && unlikely(!t->lldd_task || !t->dev))
 		return;
-	}
 
 	ts = &t->task_status;
 
@@ -2463,15 +2354,6 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha,
 		if (param == 0) {
 			ts->resp = SAS_TASK_COMPLETE;
 			ts->stat = SAS_SAM_STAT_GOOD;
-			/* check if response is for SEND READ LOG */
-			if (pm8001_dev &&
-			    (pm8001_dev->id & NCQ_READ_LOG_FLAG)) {
-				pm80xx_send_abort_all(pm8001_ha, pm8001_dev);
-				/* Free the tag */
-				pm8001_tag_free(pm8001_ha, tag);
-				sas_free_task(t);
-				return;
-			}
 		} else {
 			u8 len;
 			ts->resp = SAS_TASK_COMPLETE;
@@ -2806,9 +2688,11 @@ static void mpi_sata_event(struct pm8001_hba_info *pm8001_ha,
 	if (event == IO_XFER_ERROR_ABORTED_NCQ_MODE) {
 		/* find device using device id */
 		pm8001_dev = pm8001_find_dev(pm8001_ha, dev_id);
-		/* send read log extension */
+		/* send read log extension by aborting the link - libata does what we want */
 		if (pm8001_dev)
-			pm80xx_send_read_log(pm8001_ha, pm8001_dev);
+			pm8001_handle_event(pm8001_ha,
+				pm8001_dev,
+				IO_XFER_ERROR_ABORTED_NCQ_MODE);
 		return;
 	}
 
@@ -4556,7 +4440,6 @@ static int pm80xx_chip_sata_req(struct pm8001_hba_info *pm8001_ha,
 	u32 end_addr_high, end_addr_low;
 	u32 ATAP = 0x0;
 	u32 dir;
-	unsigned long flags;
 	u32 opc = OPC_INB_SATA_HOST_OPSTART;
 	memset(&sata_cmd, 0, sizeof(sata_cmd));
 
@@ -4735,40 +4618,6 @@ static int pm80xx_chip_sata_req(struct pm8001_hba_info *pm8001_ha,
 				     (task->ata_task.atapi_packet[15] << 24)));
 	}
 
-	/* Check for read log for failed drive and return */
-	if (sata_cmd.sata_fis.command == 0x2f) {
-		if (pm8001_ha_dev && ((pm8001_ha_dev->id & NCQ_READ_LOG_FLAG) ||
-			(pm8001_ha_dev->id & NCQ_ABORT_ALL_FLAG) ||
-			(pm8001_ha_dev->id & NCQ_2ND_RLE_FLAG))) {
-			struct task_status_struct *ts;
-
-			pm8001_ha_dev->id &= 0xDFFFFFFF;
-			ts = &task->task_status;
-
-			spin_lock_irqsave(&task->task_state_lock, flags);
-			ts->resp = SAS_TASK_COMPLETE;
-			ts->stat = SAS_SAM_STAT_GOOD;
-			task->task_state_flags &= ~SAS_TASK_STATE_PENDING;
-			task->task_state_flags |= SAS_TASK_STATE_DONE;
-			if (unlikely((task->task_state_flags &
-					SAS_TASK_STATE_ABORTED))) {
-				spin_unlock_irqrestore(&task->task_state_lock,
-							flags);
-				pm8001_dbg(pm8001_ha, FAIL,
-					   "task 0x%p resp 0x%x  stat 0x%x but aborted by upper layer\n",
-					   task, ts->resp,
-					   ts->stat);
-				pm8001_ccb_task_free(pm8001_ha, ccb);
-				return 0;
-			} else {
-				spin_unlock_irqrestore(&task->task_state_lock,
-							flags);
-				pm8001_ccb_task_free_done(pm8001_ha, ccb);
-				atomic_dec(&pm8001_ha_dev->running_req);
-				return 0;
-			}
-		}
-	}
 	trace_pm80xx_request_issue(pm8001_ha->id,
 				ccb->device ? ccb->device->attached_phy : PM8001_MAX_PHYS,
 				ccb->ccb_tag, opc,
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/6] scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()
  2022-08-17 14:52 [PATCH v2 0/6] libsas and drivers: NCQ error handling John Garry
                   ` (2 preceding siblings ...)
  2022-08-17 14:52 ` [PATCH v2 3/6] scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors John Garry
@ 2022-08-17 14:52 ` John Garry
  2022-08-17 14:52 ` [PATCH v2 5/6] scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw John Garry
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: John Garry @ 2022-08-17 14:52 UTC (permalink / raw)
  To: jejb, martin.petersen, jinpu.wang, damien.lemoal, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare, John Garry

It is not necessary to issue an ATA softreset in hisi_sas_abort_task().
This is because in the libsas EH scheme we will defer to ATA EH which will
deal with any device reset during recovery.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/hisi_sas/hisi_sas_main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c
index 33af5b8dede2..2fe5290ac56c 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -1598,13 +1598,16 @@ static int hisi_sas_abort_task(struct sas_task *task)
 	} else if (task->task_proto & SAS_PROTOCOL_SATA ||
 		task->task_proto & SAS_PROTOCOL_STP) {
 		if (task->dev->dev_type == SAS_SATA_DEV) {
+			struct hisi_sas_slot *slot = task->lldd_task;
+
+			slot->task = NULL;
 			rc = hisi_sas_internal_task_abort_dev(sas_dev, false);
 			if (rc < 0) {
 				dev_err(dev, "abort task: internal abort failed\n");
 				goto out;
 			}
 			hisi_sas_dereg_device(hisi_hba, device);
-			rc = hisi_sas_softreset_ata_disk(device);
+			rc = TMF_RESP_FUNC_COMPLETE;
 		}
 	} else if (task->lldd_task && task->task_proto & SAS_PROTOCOL_SMP) {
 		/* SMP */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 5/6] scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw
  2022-08-17 14:52 [PATCH v2 0/6] libsas and drivers: NCQ error handling John Garry
                   ` (3 preceding siblings ...)
  2022-08-17 14:52 ` [PATCH v2 4/6] scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task() John Garry
@ 2022-08-17 14:52 ` John Garry
  2022-08-17 14:52 ` [PATCH v2 6/6] scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private John Garry
  2022-08-17 15:52 ` [PATCH v2 0/6] libsas and drivers: NCQ error handling Damien Le Moal
  6 siblings, 0 replies; 14+ messages in thread
From: John Garry @ 2022-08-17 14:52 UTC (permalink / raw)
  To: jejb, martin.petersen, jinpu.wang, damien.lemoal, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare, John Garry

From: Xingui Yang <yangxingui@huawei.com>

When CQ header dw3 SATA_DISK_ERR is set it means this SATA disk is in error
state and the current IPTT is invalid. An invalid IPTT does not correspond
to any slot.

In this scenario, new I/Os that delivered to disk will be rejected by the,
controller and all I/Os remained on the disk should be aborted, which we
add here with the sas_ata_device_link_abort() call.

Signed-off-by: Xingui Yang <yangxingui@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index efe8c5be5870..86db4e19beed 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -423,6 +423,8 @@
 #define CMPLT_HDR_DEV_ID_OFF		16
 #define CMPLT_HDR_DEV_ID_MSK		(0xffff << CMPLT_HDR_DEV_ID_OFF)
 /* dw3 */
+#define CMPLT_HDR_SATA_DISK_ERR_OFF	16
+#define CMPLT_HDR_SATA_DISK_ERR_MSK	(0x1 << CMPLT_HDR_SATA_DISK_ERR_OFF)
 #define CMPLT_HDR_IO_IN_TARGET_OFF	17
 #define CMPLT_HDR_IO_IN_TARGET_MSK	(0x1 << CMPLT_HDR_IO_IN_TARGET_OFF)
 
@@ -2384,14 +2386,30 @@ static irqreturn_t  cq_thread_v3_hw(int irq_no, void *p)
 	while (rd_point != wr_point) {
 		struct hisi_sas_complete_v3_hdr *complete_hdr;
 		struct device *dev = hisi_hba->dev;
-		u32 dw1;
+		u32 dw0, dw1, dw3;
 		int iptt;
 
 		complete_hdr = &complete_queue[rd_point];
+		dw0 = le32_to_cpu(complete_hdr->dw0);
 		dw1 = le32_to_cpu(complete_hdr->dw1);
+		dw3 = le32_to_cpu(complete_hdr->dw3);
 
 		iptt = dw1 & CMPLT_HDR_IPTT_MSK;
-		if (likely(iptt < HISI_SAS_COMMAND_ENTRIES_V3_HW)) {
+		if (unlikely((dw0 & CMPLT_HDR_CMPLT_MSK) == 0x3) &&
+			     (dw3 & CMPLT_HDR_SATA_DISK_ERR_MSK)) {
+			int device_id = (dw1 & CMPLT_HDR_DEV_ID_MSK) >>
+					CMPLT_HDR_DEV_ID_OFF;
+			struct hisi_sas_itct *itct =
+				&hisi_hba->itct[device_id];
+			struct hisi_sas_device *sas_dev =
+				&hisi_hba->devices[device_id];
+			struct domain_device *device = sas_dev->sas_device;
+
+			dev_err(dev, "erroneous completion disk err dev id=%d sas_addr=0x%llx CQ hdr: 0x%x 0x%x 0x%x 0x%x\n",
+				device_id, itct->sas_addr, dw0, dw1,
+				complete_hdr->act, dw3);
+			sas_ata_device_link_abort(device);
+		} else if (likely(iptt < HISI_SAS_COMMAND_ENTRIES_V3_HW)) {
 			slot = &hisi_hba->slot_info[iptt];
 			slot->cmplt_queue_slot = rd_point;
 			slot->cmplt_queue = queue;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 6/6] scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private
  2022-08-17 14:52 [PATCH v2 0/6] libsas and drivers: NCQ error handling John Garry
                   ` (4 preceding siblings ...)
  2022-08-17 14:52 ` [PATCH v2 5/6] scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw John Garry
@ 2022-08-17 14:52 ` John Garry
  2022-08-17 15:52 ` [PATCH v2 0/6] libsas and drivers: NCQ error handling Damien Le Moal
  6 siblings, 0 replies; 14+ messages in thread
From: John Garry @ 2022-08-17 14:52 UTC (permalink / raw)
  To: jejb, martin.petersen, jinpu.wang, damien.lemoal, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare, John Garry

We have no users outside libsas any longer, so make sas_alloc_task(),
sas_alloc_slow_task(), and sas_free_task() private.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/libsas/sas_init.c     | 3 ---
 drivers/scsi/libsas/sas_internal.h | 4 ++++
 include/scsi/libsas.h              | 4 ----
 3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c
index e4f77072a58d..f2c05ebeb72f 100644
--- a/drivers/scsi/libsas/sas_init.c
+++ b/drivers/scsi/libsas/sas_init.c
@@ -35,7 +35,6 @@ struct sas_task *sas_alloc_task(gfp_t flags)
 
 	return task;
 }
-EXPORT_SYMBOL_GPL(sas_alloc_task);
 
 struct sas_task *sas_alloc_slow_task(gfp_t flags)
 {
@@ -56,7 +55,6 @@ struct sas_task *sas_alloc_slow_task(gfp_t flags)
 
 	return task;
 }
-EXPORT_SYMBOL_GPL(sas_alloc_slow_task);
 
 void sas_free_task(struct sas_task *task)
 {
@@ -65,7 +63,6 @@ void sas_free_task(struct sas_task *task)
 		kmem_cache_free(sas_task_cache, task);
 	}
 }
-EXPORT_SYMBOL_GPL(sas_free_task);
 
 /*------------ SAS addr hash -----------*/
 void sas_hash_addr(u8 *hashed, const u8 *sas_addr)
diff --git a/drivers/scsi/libsas/sas_internal.h b/drivers/scsi/libsas/sas_internal.h
index 8d0ad3abc7b5..b54bcf3c9a9d 100644
--- a/drivers/scsi/libsas/sas_internal.h
+++ b/drivers/scsi/libsas/sas_internal.h
@@ -52,6 +52,10 @@ void sas_unregister_phys(struct sas_ha_struct *sas_ha);
 struct asd_sas_event *sas_alloc_event(struct asd_sas_phy *phy, gfp_t gfp_flags);
 void sas_free_event(struct asd_sas_event *event);
 
+struct sas_task *sas_alloc_task(gfp_t flags);
+struct sas_task *sas_alloc_slow_task(gfp_t flags);
+void sas_free_task(struct sas_task *task);
+
 int  sas_register_ports(struct sas_ha_struct *sas_ha);
 void sas_unregister_ports(struct sas_ha_struct *sas_ha);
 
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index 2dbead74a2af..f86b56bf7833 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -639,10 +639,6 @@ struct sas_task_slow {
 #define SAS_TASK_STATE_ABORTED      4
 #define SAS_TASK_NEED_DEV_RESET     8
 
-extern struct sas_task *sas_alloc_task(gfp_t flags);
-extern struct sas_task *sas_alloc_slow_task(gfp_t flags);
-extern void sas_free_task(struct sas_task *task);
-
 static inline bool sas_is_internal_abort(struct sas_task *task)
 {
 	return task->task_proto == SAS_PROTOCOL_INTERNAL_ABORT;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/6] libsas and drivers: NCQ error handling
  2022-08-17 14:52 [PATCH v2 0/6] libsas and drivers: NCQ error handling John Garry
                   ` (5 preceding siblings ...)
  2022-08-17 14:52 ` [PATCH v2 6/6] scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private John Garry
@ 2022-08-17 15:52 ` Damien Le Moal
  6 siblings, 0 replies; 14+ messages in thread
From: Damien Le Moal @ 2022-08-17 15:52 UTC (permalink / raw)
  To: John Garry, jejb, martin.petersen, jinpu.wang, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare

On 2022/08/17 7:52, John Garry wrote:
> As reported in [0], the pm8001 driver NCQ error handling more or less
> duplicates what libata does in link error handling, as follows:
> - abort all commands
> - do autopsy with read log ext 10 command
> - reset the target to recover
> 
> Indeed for the hisi_sas driver we want to add similar handling for NCQ
> errors.
> 
> This series add a new libsas API - sas_ata_device_link_abort() - to handle
> host NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it. As
> mentioned in the pm8001 changeover patch, I would prefer a better place to
> locate the SATA ABORT command (rather that nexus reset callback).
> 
> Damien kindly tested the v1 series for pm8001, but any further pm8001
> testing would be appreciated as I have since tweaked
> sas_ata_device_link_abort(). This is because the pm8001 driver hangs on my
> arm64 machine read log ext10 command.

I will run more tests with this series.

> 
> Finally with these changes we can make the libsas task alloc/free APIs
> private, which they should always have been.
> 
> Based on v6.0-rc1
> 
> [0] https://lore.kernel.org/linux-scsi/8fb3b093-55f0-1fab-81f4-e8519810a978@huawei.com/
> 
> Changes since v1:
> - Rename sas_ata_link_abort() -> sas_ata_device_link_abort()
> - Set EH RESET flag in sas_ata_device_link_abort()
> - Add Jack's Ack tags
> - Rebase
> 
> John Garry (5):
>   scsi: pm8001: Modify task abort handling for SATA task
>   scsi: libsas: Add sas_ata_device_link_abort()
>   scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors
>   scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()
>   scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private
> 
> Xingui Yang (1):
>   scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw
> 
>  drivers/scsi/hisi_sas/hisi_sas_main.c  |   5 +-
>  drivers/scsi/hisi_sas/hisi_sas_v3_hw.c |  22 ++-
>  drivers/scsi/libsas/sas_ata.c          |  11 ++
>  drivers/scsi/libsas/sas_init.c         |   3 -
>  drivers/scsi/libsas/sas_internal.h     |   4 +
>  drivers/scsi/pm8001/pm8001_hwi.c       | 194 +++++++------------------
>  drivers/scsi/pm8001/pm8001_sas.c       |  13 ++
>  drivers/scsi/pm8001/pm8001_sas.h       |   8 +-
>  drivers/scsi/pm8001/pm80xx_hwi.c       | 177 ++--------------------
>  include/scsi/libsas.h                  |   4 -
>  include/scsi/sas_ata.h                 |   5 +
>  11 files changed, 133 insertions(+), 313 deletions(-)
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()
  2022-08-17 14:52 ` [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort() John Garry
@ 2022-08-17 16:04   ` Damien Le Moal
  2022-08-17 16:54     ` John Garry
  0 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2022-08-17 16:04 UTC (permalink / raw)
  To: John Garry, jejb, martin.petersen, jinpu.wang, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare

On 2022/08/17 7:52, John Garry wrote:
> Similar to how AHCI handles NCQ errors in ahci_error_intr() ->
> ata_port_abort() -> ata_do_link_abort(), add an NCQ error handler for LLDDs
> to call to initiate a link abort.
> 
> This will mark all outstanding QCs as failed and kick-off EH.
> 
> Note:
> The ATA_EH_RESET flag is set for following reasons:
> - For hisi_sas, SATA device resources during error handling will only be
>   released during reset for ATA EH.
>   ATA EH could decide during autopsy that EH would not be required, so
>   ensure that it happens (by setting the flag).
> - Similar to hisi_sas, pm8001 NCQ error handling requires a hardreset to
>   ensure necessary recovery commands are sent (so again we require flag
>   ATA_EH_RESET to be set as an insurance policy).
> 
> Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  drivers/scsi/libsas/sas_ata.c | 11 +++++++++++
>  include/scsi/sas_ata.h        |  5 +++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
> index d35c9296f738..9daae64be37e 100644
> --- a/drivers/scsi/libsas/sas_ata.c
> +++ b/drivers/scsi/libsas/sas_ata.c
> @@ -861,6 +861,17 @@ void sas_ata_wait_eh(struct domain_device *dev)
>  	ata_port_wait_eh(ap);
>  }
>  
> +void sas_ata_device_link_abort(struct domain_device *device)
> +{
> +	struct ata_port *ap = device->sata_dev.ap;
> +	struct ata_link *link = &ap->link;
> +
> +	link->eh_info.err_mask |= AC_ERR_DEV;
> +	link->eh_info.action |= ATA_EH_RESET;

I am still not convinced that we should set this here. ata_eh_link_autopsy() and
ata_eh_analyze_serror() are supposed to set the action based on the error. Can't
you reuse the link autopsy function ?

> +	ata_link_abort(link);
> +}
> +EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);
> +
>  int sas_execute_ata_cmd(struct domain_device *device, u8 *fis, int force_phy_id)
>  {
>  	struct sas_tmf_task tmf_task = {};
> diff --git a/include/scsi/sas_ata.h b/include/scsi/sas_ata.h
> index a1df4f9d57a3..cad0b33064a5 100644
> --- a/include/scsi/sas_ata.h
> +++ b/include/scsi/sas_ata.h
> @@ -32,6 +32,7 @@ void sas_probe_sata(struct asd_sas_port *port);
>  void sas_suspend_sata(struct asd_sas_port *port);
>  void sas_resume_sata(struct asd_sas_port *port);
>  void sas_ata_end_eh(struct ata_port *ap);
> +void sas_ata_device_link_abort(struct domain_device *dev);
>  int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
>  			int force_phy_id);
>  int sas_ata_wait_after_reset(struct domain_device *dev, unsigned long deadline);
> @@ -87,6 +88,10 @@ static inline void sas_ata_end_eh(struct ata_port *ap)
>  {
>  }
>  
> +static inline void sas_ata_device_link_abort(struct domain_device *dev)
> +{
> +}
> +
>  static inline int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
>  				      int force_phy_id)
>  {


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()
  2022-08-17 16:04   ` Damien Le Moal
@ 2022-08-17 16:54     ` John Garry
  2022-08-17 17:14       ` Damien Le Moal
  0 siblings, 1 reply; 14+ messages in thread
From: John Garry @ 2022-08-17 16:54 UTC (permalink / raw)
  To: Damien Le Moal, jejb, martin.petersen, jinpu.wang, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare

On 17/08/2022 17:04, Damien Le Moal wrote:
> On 2022/08/17 7:52, John Garry wrote:
>> Similar to how AHCI handles NCQ errors in ahci_error_intr() ->
>> ata_port_abort() -> ata_do_link_abort(), add an NCQ error handler for LLDDs
>> to call to initiate a link abort.
>>
>> This will mark all outstanding QCs as failed and kick-off EH.
>>
>> Note:
>> The ATA_EH_RESET flag is set for following reasons:
>> - For hisi_sas, SATA device resources during error handling will only be
>>    released during reset for ATA EH.
>>    ATA EH could decide during autopsy that EH would not be required, so
>>    ensure that it happens (by setting the flag).
>> - Similar to hisi_sas, pm8001 NCQ error handling requires a hardreset to
>>    ensure necessary recovery commands are sent (so again we require flag
>>    ATA_EH_RESET to be set as an insurance policy).
>>
>> Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>> Signed-off-by: John Garry <john.garry@huawei.com>
>> ---
>>   drivers/scsi/libsas/sas_ata.c | 11 +++++++++++
>>   include/scsi/sas_ata.h        |  5 +++++
>>   2 files changed, 16 insertions(+)
>>
>> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
>> index d35c9296f738..9daae64be37e 100644
>> --- a/drivers/scsi/libsas/sas_ata.c
>> +++ b/drivers/scsi/libsas/sas_ata.c
>> @@ -861,6 +861,17 @@ void sas_ata_wait_eh(struct domain_device *dev)
>>   	ata_port_wait_eh(ap);
>>   }
>>   
>> +void sas_ata_device_link_abort(struct domain_device *device)
>> +{
>> +	struct ata_port *ap = device->sata_dev.ap;
>> +	struct ata_link *link = &ap->link;
>> +
>> +	link->eh_info.err_mask |= AC_ERR_DEV;
>> +	link->eh_info.action |= ATA_EH_RESET;
> 

Hi Damien,

Warning: I'm going to write a bit of an essay here... :)

> I am still not convinced that we should set this here. ata_eh_link_autopsy() and
> ata_eh_analyze_serror() are supposed to set the action based on the error. Can't
> you reuse the link autopsy function ?

The link autopsy code works ok, but it may not decide to do the reset 
which we may rely on.

For hisi_sas, consider this log from an NCQ error:

174.385322] ata10.00: failed command: READ FPDMA QUEUED
[ 174.385336] ata10.00: cmd 60/08:00:11:27:00/00:00:00:00:00/40 tag 0 
ncq dma 4096 in
res 41/40:08:11:27:00/00:00:00:00:00/00 Emask 0x409 (media error) <F>
[ 174.385339] ata10.00: status: { DRDY ERR }
[ 174.385343] ata10.00: error: { UNC }
[ 174.385641] ata10.00: configured for UDMA/133
[ 174.385697] sd 6:0:1:0: [sdb] tag#701 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[ 174.385710] sd 6:0:1:0: [sdb] tag#701 Sense Key : Medium Error [current]
[ 174.385726] sd 6:0:1:0: [sdb] tag#701 Add. Sense: Unrecovered read 
error - auto reallocate failed
[ 174.385740] sd 6:0:1:0: [sdb] tag#701 CDB: Read(10) 28 00 00 00 27 11 
00 00 08 00
[ 174.385745] print_req_error: I/O error, dev sdb, sector 10001
[ 174.385827] ata10: EH complete

The ATA autopsy has a found medium error and decided that reset is not 
required - this is similar in that regard to the "unaligned write to a 
sequential write required zone on SMR" error you mentioned from your 
test previously. The problem in this is that for hisi_sas we depend on 
disk reset to release driver resources associated with ATA QCs. That is 
because it is only after reset that we can guarantee that no IO 
associated with the disk will complete in HW and it is safe to release 
the resources.

But pm8001 seems different here with regards releasing resources. I find 
that when EH kicks in from NCQ error and libsas tries to abort missing 
commands, the pm8001_abort_task() -> sas_execute_internal_abort_single() 
causes the original IO to complete as aborted - this is good, as then we 
may release the resources there. hisi_sas has no such feature.

But the pm8001 manual and current driver indicate that the 
OPC_INB_SATA_ABORT command should be sent after read log ext when 
handling NCQ error, regardless of an autopsy. I send OPC_INB_SATA_ABORT 
in ata_eh_reset() -> pm8001_I_T_nexus_reset() -> pm8001_send_abort_all()

As I mentioned before, I saw nowhere better to call 
pm8001_send_abort_all() for this. I would rather not do it in 
ata_eh_reset() -> pm8001_I_T_nexus_reset()

How about this modified approach:
- Continue to set ATA_EH_RESET in sas_ata_device_link_abort()
- pm8001_I_T_nexus_reset() will only call pm8001_send_abort_all() when 
the driver is in NCQ error state and not do a hard reset in that case.
	- But I am not sure if that works as the autopsy for NCQ error may have 
decided that a hardreset was really required. Hmmm..

Thanks,
John


> 
>> +	ata_link_abort(link);
>> +}
>> +EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);
>> +
>>   int sas_execute_ata_cmd(struct domain_device *device, u8 *fis, int force_phy_id)
>>   {
>>   	struct sas_tmf_task tmf_task = {};
>> diff --git a/include/scsi/sas_ata.h b/include/scsi/sas_ata.h
>> index a1df4f9d57a3..cad0b33064a5 100644
>> --- a/include/scsi/sas_ata.h
>> +++ b/include/scsi/sas_ata.h
>> @@ -32,6 +32,7 @@ void sas_probe_sata(struct asd_sas_port *port);
>>   void sas_suspend_sata(struct asd_sas_port *port);
>>   void sas_resume_sata(struct asd_sas_port *port);
>>   void sas_ata_end_eh(struct ata_port *ap);
>> +void sas_ata_device_link_abort(struct domain_device *dev);
>>   int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
>>   			int force_phy_id);
>>   int sas_ata_wait_after_reset(struct domain_device *dev, unsigned long deadline);
>> @@ -87,6 +88,10 @@ static inline void sas_ata_end_eh(struct ata_port *ap)
>>   {
>>   }
>>   
>> +static inline void sas_ata_device_link_abort(struct domain_device *dev)
>> +{
>> +}
>> +
>>   static inline int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
>>   				      int force_phy_id)
>>   {
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()
  2022-08-17 16:54     ` John Garry
@ 2022-08-17 17:14       ` Damien Le Moal
  2022-08-18 12:09         ` John Garry
  0 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2022-08-17 17:14 UTC (permalink / raw)
  To: John Garry, jejb, martin.petersen, jinpu.wang, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare

On 2022/08/17 9:54, John Garry wrote:
> On 17/08/2022 17:04, Damien Le Moal wrote:
>> On 2022/08/17 7:52, John Garry wrote:
>>> Similar to how AHCI handles NCQ errors in ahci_error_intr() ->
>>> ata_port_abort() -> ata_do_link_abort(), add an NCQ error handler for LLDDs
>>> to call to initiate a link abort.
>>>
>>> This will mark all outstanding QCs as failed and kick-off EH.
>>>
>>> Note:
>>> The ATA_EH_RESET flag is set for following reasons:
>>> - For hisi_sas, SATA device resources during error handling will only be
>>>    released during reset for ATA EH.
>>>    ATA EH could decide during autopsy that EH would not be required, so
>>>    ensure that it happens (by setting the flag).
>>> - Similar to hisi_sas, pm8001 NCQ error handling requires a hardreset to
>>>    ensure necessary recovery commands are sent (so again we require flag
>>>    ATA_EH_RESET to be set as an insurance policy).
>>>
>>> Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>>> Signed-off-by: John Garry <john.garry@huawei.com>
>>> ---
>>>   drivers/scsi/libsas/sas_ata.c | 11 +++++++++++
>>>   include/scsi/sas_ata.h        |  5 +++++
>>>   2 files changed, 16 insertions(+)
>>>
>>> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
>>> index d35c9296f738..9daae64be37e 100644
>>> --- a/drivers/scsi/libsas/sas_ata.c
>>> +++ b/drivers/scsi/libsas/sas_ata.c
>>> @@ -861,6 +861,17 @@ void sas_ata_wait_eh(struct domain_device *dev)
>>>   	ata_port_wait_eh(ap);
>>>   }
>>>   
>>> +void sas_ata_device_link_abort(struct domain_device *device)
>>> +{
>>> +	struct ata_port *ap = device->sata_dev.ap;
>>> +	struct ata_link *link = &ap->link;
>>> +
>>> +	link->eh_info.err_mask |= AC_ERR_DEV;
>>> +	link->eh_info.action |= ATA_EH_RESET;
>>
> 
> Hi Damien,
> 
> Warning: I'm going to write a bit of an essay here... :)
> 
>> I am still not convinced that we should set this here. ata_eh_link_autopsy() and
>> ata_eh_analyze_serror() are supposed to set the action based on the error. Can't
>> you reuse the link autopsy function ?
> 
> The link autopsy code works ok, but it may not decide to do the reset 
> which we may rely on.
> 
> For hisi_sas, consider this log from an NCQ error:
> 
> 174.385322] ata10.00: failed command: READ FPDMA QUEUED
> [ 174.385336] ata10.00: cmd 60/08:00:11:27:00/00:00:00:00:00/40 tag 0 
> ncq dma 4096 in
> res 41/40:08:11:27:00/00:00:00:00:00/00 Emask 0x409 (media error) <F>
> [ 174.385339] ata10.00: status: { DRDY ERR }
> [ 174.385343] ata10.00: error: { UNC }
> [ 174.385641] ata10.00: configured for UDMA/133
> [ 174.385697] sd 6:0:1:0: [sdb] tag#701 FAILED Result: hostbyte=DID_OK 
> driverbyte=DRIVER_SENSE
> [ 174.385710] sd 6:0:1:0: [sdb] tag#701 Sense Key : Medium Error [current]
> [ 174.385726] sd 6:0:1:0: [sdb] tag#701 Add. Sense: Unrecovered read 
> error - auto reallocate failed
> [ 174.385740] sd 6:0:1:0: [sdb] tag#701 CDB: Read(10) 28 00 00 00 27 11 
> 00 00 08 00
> [ 174.385745] print_req_error: I/O error, dev sdb, sector 10001
> [ 174.385827] ata10: EH complete
> 
> The ATA autopsy has a found medium error and decided that reset is not 
> required - this is similar in that regard to the "unaligned write to a 
> sequential write required zone on SMR" error you mentioned from your 
> test previously. The problem in this is that for hisi_sas we depend on 
> disk reset to release driver resources associated with ATA QCs. That is 
> because it is only after reset that we can guarantee that no IO 
> associated with the disk will complete in HW and it is safe to release 
> the resources.

If you had an error, then you already are guaranteed that you will not see any
completion at all since the SATA drive is in error mode already. But I see the
point here. The HBA internal qc resources need to be cleared and that seems to
be done only with a device reset big hammer.

> But pm8001 seems different here with regards releasing resources. I find 
> that when EH kicks in from NCQ error and libsas tries to abort missing 
> commands, the pm8001_abort_task() -> sas_execute_internal_abort_single() 
> causes the original IO to complete as aborted - this is good, as then we 
> may release the resources there. hisi_sas has no such feature.
> 
> But the pm8001 manual and current driver indicate that the 
> OPC_INB_SATA_ABORT command should be sent after read log ext when 
> handling NCQ error, regardless of an autopsy. I send OPC_INB_SATA_ABORT 
> in ata_eh_reset() -> pm8001_I_T_nexus_reset() -> pm8001_send_abort_all()

You lost me: ata_eh_recover() will call ata_eh_reset() only if the ATA_EH_RESET
action flag is set. So are you saying that even though it is not needed, you
still need to set ATA_EH_RESET for pm8001 ?

> As I mentioned before, I saw nowhere better to call 
> pm8001_send_abort_all() for this. I would rather not do it in 
> ata_eh_reset() -> pm8001_I_T_nexus_reset()

We could add a new op ->eh_link_autopsy which we can call if defined after the
call to ata_eh_analyze_ncq_error() in ata_eh_link_autopsy(). With that, you can
set ATA_EH_RESET for the hisi driver and only do pm8001_send_abort_all() for
pm8001 (that will be done after the read log 10h).

> 
> How about this modified approach:
> - Continue to set ATA_EH_RESET in sas_ata_device_link_abort()
> - pm8001_I_T_nexus_reset() will only call pm8001_send_abort_all() when 
> the driver is in NCQ error state and not do a hard reset in that case.
> 	- But I am not sure if that works as the autopsy for NCQ error may have 
> decided that a hardreset was really required. Hmmm..

See the above. the new op may decided a reset is needed (hisi case) and even if
the standard autopsy does not make that decision, the flag is set and
ata_eh_recover() will reset the device. For the pm8001 case, no reset set with
the new op, only pm8001_send_abort_all(). And even if ata_eh_link_autopsy()
decides that a reset is needed after calling the new op, that would still be OK
I think.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()
  2022-08-17 17:14       ` Damien Le Moal
@ 2022-08-18 12:09         ` John Garry
  2022-09-02 16:19           ` John Garry
  0 siblings, 1 reply; 14+ messages in thread
From: John Garry @ 2022-08-18 12:09 UTC (permalink / raw)
  To: Damien Le Moal, jejb, martin.petersen, jinpu.wang, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare

On 17/08/2022 18:14, Damien Le Moal wrote:
>> The ATA autopsy has a found medium error and decided that reset is not
>> required - this is similar in that regard to the "unaligned write to a
>> sequential write required zone on SMR" error you mentioned from your
>> test previously. The problem in this is that for hisi_sas we depend on
>> disk reset to release driver resources associated with ATA QCs. That is
>> because it is only after reset that we can guarantee that no IO
>> associated with the disk will complete in HW and it is safe to release
>> the resources.
> If you had an error, then you already are guaranteed that you will not see any
> completion at all since the SATA drive is in error mode already. But I see the
> point here. The HBA internal qc resources need to be cleared and that seems to
> be done only with a device reset big hammer.

Yeah, unfortunately

> 
>> But pm8001 seems different here with regards releasing resources. I find
>> that when EH kicks in from NCQ error and libsas tries to abort missing
>> commands, the pm8001_abort_task() -> sas_execute_internal_abort_single()
>> causes the original IO to complete as aborted - this is good, as then we
>> may release the resources there. hisi_sas has no such feature.
>>
>> But the pm8001 manual and current driver indicate that the
>> OPC_INB_SATA_ABORT command should be sent after read log ext when
>> handling NCQ error, regardless of an autopsy. I send OPC_INB_SATA_ABORT
>> in ata_eh_reset() -> pm8001_I_T_nexus_reset() -> pm8001_send_abort_all()
> You lost me: ata_eh_recover() will call ata_eh_reset() only if the ATA_EH_RESET
> action flag is set. So are you saying that even though it is not needed, you
> still need to set ATA_EH_RESET for pm8001 ?

As below, it was the only location I found suitable to call 
pm8001_send_abort_all().

However I am not really sure it is required now. For pm8001 NCQ error 
handling we require 2x steps:
- read log ext
- Send OPC_INB_SATA_ABORT - we do this in pm8001_send_abort_all()

pm8001_send_abort_all() sends OPC_INB_SATA_ABORT in "device abort all" 
mode, meaning any IO in the HBA is aborted for the device. But we are 
also earlier in EH sending OPC_INB_SATA_ABORT for individual IOs in 
sas_eh_handle_sas_errors() -> sas_scsi_find_task() -> 
pm8001_abort_task() -> sas_execute_internal_abort_single() -> ... 
send_abort_task()

So I don't think that the pm8001_send_abort_all() call has any effect, 
as we're already aborting any outstanding IO earlier.

Admittedly the order of the 2x steps is different, but 
OPC_INB_SATA_ABORT does not send any protocol message to the disk, so 
would not affect anything subsequently read with read log ext.

Having said all that, it may be wise to still send 
pm8001_send_abort_all()...

> 
>> As I mentioned before, I saw nowhere better to call
>> pm8001_send_abort_all() for this. I would rather not do it in
>> ata_eh_reset() -> pm8001_I_T_nexus_reset()
> We could add a new op ->eh_link_autopsy which we can call if defined after the
> call to ata_eh_analyze_ncq_error() in ata_eh_link_autopsy(). With that, you can
> set ATA_EH_RESET for the hisi driver and only do pm8001_send_abort_all() for
> pm8001 (that will be done after the read log 10h).

hmmmm.... seems unfortunate if we need to add a new op just for this.

If we supported ata_port_operations.softreset CB for libsas, then it 
seems a good location to issue pm8001_send_abort_all(). However, ATA EH 
always prefers hardreset over softreset if both supported - do you see 
any scope to change this so that we could use softreset?

> 
>> How about this modified approach:
>> - Continue to set ATA_EH_RESET in sas_ata_device_link_abort()
>> - pm8001_I_T_nexus_reset() will only call pm8001_send_abort_all() when
>> the driver is in NCQ error state and not do a hard reset in that case.
>> 	- But I am not sure if that works as the autopsy for NCQ error may have
>> decided that a hardreset was really required. Hmmm..
> See the above. the new op may decided a reset is needed (hisi case) and even if
> the standard autopsy does not make that decision, the flag is set and
> ata_eh_recover() will reset the device. For the pm8001 case, no reset set with
> the new op, only pm8001_send_abort_all(). And even if ata_eh_link_autopsy()
> decides that a reset is needed after calling the new op, that would still be OK
> I think.

I just think that for hisi_sas a reset is always required unfortunately 
- either an ATA softreset or hardreset. For pm8001, as you say, we can 
let autopsy decide whether the big hammer (hard) reset is required.

Thanks,
John

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()
  2022-08-18 12:09         ` John Garry
@ 2022-09-02 16:19           ` John Garry
  2022-09-05 23:23             ` Damien Le Moal
  0 siblings, 1 reply; 14+ messages in thread
From: John Garry @ 2022-09-02 16:19 UTC (permalink / raw)
  To: Damien Le Moal, jejb, martin.petersen, jinpu.wang, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare

Hi Damien,

>>>
>>> But the pm8001 manual and current driver indicate that the
>>> OPC_INB_SATA_ABORT command should be sent after read log ext when
>>> handling NCQ error, regardless of an autopsy. I send OPC_INB_SATA_ABORT
>>> in ata_eh_reset() -> pm8001_I_T_nexus_reset() -> pm8001_send_abort_all()
>> You lost me: ata_eh_recover() will call ata_eh_reset() only if the 
>> ATA_EH_RESET
>> action flag is set. So are you saying that even though it is not 
>> needed, you
>> still need to set ATA_EH_RESET for pm8001 ?
> 
> As below, it was the only location I found suitable to call 
> pm8001_send_abort_all().
> 
> However I am not really sure it is required now. For pm8001 NCQ error 
> handling we require 2x steps:
> - read log ext
> - Send OPC_INB_SATA_ABORT - we do this in pm8001_send_abort_all()
> 
> pm8001_send_abort_all() sends OPC_INB_SATA_ABORT in "device abort all" 
> mode, meaning any IO in the HBA is aborted for the device. But we are 
> also earlier in EH sending OPC_INB_SATA_ABORT for individual IOs in 
> sas_eh_handle_sas_errors() -> sas_scsi_find_task() -> 
> pm8001_abort_task() -> sas_execute_internal_abort_single() -> ... 
> send_abort_task()
> 
> So I don't think that the pm8001_send_abort_all() call has any effect, 
> as we're already aborting any outstanding IO earlier.
> 
> Admittedly the order of the 2x steps is different, but 
> OPC_INB_SATA_ABORT does not send any protocol message to the disk, so 
> would not affect anything subsequently read with read log ext.
> 
> Having said all that, it may be wise to still send 
> pm8001_send_abort_all()..

Have you had a chance to consider all this yet?

I am now starting to think that it is not necessary to call 
pm8001_send_abort_all(), and instead rely only on 
sas_eh_handle_sas_errors() -> sas_scsi_find_task() -> 
pm8001_abort_task() -> sas_execute_internal_abort_single() -> ... -> 
send_abort_task() to abort each outstanding IO. Then we would not have 
the issue of forcing the reset in $subject (to lead to calling 
pm8001_send_abort_all()).

This idea could simply be tested by stubbing pm8001_send_abort_all()
(and dropping the |= ATA_EH_RESET in $subject).

Thanks,
John


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()
  2022-09-02 16:19           ` John Garry
@ 2022-09-05 23:23             ` Damien Le Moal
  0 siblings, 0 replies; 14+ messages in thread
From: Damien Le Moal @ 2022-09-05 23:23 UTC (permalink / raw)
  To: John Garry, jejb, martin.petersen, jinpu.wang, yangxingui
  Cc: linux-scsi, linux-kernel, linuxarm, hare

On 9/3/22 01:19, John Garry wrote:
> Hi Damien,
> 
>>>>
>>>> But the pm8001 manual and current driver indicate that the
>>>> OPC_INB_SATA_ABORT command should be sent after read log ext when
>>>> handling NCQ error, regardless of an autopsy. I send OPC_INB_SATA_ABORT
>>>> in ata_eh_reset() -> pm8001_I_T_nexus_reset() -> pm8001_send_abort_all()
>>> You lost me: ata_eh_recover() will call ata_eh_reset() only if the 
>>> ATA_EH_RESET
>>> action flag is set. So are you saying that even though it is not 
>>> needed, you
>>> still need to set ATA_EH_RESET for pm8001 ?
>>
>> As below, it was the only location I found suitable to call 
>> pm8001_send_abort_all().
>>
>> However I am not really sure it is required now. For pm8001 NCQ error 
>> handling we require 2x steps:
>> - read log ext
>> - Send OPC_INB_SATA_ABORT - we do this in pm8001_send_abort_all()
>>
>> pm8001_send_abort_all() sends OPC_INB_SATA_ABORT in "device abort all" 
>> mode, meaning any IO in the HBA is aborted for the device. But we are 
>> also earlier in EH sending OPC_INB_SATA_ABORT for individual IOs in 
>> sas_eh_handle_sas_errors() -> sas_scsi_find_task() -> 
>> pm8001_abort_task() -> sas_execute_internal_abort_single() -> ... 
>> send_abort_task()
>>
>> So I don't think that the pm8001_send_abort_all() call has any effect, 
>> as we're already aborting any outstanding IO earlier.
>>
>> Admittedly the order of the 2x steps is different, but 
>> OPC_INB_SATA_ABORT does not send any protocol message to the disk, so 
>> would not affect anything subsequently read with read log ext.
>>
>> Having said all that, it may be wise to still send 
>> pm8001_send_abort_all()..
> 
> Have you had a chance to consider all this yet?
> 
> I am now starting to think that it is not necessary to call 
> pm8001_send_abort_all(), and instead rely only on 
> sas_eh_handle_sas_errors() -> sas_scsi_find_task() -> 
> pm8001_abort_task() -> sas_execute_internal_abort_single() -> ... -> 
> send_abort_task() to abort each outstanding IO. Then we would not have 
> the issue of forcing the reset in $subject (to lead to calling 
> pm8001_send_abort_all()).
> 
> This idea could simply be tested by stubbing pm8001_send_abort_all()
> (and dropping the |= ATA_EH_RESET in $subject).

Send a re-spinned patch and I will test :)

> 
> Thanks,
> John
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-09-05 23:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-17 14:52 [PATCH v2 0/6] libsas and drivers: NCQ error handling John Garry
2022-08-17 14:52 ` [PATCH v2 1/6] scsi: pm8001: Modify task abort handling for SATA task John Garry
2022-08-17 14:52 ` [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort() John Garry
2022-08-17 16:04   ` Damien Le Moal
2022-08-17 16:54     ` John Garry
2022-08-17 17:14       ` Damien Le Moal
2022-08-18 12:09         ` John Garry
2022-09-02 16:19           ` John Garry
2022-09-05 23:23             ` Damien Le Moal
2022-08-17 14:52 ` [PATCH v2 3/6] scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors John Garry
2022-08-17 14:52 ` [PATCH v2 4/6] scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task() John Garry
2022-08-17 14:52 ` [PATCH v2 5/6] scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw John Garry
2022-08-17 14:52 ` [PATCH v2 6/6] scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private John Garry
2022-08-17 15:52 ` [PATCH v2 0/6] libsas and drivers: NCQ error handling Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).