All of lore.kernel.org
 help / color / mirror / Atom feed
* scsi: Allow scsi_execute users to control retries
@ 2023-09-05 23:15 Mike Christie
  2023-09-05 23:15 ` [PATCH v11 01/34] scsi: Add helper to prep sense during error handling Mike Christie
                   ` (33 more replies)
  0 siblings, 34 replies; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley

The following patches were made over Martin's 6.6 staging branch.
They allow scsi_execute_cmd users to control exactly which errors are
retried, so we can reduce the sense/sshd handling they have to do and
have it one place.

The patches allow scsi_execute_cmd users to pass in an array of failures
which they want retried/failed and also specify how many times they want
them retried. If we hit an error that the user did not specify then we drop
down to the default behavior. This allows us to remove almost all the
retry logic from scsi_execute_cmd users. We just have the special cases
where we want to retry with a difference size command or sleep between
retries.

v11:
- Document scsi_failure.result special values
- Fix sshdr fix git commit message where there was a missing word
- Use designated initializers for cdb setup
- Fix up various coding style comments from John like redoing if/else
error/success checks.
- Add patch to fix rdac issue where stale SCSH_DH values were returned
- Remove old comment from:
"[PATCH v10 16/33] scsi: spi: Have scsi-ml retry spi_execute errors"
- Drop EOPNOTSUPP use from:
"[PATCH v10 17/33] scsi: sd: Fix sshdr use in sd_suspend_common"
and instead have sd_sync_cache handle ILLEGAL_REQUEST
- Init errno to 0 when declared in:
"[PATCH v10 20/33] scsi: ch: Have scsi-ml retry ch_do_scsi errors"
- Add diffstat below

v10:
- Drop "," after {}.
- Hopefully fix outlook issues.

v9:
- Drop spi_execute changes from [PATCH] scsi: spi: Fix sshdr use
- Change git commit message for sshdr use fixes.

v8:
- Rebase.
- Saw the discussion about the possible bug where callers are
accessing the sshdr when it's not setup, so I added some patches
for that since I was going over the same code.

v7:
- Rebase against scsi_execute_cmd patchset.

v6:
- Fix kunit build when it's built as a module but scsi is not.
- Drop [PATCH 17/35] scsi: ufshcd: Convert to scsi_exec_req because
  that driver no longer uses scsi_execute.
- Convert ufshcd to use the scsi_failures struct because it now just does
  direct retries and does not do it's own deadline timing.
- Add back memset in read_capacity_16.
- Remove memset in get_sectorsize and replace with { } to init buf.

v5:
- Fix spelling (made sure I ran checkpatch strict)
- Drop SCMD_FAILURE_NONE
- Rename SCMD_FAILURE_ANY
- Fix media_not_present handling where it was being retried instead of
failed.
- Fix ILLEGAL_REQUEST handling in read_capacity_16 so it was not retried.
- Fix coding style, spelling and and naming convention in kunit and added
  more tests to handle cases like the media_not_present one where we want
  to force failures instead of retries.
- Drop cxlflash patch because it actually checked it's internal state before
  performing a retry which we currently do not support.

v4:
- Redefine cmd definitions if the cmd is touched.
- Fix up coding style issues.
- Use sam_status enum.
- Move failures initialization to scsi_initialize_rq
(also fixes KASAN error).
- Add kunit test.
- Add function comments.

v3:
- Use a for loop in scsi_check_passthrough
- Fix result handling/testing.
- Fix scsi_status_is_good handling.
- make __scsi_exec_req take a const arg
- Fix formatting in patch 24

v2:
- Rename scsi_prep_sense
- Change scsi_check_passthrough's loop and added some fixes
- Modified scsi_execute* so it uses a struct to pass in args


 drivers/scsi/Kconfig                        |   9 +
 drivers/scsi/ch.c                           |  24 +-
 drivers/scsi/device_handler/scsi_dh_hp_sw.c | 106 ++++----
 drivers/scsi/device_handler/scsi_dh_rdac.c  | 100 ++++----
 drivers/scsi/scsi.c                         |   2 +-
 drivers/scsi/scsi_error.c                   | 112 ++++++++-
 drivers/scsi/scsi_error_test.c              | 170 +++++++++++++
 drivers/scsi/scsi_lib.c                     |  34 ++-
 drivers/scsi/scsi_scan.c                    | 102 ++++----
 drivers/scsi/scsi_transport_spi.c           |  36 ++-
 drivers/scsi/sd.c                           | 366 +++++++++++++++++-----------
 drivers/scsi/ses.c                          |  60 +++--
 drivers/scsi/sr.c                           |  38 +--
 drivers/ufs/core/ufshcd.c                   |  19 +-
 include/scsi/scsi_cmnd.h                    |  35 +++
 include/scsi/scsi_device.h                  |   2 +
 16 files changed, 842 insertions(+), 373 deletions(-)




^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v11 01/34] scsi: Add helper to prep sense during error handling
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-05 23:15 ` [PATCH v11 02/34] scsi: Allow passthrough to override what errors to retry Mike Christie
                   ` (32 subsequent siblings)
  33 siblings, 0 replies; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This breaks out the sense prep so it can be used in helper that will be
added in this patchset for passthrough commands.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_error.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index c67cdcdc3ba8..7c3eccbdd39f 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -523,6 +523,23 @@ static inline void set_scsi_ml_byte(struct scsi_cmnd *cmd, u8 status)
 	cmd->result = (cmd->result & 0xffff00ff) | (status << 8);
 }
 
+static enum scsi_disposition
+scsi_start_sense_processing(struct scsi_cmnd *scmd,
+			    struct scsi_sense_hdr *sshdr)
+{
+	struct scsi_device *sdev = scmd->device;
+
+	if (!scsi_command_normalize_sense(scmd, sshdr))
+		return FAILED;  /* no valid sense data */
+
+	scsi_report_sense(sdev, sshdr);
+
+	if (scsi_sense_is_deferred(sshdr))
+		return NEEDS_RETRY;
+
+	return SUCCESS;
+}
+
 /**
  * scsi_check_sense - Examine scsi cmd sense
  * @scmd:	Cmd to have sense checked.
@@ -539,14 +556,11 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 	struct request *req = scsi_cmd_to_rq(scmd);
 	struct scsi_device *sdev = scmd->device;
 	struct scsi_sense_hdr sshdr;
+	enum scsi_disposition ret;
 
-	if (! scsi_command_normalize_sense(scmd, &sshdr))
-		return FAILED;	/* no valid sense data */
-
-	scsi_report_sense(sdev, &sshdr);
-
-	if (scsi_sense_is_deferred(&sshdr))
-		return NEEDS_RETRY;
+	ret = scsi_start_sense_processing(scmd, &sshdr);
+	if (ret != SUCCESS)
+		return ret;
 
 	if (sdev->handler && sdev->handler->check_sense) {
 		enum scsi_disposition rc;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 02/34] scsi: Allow passthrough to override what errors to retry
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
  2023-09-05 23:15 ` [PATCH v11 01/34] scsi: Add helper to prep sense during error handling Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:08   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 03/34] scsi: Add scsi_failure field to scsi_exec_args Mike Christie
                   ` (31 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

For passthrough, we don't retry any error we get a check condition for.
This results in a lot of callers driving their own retries for those types
of errors and retrying all errors, and there has been a request to retry
specific host byte errors.

This adds the core code to allow passthrough users to specify what errors
they want scsi-ml to retry for them. We can then convert users to drop
their sense parsing and retry handling.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_error.c | 80 +++++++++++++++++++++++++++++++++++++++
 drivers/scsi/scsi_lib.c   | 10 +++++
 include/scsi/scsi_cmnd.h  | 35 +++++++++++++++++
 3 files changed, 125 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 7c3eccbdd39f..d2fb28212880 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1872,6 +1872,80 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
 	return false;
 }
 
+/**
+ * scsi_check_passthrough - Determine if passthrough scsi_cmnd needs a retry.
+ * @scmd: scsi_cmnd to check.
+ *
+ * Return value:
+ *	SCSI_RETURN_NOT_HANDLED - if the caller should examine the command
+ *	status because the passthrough user wanted the default error processing.
+ *	SUCCESS, FAILED or NEEDS_RETRY - if this function has determined the
+ *	command should be completed, go through the error handler due to
+ *	missing sense or should be retried.
+ */
+static enum scsi_disposition scsi_check_passthrough(struct scsi_cmnd *scmd)
+{
+	struct scsi_failure *failure;
+	struct scsi_sense_hdr sshdr;
+	enum scsi_disposition ret;
+	enum sam_status status;
+
+	if (!scmd->failures)
+		return SCSI_RETURN_NOT_HANDLED;
+
+	for (failure = scmd->failures; failure->result; failure++) {
+		if (failure->result == SCMD_FAILURE_RESULT_ANY)
+			goto maybe_retry;
+
+		if (host_byte(scmd->result) &&
+		    host_byte(scmd->result) == host_byte(failure->result))
+			goto maybe_retry;
+
+		status = status_byte(scmd->result);
+		if (!status)
+			continue;
+
+		if (failure->result == SCMD_FAILURE_STAT_ANY &&
+		    !scsi_status_is_good(scmd->result))
+			goto maybe_retry;
+
+		if (status != status_byte(failure->result))
+			continue;
+
+		if (status_byte(failure->result) != SAM_STAT_CHECK_CONDITION ||
+		    failure->sense == SCMD_FAILURE_SENSE_ANY)
+			goto maybe_retry;
+
+		ret = scsi_start_sense_processing(scmd, &sshdr);
+		if (ret == NEEDS_RETRY)
+			goto maybe_retry;
+		else if (ret != SUCCESS)
+			return ret;
+
+		if (failure->sense != sshdr.sense_key)
+			continue;
+
+		if (failure->asc == SCMD_FAILURE_ASC_ANY)
+			goto maybe_retry;
+
+		if (failure->asc != sshdr.asc)
+			continue;
+
+		if (failure->ascq == SCMD_FAILURE_ASCQ_ANY ||
+		    failure->ascq == sshdr.ascq)
+			goto maybe_retry;
+	}
+
+	return SCSI_RETURN_NOT_HANDLED;
+
+maybe_retry:
+	if (failure->allowed == SCMD_FAILURE_NO_LIMIT ||
+	    ++failure->retries <= failure->allowed)
+		return NEEDS_RETRY;
+
+	return SUCCESS;
+}
+
 /**
  * scsi_decide_disposition - Disposition a cmd on return from LLD.
  * @scmd:	SCSI cmd to examine.
@@ -1900,6 +1974,12 @@ enum scsi_disposition scsi_decide_disposition(struct scsi_cmnd *scmd)
 		return SUCCESS;
 	}
 
+	if (scmd->result && blk_rq_is_passthrough(scsi_cmd_to_rq(scmd))) {
+		rtn = scsi_check_passthrough(scmd);
+		if (rtn != SCSI_RETURN_NOT_HANDLED)
+			return rtn;
+	}
+
 	/*
 	 * first check the host byte, to see if there is anything in there
 	 * that would indicate what we need to do.
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ca5eb058d5c7..7c3e18663c64 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -184,6 +184,15 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
 	__scsi_queue_insert(cmd, reason, true);
 }
 
+void scsi_reset_failures(struct scsi_failure *failures)
+{
+	struct scsi_failure *failure;
+
+	for (failure = failures; failure->result; failure++)
+		failure->retries = 0;
+}
+EXPORT_SYMBOL_GPL(scsi_reset_failures);
+
 /**
  * scsi_execute_cmd - insert request and wait for the result
  * @sdev:	scsi_device
@@ -1129,6 +1138,7 @@ static void scsi_initialize_rq(struct request *rq)
 	init_rcu_head(&cmd->rcu);
 	cmd->jiffies_at_alloc = jiffies;
 	cmd->retries = 0;
+	cmd->failures = NULL;
 }
 
 struct request *scsi_alloc_request(struct request_queue *q, blk_opf_t opf,
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index 526def14e7fb..0dc937511f2b 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -71,6 +71,38 @@ enum scsi_cmnd_submitter {
 	SUBMITTED_BY_SCSI_RESET_IOCTL = 2,
 } __packed;
 
+/*
+ * scsi_execute_cmd users can set scsi_failure.result to have
+ * scsi_check_passthrough fail/retry a command. scsi_failure.result can be a
+ * specific host byte or message code, or SCMD_FAILURE_RESULT_ANY can be used
+ * to match any host or message code.
+ */
+#define SCMD_FAILURE_RESULT_ANY	0x7fffffff
+/*
+ * Set scsi_failure.result to SCMD_FAILURE_STAT_ANY to fail/retry any failure
+ * scsi_status_is_good returns false for.
+ */
+#define SCMD_FAILURE_STAT_ANY	0xff
+/*
+ * The following can be set to the scsi_failure sense, asc and ascq fields to
+ * match on any sense, ASC, or ASCQ value.
+ */
+#define SCMD_FAILURE_SENSE_ANY	0xff
+#define SCMD_FAILURE_ASC_ANY	0xff
+#define SCMD_FAILURE_ASCQ_ANY	0xff
+/* Always retry a matching failure. */
+#define SCMD_FAILURE_NO_LIMIT	-1
+
+struct scsi_failure {
+	int result;
+	u8 sense;
+	u8 asc;
+	u8 ascq;
+
+	s8 allowed;
+	s8 retries;
+};
+
 struct scsi_cmnd {
 	struct scsi_device *device;
 	struct list_head eh_entry; /* entry for the host eh_abort_list/eh_cmd_q */
@@ -91,6 +123,8 @@ struct scsi_cmnd {
 
 	int retries;
 	int allowed;
+	/* optional array of failures that passthrough users want retried */
+	struct scsi_failure *failures;
 
 	unsigned char prot_op;
 	unsigned char prot_type;
@@ -394,5 +428,6 @@ extern void scsi_build_sense(struct scsi_cmnd *scmd, int desc,
 
 struct request *scsi_alloc_request(struct request_queue *q, blk_opf_t opf,
 				   blk_mq_req_flags_t flags);
+void scsi_reset_failures(struct scsi_failure *failures);
 
 #endif /* _SCSI_SCSI_CMND_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 03/34] scsi: Add scsi_failure field to scsi_exec_args
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
  2023-09-05 23:15 ` [PATCH v11 01/34] scsi: Add helper to prep sense during error handling Mike Christie
  2023-09-05 23:15 ` [PATCH v11 02/34] scsi: Allow passthrough to override what errors to retry Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-05 23:15 ` [PATCH v11 04/34] scsi: Have scsi-ml retry scsi_probe_lun errors Mike Christie
                   ` (30 subsequent siblings)
  33 siblings, 0 replies; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

Allow SCSI execution callers to pass in a list of failures they want
retried.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_lib.c    | 1 +
 include/scsi/scsi_device.h | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7c3e18663c64..d9432bbb64fd 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -237,6 +237,7 @@ int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd,
 	scmd->cmd_len = COMMAND_SIZE(cmd[0]);
 	memcpy(scmd->cmnd, cmd, scmd->cmd_len);
 	scmd->allowed = retries;
+	scmd->failures = args->failures;
 	scmd->flags |= args->scmd_flags;
 	req->timeout = timeout;
 	req->rq_flags |= RQF_QUIET;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 75b2235b99e2..accf6c80591b 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -14,6 +14,7 @@ struct bsg_device;
 struct device;
 struct request_queue;
 struct scsi_cmnd;
+struct scsi_failure;
 struct scsi_lun;
 struct scsi_sense_hdr;
 
@@ -472,6 +473,7 @@ struct scsi_exec_args {
 	blk_mq_req_flags_t req_flags;	/* BLK_MQ_REQ flags */
 	int scmd_flags;			/* SCMD flags */
 	int *resid;			/* residual length */
+	struct scsi_failure *failures;	/* failures to retry */
 };
 
 int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 04/34] scsi: Have scsi-ml retry scsi_probe_lun errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (2 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 03/34] scsi: Add scsi_failure field to scsi_exec_args Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:11   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 05/34] scsi: retry INQUIRY after timeout Mike Christie
                   ` (29 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has scsi_probe_lun ask scsi-ml to retry UAs instead of driving them
itself.

There is one behavior change with this patch. We used to get a total of
3 retries for both UAs we were checking for. We now get 3 retries for
each.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_scan.c | 42 +++++++++++++++++++++++-----------------
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 52014b2d39e1..0accd2f0f295 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -647,10 +647,29 @@ static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result,
 	int first_inquiry_len, try_inquiry_len, next_inquiry_len;
 	int response_len = 0;
 	int pass, count, result, resid;
-	struct scsi_sense_hdr sshdr;
+	/*
+	 * not-ready to ready transition [asc/ascq=0x28/0x0] or power-on,
+	 * reset [asc/ascq=0x29/0x0], continue. INQUIRY should not yield
+	 * UNIT_ATTENTION but many buggy devices do so anyway.
+	 */
+	struct scsi_failure failures[] = {
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = 0x28,
+			.allowed = 3,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = 0x29,
+			.allowed = 3,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
-		.sshdr = &sshdr,
 		.resid = &resid,
+		.failures = failures,
 	};
 
 	*bflags = 0;
@@ -668,6 +687,8 @@ static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result,
 				pass, try_inquiry_len));
 
 	/* Each pass gets up to three chances to ignore Unit Attention */
+	scsi_reset_failures(failures);
+
 	for (count = 0; count < 3; ++count) {
 		memset(scsi_cmd, 0, 6);
 		scsi_cmd[0] = INQUIRY;
@@ -684,22 +705,7 @@ static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result,
 				"scsi scan: INQUIRY %s with code 0x%x\n",
 				result ? "failed" : "successful", result));
 
-		if (result > 0) {
-			/*
-			 * not-ready to ready transition [asc/ascq=0x28/0x0]
-			 * or power-on, reset [asc/ascq=0x29/0x0], continue.
-			 * INQUIRY should not yield UNIT_ATTENTION
-			 * but many buggy devices do so anyway. 
-			 */
-			if (scsi_status_is_check_condition(result) &&
-			    scsi_sense_valid(&sshdr)) {
-				if ((sshdr.sense_key == UNIT_ATTENTION) &&
-				    ((sshdr.asc == 0x28) ||
-				     (sshdr.asc == 0x29)) &&
-				    (sshdr.ascq == 0))
-					continue;
-			}
-		} else if (result == 0) {
+		if (result == 0) {
 			/*
 			 * if nothing was transferred, we try
 			 * again. It's a workaround for some USB
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 05/34] scsi: retry INQUIRY after timeout
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (3 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 04/34] scsi: Have scsi-ml retry scsi_probe_lun errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:11   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 06/34] scsi: sd: Fix sshdr use in read_capacity_16 Mike Christie
                   ` (28 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

Description from: Martin Wilck <mwilck@suse.com>:

The SCSI mid layer doesn't retry commands after DID_TIME_OUT (see
scsi_noretry_cmd()). Packet loss in the fabric can cause spurious timeouts
during SCSI device probing, causing device probing to fail. This has been
observed in FCoE uplink failover tests, for example.

This patch fixes the issue by retrying the INQUIRY.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_scan.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 0accd2f0f295..c27b64a1b239 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -665,6 +665,10 @@ static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result,
 			.allowed = 3,
 			.result = SAM_STAT_CHECK_CONDITION,
 		},
+		{
+			.allowed = 1,
+			.result = DID_TIME_OUT << 16,
+		},
 		{}
 	};
 	const struct scsi_exec_args exec_args = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 06/34] scsi: sd: Fix sshdr use in read_capacity_16
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (4 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 05/34] scsi: retry INQUIRY after timeout Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:13   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors Mike Christie
                   ` (27 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so we
shouldn't access the sshdr. If it returns 0, then the cmd executed
successfully, so there is no need to check the sshdr. This has us access
the sshdr when we get a return value > 0.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/sd.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 4cd281368826..70178c1f3f8e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2388,11 +2388,10 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp,
 		the_result = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN,
 					      buffer, RC16_LEN, SD_TIMEOUT,
 					      sdkp->max_retries, &exec_args);
-
-		if (media_not_present(sdkp, &sshdr))
-			return -ENODEV;
-
 		if (the_result > 0) {
+			if (media_not_present(sdkp, &sshdr))
+				return -ENODEV;
+
 			sense_valid = scsi_sense_valid(&sshdr);
 			if (sense_valid &&
 			    sshdr.sense_key == ILLEGAL_REQUEST &&
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (5 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 06/34] scsi: sd: Fix sshdr use in read_capacity_16 Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:21   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 08/34] scsi: Use separate buf for START_STOP in sd_spinup_disk Mike Christie
                   ` (26 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has read_capacity_16 have scsi-ml retry errors instead of driving
them itself.

There are 2 behavior changes with this patch:
1. There is one behavior change where we no longer retry when
scsi_execute_cmd returns < 0, but we should be ok. We don't need to retry
for failures like the queue being removed, and for the case where there
are no tags/reqs since the block layer waits/retries for us. For possible
memory allocation failures from blk_rq_map_kern we use GFP_NOIO, so
retrying will probably not help.
2. For the specific UAs we checked for and retried, we would get
READ_CAPACITY_RETRIES_ON_RESET retries plus whatever retries were left
from the loop's retries. Each UA now gets READ_CAPACITY_RETRIES_ON_RESET
reties, and the other errors (not including medium not present) get up
to 3 retries.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/sd.c | 110 ++++++++++++++++++++++++++++++----------------
 1 file changed, 73 insertions(+), 37 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 70178c1f3f8e..69188b13f5e2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2363,55 +2363,91 @@ static void read_capacity_error(struct scsi_disk *sdkp, struct scsi_device *sdp,
 static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp,
 						unsigned char *buffer)
 {
-	unsigned char cmd[16];
-	struct scsi_sense_hdr sshdr;
-	const struct scsi_exec_args exec_args = {
-		.sshdr = &sshdr,
+	static const u8 cmd[16] = {
+		[0] = SERVICE_ACTION_IN_16,
+		[1] = SAI_READ_CAPACITY_16,
+		[13] = RC16_LEN,
 	};
+	struct scsi_sense_hdr sshdr;
 	int sense_valid = 0;
 	int the_result;
-	int retries = 3, reset_retries = READ_CAPACITY_RETRIES_ON_RESET;
 	unsigned int alignment;
 	unsigned long long lba;
 	unsigned sector_size;
+	struct scsi_failure failures[] = {
+		/*
+		 * Fail immediately for Invalid Command Operation Code or
+		 * Invalid Field in CDB.
+		 */
+		{
+			.sense = ILLEGAL_REQUEST,
+			.asc = 0x20,
+			.allowed = 0,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = ILLEGAL_REQUEST,
+			.asc = 0x24,
+			.allowed = 0,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		/* Fail immediately for Medium Not Present */
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = 0x3A,
+			.allowed = 0,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = NOT_READY,
+			.asc = 0x3A,
+			.ascq = 0x0,
+			.allowed = 0,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = 0x29,
+			/* Device reset might occur several times */
+			.allowed = READ_CAPACITY_RETRIES_ON_RESET,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		/* Any other error not listed above retry */
+		{
+			.result = SCMD_FAILURE_RESULT_ANY,
+			.allowed = 3,
+		},
+		{}
+	};
+	const struct scsi_exec_args exec_args = {
+		.sshdr = &sshdr,
+		.failures = failures,
+	};
 
 	if (sdp->no_read_capacity_16)
 		return -EINVAL;
 
-	do {
-		memset(cmd, 0, 16);
-		cmd[0] = SERVICE_ACTION_IN_16;
-		cmd[1] = SAI_READ_CAPACITY_16;
-		cmd[13] = RC16_LEN;
-		memset(buffer, 0, RC16_LEN);
-
-		the_result = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN,
-					      buffer, RC16_LEN, SD_TIMEOUT,
-					      sdkp->max_retries, &exec_args);
-		if (the_result > 0) {
-			if (media_not_present(sdkp, &sshdr))
-				return -ENODEV;
+	memset(buffer, 0, RC16_LEN);
 
-			sense_valid = scsi_sense_valid(&sshdr);
-			if (sense_valid &&
-			    sshdr.sense_key == ILLEGAL_REQUEST &&
-			    (sshdr.asc == 0x20 || sshdr.asc == 0x24) &&
-			    sshdr.ascq == 0x00)
-				/* Invalid Command Operation Code or
-				 * Invalid Field in CDB, just retry
-				 * silently with RC10 */
-				return -EINVAL;
-			if (sense_valid &&
-			    sshdr.sense_key == UNIT_ATTENTION &&
-			    sshdr.asc == 0x29 && sshdr.ascq == 0x00)
-				/* Device reset might occur several times,
-				 * give it one more chance */
-				if (--reset_retries > 0)
-					continue;
-		}
-		retries--;
+	the_result = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, buffer,
+				      RC16_LEN, SD_TIMEOUT, sdkp->max_retries,
+				      &exec_args);
 
-	} while (the_result && retries);
+	if (the_result > 0) {
+		if (media_not_present(sdkp, &sshdr))
+			return -ENODEV;
+
+		sense_valid = scsi_sense_valid(&sshdr);
+		if (sense_valid && sshdr.sense_key == ILLEGAL_REQUEST &&
+		    (sshdr.asc == 0x20 || sshdr.asc == 0x24) &&
+		     sshdr.ascq == 0x00) {
+			/*
+			 * Invalid Command Operation Code or Invalid Field in
+			 * CDB, just retry silently with RC10
+			 */
+			return -EINVAL;
+		}
+	}
 
 	if (the_result) {
 		sd_print_result(sdkp, "Read Capacity(16) failed", the_result);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 08/34] scsi: Use separate buf for START_STOP in sd_spinup_disk
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (6 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:26   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 09/34] scsi: sd: Fix sshdr use " Mike Christie
                   ` (25 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

We currently re-use the cmd buffer for the TUR and START_STOP commands
which requires us to reset the buffer when retrying. This has us use
separate buffers for the 2 commands so we can make them const and I think
it makes it easier to handle for retries but does not add too much extra
to the stack use.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/sd.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 69188b13f5e2..3aac3041c83f 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2228,14 +2228,16 @@ sd_spinup_disk(struct scsi_disk *sdkp)
 			 * Issue command to spin up drive when not ready
 			 */
 			if (!spintime) {
+				/* Return immediately and start spin cycle */
+				const u8 start_cmd[10] = {
+					[0] = START_STOP,
+					[1] = 1,
+					[4] = sdkp->device->start_stop_pwr_cond ?
+						0x11 : 1,
+				};
+
 				sd_printk(KERN_NOTICE, sdkp, "Spinning up disk...");
-				cmd[0] = START_STOP;
-				cmd[1] = 1;	/* Return immediately */
-				memset((void *) &cmd[2], 0, 8);
-				cmd[4] = 1;	/* Start spin cycle */
-				if (sdkp->device->start_stop_pwr_cond)
-					cmd[4] |= 1 << 4;
-				scsi_execute_cmd(sdkp->device, cmd,
+				scsi_execute_cmd(sdkp->device, start_cmd,
 						 REQ_OP_DRV_IN, NULL, 0,
 						 SD_TIMEOUT, sdkp->max_retries,
 						 &exec_args);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 09/34] scsi: sd: Fix sshdr use in sd_spinup_disk
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (7 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 08/34] scsi: Use separate buf for START_STOP in sd_spinup_disk Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:27   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 10/34] scsi: Have scsi-ml retry sd_spinup_disk errors Mike Christie
                   ` (24 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so we
shouldn't access the sshdr. If it returns 0, then the cmd executed
successfully, so there is no need to check the sshdr. This has us access
the sshdr when we get a return value > 0.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/sd.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 3aac3041c83f..74967e1b19da 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2180,19 +2180,21 @@ sd_spinup_disk(struct scsi_disk *sdkp)
 						      sdkp->max_retries,
 						      &exec_args);
 
-			/*
-			 * If the drive has indicated to us that it
-			 * doesn't have any media in it, don't bother
-			 * with any more polling.
-			 */
-			if (media_not_present(sdkp, &sshdr)) {
-				if (media_was_present)
-					sd_printk(KERN_NOTICE, sdkp, "Media removed, stopped polling\n");
-				return;
-			}
+			if (the_result > 0) {
+				/*
+				 * If the drive has indicated to us that it
+				 * doesn't have any media in it, don't bother
+				 * with any more polling.
+				 */
+				if (media_not_present(sdkp, &sshdr)) {
+					if (media_was_present)
+						sd_printk(KERN_NOTICE, sdkp,
+							  "Media removed, stopped polling\n");
+					return;
+				}
 
-			if (the_result)
 				sense_valid = scsi_sense_valid(&sshdr);
+			}
 			retries++;
 		} while (retries < 3 &&
 			 (!scsi_status_is_good(the_result) ||
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 10/34] scsi: Have scsi-ml retry sd_spinup_disk errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (8 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 09/34] scsi: sd: Fix sshdr use " Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:46   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 11/34] scsi: hp_sw: Only access sshdr if res > 0 Mike Christie
                   ` (23 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This simplifies sd_spinup_disk so scsi-ml retries errors for it. Note that
we retried specifically on a UA and also if scsi_status_is_good returned
failed which would happen for all check conditions. In this patch we use
SCMD_FAILURE_STAT_ANY which will trigger for the same conditions as
when scsi_status_is_good returns false. This will cover all CCs including
UAs so there is no explicit failures arrary entry for UAs.

We do not handle the outside loop's retries because we want to sleep
between tries and we don't support that yet.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/sd.c | 73 ++++++++++++++++++++++++++---------------------
 1 file changed, 41 insertions(+), 32 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 74967e1b19da..f5e6b5cc762f 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2151,55 +2151,64 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 static void
 sd_spinup_disk(struct scsi_disk *sdkp)
 {
-	unsigned char cmd[10];
+	static const u8 cmd[10] = { TEST_UNIT_READY };
 	unsigned long spintime_expire = 0;
-	int retries, spintime;
+	int spintime, sense_valid = 0;
 	unsigned int the_result;
 	struct scsi_sense_hdr sshdr;
+	struct scsi_failure failures[] = {
+		/* Fail immediately for Medium Not Present */
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = 0x3A,
+			.allowed = 0,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = NOT_READY,
+			.asc = 0x3A,
+			.allowed = 0,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.result = SCMD_FAILURE_STAT_ANY,
+			.allowed = 3,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
 		.sshdr = &sshdr,
+		.failures = failures,
 	};
-	int sense_valid = 0;
 
 	spintime = 0;
 
 	/* Spin up drives, as required.  Only do this at boot time */
 	/* Spinup needs to be done for module loads too. */
 	do {
-		retries = 0;
-
-		do {
-			bool media_was_present = sdkp->media_present;
+		bool media_was_present = sdkp->media_present;
 
-			cmd[0] = TEST_UNIT_READY;
-			memset((void *) &cmd[1], 0, 9);
+		scsi_reset_failures(failures);
 
-			the_result = scsi_execute_cmd(sdkp->device, cmd,
-						      REQ_OP_DRV_IN, NULL, 0,
-						      SD_TIMEOUT,
-						      sdkp->max_retries,
-						      &exec_args);
+		the_result = scsi_execute_cmd(sdkp->device, cmd, REQ_OP_DRV_IN,
+					      NULL, 0, SD_TIMEOUT,
+					      sdkp->max_retries, &exec_args);
 
-			if (the_result > 0) {
-				/*
-				 * If the drive has indicated to us that it
-				 * doesn't have any media in it, don't bother
-				 * with any more polling.
-				 */
-				if (media_not_present(sdkp, &sshdr)) {
-					if (media_was_present)
-						sd_printk(KERN_NOTICE, sdkp,
-							  "Media removed, stopped polling\n");
-					return;
-				}
 
-				sense_valid = scsi_sense_valid(&sshdr);
+		if (the_result > 0) {
+			/*
+			 * If the drive has indicated to us that it doesn't
+			 * have any media in it, don't bother with any more
+			 * polling.
+			 */
+			if (media_not_present(sdkp, &sshdr)) {
+				if (media_was_present)
+					sd_printk(KERN_NOTICE, sdkp,
+						  "Media removed, stopped polling\n");
+				return;
 			}
-			retries++;
-		} while (retries < 3 &&
-			 (!scsi_status_is_good(the_result) ||
-			  (scsi_status_is_check_condition(the_result) &&
-			  sense_valid && sshdr.sense_key == UNIT_ATTENTION)));
+			sense_valid = scsi_sense_valid(&sshdr);
+		}
 
 		if (!scsi_status_is_check_condition(the_result)) {
 			/* no sense, TUR either succeeded or failed
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 11/34] scsi: hp_sw: Only access sshdr if res > 0
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (9 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 10/34] scsi: Have scsi-ml retry sd_spinup_disk errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:48   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 12/34] scsi: hp_sw: Have scsi-ml retry scsi_exec_req errors Mike Christie
                   ` (22 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so we
shouldn't access the sshdr. If it returns 0, then the cmd executed
successfully, so there is no need to check the sshdr. This has us access
the sshdr when we get a return value > 0.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/device_handler/scsi_dh_hp_sw.c | 79 +++++++++++----------
 1 file changed, 40 insertions(+), 39 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_hp_sw.c b/drivers/scsi/device_handler/scsi_dh_hp_sw.c
index 5f2f943d926c..944ea4e0cc45 100644
--- a/drivers/scsi/device_handler/scsi_dh_hp_sw.c
+++ b/drivers/scsi/device_handler/scsi_dh_hp_sw.c
@@ -82,7 +82,7 @@ static int hp_sw_tur(struct scsi_device *sdev, struct hp_sw_dh_data *h)
 {
 	unsigned char cmd[6] = { TEST_UNIT_READY };
 	struct scsi_sense_hdr sshdr;
-	int ret = SCSI_DH_OK, res;
+	int ret, res;
 	blk_opf_t opf = REQ_OP_DRV_IN | REQ_FAILFAST_DEV |
 				REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER;
 	const struct scsi_exec_args exec_args = {
@@ -92,19 +92,18 @@ static int hp_sw_tur(struct scsi_device *sdev, struct hp_sw_dh_data *h)
 retry:
 	res = scsi_execute_cmd(sdev, cmd, opf, NULL, 0, HP_SW_TIMEOUT,
 			       HP_SW_RETRIES, &exec_args);
-	if (res) {
-		if (scsi_sense_valid(&sshdr))
-			ret = tur_done(sdev, h, &sshdr);
-		else {
-			sdev_printk(KERN_WARNING, sdev,
-				    "%s: sending tur failed with %x\n",
-				    HP_SW_NAME, res);
-			ret = SCSI_DH_IO;
-		}
-	} else {
+	if (res > 0 && scsi_sense_valid(&sshdr)) {
+		ret = tur_done(sdev, h, &sshdr);
+	} else if (res == 0) {
 		h->path_state = HP_SW_PATH_ACTIVE;
 		ret = SCSI_DH_OK;
+	} else {
+		sdev_printk(KERN_WARNING, sdev,
+			    "%s: sending tur failed with %x\n",
+			    HP_SW_NAME, res);
+		ret = SCSI_DH_IO;
 	}
+
 	if (ret == SCSI_DH_IMM_RETRY)
 		goto retry;
 
@@ -122,7 +121,7 @@ static int hp_sw_start_stop(struct hp_sw_dh_data *h)
 	unsigned char cmd[6] = { START_STOP, 0, 0, 0, 1, 0 };
 	struct scsi_sense_hdr sshdr;
 	struct scsi_device *sdev = h->sdev;
-	int res, rc = SCSI_DH_OK;
+	int res, rc;
 	int retry_cnt = HP_SW_RETRIES;
 	blk_opf_t opf = REQ_OP_DRV_IN | REQ_FAILFAST_DEV |
 				REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER;
@@ -133,35 +132,37 @@ static int hp_sw_start_stop(struct hp_sw_dh_data *h)
 retry:
 	res = scsi_execute_cmd(sdev, cmd, opf, NULL, 0, HP_SW_TIMEOUT,
 			       HP_SW_RETRIES, &exec_args);
-	if (res) {
-		if (!scsi_sense_valid(&sshdr)) {
-			sdev_printk(KERN_WARNING, sdev,
-				    "%s: sending start_stop_unit failed, "
-				    "no sense available\n", HP_SW_NAME);
-			return SCSI_DH_IO;
-		}
-		switch (sshdr.sense_key) {
-		case NOT_READY:
-			if (sshdr.asc == 0x04 && sshdr.ascq == 3) {
-				/*
-				 * LUN not ready - manual intervention required
-				 *
-				 * Switch-over in progress, retry.
-				 */
-				if (--retry_cnt)
-					goto retry;
-				rc = SCSI_DH_RETRY;
-				break;
-			}
-			fallthrough;
-		default:
-			sdev_printk(KERN_WARNING, sdev,
-				    "%s: sending start_stop_unit failed, "
-				    "sense %x/%x/%x\n", HP_SW_NAME,
-				    sshdr.sense_key, sshdr.asc, sshdr.ascq);
-			rc = SCSI_DH_IO;
+	if (!res) {
+		return SCSI_DH_OK;
+	} else if (res < 0 || !scsi_sense_valid(&sshdr)) {
+		sdev_printk(KERN_WARNING, sdev,
+			    "%s: sending start_stop_unit failed, "
+			    "no sense available\n", HP_SW_NAME);
+		return SCSI_DH_IO;
+	}
+
+	switch (sshdr.sense_key) {
+	case NOT_READY:
+		if (sshdr.asc == 0x04 && sshdr.ascq == 3) {
+			/*
+			 * LUN not ready - manual intervention required
+			 *
+			 * Switch-over in progress, retry.
+			 */
+			if (--retry_cnt)
+				goto retry;
+			rc = SCSI_DH_RETRY;
+			break;
 		}
+		fallthrough;
+	default:
+		sdev_printk(KERN_WARNING, sdev,
+			    "%s: sending start_stop_unit failed, "
+			    "sense %x/%x/%x\n", HP_SW_NAME,
+			    sshdr.sense_key, sshdr.asc, sshdr.ascq);
+		rc = SCSI_DH_IO;
 	}
+
 	return rc;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 12/34] scsi: hp_sw: Have scsi-ml retry scsi_exec_req errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (10 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 11/34] scsi: hp_sw: Only access sshdr if res > 0 Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:51   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 13/34] scsi: rdac: Fix send_mode_select retry handling Mike Christie
                   ` (21 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has hp_sw have scsi-ml retry scsi_exec_req errors instead of driving
them itself.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/device_handler/scsi_dh_hp_sw.c | 43 +++++++++++++--------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_hp_sw.c b/drivers/scsi/device_handler/scsi_dh_hp_sw.c
index 944ea4e0cc45..d31db178875c 100644
--- a/drivers/scsi/device_handler/scsi_dh_hp_sw.c
+++ b/drivers/scsi/device_handler/scsi_dh_hp_sw.c
@@ -46,9 +46,6 @@ static int tur_done(struct scsi_device *sdev, struct hp_sw_dh_data *h,
 	int ret = SCSI_DH_IO;
 
 	switch (sshdr->sense_key) {
-	case UNIT_ATTENTION:
-		ret = SCSI_DH_IMM_RETRY;
-		break;
 	case NOT_READY:
 		if (sshdr->asc == 0x04 && sshdr->ascq == 2) {
 			/*
@@ -85,11 +82,21 @@ static int hp_sw_tur(struct scsi_device *sdev, struct hp_sw_dh_data *h)
 	int ret, res;
 	blk_opf_t opf = REQ_OP_DRV_IN | REQ_FAILFAST_DEV |
 				REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER;
+	struct scsi_failure failures[] = {
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = SCMD_FAILURE_NO_LIMIT,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
 		.sshdr = &sshdr,
+		.failures = failures,
 	};
 
-retry:
 	res = scsi_execute_cmd(sdev, cmd, opf, NULL, 0, HP_SW_TIMEOUT,
 			       HP_SW_RETRIES, &exec_args);
 	if (res > 0 && scsi_sense_valid(&sshdr)) {
@@ -104,9 +111,6 @@ static int hp_sw_tur(struct scsi_device *sdev, struct hp_sw_dh_data *h)
 		ret = SCSI_DH_IO;
 	}
 
-	if (ret == SCSI_DH_IMM_RETRY)
-		goto retry;
-
 	return ret;
 }
 
@@ -122,14 +126,28 @@ static int hp_sw_start_stop(struct hp_sw_dh_data *h)
 	struct scsi_sense_hdr sshdr;
 	struct scsi_device *sdev = h->sdev;
 	int res, rc;
-	int retry_cnt = HP_SW_RETRIES;
 	blk_opf_t opf = REQ_OP_DRV_IN | REQ_FAILFAST_DEV |
 				REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER;
+	struct scsi_failure failures[] = {
+		{
+			/*
+			 * LUN not ready - manual intervention required
+			 *
+			 * Switch-over in progress, retry.
+			 */
+			.sense = NOT_READY,
+			.asc = 0x04,
+			.ascq = 0x03,
+			.allowed = HP_SW_RETRIES,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
 		.sshdr = &sshdr,
+		.failures = failures,
 	};
 
-retry:
 	res = scsi_execute_cmd(sdev, cmd, opf, NULL, 0, HP_SW_TIMEOUT,
 			       HP_SW_RETRIES, &exec_args);
 	if (!res) {
@@ -144,13 +162,6 @@ static int hp_sw_start_stop(struct hp_sw_dh_data *h)
 	switch (sshdr.sense_key) {
 	case NOT_READY:
 		if (sshdr.asc == 0x04 && sshdr.ascq == 3) {
-			/*
-			 * LUN not ready - manual intervention required
-			 *
-			 * Switch-over in progress, retry.
-			 */
-			if (--retry_cnt)
-				goto retry;
 			rc = SCSI_DH_RETRY;
 			break;
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 13/34] scsi: rdac: Fix send_mode_select retry handling
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (11 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 12/34] scsi: hp_sw: Have scsi-ml retry scsi_exec_req errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:54   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 14/34] scsi: rdac: Fix sshdr use Mike Christie
                   ` (20 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

If send_mode_select retries scsi_execute_cmd it will leave err set to
SCSI_DH_RETRY/SCSI_DH_IMM_RETRY. If on the retry, the command is
successful, then SCSI_DH_RETRY/SCSI_DH_IMM_RETRY will be returned to
the scsi_dh activation caller. On the retry, we will then detect the
previous MODE SELECT had worked, and so we will return success.

This patch has us return the correct return value, so we can avoid the
extra scsi_dh activation call and to avoid failures if the caller had
hit its activation retry limit and does not end up retrying.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/scsi/device_handler/scsi_dh_rdac.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_rdac.c b/drivers/scsi/device_handler/scsi_dh_rdac.c
index c5538645057a..b65586d6649c 100644
--- a/drivers/scsi/device_handler/scsi_dh_rdac.c
+++ b/drivers/scsi/device_handler/scsi_dh_rdac.c
@@ -530,7 +530,7 @@ static void send_mode_select(struct work_struct *work)
 		container_of(work, struct rdac_controller, ms_work);
 	struct scsi_device *sdev = ctlr->ms_sdev;
 	struct rdac_dh_data *h = sdev->handler_data;
-	int err = SCSI_DH_OK, retry_cnt = RDAC_RETRY_COUNT;
+	int err, retry_cnt = RDAC_RETRY_COUNT;
 	struct rdac_queue_data *tmp, *qdata;
 	LIST_HEAD(list);
 	unsigned char cdb[MAX_COMMAND_SIZE];
@@ -558,20 +558,20 @@ static void send_mode_select(struct work_struct *work)
 		(char *) h->ctlr->array_name, h->ctlr->index,
 		(retry_cnt == RDAC_RETRY_COUNT) ? "queueing" : "retrying");
 
-	if (scsi_execute_cmd(sdev, cdb, opf, &h->ctlr->mode_select, data_size,
-			     RDAC_TIMEOUT * HZ, RDAC_RETRIES, &exec_args)) {
+	if (!scsi_execute_cmd(sdev, cdb, opf, &h->ctlr->mode_select, data_size,
+			      RDAC_TIMEOUT * HZ, RDAC_RETRIES, &exec_args)) {
+		h->state = RDAC_STATE_ACTIVE;
+		RDAC_LOG(RDAC_LOG_FAILOVER, sdev, "array %s, ctlr %d, "
+				"MODE_SELECT completed",
+				(char *) h->ctlr->array_name, h->ctlr->index);
+		err = SCSI_DH_OK;
+	} else {
 		err = mode_select_handle_sense(sdev, &sshdr);
 		if (err == SCSI_DH_RETRY && retry_cnt--)
 			goto retry;
 		if (err == SCSI_DH_IMM_RETRY)
 			goto retry;
 	}
-	if (err == SCSI_DH_OK) {
-		h->state = RDAC_STATE_ACTIVE;
-		RDAC_LOG(RDAC_LOG_FAILOVER, sdev, "array %s, ctlr %d, "
-				"MODE_SELECT completed",
-				(char *) h->ctlr->array_name, h->ctlr->index);
-	}
 
 	list_for_each_entry_safe(qdata, tmp, &list, entry) {
 		list_del(&qdata->entry);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 14/34] scsi: rdac: Fix sshdr use
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (12 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 13/34] scsi: rdac: Fix send_mode_select retry handling Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:55   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 15/34] scsi: rdac: Have scsi-ml retry send_mode_select errors Mike Christie
                   ` (19 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so we
shouldn't access the sshdr. If it returns 0, then the cmd executed
successfully, so there is no need to check the sshdr. This has us access
the sshdr when we get a return value > 0.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/device_handler/scsi_dh_rdac.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_rdac.c b/drivers/scsi/device_handler/scsi_dh_rdac.c
index b65586d6649c..1ac2ae17e8be 100644
--- a/drivers/scsi/device_handler/scsi_dh_rdac.c
+++ b/drivers/scsi/device_handler/scsi_dh_rdac.c
@@ -530,7 +530,7 @@ static void send_mode_select(struct work_struct *work)
 		container_of(work, struct rdac_controller, ms_work);
 	struct scsi_device *sdev = ctlr->ms_sdev;
 	struct rdac_dh_data *h = sdev->handler_data;
-	int err, retry_cnt = RDAC_RETRY_COUNT;
+	int rc, err, retry_cnt = RDAC_RETRY_COUNT;
 	struct rdac_queue_data *tmp, *qdata;
 	LIST_HEAD(list);
 	unsigned char cdb[MAX_COMMAND_SIZE];
@@ -558,13 +558,16 @@ static void send_mode_select(struct work_struct *work)
 		(char *) h->ctlr->array_name, h->ctlr->index,
 		(retry_cnt == RDAC_RETRY_COUNT) ? "queueing" : "retrying");
 
-	if (!scsi_execute_cmd(sdev, cdb, opf, &h->ctlr->mode_select, data_size,
-			      RDAC_TIMEOUT * HZ, RDAC_RETRIES, &exec_args)) {
+	rc = scsi_execute_cmd(sdev, cdb, opf, &h->ctlr->mode_select, data_size,
+			      RDAC_TIMEOUT * HZ, RDAC_RETRIES, &exec_args);
+	if (!rc) {
 		h->state = RDAC_STATE_ACTIVE;
 		RDAC_LOG(RDAC_LOG_FAILOVER, sdev, "array %s, ctlr %d, "
 				"MODE_SELECT completed",
 				(char *) h->ctlr->array_name, h->ctlr->index);
 		err = SCSI_DH_OK;
+	} else if (rc < 0) {
+		err = SCSI_DH_IO;
 	} else {
 		err = mode_select_handle_sense(sdev, &sshdr);
 		if (err == SCSI_DH_RETRY && retry_cnt--)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 15/34] scsi: rdac: Have scsi-ml retry send_mode_select errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (13 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 14/34] scsi: rdac: Fix sshdr use Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:58   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 16/34] scsi: spi: Fix sshdr use Mike Christie
                   ` (18 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has rdac have scsi-ml retry errors instead of driving them itself.

There is one behavior change with this patch. We used to get a total of
5 retries for errors mode_select_handle_sense returned SCSI_DH_RETRY. We
now get 5 retries for each failure.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/device_handler/scsi_dh_rdac.c | 87 ++++++++++++----------
 1 file changed, 49 insertions(+), 38 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_rdac.c b/drivers/scsi/device_handler/scsi_dh_rdac.c
index 1ac2ae17e8be..771108a332cb 100644
--- a/drivers/scsi/device_handler/scsi_dh_rdac.c
+++ b/drivers/scsi/device_handler/scsi_dh_rdac.c
@@ -485,43 +485,17 @@ static int set_mode_select(struct scsi_device *sdev, struct rdac_dh_data *h)
 static int mode_select_handle_sense(struct scsi_device *sdev,
 				    struct scsi_sense_hdr *sense_hdr)
 {
-	int err = SCSI_DH_IO;
 	struct rdac_dh_data *h = sdev->handler_data;
 
 	if (!scsi_sense_valid(sense_hdr))
-		goto done;
-
-	switch (sense_hdr->sense_key) {
-	case NO_SENSE:
-	case ABORTED_COMMAND:
-	case UNIT_ATTENTION:
-		err = SCSI_DH_RETRY;
-		break;
-	case NOT_READY:
-		if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x01)
-			/* LUN Not Ready and is in the Process of Becoming
-			 * Ready
-			 */
-			err = SCSI_DH_RETRY;
-		break;
-	case ILLEGAL_REQUEST:
-		if (sense_hdr->asc == 0x91 && sense_hdr->ascq == 0x36)
-			/*
-			 * Command Lock contention
-			 */
-			err = SCSI_DH_IMM_RETRY;
-		break;
-	default:
-		break;
-	}
+		return SCSI_DH_IO;
 
 	RDAC_LOG(RDAC_LOG_FAILOVER, sdev, "array %s, ctlr %d, "
 		"MODE_SELECT returned with sense %02x/%02x/%02x",
 		(char *) h->ctlr->array_name, h->ctlr->index,
 		sense_hdr->sense_key, sense_hdr->asc, sense_hdr->ascq);
 
-done:
-	return err;
+	return SCSI_DH_IO;
 }
 
 static void send_mode_select(struct work_struct *work)
@@ -530,7 +504,7 @@ static void send_mode_select(struct work_struct *work)
 		container_of(work, struct rdac_controller, ms_work);
 	struct scsi_device *sdev = ctlr->ms_sdev;
 	struct rdac_dh_data *h = sdev->handler_data;
-	int rc, err, retry_cnt = RDAC_RETRY_COUNT;
+	int rc, err;
 	struct rdac_queue_data *tmp, *qdata;
 	LIST_HEAD(list);
 	unsigned char cdb[MAX_COMMAND_SIZE];
@@ -538,8 +512,52 @@ static void send_mode_select(struct work_struct *work)
 	unsigned int data_size;
 	blk_opf_t opf = REQ_OP_DRV_OUT | REQ_FAILFAST_DEV |
 				REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER;
+	struct scsi_failure failures[] = {
+		{
+			.sense = NO_SENSE,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = RDAC_RETRY_COUNT,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = ABORTED_COMMAND,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = RDAC_RETRY_COUNT,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = RDAC_RETRY_COUNT,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			/*
+			 * LUN Not Ready and is in the Process of Becoming
+			 * Ready
+			 */
+			.sense = NOT_READY,
+			.asc = 0x04,
+			.ascq = 0x01,
+			.allowed = RDAC_RETRY_COUNT,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			/* Command Lock contention */
+			.sense = ILLEGAL_REQUEST,
+			.asc = 0x91,
+			.ascq = 0x36,
+			.allowed = SCMD_FAILURE_NO_LIMIT,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
 		.sshdr = &sshdr,
+		.failures = failures,
 	};
 
 	spin_lock(&ctlr->ms_lock);
@@ -548,15 +566,12 @@ static void send_mode_select(struct work_struct *work)
 	ctlr->ms_sdev = NULL;
 	spin_unlock(&ctlr->ms_lock);
 
- retry:
 	memset(cdb, 0, sizeof(cdb));
 
 	data_size = rdac_failover_get(ctlr, &list, cdb);
 
-	RDAC_LOG(RDAC_LOG_FAILOVER, sdev, "array %s, ctlr %d, "
-		"%s MODE_SELECT command",
-		(char *) h->ctlr->array_name, h->ctlr->index,
-		(retry_cnt == RDAC_RETRY_COUNT) ? "queueing" : "retrying");
+	RDAC_LOG(RDAC_LOG_FAILOVER, sdev, "array %s, ctlr %d, queueingMODE_SELECT command",
+		(char *) h->ctlr->array_name, h->ctlr->index);
 
 	rc = scsi_execute_cmd(sdev, cdb, opf, &h->ctlr->mode_select, data_size,
 			      RDAC_TIMEOUT * HZ, RDAC_RETRIES, &exec_args);
@@ -570,10 +585,6 @@ static void send_mode_select(struct work_struct *work)
 		err = SCSI_DH_IO;
 	} else {
 		err = mode_select_handle_sense(sdev, &sshdr);
-		if (err == SCSI_DH_RETRY && retry_cnt--)
-			goto retry;
-		if (err == SCSI_DH_IMM_RETRY)
-			goto retry;
 	}
 
 	list_for_each_entry_safe(qdata, tmp, &list, entry) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 16/34] scsi: spi: Fix sshdr use
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (14 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 15/34] scsi: rdac: Have scsi-ml retry send_mode_select errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 20:59   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 17/34] scsi: spi: Have scsi-ml retry spi_execute errors Mike Christie
                   ` (17 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so we
shouldn't access the sshdr. If it returns 0, then the cmd executed
successfully, so there is no need to check the sshdr. This has us access
the sshdr when we get a return value > 0.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_transport_spi.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_transport_spi.c b/drivers/scsi/scsi_transport_spi.c
index 2442d4d2e3f3..f668c1c0a98f 100644
--- a/drivers/scsi/scsi_transport_spi.c
+++ b/drivers/scsi/scsi_transport_spi.c
@@ -676,10 +676,10 @@ spi_dv_device_echo_buffer(struct scsi_device *sdev, u8 *buffer,
 	for (r = 0; r < retries; r++) {
 		result = spi_execute(sdev, spi_write_buffer, REQ_OP_DRV_OUT,
 				     buffer, len, &sshdr);
-		if(result || !scsi_device_online(sdev)) {
+		if (result || !scsi_device_online(sdev)) {
 
 			scsi_device_set_state(sdev, SDEV_QUIESCE);
-			if (scsi_sense_valid(&sshdr)
+			if (result > 0 && scsi_sense_valid(&sshdr)
 			    && sshdr.sense_key == ILLEGAL_REQUEST
 			    /* INVALID FIELD IN CDB */
 			    && sshdr.asc == 0x24 && sshdr.ascq == 0x00)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 17/34] scsi: spi: Have scsi-ml retry spi_execute errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (15 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 16/34] scsi: spi: Fix sshdr use Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:00   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 18/34] scsi: sd: Fix sshdr use in sd_suspend_common Mike Christie
                   ` (16 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has spi_execute have scsi-ml retry errors instead of driving them.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_transport_spi.c | 32 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/scsi_transport_spi.c b/drivers/scsi/scsi_transport_spi.c
index f668c1c0a98f..e04b38a009d0 100644
--- a/drivers/scsi/scsi_transport_spi.c
+++ b/drivers/scsi/scsi_transport_spi.c
@@ -108,29 +108,27 @@ static int spi_execute(struct scsi_device *sdev, const void *cmd,
 		       enum req_op op, void *buffer, unsigned int bufflen,
 		       struct scsi_sense_hdr *sshdr)
 {
-	int i, result;
-	struct scsi_sense_hdr sshdr_tmp;
 	blk_opf_t opf = op | REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT |
 			REQ_FAILFAST_DRIVER;
+	struct scsi_failure failures[] = {
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = DV_RETRIES,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
+		/* bypass the SDEV_QUIESCE state with BLK_MQ_REQ_PM */
 		.req_flags = BLK_MQ_REQ_PM,
-		.sshdr = sshdr ? : &sshdr_tmp,
+		.sshdr = sshdr,
+		.failures = failures,
 	};
 
-	sshdr = exec_args.sshdr;
-
-	for(i = 0; i < DV_RETRIES; i++) {
-		/*
-		 * The purpose of the RQF_PM flag below is to bypass the
-		 * SDEV_QUIESCE state.
-		 */
-		result = scsi_execute_cmd(sdev, cmd, opf, buffer, bufflen,
-					  DV_TIMEOUT, 1, &exec_args);
-		if (result < 0 || !scsi_sense_valid(sshdr) ||
-		    sshdr->sense_key != UNIT_ATTENTION)
-			break;
-	}
-	return result;
+	return scsi_execute_cmd(sdev, cmd, opf, buffer, bufflen, DV_TIMEOUT, 1,
+				&exec_args);
 }
 
 static struct {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 18/34] scsi: sd: Fix sshdr use in sd_suspend_common
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (16 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 17/34] scsi: spi: Have scsi-ml retry spi_execute errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:04   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 19/34] scsi: sd: Have scsi-ml retry sd_sync_cache errors Mike Christie
                   ` (15 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so we
shouldn't access the sshdr. If it returns 0, then the cmd executed
successfully, so there is no need to check the sshdr. sd_sync_cache will
only access the sshdr if it's been setup because it calls
scsi_status_is_check_condition before accessing it. However, the
sd_sync_cache caller, sd_suspend_common, does not check.

sd_suspend_common is only checking for ILLEGAL_REQUEST which it's using
to determine if the command is supported. If it's not it just ignores
the error. So to fix its sshdr use this patch just moves that check to
sd_sync_cache where it converts ILLEGAL_REQUEST to success/0.
sd_suspend_common was ignoring that error and sd_shutdown doesn't check
for errors so there will be no behavior changes.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 53 ++++++++++++++++++++---------------------------
 1 file changed, 23 insertions(+), 30 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index f5e6b5cc762f..1f6cc24d633b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1564,24 +1564,21 @@ static unsigned int sd_check_events(struct gendisk *disk, unsigned int clearing)
 	return disk_changed ? DISK_EVENT_MEDIA_CHANGE : 0;
 }
 
-static int sd_sync_cache(struct scsi_disk *sdkp, struct scsi_sense_hdr *sshdr)
+static int sd_sync_cache(struct scsi_disk *sdkp)
 {
 	int retries, res;
 	struct scsi_device *sdp = sdkp->device;
 	const int timeout = sdp->request_queue->rq_timeout
 		* SD_FLUSH_TIMEOUT_MULTIPLIER;
-	struct scsi_sense_hdr my_sshdr;
+	struct scsi_sense_hdr sshdr;
 	const struct scsi_exec_args exec_args = {
 		.req_flags = BLK_MQ_REQ_PM,
-		/* caller might not be interested in sense, but we need it */
-		.sshdr = sshdr ? : &my_sshdr,
+		.sshdr = &sshdr,
 	};
 
 	if (!scsi_device_online(sdp))
 		return -ENODEV;
 
-	sshdr = exec_args.sshdr;
-
 	for (retries = 3; retries > 0; --retries) {
 		unsigned char cmd[16] = { 0 };
 
@@ -1606,15 +1603,23 @@ static int sd_sync_cache(struct scsi_disk *sdkp, struct scsi_sense_hdr *sshdr)
 			return res;
 
 		if (scsi_status_is_check_condition(res) &&
-		    scsi_sense_valid(sshdr)) {
-			sd_print_sense_hdr(sdkp, sshdr);
+		    scsi_sense_valid(&sshdr)) {
+			sd_print_sense_hdr(sdkp, &sshdr);
 
 			/* we need to evaluate the error return  */
-			if (sshdr->asc == 0x3a ||	/* medium not present */
-			    sshdr->asc == 0x20 ||	/* invalid command */
-			    (sshdr->asc == 0x74 && sshdr->ascq == 0x71))	/* drive is password locked */
+			if (sshdr.asc == 0x3a ||	/* medium not present */
+			    sshdr.asc == 0x20 ||	/* invalid command */
+			    (sshdr.asc == 0x74 && sshdr.ascq == 0x71))	/* drive is password locked */
 				/* this is no error here */
 				return 0;
+			/*
+			 * This drive doesn't support sync and there's not much
+			 * we can do because this is called during shutdown
+			 * or suspend so just return success so those operations
+			 * can proceed.
+			 */
+			if (sshdr.sense_key == ILLEGAL_REQUEST)
+				return 0;
 		}
 
 		switch (host_byte(res)) {
@@ -3816,7 +3821,7 @@ static void sd_shutdown(struct device *dev)
 
 	if (sdkp->WCE && sdkp->media_present) {
 		sd_printk(KERN_NOTICE, sdkp, "Synchronizing SCSI cache\n");
-		sd_sync_cache(sdkp, NULL);
+		sd_sync_cache(sdkp);
 	}
 
 	if (system_state != SYSTEM_RESTART && sdkp->device->manage_start_stop) {
@@ -3828,7 +3833,6 @@ static void sd_shutdown(struct device *dev)
 static int sd_suspend_common(struct device *dev, bool ignore_stop_errors)
 {
 	struct scsi_disk *sdkp = dev_get_drvdata(dev);
-	struct scsi_sense_hdr sshdr;
 	int ret = 0;
 
 	if (!sdkp)	/* E.g.: runtime suspend following sd_remove() */
@@ -3837,24 +3841,13 @@ static int sd_suspend_common(struct device *dev, bool ignore_stop_errors)
 	if (sdkp->WCE && sdkp->media_present) {
 		if (!sdkp->device->silence_suspend)
 			sd_printk(KERN_NOTICE, sdkp, "Synchronizing SCSI cache\n");
-		ret = sd_sync_cache(sdkp, &sshdr);
-
-		if (ret) {
-			/* ignore OFFLINE device */
-			if (ret == -ENODEV)
-				return 0;
-
-			if (!scsi_sense_valid(&sshdr) ||
-			    sshdr.sense_key != ILLEGAL_REQUEST)
-				return ret;
+		ret = sd_sync_cache(sdkp);
+		/* ignore OFFLINE device */
+		if (ret == -ENODEV)
+			return 0;
 
-			/*
-			 * sshdr.sense_key == ILLEGAL_REQUEST means this drive
-			 * doesn't support sync. There's not much to do and
-			 * suspend shouldn't fail.
-			 */
-			ret = 0;
-		}
+		if (ret)
+			return ret;
 	}
 
 	if (sdkp->device->manage_start_stop) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 19/34] scsi: sd: Have scsi-ml retry sd_sync_cache errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (17 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 18/34] scsi: sd: Fix sshdr use in sd_suspend_common Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:10   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 20/34] scsi: ch: Remove unit_attention Mike Christie
                   ` (14 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has sd_sync_cache have scsi-ml retry errors instead of driving them
itself.

There is one behavior change where we no longer retry when
scsi_execute_cmd returns < 0, but we should be ok. We don't need to retry
for failures like the queue being removed, and for the case where there
are no tags/reqs the block layer waits/retries for us. For possible memory
allocation failures from blk_rq_map_kern we use GFP_NOIO, so retrying
will probably not help.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/sd.c | 32 ++++++++++++++------------------
 1 file changed, 14 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 1f6cc24d633b..8efd5d8e46e8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1566,36 +1566,32 @@ static unsigned int sd_check_events(struct gendisk *disk, unsigned int clearing)
 
 static int sd_sync_cache(struct scsi_disk *sdkp)
 {
-	int retries, res;
+	int res;
 	struct scsi_device *sdp = sdkp->device;
 	const int timeout = sdp->request_queue->rq_timeout
 		* SD_FLUSH_TIMEOUT_MULTIPLIER;
+	/* Leave the rest of the command zero to indicate flush everything. */
+	const unsigned char cmd[16] = { sdp->use_16_for_sync ?
+				SYNCHRONIZE_CACHE_16 : SYNCHRONIZE_CACHE };
 	struct scsi_sense_hdr sshdr;
+	struct scsi_failure failures[] = {
+		{
+			.allowed = 3,
+			.result = SCMD_FAILURE_RESULT_ANY,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
 		.req_flags = BLK_MQ_REQ_PM,
 		.sshdr = &sshdr,
+		.failures = failures,
 	};
 
 	if (!scsi_device_online(sdp))
 		return -ENODEV;
 
-	for (retries = 3; retries > 0; --retries) {
-		unsigned char cmd[16] = { 0 };
-
-		if (sdp->use_16_for_sync)
-			cmd[0] = SYNCHRONIZE_CACHE_16;
-		else
-			cmd[0] = SYNCHRONIZE_CACHE;
-		/*
-		 * Leave the rest of the command zero to indicate
-		 * flush everything.
-		 */
-		res = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, NULL, 0,
-				       timeout, sdkp->max_retries, &exec_args);
-		if (res == 0)
-			break;
-	}
-
+	res = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, NULL, 0, timeout,
+			       sdkp->max_retries, &exec_args);
 	if (res) {
 		sd_print_result(sdkp, "Synchronize Cache(10) failed", res);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 20/34] scsi: ch: Remove unit_attention
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (18 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 19/34] scsi: sd: Have scsi-ml retry sd_sync_cache errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:08   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 21/34] scsi: ch: Have scsi-ml retry ch_do_scsi errors Mike Christie
                   ` (13 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

unit_attention is not used so remove it.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/ch.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/scsi/ch.c b/drivers/scsi/ch.c
index cb0a399be1cc..1a998e45978e 100644
--- a/drivers/scsi/ch.c
+++ b/drivers/scsi/ch.c
@@ -113,7 +113,6 @@ typedef struct {
 	struct scsi_device  **dt;        /* ptrs to data transfer elements */
 	u_int               firsts[CH_TYPES];
 	u_int               counts[CH_TYPES];
-	u_int               unit_attention;
 	u_int		    voltags;
 	struct mutex	    lock;
 } scsi_changer;
@@ -208,7 +207,6 @@ ch_do_scsi(scsi_changer *ch, unsigned char *cmd, int cmd_len,
 
 		switch(sshdr.sense_key) {
 		case UNIT_ATTENTION:
-			ch->unit_attention = 1;
 			if (retries++ < 3)
 				goto retry;
 			break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 21/34] scsi: ch: Have scsi-ml retry ch_do_scsi errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (19 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 20/34] scsi: ch: Remove unit_attention Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:11   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 22/34] scsi: Have scsi-ml retry scsi_mode_sense UAs Mike Christie
                   ` (12 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has ch_do_scsi have scsi-ml retry errors instead of driving them
itself.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/ch.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/ch.c b/drivers/scsi/ch.c
index 1a998e45978e..4412562a3df8 100644
--- a/drivers/scsi/ch.c
+++ b/drivers/scsi/ch.c
@@ -185,17 +185,26 @@ static int
 ch_do_scsi(scsi_changer *ch, unsigned char *cmd, int cmd_len,
 	   void *buffer, unsigned int buflength, enum req_op op)
 {
-	int errno, retries = 0, timeout, result;
+	int errno = 0, timeout, result;
 	struct scsi_sense_hdr sshdr;
+	struct scsi_failure failures[] = {
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = 3,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
 		.sshdr = &sshdr,
+		.failures = failures,
 	};
 
 	timeout = (cmd[0] == INITIALIZE_ELEMENT_STATUS)
 		? timeout_init : timeout_move;
 
- retry:
-	errno = 0;
 	result = scsi_execute_cmd(ch->device, cmd, op, buffer, buflength,
 				  timeout * HZ, MAX_RETRIES, &exec_args);
 	if (result < 0)
@@ -204,13 +213,6 @@ ch_do_scsi(scsi_changer *ch, unsigned char *cmd, int cmd_len,
 		if (debug)
 			scsi_print_sense_hdr(ch->device, ch->name, &sshdr);
 		errno = ch_find_errno(&sshdr);
-
-		switch(sshdr.sense_key) {
-		case UNIT_ATTENTION:
-			if (retries++ < 3)
-				goto retry;
-			break;
-		}
 	}
 	return errno;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 22/34] scsi: Have scsi-ml retry scsi_mode_sense UAs
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (20 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 21/34] scsi: ch: Have scsi-ml retry ch_do_scsi errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:11   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 23/34] scsi: sd: Fix scsi_mode_sense caller's sshdr use Mike Christie
                   ` (11 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has scsi_mode_sense have scsi-ml retry UAs instead of driving them
itself.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_lib.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d9432bbb64fd..67d74f175c4c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2186,11 +2186,22 @@ scsi_mode_sense(struct scsi_device *sdev, int dbd, int modepage, int subpage,
 	unsigned char cmd[12];
 	int use_10_for_ms;
 	int header_length;
-	int result, retry_count = retries;
+	int result;
 	struct scsi_sense_hdr my_sshdr;
+	struct scsi_failure failures[] = {
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = retries,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
 		/* caller might not be interested in sense, but we need it */
 		.sshdr = sshdr ? : &my_sshdr,
+		.failures = failures,
 	};
 
 	memset(data, 0, sizeof(*data));
@@ -2252,12 +2263,6 @@ scsi_mode_sense(struct scsi_device *sdev, int dbd, int modepage, int subpage,
 					goto retry;
 				}
 			}
-			if (scsi_status_is_check_condition(result) &&
-			    sshdr->sense_key == UNIT_ATTENTION &&
-			    retry_count) {
-				retry_count--;
-				goto retry;
-			}
 		}
 		return -EIO;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 23/34] scsi: sd: Fix scsi_mode_sense caller's sshdr use
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (21 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 22/34] scsi: Have scsi-ml retry scsi_mode_sense UAs Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:13   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 24/34] scsi: Have scsi-ml retry scsi_report_lun_scan errors Mike Christie
                   ` (10 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

The sshdr passed into scsi_execute_cmd is only initialized if
scsi_execute_cmd returns >= 0, and scsi_mode_sense will convert all non
good statuses like check conditions to -EIO. This has scsi_mode_sense
callers that were possibly accessing an uninitialized sshdrs to only
access it if we got -EIO.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8efd5d8e46e8..6dd2dde75354 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2938,7 +2938,7 @@ sd_read_cache_type(struct scsi_disk *sdkp, unsigned char *buffer)
 	}
 
 bad_sense:
-	if (scsi_sense_valid(&sshdr) &&
+	if (res == -EIO && scsi_sense_valid(&sshdr) &&
 	    sshdr.sense_key == ILLEGAL_REQUEST &&
 	    sshdr.asc == 0x24 && sshdr.ascq == 0x0)
 		/* Invalid field in CDB */
@@ -2986,7 +2986,7 @@ static void sd_read_app_tag_own(struct scsi_disk *sdkp, unsigned char *buffer)
 		sd_first_printk(KERN_WARNING, sdkp,
 			  "getting Control mode page failed, assume no ATO\n");
 
-		if (scsi_sense_valid(&sshdr))
+		if (res == -EIO && scsi_sense_valid(&sshdr))
 			sd_print_sense_hdr(sdkp, &sshdr);
 
 		return;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 24/34] scsi: Have scsi-ml retry scsi_report_lun_scan errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (22 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 23/34] scsi: sd: Fix scsi_mode_sense caller's sshdr use Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:17   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 25/34] scsi: sd: Have pr commands retry UAs Mike Christie
                   ` (9 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has scsi_report_lun_scan have scsi-ml retry errors instead of driving
them itself.

There is one behavior change where we no longer retry when
scsi_execute_cmd returns < 0, but we should be ok. We don't need to retry
for failures like the queue being removed, and for the case where there
are no tags/reqs the block layer waits/retries for us. For possible memory
allocation failures from blk_rq_map_kern we use GFP_NOIO, so retrying
will probably not help.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_scan.c | 56 +++++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 24 deletions(-)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index c27b64a1b239..b29c47c2a5c6 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1412,14 +1412,33 @@ static int scsi_report_lun_scan(struct scsi_target *starget, blist_flags_t bflag
 	unsigned int length;
 	u64 lun;
 	unsigned int num_luns;
-	unsigned int retries;
 	int result;
 	struct scsi_lun *lunp, *lun_data;
-	struct scsi_sense_hdr sshdr;
 	struct scsi_device *sdev;
 	struct Scsi_Host *shost = dev_to_shost(&starget->dev);
+	struct scsi_failure failures[] = {
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = 3,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		/* Fail all CCs except the UA above */
+		{
+			.sense = SCMD_FAILURE_SENSE_ANY,
+			.allowed = 0,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		/* Retry any oher errors not listed above */
+		{
+			.result = SCMD_FAILURE_RESULT_ANY,
+			.allowed = 3,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
-		.sshdr = &sshdr,
+		.failures = failures,
 	};
 	int ret = 0;
 
@@ -1490,29 +1509,18 @@ static int scsi_report_lun_scan(struct scsi_target *starget, blist_flags_t bflag
 	 * should come through as a check condition, and will not generate
 	 * a retry.
 	 */
-	for (retries = 0; retries < 3; retries++) {
-		SCSI_LOG_SCAN_BUS(3, sdev_printk (KERN_INFO, sdev,
-				"scsi scan: Sending REPORT LUNS to (try %d)\n",
-				retries));
-
-		result = scsi_execute_cmd(sdev, scsi_cmd, REQ_OP_DRV_IN,
-					  lun_data, length,
-					  SCSI_REPORT_LUNS_TIMEOUT, 3,
-					  &exec_args);
+	scsi_reset_failures(failures);
 
-		SCSI_LOG_SCAN_BUS(3, sdev_printk (KERN_INFO, sdev,
-				"scsi scan: REPORT LUNS"
-				" %s (try %d) result 0x%x\n",
-				result ?  "failed" : "successful",
-				retries, result));
-		if (result == 0)
-			break;
-		else if (scsi_sense_valid(&sshdr)) {
-			if (sshdr.sense_key != UNIT_ATTENTION)
-				break;
-		}
-	}
+	SCSI_LOG_SCAN_BUS(3, sdev_printk (KERN_INFO, sdev,
+			  "scsi scan: Sending REPORT LUNS\n"));
+
+	result = scsi_execute_cmd(sdev, scsi_cmd, REQ_OP_DRV_IN, lun_data,
+				  length, SCSI_REPORT_LUNS_TIMEOUT, 3,
+				  &exec_args);
 
+	SCSI_LOG_SCAN_BUS(3, sdev_printk (KERN_INFO, sdev,
+			  "scsi scan: REPORT LUNS  %s result 0x%x\n",
+			  result ?  "failed" : "successful", result));
 	if (result) {
 		/*
 		 * The device probably does not support a REPORT LUN command
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 25/34] scsi: sd: Have pr commands retry UAs
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (23 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 24/34] scsi: Have scsi-ml retry scsi_report_lun_scan errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:18   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 26/34] scsi: sd: Have scsi-ml retry read_capacity_10 errors Mike Christie
                   ` (8 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

It's common to get a UA when doing PR commands. It could be due to a
target restarting, transport level relogin or other PR commands like a
release causing it. The upper layers don't get the sense and in some cases
have no idea if it's a SCSI device, so this has the sd layer retry.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 6dd2dde75354..5b80f1df4cd9 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1718,8 +1718,19 @@ static int sd_pr_in_command(struct block_device *bdev, u8 sa,
 	struct scsi_device *sdev = sdkp->device;
 	struct scsi_sense_hdr sshdr;
 	u8 cmd[10] = { PERSISTENT_RESERVE_IN, sa };
+	struct scsi_failure failures[] = {
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = 5,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
 		.sshdr = &sshdr,
+		.failures = failures,
 	};
 	int result;
 
@@ -1806,8 +1817,19 @@ static int sd_pr_out_command(struct block_device *bdev, u8 sa, u64 key,
 	struct scsi_disk *sdkp = scsi_disk(bdev->bd_disk);
 	struct scsi_device *sdev = sdkp->device;
 	struct scsi_sense_hdr sshdr;
+	struct scsi_failure failures[] = {
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = 5,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
 		.sshdr = &sshdr,
+		.failures = failures,
 	};
 	int result;
 	u8 cmd[16] = { 0, };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 26/34] scsi: sd: Have scsi-ml retry read_capacity_10 errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (24 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 25/34] scsi: sd: Have pr commands retry UAs Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:25   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 27/34] scsi: ses: Have scsi-ml retry scsi_exec_req errors Mike Christie
                   ` (7 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has read_capacity_10 have scsi-ml retry errors instead of driving
them itself.

There are two behavior changes:
1. We no longer retry when scsi_execute_cmd returns < 0, but we should be
ok. We don't need to retry for failures like the queue being removed, and
for the case where there are no tags/reqs the block layer waits/retries
for us. For possible memory allocation failures from blk_rq_map_kern we
use GFP_NOIO, so retrying will probably not help.
2. For device reset UAs, we would retry READ_CAPACITY_RETRIES_ON_RESET
times, then once those are used up we would hit the main do loops retry
counter and get 3 more retries. We now only get
READ_CAPACITY_RETRIES_ON_RESET retries.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 64 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 41 insertions(+), 23 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 5b80f1df4cd9..781df1fefa84 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2528,42 +2528,60 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp,
 static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp,
 						unsigned char *buffer)
 {
-	unsigned char cmd[16];
+	static const u8 cmd[10] = { READ_CAPACITY };
 	struct scsi_sense_hdr sshdr;
+	struct scsi_failure failures[] = {
+		/* Fail immediately for Medium Not Present */
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = 0x3A,
+			.ascq = 0x0,
+			.allowed = 0,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = NOT_READY,
+			.asc = 0x3A,
+			.ascq = 0x0,
+			.allowed = 0,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = 0x29,
+			.ascq = 0,
+			/* Device reset might occur several times */
+			.allowed = READ_CAPACITY_RETRIES_ON_RESET,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		/* Any other error not listed above retry */
+		{
+			.result = SCMD_FAILURE_RESULT_ANY,
+			.allowed = 3,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
 		.sshdr = &sshdr,
+		.failures = failures,
 	};
 	int sense_valid = 0;
 	int the_result;
-	int retries = 3, reset_retries = READ_CAPACITY_RETRIES_ON_RESET;
 	sector_t lba;
 	unsigned sector_size;
 
-	do {
-		cmd[0] = READ_CAPACITY;
-		memset(&cmd[1], 0, 9);
-		memset(buffer, 0, 8);
+	memset(buffer, 0, 8);
 
-		the_result = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, buffer,
-					      8, SD_TIMEOUT, sdkp->max_retries,
-					      &exec_args);
+	the_result = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, buffer,
+				      8, SD_TIMEOUT, sdkp->max_retries,
+				      &exec_args);
+
+	if (the_result > 0) {
+		sense_valid = scsi_sense_valid(&sshdr);
 
 		if (media_not_present(sdkp, &sshdr))
 			return -ENODEV;
-
-		if (the_result > 0) {
-			sense_valid = scsi_sense_valid(&sshdr);
-			if (sense_valid &&
-			    sshdr.sense_key == UNIT_ATTENTION &&
-			    sshdr.asc == 0x29 && sshdr.ascq == 0x00)
-				/* Device reset might occur several times,
-				 * give it one more chance */
-				if (--reset_retries > 0)
-					continue;
-		}
-		retries--;
-
-	} while (the_result && retries);
+	}
 
 	if (the_result) {
 		sd_print_result(sdkp, "Read Capacity(10) failed", the_result);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 27/34] scsi: ses: Have scsi-ml retry scsi_exec_req errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (25 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 26/34] scsi: sd: Have scsi-ml retry read_capacity_10 errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:34   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 28/34] scsi: sr: Have scsi-ml retry get_sectorsize errors Mike Christie
                   ` (6 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has ses have scsi-ml retry scsi_exec_req errors instead of driving
them itself.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/ses.c | 60 ++++++++++++++++++++++++++++++----------------
 1 file changed, 40 insertions(+), 20 deletions(-)

diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
index d7d0c35c58b8..f3d497366af1 100644
--- a/drivers/scsi/ses.c
+++ b/drivers/scsi/ses.c
@@ -87,19 +87,29 @@ static int ses_recv_diag(struct scsi_device *sdev, int page_code,
 		0
 	};
 	unsigned char recv_page_code;
-	unsigned int retries = SES_RETRIES;
-	struct scsi_sense_hdr sshdr;
+	struct scsi_failure failures[] = {
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = 0x29,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = SES_RETRIES,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = NOT_READY,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = SES_RETRIES,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
-		.sshdr = &sshdr,
+		.failures = failures,
 	};
 
-	do {
-		ret = scsi_execute_cmd(sdev, cmd, REQ_OP_DRV_IN, buf, bufflen,
-				       SES_TIMEOUT, 1, &exec_args);
-	} while (ret > 0 && --retries && scsi_sense_valid(&sshdr) &&
-		 (sshdr.sense_key == NOT_READY ||
-		  (sshdr.sense_key == UNIT_ATTENTION && sshdr.asc == 0x29)));
-
+	ret = scsi_execute_cmd(sdev, cmd, REQ_OP_DRV_IN, buf, bufflen,
+			       SES_TIMEOUT, 1, &exec_args);
 	if (unlikely(ret))
 		return ret;
 
@@ -131,19 +141,29 @@ static int ses_send_diag(struct scsi_device *sdev, int page_code,
 		bufflen & 0xff,
 		0
 	};
-	struct scsi_sense_hdr sshdr;
-	unsigned int retries = SES_RETRIES;
+	struct scsi_failure failures[] = {
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = 0x29,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = SES_RETRIES,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = NOT_READY,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = SES_RETRIES,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
 	const struct scsi_exec_args exec_args = {
-		.sshdr = &sshdr,
+		.failures = failures,
 	};
 
-	do {
-		result = scsi_execute_cmd(sdev, cmd, REQ_OP_DRV_OUT, buf,
-					  bufflen, SES_TIMEOUT, 1, &exec_args);
-	} while (result > 0 && --retries && scsi_sense_valid(&sshdr) &&
-		 (sshdr.sense_key == NOT_READY ||
-		  (sshdr.sense_key == UNIT_ATTENTION && sshdr.asc == 0x29)));
-
+	result = scsi_execute_cmd(sdev, cmd, REQ_OP_DRV_OUT, buf, bufflen,
+				  SES_TIMEOUT, 1, &exec_args);
 	if (result)
 		sdev_printk(KERN_ERR, sdev, "SEND DIAGNOSTIC result: %8x\n",
 			    result);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 28/34] scsi: sr: Have scsi-ml retry get_sectorsize errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (26 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 27/34] scsi: ses: Have scsi-ml retry scsi_exec_req errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:36   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 29/34] scsi: ufs: Have scsi-ml retry start stop errors Mike Christie
                   ` (5 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has get_sectorsize have scsi-ml retry errors instead of driving them
itself.

There is one behavior change where we no longer retry when
scsi_execute_cmd returns < 0, but we should be ok. We don't need to retry
for failures like the queue being removed, and for the case where there
are no tags/reqs the block layer waits/retries for us. For possible memory
allocation failures from blk_rq_map_kern we use GFP_NOIO, so retrying
will probably not help.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sr.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 07ef3db3d1a1..100480f5bc2c 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -716,27 +716,26 @@ static int sr_probe(struct device *dev)
 
 static void get_sectorsize(struct scsi_cd *cd)
 {
-	unsigned char cmd[10];
-	unsigned char buffer[8];
-	int the_result, retries = 3;
+	static const u8 cmd[10] = { READ_CAPACITY };
+	unsigned char buffer[8] = { };
+	int the_result;
 	int sector_size;
 	struct request_queue *queue;
+	struct scsi_failure failures[] = {
+		{
+			.result = SCMD_FAILURE_RESULT_ANY,
+			.allowed = 3,
+		},
+		{}
+	};
+	const struct scsi_exec_args exec_args = {
+		.failures = failures,
+	};
 
-	do {
-		cmd[0] = READ_CAPACITY;
-		memset((void *) &cmd[1], 0, 9);
-		memset(buffer, 0, sizeof(buffer));
-
-		/* Do the command and wait.. */
-		the_result = scsi_execute_cmd(cd->device, cmd, REQ_OP_DRV_IN,
-					      buffer, sizeof(buffer),
-					      SR_TIMEOUT, MAX_RETRIES, NULL);
-
-		retries--;
-
-	} while (the_result && retries);
-
-
+	/* Do the command and wait.. */
+	the_result = scsi_execute_cmd(cd->device, cmd, REQ_OP_DRV_IN, buffer,
+				      sizeof(buffer), SR_TIMEOUT, MAX_RETRIES,
+				      &exec_args);
 	if (the_result) {
 		cd->capacity = 0x1fffff;
 		sector_size = 2048;	/* A guess, just in case */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 29/34] scsi: ufs: Have scsi-ml retry start stop errors
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (27 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 28/34] scsi: sr: Have scsi-ml retry get_sectorsize errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:37   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 30/34] scsi: Fix sshdr use in scsi_test_unit_ready Mike Christie
                   ` (4 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

This has scsi-ml retry errors instead of driving them itself.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/ufs/core/ufshcd.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index c2df07545f96..a65b62d11739 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -9266,7 +9266,14 @@ static int ufshcd_execute_start_stop(struct scsi_device *sdev,
 				     struct scsi_sense_hdr *sshdr)
 {
 	const unsigned char cdb[6] = { START_STOP, 0, 0, 0, pwr_mode << 4, 0 };
+	struct scsi_failure failures[] = {
+		{
+			.allowed = 2,
+			.result = SCMD_FAILURE_RESULT_ANY,
+		},
+	};
 	const struct scsi_exec_args args = {
+		.failures = failures,
 		.sshdr = sshdr,
 		.req_flags = BLK_MQ_REQ_PM,
 		.scmd_flags = SCMD_FAIL_IF_RECOVERING,
@@ -9292,7 +9299,7 @@ static int ufshcd_set_dev_pwr_mode(struct ufs_hba *hba,
 	struct scsi_sense_hdr sshdr;
 	struct scsi_device *sdp;
 	unsigned long flags;
-	int ret, retries;
+	int ret;
 
 	spin_lock_irqsave(hba->host->host_lock, flags);
 	sdp = hba->ufs_device_wlun;
@@ -9318,15 +9325,7 @@ static int ufshcd_set_dev_pwr_mode(struct ufs_hba *hba,
 	 * callbacks hence set the RQF_PM flag so that it doesn't resume the
 	 * already suspended childs.
 	 */
-	for (retries = 3; retries > 0; --retries) {
-		ret = ufshcd_execute_start_stop(sdp, pwr_mode, &sshdr);
-		/*
-		 * scsi_execute() only returns a negative value if the request
-		 * queue is dying.
-		 */
-		if (ret <= 0)
-			break;
-	}
+	ret = ufshcd_execute_start_stop(sdp, pwr_mode, &sshdr);
 	if (ret) {
 		sdev_printk(KERN_WARNING, sdp,
 			    "START_STOP failed for power mode: %d, result %x\n",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 30/34] scsi: Fix sshdr use in scsi_test_unit_ready
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (28 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 29/34] scsi: ufs: Have scsi-ml retry start stop errors Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:38   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 31/34] scsi: Fix sshdr use in scsi_cdl_enable Mike Christie
                   ` (3 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so we
shouldn't access the sshdr. If it returns 0, then the cmd executed
successfully, so there is no need to check the sshdr. This has us access
the sshdr when we get a return value > 0.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_lib.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 67d74f175c4c..756e13637f15 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2319,10 +2319,10 @@ scsi_test_unit_ready(struct scsi_device *sdev, int timeout, int retries,
 	do {
 		result = scsi_execute_cmd(sdev, cmd, REQ_OP_DRV_IN, NULL, 0,
 					  timeout, 1, &exec_args);
-		if (sdev->removable && scsi_sense_valid(sshdr) &&
+		if (sdev->removable && result > 0 && scsi_sense_valid(sshdr) &&
 		    sshdr->sense_key == UNIT_ATTENTION)
 			sdev->changed = 1;
-	} while (scsi_sense_valid(sshdr) &&
+	} while (result > 0 && scsi_sense_valid(sshdr) &&
 		 sshdr->sense_key == UNIT_ATTENTION && --retries);
 
 	return result;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 31/34] scsi: Fix sshdr use in scsi_cdl_enable
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (29 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 30/34] scsi: Fix sshdr use in scsi_test_unit_ready Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:39   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 32/34] scsi: sd: Fix sshdr use in cache_type_store Mike Christie
                   ` (2 subsequent siblings)
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so we
shouldn't access the sshdr. If it returns 0, then the cmd executed
successfully, so there is no need to check the sshdr. This has us access
the sshdr when we get a return value > 0.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index d0911bc28663..d1c0ba3ef1f5 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -692,7 +692,7 @@ int scsi_cdl_enable(struct scsi_device *sdev, bool enable)
 		ret = scsi_mode_select(sdev, 1, 0, buf_data, len, 5 * HZ, 3,
 				       &data, &sshdr);
 		if (ret) {
-			if (scsi_sense_valid(&sshdr))
+			if (ret > 0 && scsi_sense_valid(&sshdr))
 				scsi_print_sense_hdr(sdev,
 					dev_name(&sdev->sdev_gendev), &sshdr);
 			return ret;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 32/34] scsi: sd: Fix sshdr use in cache_type_store
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (30 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 31/34] scsi: Fix sshdr use in scsi_cdl_enable Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:44   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 33/34] scsi: sr: Fix sshdr use in sr_get_events Mike Christie
  2023-09-05 23:15 ` [PATCH v11 34/34] scsi: Add kunit tests for scsi_check_passthrough Mike Christie
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so we
shouldn't access the sshdr. If it returns 0, then the cmd executed
successfully, so there is no need to check the sshdr. This has us access
the sshdr when we get a return value > 0.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/sd.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 781df1fefa84..a5c67129b0c3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -143,7 +143,7 @@ cache_type_store(struct device *dev, struct device_attribute *attr,
 	struct scsi_mode_data data;
 	struct scsi_sense_hdr sshdr;
 	static const char temp[] = "temporary ";
-	int len;
+	int len, ret;
 
 	if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC)
 		/* no cache control on RBC devices; theoretically they
@@ -190,9 +190,10 @@ cache_type_store(struct device *dev, struct device_attribute *attr,
 	 */
 	data.device_specific = 0;
 
-	if (scsi_mode_select(sdp, 1, sp, buffer_data, len, SD_TIMEOUT,
-			     sdkp->max_retries, &data, &sshdr)) {
-		if (scsi_sense_valid(&sshdr))
+	ret = scsi_mode_select(sdp, 1, sp, buffer_data, len, SD_TIMEOUT,
+			       sdkp->max_retries, &data, &sshdr);
+	if (ret) {
+		if (ret > 0 && scsi_sense_valid(&sshdr))
 			sd_print_sense_hdr(sdkp, &sshdr);
 		return -EINVAL;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 33/34] scsi: sr: Fix sshdr use in sr_get_events
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (31 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 32/34] scsi: sd: Fix sshdr use in cache_type_store Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:44   ` Martin Wilck
  2023-09-05 23:15 ` [PATCH v11 34/34] scsi: Add kunit tests for scsi_check_passthrough Mike Christie
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so we
shouldn't access the sshdr. If it returns 0, then the cmd executed
successfully, so there is no need to check the sshdr. This has us access
the sshdr when we get a return value > 0.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/sr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 100480f5bc2c..a1889709c84c 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -177,7 +177,8 @@ static unsigned int sr_get_events(struct scsi_device *sdev)
 
 	result = scsi_execute_cmd(sdev, cmd, REQ_OP_DRV_IN, buf, sizeof(buf),
 				  SR_TIMEOUT, MAX_RETRIES, &exec_args);
-	if (scsi_sense_valid(&sshdr) && sshdr.sense_key == UNIT_ATTENTION)
+	if (result > 0 && scsi_sense_valid(&sshdr) &&
+	    sshdr.sense_key == UNIT_ATTENTION)
 		return DISK_EVENT_MEDIA_CHANGE;
 
 	if (result || be16_to_cpu(eh->data_len) < sizeof(*med))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v11 34/34] scsi: Add kunit tests for scsi_check_passthrough
  2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
                   ` (32 preceding siblings ...)
  2023-09-05 23:15 ` [PATCH v11 33/34] scsi: sr: Fix sshdr use in sr_get_events Mike Christie
@ 2023-09-05 23:15 ` Mike Christie
  2023-09-15 21:52   ` Martin Wilck
  33 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-05 23:15 UTC (permalink / raw)
  To: john.g.garry, bvanassche, mwilck, hch, martin.petersen,
	linux-scsi, james.bottomley
  Cc: Mike Christie

Add some kunit tests for scsi_check_passthrough so we can easily make sure
we are hitting the cases it's difficult to replicate in hardware or even
scsi_debug.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/Kconfig           |   9 ++
 drivers/scsi/scsi_error.c      |   4 +
 drivers/scsi/scsi_error_test.c | 170 +++++++++++++++++++++++++++++++++
 3 files changed, 183 insertions(+)
 create mode 100644 drivers/scsi/scsi_error_test.c

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 695a57d894cd..b7ffb435afb5 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -67,6 +67,15 @@ config SCSI_PROC_FS
 
 	  If unsure say Y.
 
+config SCSI_KUNIT_TEST
+	tristate "KUnit tests for SCSI Mid Layer" if !KUNIT_ALL_TESTS
+	depends on KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  Run SCSI Mid Layer's KUnit tests.
+
+	  If unsure say N.
+
 comment "SCSI support type (disk, tape, CD-ROM)"
 	depends on SCSI
 
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index d2fb28212880..f12ab199a03f 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -2663,3 +2663,7 @@ bool scsi_get_sense_info_fld(const u8 *sense_buffer, int sb_len,
 	}
 }
 EXPORT_SYMBOL(scsi_get_sense_info_fld);
+
+#ifdef CONFIG_SCSI_KUNIT_TEST
+#include "scsi_error_test.c"
+#endif
diff --git a/drivers/scsi/scsi_error_test.c b/drivers/scsi/scsi_error_test.c
new file mode 100644
index 000000000000..c01201ad8702
--- /dev/null
+++ b/drivers/scsi/scsi_error_test.c
@@ -0,0 +1,170 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit tests for scsi_error.c.
+ *
+ * Copyright (C) 2022, Oracle Corporation
+ */
+#include <kunit/test.h>
+
+#include <scsi/scsi_proto.h>
+#include <scsi/scsi_cmnd.h>
+
+#define SCSI_TEST_ERROR_MAX_ALLOWED 3
+
+static void scsi_test_error_check_passthough(struct kunit *test)
+{
+	struct scsi_failure multiple_sense_failures[] = {
+		{
+			.sense = DATA_PROTECT,
+			.asc = 0x1,
+			.ascq = 0x1,
+			.allowed = 0,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = UNIT_ATTENTION,
+			.asc = 0x11,
+			.ascq = 0x0,
+			.allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = NOT_READY,
+			.asc = 0x11,
+			.ascq = 0x22,
+			.allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = ABORTED_COMMAND,
+			.asc = 0x11,
+			.ascq = SCMD_FAILURE_ASCQ_ANY,
+			.allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = HARDWARE_ERROR,
+			.asc = SCMD_FAILURE_ASC_ANY,
+			.allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{
+			.sense = ILLEGAL_REQUEST,
+			.asc = 0x91,
+			.ascq = 0x36,
+			.allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
+			.result = SAM_STAT_CHECK_CONDITION,
+		},
+		{}
+	};
+	struct scsi_failure retryable_host_failures[] = {
+		{
+			.result = DID_TRANSPORT_DISRUPTED << 16,
+			.allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
+		},
+		{
+			.result = DID_TIME_OUT << 16,
+			.allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
+		},
+		{}
+	};
+	struct scsi_failure any_status_failures[] = {
+		{
+			.result = SCMD_FAILURE_STAT_ANY,
+			.allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
+		},
+		{}
+	};
+	struct scsi_failure any_sense_failures[] = {
+		{
+			.result = SCMD_FAILURE_SENSE_ANY,
+			.allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
+		},
+		{}
+	};
+	struct scsi_failure any_failures[] = {
+		{
+			.result = SCMD_FAILURE_RESULT_ANY,
+			.allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
+		},
+		{}
+	};
+	u8 sense[SCSI_SENSE_BUFFERSIZE] = {};
+	struct scsi_cmnd sc = {
+		.sense_buffer = sense,
+		.failures = multiple_sense_failures,
+	};
+	int i;
+
+	/* Match end of array */
+	scsi_build_sense(&sc, 0, ILLEGAL_REQUEST, 0x91, 0x36);
+	KUNIT_EXPECT_EQ(test, NEEDS_RETRY, scsi_check_passthrough(&sc));
+	/* Basic match in array */
+	scsi_build_sense(&sc, 0, UNIT_ATTENTION, 0x11, 0x0);
+	KUNIT_EXPECT_EQ(test, NEEDS_RETRY, scsi_check_passthrough(&sc));
+	/* No matching sense entry */
+	scsi_build_sense(&sc, 0, MISCOMPARE, 0x11, 0x11);
+	KUNIT_EXPECT_EQ(test, SCSI_RETURN_NOT_HANDLED,
+			scsi_check_passthrough(&sc));
+	/* Match using SCMD_FAILURE_ASCQ_ANY */
+	scsi_build_sense(&sc, 0, ABORTED_COMMAND, 0x22, 0x22);
+	KUNIT_EXPECT_EQ(test, SCSI_RETURN_NOT_HANDLED,
+			scsi_check_passthrough(&sc));
+	/* Match using SCMD_FAILURE_ASC_ANY */
+	scsi_build_sense(&sc, 0, HARDWARE_ERROR, 0x11, 0x22);
+	KUNIT_EXPECT_EQ(test, NEEDS_RETRY, scsi_check_passthrough(&sc));
+	/* No matching status entry */
+	sc.result = SAM_STAT_RESERVATION_CONFLICT;
+	KUNIT_EXPECT_EQ(test, SCSI_RETURN_NOT_HANDLED,
+			scsi_check_passthrough(&sc));
+
+	/* Test hitting allowed limit */
+	scsi_build_sense(&sc, 0, NOT_READY, 0x11, 0x22);
+	for (i = 0; i < SCSI_TEST_ERROR_MAX_ALLOWED; i++)
+		KUNIT_EXPECT_EQ(test, NEEDS_RETRY, scsi_check_passthrough(&sc));
+	KUNIT_EXPECT_EQ(test, SUCCESS, scsi_check_passthrough(&sc));
+
+	/* Match using SCMD_FAILURE_SENSE_ANY */
+	sc.failures = any_sense_failures;
+	scsi_build_sense(&sc, 0, MEDIUM_ERROR, 0x11, 0x22);
+	KUNIT_EXPECT_EQ(test, NEEDS_RETRY, scsi_check_passthrough(&sc));
+
+	/* reset retries so we can retest */
+	scsi_reset_failures(multiple_sense_failures);
+
+	/* Test no retries allowed */
+	sc.failures = multiple_sense_failures;
+	scsi_build_sense(&sc, 0, DATA_PROTECT, 0x1, 0x1);
+	KUNIT_EXPECT_EQ(test, SUCCESS, scsi_check_passthrough(&sc));
+
+	/* No matching host byte entry */
+	sc.failures = retryable_host_failures;
+	sc.result = DID_NO_CONNECT << 16;
+	KUNIT_EXPECT_EQ(test, SCSI_RETURN_NOT_HANDLED,
+			scsi_check_passthrough(&sc));
+	/* Matching host byte entry */
+	sc.result = DID_TIME_OUT << 16;
+	KUNIT_EXPECT_EQ(test, NEEDS_RETRY, scsi_check_passthrough(&sc));
+
+	/* Match SCMD_FAILURE_RESULT_ANY */
+	sc.failures = any_failures;
+	sc.result = DID_TRANSPORT_FAILFAST << 16;
+	KUNIT_EXPECT_EQ(test, NEEDS_RETRY, scsi_check_passthrough(&sc));
+
+	/* Test any status handling */
+	sc.failures = any_status_failures;
+	sc.result = SAM_STAT_RESERVATION_CONFLICT;
+	KUNIT_EXPECT_EQ(test, NEEDS_RETRY, scsi_check_passthrough(&sc));
+}
+
+static struct kunit_case scsi_test_error_cases[] = {
+	KUNIT_CASE(scsi_test_error_check_passthough),
+	{}
+};
+
+static struct kunit_suite scsi_test_error_suite = {
+	.name = "scsi_error",
+	.test_cases = scsi_test_error_cases,
+};
+
+kunit_test_suite(scsi_test_error_suite);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 02/34] scsi: Allow passthrough to override what errors to retry
  2023-09-05 23:15 ` [PATCH v11 02/34] scsi: Allow passthrough to override what errors to retry Mike Christie
@ 2023-09-15 20:08   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:08 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> For passthrough, we don't retry any error we get a check condition
> for.
> This results in a lot of callers driving their own retries for those
> types
> of errors and retrying all errors, and there has been a request to
> retry
> specific host byte errors.
> 
> This adds the core code to allow passthrough users to specify what
> errors
> they want scsi-ml to retry for them. We can then convert users to
> drop
> their sense parsing and retry handling.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 04/34] scsi: Have scsi-ml retry scsi_probe_lun errors
  2023-09-05 23:15 ` [PATCH v11 04/34] scsi: Have scsi-ml retry scsi_probe_lun errors Mike Christie
@ 2023-09-15 20:11   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:11 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has scsi_probe_lun ask scsi-ml to retry UAs instead of driving
> them
> itself.
> 
> There is one behavior change with this patch. We used to get a total
> of
> 3 retries for both UAs we were checking for. We now get 3 retries for
> each.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Reviewed-by: Martin Wilck <mwilck@suse.com>



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 05/34] scsi: retry INQUIRY after timeout
  2023-09-05 23:15 ` [PATCH v11 05/34] scsi: retry INQUIRY after timeout Mike Christie
@ 2023-09-15 20:11   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:11 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> Description from: Martin Wilck <mwilck@suse.com>:
> 
> The SCSI mid layer doesn't retry commands after DID_TIME_OUT (see
> scsi_noretry_cmd()). Packet loss in the fabric can cause spurious
> timeouts
> during SCSI device probing, causing device probing to fail. This has
> been
> observed in FCoE uplink failover tests, for example.
> 
> This patch fixes the issue by retrying the INQUIRY.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 06/34] scsi: sd: Fix sshdr use in read_capacity_16
  2023-09-05 23:15 ` [PATCH v11 06/34] scsi: sd: Fix sshdr use in read_capacity_16 Mike Christie
@ 2023-09-15 20:13   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:13 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so
> we
> shouldn't access the sshdr. If it returns 0, then the cmd executed
> successfully, so there is no need to check the sshdr. This has us
> access
> the sshdr when we get a return value > 0.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Reviewed-by: Martin Wilck <mwilck@suse.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors
  2023-09-05 23:15 ` [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors Mike Christie
@ 2023-09-15 20:21   ` Martin Wilck
  2023-09-15 21:34     ` Martin Wilck
  0 siblings, 1 reply; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:21 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has read_capacity_16 have scsi-ml retry errors instead of
> driving
> them itself.
> 
> There are 2 behavior changes with this patch:
> 1. There is one behavior change where we no longer retry when
> scsi_execute_cmd returns < 0, but we should be ok. We don't need to
> retry
> for failures like the queue being removed, and for the case where
> there
> are no tags/reqs since the block layer waits/retries for us. For
> possible
> memory allocation failures from blk_rq_map_kern we use GFP_NOIO, so
> retrying will probably not help.
> 2. For the specific UAs we checked for and retried, we would get
> READ_CAPACITY_RETRIES_ON_RESET retries plus whatever retries were
> left
> from the loop's retries. Each UA now gets
> READ_CAPACITY_RETRIES_ON_RESET
> reties, and the other errors (not including medium not present) get
> up
> to 3 retries.

This is ok, but - just a thought - would it make sense to add a field
for maximum total retry count (summed over all failures)? That would
allow us to minimize behavior changes also in other cases.

> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 08/34] scsi: Use separate buf for START_STOP in sd_spinup_disk
  2023-09-05 23:15 ` [PATCH v11 08/34] scsi: Use separate buf for START_STOP in sd_spinup_disk Mike Christie
@ 2023-09-15 20:26   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:26 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> We currently re-use the cmd buffer for the TUR and START_STOP
> commands
> which requires us to reset the buffer when retrying. This has us use
> separate buffers for the 2 commands so we can make them const and I
> think
> it makes it easier to handle for retries but does not add too much
> extra
> to the stack use.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 09/34] scsi: sd: Fix sshdr use in sd_spinup_disk
  2023-09-05 23:15 ` [PATCH v11 09/34] scsi: sd: Fix sshdr use " Mike Christie
@ 2023-09-15 20:27   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:27 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so
> we
> shouldn't access the sshdr. If it returns 0, then the cmd executed
> successfully, so there is no need to check the sshdr. This has us
> access
> the sshdr when we get a return value > 0.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 10/34] scsi: Have scsi-ml retry sd_spinup_disk errors
  2023-09-05 23:15 ` [PATCH v11 10/34] scsi: Have scsi-ml retry sd_spinup_disk errors Mike Christie
@ 2023-09-15 20:46   ` Martin Wilck
  2023-09-15 20:58     ` Mike Christie
  0 siblings, 1 reply; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:46 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This simplifies sd_spinup_disk so scsi-ml retries errors for it. Note
> that
> we retried specifically on a UA and also if scsi_status_is_good
> returned
> failed which would happen for all check conditions. In this patch we
> use
> SCMD_FAILURE_STAT_ANY which will trigger for the same conditions as
> when scsi_status_is_good returns false. This will cover all CCs
> including
> UAs so there is no explicit failures arrary entry for UAs.
> 
> We do not handle the outside loop's retries because we want to sleep
> between tries and we don't support that yet.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  drivers/scsi/sd.c | 73 ++++++++++++++++++++++++++-------------------
> --
>  1 file changed, 41 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 74967e1b19da..f5e6b5cc762f 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -2151,55 +2151,64 @@ static int sd_done(struct scsi_cmnd *SCpnt)
>  static void
>  sd_spinup_disk(struct scsi_disk *sdkp)
>  {
> -       unsigned char cmd[10];
> +       static const u8 cmd[10] = { TEST_UNIT_READY };
>         unsigned long spintime_expire = 0;
> -       int retries, spintime;
> +       int spintime, sense_valid = 0;
>         unsigned int the_result;
>         struct scsi_sense_hdr sshdr;
> +       struct scsi_failure failures[] = {
> +               /* Fail immediately for Medium Not Present */
> +               {
> +                       .sense = UNIT_ATTENTION,
> +                       .asc = 0x3A,

Shouldn't you set .ascq = SCMD_FAILURE_ASCQ_ANY here, and below as
well?

> +                       .allowed = 0,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {
> +                       .sense = NOT_READY,
> +                       .asc = 0x3A,
> +                       .allowed = 0,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {
> +                       .result = SCMD_FAILURE_STAT_ANY,
> +                       .allowed = 3,
> +               },
> +               {}
> +       };
>         const struct scsi_exec_args exec_args = {
>                 .sshdr = &sshdr,
> +               .failures = failures,
>         };
> -       int sense_valid = 0;
>  
>         spintime = 0;
>  
>         /* Spin up drives, as required.  Only do this at boot time */
>         /* Spinup needs to be done for module loads too. */
>         do {
> -               retries = 0;
> -
> -               do {
> -                       bool media_was_present = sdkp->media_present;
> +               bool media_was_present = sdkp->media_present;
>  
> -                       cmd[0] = TEST_UNIT_READY;
> -                       memset((void *) &cmd[1], 0, 9);
> +               scsi_reset_failures(failures);
>  
> -                       the_result = scsi_execute_cmd(sdkp->device,
> cmd,
> -                                                     REQ_OP_DRV_IN,
> NULL, 0,
> -                                                     SD_TIMEOUT,
> -                                                     sdkp-
> >max_retries,
> -                                                     &exec_args);
> +               the_result = scsi_execute_cmd(sdkp->device, cmd,
> REQ_OP_DRV_IN,
> +                                             NULL, 0, SD_TIMEOUT,
> +                                             sdkp->max_retries,
> &exec_args);
>  
> -                       if (the_result > 0) {
> -                               /*
> -                                * If the drive has indicated to us
> that it
> -                                * doesn't have any media in it,
> don't bother
> -                                * with any more polling.
> -                                */
> -                               if (media_not_present(sdkp, &sshdr))
> {
> -                                       if (media_was_present)
> -
>                                                sd_printk(KERN_NOTICE,
> sdkp,
> -                                                         "Media
> removed, stopped polling\n");
> -                                       return;
> -                               }
>  
> -                               sense_valid =
> scsi_sense_valid(&sshdr);
> +               if (the_result > 0) {
> +                       /*
> +                        * If the drive has indicated to us that it
> doesn't
> +                        * have any media in it, don't bother with
> any more
> +                        * polling.
> +                        */
> +                       if (media_not_present(sdkp, &sshdr)) {
> +                               if (media_was_present)
> +                                       sd_printk(KERN_NOTICE, sdkp,
> +                                                 "Media removed,
> stopped polling\n");
> +                               return;
>                         }
> -                       retries++;
> -               } while (retries < 3 &&
> -                        (!scsi_status_is_good(the_result) ||
> -                         (scsi_status_is_check_condition(the_result)
> &&
> -                         sense_valid && sshdr.sense_key ==
> UNIT_ATTENTION)));
> +                       sense_valid = scsi_sense_valid(&sshdr);
> +               }
>  
>                 if (!scsi_status_is_check_condition(the_result)) {
>                         /* no sense, TUR either succeeded or failed


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 11/34] scsi: hp_sw: Only access sshdr if res > 0
  2023-09-05 23:15 ` [PATCH v11 11/34] scsi: hp_sw: Only access sshdr if res > 0 Mike Christie
@ 2023-09-15 20:48   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:48 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so
> we
> shouldn't access the sshdr. If it returns 0, then the cmd executed
> successfully, so there is no need to check the sshdr. This has us
> access
> the sshdr when we get a return value > 0.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 12/34] scsi: hp_sw: Have scsi-ml retry scsi_exec_req errors
  2023-09-05 23:15 ` [PATCH v11 12/34] scsi: hp_sw: Have scsi-ml retry scsi_exec_req errors Mike Christie
@ 2023-09-15 20:51   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:51 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has hp_sw have scsi-ml retry scsi_exec_req errors instead of
> driving
> them itself.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 13/34] scsi: rdac: Fix send_mode_select retry handling
  2023-09-05 23:15 ` [PATCH v11 13/34] scsi: rdac: Fix send_mode_select retry handling Mike Christie
@ 2023-09-15 20:54   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:54 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> If send_mode_select retries scsi_execute_cmd it will leave err set to
> SCSI_DH_RETRY/SCSI_DH_IMM_RETRY. If on the retry, the command is
> successful, then SCSI_DH_RETRY/SCSI_DH_IMM_RETRY will be returned to
> the scsi_dh activation caller. On the retry, we will then detect the
> previous MODE SELECT had worked, and so we will return success.
> 
> This patch has us return the correct return value, so we can avoid
> the
> extra scsi_dh activation call and to avoid failures if the caller had
> hit its activation retry limit and does not end up retrying.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 14/34] scsi: rdac: Fix sshdr use
  2023-09-05 23:15 ` [PATCH v11 14/34] scsi: rdac: Fix sshdr use Mike Christie
@ 2023-09-15 20:55   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:55 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so
> we
> shouldn't access the sshdr. If it returns 0, then the cmd executed
> successfully, so there is no need to check the sshdr. This has us
> access
> the sshdr when we get a return value > 0.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 10/34] scsi: Have scsi-ml retry sd_spinup_disk errors
  2023-09-15 20:46   ` Martin Wilck
@ 2023-09-15 20:58     ` Mike Christie
  2023-09-15 21:23       ` Martin Wilck
  0 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-15 20:58 UTC (permalink / raw)
  To: Martin Wilck, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On 9/15/23 3:46 PM, Martin Wilck wrote:
>>  sd_spinup_disk(struct scsi_disk *sdkp)
>>  {
>> -       unsigned char cmd[10];
>> +       static const u8 cmd[10] = { TEST_UNIT_READY };
>>         unsigned long spintime_expire = 0;
>> -       int retries, spintime;
>> +       int spintime, sense_valid = 0;
>>         unsigned int the_result;
>>         struct scsi_sense_hdr sshdr;
>> +       struct scsi_failure failures[] = {
>> +               /* Fail immediately for Medium Not Present */
>> +               {
>> +                       .sense = UNIT_ATTENTION,
>> +                       .asc = 0x3A,
> Shouldn't you set .ascq = SCMD_FAILURE_ASCQ_ANY here, and below as
> well?

You're right. Will fix all those cases.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 15/34] scsi: rdac: Have scsi-ml retry send_mode_select errors
  2023-09-05 23:15 ` [PATCH v11 15/34] scsi: rdac: Have scsi-ml retry send_mode_select errors Mike Christie
@ 2023-09-15 20:58   ` Martin Wilck
  2023-09-15 21:52     ` Mike Christie
  0 siblings, 1 reply; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:58 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has rdac have scsi-ml retry errors instead of driving them
> itself.
> 
> There is one behavior change with this patch. We used to get a total
> of
> 5 retries for errors mode_select_handle_sense returned SCSI_DH_RETRY.
> We
> now get 5 retries for each failure.

... making me think of the total retry count again.

> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/device_handler/scsi_dh_rdac.c | 87 ++++++++++++--------
> --
>  1 file changed, 49 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/scsi/device_handler/scsi_dh_rdac.c
> b/drivers/scsi/device_handler/scsi_dh_rdac.c
> index 1ac2ae17e8be..771108a332cb 100644
> --- a/drivers/scsi/device_handler/scsi_dh_rdac.c
> +++ b/drivers/scsi/device_handler/scsi_dh_rdac.c
> @@ -485,43 +485,17 @@ static int set_mode_select(struct scsi_device
> *sdev, struct rdac_dh_data *h)
>  static int mode_select_handle_sense(struct scsi_device *sdev,
>                                     struct scsi_sense_hdr *sense_hdr)
>  {
> -       int err = SCSI_DH_IO;
>         struct rdac_dh_data *h = sdev->handler_data;
>  
>         if (!scsi_sense_valid(sense_hdr))
> -               goto done;
> -
> -       switch (sense_hdr->sense_key) {
> -       case NO_SENSE:
> -       case ABORTED_COMMAND:
> -       case UNIT_ATTENTION:
> -               err = SCSI_DH_RETRY;
> -               break;
> -       case NOT_READY:
> -               if (sense_hdr->asc == 0x04 && sense_hdr->ascq ==
> 0x01)
> -                       /* LUN Not Ready and is in the Process of
> Becoming
> -                        * Ready
> -                        */
> -                       err = SCSI_DH_RETRY;
> -               break;
> -       case ILLEGAL_REQUEST:
> -               if (sense_hdr->asc == 0x91 && sense_hdr->ascq ==
> 0x36)
> -                       /*
> -                        * Command Lock contention
> -                        */
> -                       err = SCSI_DH_IMM_RETRY;
> -               break;
> -       default:
> -               break;
> -       }
> +               return SCSI_DH_IO;
>  
>         RDAC_LOG(RDAC_LOG_FAILOVER, sdev, "array %s, ctlr %d, "
>                 "MODE_SELECT returned with sense %02x/%02x/%02x",
>                 (char *) h->ctlr->array_name, h->ctlr->index,
>                 sense_hdr->sense_key, sense_hdr->asc, sense_hdr-
> >ascq);
>  
> -done:
> -       return err;
> +       return SCSI_DH_IO;
>  }
>  
>  static void send_mode_select(struct work_struct *work)
> @@ -530,7 +504,7 @@ static void send_mode_select(struct work_struct
> *work)
>                 container_of(work, struct rdac_controller, ms_work);
>         struct scsi_device *sdev = ctlr->ms_sdev;
>         struct rdac_dh_data *h = sdev->handler_data;
> -       int rc, err, retry_cnt = RDAC_RETRY_COUNT;
> +       int rc, err;
>         struct rdac_queue_data *tmp, *qdata;
>         LIST_HEAD(list);
>         unsigned char cdb[MAX_COMMAND_SIZE];
> @@ -538,8 +512,52 @@ static void send_mode_select(struct work_struct
> *work)
>         unsigned int data_size;
>         blk_opf_t opf = REQ_OP_DRV_OUT | REQ_FAILFAST_DEV |
>                                 REQ_FAILFAST_TRANSPORT |
> REQ_FAILFAST_DRIVER;
> +       struct scsi_failure failures[] = {
> +               {
> +                       .sense = NO_SENSE,
> +                       .asc = SCMD_FAILURE_ASC_ANY,
> +                       .ascq = SCMD_FAILURE_ASCQ_ANY,
> +                       .allowed = RDAC_RETRY_COUNT,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {
> +                       .sense = ABORTED_COMMAND,
> +                       .asc = SCMD_FAILURE_ASC_ANY,
> +                       .ascq = SCMD_FAILURE_ASCQ_ANY,
> +                       .allowed = RDAC_RETRY_COUNT,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {
> +                       .sense = UNIT_ATTENTION,
> +                       .asc = SCMD_FAILURE_ASC_ANY,
> +                       .ascq = SCMD_FAILURE_ASCQ_ANY,
> +                       .allowed = RDAC_RETRY_COUNT,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {
> +                       /*
> +                        * LUN Not Ready and is in the Process of
> Becoming
> +                        * Ready
> +                        */
> +                       .sense = NOT_READY,
> +                       .asc = 0x04,
> +                       .ascq = 0x01,
> +                       .allowed = RDAC_RETRY_COUNT,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {
> +                       /* Command Lock contention */
> +                       .sense = ILLEGAL_REQUEST,
> +                       .asc = 0x91,
> +                       .ascq = 0x36,
> +                       .allowed = SCMD_FAILURE_NO_LIMIT,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {}
> +       };
>         const struct scsi_exec_args exec_args = {
>                 .sshdr = &sshdr,
> +               .failures = failures,
>         };
>  
>         spin_lock(&ctlr->ms_lock);
> @@ -548,15 +566,12 @@ static void send_mode_select(struct work_struct
> *work)
>         ctlr->ms_sdev = NULL;
>         spin_unlock(&ctlr->ms_lock);
>  
> - retry:
>         memset(cdb, 0, sizeof(cdb));
>  
>         data_size = rdac_failover_get(ctlr, &list, cdb);
>  
> -       RDAC_LOG(RDAC_LOG_FAILOVER, sdev, "array %s, ctlr %d, "
> -               "%s MODE_SELECT command",
> -               (char *) h->ctlr->array_name, h->ctlr->index,
> -               (retry_cnt == RDAC_RETRY_COUNT) ? "queueing" :
> "retrying");
> +       RDAC_LOG(RDAC_LOG_FAILOVER, sdev, "array %s, ctlr %d,
> queueingMODE_SELECT command",

missing space?


> +               (char *) h->ctlr->array_name, h->ctlr->index);
>  
>         rc = scsi_execute_cmd(sdev, cdb, opf, &h->ctlr->mode_select,
> data_size,
>                               RDAC_TIMEOUT * HZ, RDAC_RETRIES,
> &exec_args);
> @@ -570,10 +585,6 @@ static void send_mode_select(struct work_struct
> *work)
>                 err = SCSI_DH_IO;
>         } else {
>                 err = mode_select_handle_sense(sdev, &sshdr);
> -               if (err == SCSI_DH_RETRY && retry_cnt--)
> -                       goto retry;
> -               if (err == SCSI_DH_IMM_RETRY)
> -                       goto retry;
>         }
>  
>         list_for_each_entry_safe(qdata, tmp, &list, entry) {


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 16/34] scsi: spi: Fix sshdr use
  2023-09-05 23:15 ` [PATCH v11 16/34] scsi: spi: Fix sshdr use Mike Christie
@ 2023-09-15 20:59   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 20:59 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so
> we
> shouldn't access the sshdr. If it returns 0, then the cmd executed
> successfully, so there is no need to check the sshdr. This has us
> access
> the sshdr when we get a return value > 0.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 17/34] scsi: spi: Have scsi-ml retry spi_execute errors
  2023-09-05 23:15 ` [PATCH v11 17/34] scsi: spi: Have scsi-ml retry spi_execute errors Mike Christie
@ 2023-09-15 21:00   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:00 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has spi_execute have scsi-ml retry errors instead of driving
> them.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 18/34] scsi: sd: Fix sshdr use in sd_suspend_common
  2023-09-05 23:15 ` [PATCH v11 18/34] scsi: sd: Fix sshdr use in sd_suspend_common Mike Christie
@ 2023-09-15 21:04   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:04 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so
> we
> shouldn't access the sshdr. If it returns 0, then the cmd executed
> successfully, so there is no need to check the sshdr. sd_sync_cache
> will
> only access the sshdr if it's been setup because it calls
> scsi_status_is_check_condition before accessing it. However, the
> sd_sync_cache caller, sd_suspend_common, does not check.
> 
> sd_suspend_common is only checking for ILLEGAL_REQUEST which it's
> using
> to determine if the command is supported. If it's not it just ignores
> the error. So to fix its sshdr use this patch just moves that check
> to
> sd_sync_cache where it converts ILLEGAL_REQUEST to success/0.
> sd_suspend_common was ignoring that error and sd_shutdown doesn't
> check
> for errors so there will be no behavior changes.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 20/34] scsi: ch: Remove unit_attention
  2023-09-05 23:15 ` [PATCH v11 20/34] scsi: ch: Remove unit_attention Mike Christie
@ 2023-09-15 21:08   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:08 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> unit_attention is not used so remove it.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 19/34] scsi: sd: Have scsi-ml retry sd_sync_cache errors
  2023-09-05 23:15 ` [PATCH v11 19/34] scsi: sd: Have scsi-ml retry sd_sync_cache errors Mike Christie
@ 2023-09-15 21:10   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:10 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has sd_sync_cache have scsi-ml retry errors instead of driving
> them
> itself.
> 
> There is one behavior change where we no longer retry when
> scsi_execute_cmd returns < 0, but we should be ok. We don't need to
> retry
> for failures like the queue being removed, and for the case where
> there
> are no tags/reqs the block layer waits/retries for us. For possible
> memory
> allocation failures from blk_rq_map_kern we use GFP_NOIO, so retrying
> will probably not help.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 21/34] scsi: ch: Have scsi-ml retry ch_do_scsi errors
  2023-09-05 23:15 ` [PATCH v11 21/34] scsi: ch: Have scsi-ml retry ch_do_scsi errors Mike Christie
@ 2023-09-15 21:11   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:11 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has ch_do_scsi have scsi-ml retry errors instead of driving them
> itself.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 22/34] scsi: Have scsi-ml retry scsi_mode_sense UAs
  2023-09-05 23:15 ` [PATCH v11 22/34] scsi: Have scsi-ml retry scsi_mode_sense UAs Mike Christie
@ 2023-09-15 21:11   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:11 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has scsi_mode_sense have scsi-ml retry UAs instead of driving
> them
> itself.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 23/34] scsi: sd: Fix scsi_mode_sense caller's sshdr use
  2023-09-05 23:15 ` [PATCH v11 23/34] scsi: sd: Fix scsi_mode_sense caller's sshdr use Mike Christie
@ 2023-09-15 21:13   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:13 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> The sshdr passed into scsi_execute_cmd is only initialized if
> scsi_execute_cmd returns >= 0, and scsi_mode_sense will convert all
> non
> good statuses like check conditions to -EIO. This has scsi_mode_sense
> callers that were possibly accessing an uninitialized sshdrs to only
> access it if we got -EIO.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 24/34] scsi: Have scsi-ml retry scsi_report_lun_scan errors
  2023-09-05 23:15 ` [PATCH v11 24/34] scsi: Have scsi-ml retry scsi_report_lun_scan errors Mike Christie
@ 2023-09-15 21:17   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:17 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has scsi_report_lun_scan have scsi-ml retry errors instead of
> driving
> them itself.
> 
> There is one behavior change where we no longer retry when
> scsi_execute_cmd returns < 0, but we should be ok. We don't need to
> retry
> for failures like the queue being removed, and for the case where
> there
> are no tags/reqs the block layer waits/retries for us. For possible
> memory
> allocation failures from blk_rq_map_kern we use GFP_NOIO, so retrying
> will probably not help.

Another minor behavior change is that we may now get up to 3 retries
for UAs plus up to 3 retries for non-CC errors. Anyway,

> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 25/34] scsi: sd: Have pr commands retry UAs
  2023-09-05 23:15 ` [PATCH v11 25/34] scsi: sd: Have pr commands retry UAs Mike Christie
@ 2023-09-15 21:18   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:18 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> It's common to get a UA when doing PR commands. It could be due to a
> target restarting, transport level relogin or other PR commands like
> a
> release causing it. The upper layers don't get the sense and in some
> cases
> have no idea if it's a SCSI device, so this has the sd layer retry.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 10/34] scsi: Have scsi-ml retry sd_spinup_disk errors
  2023-09-15 20:58     ` Mike Christie
@ 2023-09-15 21:23       ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:23 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Fri, 2023-09-15 at 15:58 -0500, Mike Christie wrote:
> On 9/15/23 3:46 PM, Martin Wilck wrote:
> > >  sd_spinup_disk(struct scsi_disk *sdkp)
> > >  {
> > > -       unsigned char cmd[10];
> > > +       static const u8 cmd[10] = { TEST_UNIT_READY };
> > >         unsigned long spintime_expire = 0;
> > > -       int retries, spintime;
> > > +       int spintime, sense_valid = 0;
> > >         unsigned int the_result;
> > >         struct scsi_sense_hdr sshdr;
> > > +       struct scsi_failure failures[] = {
> > > +               /* Fail immediately for Medium Not Present */
> > > +               {
> > > +                       .sense = UNIT_ATTENTION,
> > > +                       .asc = 0x3A,
> > Shouldn't you set .ascq = SCMD_FAILURE_ASCQ_ANY here, and below as
> > well?
> 
> You're right. Will fix all those cases.

I also noted that you don't treat .ascq = 0 consistently, e.g. in
07/34, where you set it for the NOT_READY case but not for others. It's
not wrong to omit it, but for code clarity it might be good to set it
explicitly.

Thanks,
Martin


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 26/34] scsi: sd: Have scsi-ml retry read_capacity_10 errors
  2023-09-05 23:15 ` [PATCH v11 26/34] scsi: sd: Have scsi-ml retry read_capacity_10 errors Mike Christie
@ 2023-09-15 21:25   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:25 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has read_capacity_10 have scsi-ml retry errors instead of
> driving
> them itself.
> 
> There are two behavior changes:
> 1. We no longer retry when scsi_execute_cmd returns < 0, but we
> should be
> ok. We don't need to retry for failures like the queue being removed,
> and
> for the case where there are no tags/reqs the block layer
> waits/retries
> for us. For possible memory allocation failures from blk_rq_map_kern
> we
> use GFP_NOIO, so retrying will probably not help.
> 2. For device reset UAs, we would retry
> READ_CAPACITY_RETRIES_ON_RESET
> times, then once those are used up we would hit the main do loops
> retry
> counter and get 3 more retries. We now only get
> READ_CAPACITY_RETRIES_ON_RESET retries.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Martin Wilck <mwilck@suse.com>



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors
  2023-09-15 20:21   ` Martin Wilck
@ 2023-09-15 21:34     ` Martin Wilck
  2023-09-18  0:35       ` Mike Christie
  0 siblings, 1 reply; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:34 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Fri, 2023-09-15 at 22:21 +0200, Martin Wilck wrote:
> On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> > This has read_capacity_16 have scsi-ml retry errors instead of
> > driving
> > them itself.
> > 
> > There are 2 behavior changes with this patch:
> > 1. There is one behavior change where we no longer retry when
> > scsi_execute_cmd returns < 0, but we should be ok. We don't need to
> > retry
> > for failures like the queue being removed, and for the case where
> > there
> > are no tags/reqs since the block layer waits/retries for us. For
> > possible
> > memory allocation failures from blk_rq_map_kern we use GFP_NOIO, so
> > retrying will probably not help.
> > 2. For the specific UAs we checked for and retried, we would get
> > READ_CAPACITY_RETRIES_ON_RESET retries plus whatever retries were
> > left
> > from the loop's retries. Each UA now gets
> > READ_CAPACITY_RETRIES_ON_RESET
> > reties, and the other errors (not including medium not present) get
> > up
> > to 3 retries.
> 
> This is ok, but - just a thought - would it make sense to add a field
> for maximum total retry count (summed over all failures)? That would
> allow us to minimize behavior changes also in other cases.

Could we perhaps use scmd->allowed for this purpose?

I noted that several callers of scsi_execute_cmd() in your patch set
set the allowed parameter, e.g. to sdkp->max_retries in 07/34.
But allowed doesn't seem to be used at all in the passthrough case,
so we might as well use it as an upper limit for the total number of
retries.



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 27/34] scsi: ses: Have scsi-ml retry scsi_exec_req errors
  2023-09-05 23:15 ` [PATCH v11 27/34] scsi: ses: Have scsi-ml retry scsi_exec_req errors Mike Christie
@ 2023-09-15 21:34   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:34 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has ses have scsi-ml retry scsi_exec_req errors instead of
> driving
> them itself.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Martin Wilck <mwilck@suse.com>



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 28/34] scsi: sr: Have scsi-ml retry get_sectorsize errors
  2023-09-05 23:15 ` [PATCH v11 28/34] scsi: sr: Have scsi-ml retry get_sectorsize errors Mike Christie
@ 2023-09-15 21:36   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:36 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has get_sectorsize have scsi-ml retry errors instead of driving
> them
> itself.
> 
> There is one behavior change where we no longer retry when
> scsi_execute_cmd returns < 0, but we should be ok. We don't need to
> retry
> for failures like the queue being removed, and for the case where
> there
> are no tags/reqs the block layer waits/retries for us. For possible
> memory
> allocation failures from blk_rq_map_kern we use GFP_NOIO, so retrying
> will probably not help.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Martin Wilck <mwilck@suse.com>



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 29/34] scsi: ufs: Have scsi-ml retry start stop errors
  2023-09-05 23:15 ` [PATCH v11 29/34] scsi: ufs: Have scsi-ml retry start stop errors Mike Christie
@ 2023-09-15 21:37   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:37 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> This has scsi-ml retry errors instead of driving them itself.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Reviewed-by: Martin Wilck <mwilck@suse.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 30/34] scsi: Fix sshdr use in scsi_test_unit_ready
  2023-09-05 23:15 ` [PATCH v11 30/34] scsi: Fix sshdr use in scsi_test_unit_ready Mike Christie
@ 2023-09-15 21:38   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:38 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so
> we
> shouldn't access the sshdr. If it returns 0, then the cmd executed
> successfully, so there is no need to check the sshdr. This has us
> access
> the sshdr when we get a return value > 0.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 31/34] scsi: Fix sshdr use in scsi_cdl_enable
  2023-09-05 23:15 ` [PATCH v11 31/34] scsi: Fix sshdr use in scsi_cdl_enable Mike Christie
@ 2023-09-15 21:39   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:39 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so
> we
> shouldn't access the sshdr. If it returns 0, then the cmd executed
> successfully, so there is no need to check the sshdr. This has us
> access
> the sshdr when we get a return value > 0.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 32/34] scsi: sd: Fix sshdr use in cache_type_store
  2023-09-05 23:15 ` [PATCH v11 32/34] scsi: sd: Fix sshdr use in cache_type_store Mike Christie
@ 2023-09-15 21:44   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:44 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so
> we
> shouldn't access the sshdr. If it returns 0, then the cmd executed
> successfully, so there is no need to check the sshdr. This has us
> access
> the sshdr when we get a return value > 0.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

While all these "fix sshdr use" patches are correct, I start wondering
whether we shouldn't simply execute "

        if (sshdr) 
                sshdr->response_code = 0;

early in scsi_execute_cmd(). That way we could skip all those (ret > 0)
clauses.

Anyway,

Reviewed-by: Martin Wilck <mwilck@suse.com>



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 33/34] scsi: sr: Fix sshdr use in sr_get_events
  2023-09-05 23:15 ` [PATCH v11 33/34] scsi: sr: Fix sshdr use in sr_get_events Mike Christie
@ 2023-09-15 21:44   ` Martin Wilck
  0 siblings, 0 replies; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:44 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> If scsi_execute_cmd returns < 0, it doesn't initialize the sshdr, so
> we
> shouldn't access the sshdr. If it returns 0, then the cmd executed
> successfully, so there is no need to check the sshdr. This has us
> access
> the sshdr when we get a return value > 0.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 34/34] scsi: Add kunit tests for scsi_check_passthrough
  2023-09-05 23:15 ` [PATCH v11 34/34] scsi: Add kunit tests for scsi_check_passthrough Mike Christie
@ 2023-09-15 21:52   ` Martin Wilck
  2023-09-15 22:07     ` Mike Christie
  0 siblings, 1 reply; 77+ messages in thread
From: Martin Wilck @ 2023-09-15 21:52 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> Add some kunit tests for scsi_check_passthrough so we can easily make
> sure
> we are hitting the cases it's difficult to replicate in hardware or
> even
> scsi_debug.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/Kconfig           |   9 ++
>  drivers/scsi/scsi_error.c      |   4 +
>  drivers/scsi/scsi_error_test.c | 170
> +++++++++++++++++++++++++++++++++
>  3 files changed, 183 insertions(+)
>  create mode 100644 drivers/scsi/scsi_error_test.c
> 
> diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
> index 695a57d894cd..b7ffb435afb5 100644
> --- a/drivers/scsi/Kconfig
> +++ b/drivers/scsi/Kconfig
> @@ -67,6 +67,15 @@ config SCSI_PROC_FS
>  
>           If unsure say Y.
>  
> +config SCSI_KUNIT_TEST
> +       tristate "KUnit tests for SCSI Mid Layer" if !KUNIT_ALL_TESTS
> +       depends on KUNIT
> +       default KUNIT_ALL_TESTS
> +       help
> +         Run SCSI Mid Layer's KUnit tests.
> +
> +         If unsure say N.
> +
>  comment "SCSI support type (disk, tape, CD-ROM)"
>         depends on SCSI
>  
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index d2fb28212880..f12ab199a03f 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -2663,3 +2663,7 @@ bool scsi_get_sense_info_fld(const u8
> *sense_buffer, int sb_len,
>         }
>  }
>  EXPORT_SYMBOL(scsi_get_sense_info_fld);
> +
> +#ifdef CONFIG_SCSI_KUNIT_TEST
> +#include "scsi_error_test.c"
> +#endif
> diff --git a/drivers/scsi/scsi_error_test.c
> b/drivers/scsi/scsi_error_test.c
> new file mode 100644
> index 000000000000..c01201ad8702
> --- /dev/null
> +++ b/drivers/scsi/scsi_error_test.c
> @@ -0,0 +1,170 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KUnit tests for scsi_error.c.
> + *
> + * Copyright (C) 2022, Oracle Corporation
> + */
> +#include <kunit/test.h>
> +
> +#include <scsi/scsi_proto.h>
> +#include <scsi/scsi_cmnd.h>
> +
> +#define SCSI_TEST_ERROR_MAX_ALLOWED 3
> +
> +static void scsi_test_error_check_passthough(struct kunit *test)
> +{
> +       struct scsi_failure multiple_sense_failures[] = {
> +               {
> +                       .sense = DATA_PROTECT,
> +                       .asc = 0x1,
> +                       .ascq = 0x1,
> +                       .allowed = 0,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {
> +                       .sense = UNIT_ATTENTION,
> +                       .asc = 0x11,
> +                       .ascq = 0x0,
> +                       .allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {
> +                       .sense = NOT_READY,
> +                       .asc = 0x11,
> +                       .ascq = 0x22,
> +                       .allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {
> +                       .sense = ABORTED_COMMAND,
> +                       .asc = 0x11,
> +                       .ascq = SCMD_FAILURE_ASCQ_ANY,
> +                       .allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {
> +                       .sense = HARDWARE_ERROR,
> +                       .asc = SCMD_FAILURE_ASC_ANY,
> +                       .allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {
> +                       .sense = ILLEGAL_REQUEST,
> +                       .asc = 0x91,
> +                       .ascq = 0x36,
> +                       .allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {}
> +       };
> +       struct scsi_failure retryable_host_failures[] = {
> +               {
> +                       .result = DID_TRANSPORT_DISRUPTED << 16,
> +                       .allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
> +               },
> +               {
> +                       .result = DID_TIME_OUT << 16,
> +                       .allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
> +               },
> +               {}
> +       };
> +       struct scsi_failure any_status_failures[] = {
> +               {
> +                       .result = SCMD_FAILURE_STAT_ANY,
> +                       .allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
> +               },
> +               {}
> +       };
> +       struct scsi_failure any_sense_failures[] = {
> +               {
> +                       .result = SCMD_FAILURE_SENSE_ANY,
> +                       .allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
> +               },
> +               {}
> +       };
> +       struct scsi_failure any_failures[] = {
> +               {
> +                       .result = SCMD_FAILURE_RESULT_ANY,
> +                       .allowed = SCSI_TEST_ERROR_MAX_ALLOWED,
> +               },
> +               {}
> +       };
> +       u8 sense[SCSI_SENSE_BUFFERSIZE] = {};
> +       struct scsi_cmnd sc = {
> +               .sense_buffer = sense,
> +               .failures = multiple_sense_failures,
> +       };
> +       int i;
> +
> +       /* Match end of array */
> +       scsi_build_sense(&sc, 0, ILLEGAL_REQUEST, 0x91, 0x36);
> +       KUNIT_EXPECT_EQ(test, NEEDS_RETRY,
> scsi_check_passthrough(&sc));
> +       /* Basic match in array */
> +       scsi_build_sense(&sc, 0, UNIT_ATTENTION, 0x11, 0x0);
> +       KUNIT_EXPECT_EQ(test, NEEDS_RETRY,
> scsi_check_passthrough(&sc));
> +       /* No matching sense entry */
> +       scsi_build_sense(&sc, 0, MISCOMPARE, 0x11, 0x11);
> +       KUNIT_EXPECT_EQ(test, SCSI_RETURN_NOT_HANDLED,
> +                       scsi_check_passthrough(&sc));
> +       /* Match using SCMD_FAILURE_ASCQ_ANY */

This comment looks wrong to me. Are you missing an ABORTED_COMMAND,
0x11 case here?

> +       scsi_build_sense(&sc, 0, ABORTED_COMMAND, 0x22, 0x22);
> +       KUNIT_EXPECT_EQ(test, SCSI_RETURN_NOT_HANDLED,
> +                       scsi_check_passthrough(&sc));
> +       /* Match using SCMD_FAILURE_ASC_ANY */
> +       scsi_build_sense(&sc, 0, HARDWARE_ERROR, 0x11, 0x22);
> +       KUNIT_EXPECT_EQ(test, NEEDS_RETRY,
> scsi_check_passthrough(&sc));
> +       /* No matching status entry */
> +       sc.result = SAM_STAT_RESERVATION_CONFLICT;
> +       KUNIT_EXPECT_EQ(test, SCSI_RETURN_NOT_HANDLED,
> +                       scsi_check_passthrough(&sc));
> +
> +       /* Test hitting allowed limit */
> +       scsi_build_sense(&sc, 0, NOT_READY, 0x11, 0x22);
> +       for (i = 0; i < SCSI_TEST_ERROR_MAX_ALLOWED; i++)
> +               KUNIT_EXPECT_EQ(test, NEEDS_RETRY,
> scsi_check_passthrough(&sc));
> +       KUNIT_EXPECT_EQ(test, SUCCESS, scsi_check_passthrough(&sc));
> +
> +       /* Match using SCMD_FAILURE_SENSE_ANY */
> +       sc.failures = any_sense_failures;
> +       scsi_build_sense(&sc, 0, MEDIUM_ERROR, 0x11, 0x22);
> +       KUNIT_EXPECT_EQ(test, NEEDS_RETRY,
> scsi_check_passthrough(&sc));
> +
> +       /* reset retries so we can retest */
> +       scsi_reset_failures(multiple_sense_failures);
> +
> +       /* Test no retries allowed */
> +       sc.failures = multiple_sense_failures;
> +       scsi_build_sense(&sc, 0, DATA_PROTECT, 0x1, 0x1);
> +       KUNIT_EXPECT_EQ(test, SUCCESS, scsi_check_passthrough(&sc));
> +
> +       /* No matching host byte entry */
> +       sc.failures = retryable_host_failures;
> +       sc.result = DID_NO_CONNECT << 16;
> +       KUNIT_EXPECT_EQ(test, SCSI_RETURN_NOT_HANDLED,
> +                       scsi_check_passthrough(&sc));
> +       /* Matching host byte entry */
> +       sc.result = DID_TIME_OUT << 16;
> +       KUNIT_EXPECT_EQ(test, NEEDS_RETRY,
> scsi_check_passthrough(&sc));
> +
> +       /* Match SCMD_FAILURE_RESULT_ANY */
> +       sc.failures = any_failures;
> +       sc.result = DID_TRANSPORT_FAILFAST << 16;
> +       KUNIT_EXPECT_EQ(test, NEEDS_RETRY,
> scsi_check_passthrough(&sc));
> +
> +       /* Test any status handling */
> +       sc.failures = any_status_failures;
> +       sc.result = SAM_STAT_RESERVATION_CONFLICT;
> +       KUNIT_EXPECT_EQ(test, NEEDS_RETRY,
> scsi_check_passthrough(&sc));
> +}
> +
> +static struct kunit_case scsi_test_error_cases[] = {
> +       KUNIT_CASE(scsi_test_error_check_passthough),
> +       {}
> +};
> +
> +static struct kunit_suite scsi_test_error_suite = {
> +       .name = "scsi_error",
> +       .test_cases = scsi_test_error_cases,
> +};
> +
> +kunit_test_suite(scsi_test_error_suite);


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 15/34] scsi: rdac: Have scsi-ml retry send_mode_select errors
  2023-09-15 20:58   ` Martin Wilck
@ 2023-09-15 21:52     ` Mike Christie
  0 siblings, 0 replies; 77+ messages in thread
From: Mike Christie @ 2023-09-15 21:52 UTC (permalink / raw)
  To: Martin Wilck, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On 9/15/23 3:58 PM, Martin Wilck wrote:
>> -       RDAC_LOG(RDAC_LOG_FAILOVER, sdev, "array %s, ctlr %d, "
>> -               "%s MODE_SELECT command",
>> -               (char *) h->ctlr->array_name, h->ctlr->index,
>> -               (retry_cnt == RDAC_RETRY_COUNT) ? "queueing" :
>> "retrying");
>> +       RDAC_LOG(RDAC_LOG_FAILOVER, sdev, "array %s, ctlr %d,
>> queueingMODE_SELECT command",
> missing space?

Yeah, I see it. I had just kept the existing tab/spacing. Will fix.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 34/34] scsi: Add kunit tests for scsi_check_passthrough
  2023-09-15 21:52   ` Martin Wilck
@ 2023-09-15 22:07     ` Mike Christie
  0 siblings, 0 replies; 77+ messages in thread
From: Mike Christie @ 2023-09-15 22:07 UTC (permalink / raw)
  To: Martin Wilck, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On 9/15/23 4:52 PM, Martin Wilck wrote:
>> +       /* Match using SCMD_FAILURE_ASCQ_ANY */
> This comment looks wrong to me. Are you missing an ABORTED_COMMAND,
> 0x11 case here?

Both are wrong :)

I'm actually testing that we don't match (get SCSI_RETURN_NOT_HANDLED)
even though the caller set SCMD_FAILURE_ASCQ_ANY. So the comment is wrong.

But yeah, I also wanted to test that SCMD_FAILURE_ASCQ_ANY works so I
need to add a test for that. I had tested this but then changed the test
for the above case while testing, but didn't go back and edit the tests
again to have them both.

> 
>> +       scsi_build_sense(&sc, 0, ABORTED_COMMAND, 0x22, 0x22);
>> +       KUNIT_EXPECT_EQ(test, SCSI_RETURN_NOT_HANDLED,
>> +                       scsi_check_passthrough(&sc));


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors
  2023-09-15 21:34     ` Martin Wilck
@ 2023-09-18  0:35       ` Mike Christie
  2023-09-18 16:48         ` Martin Wilck
  0 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-18  0:35 UTC (permalink / raw)
  To: Martin Wilck, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On 9/15/23 4:34 PM, Martin Wilck wrote:
> On Fri, 2023-09-15 at 22:21 +0200, Martin Wilck wrote:
>> On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
>>> This has read_capacity_16 have scsi-ml retry errors instead of
>>> driving
>>> them itself.
>>>
>>> There are 2 behavior changes with this patch:
>>> 1. There is one behavior change where we no longer retry when
>>> scsi_execute_cmd returns < 0, but we should be ok. We don't need to
>>> retry
>>> for failures like the queue being removed, and for the case where
>>> there
>>> are no tags/reqs since the block layer waits/retries for us. For
>>> possible
>>> memory allocation failures from blk_rq_map_kern we use GFP_NOIO, so
>>> retrying will probably not help.
>>> 2. For the specific UAs we checked for and retried, we would get
>>> READ_CAPACITY_RETRIES_ON_RESET retries plus whatever retries were
>>> left
>>> from the loop's retries. Each UA now gets
>>> READ_CAPACITY_RETRIES_ON_RESET
>>> reties, and the other errors (not including medium not present) get
>>> up
>>> to 3 retries.
>>
>> This is ok, but - just a thought - would it make sense to add a field
>> for maximum total retry count (summed over all failures)? That would
>> allow us to minimize behavior changes also in other cases.
> 
> Could we perhaps use scmd->allowed for this purpose?
> 
> I noted that several callers of scsi_execute_cmd() in your patch set
> set the allowed parameter, e.g. to sdkp->max_retries in 07/34.
> But allowed doesn't seem to be used at all in the passthrough case,

I think scmd->allowed is still used for errors that don't hit the
check_type goto in scsi_no_retry_cmd.

If the user sets up scsi failures for only UAs, and we got a
DID_TRANSPORT_DISRUPTED then we we do scsi_cmd_retry_allowed which
checks scmd->allowed.

In scsi_noretry_cmd we then hit the:

        if (!scsi_status_is_check_condition(scmd->result))
                return false;

and will retry.

> so we might as well use it as an upper limit for the total number of
> retries.
> 

If you really want an overall retry count let me know. I can just add
a total limit to the scsi_failures struct. You have been the only person
to ask about it and it didn't sound like you were sure if you wanted it
or not, so I haven't done it.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors
  2023-09-18  0:35       ` Mike Christie
@ 2023-09-18 16:48         ` Martin Wilck
  2023-09-18 18:45           ` Mike Christie
  0 siblings, 1 reply; 77+ messages in thread
From: Martin Wilck @ 2023-09-18 16:48 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Sun, 2023-09-17 at 19:35 -0500, Mike Christie wrote:
> On 9/15/23 4:34 PM, Martin Wilck wrote:
> > On Fri, 2023-09-15 at 22:21 +0200, Martin Wilck wrote:
> > > On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> > > > This has read_capacity_16 have scsi-ml retry errors instead of
> > > > driving
> > > > them itself.
> > > > 
> > > > There are 2 behavior changes with this patch:
> > > > 1. There is one behavior change where we no longer retry when
> > > > scsi_execute_cmd returns < 0, but we should be ok. We don't
> > > > need to
> > > > retry
> > > > for failures like the queue being removed, and for the case
> > > > where
> > > > there
> > > > are no tags/reqs since the block layer waits/retries for us.
> > > > For
> > > > possible
> > > > memory allocation failures from blk_rq_map_kern we use
> > > > GFP_NOIO, so
> > > > retrying will probably not help.
> > > > 2. For the specific UAs we checked for and retried, we would
> > > > get
> > > > READ_CAPACITY_RETRIES_ON_RESET retries plus whatever retries
> > > > were
> > > > left
> > > > from the loop's retries. Each UA now gets
> > > > READ_CAPACITY_RETRIES_ON_RESET
> > > > reties, and the other errors (not including medium not present)
> > > > get
> > > > up
> > > > to 3 retries.
> > > 
> > > This is ok, but - just a thought - would it make sense to add a
> > > field
> > > for maximum total retry count (summed over all failures)? That
> > > would
> > > allow us to minimize behavior changes also in other cases.
> > 
> > Could we perhaps use scmd->allowed for this purpose?
> > 
> > I noted that several callers of scsi_execute_cmd() in your patch
> > set
> > set the allowed parameter, e.g. to sdkp->max_retries in 07/34.
> > But allowed doesn't seem to be used at all in the passthrough case,
> 
> I think scmd->allowed is still used for errors that don't hit the
> check_type goto in scsi_no_retry_cmd.
> 
> If the user sets up scsi failures for only UAs, and we got a
> DID_TRANSPORT_DISRUPTED then we we do scsi_cmd_retry_allowed which
> checks scmd->allowed.
> 
> In scsi_noretry_cmd we then hit the:
> 
>         if (!scsi_status_is_check_condition(scmd->result))
>                 return false;
> 
> and will retry.

Right. But that's pretty confusing. Actually, I am confused about some
other things as well. 

I apologize for taking a step back and asking some more questions and
presenting some more thoughts about your patch set as a whole.

For passthrough commands, the simplified logic is now:

scsi_decide_disposition() 
{
         if (!scsi_device_online)
                return SUCCESS;
         if ((rtn = scsi_check_passthrough(scmd)) != SCSI_RETURN_NOT_HANDLED)
                return rtn;

         /* handle host byte for regular commands, 
            may return SUCESS, NEEDS_RETRY/ADD_TO_MLQUEUE, 
            FAILED, fall through, or jump to maybe_retry */

         /* handle status byte for regular commands, likewise */
          
 maybe_retry: /* [2] */
         if (scsi_cmd_retry_allowed(scmd) && !scsi_noretry_cmd(scmd))
                return NEEDS_RETRY;
         else
                return SUCCESS;
}

where scsi_noretry_cmd() has special treatment for passthrough commands
in certain cases (DID_TIME_OUT and CHECK_CONDITION).

Because you put the call to scsi_check_passthrough() before the
standard checks, some retries that the mid layer used to do will now
not be done if scsi_check_passthrough() returns SUCCESS. Also,
depending on the list of failures passed with the command, we might
miss the clauses in scsi_decide_disposition() where we previously
returned FAILED (I haven't reviewed the current callers, but at least
in theory it can happen). This will obviously change the runtime
behavior, in ways I wouldn't bet can't be detrimental.

Before your patch set, the checks we now do in scsi_check_passthrough()
were only performed in the case where the "regular command checks"
returned SUCCESS. If we want as little behavior change as possible, the
SUCCESS case is where we should plug in the call to
scsi_check_passthrough(). Or am I missing something? [1]

This way we'd preserve the current semantics of "retries" and "allowed"
which used to relate only to what were the "mid-layer retries" before
your patch set.

> > so we might as well use it as an upper limit for the total number of
> > retries.
> > 
> 
> If you really want an overall retry count let me know. I can just add
> a total limit to the scsi_failures struct. You have been the only person
> to ask about it and it didn't sound like you were sure if you wanted it
> or not, so I haven't done it.

The question whether we want an additional limit for the total "failure
retry" count is orthogonal to the above. My personal opinion is that we
do, as almost all call sites that I've seen in your set currently use
just a single retry count for all failures they handle.

I'm not sure whether the separate total retry count would have
practical relevance. I think that in most practical cases we won't have
a mix of standard "ML" retries and multiple different failure cases. I
suppose that in any given situation, it's more likely that there's a
single error condition which repeats.

I'm also unsure whether all this theoretical reasoning of mine warrants
you putting even more effort into this patch set, which has seen 11
revisions already. In general, you have much more experience with SCSI
error handling than I do.

Martin


[1] If we did that, we'd could also take into account that previously
the caller would have called scsi_execute_cmd() again, which would have
reset cmd->retries; so we could reset this field if
scsi_check_passthrough() decides to retry the command.

[2] It's weird, although unrelated to your patch set, that in the
maybe_retry clause we increment cmd->retries before checking
scsi_noretry_cmd().





^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors
  2023-09-18 16:48         ` Martin Wilck
@ 2023-09-18 18:45           ` Mike Christie
  2023-09-19  9:07             ` Martin Wilck
  0 siblings, 1 reply; 77+ messages in thread
From: Mike Christie @ 2023-09-18 18:45 UTC (permalink / raw)
  To: Martin Wilck, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On 9/18/23 11:48 AM, Martin Wilck wrote:
> On Sun, 2023-09-17 at 19:35 -0500, Mike Christie wrote:
>> On 9/15/23 4:34 PM, Martin Wilck wrote:
>>> On Fri, 2023-09-15 at 22:21 +0200, Martin Wilck wrote:
>>>> On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
>>>>> This has read_capacity_16 have scsi-ml retry errors instead of
>>>>> driving
>>>>> them itself.
>>>>>
>>>>> There are 2 behavior changes with this patch:
>>>>> 1. There is one behavior change where we no longer retry when
>>>>> scsi_execute_cmd returns < 0, but we should be ok. We don't
>>>>> need to
>>>>> retry
>>>>> for failures like the queue being removed, and for the case
>>>>> where
>>>>> there
>>>>> are no tags/reqs since the block layer waits/retries for us.
>>>>> For
>>>>> possible
>>>>> memory allocation failures from blk_rq_map_kern we use
>>>>> GFP_NOIO, so
>>>>> retrying will probably not help.
>>>>> 2. For the specific UAs we checked for and retried, we would
>>>>> get
>>>>> READ_CAPACITY_RETRIES_ON_RESET retries plus whatever retries
>>>>> were
>>>>> left
>>>>> from the loop's retries. Each UA now gets
>>>>> READ_CAPACITY_RETRIES_ON_RESET
>>>>> reties, and the other errors (not including medium not present)
>>>>> get
>>>>> up
>>>>> to 3 retries.
>>>>
>>>> This is ok, but - just a thought - would it make sense to add a
>>>> field
>>>> for maximum total retry count (summed over all failures)? That
>>>> would
>>>> allow us to minimize behavior changes also in other cases.
>>>
>>> Could we perhaps use scmd->allowed for this purpose?
>>>
>>> I noted that several callers of scsi_execute_cmd() in your patch
>>> set
>>> set the allowed parameter, e.g. to sdkp->max_retries in 07/34.
>>> But allowed doesn't seem to be used at all in the passthrough case,
>>
>> I think scmd->allowed is still used for errors that don't hit the
>> check_type goto in scsi_no_retry_cmd.
>>
>> If the user sets up scsi failures for only UAs, and we got a
>> DID_TRANSPORT_DISRUPTED then we we do scsi_cmd_retry_allowed which
>> checks scmd->allowed.
>>
>> In scsi_noretry_cmd we then hit the:
>>
>>         if (!scsi_status_is_check_condition(scmd->result))
>>                 return false;
>>
>> and will retry.
> 
> Right. But that's pretty confusing. Actually, I am confused about some
> other things as well. 
> 
> I apologize for taking a step back and asking some more questions and
> presenting some more thoughts about your patch set as a whole.
> 
> For passthrough commands, the simplified logic is now:
> 
> scsi_decide_disposition() 
> {
>          if (!scsi_device_online)
>                 return SUCCESS;
>          if ((rtn = scsi_check_passthrough(scmd)) != SCSI_RETURN_NOT_HANDLED)
>                 return rtn;
> 
>          /* handle host byte for regular commands, 
>             may return SUCESS, NEEDS_RETRY/ADD_TO_MLQUEUE, 
>             FAILED, fall through, or jump to maybe_retry */
> 
>          /* handle status byte for regular commands, likewise */
>           
>  maybe_retry: /* [2] */
>          if (scsi_cmd_retry_allowed(scmd) && !scsi_noretry_cmd(scmd))
>                 return NEEDS_RETRY;
>          else
>                 return SUCCESS;
> }
> 
> where scsi_noretry_cmd() has special treatment for passthrough commands
> in certain cases (DID_TIME_OUT and CHECK_CONDITION).
> 
> Because you put the call to scsi_check_passthrough() before the
> standard checks, some retries that the mid layer used to do will now
> not be done if scsi_check_passthrough() returns SUCCESS. Also,

Yeah, I did that on purpose to give scsi_execute_cmd more control over
whether to retry or not. I think you are looking at this more like
we want to be able to retry when scsi-ml decided not to.

For example, I was thinking multipath related code like the scsi_dh handlers
would want to fail for cases scsi-ml was currently retrying. Right now they
are setting REQ_FAILFAST_TRANSPORT but for most drivers that's not doing what
they think it does. Only DID_BUS_BUSY fast fails and I think the scsi_dh
failover code wants other errors like DID_TRANSPORT_DISRUPTED to fail so the
caller can decide based on something like the dm-multipath pg retries.


> depending on the list of failures passed with the command, we might
> miss the clauses in scsi_decide_disposition() where we previously
> returned FAILED (I haven't reviewed the current callers, but at least
> in theory it can happen). This will obviously change the runtime
> behavior, in ways I wouldn't bet can't be detrimental.


I think it's reverse of what you are thinking for the FAILED case
but I agree it's wrong (wrong but for different reasons).

If scsi_decide_disposition returns FAILED then runtime is bad, because
the scsi eh will start and then we have to wait for that.

The scsi_execute_cmd user can now actually bypass the EH for FAILED
so runtime will be shorter. However, I agree that it's wrong we can
bypass the EH because in some cases we need to kick the device or
cleanup cmds. So that should be fixed for sure and we should always
start the EH and go through that code path.


> 
> Before your patch set, the checks we now do in scsi_check_passthrough()
> were only performed in the case where the "regular command checks"
> returned SUCCESS. If we want as little behavior change as possible, the
> SUCCESS case is where we should plug in the call to
> scsi_check_passthrough(). Or am I missing something? [1]
> 
> This way we'd preserve the current semantics of "retries" and "allowed"
> which used to relate only to what were the "mid-layer retries" before
> your patch set.

It looks like we had different priorities. I was trying to allow
scsi_execute_cmd to be able to override scsi-ml retries, and not just allow
us to retry if scsi-ml decided not to.

Given I messed up on the FAILED case above, I agree doing the less
risky approach is better now. We can change it for multipath some other
day.


> 
>>> so we might as well use it as an upper limit for the total number of
>>> retries.
>>>
>>
>> If you really want an overall retry count let me know. I can just add
>> a total limit to the scsi_failures struct. You have been the only person
>> to ask about it and it didn't sound like you were sure if you wanted it
>> or not, so I haven't done it.
> 
> The question whether we want an additional limit for the total "failure
> retry" count is orthogonal to the above. My personal opinion is that we
> do, as almost all call sites that I've seen in your set currently use
> just a single retry count for all failures they handle.
> 
> I'm not sure whether the separate total retry count would have
> practical relevance. I think that in most practical cases we won't have
> a mix of standard "ML" retries and multiple different failure cases. I
> suppose that in any given situation, it's more likely that there's a
> single error condition which repeats.

I'm not sure what you are saying in the 2 paragraphs above.

We have:

1. scsi_execute_cmd users like read_capacity_10 which will retry the
device reset UA 10 times. Then it can retry that error 3 more time
(this was probably not intentional so can be ignored) and it can retry
any error other error except medium not present 3 times.

And then it calls scsi_execute_cmd with sdkp->max_retries so the scsi-ml
retried are whatever the user requested and 5 by default.

I think we wanted to keep this behavior.

2.  Then, for the initial device setup/discovery tests where the transport is
flakey during device discovery/setup, we can hit the DID_TRANSPORT_DISRUPTED
that multiple times, then we will get UAs after we relogin/reset the
transport/device. So I think we want to keep the behavior where a user
does

+       struct scsi_failure failures[] = {
+               {
+                       .sense = UNIT_ATTENTION,
+                       .asc = SCMD_FAILURE_ASC_ANY,
+                       .ascq = SCMD_FAILURE_ASCQ_ANY,
+                       .allowed = UA_RETRY_COUNT,
+                       .result = SAM_STAT_CHECK_CONDITION,
+               },
+               {}
+       };


scsi_execute_cmd(.... $NORMAL_CMD_RETRIES)

then we can retry transport errors NORMAL_CMD_RETRIES then once those
settle still retry UAs UA_RETRY_COUNT times.

3. Then we have the cases where the scsi_execute_cmd user retried
multiple errors and in this patchset we used to retry a total of N
times, but now can retry each error up to N times. For this it sounds
like you want to code it so we only do a total of N times so the
behavior is the same as before.

To handle all these I think I need the extra allowed field on the
scsi_failures.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors
  2023-09-18 18:45           ` Mike Christie
@ 2023-09-19  9:07             ` Martin Wilck
  2023-09-19 18:02               ` Mike Christie
  0 siblings, 1 reply; 77+ messages in thread
From: Martin Wilck @ 2023-09-19  9:07 UTC (permalink / raw)
  To: Mike Christie, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On Mon, 2023-09-18 at 13:45 -0500, Mike Christie wrote:
> On 9/18/23 11:48 AM, Martin Wilck wrote:
> > On Sun, 2023-09-17 at 19:35 -0500, Mike Christie wrote:
> > > On 9/15/23 4:34 PM, Martin Wilck wrote:
> > > > On Fri, 2023-09-15 at 22:21 +0200, Martin Wilck wrote:
> > > > > On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
> > > > > > This has read_capacity_16 have scsi-ml retry errors instead
> > > > > > of
> > > > > > driving
> > > > > > them itself.
> > > > > > 
> > > > > > There are 2 behavior changes with this patch:
> > > > > > 1. There is one behavior change where we no longer retry
> > > > > > when
> > > > > > scsi_execute_cmd returns < 0, but we should be ok. We don't
> > > > > > need to
> > > > > > retry
> > > > > > for failures like the queue being removed, and for the case
> > > > > > where
> > > > > > there
> > > > > > are no tags/reqs since the block layer waits/retries for
> > > > > > us.
> > > > > > For
> > > > > > possible
> > > > > > memory allocation failures from blk_rq_map_kern we use
> > > > > > GFP_NOIO, so
> > > > > > retrying will probably not help.
> > > > > > 2. For the specific UAs we checked for and retried, we
> > > > > > would
> > > > > > get
> > > > > > READ_CAPACITY_RETRIES_ON_RESET retries plus whatever
> > > > > > retries
> > > > > > were
> > > > > > left
> > > > > > from the loop's retries. Each UA now gets
> > > > > > READ_CAPACITY_RETRIES_ON_RESET
> > > > > > reties, and the other errors (not including medium not
> > > > > > present)
> > > > > > get
> > > > > > up
> > > > > > to 3 retries.
> > > > > 
> > > > > This is ok, but - just a thought - would it make sense to add
> > > > > a
> > > > > field
> > > > > for maximum total retry count (summed over all failures)?
> > > > > That
> > > > > would
> > > > > allow us to minimize behavior changes also in other cases.
> > > > 
> > > > Could we perhaps use scmd->allowed for this purpose?
> > > > 
> > > > I noted that several callers of scsi_execute_cmd() in your
> > > > patch
> > > > set
> > > > set the allowed parameter, e.g. to sdkp->max_retries in 07/34.
> > > > But allowed doesn't seem to be used at all in the passthrough
> > > > case,
> > > 
> > > I think scmd->allowed is still used for errors that don't hit the
> > > check_type goto in scsi_no_retry_cmd.
> > > 
> > > If the user sets up scsi failures for only UAs, and we got a
> > > DID_TRANSPORT_DISRUPTED then we we do scsi_cmd_retry_allowed
> > > which
> > > checks scmd->allowed.
> > > 
> > > In scsi_noretry_cmd we then hit the:
> > > 
> > >         if (!scsi_status_is_check_condition(scmd->result))
> > >                 return false;
> > > 
> > > and will retry.
> > 
> > Right. But that's pretty confusing. Actually, I am confused about
> > some
> > other things as well. 
> > 
> > I apologize for taking a step back and asking some more questions
> > and
> > presenting some more thoughts about your patch set as a whole.
> > 
> > For passthrough commands, the simplified logic is now:
> > 
> > scsi_decide_disposition() 
> > {
> >          if (!scsi_device_online)
> >                 return SUCCESS;
> >          if ((rtn = scsi_check_passthrough(scmd)) !=
> > SCSI_RETURN_NOT_HANDLED)
> >                 return rtn;
> > 
> >          /* handle host byte for regular commands, 
> >             may return SUCESS, NEEDS_RETRY/ADD_TO_MLQUEUE, 
> >             FAILED, fall through, or jump to maybe_retry */
> > 
> >          /* handle status byte for regular commands, likewise */
> >           
> >  maybe_retry: /* [2] */
> >          if (scsi_cmd_retry_allowed(scmd) &&
> > !scsi_noretry_cmd(scmd))
> >                 return NEEDS_RETRY;
> >          else
> >                 return SUCCESS;
> > }
> > 
> > where scsi_noretry_cmd() has special treatment for passthrough
> > commands
> > in certain cases (DID_TIME_OUT and CHECK_CONDITION).
> > 
> > Because you put the call to scsi_check_passthrough() before the
> > standard checks, some retries that the mid layer used to do will
> > now
> > not be done if scsi_check_passthrough() returns SUCCESS. Also,
> 
> Yeah, I did that on purpose to give scsi_execute_cmd more control
> over
> whether to retry or not. I think you are looking at this more like
> we want to be able to retry when scsi-ml decided not to.

I am not saying that giving the failure checks more control is wrong; I
lack experience to judge that. I was just pointing out a change in
behavior that wasn't obvious to me from the outset. IOW, I'd appreciate
a clear separation between the  formal change that moves failure
treatment for passthrough commands into the mid layer, and the semantic
changes regarding the ordering and precedence of the various cases in
scsi_decide_disposition().

> For example, I was thinking multipath related code like the scsi_dh
> handlers
> would want to fail for cases scsi-ml was currently retrying. Right
> now they
> are setting REQ_FAILFAST_TRANSPORT but for most drivers that's not
> doing what
> they think it does. Only DID_BUS_BUSY fast fails and I think the
> scsi_dh
> failover code wants other errors like DID_TRANSPORT_DISRUPTED to fail
> so the
> caller can decide based on something like the dm-multipath pg
> retries.

This makes sense for device handlers, but not so much for other
callers. Do we need to discuss a separate approach for commands sent by
device handlers?

Current device handlers have been written with the standard ML retry
handling in mind. The device handler changes in your patch set apply to
special CHECK CONDITION situations. These checks are executed before
the host byte checks in scsi_decide_disposition(). It's not obvious
from the code that none of these conditions can trigger if the host
byte is set [1], which would seem wrong. Perhaps we need separate
scsi_check_passthrough() handlers for host byte and status byte?

> > depending on the list of failures passed with the command, we might
> > miss the clauses in scsi_decide_disposition() where we previously
> > returned FAILED (I haven't reviewed the current callers, but at
> > least
> > in theory it can happen). This will obviously change the runtime
> > behavior, in ways I wouldn't bet can't be detrimental.
> 
> 
> I think it's reverse of what you are thinking for the FAILED case
> but I agree it's wrong (wrong but for different reasons).
> 
> If scsi_decide_disposition returns FAILED then runtime is bad,
> because
> the scsi eh will start and then we have to wait for that.
> 
> The scsi_execute_cmd user can now actually bypass the EH for FAILED
> so runtime will be shorter. However, I agree that it's wrong we can
> bypass the EH because in some cases we need to kick the device or
> cleanup cmds. So that should be fixed for sure and we should always
> start the EH and go through that code path.

Thanks for clarifying. I think I actually meant this, but my mind was
somewhat shady :-)

> > 
> > Before your patch set, the checks we now do in
> > scsi_check_passthrough()
> > were only performed in the case where the "regular command checks"
> > returned SUCCESS. If we want as little behavior change as possible,
> > the
> > SUCCESS case is where we should plug in the call to
> > scsi_check_passthrough(). Or am I missing something? [1]
> > 
> > This way we'd preserve the current semantics of "retries" and
> > "allowed"
> > which used to relate only to what were the "mid-layer retries"
> > before
> > your patch set.
> 
> It looks like we had different priorities. I was trying to allow
> scsi_execute_cmd to be able to override scsi-ml retries, and not just
> allow
> us to retry if scsi-ml decided not to.

As I said above, I have nothing in general against this approach.
I would just like to separate it logically from the part of the patch
set that simply moves code from HL to ML. And obviously, changes in the
logic of scsi_decide_disposition() need in-depth review, if possible by
better qualified people than myself. I've stared at this code a lot of
times, but I wouldn't say I understand the entire decision tree.

> Given I messed up on the FAILED case above, I agree doing the less
> risky approach is better now. We can change it for multipath some
> other
> day.

AFAICT, dm-multipath + scsi device handlers are currently doing a
decent job. If we want improvements in this area, we should cover ALUA,
too, which would probably require a different approach, as the ALUA
handler's logic in alua_rtpg() can't be easily mapped to a "struct
failure" array.

> 
> > 
> > > > so we might as well use it as an upper limit for the total
> > > > number of
> > > > retries.
> > > > 
> > > 
> > > If you really want an overall retry count let me know. I can just
> > > add
> > > a total limit to the scsi_failures struct. You have been the only
> > > person
> > > to ask about it and it didn't sound like you were sure if you
> > > wanted it
> > > or not, so I haven't done it.
> > 
> > The question whether we want an additional limit for the total
> > "failure
> > retry" count is orthogonal to the above. My personal opinion is
> > that we
> > do, as almost all call sites that I've seen in your set currently
> > use
> > just a single retry count for all failures they handle.
> > 
> > I'm not sure whether the separate total retry count would have
> > practical relevance. I think that in most practical cases we won't
> > have
> > a mix of standard "ML" retries and multiple different failure
> > cases. I
> > suppose that in any given situation, it's more likely that there's
> > a
> > single error condition which repeats.
> 
> I'm not sure what you are saying in the 2 paragraphs above.
> 
> We have:
> 
> 1. scsi_execute_cmd users like read_capacity_10 which will retry the
> device reset UA 10 times. Then it can retry that error 3 more time
> (this was probably not intentional so can be ignored) and it can
> retry
> any error other error except medium not present 3 times.
> 
> And then it calls scsi_execute_cmd with sdkp->max_retries so the
> scsi-ml
> retried are whatever the user requested and 5 by default.
> 
> I think we wanted to keep this behavior.

I don't object. In this case, if we had a "total failure retries"
field, it would be set to infinity (or 13).

When I talked about a 'mix of standard "ML" retries and multiple
different failure cases' and its practical relevance, I had some weird
hypothetical devices in mind. Ignore it if you wish, or look at [2].

> 2.  Then, for the initial device setup/discovery tests where the
> transport is
> flakey during device discovery/setup, we can hit the
> DID_TRANSPORT_DISRUPTED
> that multiple times, then we will get UAs after we relogin/reset the
> transport/device. So I think we want to keep the behavior where a
> user
> does
> 
> +       struct scsi_failure failures[] = {
> +               {
> +                       .sense = UNIT_ATTENTION,
> +                       .asc = SCMD_FAILURE_ASC_ANY,
> +                       .ascq = SCMD_FAILURE_ASCQ_ANY,
> +                       .allowed = UA_RETRY_COUNT,
> +                       .result = SAM_STAT_CHECK_CONDITION,
> +               },
> +               {}
> +       };
> 
> 
> scsi_execute_cmd(.... $NORMAL_CMD_RETRIES)
> 
> then we can retry transport errors NORMAL_CMD_RETRIES then once those
> settle still retry UAs UA_RETRY_COUNT times.

Yes, sure. I already agreed that we still need the standard "allowed"
field in its current form, and that we can't re-use it for failures.
In this case, with just one element in the failures array, a total
failure count wouldn't matter at all.
> 
> 3. Then we have the cases where the scsi_execute_cmd user retried
> multiple errors and in this patchset we used to retry a total of N
> times, but now can retry each error up to N times. For this it sounds
> like you want to code it so we only do a total of N times so the
> behavior is the same as before.
> 
> To handle all these I think I need the extra allowed field on the
> scsi_failures.
> 

Right. My impression was that some callers wanted to limit the overall
number of retries. But it's difficult to say what the original
intention of the authors was, and how much thought they put into it in
the first place.

Martin

[1] I think that if the host byte is set we never have a CC status and
valid sense data. But the host byte is ultimately up to drivers, so
this is not obvious from the code, and hard to prove.

[2] My thinking was as follows: the e.g. for read_capacity_10(), the
SCSI result can be a) CC/UA/3a/00 (device reset, which read_capacity_10
will retry 10x), or b) some code that ML retries (e.g. CC/UA/04/01), or
c) anything else, which s_r_10 will retry 3x. Case b) used to be
retried up to 3*sdkp->max_retries times, while with your patch it will
only be retried sdkp->max_retries 1times. 

It gets more confusing if you consider a device that would return these
error codes in an alternating fashion. Consider a device that returns
CC/UA/04/01, CC/UA/3a/00, CC/UA/04/01, CC/UA/3a/00,...
Previously, the CC/UA/04/01 would be hidden in the ML, because the ML
retry count would be reset with every call to scsi_execute_cmd(), and
CC/UA/3a/00 would be repeated 10x. Now, CC/UA/04/01 will use up retries
(first those from your SCMD_FAILURE_RESULT_ANY case, then the ML
retries) and eventually fail before the 10 retries for the device reset
case are exhausted. You can imagine out even more tricky scenarios.

This kind of weirdness is what I meant with 'a mix of standard "ML"
retries and multiple different failure cases'. Such cases are treated
differently now than they used to be. I wanted to point that out, and
at the same time I wanted to say that I doubt this has practical
relevance. *Real* devices will return CC/UA/3a/00 a couple of times in
a row, and then some other error, like you described.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors
  2023-09-19  9:07             ` Martin Wilck
@ 2023-09-19 18:02               ` Mike Christie
  0 siblings, 0 replies; 77+ messages in thread
From: Mike Christie @ 2023-09-19 18:02 UTC (permalink / raw)
  To: Martin Wilck, john.g.garry, bvanassche, hch, martin.petersen,
	linux-scsi, james.bottomley

On 9/19/23 4:07 AM, Martin Wilck wrote:
> On Mon, 2023-09-18 at 13:45 -0500, Mike Christie wrote:
>> On 9/18/23 11:48 AM, Martin Wilck wrote:
>>> On Sun, 2023-09-17 at 19:35 -0500, Mike Christie wrote:
>>>> On 9/15/23 4:34 PM, Martin Wilck wrote:
>>>>> On Fri, 2023-09-15 at 22:21 +0200, Martin Wilck wrote:
>>>>>> On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
>>>>>>> This has read_capacity_16 have scsi-ml retry errors instead
>>>>>>> of
>>>>>>> driving
>>>>>>> them itself.
>>>>>>>
>>>>>>> There are 2 behavior changes with this patch:
>>>>>>> 1. There is one behavior change where we no longer retry
>>>>>>> when
>>>>>>> scsi_execute_cmd returns < 0, but we should be ok. We don't
>>>>>>> need to
>>>>>>> retry
>>>>>>> for failures like the queue being removed, and for the case
>>>>>>> where
>>>>>>> there
>>>>>>> are no tags/reqs since the block layer waits/retries for
>>>>>>> us.
>>>>>>> For
>>>>>>> possible
>>>>>>> memory allocation failures from blk_rq_map_kern we use
>>>>>>> GFP_NOIO, so
>>>>>>> retrying will probably not help.
>>>>>>> 2. For the specific UAs we checked for and retried, we
>>>>>>> would
>>>>>>> get
>>>>>>> READ_CAPACITY_RETRIES_ON_RESET retries plus whatever
>>>>>>> retries
>>>>>>> were
>>>>>>> left
>>>>>>> from the loop's retries. Each UA now gets
>>>>>>> READ_CAPACITY_RETRIES_ON_RESET
>>>>>>> reties, and the other errors (not including medium not
>>>>>>> present)
>>>>>>> get
>>>>>>> up
>>>>>>> to 3 retries.
>>>>>>
>>>>>> This is ok, but - just a thought - would it make sense to add
>>>>>> a
>>>>>> field
>>>>>> for maximum total retry count (summed over all failures)?
>>>>>> That
>>>>>> would
>>>>>> allow us to minimize behavior changes also in other cases.
>>>>>
>>>>> Could we perhaps use scmd->allowed for this purpose?
>>>>>
>>>>> I noted that several callers of scsi_execute_cmd() in your
>>>>> patch
>>>>> set
>>>>> set the allowed parameter, e.g. to sdkp->max_retries in 07/34.
>>>>> But allowed doesn't seem to be used at all in the passthrough
>>>>> case,
>>>>
>>>> I think scmd->allowed is still used for errors that don't hit the
>>>> check_type goto in scsi_no_retry_cmd.
>>>>
>>>> If the user sets up scsi failures for only UAs, and we got a
>>>> DID_TRANSPORT_DISRUPTED then we we do scsi_cmd_retry_allowed
>>>> which
>>>> checks scmd->allowed.
>>>>
>>>> In scsi_noretry_cmd we then hit the:
>>>>
>>>>         if (!scsi_status_is_check_condition(scmd->result))
>>>>                 return false;
>>>>
>>>> and will retry.
>>>
>>> Right. But that's pretty confusing. Actually, I am confused about
>>> some
>>> other things as well. 
>>>
>>> I apologize for taking a step back and asking some more questions
>>> and
>>> presenting some more thoughts about your patch set as a whole.
>>>
>>> For passthrough commands, the simplified logic is now:
>>>
>>> scsi_decide_disposition() 
>>> {
>>>          if (!scsi_device_online)
>>>                 return SUCCESS;
>>>          if ((rtn = scsi_check_passthrough(scmd)) !=
>>> SCSI_RETURN_NOT_HANDLED)
>>>                 return rtn;
>>>
>>>          /* handle host byte for regular commands, 
>>>             may return SUCESS, NEEDS_RETRY/ADD_TO_MLQUEUE, 
>>>             FAILED, fall through, or jump to maybe_retry */
>>>
>>>          /* handle status byte for regular commands, likewise */
>>>           
>>>  maybe_retry: /* [2] */
>>>          if (scsi_cmd_retry_allowed(scmd) &&
>>> !scsi_noretry_cmd(scmd))
>>>                 return NEEDS_RETRY;
>>>          else
>>>                 return SUCCESS;
>>> }
>>>
>>> where scsi_noretry_cmd() has special treatment for passthrough
>>> commands
>>> in certain cases (DID_TIME_OUT and CHECK_CONDITION).
>>>
>>> Because you put the call to scsi_check_passthrough() before the
>>> standard checks, some retries that the mid layer used to do will
>>> now
>>> not be done if scsi_check_passthrough() returns SUCCESS. Also,
>>
>> Yeah, I did that on purpose to give scsi_execute_cmd more control
>> over
>> whether to retry or not. I think you are looking at this more like
>> we want to be able to retry when scsi-ml decided not to.
> 
> I am not saying that giving the failure checks more control is wrong; I
> lack experience to judge that. I was just pointing out a change in
> behavior that wasn't obvious to me from the outset. IOW, I'd appreciate
> a clear separation between the  formal change that moves failure
> treatment for passthrough commands into the mid layer, and the semantic
> changes regarding the ordering and precedence of the various cases in
> scsi_decide_disposition().

Yeah, agree now. The email is a mess because I wrote that below. I
should have written that here to save you time :)


> 
>> For example, I was thinking multipath related code like the scsi_dh
>> handlers
>> would want to fail for cases scsi-ml was currently retrying. Right
>> now they
>> are setting REQ_FAILFAST_TRANSPORT but for most drivers that's not
>> doing what
>> they think it does. Only DID_BUS_BUSY fast fails and I think the
>> scsi_dh
>> failover code wants other errors like DID_TRANSPORT_DISRUPTED to fail
>> so the
>> caller can decide based on something like the dm-multipath pg
>> retries.
> 
> This makes sense for device handlers, but not so much for other
> callers. Do we need to discuss a separate approach for commands sent by
> device handlers?
> 
> Current device handlers have been written with the standard ML retry
> handling in mind. The device handler changes in your patch set apply to
> special CHECK CONDITION situations. These checks are executed before
> the host byte checks in scsi_decide_disposition(). It's not obvious
> from the code that none of these conditions can trigger if the host
> byte is set [1], which would seem wrong. Perhaps we need separate
> scsi_check_passthrough() handlers for host byte and status byte?

I think we are talking about different things. I can do what you want
first then go for the behavior change later. We can discuss that later.
The second/other patchset will be smaller so it will be easier to
discuss.


Snip the chunk below because I agree with you.

And for the total retry issue, since I'm going to try and keep the
existing behavior as much as possible, I'll add code for that. It
will be easier to discuss code then.

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2023-09-19 18:02 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
2023-09-05 23:15 ` [PATCH v11 01/34] scsi: Add helper to prep sense during error handling Mike Christie
2023-09-05 23:15 ` [PATCH v11 02/34] scsi: Allow passthrough to override what errors to retry Mike Christie
2023-09-15 20:08   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 03/34] scsi: Add scsi_failure field to scsi_exec_args Mike Christie
2023-09-05 23:15 ` [PATCH v11 04/34] scsi: Have scsi-ml retry scsi_probe_lun errors Mike Christie
2023-09-15 20:11   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 05/34] scsi: retry INQUIRY after timeout Mike Christie
2023-09-15 20:11   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 06/34] scsi: sd: Fix sshdr use in read_capacity_16 Mike Christie
2023-09-15 20:13   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors Mike Christie
2023-09-15 20:21   ` Martin Wilck
2023-09-15 21:34     ` Martin Wilck
2023-09-18  0:35       ` Mike Christie
2023-09-18 16:48         ` Martin Wilck
2023-09-18 18:45           ` Mike Christie
2023-09-19  9:07             ` Martin Wilck
2023-09-19 18:02               ` Mike Christie
2023-09-05 23:15 ` [PATCH v11 08/34] scsi: Use separate buf for START_STOP in sd_spinup_disk Mike Christie
2023-09-15 20:26   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 09/34] scsi: sd: Fix sshdr use " Mike Christie
2023-09-15 20:27   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 10/34] scsi: Have scsi-ml retry sd_spinup_disk errors Mike Christie
2023-09-15 20:46   ` Martin Wilck
2023-09-15 20:58     ` Mike Christie
2023-09-15 21:23       ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 11/34] scsi: hp_sw: Only access sshdr if res > 0 Mike Christie
2023-09-15 20:48   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 12/34] scsi: hp_sw: Have scsi-ml retry scsi_exec_req errors Mike Christie
2023-09-15 20:51   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 13/34] scsi: rdac: Fix send_mode_select retry handling Mike Christie
2023-09-15 20:54   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 14/34] scsi: rdac: Fix sshdr use Mike Christie
2023-09-15 20:55   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 15/34] scsi: rdac: Have scsi-ml retry send_mode_select errors Mike Christie
2023-09-15 20:58   ` Martin Wilck
2023-09-15 21:52     ` Mike Christie
2023-09-05 23:15 ` [PATCH v11 16/34] scsi: spi: Fix sshdr use Mike Christie
2023-09-15 20:59   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 17/34] scsi: spi: Have scsi-ml retry spi_execute errors Mike Christie
2023-09-15 21:00   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 18/34] scsi: sd: Fix sshdr use in sd_suspend_common Mike Christie
2023-09-15 21:04   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 19/34] scsi: sd: Have scsi-ml retry sd_sync_cache errors Mike Christie
2023-09-15 21:10   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 20/34] scsi: ch: Remove unit_attention Mike Christie
2023-09-15 21:08   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 21/34] scsi: ch: Have scsi-ml retry ch_do_scsi errors Mike Christie
2023-09-15 21:11   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 22/34] scsi: Have scsi-ml retry scsi_mode_sense UAs Mike Christie
2023-09-15 21:11   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 23/34] scsi: sd: Fix scsi_mode_sense caller's sshdr use Mike Christie
2023-09-15 21:13   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 24/34] scsi: Have scsi-ml retry scsi_report_lun_scan errors Mike Christie
2023-09-15 21:17   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 25/34] scsi: sd: Have pr commands retry UAs Mike Christie
2023-09-15 21:18   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 26/34] scsi: sd: Have scsi-ml retry read_capacity_10 errors Mike Christie
2023-09-15 21:25   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 27/34] scsi: ses: Have scsi-ml retry scsi_exec_req errors Mike Christie
2023-09-15 21:34   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 28/34] scsi: sr: Have scsi-ml retry get_sectorsize errors Mike Christie
2023-09-15 21:36   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 29/34] scsi: ufs: Have scsi-ml retry start stop errors Mike Christie
2023-09-15 21:37   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 30/34] scsi: Fix sshdr use in scsi_test_unit_ready Mike Christie
2023-09-15 21:38   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 31/34] scsi: Fix sshdr use in scsi_cdl_enable Mike Christie
2023-09-15 21:39   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 32/34] scsi: sd: Fix sshdr use in cache_type_store Mike Christie
2023-09-15 21:44   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 33/34] scsi: sr: Fix sshdr use in sr_get_events Mike Christie
2023-09-15 21:44   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 34/34] scsi: Add kunit tests for scsi_check_passthrough Mike Christie
2023-09-15 21:52   ` Martin Wilck
2023-09-15 22:07     ` Mike Christie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.