All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 1/1] scsi/ses: Saw "Failed to get diagnostic page 0x1"
@ 2021-09-15 13:04 wenxiong
  2021-09-15 14:43 ` James Bottomley
  0 siblings, 1 reply; 4+ messages in thread
From: wenxiong @ 2021-09-15 13:04 UTC (permalink / raw)
  To: jejb; +Cc: linux-scsi, brking1, martin.petersen, wenxiong, Wen Xiong

From: Wen Xiong <wenxiong@linux.vnet.ibm.com>

Setting scsi logging level with error=3, we saw some errors from enclosues:

[108017.360833] ses 0:0:9:0: tag#641 Done: NEEDS_RETRY Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
[108017.360838] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c 01 01 00 20 00
[108017.427778] ses 0:0:9:0: Power-on or device reset occurred
[108017.427784] ses 0:0:9:0: tag#641 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[108017.427788] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c 01 01 00 20 00
[108017.427791] ses 0:0:9:0: tag#641 Sense Key : Unit Attention [current]
[108017.427793] ses 0:0:9:0: tag#641 Add. Sense: Bus device reset function occurred
[108017.427801] ses 0:0:9:0: Failed to get diagnostic page 0x1
[108017.427804] ses 0:0:9:0: Failed to bind enclosure -19
[108017.427895] ses 0:0:10:0: Attached Enclosure device
[108017.427942] ses 0:0:10:0: Attached scsi generic sg18 type 13

As Martin's suggestion, the patch checks to retry on NOT_READY as well as
UNIT_ATTENTION with ASC 0x29.

Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
---
 drivers/scsi/ses.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
index c2afba2a5414..7b6efe4ccc13 100644
--- a/drivers/scsi/ses.c
+++ b/drivers/scsi/ses.c
@@ -87,9 +87,16 @@ static int ses_recv_diag(struct scsi_device *sdev, int page_code,
 		0
 	};
 	unsigned char recv_page_code;
+	struct scsi_sense_hdr sshdr;
+	int retries = SES_RETRIES;
+
+	do {
+		ret =  scsi_execute_req(sdev, cmd, DMA_FROM_DEVICE, buf, bufflen,
+			&sshdr, SES_TIMEOUT, 1, NULL);
+	} while (scsi_sense_valid(&sshdr) && --retries &&
+		(sshdr.sense_key == NOT_READY ||
+		(sshdr.sense_key == UNIT_ATTENTION && sshdr.asc == 0x29)));
 
-	ret =  scsi_execute_req(sdev, cmd, DMA_FROM_DEVICE, buf, bufflen,
-				NULL, SES_TIMEOUT, SES_RETRIES, NULL);
 	if (unlikely(ret))
 		return ret;
 
@@ -121,9 +128,16 @@ static int ses_send_diag(struct scsi_device *sdev, int page_code,
 		bufflen & 0xff,
 		0
 	};
+	struct scsi_sense_hdr sshdr;
+	int retries = SES_RETRIES;
+
+	do {
+		result =  scsi_execute_req(sdev, cmd, DMA_FROM_DEVICE, buf, bufflen,
+			&sshdr, SES_TIMEOUT, 1, NULL);
+	} while (scsi_sense_valid(&sshdr) && --retries &&
+		(sshdr.sense_key == NOT_READY ||
+		(sshdr.sense_key == UNIT_ATTENTION && sshdr.asc == 0x29)));
 
-	result = scsi_execute_req(sdev, cmd, DMA_TO_DEVICE, buf, bufflen,
-				  NULL, SES_TIMEOUT, SES_RETRIES, NULL);
 	if (result)
 		sdev_printk(KERN_ERR, sdev, "SEND DIAGNOSTIC result: %8x\n",
 			    result);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH V3 1/1] scsi/ses: Saw "Failed to get diagnostic page 0x1"
  2021-09-15 13:04 [PATCH V3 1/1] scsi/ses: Saw "Failed to get diagnostic page 0x1" wenxiong
@ 2021-09-15 14:43 ` James Bottomley
  2021-09-15 17:55   ` Douglas Gilbert
  0 siblings, 1 reply; 4+ messages in thread
From: James Bottomley @ 2021-09-15 14:43 UTC (permalink / raw)
  To: wenxiong; +Cc: linux-scsi, brking1, martin.petersen, wenxiong

On Wed, 2021-09-15 at 08:04 -0500, wenxiong@linux.vnet.ibm.com wrote:
> From: Wen Xiong <wenxiong@linux.vnet.ibm.com>
> 
> Setting scsi logging level with error=3, we saw some errors from
> enclosues:
> 
> [108017.360833] ses 0:0:9:0: tag#641 Done: NEEDS_RETRY Result:
> hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
> [108017.360838] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c 01 01
> 00 20 00
> [108017.427778] ses 0:0:9:0: Power-on or device reset occurred
> [108017.427784] ses 0:0:9:0: tag#641 Done: SUCCESS Result:
> hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> [108017.427788] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c 01 01
> 00 20 00
> [108017.427791] ses 0:0:9:0: tag#641 Sense Key : Unit Attention
> [current]
> [108017.427793] ses 0:0:9:0: tag#641 Add. Sense: Bus device reset
> function occurred
> [108017.427801] ses 0:0:9:0: Failed to get diagnostic page 0x1
> [108017.427804] ses 0:0:9:0: Failed to bind enclosure -19
> [108017.427895] ses 0:0:10:0: Attached Enclosure device
> [108017.427942] ses 0:0:10:0: Attached scsi generic sg18 type 13
> 
> As Martin's suggestion, the patch checks to retry on NOT_READY as
> well as
> UNIT_ATTENTION with ASC 0x29.

This looks fine to me.  I think the reason expecting_cc_ua doesn't work
for you is that you're getting > 1 reset per command.  expecting_cc_ua
automatically resets after eating the first unit attention.

Reviewed-by: James Bottomley <jejb@linux.ibm.com>

James



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH V3 1/1] scsi/ses: Saw "Failed to get diagnostic page 0x1"
  2021-09-15 14:43 ` James Bottomley
@ 2021-09-15 17:55   ` Douglas Gilbert
  2021-09-16 13:01     ` James Bottomley
  0 siblings, 1 reply; 4+ messages in thread
From: Douglas Gilbert @ 2021-09-15 17:55 UTC (permalink / raw)
  To: jejb, wenxiong; +Cc: linux-scsi, brking1, martin.petersen, wenxiong

On 2021-09-15 10:43 a.m., James Bottomley wrote:
> On Wed, 2021-09-15 at 08:04 -0500, wenxiong@linux.vnet.ibm.com wrote:
>> From: Wen Xiong <wenxiong@linux.vnet.ibm.com>
>>
>> Setting scsi logging level with error=3, we saw some errors from
>> enclosues:
>>
>> [108017.360833] ses 0:0:9:0: tag#641 Done: NEEDS_RETRY Result:
>> hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
>> [108017.360838] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c 01 01
>> 00 20 00
>> [108017.427778] ses 0:0:9:0: Power-on or device reset occurred
>> [108017.427784] ses 0:0:9:0: tag#641 Done: SUCCESS Result:
>> hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>> [108017.427788] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c 01 01
>> 00 20 00
>> [108017.427791] ses 0:0:9:0: tag#641 Sense Key : Unit Attention
>> [current]
>> [108017.427793] ses 0:0:9:0: tag#641 Add. Sense: Bus device reset
>> function occurred
>> [108017.427801] ses 0:0:9:0: Failed to get diagnostic page 0x1
>> [108017.427804] ses 0:0:9:0: Failed to bind enclosure -19
>> [108017.427895] ses 0:0:10:0: Attached Enclosure device
>> [108017.427942] ses 0:0:10:0: Attached scsi generic sg18 type 13
>>
>> As Martin's suggestion, the patch checks to retry on NOT_READY as
>> well as
>> UNIT_ATTENTION with ASC 0x29.
> 
> This looks fine to me.  I think the reason expecting_cc_ua doesn't work
> for you is that you're getting > 1 reset per command.  expecting_cc_ua
> automatically resets after eating the first unit attention.

Rather that simply consuming UAs, what do you think of a fixed length
FIFO, say 8 entries, that holds the asc,ascq of the last 8 UAs together
with a timestamp of when it was received (with a boot time epoch).
Then allow the user space to see that FIFO via sysfs (e.g. under
/sys/class/scsi_device/<hctl>). Remembering the previous UA may also be
useful for the mid-level UA processing. For example after a firmware
upgrade, there may be UAs for both INQUIRY data change and device reset.

Also the first device reset after a reboot (power cycle) is expected,
having one later, for example when part of a disk is mounted, is a bit
more suspicious.

Doug Gilbert



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH V3 1/1] scsi/ses: Saw "Failed to get diagnostic page 0x1"
  2021-09-15 17:55   ` Douglas Gilbert
@ 2021-09-16 13:01     ` James Bottomley
  0 siblings, 0 replies; 4+ messages in thread
From: James Bottomley @ 2021-09-16 13:01 UTC (permalink / raw)
  To: dgilbert, wenxiong; +Cc: linux-scsi, brking1, martin.petersen, wenxiong

On Wed, 2021-09-15 at 13:55 -0400, Douglas Gilbert wrote:
> On 2021-09-15 10:43 a.m., James Bottomley wrote:
> > On Wed, 2021-09-15 at 08:04 -0500, wenxiong@linux.vnet.ibm.com
> > wrote:
> > > From: Wen Xiong <wenxiong@linux.vnet.ibm.com>
> > > 
> > > Setting scsi logging level with error=3, we saw some errors from
> > > enclosues:
> > > 
> > > [108017.360833] ses 0:0:9:0: tag#641 Done: NEEDS_RETRY Result:
> > > hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
> > > [108017.360838] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c
> > > 01 01
> > > 00 20 00
> > > [108017.427778] ses 0:0:9:0: Power-on or device reset occurred
> > > [108017.427784] ses 0:0:9:0: tag#641 Done: SUCCESS Result:
> > > hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> > > [108017.427788] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c
> > > 01 01
> > > 00 20 00
> > > [108017.427791] ses 0:0:9:0: tag#641 Sense Key : Unit Attention
> > > [current]
> > > [108017.427793] ses 0:0:9:0: tag#641 Add. Sense: Bus device reset
> > > function occurred
> > > [108017.427801] ses 0:0:9:0: Failed to get diagnostic page 0x1
> > > [108017.427804] ses 0:0:9:0: Failed to bind enclosure -19
> > > [108017.427895] ses 0:0:10:0: Attached Enclosure device
> > > [108017.427942] ses 0:0:10:0: Attached scsi generic sg18 type 13
> > > 
> > > As Martin's suggestion, the patch checks to retry on NOT_READY as
> > > well as UNIT_ATTENTION with ASC 0x29.
> > 
> > This looks fine to me.  I think the reason expecting_cc_ua doesn't
> > work for you is that you're getting > 1 reset per
> > command.  expecting_cc_ua automatically resets after eating the
> > first unit attention.
> 
> Rather that simply consuming UAs, what do you think of a fixed length
> FIFO, say 8 entries, that holds the asc,ascq of the last 8 UAs
> together with a timestamp of when it was received (with a boot time
> epoch).  Then allow the user space to see that FIFO via sysfs (e.g.
> under /sys/class/scsi_device/<hctl>). Remembering the previous UA may
> also be useful for the mid-level UA processing. For example after a
> firmware upgrade, there may be UAs for both INQUIRY data change and
> device reset.

I don't really see how it would be useful.  On the available debugging
information (which is a bit scant) it looks like a single device
problem (as usual) ... possibly not clearing the condition after the
REQUEST_SENSE because I can't believe another reset went over the bus
immediately after the first UA/CC was received.

> Also the first device reset after a reboot (power cycle) is expected,
> having one later, for example when part of a disk is mounted, is a
> bit more suspicious.

It's definitely hugely suspicious, but tying it to ASC/ASQC 0x29/0x00
should limit the fallout considerably.

James



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-09-16 13:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-15 13:04 [PATCH V3 1/1] scsi/ses: Saw "Failed to get diagnostic page 0x1" wenxiong
2021-09-15 14:43 ` James Bottomley
2021-09-15 17:55   ` Douglas Gilbert
2021-09-16 13:01     ` James Bottomley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.