linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
@ 2013-10-13 17:23 Vaughan Cao
  2013-10-14 11:13 ` Hannes Reinecke
  0 siblings, 1 reply; 16+ messages in thread
From: Vaughan Cao @ 2013-10-13 17:23 UTC (permalink / raw)
  To: JBottomley; +Cc: linux-scsi, linux-kernel, vaughan.cao

Hi James,

[1.] One line summary of the problem:
special sense code asc,ascq=04h,0Ch abort scsi scan in the middle

[2.] Full description of the problem/report:
For instance, storage represents 8 iscsi LUNs, however the LUN No.7 is 
not well configured or has something wrong.
Then messages received:
kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, 
scan aborted
Which will make LUN No.8 unavailable.
It's confirmed that Windows and Solaris systems will continue the scan 
and make LUN No.1,2,3,4,5,6 and 8 available.

Log snippet is as below:
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY 
pass 1 length 36
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 
00 00 24 00
Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180, 
bufflen = 36, queuecommand 0xffffffffa00b99e7
Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done: 
0xffff8801e9bd4280 SUCCESS
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result: 
hostbyte=DID_OK driverbyte=DRIVER_OK
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 
00 00 24 00
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready 
[current]
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical 
unit not accessible, target port in unavailable state
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code 
0x8000002
Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response 
from lun 7 while scanning, scan aborted

According to scsi_report_lun_scan(), I found:
Linux use an inquiry command to probe a lun according to the result of 
report_lun command.
It assumes every probe cmd will get a legal result. Otherwise, it 
regards the whole peripheral not exist or dead.
If the return of inquiry passes its legal checking and indicates 'LUN 
not present', it won't break but also continue with the scan process.
In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch (Logical 
unit not accessible, target port in unavailable state).
And this is ignored, so scsi_probe_lun() returns -EIO and the scan 
process is aborted.

I have two questions:
1. Is it correct for hardware to return a sense 04h,0Ch to inquiry 
again, even after presenting this lun in responce to REPORT_LUN command?
2. Since windows and solaris can continue scan, is it reasonable for 
linux to do the same, even for a fault-tolerance purpose?

Below is information of our storage setting:
Storage array is configured as a cluster mode, and there is a "default" 
target group and "default" initiator group exist on
the storage that includes the target nodename of both the nodes in the 
cluster and all initiator names respectively.
In the partner node, there was lun mapped to the default target 
group/initiator group and having the ID 7.
Since that lun is owner by the partner node, the SCSI inquiry was 
failing on it and as a result the initiator aborts the scan.

Thanks,
Vaughan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-13 17:23 PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle Vaughan Cao
@ 2013-10-14 11:13 ` Hannes Reinecke
  2013-10-14 12:51   ` Steffen Maier
  2013-10-15  3:32   ` vaughan
  0 siblings, 2 replies; 16+ messages in thread
From: Hannes Reinecke @ 2013-10-14 11:13 UTC (permalink / raw)
  To: Vaughan Cao; +Cc: JBottomley, linux-scsi, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3560 bytes --]

On 10/13/2013 07:23 PM, Vaughan Cao wrote:
> Hi James,
> 
> [1.] One line summary of the problem:
> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
> 
> [2.] Full description of the problem/report:
> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
> is not well configured or has something wrong.
> Then messages received:
> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning,
> scan aborted
> Which will make LUN No.8 unavailable.
> It's confirmed that Windows and Solaris systems will continue the
> scan and make LUN No.1,2,3,4,5,6 and 8 available.
> 
> Log snippet is as below:
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan:
> INQUIRY pass 1 length 36
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send:
> 0xffff8801e9bd4280
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12
> 00 00 00 24 00
> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180,
> bufflen = 36, queuecommand 0xffffffffa00b99e7
> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done:
> 0xffff8801e9bd4280 SUCCESS
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result:
> hostbyte=DID_OK driverbyte=DRIVER_OK
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12
> 00 00 00 24 00
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not
> Ready [current]
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense:
> Logical unit not accessible, target port in unavailable state
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1
> failed 0
> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with
> code 0x8000002
> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected
> response from lun 7 while scanning, scan aborted
> 
> According to scsi_report_lun_scan(), I found:
> Linux use an inquiry command to probe a lun according to the result
> of report_lun command.
> It assumes every probe cmd will get a legal result. Otherwise, it
> regards the whole peripheral not exist or dead.
> If the return of inquiry passes its legal checking and indicates
> 'LUN not present', it won't break but also continue with the scan
> process.
> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
> (Logical unit not accessible, target port in unavailable state).
> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
> process is aborted.
> 
> I have two questions:
> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
> again, even after presenting this lun in responce to REPORT_LUN
> command?
Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.

> 2. Since windows and solaris can continue scan, is it reasonable for
> linux to do the same, even for a fault-tolerance purpose?
> 
Hmm. Yes, and no.

_Actually_ this is an issue with the target, as it looks as if it
will return the above sense code while sending an 'INQUIRY' to the
device.
SPC explicitely states that the INQUIRY command should _not_ fail
for unavailable devices.
But yeah, we probably should work around this issues.
Nevertheless, please raise this issue with your array vendor.

Please try the attached patch.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

[-- Attachment #2: scsi_scan-continue-after-error.patch --]
[-- Type: text/x-patch, Size: 1116 bytes --]

>From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.de>
Date: Mon, 14 Oct 2013 13:11:22 +0200
Subject: [PATCH] scsi_scan: continue report_lun_scan after error

When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
does _not_ indicate that the entire target is done for.
So continue scanning for the remaining devices.

Signed-off-by: Hannes Reinecke <hare@suse.de>

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 307a811..973a121 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct scsi_target *starget, int bflags,
 				lun, NULL, NULL, rescan, NULL);
 			if (res == SCSI_SCAN_NO_RESPONSE) {
 				/*
-				 * Got some results, but now none, abort.
+				 * Got some results, but now none, ignore.
 				 */
 				sdev_printk(KERN_ERR, sdev,
 					"Unexpected response"
-				        " from lun %d while scanning, scan"
-				        " aborted\n", lun);
-				break;
+					" from lun %d while scanning,"
+					" ignoring device\n", lun);
 			}
 		}
 	}

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-14 11:13 ` Hannes Reinecke
@ 2013-10-14 12:51   ` Steffen Maier
  2013-10-14 13:18     ` Hannes Reinecke
  2013-10-15  3:32   ` vaughan
  1 sibling, 1 reply; 16+ messages in thread
From: Steffen Maier @ 2013-10-14 12:51 UTC (permalink / raw)
  To: Hannes Reinecke, Vaughan Cao
  Cc: JBottomley, linux-scsi, linux-kernel, Krishna Gudipati

Hi Hannes,

On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>> Hi James,
>>
>> [1.] One line summary of the problem:
>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>
>> [2.] Full description of the problem/report:
>> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
>> is not well configured or has something wrong.
>> Then messages received:
>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>> Which will make LUN No.8 unavailable.
>> It's confirmed that Windows and Solaris systems will continue the
>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>
>> Log snippet is as below:
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY pass 1 length 36
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done: 0xffff8801e9bd4280 SUCCESS
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready [current]
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical unit not accessible, target port in unavailable state
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code 0x8000002
>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>
>> According to scsi_report_lun_scan(), I found:
>> Linux use an inquiry command to probe a lun according to the result
>> of report_lun command.
>> It assumes every probe cmd will get a legal result. Otherwise, it
>> regards the whole peripheral not exist or dead.
>> If the return of inquiry passes its legal checking and indicates
>> 'LUN not present', it won't break but also continue with the scan
>> process.
>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>> (Logical unit not accessible, target port in unavailable state).
>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>> process is aborted.
>>
>> I have two questions:
>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>> again, even after presenting this lun in responce to REPORT_LUN
>> command?
> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
> 
>> 2. Since windows and solaris can continue scan, is it reasonable for
>> linux to do the same, even for a fault-tolerance purpose?
>>
> Hmm. Yes, and no.
> 
> _Actually_ this is an issue with the target, as it looks as if it
> will return the above sense code while sending an 'INQUIRY' to the
> device.
> SPC explicitely states that the INQUIRY command should _not_ fail
> for unavailable devices.
> But yeah, we probably should work around this issues.
> Nevertheless, please raise this issue with your array vendor.
> 
> Please try the attached patch.
> 
> Cheers,
> 
> Hannes
> 

> From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17 00:00:00 2001
> From: Hannes Reinecke <hare@suse.de>
> Date: Mon, 14 Oct 2013 13:11:22 +0200
> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
> 
> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
> does _not_ indicate that the entire target is done for.
> So continue scanning for the remaining devices.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> 
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 307a811..973a121 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct scsi_target *starget, int bflags,
>  				lun, NULL, NULL, rescan, NULL);
>  			if (res == SCSI_SCAN_NO_RESPONSE) {
>  				/*
> -				 * Got some results, but now none, abort.
> +				 * Got some results, but now none, ignore.
>  				 */
>  				sdev_printk(KERN_ERR, sdev,
>  					"Unexpected response"
> -				        " from lun %d while scanning, scan"
> -				        " aborted\n", lun);
> -				break;
> +					" from lun %d while scanning,"
> +					" ignoring device\n", lun);
>  			}
>  		}
>  	}

In LLDDs that do their own initiator based LUN masking (because the midlayer does not have this functionality to enable hardware virtualization without NPIV, or to work around suboptimal LUN masking on the target), they are likely to return -ENXIO from slave_alloc(), making scsi_alloc_sdev() return NULL, being converted to SCSI_SCAN_NO_RESPONSE by scsi_probe_and_add_lun() and thus going through the same code path above.

E.g. zfcp does return -ENXIO if the particular LUN was not made known to the unit whitelist (via zfcp sysfs attribute unit_add).
If we attach LUN 0 (via unit_add) and trigger a target scan with SCAN_WILD_CARD for the scsi lun (e.g. on remote port recovery), we see exactly above error message for the first LUN in the response of report lun which is not explicitly attached to zfcp.
IIRC, other LLDDs such as bfa also do similar stuff [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].

For those cases, I think it makes sense to abort scsi_report_lun_scan(). Otherwise we would force the LLDD to return -ENXIO for every single LUN reported by report lun but not explicitly added to the LLDD LUN whitelist; and this would likely *flood kernel messages*.

Maybe Vaughan's case needs to be distinguished in a patch.

Some more details (because I happened to have written this up already):

MESSAGE
=======

kernel: sd 0:0:17:0: Unexpected response from lun 1 while scanning, scan aborted

SUMMARY
=======

requirements for reproduction

1. zfcp with auto lun scan support but disabled
   (i.e. kernel >=2.6.37 , and no NPIV or zfcp.allow_lun_scan=0)
2. opened target port which supports the report lun SCSI command (SCSI-3)
3. attach lun 0 to that target port by means of zfcp's unit_add sysfs attribute
4. perform scsi target scan for that target port

=> message appears for first lun in list of report lun response
   which is not attached to zfcp by means of the unit_add sysfs attribute

Hence, this only occurs if requirement [3] above is met and
the storage target uses non-optimal LUN masking.
The message does not hurt and can either be ignored or LUN masking be fixed.

Trigger [4] can be activated in various different situations,
see examples sorted along increasing impact below.

EXAMPLES
========

Kernel >= v2.6.37

While below uses a V7000 as target, the target type does not matter;
it's just the same with DS8000 or other storage.

[root@host:~](0)# scsi_logging_level -g
Current scsi logging level:
dev.scsi.logging_level = 4605

[root@host:~](0)# systool -m zfcp -v
Module = "zfcp"
  Parameters:
    allow_lun_scan      = "N"
    dbfsize             = "4"
    device              = "(null)"
    dif                 = "N"
    no_auto_port_rescan = "N"
    queue_depth         = "32"

[root@host:~](0)# chccwdev -e 3c40

[root@host:~](0)# ziorep_config -A
Host:    host0
CHPID:   60
Adapter: 0.0.3c40
Sub-Ch.: 0.0.001b
Name:    0xc05076ffe4801a51
P-Name:  0xc05076ffe4801a51
Version: 0x0006
LIC:     0x00000410
Type:    NPort (fabric via point-to-point)
Speed:   8 Gbit
State:   Online

[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40/0x5005076802100c1a](0)#
 echo 0x0000000000000000 >| unit_add
[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40/0x5005076802100c1a](0)#
 echo 0x0002000000000000 >| unit_add
[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40](0)# lszfcp -D
0.0.3c40/0x5005076802100c1a/0x0000000000000000 0:0:17:0
0.0.3c40/0x5005076802100c1a/0x0002000000000000 0:0:17:2
[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40](0)# lsscsi -g
[0:0:17:0]   disk    IBM      2145             0000  /dev/sda   /dev/sg0
[0:0:17:2]   disk    IBM      2145             0000  /dev/sdb   /dev/sg1
[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40](0)# sg_luns -v /dev/sg0
    report luns cdb: a0 00 00 00 00 00 00 00 20 00 00 00
    report luns: requested 8192 bytes but got 2376 bytes
Lun list length = 2368 which imples 296 lun entries
Report luns [select_report=0]:
    0000000000000000
    0001000000000000
    0002000000000000
    0003000000000000
    ...

Example 1: SCSI HOST SCAN

this has negligible impact on currently running workload and can
safely be executed for individual reproduction

[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40](0)#
 echo "- - -" >| host0/scsi_host/host0/scan

kernel: scsi scan: device exists on 0:0:17:0
kernel: scsi scan: Sending REPORT LUNS to host 0 channel 0 id 17 (try 0)
kernel: scsi scan: REPORT LUNS successful (try 0) result 0x0
kernel: sd 0:0:17:0: scsi scan: REPORT LUN scan
kernel: scsi scan: device exists on 0:0:17:0
kernel: sd 0:0:17:0: Unexpected response from lun 1 while scanning, scan aborted

Example 2: PORT RECOVERY

this causes a short interruption of I/O to all LUNs at that target port

includes a scsi target (re)scan of rport-0:0-17 / 0x5005076802100c1a

[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40/0x5005076802100c1a](0)#
 echo 0 >| failed

kernel: scsi scan: device exists on 0:0:17:0
kernel: scsi scan: Sending REPORT LUNS to host 0 channel 0 id 17 (try 0)
kernel: sd 0:0:17:0: Done: RETRY
kernel: sd 0:0:17:0:  Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK
kernel: sd 0:0:17:0: CDB: Report luns: a0 00 00 00 00 00 00 00 10 00 00 00
kernel: scsi scan: REPORT LUNS successful (try 0) result 0x0
kernel: sd 0:0:17:0: scsi scan: REPORT LUN scan
kernel: scsi scan: device exists on 0:0:17:0
kernel: sd 0:0:17:0: Unexpected response from lun 1 while scanning, scan aborted
kernel: scsi scan: device exists on 0:0:17:0
kernel: scsi scan: device exists on 0:0:17:2

Two trailing "device exists" are from zfcp's unit recovery for each
lun at the recovered remote port. This causes additional individual
scsi_scan_target() calls without wildcards but for a specific lun instead.

Example 3: ADAPTER RECOVERY

this causes a short interruption of I/O over all paths through this FCP device

includes recovery of rport-0:0-17 / 0x5005076802100c1a

[root@host:/sys/bus/ccw/drivers/zfcp/0.0.3c40](0)# echo 0 >| failed

kernel: qdio: 0.0.3c40 ZFCP on SC 1b using AI:1 QEBSM:1 PCI:1 TDD:1 SIGA: W A
kernel: scsi scan: device exists on 0:0:17:0
kernel: scsi scan: Sending REPORT LUNS to host 0 channel 0 id 17 (try 0)
kernel: scsi scan: REPORT LUNS successful (try 0) result 0x0
kernel: sd 0:0:17:0: scsi scan: REPORT LUN scan
kernel: scsi scan: device exists on 0:0:17:0
kernel: sd 0:0:17:0: Unexpected response from lun 1 while scanning, scan aborted
kernel: scsi scan: device exists on 0:0:17:0
kernel: scsi scan: device exists on 0:0:17:2

DETAILS
=======

Square brackets indicate where above requirements come into play.

[4]
scsi_scan_target(prnt, 0/*channel*/, id/*target*/, SCAN_WILD_CARD/*lun*/, rscan)
__scsi_scan_target()
scsi_probe_and_add_lun(starget, 0, &bflags, NULL, rescan, NULL); [3]
scsi_report_lun_scan(starget, bflags, rescan) [2] {
 foreach lun in report lun response {
  scsi_probe_and_add_lun() {
   if exists => "kernel: scsi scan: device exists on <HCTL>"
   else {
    scsi_alloc_sdev() {
     ret = shost->hostt->slave_alloc() => zfcp_scsi_slave_alloc() {
  	if (!unit && !(allow_lun_scan && npiv)) {
  		put_device(&port->dev);
  		return -ENXIO;						[1]
  	}
     }
     if (ret) {
     	/*
     	 * if LLDD reports slave not present, don't clutter
     	 * console with alloc failure messages
     	 */
     	if (ret == -ENXIO)
     		display_failure_msg = 0;
     	goto out_device_destroy;
     }
    }
    if allocation failed, return early with SCSI_SCAN_NO_RESPONSE
    else continue lun probing
   }
  }
  if (res == SCSI_SCAN_NO_RESPONSE) {
  	/*
  	 * Got some results, but now none, abort.
  	 */
  	sdev_printk(KERN_ERR, sdev,
  		"Unexpected response"
  	" from lun %d while scanning, scan"
  	" aborted\n", lun);
  	break;
  }
 }
}


-- 
Mit freundlichen Grüßen / Kind regards
Steffen Maier

Linux on System z Development

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-14 12:51   ` Steffen Maier
@ 2013-10-14 13:18     ` Hannes Reinecke
  2013-10-14 13:32       ` Hannes Reinecke
  2013-10-14 15:18       ` Vaughan Cao
  0 siblings, 2 replies; 16+ messages in thread
From: Hannes Reinecke @ 2013-10-14 13:18 UTC (permalink / raw)
  To: Steffen Maier
  Cc: Vaughan Cao, JBottomley, linux-scsi, linux-kernel, Krishna Gudipati

On 10/14/2013 02:51 PM, Steffen Maier wrote:
> Hi Hannes,
> 
> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>> Hi James,
>>>
>>> [1.] One line summary of the problem:
>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>
>>> [2.] Full description of the problem/report:
>>> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
>>> is not well configured or has something wrong.
>>> Then messages received:
>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>> Which will make LUN No.8 unavailable.
>>> It's confirmed that Windows and Solaris systems will continue the
>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>
>>> Log snippet is as below:
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY pass 1 length 36
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done: 0xffff8801e9bd4280 SUCCESS
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready [current]
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical unit not accessible, target port in unavailable state
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code 0x8000002
>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>
>>> According to scsi_report_lun_scan(), I found:
>>> Linux use an inquiry command to probe a lun according to the result
>>> of report_lun command.
>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>> regards the whole peripheral not exist or dead.
>>> If the return of inquiry passes its legal checking and indicates
>>> 'LUN not present', it won't break but also continue with the scan
>>> process.
>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>> (Logical unit not accessible, target port in unavailable state).
>>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>>> process is aborted.
>>>
>>> I have two questions:
>>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>>> again, even after presenting this lun in responce to REPORT_LUN
>>> command?
>> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>>
>>> 2. Since windows and solaris can continue scan, is it reasonable for
>>> linux to do the same, even for a fault-tolerance purpose?
>>>
>> Hmm. Yes, and no.
>>
>> _Actually_ this is an issue with the target, as it looks as if it
>> will return the above sense code while sending an 'INQUIRY' to the
>> device.
>> SPC explicitely states that the INQUIRY command should _not_ fail
>> for unavailable devices.
>> But yeah, we probably should work around this issues.
>> Nevertheless, please raise this issue with your array vendor.
>>
>> Please try the attached patch.
>>
>> Cheers,
>>
>> Hannes
>>
> 
>> From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17 00:00:00 2001
>> From: Hannes Reinecke <hare@suse.de>
>> Date: Mon, 14 Oct 2013 13:11:22 +0200
>> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
>>
>> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
>> does _not_ indicate that the entire target is done for.
>> So continue scanning for the remaining devices.
>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>
>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>> index 307a811..973a121 100644
>> --- a/drivers/scsi/scsi_scan.c
>> +++ b/drivers/scsi/scsi_scan.c
>> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct scsi_target *starget, int bflags,
>>  				lun, NULL, NULL, rescan, NULL);
>>  			if (res == SCSI_SCAN_NO_RESPONSE) {
>>  				/*
>> -				 * Got some results, but now none, abort.
>> +				 * Got some results, but now none, ignore.
>>  				 */
>>  				sdev_printk(KERN_ERR, sdev,
>>  					"Unexpected response"
>> -				        " from lun %d while scanning, scan"
>> -				        " aborted\n", lun);
>> -				break;
>> +					" from lun %d while scanning,"
>> +					" ignoring device\n", lun);
>>  			}
>>  		}
>>  	}
> 
> In LLDDs that do their own initiator based LUN masking (because the midlayer does not have this
> functionality to enable hardware virtualization without NPIV, or
to work around suboptimal LUN
> masking on the target), they are likely to return -ENXIO from
slave_alloc(), making scsi_alloc_sdev()
> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
scsi_probe_and_add_lun() and thus going
> through the same code path above.
> 
Ah. Hmm. Yes, they would.

However, I personally would question this approach, as SPC states that

> The REPORT LUNS command (see table 284) requests the device
> server to return the peripheral device logical unit inventory
> accessible to the I_T nexus.

So by plain reading this would meant that you either should modify
'REPORT LUNS' to not show the masked LUNs, or set the pqual field to
'0x10' or '0x11' for those LUNs.

> E.g. zfcp does return -ENXIO if the particular LUN was not made known to the unit whitelist
> (via zfcp sysfs attribute unit_add).
> If we attach LUN 0 (via unit_add) and trigger a target scan with SCAN_WILD_CARD for the scsi
> lun (e.g. on remote port recovery), we see exactly above error
message for the first LUN in
> the response of report lun which is not explicitly attached to zfcp.
> IIRC, other LLDDs such as bfa also do similar stuff [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
> 
> For those cases, I think it makes sense to abort scsi_report_lun_scan().
> Otherwise we would force the LLDD to return -ENXIO for every
single LUN reported by report lun but not
> explicitly added to the LLDD LUN whitelist; and this would likely
*flood kernel messages*.
> 
> Maybe Vaughan's case needs to be distinguished in a patch.
> 
Well, as mentioned initially, the real issue is that the target
aborts an INQUIRY while being in 'Unavailable'. Which, according to
SPC-3 (or later), is a violation of the spec.

So we _could_ just tell them to go away, but admittedly that's bad
style. Which means we'll have to implement a workaround; the above
was just a simple way of implementing it. If that's not working of
course we'll have to do something else.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-14 13:18     ` Hannes Reinecke
@ 2013-10-14 13:32       ` Hannes Reinecke
  2013-10-14 15:24         ` Steffen Maier
  2013-10-14 15:18       ` Vaughan Cao
  1 sibling, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2013-10-14 13:32 UTC (permalink / raw)
  To: Steffen Maier; +Cc: Vaughan Cao, JBottomley, linux-scsi, linux-kernel

On 10/14/2013 03:18 PM, Hannes Reinecke wrote:
> On 10/14/2013 02:51 PM, Steffen Maier wrote:
>> Hi Hannes,
>>
>> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>>> Hi James,
>>>>
>>>> [1.] One line summary of the problem:
>>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>>
>>>> [2.] Full description of the problem/report:
>>>> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
>>>> is not well configured or has something wrong.
>>>> Then messages received:
>>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>> Which will make LUN No.8 unavailable.
>>>> It's confirmed that Windows and Solaris systems will continue the
>>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>>
>>>> Log snippet is as below:
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY pass 1 length 36
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done: 0xffff8801e9bd4280 SUCCESS
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready [current]
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical unit not accessible, target port in unavailable state
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
>>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code 0x8000002
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>>
>>>> According to scsi_report_lun_scan(), I found:
>>>> Linux use an inquiry command to probe a lun according to the result
>>>> of report_lun command.
>>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>>> regards the whole peripheral not exist or dead.
>>>> If the return of inquiry passes its legal checking and indicates
>>>> 'LUN not present', it won't break but also continue with the scan
>>>> process.
>>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>>> (Logical unit not accessible, target port in unavailable state).
>>>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>>>> process is aborted.
>>>>
>>>> I have two questions:
>>>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>>>> again, even after presenting this lun in responce to REPORT_LUN
>>>> command?
>>> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>>>
>>>> 2. Since windows and solaris can continue scan, is it reasonable for
>>>> linux to do the same, even for a fault-tolerance purpose?
>>>>
>>> Hmm. Yes, and no.
>>>
>>> _Actually_ this is an issue with the target, as it looks as if it
>>> will return the above sense code while sending an 'INQUIRY' to the
>>> device.
>>> SPC explicitely states that the INQUIRY command should _not_ fail
>>> for unavailable devices.
>>> But yeah, we probably should work around this issues.
>>> Nevertheless, please raise this issue with your array vendor.
>>>
>>> Please try the attached patch.
>>>
>>> Cheers,
>>>
>>> Hannes
>>>
>>
>>> From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17 00:00:00 2001
>>> From: Hannes Reinecke <hare@suse.de>
>>> Date: Mon, 14 Oct 2013 13:11:22 +0200
>>> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
>>>
>>> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
>>> does _not_ indicate that the entire target is done for.
>>> So continue scanning for the remaining devices.
>>>
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>>
>>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>>> index 307a811..973a121 100644
>>> --- a/drivers/scsi/scsi_scan.c
>>> +++ b/drivers/scsi/scsi_scan.c
>>> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct scsi_target *starget, int bflags,
>>>  				lun, NULL, NULL, rescan, NULL);
>>>  			if (res == SCSI_SCAN_NO_RESPONSE) {
>>>  				/*
>>> -				 * Got some results, but now none, abort.
>>> +				 * Got some results, but now none, ignore.
>>>  				 */
>>>  				sdev_printk(KERN_ERR, sdev,
>>>  					"Unexpected response"
>>> -				        " from lun %d while scanning, scan"
>>> -				        " aborted\n", lun);
>>> -				break;
>>> +					" from lun %d while scanning,"
>>> +					" ignoring device\n", lun);
>>>  			}
>>>  		}
>>>  	}
>>
>> In LLDDs that do their own initiator based LUN masking (because the midlayer does not have this
>> functionality to enable hardware virtualization without NPIV, or
> to work around suboptimal LUN
>> masking on the target), they are likely to return -ENXIO from
> slave_alloc(), making scsi_alloc_sdev()
>> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
> scsi_probe_and_add_lun() and thus going
>> through the same code path above.
>>
> Ah. Hmm. Yes, they would.
> 
> However, I personally would question this approach, as SPC states that
> 
>> The REPORT LUNS command (see table 284) requests the device
>> server to return the peripheral device logical unit inventory
>> accessible to the I_T nexus.
> 
> So by plain reading this would meant that you either should modify
> 'REPORT LUNS' to not show the masked LUNs, or set the pqual field to
> '0x10' or '0x11' for those LUNs.
> 
>> E.g. zfcp does return -ENXIO if the particular LUN was not made known to the unit whitelist
>> (via zfcp sysfs attribute unit_add).
>> If we attach LUN 0 (via unit_add) and trigger a target scan with SCAN_WILD_CARD for the scsi
>> lun (e.g. on remote port recovery), we see exactly above error
> message for the first LUN in
>> the response of report lun which is not explicitly attached to zfcp.
>> IIRC, other LLDDs such as bfa also do similar stuff [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
>>
>> For those cases, I think it makes sense to abort scsi_report_lun_scan().
>> Otherwise we would force the LLDD to return -ENXIO for every
> single LUN reported by report lun but not
>> explicitly added to the LLDD LUN whitelist; and this would likely
> *flood kernel messages*.
>>
>> Maybe Vaughan's case needs to be distinguished in a patch.
>>
> Well, as mentioned initially, the real issue is that the target
> aborts an INQUIRY while being in 'Unavailable'. Which, according to
> SPC-3 (or later), is a violation of the spec.
> 
> So we _could_ just tell them to go away, but admittedly that's bad
> style. Which means we'll have to implement a workaround; the above
> was just a simple way of implementing it. If that's not working of
> course we'll have to do something else.
> 
What about this patch:

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 973a121..01a7d69 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -594,6 +594,19 @@ static int scsi_probe_lun(struct scsi_device
*sdev, unsigne
d char *inq_result,
                                     (sshdr.asc == 0x29)) &&
                                    (sshdr.ascq == 0))
                                        continue;
+                               /*
+                                * Some buggy implementations return
+                                * 'target port in unavailable state'
+                                * even on INQUIRY.
+                                * Set peripheral qualifier 3
+                                * for these devices.
+                                */
+                               if ((sshdr.sense_key == NOT_READY) &&
+                                   ((sshdr.asc == 0x04) &&
+                                    (sshdr.ascq == 0x0C))) {
+                                   inq_result[0] = 3 << 5;
+                                   return 0;
+                               }
                        }
                } else {
                        /*

(watchout, linebreaks mangled and all that).
Should be working for this particular case without interrupting
normal workflow, now should it not?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-14 13:18     ` Hannes Reinecke
  2013-10-14 13:32       ` Hannes Reinecke
@ 2013-10-14 15:18       ` Vaughan Cao
  1 sibling, 0 replies; 16+ messages in thread
From: Vaughan Cao @ 2013-10-14 15:18 UTC (permalink / raw)
  To: Hannes Reinecke, Steffen Maier
  Cc: JBottomley, linux-scsi, linux-kernel, Krishna Gudipati


On 2013年10月14日 21:18, Hannes Reinecke wrote:
> On 10/14/2013 02:51 PM, Steffen Maier wrote:
>> Hi Hannes,
>>
>> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>>> Hi James,
>>>>
>>>> [1.] One line summary of the problem:
>>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>>
>>>> [2.] Full description of the problem/report:
>>>> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
>>>> is not well configured or has something wrong.
>>>> Then messages received:
>>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>> Which will make LUN No.8 unavailable.
>>>> It's confirmed that Windows and Solaris systems will continue the
>>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>>
>>>> Log snippet is as below:
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY pass 1 length 36
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done: 0xffff8801e9bd4280 SUCCESS
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready [current]
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical unit not accessible, target port in unavailable state
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
>>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code 0x8000002
>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>>
>>>> According to scsi_report_lun_scan(), I found:
>>>> Linux use an inquiry command to probe a lun according to the result
>>>> of report_lun command.
>>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>>> regards the whole peripheral not exist or dead.
>>>> If the return of inquiry passes its legal checking and indicates
>>>> 'LUN not present', it won't break but also continue with the scan
>>>> process.
>>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>>> (Logical unit not accessible, target port in unavailable state).
>>>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>>>> process is aborted.
>>>>
>>>> I have two questions:
>>>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>>>> again, even after presenting this lun in responce to REPORT_LUN
>>>> command?
>>> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>>>
>>>> 2. Since windows and solaris can continue scan, is it reasonable for
>>>> linux to do the same, even for a fault-tolerance purpose?
>>>>
>>> Hmm. Yes, and no.
>>>
>>> _Actually_ this is an issue with the target, as it looks as if it
>>> will return the above sense code while sending an 'INQUIRY' to the
>>> device.
>>> SPC explicitely states that the INQUIRY command should _not_ fail
>>> for unavailable devices.
>>> But yeah, we probably should work around this issues.
>>> Nevertheless, please raise this issue with your array vendor.
>>>
>>> Please try the attached patch.
>>>
>>> Cheers,
>>>
>>> Hannes
>>>
>> In LLDDs that do their own initiator based LUN masking (because the midlayer does not have this
>> functionality to enable hardware virtualization without NPIV, or
> to work around suboptimal LUN
>> masking on the target), they are likely to return -ENXIO from
> slave_alloc(), making scsi_alloc_sdev()
>> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
> scsi_probe_and_add_lun() and thus going
>> through the same code path above.
>>
> Ah. Hmm. Yes, they would.
>
> However, I personally would question this approach, as SPC states that
>
>> The REPORT LUNS command (see table 284) requests the device
>> server to return the peripheral device logical unit inventory
>> accessible to the I_T nexus.
> So by plain reading this would meant that you either should modify
> 'REPORT LUNS' to not show the masked LUNs,
I have the same question. If you don't want us use them, why still you 
present them in response to REPORT_LUN?
Since you report it in REPORT_LUN, I suppose the target server at least 
hold some information of this lun, so it shouldn't give an error when I 
check it? It should give me something to suggest that lun does exist, 
though it's not allowed to deal more with it at this time.
Or 'accessible' doesn't mean accessible at this time, but we have rights 
to address this LUN in this session? Whether it's online or not depends 
on the result of INQUIRY and TEST_UNIT_READY?

>   or set the pqual field to
> '0x10' or '0x11' for those LUNs.
Do you mean 001b?
After read the spc4r36g again, I'm confused on the difference between 
pqual=000b and 001b.
It seems 000b don't guarantee a lun is connected while 001b indicates a 
lun is surely not connected?
Anyone will explain these two questions a bit clearer?

###snippet form spc4
In response to an INQUIRY command received by an incorrect logical unit, 
the SCSI target device shall return
the INQUIRY data with the peripheral qualifier set to the value defined 
in 6.6.2. The INQUIRY command shall
return CHECK CONDITION status only if the device server is unable to 
return the requested INQUIRY data.

Table 175 — PERIPHERAL QUALIFIER field
Qualifier Description
000b A peripheral device having the specified peripheral device type is 
connected to this
logical unit. *If the device server is unable to determine whether or 
not a peripheral
device is connected, it also shall use this peripheral qualifier. This 
peripheral qualifier
does not mean that the peripheral device connected to the logical unit 
is ready for
access.*
001b A peripheral device having the specified peripheral device type is 
not connected to this
logical unit. However, the device server is capable of supporting the 
specified periph-
eral device type on this logical unit. (spc4r36g)
>> E.g. zfcp does return -ENXIO if the particular LUN was not made known to the unit whitelist
>> (via zfcp sysfs attribute unit_add).
>> If we attach LUN 0 (via unit_add) and trigger a target scan with SCAN_WILD_CARD for the scsi
>> lun (e.g. on remote port recovery), we see exactly above error message for the first LUN in
>> the response of report lun which is not explicitly attached to zfcp.
>> IIRC, other LLDDs such as bfa also do similar stuff [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
>>
>> For those cases, I think it makes sense to abort scsi_report_lun_scan().
>> Otherwise we would force the LLDD to return -ENXIO for every single LUN reported by report lun but not
>> explicitly added to the LLDD LUN whitelist; and this would likely *flood kernel messages*.
To Steffen,
It acts like scsi_sequential_lun_scan().
* Generally, scan from LUN 1 (LUN 0 is assumed to already have been
* scanned) to some maximum lun until a LUN is found with no device
* attached.
But is there case where a lun in the middle is indeed broken, but others 
following are fine, which worths a tolerate?
Never happen?


Vaughan
>> Maybe Vaughan's case needs to be distinguished in a patch.
>>
> Well, as mentioned initially, the real issue is that the target
> aborts an INQUIRY while being in 'Unavailable'. Which, according to
> SPC-3 (or later), is a violation of the spec.
>
> So we _could_ just tell them to go away, but admittedly that's bad
> style. Which means we'll have to implement a workaround; the above
> was just a simple way of implementing it. If that's not working of
> course we'll have to do something else.
>
> Cheers,
>
> Hannes


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-14 13:32       ` Hannes Reinecke
@ 2013-10-14 15:24         ` Steffen Maier
  2013-10-16  6:52           ` Hannes Reinecke
  0 siblings, 1 reply; 16+ messages in thread
From: Steffen Maier @ 2013-10-14 15:24 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Vaughan Cao, JBottomley, linux-scsi, linux-kernel

On 10/14/2013 03:32 PM, Hannes Reinecke wrote:
> On 10/14/2013 03:18 PM, Hannes Reinecke wrote:
>> On 10/14/2013 02:51 PM, Steffen Maier wrote:
>>> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>>>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>>>> [1.] One line summary of the problem:
>>>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>>>
>>>>> [2.] Full description of the problem/report:
>>>>> For instance, storage represents 8 iscsi LUNs, however the LUN No.7
>>>>> is not well configured or has something wrong.
>>>>> Then messages received:
>>>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>>> Which will make LUN No.8 unavailable.
>>>>> It's confirmed that Windows and Solaris systems will continue the
>>>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>>>
>>>>> Log snippet is as below:
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan: INQUIRY pass 1 length 36
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send: 0xffff8801e9bd4280
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer = 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done: 0xffff8801e9bd4280 SUCCESS
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB: Inquiry: 12 00 00 00 24 00
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key : Not Ready [current]
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense: Logical unit not accessible, target port in unavailable state
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host busy 1 failed 0
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes done.
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed with code 0x8000002
>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected response from lun 7 while scanning, scan aborted
>>>>>
>>>>> According to scsi_report_lun_scan(), I found:
>>>>> Linux use an inquiry command to probe a lun according to the result
>>>>> of report_lun command.
>>>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>>>> regards the whole peripheral not exist or dead.
>>>>> If the return of inquiry passes its legal checking and indicates
>>>>> 'LUN not present', it won't break but also continue with the scan
>>>>> process.
>>>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>>>> (Logical unit not accessible, target port in unavailable state).
>>>>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>>>>> process is aborted.
>>>>>
>>>>> I have two questions:
>>>>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>>>>> again, even after presenting this lun in responce to REPORT_LUN
>>>>> command?
>>>> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>>>>
>>>>> 2. Since windows and solaris can continue scan, is it reasonable for
>>>>> linux to do the same, even for a fault-tolerance purpose?
>>>>>
>>>> Hmm. Yes, and no.
>>>>
>>>> _Actually_ this is an issue with the target, as it looks as if it
>>>> will return the above sense code while sending an 'INQUIRY' to the
>>>> device.
>>>> SPC explicitely states that the INQUIRY command should _not_ fail
>>>> for unavailable devices.
>>>> But yeah, we probably should work around this issues.
>>>> Nevertheless, please raise this issue with your array vendor.
>>>>
>>>> Please try the attached patch.
>>>>
>>>> Cheers,
>>>>
>>>> Hannes
>>>>
>>>
>>>>  From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17 00:00:00 2001
>>>> From: Hannes Reinecke <hare@suse.de>
>>>> Date: Mon, 14 Oct 2013 13:11:22 +0200
>>>> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
>>>>
>>>> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
>>>> does _not_ indicate that the entire target is done for.
>>>> So continue scanning for the remaining devices.
>>>>
>>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>>>
>>>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>>>> index 307a811..973a121 100644
>>>> --- a/drivers/scsi/scsi_scan.c
>>>> +++ b/drivers/scsi/scsi_scan.c
>>>> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct scsi_target *starget, int bflags,
>>>>   				lun, NULL, NULL, rescan, NULL);
>>>>   			if (res == SCSI_SCAN_NO_RESPONSE) {
>>>>   				/*
>>>> -				 * Got some results, but now none, abort.
>>>> +				 * Got some results, but now none, ignore.
>>>>   				 */
>>>>   				sdev_printk(KERN_ERR, sdev,
>>>>   					"Unexpected response"
>>>> -				        " from lun %d while scanning, scan"
>>>> -				        " aborted\n", lun);
>>>> -				break;
>>>> +					" from lun %d while scanning,"
>>>> +					" ignoring device\n", lun);
>>>>   			}
>>>>   		}
>>>>   	}
>>>
>>> In LLDDs that do their own initiator based LUN masking (because the midlayer does not have this
>>> functionality to enable hardware virtualization without NPIV, or
>> to work around suboptimal LUN
>>> masking on the target), they are likely to return -ENXIO from
>> slave_alloc(), making scsi_alloc_sdev()
>>> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
>> scsi_probe_and_add_lun() and thus going
>>> through the same code path above.
>>>
>> Ah. Hmm. Yes, they would.
>>
>> However, I personally would question this approach, as SPC states that
>>
>>> The REPORT LUNS command (see table 284) requests the device
>>> server to return the peripheral device logical unit inventory
>>> accessible to the I_T nexus.
>>
>> So by plain reading this would meant that you either should modify
>> 'REPORT LUNS' to not show the masked LUNs, or set the pqual field to
>> '0x10' or '0x11' for those LUNs.

We need to distinguish two cases:
1) suboptimal lun masking on the target
2) hardware virtualization without NPIV

Regarding 1, one could require fixing lun masking on the target. 
However, some users cannot or do not want to do it very fine granular. 
That's why s390 also does deferred device probing ("set online" in 
sysfs) or even limits bus sensing (cio_ignore).

Regarding 2, fixing lun masking on the target does not help because 
without NPIV, the target cannot distinguish the different virtual 
initators since they are all behind one shared WWPN (and N-Port_ID).
This forces zfcp to implement initiator based lun masking, because only 
the user can tell which lun to attach to which of the virtual initiators 
sharing the same physical port. Without that, Linux would attach all 
luns to all virtual initiators, i.e. share inadvertently.

>>> E.g. zfcp does return -ENXIO if the particular LUN was not made known to the unit whitelist
>>> (via zfcp sysfs attribute unit_add).
>>> If we attach LUN 0 (via unit_add) and trigger a target scan with SCAN_WILD_CARD for the scsi
>>> lun (e.g. on remote port recovery), we see exactly above error
>> message for the first LUN in
>>> the response of report lun which is not explicitly attached to zfcp.
>>> IIRC, other LLDDs such as bfa also do similar stuff [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
>>>
>>> For those cases, I think it makes sense to abort scsi_report_lun_scan().
>>> Otherwise we would force the LLDD to return -ENXIO for every
>> single LUN reported by report lun but not
>>> explicitly added to the LLDD LUN whitelist; and this would likely
>> *flood kernel messages*.
>>>
>>> Maybe Vaughan's case needs to be distinguished in a patch.
>>>
>> Well, as mentioned initially, the real issue is that the target
>> aborts an INQUIRY while being in 'Unavailable'. Which, according to
>> SPC-3 (or later), is a violation of the spec.
>>
>> So we _could_ just tell them to go away, but admittedly that's bad
>> style. Which means we'll have to implement a workaround; the above
>> was just a simple way of implementing it. If that's not working of
>> course we'll have to do something else.
>>
> What about this patch:
>
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 973a121..01a7d69 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -594,6 +594,19 @@ static int scsi_probe_lun(struct scsi_device
> *sdev, unsigne
> d char *inq_result,
>                                       (sshdr.asc == 0x29)) &&
>                                      (sshdr.ascq == 0))
>                                          continue;
> +                               /*
> +                                * Some buggy implementations return
> +                                * 'target port in unavailable state'
> +                                * even on INQUIRY.
> +                                * Set peripheral qualifier 3
> +                                * for these devices.
> +                                */
> +                               if ((sshdr.sense_key == NOT_READY) &&
> +                                   ((sshdr.asc == 0x04) &&
> +                                    (sshdr.ascq == 0x0C))) {

style question: lower case hex digits? 0x0c

Any reason why you put the conjunction of asc and ascq inside its own 
brackets instead of having all three (including sense_key) on the same 
level of one larger conjunction (as the code above does for UA asc 
0x28/0x29 ascq 0x00)? Should be semantically equivalent, isn't it? But 
then again, ascq always goes with asc, so they form a kind of pair.

> +                                   inq_result[0] = 3 << 5;
> +                                   return 0;
> +                               }
>                          }
>                  } else {
>                          /*
>
> (watchout, linebreaks mangled and all that).
> Should be working for this particular case without interrupting
> normal workflow, now should it not?

The approach of distinguishing the workaround close to the response of 
the inquiry sounds good to me. I suppose it won't break zfcp which is 
good. Unfortunately, I don't know what the ramifications of PQ==3 are 
(the SPC-4 description sounds good, though), nor enough details about 
this common code to tell if e.g. the early return is OK (skipping 
setting sdev->scsi_level near the end of scsi_probe_lun()). But then 
again, without inquiry reply we cannot get the level from the response. 
So I think the early return is OK after all.
I guess we want to get around "if (result) return -EIO;" but also do not 
want to execute the parts depending on result==0.

SPC-4 says that for PQ==3 the PDT should be set to 0x1f. Do we need to 
fake this here as well? (I assume the target did not fill in a PDT on 
its own when replying with sense data.)

The clarification on the T10 reflector seems to say that Linux would 
then accept LUNs with PQ 3, but the target shall not have put LUs with 
PQ 3 into the LU inventory in the first place?
Anyway, I'm not opposed to the workaround.

-- 
Mit freundlichen Grüßen / Kind regards
Steffen Maier

Linux on System z Development

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-14 11:13 ` Hannes Reinecke
  2013-10-14 12:51   ` Steffen Maier
@ 2013-10-15  3:32   ` vaughan
  2013-10-15  5:51     ` Hannes Reinecke
  1 sibling, 1 reply; 16+ messages in thread
From: vaughan @ 2013-10-15  3:32 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: JBottomley, linux-scsi, linux-kernel

On 10/14/2013 07:13 PM, Hannes Reinecke wrote:
> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
> (Logical unit not accessible, target port in unavailable state).
> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
> process is aborted.
>
> I have two questions:
> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
> again, even after presenting this lun in responce to REPORT_LUN
> command?
> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>
>> 2. Since windows and solaris can continue scan, is it reasonable for
>> linux to do the same, even for a fault-tolerance purpose?
>>
> Hmm. Yes, and no.
>
> _Actually_ this is an issue with the target, as it looks as if it
> will return the above sense code while sending an 'INQUIRY' to the
> device.
> SPC explicitely states that the INQUIRY command should _not_ fail
> for unavailable devices.
Hi all,

I found this below in spc4.
>>>
5.15.2.4.4 Target port group asymmetric access states - Standby state
While in the unavailable primary target port asymmetric access state,
the device server shall support those of
the following commands that it supports while in the active/optimized state:
a) INQUIRY (the peripheral qualifier (see 6.6.2) shall be set to 001b);
....
For those commands that are not supported, the device server shall
terminate the command with CHECK
CONDITION status, with the sense key set to NOT READY, and the
additional sense code set to LOGICAL
UNIT NOT ACCESSIBLE, TARGET PORT IN UNAVAILABLE STATE.
<<<
>From the above, I suppose the hardware may works very compliant with
spc. The case could be:
Storage is a alua supported target. Initiator sent REPORT_LUN to target,
target return all pqual=000b to it.
Then Initiator INQUIRY lun 7 which is in standby state where pqual=000b
not 001b. So this INQUIRY is regarded as
'not supported', and get terminated with CHECK_CONDITION,  sense key=NOT
READY, asc,ascq=04h,0ch.

Could you confirm if my understanding is right or wrong?

Thanks,
Vaughan
> But yeah, we probably should work around this issues.
> Nevertheless, please raise this issue with your array vendor.
>
> Please try the attached patch.
>
> Cheers,
>
> Hannes


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-15  3:32   ` vaughan
@ 2013-10-15  5:51     ` Hannes Reinecke
  2013-10-15 11:46       ` Vaughan Cao
  0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2013-10-15  5:51 UTC (permalink / raw)
  To: vaughan; +Cc: JBottomley, linux-scsi, linux-kernel

On 10/15/2013 05:32 AM, vaughan wrote:
> On 10/14/2013 07:13 PM, Hannes Reinecke wrote:
>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>> (Logical unit not accessible, target port in unavailable state).
>> And this is ignored, so scsi_probe_lun() returns -EIO and the scan
>> process is aborted.
>>
>> I have two questions:
>> 1. Is it correct for hardware to return a sense 04h,0Ch to inquiry
>> again, even after presenting this lun in responce to REPORT_LUN
>> command?
>> Yes, this is correct. 'REPORT LUNS' is supported in 'Unavailable' state.
>>
>>> 2. Since windows and solaris can continue scan, is it reasonable for
>>> linux to do the same, even for a fault-tolerance purpose?
>>>
>> Hmm. Yes, and no.
>>
>> _Actually_ this is an issue with the target, as it looks as if it
>> will return the above sense code while sending an 'INQUIRY' to the
>> device.
>> SPC explicitely states that the INQUIRY command should _not_ fail
>> for unavailable devices.
> Hi all,
> 
> I found this below in spc4.
>>>>
> 5.15.2.4.4 Target port group asymmetric access states - Standby state
> While in the unavailable primary target port asymmetric access state,
> the device server shall support those of
> the following commands that it supports while in the active/optimized state:
> a) INQUIRY (the peripheral qualifier (see 6.6.2) shall be set to 001b);
> ....
> For those commands that are not supported, the device server shall
> terminate the command with CHECK
> CONDITION status, with the sense key set to NOT READY, and the
> additional sense code set to LOGICAL
> UNIT NOT ACCESSIBLE, TARGET PORT IN UNAVAILABLE STATE.
> <<<
> From the above, I suppose the hardware may works very compliant with
> spc. The case could be:
> Storage is a alua supported target. Initiator sent REPORT_LUN to target,
> target return all pqual=000b to it.
> Then Initiator INQUIRY lun 7 which is in standby state where pqual=000b
> not 001b. So this INQUIRY is regarded as
> 'not supported', and get terminated with CHECK_CONDITION,  sense key=NOT
> READY, asc,ascq=04h,0ch.
> 
> Could you confirm if my understanding is right or wrong?
> 
Wrong.

The sentence states that the device server _shall_ support those
commands, where the results should be identical as if the port would
have been in active/optimized state.

So INQUIRY always has to be supported, regardless of which primary
ALUA state the port happens to be in.

(Otherwise we'd be hard-pressed to figure out whether the port is in
'unavailable' ALUA state in the first place, as without the INQUIRY
data we couldn't even _tell_ if ALUA is supported.)

So yeah, it really looks like a firmware issue here.

But that notwithstanding, did you get a chance to test my patch?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-15  5:51     ` Hannes Reinecke
@ 2013-10-15 11:46       ` Vaughan Cao
  0 siblings, 0 replies; 16+ messages in thread
From: Vaughan Cao @ 2013-10-15 11:46 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: JBottomley, linux-scsi, linux-kernel


On 2013年10月15日 13:51, Hannes Reinecke wrote:
> But that notwithstanding, did you get a chance to test my patch?
>
> Cheers,
>
> Hannes
Hi Hannes,

Kernel patched and waiting feedback from lab guy.

Thanks,
Vaughan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-14 15:24         ` Steffen Maier
@ 2013-10-16  6:52           ` Hannes Reinecke
  2013-10-16  7:26             ` vaughan
                               ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Hannes Reinecke @ 2013-10-16  6:52 UTC (permalink / raw)
  To: Steffen Maier; +Cc: Vaughan Cao, JBottomley, linux-scsi, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 12936 bytes --]

On 10/14/2013 05:24 PM, Steffen Maier wrote:
> On 10/14/2013 03:32 PM, Hannes Reinecke wrote:
>> On 10/14/2013 03:18 PM, Hannes Reinecke wrote:
>>> On 10/14/2013 02:51 PM, Steffen Maier wrote:
>>>> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>>>>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>>>>> [1.] One line summary of the problem:
>>>>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>>>>
>>>>>> [2.] Full description of the problem/report:
>>>>>> For instance, storage represents 8 iscsi LUNs, however the LUN
>>>>>> No.7
>>>>>> is not well configured or has something wrong.
>>>>>> Then messages received:
>>>>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while
>>>>>> scanning, scan aborted
>>>>>> Which will make LUN No.8 unavailable.
>>>>>> It's confirmed that Windows and Solaris systems will continue the
>>>>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>>>>
>>>>>> Log snippet is as below:
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan:
>>>>>> INQUIRY pass 1 length 36
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send:
>>>>>> 0xffff8801e9bd4280
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB:
>>>>>> Inquiry: 12 00 00 00 24 00
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer =
>>>>>> 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done:
>>>>>> 0xffff8801e9bd4280 SUCCESS
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result:
>>>>>> hostbyte=DID_OK driverbyte=DRIVER_OK
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB:
>>>>>> Inquiry: 12 00 00 00 24 00
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key :
>>>>>> Not Ready [current]
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense:
>>>>>> Logical unit not accessible, target port in unavailable state
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host
>>>>>> busy 1 failed 0
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes
>>>>>> done.
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed
>>>>>> with code 0x8000002
>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected
>>>>>> response from lun 7 while scanning, scan aborted
>>>>>>
>>>>>> According to scsi_report_lun_scan(), I found:
>>>>>> Linux use an inquiry command to probe a lun according to the
>>>>>> result
>>>>>> of report_lun command.
>>>>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>>>>> regards the whole peripheral not exist or dead.
>>>>>> If the return of inquiry passes its legal checking and indicates
>>>>>> 'LUN not present', it won't break but also continue with the scan
>>>>>> process.
>>>>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>>>>> (Logical unit not accessible, target port in unavailable state).
>>>>>> And this is ignored, so scsi_probe_lun() returns -EIO and the
>>>>>> scan
>>>>>> process is aborted.
>>>>>>
>>>>>> I have two questions:
>>>>>> 1. Is it correct for hardware to return a sense 04h,0Ch to
>>>>>> inquiry
>>>>>> again, even after presenting this lun in responce to REPORT_LUN
>>>>>> command?
>>>>> Yes, this is correct. 'REPORT LUNS' is supported in
>>>>> 'Unavailable' state.
>>>>>
>>>>>> 2. Since windows and solaris can continue scan, is it
>>>>>> reasonable for
>>>>>> linux to do the same, even for a fault-tolerance purpose?
>>>>>>
>>>>> Hmm. Yes, and no.
>>>>>
>>>>> _Actually_ this is an issue with the target, as it looks as if it
>>>>> will return the above sense code while sending an 'INQUIRY' to the
>>>>> device.
>>>>> SPC explicitely states that the INQUIRY command should _not_ fail
>>>>> for unavailable devices.
>>>>> But yeah, we probably should work around this issues.
>>>>> Nevertheless, please raise this issue with your array vendor.
>>>>>
>>>>> Please try the attached patch.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Hannes
>>>>>
>>>>
>>>>>  From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17
>>>>> 00:00:00 2001
>>>>> From: Hannes Reinecke <hare@suse.de>
>>>>> Date: Mon, 14 Oct 2013 13:11:22 +0200
>>>>> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
>>>>>
>>>>> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
>>>>> does _not_ indicate that the entire target is done for.
>>>>> So continue scanning for the remaining devices.
>>>>>
>>>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>>>>
>>>>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>>>>> index 307a811..973a121 100644
>>>>> --- a/drivers/scsi/scsi_scan.c
>>>>> +++ b/drivers/scsi/scsi_scan.c
>>>>> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct
>>>>> scsi_target *starget, int bflags,
>>>>>                   lun, NULL, NULL, rescan, NULL);
>>>>>               if (res == SCSI_SCAN_NO_RESPONSE) {
>>>>>                   /*
>>>>> -                 * Got some results, but now none, abort.
>>>>> +                 * Got some results, but now none, ignore.
>>>>>                    */
>>>>>                   sdev_printk(KERN_ERR, sdev,
>>>>>                       "Unexpected response"
>>>>> -                        " from lun %d while scanning, scan"
>>>>> -                        " aborted\n", lun);
>>>>> -                break;
>>>>> +                    " from lun %d while scanning,"
>>>>> +                    " ignoring device\n", lun);
>>>>>               }
>>>>>           }
>>>>>       }
>>>>
>>>> In LLDDs that do their own initiator based LUN masking (because
>>>> the midlayer does not have this
>>>> functionality to enable hardware virtualization without NPIV, or
>>> to work around suboptimal LUN
>>>> masking on the target), they are likely to return -ENXIO from
>>> slave_alloc(), making scsi_alloc_sdev()
>>>> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
>>> scsi_probe_and_add_lun() and thus going
>>>> through the same code path above.
>>>>
>>> Ah. Hmm. Yes, they would.
>>>
>>> However, I personally would question this approach, as SPC states
>>> that
>>>
>>>> The REPORT LUNS command (see table 284) requests the device
>>>> server to return the peripheral device logical unit inventory
>>>> accessible to the I_T nexus.
>>>
>>> So by plain reading this would meant that you either should modify
>>> 'REPORT LUNS' to not show the masked LUNs, or set the pqual field to
>>> '0x10' or '0x11' for those LUNs.
> 
> We need to distinguish two cases:
> 1) suboptimal lun masking on the target
> 2) hardware virtualization without NPIV
> 
> Regarding 1, one could require fixing lun masking on the target.
> However, some users cannot or do not want to do it very fine
> granular. That's why s390 also does deferred device probing ("set
> online" in sysfs) or even limits bus sensing (cio_ignore).
> 
> Regarding 2, fixing lun masking on the target does not help because
> without NPIV, the target cannot distinguish the different virtual
> initators since they are all behind one shared WWPN (and N-Port_ID).
> This forces zfcp to implement initiator based lun masking, because
> only the user can tell which lun to attach to which of the virtual
> initiators sharing the same physical port. Without that, Linux would
> attach all luns to all virtual initiators, i.e. share inadvertently.
> 
>>>> E.g. zfcp does return -ENXIO if the particular LUN was not made
>>>> known to the unit whitelist
>>>> (via zfcp sysfs attribute unit_add).
>>>> If we attach LUN 0 (via unit_add) and trigger a target scan with
>>>> SCAN_WILD_CARD for the scsi
>>>> lun (e.g. on remote port recovery), we see exactly above error
>>> message for the first LUN in
>>>> the response of report lun which is not explicitly attached to
>>>> zfcp.
>>>> IIRC, other LLDDs such as bfa also do similar stuff
>>>> [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
>>>>
>>>> For those cases, I think it makes sense to abort
>>>> scsi_report_lun_scan().
>>>> Otherwise we would force the LLDD to return -ENXIO for every
>>> single LUN reported by report lun but not
>>>> explicitly added to the LLDD LUN whitelist; and this would likely
>>> *flood kernel messages*.
>>>>
>>>> Maybe Vaughan's case needs to be distinguished in a patch.
>>>>
>>> Well, as mentioned initially, the real issue is that the target
>>> aborts an INQUIRY while being in 'Unavailable'. Which, according to
>>> SPC-3 (or later), is a violation of the spec.
>>>
>>> So we _could_ just tell them to go away, but admittedly that's bad
>>> style. Which means we'll have to implement a workaround; the above
>>> was just a simple way of implementing it. If that's not working of
>>> course we'll have to do something else.
>>>
>> What about this patch:
>>
>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>> index 973a121..01a7d69 100644
>> --- a/drivers/scsi/scsi_scan.c
>> +++ b/drivers/scsi/scsi_scan.c
>> @@ -594,6 +594,19 @@ static int scsi_probe_lun(struct scsi_device
>> *sdev, unsigne
>> d char *inq_result,
>>                                       (sshdr.asc == 0x29)) &&
>>                                      (sshdr.ascq == 0))
>>                                          continue;
>> +                               /*
>> +                                * Some buggy implementations return
>> +                                * 'target port in unavailable state'
>> +                                * even on INQUIRY.
>> +                                * Set peripheral qualifier 3
>> +                                * for these devices.
>> +                                */
>> +                               if ((sshdr.sense_key == NOT_READY) &&
>> +                                   ((sshdr.asc == 0x04) &&
>> +                                    (sshdr.ascq == 0x0C))) {
> 
> style question: lower case hex digits? 0x0c
> 
Yeah. This is a test, after all ...

> Any reason why you put the conjunction of asc and ascq inside its
> own brackets instead of having all three (including sense_key) on
> the same level of one larger conjunction (as the code above does for
> UA asc 0x28/0x29 ascq 0x00)? Should be semantically equivalent,
> isn't it? But then again, ascq always goes with asc, so they form a
> kind of pair.
> 
No reason, Just copy&paste error from the above statement ...

>> +                                   inq_result[0] = 3 << 5;
>> +                                   return 0;
>> +                               }
>>                          }
>>                  } else {
>>                          /*
>>
>> (watchout, linebreaks mangled and all that).
>> Should be working for this particular case without interrupting
>> normal workflow, now should it not?
> 
> The approach of distinguishing the workaround close to the response
> of the inquiry sounds good to me. I suppose it won't break zfcp
> which is good. Unfortunately, I don't know what the ramifications of
> PQ==3 are (the SPC-4 description sounds good, though), nor enough
> details about this common code to tell if e.g. the early return is
> OK (skipping setting sdev->scsi_level near the end of
> scsi_probe_lun()). But then again, without inquiry reply we cannot
> get the level from the response. So I think the early return is OK
> after all.
> I guess we want to get around "if (result) return -EIO;" but also do
> not want to execute the parts depending on result==0.
> 
> SPC-4 says that for PQ==3 the PDT should be set to 0x1f. Do we need
> to fake this here as well? (I assume the target did not fill in a
> PDT on its own when replying with sense data.)
> 
> The clarification on the T10 reflector seems to say that Linux would
> then accept LUNs with PQ 3, but the target shall not have put LUs
> with PQ 3 into the LU inventory in the first place?
> Anyway, I'm not opposed to the workaround.
> 
Well, first and foremost this is a workaround for buggy array
firmware. If any port would be in 'unavailable' the target port is
still required to respond to an INQUIRY.
_Not_ doing so leaves us with no indication what's going on here.

The main reason why I chose PQ=3 here is that we'll end up ignoring
this device scsi_probe_and_add_lun() later on.
Saving my coding higher up the stack.
And, seeing that the device is never actually allocated, the
modifications we did for the inquiry data will be deleted anyway.

So using PQ=3 here is just a vehicle for telling the system to not
create a SCSI device at this LUN, _not_ something which has some
relevance to SPC.

But seeing that this approach raises quite some issues I've attached
a different patch.
Vaughan, could you test with that, too? Should be functionally
equivalent to the previous one.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

[-- Attachment #2: scsi_scan-continue-scan-for-LUNs-in.patch --]
[-- Type: text/x-patch, Size: 2156 bytes --]

>From 12a949d293e698960984225cdd69cd68d1bdecbc Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.de>
Date: Wed, 16 Oct 2013 08:49:15 +0200
Subject: [PATCH] scsi_scan: Continue scan for LUNs in 'unavailable' ALUA
 state

Some buggy array firmware will terminate the INQUIRY command
when in 'unavailable' ALUA state. This will cause the scan
to be aborted, so devices beyond that LUN will never be
scanned.

While this behaviour is a violation of SPC, we should
nevertheless behave nicely and allow scanning to continue.

Signed-off-by: Hannes Reinecke <hare@suse.de>

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 973a121..ec02f97 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -594,6 +594,16 @@ static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result,
 				     (sshdr.asc == 0x29)) &&
 				    (sshdr.ascq == 0))
 					continue;
+				/*
+				 * Some buggy implementations return
+				 * 'target port in unavailable state'
+				 * even on INQUIRY.
+				 */
+				if ((sshdr.sense_key == NOT_READY) &&
+				    (sshdr.asc == 0x04) &&
+				    (sshdr.ascq == 0x0c)) {
+					return SCSI_SCAN_TARGET_PRESENT;
+				}
 			}
 		} else {
 			/*
@@ -661,7 +671,7 @@ static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result,
 	/* If the last transfer attempt got an error, assume the
 	 * peripheral doesn't exist or is dead. */
 	if (result)
-		return -EIO;
+		return SCSI_SCAN_NO_RESPONSE;
 
 	/* Don't report any more data than the device says is valid */
 	sdev->inquiry_len = min(try_inquiry_len, response_len);
@@ -711,7 +721,7 @@ static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result,
 		sdev->scsi_level++;
 	sdev->sdev_target->scsi_level = sdev->scsi_level;
 
-	return 0;
+	return SCSI_SCAN_LUN_PRESENT;
 }
 
 /**
@@ -1046,7 +1056,8 @@ static int scsi_probe_and_add_lun(struct scsi_target *starget,
 	if (!result)
 		goto out_free_sdev;
 
-	if (scsi_probe_lun(sdev, result, result_len, &bflags))
+	res = scsi_probe_lun(sdev, result, result_len, &bflags);
+	if (res != SCSI_SCAN_LUN_PRESENT)
 		goto out_free_result;
 
 	if (bflagsp)

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-16  6:52           ` Hannes Reinecke
@ 2013-10-16  7:26             ` vaughan
  2013-10-21  6:07             ` vaughan
  2014-02-19  8:29             ` vaughan
  2 siblings, 0 replies; 16+ messages in thread
From: vaughan @ 2013-10-16  7:26 UTC (permalink / raw)
  To: Hannes Reinecke, Steffen Maier; +Cc: JBottomley, linux-scsi, linux-kernel

On 10/16/2013 02:52 PM, Hannes Reinecke wrote:
> But seeing that this approach raises quite some issues I've attached a
> different patch. Vaughan, could you test with that, too? Should be
> functionally equivalent to the previous one. Cheers, Hannes 
Of course. This one is more clear to express our intention than setting
PQ 3 to break out.

Vaughan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-16  6:52           ` Hannes Reinecke
  2013-10-16  7:26             ` vaughan
@ 2013-10-21  6:07             ` vaughan
  2013-10-22 17:05               ` Hannes Reinecke
  2013-12-18 13:51               ` Vaughan Cao
  2014-02-19  8:29             ` vaughan
  2 siblings, 2 replies; 16+ messages in thread
From: vaughan @ 2013-10-21  6:07 UTC (permalink / raw)
  To: Hannes Reinecke, Steffen Maier; +Cc: JBottomley, linux-scsi, linux-kernel

On 10/16/2013 02:52 PM, Hannes Reinecke wrote:
> But seeing that this approach raises quite some issues I've attached a
> different patch. Vaughan, could you test with that, too? Should be
> functionally equivalent to the previous one. Cheers, Hannes 
Hi Hannes,

We only tested the later patch which returns _TARGET_PRESENT after
parsing sense, it works as expected.

About the cause of this issue, admin said he is configuring a
active-active cluster mode storage. Each node has it own LUN pool and a
set of rule to control which node can access the pool.
LUN7 is owned and can only be able to manipulated by the other node, but
can be seen by this node for a misconfig. So it presents itself in
REPORT_LUN but return NOT_READY when accessed through this node.
Do you still regard this as a misbehave in response to INQUIRY?

Thanks,
Vaughan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-21  6:07             ` vaughan
@ 2013-10-22 17:05               ` Hannes Reinecke
  2013-12-18 13:51               ` Vaughan Cao
  1 sibling, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2013-10-22 17:05 UTC (permalink / raw)
  To: vaughan, Steffen Maier; +Cc: JBottomley, linux-scsi, linux-kernel

On 10/21/2013 08:07 AM, vaughan wrote:
> On 10/16/2013 02:52 PM, Hannes Reinecke wrote:
>> But seeing that this approach raises quite some issues I've attached a
>> different patch. Vaughan, could you test with that, too? Should be
>> functionally equivalent to the previous one. Cheers, Hannes
> Hi Hannes,
>
> We only tested the later patch which returns _TARGET_PRESENT after
> parsing sense, it works as expected.
>
> About the cause of this issue, admin said he is configuring a
> active-active cluster mode storage. Each node has it own LUN pool and a
> set of rule to control which node can access the pool.
> LUN7 is owned and can only be able to manipulated by the other node, but
> can be seen by this node for a misconfig. So it presents itself in
> REPORT_LUN but return NOT_READY when accessed through this node.
> Do you still regard this as a misbehave in response to INQUIRY?
>
Yes. INQUIRY _has_ to succeed. The only exceptions here would be devices 
in 'Offline' state.
But other that that, yes, INQUIRY should never abort with an error, 
especially for ALUA.
ALUA relies on 'report target port groups' and INQUIRY EVPD page 0x83 to 
identify the target port group state.
So if INQUIRY does _not_ work you cannot figure out the ALUA state,
and by rights you would need to disable ALUA there.


Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-21  6:07             ` vaughan
  2013-10-22 17:05               ` Hannes Reinecke
@ 2013-12-18 13:51               ` Vaughan Cao
  1 sibling, 0 replies; 16+ messages in thread
From: Vaughan Cao @ 2013-12-18 13:51 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Steffen Maier, JBottomley, linux-scsi, linux-kernel


On 2013年10月21日 14:07, vaughan wrote:
> On 10/16/2013 02:52 PM, Hannes Reinecke wrote:
>> But seeing that this approach raises quite some issues I've attached a
>> different patch. Vaughan, could you test with that, too? Should be
>> functionally equivalent to the previous one. Cheers, Hannes
> Hi Hannes,
>
> We only tested the later patch which returns _TARGET_PRESENT after
> parsing sense, it works as expected.
>
Hi Hannes,

Will your second patch be included in mainline?

Thanks,
Vaughan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
  2013-10-16  6:52           ` Hannes Reinecke
  2013-10-16  7:26             ` vaughan
  2013-10-21  6:07             ` vaughan
@ 2014-02-19  8:29             ` vaughan
  2 siblings, 0 replies; 16+ messages in thread
From: vaughan @ 2014-02-19  8:29 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-kernel, SCSI development list, JBottomley, Steffen Maier

Hi Hannes,

Sorry to bother you.
Months ago, you made a patch to fix this scsi_scan abort error found on
zfssa storage. Though it's only a specific storage, the logic -- not
abort scsi scan process because of an inquiry failure of a LU in the
middle, is helpful as a way to make our scanning more resilient. I'd
prefer to keep our device scanning behavior in sync with other OS like
Solaris and Windows.
Will you merge your patch in mainline? You can find the patch here
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg521753.html

Regards,
Vaughan

On 10/16/2013 02:52 PM, Hannes Reinecke wrote:
> On 10/14/2013 05:24 PM, Steffen Maier wrote:
>> On 10/14/2013 03:32 PM, Hannes Reinecke wrote:
>>> On 10/14/2013 03:18 PM, Hannes Reinecke wrote:
>>>> On 10/14/2013 02:51 PM, Steffen Maier wrote:
>>>>> On 10/14/2013 01:13 PM, Hannes Reinecke wrote:
>>>>>> On 10/13/2013 07:23 PM, Vaughan Cao wrote:
>>>>>>> [1.] One line summary of the problem:
>>>>>>> special sense code asc,ascq=04h,0Ch abort scsi scan in the middle
>>>>>>>
>>>>>>> [2.] Full description of the problem/report:
>>>>>>> For instance, storage represents 8 iscsi LUNs, however the LUN
>>>>>>> No.7
>>>>>>> is not well configured or has something wrong.
>>>>>>> Then messages received:
>>>>>>> kernel: scsi 5:0:0:0: Unexpected response from lun 7 while
>>>>>>> scanning, scan aborted
>>>>>>> Which will make LUN No.8 unavailable.
>>>>>>> It's confirmed that Windows and Solaris systems will continue the
>>>>>>> scan and make LUN No.1,2,3,4,5,6 and 8 available.
>>>>>>>
>>>>>>> Log snippet is as below:
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi scan:
>>>>>>> INQUIRY pass 1 length 36
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Send:
>>>>>>> 0xffff8801e9bd4280
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB:
>>>>>>> Inquiry: 12 00 00 00 24 00
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: buffer =
>>>>>>> 0xffff8801f71fc180, bufflen = 36, queuecommand 0xffffffffa00b99e7
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: leaving scsi_dispatch_cmnd()
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Done:
>>>>>>> 0xffff8801e9bd4280 SUCCESS
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Result:
>>>>>>> hostbyte=DID_OK driverbyte=DRIVER_OK
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: CDB:
>>>>>>> Inquiry: 12 00 00 00 24 00
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Sense Key :
>>>>>>> Not Ready [current]
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: Add. Sense:
>>>>>>> Logical unit not accessible, target port in unavailable state
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:7: scsi host
>>>>>>> busy 1 failed 0
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: 0 sectors total, 36 bytes
>>>>>>> done.
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi scan: INQUIRY failed
>>>>>>> with code 0x8000002
>>>>>>> Aug 24 00:32:49 vmhodtest019 kernel: scsi 5:0:0:0: Unexpected
>>>>>>> response from lun 7 while scanning, scan aborted
>>>>>>>
>>>>>>> According to scsi_report_lun_scan(), I found:
>>>>>>> Linux use an inquiry command to probe a lun according to the
>>>>>>> result
>>>>>>> of report_lun command.
>>>>>>> It assumes every probe cmd will get a legal result. Otherwise, it
>>>>>>> regards the whole peripheral not exist or dead.
>>>>>>> If the return of inquiry passes its legal checking and indicates
>>>>>>> 'LUN not present', it won't break but also continue with the scan
>>>>>>> process.
>>>>>>> In the log, inquiry to LUN7 return a sense - asc,ascq=04h,0Ch
>>>>>>> (Logical unit not accessible, target port in unavailable state).
>>>>>>> And this is ignored, so scsi_probe_lun() returns -EIO and the
>>>>>>> scan
>>>>>>> process is aborted.
>>>>>>>
>>>>>>> I have two questions:
>>>>>>> 1. Is it correct for hardware to return a sense 04h,0Ch to
>>>>>>> inquiry
>>>>>>> again, even after presenting this lun in responce to REPORT_LUN
>>>>>>> command?
>>>>>> Yes, this is correct. 'REPORT LUNS' is supported in
>>>>>> 'Unavailable' state.
>>>>>>
>>>>>>> 2. Since windows and solaris can continue scan, is it
>>>>>>> reasonable for
>>>>>>> linux to do the same, even for a fault-tolerance purpose?
>>>>>>>
>>>>>> Hmm. Yes, and no.
>>>>>>
>>>>>> _Actually_ this is an issue with the target, as it looks as if it
>>>>>> will return the above sense code while sending an 'INQUIRY' to the
>>>>>> device.
>>>>>> SPC explicitely states that the INQUIRY command should _not_ fail
>>>>>> for unavailable devices.
>>>>>> But yeah, we probably should work around this issues.
>>>>>> Nevertheless, please raise this issue with your array vendor.
>>>>>>
>>>>>> Please try the attached patch.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Hannes
>>>>>>
>>>>>>  From b0e90778f012010c881f8bdc03bce63a36921b77 Mon Sep 17
>>>>>> 00:00:00 2001
>>>>>> From: Hannes Reinecke <hare@suse.de>
>>>>>> Date: Mon, 14 Oct 2013 13:11:22 +0200
>>>>>> Subject: [PATCH] scsi_scan: continue report_lun_scan after error
>>>>>>
>>>>>> When scsi_probe_and_add_lun() fails in scsi_report_lun_scan() this
>>>>>> does _not_ indicate that the entire target is done for.
>>>>>> So continue scanning for the remaining devices.
>>>>>>
>>>>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>>>>>
>>>>>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>>>>>> index 307a811..973a121 100644
>>>>>> --- a/drivers/scsi/scsi_scan.c
>>>>>> +++ b/drivers/scsi/scsi_scan.c
>>>>>> @@ -1484,13 +1484,12 @@ static int scsi_report_lun_scan(struct
>>>>>> scsi_target *starget, int bflags,
>>>>>>                   lun, NULL, NULL, rescan, NULL);
>>>>>>               if (res == SCSI_SCAN_NO_RESPONSE) {
>>>>>>                   /*
>>>>>> -                 * Got some results, but now none, abort.
>>>>>> +                 * Got some results, but now none, ignore.
>>>>>>                    */
>>>>>>                   sdev_printk(KERN_ERR, sdev,
>>>>>>                       "Unexpected response"
>>>>>> -                        " from lun %d while scanning, scan"
>>>>>> -                        " aborted\n", lun);
>>>>>> -                break;
>>>>>> +                    " from lun %d while scanning,"
>>>>>> +                    " ignoring device\n", lun);
>>>>>>               }
>>>>>>           }
>>>>>>       }
>>>>> In LLDDs that do their own initiator based LUN masking (because
>>>>> the midlayer does not have this
>>>>> functionality to enable hardware virtualization without NPIV, or
>>>> to work around suboptimal LUN
>>>>> masking on the target), they are likely to return -ENXIO from
>>>> slave_alloc(), making scsi_alloc_sdev()
>>>>> return NULL, being converted to SCSI_SCAN_NO_RESPONSE by
>>>> scsi_probe_and_add_lun() and thus going
>>>>> through the same code path above.
>>>>>
>>>> Ah. Hmm. Yes, they would.
>>>>
>>>> However, I personally would question this approach, as SPC states
>>>> that
>>>>
>>>>> The REPORT LUNS command (see table 284) requests the device
>>>>> server to return the peripheral device logical unit inventory
>>>>> accessible to the I_T nexus.
>>>> So by plain reading this would meant that you either should modify
>>>> 'REPORT LUNS' to not show the masked LUNs, or set the pqual field to
>>>> '0x10' or '0x11' for those LUNs.
>> We need to distinguish two cases:
>> 1) suboptimal lun masking on the target
>> 2) hardware virtualization without NPIV
>>
>> Regarding 1, one could require fixing lun masking on the target.
>> However, some users cannot or do not want to do it very fine
>> granular. That's why s390 also does deferred device probing ("set
>> online" in sysfs) or even limits bus sensing (cio_ignore).
>>
>> Regarding 2, fixing lun masking on the target does not help because
>> without NPIV, the target cannot distinguish the different virtual
>> initators since they are all behind one shared WWPN (and N-Port_ID).
>> This forces zfcp to implement initiator based lun masking, because
>> only the user can tell which lun to attach to which of the virtual
>> initiators sharing the same physical port. Without that, Linux would
>> attach all luns to all virtual initiators, i.e. share inadvertently.
>>
>>>>> E.g. zfcp does return -ENXIO if the particular LUN was not made
>>>>> known to the unit whitelist
>>>>> (via zfcp sysfs attribute unit_add).
>>>>> If we attach LUN 0 (via unit_add) and trigger a target scan with
>>>>> SCAN_WILD_CARD for the scsi
>>>>> lun (e.g. on remote port recovery), we see exactly above error
>>>> message for the first LUN in
>>>>> the response of report lun which is not explicitly attached to
>>>>> zfcp.
>>>>> IIRC, other LLDDs such as bfa also do similar stuff
>>>>> [http://marc.info/?l=linux-scsi&m=134489842105383&w=2].
>>>>>
>>>>> For those cases, I think it makes sense to abort
>>>>> scsi_report_lun_scan().
>>>>> Otherwise we would force the LLDD to return -ENXIO for every
>>>> single LUN reported by report lun but not
>>>>> explicitly added to the LLDD LUN whitelist; and this would likely
>>>> *flood kernel messages*.
>>>>> Maybe Vaughan's case needs to be distinguished in a patch.
>>>>>
>>>> Well, as mentioned initially, the real issue is that the target
>>>> aborts an INQUIRY while being in 'Unavailable'. Which, according to
>>>> SPC-3 (or later), is a violation of the spec.
>>>>
>>>> So we _could_ just tell them to go away, but admittedly that's bad
>>>> style. Which means we'll have to implement a workaround; the above
>>>> was just a simple way of implementing it. If that's not working of
>>>> course we'll have to do something else.
>>>>
>>> What about this patch:
>>>
>>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>>> index 973a121..01a7d69 100644
>>> --- a/drivers/scsi/scsi_scan.c
>>> +++ b/drivers/scsi/scsi_scan.c
>>> @@ -594,6 +594,19 @@ static int scsi_probe_lun(struct scsi_device
>>> *sdev, unsigne
>>> d char *inq_result,
>>>                                       (sshdr.asc == 0x29)) &&
>>>                                      (sshdr.ascq == 0))
>>>                                          continue;
>>> +                               /*
>>> +                                * Some buggy implementations return
>>> +                                * 'target port in unavailable state'
>>> +                                * even on INQUIRY.
>>> +                                * Set peripheral qualifier 3
>>> +                                * for these devices.
>>> +                                */
>>> +                               if ((sshdr.sense_key == NOT_READY) &&
>>> +                                   ((sshdr.asc == 0x04) &&
>>> +                                    (sshdr.ascq == 0x0C))) {
>> style question: lower case hex digits? 0x0c
>>
> Yeah. This is a test, after all ...
>
>> Any reason why you put the conjunction of asc and ascq inside its
>> own brackets instead of having all three (including sense_key) on
>> the same level of one larger conjunction (as the code above does for
>> UA asc 0x28/0x29 ascq 0x00)? Should be semantically equivalent,
>> isn't it? But then again, ascq always goes with asc, so they form a
>> kind of pair.
>>
> No reason, Just copy&paste error from the above statement ...
>
>>> +                                   inq_result[0] = 3 << 5;
>>> +                                   return 0;
>>> +                               }
>>>                          }
>>>                  } else {
>>>                          /*
>>>
>>> (watchout, linebreaks mangled and all that).
>>> Should be working for this particular case without interrupting
>>> normal workflow, now should it not?
>> The approach of distinguishing the workaround close to the response
>> of the inquiry sounds good to me. I suppose it won't break zfcp
>> which is good. Unfortunately, I don't know what the ramifications of
>> PQ==3 are (the SPC-4 description sounds good, though), nor enough
>> details about this common code to tell if e.g. the early return is
>> OK (skipping setting sdev->scsi_level near the end of
>> scsi_probe_lun()). But then again, without inquiry reply we cannot
>> get the level from the response. So I think the early return is OK
>> after all.
>> I guess we want to get around "if (result) return -EIO;" but also do
>> not want to execute the parts depending on result==0.
>>
>> SPC-4 says that for PQ==3 the PDT should be set to 0x1f. Do we need
>> to fake this here as well? (I assume the target did not fill in a
>> PDT on its own when replying with sense data.)
>>
>> The clarification on the T10 reflector seems to say that Linux would
>> then accept LUNs with PQ 3, but the target shall not have put LUs
>> with PQ 3 into the LU inventory in the first place?
>> Anyway, I'm not opposed to the workaround.
>>
> Well, first and foremost this is a workaround for buggy array
> firmware. If any port would be in 'unavailable' the target port is
> still required to respond to an INQUIRY.
> _Not_ doing so leaves us with no indication what's going on here.
>
> The main reason why I chose PQ=3 here is that we'll end up ignoring
> this device scsi_probe_and_add_lun() later on.
> Saving my coding higher up the stack.
> And, seeing that the device is never actually allocated, the
> modifications we did for the inquiry data will be deleted anyway.
>
> So using PQ=3 here is just a vehicle for telling the system to not
> create a SCSI device at this LUN, _not_ something which has some
> relevance to SPC.
>
> But seeing that this approach raises quite some issues I've attached
> a different patch.
> Vaughan, could you test with that, too? Should be functionally
> equivalent to the previous one.
>
> Cheers,
>
> Hannes


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-02-19  8:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-13 17:23 PROBLEM: special sense code asc,ascq=04h,0Ch abort scsi scan in the middle Vaughan Cao
2013-10-14 11:13 ` Hannes Reinecke
2013-10-14 12:51   ` Steffen Maier
2013-10-14 13:18     ` Hannes Reinecke
2013-10-14 13:32       ` Hannes Reinecke
2013-10-14 15:24         ` Steffen Maier
2013-10-16  6:52           ` Hannes Reinecke
2013-10-16  7:26             ` vaughan
2013-10-21  6:07             ` vaughan
2013-10-22 17:05               ` Hannes Reinecke
2013-12-18 13:51               ` Vaughan Cao
2014-02-19  8:29             ` vaughan
2013-10-14 15:18       ` Vaughan Cao
2013-10-15  3:32   ` vaughan
2013-10-15  5:51     ` Hannes Reinecke
2013-10-15 11:46       ` Vaughan Cao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).