Re: [PATCH] scsi: Do not break scan luns loop if add single lun failed

From: Steffen Maier <maier@linux.ibm.com>
To: Wenchao Hao <haowenchao@huawei.com>,
	"James E . J . Bottomley" <jejb@linux.ibm.com>,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org,
	James Smart <james.smart@broadcom.com>,
	Dick Kennedy <dick.kennedy@broadcom.com>,
	Nilesh Javali <njavali@marvell.com>,
	GR-QLogic-Storage-Upstream@marvell.com,
	Hannes Reinecke <hare@suse.de>,
	Martin Wilck <martin.wilck@suse.com>,
	linux-s390 <linux-s390@vger.kernel.org>,
	Benjamin Block <bblock@linux.ibm.com>,
	Ming Lei <ming.lei@redhat.com>
Cc: Zhiqiang Liu <liuzhiqiang26@huawei.com>,
	Feilong Lin <linfeilong@huawei.com>, Wu Bo <wubo40@huawei.com>
Subject: Re: [PATCH] scsi: Do not break scan luns loop if add single lun failed
Date: Thu, 30 Dec 2021 18:55:40 +0100	[thread overview]
Message-ID: <aa72bd76-2af5-202d-8a2c-afb5a700b6c0@linux.ibm.com> (raw)
In-Reply-To: <20211225232911.1117843-1-haowenchao@huawei.com>

On 12/26/21 00:29, Wenchao Hao wrote:
> Failed to add a single lun does not mean all luns are unaccessible,
> if we break the scan luns loop, the other luns reported by REPORT LUNS
> command would not be probed any more.
> 
> In this case, we might loss some luns which are accessible.

Could you please add more details about the specific use case, where this 
actually was a problem, for my understanding?

> 
> Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
> ---
>   drivers/scsi/scsi_scan.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 23e1c0acdeae..fee7ce082103 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -1476,13 +1476,13 @@ static int scsi_report_lun_scan(struct scsi_target *starget, blist_flags_t bflag
>   				lun, NULL, NULL, rescan, NULL);
>   			if (res == SCSI_SCAN_NO_RESPONSE) {
>   				/*
> -				 * Got some results, but now none, abort.
> +				 * Got some results, but now none, abort this lun

abort => skip ?

>   				 */
>   				sdev_printk(KERN_ERR, sdev,
>   					"Unexpected response"
>   					" from lun %llu while scanning, scan"
>   					" aborted\n", (unsigned long long)lun);

That message would no longer be correct with your change, as it would not abort 
the scan any more.

> -				break;
> +				continue;
>   			}
>   		}
>   	}

Wouldn't this change existing semantics for LLDDs intentionally returning 
-ENXIO from their slave_alloc() callback in certain cases?:

> static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
...
> 	if (shost->hostt->slave_alloc) {
> 		ret = shost->hostt->slave_alloc(sdev);
> 		if (ret) {
> 			/*
> 			 * if LLDD reports slave not present, don't clutter
> 			 * console with alloc failure messages
> 			 */
> 			if (ret == -ENXIO)
> 				display_failure_msg = 0;
> 			goto out_device_destroy;
...
> out_device_destroy:
> 	__scsi_remove_device(sdev);
> out:
> 	if (display_failure_msg)
> 		printk(ALLOC_FAILURE_MSG, __func__);
> 	return NULL;

scsi_probe_and_add_lun() [such as called by scsi_report_lun_scan() for the case 
at hand] converts this case into a SCSI_SCAN_NO_RESPONSE return value.

> static int scsi_probe_and_add_lun(struct scsi_target *starget,
...
> 	int res = SCSI_SCAN_NO_RESPONSE, result_len = 256;
...
> 		sdev = scsi_alloc_sdev(starget, lun, hostdata);
> 	if (!sdev)
> 		goto out;
...
>  out:
> 	return res;

Such as being used by zfcp:

> static int zfcp_scsi_slave_alloc(struct scsi_device *sdev)
> {
...
> 	unit = zfcp_unit_find(port, zfcp_scsi_dev_lun(sdev));
> 	if (unit)
> 		put_device(&unit->dev);
> 
> 	if (!unit && !(allow_lun_scan && npiv)) {
> 		put_device(&port->dev);
> 		return -ENXIO;
                        ^^^^^^

which implements an initiator-based LUN masking that is necessary for shared 
HBAs virtualized without NPIV.
https://www.ibm.com/docs/en/linux-on-systems?topic=devices-manually-configured-fcp-luns

While things might still work, as zfcp now "just" gets (much) more callbacks to 
slave_alloc() it has to end with -ENXIO, the user may get flooded with the 
error(!) sdev_printk on "Unexpected response from LUN ..." in 
scsi_report_lun_scan().
In the worst case, we could get this message now 64k - 1 times in a zfcp 
scenario connected to IBM DS8000 storage being able to map (all) 64k volumes to 
a single initiator (HBA), where the user via zfcp sysfs decided to use only the 
first lun reported (for the vHBA).

Other LLLDs also seem to intentionally return -ENXIO from slave_alloc() 
callbacks, such as but not limited to lpfc or qla2xxx:

> int fc_slave_alloc(struct scsi_device *sdev)
> {
> 	struct fc_rport *rport = starget_to_rport(scsi_target(sdev));
> 
> 	if (!rport || fc_remote_port_chkready(rport))
> 		return -ENXIO;

> static int
> qla2xxx_slave_alloc(struct scsi_device *sdev)
> {
> 	struct fc_rport *rport = starget_to_rport(scsi_target(sdev));
> 
> 	if (!rport || fc_remote_port_chkready(rport))
> 		return -ENXIO;

-- 
Mit freundlichen Gruessen / Kind regards
Steffen Maier

Linux on IBM Z and LinuxONE

https://www.ibm.com/privacy/us/en/
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294