linux-usb.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] usb: uas: fix a plug & unplug racing
@ 2020-01-13  3:30 EJ Hsu
  2020-01-13  9:44 ` Oliver Neukum
  0 siblings, 1 reply; 12+ messages in thread
From: EJ Hsu @ 2020-01-13  3:30 UTC (permalink / raw)
  To: oneukum; +Cc: linux-usb, ejh

When a uas disk is plugged into an external hub, uas_probe()
will be called by the hub thread to do the probe. It will
first create a SCSI host and then do the scan for this host.
During the scan, it will probe the LUN using SCSI INQUERY command
which will be packed in the URB and submitted to uas disk.

There might be a chance that this external hub with uas disk
attached is unplugged during the scan. In this case, uas driver
will fail to submit the URB (due to the NOTATTACHED state of uas
device) and try to put this SCSI command back to request queue
waiting for next chance to run.

In normal case, this cycle will terminate when hub thread gets
disconnection event and calls into uas_disconnect() accordingly.
But in this case, uas_disconnect() will not be called because
hub thread of external hub gets stuck waiting for the completion
of this SCSI command. A deadlock happened.

In this fix, uas will call scsi_scan_host() asynchronously to
avoid the blocking of hub thread.

Signed-off-by: EJ Hsu <ejh@nvidia.com>
---
 drivers/usb/storage/uas.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
index 95bba3ba6ac6..d367718fef45 100644
--- a/drivers/usb/storage/uas.c
+++ b/drivers/usb/storage/uas.c
@@ -42,9 +42,11 @@ struct uas_dev_info {
 	unsigned cmd_pipe, status_pipe, data_in_pipe, data_out_pipe;
 	unsigned use_streams:1;
 	unsigned shutdown:1;
+	unsigned scan_pending:1;
 	struct scsi_cmnd *cmnd[MAX_CMNDS];
 	spinlock_t lock;
 	struct work_struct work;
+	struct work_struct scan_work;      /* for async scanning */
 };
 
 enum {
@@ -114,6 +116,20 @@ static void uas_do_work(struct work_struct *work)
 	spin_unlock_irqrestore(&devinfo->lock, flags);
 }
 
+static void uas_scan_work(struct work_struct *work)
+{
+	struct uas_dev_info *devinfo =
+		container_of(work, struct uas_dev_info, scan_work);
+	struct Scsi_Host *shost = usb_get_intfdata(devinfo->intf);
+
+	dev_dbg(&devinfo->intf->dev, "starting scan\n");
+	scsi_scan_host(shost);
+	dev_dbg(&devinfo->intf->dev, "scan complete\n");
+
+	usb_autopm_put_interface(devinfo->intf);
+	devinfo->scan_pending = 0;
+}
+
 static void uas_add_work(struct uas_cmd_info *cmdinfo)
 {
 	struct scsi_pointer *scp = (void *)cmdinfo;
@@ -982,6 +998,7 @@ static int uas_probe(struct usb_interface *intf, const struct usb_device_id *id)
 	init_usb_anchor(&devinfo->data_urbs);
 	spin_lock_init(&devinfo->lock);
 	INIT_WORK(&devinfo->work, uas_do_work);
+	INIT_WORK(&devinfo->scan_work, uas_scan_work);
 
 	result = uas_configure_endpoints(devinfo);
 	if (result)
@@ -998,7 +1015,11 @@ static int uas_probe(struct usb_interface *intf, const struct usb_device_id *id)
 	if (result)
 		goto free_streams;
 
-	scsi_scan_host(shost);
+	/* Submit the delayed_work for SCSI-device scanning */
+	usb_autopm_get_interface_no_resume(intf);
+	devinfo->scan_pending = 1;
+	schedule_work(&devinfo->scan_work);
+
 	return result;
 
 free_streams:
@@ -1166,6 +1187,14 @@ static void uas_disconnect(struct usb_interface *intf)
 	usb_kill_anchored_urbs(&devinfo->data_urbs);
 	uas_zap_pending(devinfo, DID_NO_CONNECT);
 
+	/*
+	 * Prevent SCSI scanning (if it hasn't started yet)
+	 * or wait for the SCSI-scanning routine to stop.
+	 */
+	cancel_work_sync(&devinfo->scan_work);
+	if (devinfo->scan_pending)
+		usb_autopm_put_interface_no_suspend(intf);
+
 	scsi_remove_host(shost);
 	uas_free_streams(devinfo);
 	scsi_host_put(shost);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] usb: uas: fix a plug & unplug racing
  2020-01-13  3:30 [PATCH] usb: uas: fix a plug & unplug racing EJ Hsu
@ 2020-01-13  9:44 ` Oliver Neukum
  2020-01-13 15:21   ` Alan Stern
  2020-01-14  3:28   ` EJ Hsu
  0 siblings, 2 replies; 12+ messages in thread
From: Oliver Neukum @ 2020-01-13  9:44 UTC (permalink / raw)
  To: EJ Hsu; +Cc: linux-usb

Am Sonntag, den 12.01.2020, 19:30 -0800 schrieb EJ Hsu:

Hi,

first, thank you for diagnosing this unusual bug.

> When a uas disk is plugged into an external hub, uas_probe()
> will be called by the hub thread to do the probe. It will
> first create a SCSI host and then do the scan for this host.
> During the scan, it will probe the LUN using SCSI INQUERY command
> which will be packed in the URB and submitted to uas disk.
> 
> There might be a chance that this external hub with uas disk
> attached is unplugged during the scan. In this case, uas driver
> will fail to submit the URB (due to the NOTATTACHED state of uas
> device) and try to put this SCSI command back to request queue
> waiting for next chance to run.

Isn't that the bug? A command to a detached device should fail.
Could you please elaborate? This issue would not be limited to uas.

> ---
>  drivers/usb/storage/uas.c | 31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
> index 95bba3ba6ac6..d367718fef45 100644
> --- a/drivers/usb/storage/uas.c
> +++ b/drivers/usb/storage/uas.c
> @@ -42,9 +42,11 @@ struct uas_dev_info {
>  	unsigned cmd_pipe, status_pipe, data_in_pipe, data_out_pipe;
>  	unsigned use_streams:1;
>  	unsigned shutdown:1;
> +	unsigned scan_pending:1;
>  	struct scsi_cmnd *cmnd[MAX_CMNDS];
>  	spinlock_t lock;
>  	struct work_struct work;
> +	struct work_struct scan_work;      /* for async scanning */
>  };
>  
>  enum {
> @@ -114,6 +116,20 @@ static void uas_do_work(struct work_struct *work)
>  	spin_unlock_irqrestore(&devinfo->lock, flags);
>  }
>  
> +static void uas_scan_work(struct work_struct *work)
> +{
> +	struct uas_dev_info *devinfo =
> +		container_of(work, struct uas_dev_info, scan_work);
> +	struct Scsi_Host *shost = usb_get_intfdata(devinfo->intf);
> +
> +	dev_dbg(&devinfo->intf->dev, "starting scan\n");
> +	scsi_scan_host(shost);
> +	dev_dbg(&devinfo->intf->dev, "scan complete\n");
> +
> +	usb_autopm_put_interface(devinfo->intf);

scsi_scan_host() does runtime PM on the SCSI level. There is
no need for us to duplicate that.

> +	devinfo->scan_pending = 0;
> +}
> +
>  static void uas_add_work(struct uas_cmd_info *cmdinfo)
>  {
>  	struct scsi_pointer *scp = (void *)cmdinfo;
> @@ -982,6 +998,7 @@ static int uas_probe(struct usb_interface *intf, const struct usb_device_id *id)
>  	init_usb_anchor(&devinfo->data_urbs);
>  	spin_lock_init(&devinfo->lock);
>  	INIT_WORK(&devinfo->work, uas_do_work);
> +	INIT_WORK(&devinfo->scan_work, uas_scan_work);
>  
>  	result = uas_configure_endpoints(devinfo);
>  	if (result)
> @@ -998,7 +1015,11 @@ static int uas_probe(struct usb_interface *intf, const struct usb_device_id *id)
>  	if (result)
>  		goto free_streams;
>  
> -	scsi_scan_host(shost);
> +	/* Submit the delayed_work for SCSI-device scanning */
> +	usb_autopm_get_interface_no_resume(intf);
> +	devinfo->scan_pending = 1;
> +	schedule_work(&devinfo->scan_work);
> +
>  	return result;
>  Is this approach really necessary
>  free_streams:
> @@ -1166,6 +1187,14 @@ static void uas_disconnect(struct usb_interface *intf)
>  	usb_kill_anchored_urbs(&devinfo->data_urbs);
>  	uas_zap_pending(devinfo, DID_NO_CONNECT);
>  
> +	/*
> +	 * Prevent SCSI scanning (if it hasn't started yet)
> +	 * or wait for the SCSI-scanning routine to stop.
> +	 */
> +	cancel_work_sync(&devinfo->scan_work);
> +	if (devinfo->scan_pending)
> +		usb_autopm_put_interface_no_suspend(intf);

It is not enough to do this in disconnect()
We are guarded against runtime PM, but not against system sleep.
You'd need to handle this in suspend() and resume(), too.
And, unfortunately, the device could be reset from another interface.

Is this approach really necessary? It solves the problem, but
if you want to get it right, the corner cases are ugly.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] usb: uas: fix a plug & unplug racing
  2020-01-13  9:44 ` Oliver Neukum
@ 2020-01-13 15:21   ` Alan Stern
  2020-01-14  3:28   ` EJ Hsu
  1 sibling, 0 replies; 12+ messages in thread
From: Alan Stern @ 2020-01-13 15:21 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: EJ Hsu, linux-usb

On Mon, 13 Jan 2020, Oliver Neukum wrote:

> Am Sonntag, den 12.01.2020, 19:30 -0800 schrieb EJ Hsu:
> 
> Hi,
> 
> first, thank you for diagnosing this unusual bug.
> 
> > When a uas disk is plugged into an external hub, uas_probe()
> > will be called by the hub thread to do the probe. It will
> > first create a SCSI host and then do the scan for this host.
> > During the scan, it will probe the LUN using SCSI INQUERY command
> > which will be packed in the URB and submitted to uas disk.
> > 
> > There might be a chance that this external hub with uas disk
> > attached is unplugged during the scan. In this case, uas driver
> > will fail to submit the URB (due to the NOTATTACHED state of uas
> > device) and try to put this SCSI command back to request queue
> > waiting for next chance to run.
> 
> Isn't that the bug? A command to a detached device should fail.
> Could you please elaborate? This issue would not be limited to uas.

> It is not enough to do this in disconnect()
> We are guarded against runtime PM, but not against system sleep.
> You'd need to handle this in suspend() and resume(), too.
> And, unfortunately, the device could be reset from another interface.

This is more or less a copy of the way usb-storage works.  That driver
doesn't have any protection here against suspend/resume or reset, and
it's not clear that any such protection is needed.

> Is this approach really necessary? It solves the problem, but
> if you want to get it right, the corner cases are ugly.

Minimizing the amount of time spent running in the context of the hub 
thread is generally a good thing to do.

Alan Stern


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] usb: uas: fix a plug & unplug racing
  2020-01-13  9:44 ` Oliver Neukum
  2020-01-13 15:21   ` Alan Stern
@ 2020-01-14  3:28   ` EJ Hsu
  2020-01-14 14:41     ` Oliver Neukum
  1 sibling, 1 reply; 12+ messages in thread
From: EJ Hsu @ 2020-01-14  3:28 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: linux-usb, Alan Stern

Oliver Neukum wrote:

> Am Sonntag, den 12.01.2020, 19:30 -0800 schrieb EJ Hsu:
> 
> Isn't that the bug? A command to a detached device should fail.
> Could you please elaborate? This issue would not be limited to uas.
> 
In the case I mentioned, the hub thread of external hub running 
uas_probe() will get stuck waiting for the completion of scsi scan. 

The scsi scan will try to probe a single LUN using a SCSI INQUIRY.
If the external hub has been unplugged before LUN probe, the device 
state of uas device will be set to USB_STATE_NOTATTACHED by the 
root hub thread. So, all the following calls to usb_submit_urb() in 
uas driver will return -NODEV, and accordingly uas_queuecommand_lck() 
will return SCSI_MLQUEUE_DEVICE_BUSY to scsi_request_fn().

scsi_request_fn() then puts this scsi command back into request queue.
Because this scsi device is just created and during LUN probe process, 
this scsi command is the only one in the request queue. So, it will be picked
up soon and dispatched to uas driver again. This cycle will continue until
uas_disconnect() is called and its "resetting" flag is set. However, the 
hub thread of external hub still got stuck waiting for the completion of
this scsi command, and may not be able to run uas_disconnect(). 
A deadlock happened.

> > +static void uas_scan_work(struct work_struct *work) {
> > +     struct uas_dev_info *devinfo =
> > +             container_of(work, struct uas_dev_info, scan_work);
> > +     struct Scsi_Host *shost = usb_get_intfdata(devinfo->intf);
> > +
> > +     dev_dbg(&devinfo->intf->dev, "starting scan\n");
> > +     scsi_scan_host(shost);
> > +     dev_dbg(&devinfo->intf->dev, "scan complete\n");
> > +
> > +     usb_autopm_put_interface(devinfo->intf);
> 
> scsi_scan_host() does runtime PM on the SCSI level. There is no need for us to
> duplicate that.
>

In my opinion, if scsi_scan_host() will be run asynchronously, this interface 
needs to be guarded against runtime PM between uas_probe() & uas_scan_work().
 
> 
> It is not enough to do this in disconnect() We are guarded against runtime PM,
> but not against system sleep.
> You'd need to handle this in suspend() and resume(), too.
> And, unfortunately, the device could be reset from another interface.
> 

As Allen said, this is a copy of the way usb-storage works and I did not see any
related protection in usb-storage. But I will do more check on it. 

Thanks,
EJ
--nvpublic

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] usb: uas: fix a plug & unplug racing
  2020-01-14  3:28   ` EJ Hsu
@ 2020-01-14 14:41     ` Oliver Neukum
  2020-01-14 15:04       ` Alan Stern
  0 siblings, 1 reply; 12+ messages in thread
From: Oliver Neukum @ 2020-01-14 14:41 UTC (permalink / raw)
  To: EJ Hsu; +Cc: linux-usb, Alan Stern

Am Dienstag, den 14.01.2020, 03:28 +0000 schrieb EJ Hsu:
> Oliver Neukum wrote:
> 
> > Am Sonntag, den 12.01.2020, 19:30 -0800 schrieb EJ Hsu:
> > 
> > Isn't that the bug? A command to a detached device should fail.
> > Could you please elaborate? This issue would not be limited to uas.
> > 
> 
> In the case I mentioned, the hub thread of external hub running 
> uas_probe() will get stuck waiting for the completion of scsi scan. 
> 
> The scsi scan will try to probe a single LUN using a SCSI INQUIRY.
> If the external hub has been unplugged before LUN probe, the device 
> state of uas device will be set to USB_STATE_NOTATTACHED by the 
> root hub thread. So, all the following calls to usb_submit_urb() in 
> uas driver will return -NODEV, and accordingly uas_queuecommand_lck() 
> will return SCSI_MLQUEUE_DEVICE_BUSY to scsi_request_fn().

And that looks like the root cause. The queue isn't busy.
It is dead.

> scsi_request_fn() then puts this scsi command back into request queue.
> Because this scsi device is just created and during LUN probe process, 
> this scsi command is the only one in the request queue. So, it will be picked
> up soon and dispatched to uas driver again. This cycle will continue until
> uas_disconnect() is called and its "resetting" flag is set. However, the 
> hub thread of external hub still got stuck waiting for the completion of
> this scsi command, and may not be able to run uas_disconnect(). 
> A deadlock happened.

I see. But we are working around insufficient error reporting in the
SCSI midlayer.

> > > +static void uas_scan_work(struct work_struct *work) {
> > > +     struct uas_dev_info *devinfo =
> > > +             container_of(work, struct uas_dev_info, scan_work);
> > > +     struct Scsi_Host *shost = usb_get_intfdata(devinfo->intf);
> > > +
> > > +     dev_dbg(&devinfo->intf->dev, "starting scan\n");
> > > +     scsi_scan_host(shost);
> > > +     dev_dbg(&devinfo->intf->dev, "scan complete\n");
> > > +
> > > +     usb_autopm_put_interface(devinfo->intf);
> > 
> > scsi_scan_host() does runtime PM on the SCSI level. There is no need for us to
> > duplicate that.
> > 
> 
> In my opinion, if scsi_scan_host() will be run asynchronously, this interface 
> needs to be guarded against runtime PM between uas_probe() & uas_scan_work().

Yes it does. But it has a child, the SCSI host, which has an elevated
count. It is already guarded.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] usb: uas: fix a plug & unplug racing
  2020-01-14 14:41     ` Oliver Neukum
@ 2020-01-14 15:04       ` Alan Stern
  2020-01-15  9:31         ` Oliver Neukum
  0 siblings, 1 reply; 12+ messages in thread
From: Alan Stern @ 2020-01-14 15:04 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: EJ Hsu, linux-usb

On Tue, 14 Jan 2020, Oliver Neukum wrote:

> Am Dienstag, den 14.01.2020, 03:28 +0000 schrieb EJ Hsu:
> > Oliver Neukum wrote:
> > 
> > > Am Sonntag, den 12.01.2020, 19:30 -0800 schrieb EJ Hsu:
> > > 
> > > Isn't that the bug? A command to a detached device should fail.
> > > Could you please elaborate? This issue would not be limited to uas.
> > > 
> > 
> > In the case I mentioned, the hub thread of external hub running 
> > uas_probe() will get stuck waiting for the completion of scsi scan. 
> > 
> > The scsi scan will try to probe a single LUN using a SCSI INQUIRY.
> > If the external hub has been unplugged before LUN probe, the device 
> > state of uas device will be set to USB_STATE_NOTATTACHED by the 
> > root hub thread. So, all the following calls to usb_submit_urb() in 
> > uas driver will return -NODEV, and accordingly uas_queuecommand_lck() 
> > will return SCSI_MLQUEUE_DEVICE_BUSY to scsi_request_fn().
> 
> And that looks like the root cause. The queue isn't busy.
> It is dead.

No.  The discussion has gotten a little confused.  EJ's point is that
if SCSI scanning takes place in the context of the hub worker thread,
then that thread won't be available to process a disconnect
notification.  The device will be unplugged, but the kernel won't 
realize it until the SCSI scanning is finished.

> > scsi_request_fn() then puts this scsi command back into request queue.
> > Because this scsi device is just created and during LUN probe process, 
> > this scsi command is the only one in the request queue. So, it will be picked
> > up soon and dispatched to uas driver again. This cycle will continue until
> > uas_disconnect() is called and its "resetting" flag is set. However, the 
> > hub thread of external hub still got stuck waiting for the completion of
> > this scsi command, and may not be able to run uas_disconnect(). 
> > A deadlock happened.
> 
> I see. But we are working around insufficient error reporting in the
> SCSI midlayer.

No, the error reporting there is correct.  URBs will complete with
errors like -EPROTO but no other indication that the device is gone, so
the midlayer believes that a retry is appropriate.

Perhaps uas should treat -EPROTO, -EILSEQ, and -ETIME as fatal errors.

Alan Stern


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] usb: uas: fix a plug & unplug racing
  2020-01-14 15:04       ` Alan Stern
@ 2020-01-15  9:31         ` Oliver Neukum
  2020-01-15 15:17           ` Alan Stern
  2020-01-15 15:54           ` EJ Hsu
  0 siblings, 2 replies; 12+ messages in thread
From: Oliver Neukum @ 2020-01-15  9:31 UTC (permalink / raw)
  To: Alan Stern; +Cc: EJ Hsu, linux-usb

Am Dienstag, den 14.01.2020, 10:04 -0500 schrieb Alan Stern:
> On Tue, 14 Jan 2020, Oliver Neukum wrote:
> 
> > Am Dienstag, den 14.01.2020, 03:28 +0000 schrieb EJ Hsu:
> > > Oliver Neukum wrote:
> > > 
> > > > Am Sonntag, den 12.01.2020, 19:30 -0800 schrieb EJ Hsu:
> > > > 
> > > > Isn't that the bug? A command to a detached device should fail.
> > > > Could you please elaborate? This issue would not be limited to uas.
> > > > 
> > > 
> > > In the case I mentioned, the hub thread of external hub running 
> > > uas_probe() will get stuck waiting for the completion of scsi scan. 
> > > 
> > > The scsi scan will try to probe a single LUN using a SCSI INQUIRY.
> > > If the external hub has been unplugged before LUN probe, the device 
> > > state of uas device will be set to USB_STATE_NOTATTACHED by the 
> > > root hub thread. So, all the following calls to usb_submit_urb() in 
> > > uas driver will return -NODEV, and accordingly uas_queuecommand_lck() 
> > > will return SCSI_MLQUEUE_DEVICE_BUSY to scsi_request_fn().
> > 
> > And that looks like the root cause. The queue isn't busy.
> > It is dead.
> 
> No.  The discussion has gotten a little confused.  EJ's point is that
> if SCSI scanning takes place in the context of the hub worker thread,
> then that thread won't be available to process a disconnect
> notification.  The device will be unplugged, but the kernel won't 
> realize it until the SCSI scanning is finished.

OK, I think we have two possible code paths at least

First:
uas_queuecommand_lck() -> uas_submit_urbs() -> -ENODEV -> return
SCSI_MLQUEUE_DEVICE_BUSY

That looks wrong to me.

Second:
The submission actually works and we eventually terminate the URB
with an error. In that case nothing happens and eventually SCSI core
retries, times out and does error handling. 

> > > scsi_request_fn() then puts this scsi command back into request queue.
> > > Because this scsi device is just created and during LUN probe process, 
> > > this scsi command is the only one in the request queue. So, it will be picked
> > > up soon and dispatched to uas driver again. This cycle will continue until
> > > uas_disconnect() is called and its "resetting" flag is set. However, the 
> > > hub thread of external hub still got stuck waiting for the completion of
> > > this scsi command, and may not be able to run uas_disconnect(). 
> > > A deadlock happened.
> > 
> > I see. But we are working around insufficient error reporting in the
> > SCSI midlayer.
> 
> No, the error reporting there is correct.  URBs will complete with
> errors like -EPROTO but no other indication that the device is gone, so
> the midlayer believes that a retry is appropriate.
> 
> Perhaps uas should treat -EPROTO, -EILSEQ, and -ETIME as fatal errors.

They could happen due to bad cables.

Now, another work queue would solve the second error case, but I think
the first error case still exists and we would paper over it.

	Regards
		Oliver
 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] usb: uas: fix a plug & unplug racing
  2020-01-15  9:31         ` Oliver Neukum
@ 2020-01-15 15:17           ` Alan Stern
  2020-01-15 15:54           ` EJ Hsu
  1 sibling, 0 replies; 12+ messages in thread
From: Alan Stern @ 2020-01-15 15:17 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: EJ Hsu, linux-usb

On Wed, 15 Jan 2020, Oliver Neukum wrote:

> Am Dienstag, den 14.01.2020, 10:04 -0500 schrieb Alan Stern:
> > On Tue, 14 Jan 2020, Oliver Neukum wrote:
> > 
> > > Am Dienstag, den 14.01.2020, 03:28 +0000 schrieb EJ Hsu:
> > > > Oliver Neukum wrote:
> > > > 
> > > > > Am Sonntag, den 12.01.2020, 19:30 -0800 schrieb EJ Hsu:
> > > > > 
> > > > > Isn't that the bug? A command to a detached device should fail.
> > > > > Could you please elaborate? This issue would not be limited to uas.
> > > > > 
> > > > 
> > > > In the case I mentioned, the hub thread of external hub running 
> > > > uas_probe() will get stuck waiting for the completion of scsi scan. 
> > > > 
> > > > The scsi scan will try to probe a single LUN using a SCSI INQUIRY.
> > > > If the external hub has been unplugged before LUN probe, the device 
> > > > state of uas device will be set to USB_STATE_NOTATTACHED by the 
> > > > root hub thread. So, all the following calls to usb_submit_urb() in 
> > > > uas driver will return -NODEV, and accordingly uas_queuecommand_lck() 
> > > > will return SCSI_MLQUEUE_DEVICE_BUSY to scsi_request_fn().
> > > 
> > > And that looks like the root cause. The queue isn't busy.
> > > It is dead.
> > 
> > No.  The discussion has gotten a little confused.  EJ's point is that
> > if SCSI scanning takes place in the context of the hub worker thread,
> > then that thread won't be available to process a disconnect
> > notification.  The device will be unplugged, but the kernel won't 
> > realize it until the SCSI scanning is finished.
> 
> OK, I think we have two possible code paths at least
> 
> First:
> uas_queuecommand_lck() -> uas_submit_urbs() -> -ENODEV -> return
> SCSI_MLQUEUE_DEVICE_BUSY
> 
> That looks wrong to me.

Agreed.  Although it isn't the subject of this patch.

> Second:
> The submission actually works and we eventually terminate the URB
> with an error. In that case nothing happens and eventually SCSI core
> retries, times out and does error handling. 
> 
> > > > scsi_request_fn() then puts this scsi command back into request queue.
> > > > Because this scsi device is just created and during LUN probe process, 
> > > > this scsi command is the only one in the request queue. So, it will be picked
> > > > up soon and dispatched to uas driver again. This cycle will continue until
> > > > uas_disconnect() is called and its "resetting" flag is set. However, the 
> > > > hub thread of external hub still got stuck waiting for the completion of
> > > > this scsi command, and may not be able to run uas_disconnect(). 
> > > > A deadlock happened.
> > > 
> > > I see. But we are working around insufficient error reporting in the
> > > SCSI midlayer.
> > 
> > No, the error reporting there is correct.  URBs will complete with
> > errors like -EPROTO but no other indication that the device is gone, so
> > the midlayer believes that a retry is appropriate.
> > 
> > Perhaps uas should treat -EPROTO, -EILSEQ, and -ETIME as fatal errors.
> 
> They could happen due to bad cables.
> 
> Now, another work queue would solve the second error case, but I think
> the first error case still exists and we would paper over it.

Then you do agree with this patch, and you or EJ ought to create 
another patch to handle the first error case.

Alan Stern


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] usb: uas: fix a plug & unplug racing
  2020-01-15  9:31         ` Oliver Neukum
  2020-01-15 15:17           ` Alan Stern
@ 2020-01-15 15:54           ` EJ Hsu
  2020-01-20  9:38             ` Oliver Neukum
  1 sibling, 1 reply; 12+ messages in thread
From: EJ Hsu @ 2020-01-15 15:54 UTC (permalink / raw)
  To: Oliver Neukum, Alan Stern; +Cc: linux-usb

Oliver Neukum wrote:

> Am Dienstag, den 14.01.2020, 10:04 -0500 schrieb Alan Stern:
> >
> > No.  The discussion has gotten a little confused.  EJ's point is that
> > if SCSI scanning takes place in the context of the hub worker thread,
> > then that thread won't be available to process a disconnect
> > notification.  The device will be unplugged, but the kernel won't
> > realize it until the SCSI scanning is finished.
> 
> OK, I think we have two possible code paths at least
> 
> First:
> uas_queuecommand_lck() -> uas_submit_urbs() -> -ENODEV -> return
> SCSI_MLQUEUE_DEVICE_BUSY
> 
> That looks wrong to me.
> 
> Second:
> The submission actually works and we eventually terminate the URB with an
> error. In that case nothing happens and eventually SCSI core retries, times out
> and does error handling.
> 
Actually, the second case might happen when we connect uas disk directly to 
root hub. In this case, the root hub thread got stuck waiting for the completion
of the scsi command. It might not be able to handle the disconnection event
and set uas disk state to USB_STATE_NOTATTACHED accordingly.

So, the URB submission in uas driver works but nothing happens, and scsi layer
will time out this operation and retry it several times. Eventually the root hub 
thread will continue to run when it got the result from scsi layer. But the whole
process might take about 60 seconds.

That's another reason why I would like to run scsi_scan_host() asynchronously
and keep the hub thread free.

> > In my opinion, if scsi_scan_host() will be run asynchronously, this 
> > interface needs to be guarded against runtime PM between uas_probe() & uas_scan_work().
> Yes it does. But it has a child, the SCSI host, which has an elevated count. It is already guarded.

I just checked the code, and the reference count will be incremented in scsi_scan_host(), 
precisely speaking, in scsi_autopm_get_host().
So, I still think we need to manually add reference count of interface here.
Please correct me if there is anything wrong.

Thanks,
EJ
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] usb: uas: fix a plug & unplug racing
  2020-01-15 15:54           ` EJ Hsu
@ 2020-01-20  9:38             ` Oliver Neukum
  2020-01-21 11:29               ` EJ Hsu
  0 siblings, 1 reply; 12+ messages in thread
From: Oliver Neukum @ 2020-01-20  9:38 UTC (permalink / raw)
  To: EJ Hsu, Alan Stern; +Cc: linux-usb

Am Mittwoch, den 15.01.2020, 15:54 +0000 schrieb EJ Hsu:

Hi,

> That's another reason why I would like to run scsi_scan_host() asynchronously
> and keep the hub thread free.

Very well. I must say I don't like it, but I have no good alternative.

> > > In my opinion, if scsi_scan_host() will be run asynchronously, this 
> > > interface needs to be guarded against runtime PM between uas_probe() & uas_scan_work().
> > 
> > Yes it does. But it has a child, the SCSI host, which has an elevated count. It is already guarded.
> 
> I just checked the code, and the reference count will be incremented in scsi_scan_host(), 
> precisely speaking, in scsi_autopm_get_host().
> So, I still think we need to manually add reference count of interface here.
> Please correct me if there is anything wrong.

AFAICT the generic power model will not suspend a parent while a child
is active. The includes SCSI children of USB parents. So I think we are
safe. Could you resubmit your patch without the PM counter manipulation
and we can fix any issue that may exist in theory later?

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] usb: uas: fix a plug & unplug racing
  2020-01-20  9:38             ` Oliver Neukum
@ 2020-01-21 11:29               ` EJ Hsu
  2020-01-22  9:52                 ` EJ Hsu
  0 siblings, 1 reply; 12+ messages in thread
From: EJ Hsu @ 2020-01-21 11:29 UTC (permalink / raw)
  To: Oliver Neukum, Alan Stern; +Cc: linux-usb

Oliver Neukum wrote:

> > I just checked the code, and the reference count will be incremented
> > in scsi_scan_host(), precisely speaking, in scsi_autopm_get_host().
> > So, I still think we need to manually add reference count of interface here.
> > Please correct me if there is anything wrong.
> 
> AFAICT the generic power model will not suspend a parent while a child is
> active. The includes SCSI children of USB parents. So I think we are safe. Could
> you resubmit your patch without the PM counter manipulation and we can fix
> any issue that may exist in theory later?
> 

The child_count of usb device should be incremented in the scsi_scan_host().
I can simulate this situation by forcibly enable the autosuspend of uas driver
and add a delay before scsi_scan_host() is called in asynchronous way.

But for now, as uas driver does not support autosuspend, it is indeed safe to
remove the PM counter manipulation. If you have concern about it, it's OK 
to me to re-submit it.

Thanks,
EJ
--nvpublic


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] usb: uas: fix a plug & unplug racing
  2020-01-21 11:29               ` EJ Hsu
@ 2020-01-22  9:52                 ` EJ Hsu
  0 siblings, 0 replies; 12+ messages in thread
From: EJ Hsu @ 2020-01-22  9:52 UTC (permalink / raw)
  To: Oliver Neukum, Alan Stern; +Cc: linux-usb

EJ Hsu wrote:

> Oliver Neukum wrote:
> 
> > > I just checked the code, and the reference count will be incremented
> > > in scsi_scan_host(), precisely speaking, in scsi_autopm_get_host().
> > > So, I still think we need to manually add reference count of interface here.
> > > Please correct me if there is anything wrong.
> >
> > AFAICT the generic power model will not suspend a parent while a child
> > is active. The includes SCSI children of USB parents. So I think we
> > are safe. Could you resubmit your patch without the PM counter
> > manipulation and we can fix any issue that may exist in theory later?
> >
> 
> The child_count of usb device should be incremented in the scsi_scan_host().
> I can simulate this situation by forcibly enable the autosuspend of uas driver
> and add a delay before scsi_scan_host() is called in asynchronous way.
> 
> But for now, as uas driver does not support autosuspend, it is indeed safe to
> remove the PM counter manipulation. If you have concern about it, it's OK to
> me to re-submit it.
> 

By the way, if autosuspend will be enabled in uas driver in the future, 
the PM counter manipulation should be added before scsi_add_host(), 
just as MSC driver did. Otherwise, the child_count of usb device might be
accidentally decremented. (the rpm "idle" message will be propagated
from scsi host to usb interface and usb device)

Thanks,
EJ
--nvpublic

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-01-22  9:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-13  3:30 [PATCH] usb: uas: fix a plug & unplug racing EJ Hsu
2020-01-13  9:44 ` Oliver Neukum
2020-01-13 15:21   ` Alan Stern
2020-01-14  3:28   ` EJ Hsu
2020-01-14 14:41     ` Oliver Neukum
2020-01-14 15:04       ` Alan Stern
2020-01-15  9:31         ` Oliver Neukum
2020-01-15 15:17           ` Alan Stern
2020-01-15 15:54           ` EJ Hsu
2020-01-20  9:38             ` Oliver Neukum
2020-01-21 11:29               ` EJ Hsu
2020-01-22  9:52                 ` EJ Hsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).