All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
@ 2014-03-28 17:47 James Bottomley
  2014-03-28 17:49 ` [PATCH 1/3] [SCSI] Fix abort state memory problem James Bottomley
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: James Bottomley @ 2014-03-28 17:47 UTC (permalink / raw)
  To: SCSI development list; +Cc: USB list, Alan Stern

This is a set of three patches we agreed to a while ago to eliminate a
USB deadlock.  I did rewrite the first patch, if it could be reviewed
and tested.

Thanks,

James

---

Alan Stern (1):
  [SCSI] Fix command result state propagation

James Bottomley (2):
  [SCSI] Fix abort state memory problem
  [SCSI] Fix spurious request sense in error handling

 drivers/scsi/scsi_error.c | 19 ++++++++++++++++---
 drivers/scsi/scsi_lib.c   |  1 +
 2 files changed, 17 insertions(+), 3 deletions(-)



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/3] [SCSI] Fix abort state memory problem
  2014-03-28 17:47 [PATCH 0/3] Fix USB deadlock caused by SCSI error handling James Bottomley
@ 2014-03-28 17:49 ` James Bottomley
  2014-03-31  6:58   ` Hannes Reinecke
  2014-03-28 17:50 ` [PATCH 2/3] [SCSI] Fix spurious request sense in error handling James Bottomley
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: James Bottomley @ 2014-03-28 17:49 UTC (permalink / raw)
  To: SCSI development list; +Cc: USB list, Alan Stern

The abort state is being stored persistently across commands, meaning if the
command times out a second time, the error handler thinks an abort is still
pending.  Fix this by properly resetting the abort flag after the abort is
finished.

Signed-off-by: James Bottomley <JBottomley@Parallels.com>
---
 drivers/scsi/scsi_error.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..b9f3b07 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -869,10 +869,13 @@ static int scsi_try_bus_device_reset(struct scsi_cmnd *scmd)
 
 static int scsi_try_to_abort_cmd(struct scsi_host_template *hostt, struct scsi_cmnd *scmd)
 {
-	if (!hostt->eh_abort_handler)
-		return FAILED;
+	int retval = FAILED;
+
+	if (hostt->eh_abort_handler)
+		retval = hostt->eh_abort_handler(scmd);
 
-	return hostt->eh_abort_handler(scmd);
+	scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
+	return retval;
 }
 
 static void scsi_abort_eh_cmnd(struct scsi_cmnd *scmd)
-- 
1.9.1




^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/3] [SCSI] Fix spurious request sense in error handling
  2014-03-28 17:47 [PATCH 0/3] Fix USB deadlock caused by SCSI error handling James Bottomley
  2014-03-28 17:49 ` [PATCH 1/3] [SCSI] Fix abort state memory problem James Bottomley
@ 2014-03-28 17:50 ` James Bottomley
  2014-03-31  6:59   ` Hannes Reinecke
  2014-03-28 17:51 ` [PATCH 3/3] [SCSI] Fix command result state propagation James Bottomley
  2014-03-28 19:29 ` [PATCH 0/3] Fix USB deadlock caused by SCSI error handling Alan Stern
  3 siblings, 1 reply; 31+ messages in thread
From: James Bottomley @ 2014-03-28 17:50 UTC (permalink / raw)
  To: SCSI development list; +Cc: USB list, Alan Stern

We unconditionally execute scsi_eh_get_sense() to make sure all failed
commands that should have sense attached, do.  However, the routine forgets
that some commands, because of the way they fail, will not have any sense code
... we should not bother them with a REQUEST_SENSE command.  Fix this by
testing to see if we actually got a CHECK_CONDITION return and skip asking for
sense if we don't.

Tested-by: Alan Stern <stern-nwvwT67g6+6dFdvTe/nMLpVzexx5G7lz@public.gmane.org>
Signed-off-by: James Bottomley <JBottomley-MU7nAjRaF3makBO8gow8eQ@public.gmane.org>
---
 drivers/scsi/scsi_error.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index b9f3b07..221a5bc 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1160,6 +1160,15 @@ int scsi_eh_get_sense(struct list_head *work_q,
 					     __func__));
 			break;
 		}
+		if (status_byte(scmd->result) != CHECK_CONDITION)
+			/*
+			 * don't request sense if there's no check condition
+			 * status because the error we're processing isn't one
+			 * that has a sense code (and some devices get
+			 * confused by sense requests out of the blue)
+			 */
+			continue;
+
 		SCSI_LOG_ERROR_RECOVERY(2, scmd_printk(KERN_INFO, scmd,
 						  "%s: requesting sense\n",
 						  current->comm));
-- 
1.9.1



--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/3] [SCSI] Fix command result state propagation
  2014-03-28 17:47 [PATCH 0/3] Fix USB deadlock caused by SCSI error handling James Bottomley
  2014-03-28 17:49 ` [PATCH 1/3] [SCSI] Fix abort state memory problem James Bottomley
  2014-03-28 17:50 ` [PATCH 2/3] [SCSI] Fix spurious request sense in error handling James Bottomley
@ 2014-03-28 17:51 ` James Bottomley
  2014-03-31  7:00   ` Hannes Reinecke
  2014-03-28 19:29 ` [PATCH 0/3] Fix USB deadlock caused by SCSI error handling Alan Stern
  3 siblings, 1 reply; 31+ messages in thread
From: James Bottomley @ 2014-03-28 17:51 UTC (permalink / raw)
  To: SCSI development list; +Cc: USB list, Alan Stern

From: Alan Stern <stern-nwvwT67g6+6dFdvTe/nMLpVzexx5G7lz@public.gmane.org>

We're seeing a case where the contents of scmd->result isn't being reset after
a SCSI command encounters an error, is resubmitted, times out and then gets
handled.  The error handler acts on the stale result of the previous error
instead of the timeout.  Fix this by properly zeroing the scmd->status before
the command is resubmitted.

Signed-off-by: Alan Stern <stern-nwvwT67g6+6dFdvTe/nMLpVzexx5G7lz@public.gmane.org>
Signed-off-by: James Bottomley <JBottomley-MU7nAjRaF3makBO8gow8eQ@public.gmane.org>
---
 drivers/scsi/scsi_error.c | 1 +
 drivers/scsi/scsi_lib.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 221a5bc..335eaf4 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -927,6 +927,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses,
 	memset(scmd->cmnd, 0, BLK_MAX_CDB);
 	memset(&scmd->sdb, 0, sizeof(scmd->sdb));
 	scmd->request->next_rq = NULL;
+	scmd->result = 0;
 
 	if (sense_bytes) {
 		scmd->sdb.length = min_t(unsigned, SCSI_SENSE_BUFFERSIZE,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5681c05..b1f4867 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -137,6 +137,7 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
 	 * lock such that the kblockd_schedule_work() call happens
 	 * before blk_cleanup_queue() finishes.
 	 */
+	cmd->result = 0;
 	spin_lock_irqsave(q->queue_lock, flags);
 	blk_requeue_request(q, cmd->request);
 	kblockd_schedule_work(q, &device->requeue_work);
-- 
1.9.1



--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-03-28 17:47 [PATCH 0/3] Fix USB deadlock caused by SCSI error handling James Bottomley
                   ` (2 preceding siblings ...)
  2014-03-28 17:51 ` [PATCH 3/3] [SCSI] Fix command result state propagation James Bottomley
@ 2014-03-28 19:29 ` Alan Stern
  2014-03-31  7:22   ` Hannes Reinecke
  3 siblings, 1 reply; 31+ messages in thread
From: Alan Stern @ 2014-03-28 19:29 UTC (permalink / raw)
  To: James Bottomley; +Cc: SCSI development list, USB list

On Fri, 28 Mar 2014, James Bottomley wrote:

> This is a set of three patches we agreed to a while ago to eliminate a
> USB deadlock.  I did rewrite the first patch, if it could be reviewed
> and tested.

I tested all three patches under the same conditions as before, and 
they all worked correctly.

In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't
entirely clear.  This boils down to two questions, which I don't 
know the answers to:

	What should happen in scmd_eh_abort_handler() if
	scsi_host_eh_past_deadline() returns true and thereby
	prevents scsi_try_to_abort_cmd() from being called?
	The flag wouldn't get cleared then.

	What should happen if some other pathway manages to call
	scsi_try_to_abort_cmd() while scmd->abort_work is still
	sitting on the work queue?  The command would be aborted
	and the flag would be cleared, but the queued work would
	remain.  Can this ever happen?

Maybe scmd_eh_abort_handler() should check the flag before doing
anything.  Is there any sort of sychronization to prevent the same
incarnation of a command from being aborted twice (or by two different
threads at the same time)?  If there is, it isn't obvious.

(Also, what's going on at the start of scsi_abort_command()?  Contrary
to what one might expect, the first part of the function _cancels_ a
scheduled abort.  And it does so without clearing the
SCSI_EH_ABORT_SCHEDULED flag.)

Maybe in all these scenarios, the command is doomed to fail anyway so 
these questions don't make any real difference.  I don't know...

Alan Stern


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] [SCSI] Fix abort state memory problem
  2014-03-28 17:49 ` [PATCH 1/3] [SCSI] Fix abort state memory problem James Bottomley
@ 2014-03-31  6:58   ` Hannes Reinecke
  2014-03-31 13:06     ` Alan Stern
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2014-03-31  6:58 UTC (permalink / raw)
  To: James Bottomley, SCSI development list; +Cc: USB list, Alan Stern

On 03/28/2014 06:49 PM, James Bottomley wrote:
> The abort state is being stored persistently across commands, meaning if the
> command times out a second time, the error handler thinks an abort is still
> pending.  Fix this by properly resetting the abort flag after the abort is
> finished.
> 
> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
> ---
>  drivers/scsi/scsi_error.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 771c16b..b9f3b07 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -869,10 +869,13 @@ static int scsi_try_bus_device_reset(struct scsi_cmnd *scmd)
>  
>  static int scsi_try_to_abort_cmd(struct scsi_host_template *hostt, struct scsi_cmnd *scmd)
>  {
> -	if (!hostt->eh_abort_handler)
> -		return FAILED;
> +	int retval = FAILED;
> +
> +	if (hostt->eh_abort_handler)
> +		retval = hostt->eh_abort_handler(scmd);
>  
> -	return hostt->eh_abort_handler(scmd);
> +	scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
> +	return retval;
>  }
>  
>  static void scsi_abort_eh_cmnd(struct scsi_cmnd *scmd)
> 
Hmm. No, I don't think this is correct.

SCSI_EH_ABORT_SCHEDULED signifies whether scmd_eh_abort_handler()
needs to run. As such it needs to be reset when the workqueue item
is triggered.

Can you try with this instead?

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..9694803 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -120,6 +120,8 @@ scmd_eh_abort_handler(struct work_struct *work)
        struct scsi_device *sdev = scmd->device;
        int rtn;

+       scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
+
        if (scsi_host_eh_past_deadline(sdev->host)) {
                SCSI_LOG_ERROR_RECOVERY(3,
                        scmd_printk(KERN_INFO, scmd,


Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] [SCSI] Fix spurious request sense in error handling
  2014-03-28 17:50 ` [PATCH 2/3] [SCSI] Fix spurious request sense in error handling James Bottomley
@ 2014-03-31  6:59   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2014-03-31  6:59 UTC (permalink / raw)
  To: James Bottomley, SCSI development list; +Cc: USB list, Alan Stern

On 03/28/2014 06:50 PM, James Bottomley wrote:
> We unconditionally execute scsi_eh_get_sense() to make sure all failed
> commands that should have sense attached, do.  However, the routine forgets
> that some commands, because of the way they fail, will not have any sense code
> ... we should not bother them with a REQUEST_SENSE command.  Fix this by
> testing to see if we actually got a CHECK_CONDITION return and skip asking for
> sense if we don't.
> 
> Tested-by: Alan Stern <stern@rowland.harvard.edu>
> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Acked-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] [SCSI] Fix command result state propagation
  2014-03-28 17:51 ` [PATCH 3/3] [SCSI] Fix command result state propagation James Bottomley
@ 2014-03-31  7:00   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2014-03-31  7:00 UTC (permalink / raw)
  To: James Bottomley, SCSI development list; +Cc: USB list, Alan Stern

On 03/28/2014 06:51 PM, James Bottomley wrote:
> From: Alan Stern <stern-nwvwT67g6+6dFdvTe/nMLpVzexx5G7lz@public.gmane.org>
> 
> We're seeing a case where the contents of scmd->result isn't being reset after
> a SCSI command encounters an error, is resubmitted, times out and then gets
> handled.  The error handler acts on the stale result of the previous error
> instead of the timeout.  Fix this by properly zeroing the scmd->status before
> the command is resubmitted.
> 
> Signed-off-by: Alan Stern <stern-nwvwT67g6+6dFdvTe/nMLpVzexx5G7lz@public.gmane.org>
> Signed-off-by: James Bottomley <JBottomley-MU7nAjRaF3makBO8gow8eQ@public.gmane.org>

<grin>

Acked-by: Hannes Reinecke <hare-l3A5Bk7waGM@public.gmane.org>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare-l3A5Bk7waGM@public.gmane.org			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-03-28 19:29 ` [PATCH 0/3] Fix USB deadlock caused by SCSI error handling Alan Stern
@ 2014-03-31  7:22   ` Hannes Reinecke
  2014-03-31 13:33     ` Alan Stern
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2014-03-31  7:22 UTC (permalink / raw)
  To: Alan Stern, James Bottomley; +Cc: SCSI development list, USB list

On 03/28/2014 08:29 PM, Alan Stern wrote:
> On Fri, 28 Mar 2014, James Bottomley wrote:
> 
>> This is a set of three patches we agreed to a while ago to eliminate a
>> USB deadlock.  I did rewrite the first patch, if it could be reviewed
>> and tested.
> 
> I tested all three patches under the same conditions as before, and 
> they all worked correctly.
> 
> In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't
> entirely clear.  This boils down to two questions, which I don't 
> know the answers to:
> 
> 	What should happen in scmd_eh_abort_handler() if
> 	scsi_host_eh_past_deadline() returns true and thereby
> 	prevents scsi_try_to_abort_cmd() from being called?
> 	The flag wouldn't get cleared then.
> 
Ah. Correct. But that's due to the first patch being incorrect.
Cf my response to the original first patch.

> 	What should happen if some other pathway manages to call
> 	scsi_try_to_abort_cmd() while scmd->abort_work is still
> 	sitting on the work queue?  The command would be aborted
> 	and the flag would be cleared, but the queued work would
> 	remain.  Can this ever happen?
> 
Not that I could see.
A command abort is only ever triggered by the request timeout from
the block layer. And the timer is _not_ rearmed once the timeout
function (here: scsi_times_out()) is called.
Hence I fail to see how it can be called concurrently.

> Maybe scmd_eh_abort_handler() should check the flag before doing
> anything.  Is there any sort of sychronization to prevent the same
> incarnation of a command from being aborted twice (or by two different
> threads at the same time)?  If there is, it isn't obvious.
> 
See above. scsi_times_out() will only ever called once.
What can happen, though, is that _theoretically_ the LLDD might
decide to call ->done() on a timed out command when
scsi_eh_abort_handler() is still pending.


> (Also, what's going on at the start of scsi_abort_command()?  Contrary
> to what one might expect, the first part of the function _cancels_ a
> scheduled abort.  And it does so without clearing the
> SCSI_EH_ABORT_SCHEDULED flag.)
> 
The original idea was this:

SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
Point is, any command abort is only ever send for a timed-out
command. And the main problem for a timed-out command is that we
simply _do not_ know what happened for that command. So _if_ a
command timed out, _and_ we've send an abort, _and_ the command
times out _again_ we'll be running into an endless loop between
timeout and aborting, and never returning the command at all.

So to prevent this we should set a marker on that command telling it
to _not_ try to abort the command again.
Which is what 'SCSI_EH_ABORT_SCHEDULED' was meant for:

- A command times out, sets 'SCSI_EH_ABORT_SCHEDULED'
- abort _succeeds_
- The same command times out a second time, notifies
  that SCSI_EH_ABORT_SCHEDULED is set, and doesn't call
  scsi_eh_abort_command() but rather escalates directly.

_However_, I do feel that we've gotten the wrong track with all of
this. In you original mail you stated:

> Basically, usb-storage deadlocks when the SCSI error handler
> invokes the eh_device_reset_handler callback while a command
> is running.  The command has timed out and will never complete
> normally, because the device's firmware has crashed.  But
> usb-storage's device-reset routine waits for the current command
> to finish, which brings everything to a standstill.

Question now to you as the USB god:

A command abort is only _ever_ send after a command timeout.
And the main idea of the command abort is to remove any context
the LLDD or the target might have. So by the time the command
abort returns SUCCESS we _expect_ the firmware to have cleared that
command. If for some reason the firmware isn't capable of doing so,
it should return FAILED.

So:
- Has the command timeout fired?
- If so, why hasn't it returned FAILED?
- If it had returned SUCCESS, why did the device_reset_handler
  wait for the command to complete?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] [SCSI] Fix abort state memory problem
  2014-03-31  6:58   ` Hannes Reinecke
@ 2014-03-31 13:06     ` Alan Stern
  0 siblings, 0 replies; 31+ messages in thread
From: Alan Stern @ 2014-03-31 13:06 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: James Bottomley, SCSI development list, USB list

On Mon, 31 Mar 2014, Hannes Reinecke wrote:

> On 03/28/2014 06:49 PM, James Bottomley wrote:
> > The abort state is being stored persistently across commands, meaning if the
> > command times out a second time, the error handler thinks an abort is still
> > pending.  Fix this by properly resetting the abort flag after the abort is
> > finished.
> > 
> > Signed-off-by: James Bottomley <JBottomley@Parallels.com>
> > ---
> >  drivers/scsi/scsi_error.c | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> > index 771c16b..b9f3b07 100644
> > --- a/drivers/scsi/scsi_error.c
> > +++ b/drivers/scsi/scsi_error.c
> > @@ -869,10 +869,13 @@ static int scsi_try_bus_device_reset(struct scsi_cmnd *scmd)
> >  
> >  static int scsi_try_to_abort_cmd(struct scsi_host_template *hostt, struct scsi_cmnd *scmd)
> >  {
> > -	if (!hostt->eh_abort_handler)
> > -		return FAILED;
> > +	int retval = FAILED;
> > +
> > +	if (hostt->eh_abort_handler)
> > +		retval = hostt->eh_abort_handler(scmd);
> >  
> > -	return hostt->eh_abort_handler(scmd);
> > +	scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
> > +	return retval;
> >  }
> >  
> >  static void scsi_abort_eh_cmnd(struct scsi_cmnd *scmd)
> > 
> Hmm. No, I don't think this is correct.
> 
> SCSI_EH_ABORT_SCHEDULED signifies whether scmd_eh_abort_handler()
> needs to run. As such it needs to be reset when the workqueue item
> is triggered.
> 
> Can you try with this instead?
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 771c16b..9694803 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -120,6 +120,8 @@ scmd_eh_abort_handler(struct work_struct *work)
>         struct scsi_device *sdev = scmd->device;
>         int rtn;
> 
> +       scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
> +
>         if (scsi_host_eh_past_deadline(sdev->host)) {
>                 SCSI_LOG_ERROR_RECOVERY(3,
>                         scmd_printk(KERN_INFO, scmd,

This doesn't seem like a good idea either, because the flag is tested 
later on in scsi_decide_disposition().

Alan Stern


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-03-31  7:22   ` Hannes Reinecke
@ 2014-03-31 13:33     ` Alan Stern
  2014-03-31 14:37       ` Hannes Reinecke
  0 siblings, 1 reply; 31+ messages in thread
From: Alan Stern @ 2014-03-31 13:33 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: James Bottomley, SCSI development list, USB list

On Mon, 31 Mar 2014, Hannes Reinecke wrote:

> On 03/28/2014 08:29 PM, Alan Stern wrote:
> > On Fri, 28 Mar 2014, James Bottomley wrote:
> > 
> >> This is a set of three patches we agreed to a while ago to eliminate a
> >> USB deadlock.  I did rewrite the first patch, if it could be reviewed
> >> and tested.
> > 
> > I tested all three patches under the same conditions as before, and 
> > they all worked correctly.
> > 
> > In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't
> > entirely clear.  This boils down to two questions, which I don't 
> > know the answers to:
> > 
> > 	What should happen in scmd_eh_abort_handler() if
> > 	scsi_host_eh_past_deadline() returns true and thereby
> > 	prevents scsi_try_to_abort_cmd() from being called?
> > 	The flag wouldn't get cleared then.
> > 
> Ah. Correct. But that's due to the first patch being incorrect.
> Cf my response to the original first patch.

See my response to your response.  :-)

> > 	What should happen if some other pathway manages to call
> > 	scsi_try_to_abort_cmd() while scmd->abort_work is still
> > 	sitting on the work queue?  The command would be aborted
> > 	and the flag would be cleared, but the queued work would
> > 	remain.  Can this ever happen?
> > 
> Not that I could see.
> A command abort is only ever triggered by the request timeout from
> the block layer. And the timer is _not_ rearmed once the timeout
> function (here: scsi_times_out()) is called.
> Hence I fail to see how it can be called concurrently.

scsi_try_to_abort_cmd() is also called (via a different pathway) when a 
command sent by the error handler itself times out.  I haven't traced 
through all the different paths to make sure none of them can run 
concurrently.  But I'm willing to take your word for it.

> > Maybe scmd_eh_abort_handler() should check the flag before doing
> > anything.  Is there any sort of sychronization to prevent the same
> > incarnation of a command from being aborted twice (or by two different
> > threads at the same time)?  If there is, it isn't obvious.
> > 
> See above. scsi_times_out() will only ever called once.
> What can happen, though, is that _theoretically_ the LLDD might
> decide to call ->done() on a timed out command when
> scsi_eh_abort_handler() is still pending.

That's okay.  We can expect the LLDD to have sufficient locking to
handle that sort of thing without confusion (usb-storage does, for
example).

> > (Also, what's going on at the start of scsi_abort_command()?  Contrary
> > to what one might expect, the first part of the function _cancels_ a
> > scheduled abort.  And it does so without clearing the
> > SCSI_EH_ABORT_SCHEDULED flag.)
> > 
> The original idea was this:
> 
> SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
> Point is, any command abort is only ever send for a timed-out
> command. And the main problem for a timed-out command is that we
> simply _do not_ know what happened for that command. So _if_ a
> command timed out, _and_ we've send an abort, _and_ the command
> times out _again_ we'll be running into an endless loop between
> timeout and aborting, and never returning the command at all.
> 
> So to prevent this we should set a marker on that command telling it
> to _not_ try to abort the command again.

I disagree.  We _have_ to abort the command again -- how else can we
stop a running command?  To prevent the loop you described, we should
avoid _retrying_ the command after it is aborted the second time.

> Which is what 'SCSI_EH_ABORT_SCHEDULED' was meant for:
> 
> - A command times out, sets 'SCSI_EH_ABORT_SCHEDULED'
> - abort _succeeds_
> - The same command times out a second time, notifies
>   that SCSI_EH_ABORT_SCHEDULED is set, and doesn't call
>   scsi_eh_abort_command() but rather escalates directly.

The proper time to escalate is after the command is aborted again, not
while the command is still running.  The only situation where you might
want to escalate while a command is still running would be if you were
unable to abort the command.

(Hmmm, maybe that's not true for SCSI devices in general.  It is true 
for USB mass-storage, however.  Perhaps the reset handlers in 
usb-storage should be changed so that they will automatically abort a 
running command before starting the reset.)

> _However_, I do feel that we've gotten the wrong track with all of
> this. In you original mail you stated:
> 
> > Basically, usb-storage deadlocks when the SCSI error handler
> > invokes the eh_device_reset_handler callback while a command
> > is running.  The command has timed out and will never complete
> > normally, because the device's firmware has crashed.  But
> > usb-storage's device-reset routine waits for the current command
> > to finish, which brings everything to a standstill.
> 
> Question now to you as the USB god:
> 
> A command abort is only _ever_ send after a command timeout.
> And the main idea of the command abort is to remove any context
> the LLDD or the target might have. So by the time the command
> abort returns SUCCESS we _expect_ the firmware to have cleared that
> command. If for some reason the firmware isn't capable of doing so,
> it should return FAILED.

You're asking about the error scenario in my original bug description, 
right?

> So:
> - Has the command timeout fired?

Yes.  In fact, the same command's timeout fired twice: Once when the 
command was first issued, and then later when the command was retried.

> - If so, why hasn't it returned FAILED?

The abort routine returned SUCCESS because it succeeded in aborting the
command.  All the state relevant to that command was removed from the
LLDD and the device.

> - If it had returned SUCCESS, why did the device_reset_handler
>   wait for the command to complete?

We are talking about two different things.

     1. The command is issued and it times out.

     2. The abort handler is called.  It successfully aborts the
	command and clears the state.  (This is what I meant above.)

     3. The command is retried, and again it times out.

     4. The abort handler is not called a second time (so it doesn't 
	return either SUCCESS or FAILED).  As you pointed out, the SCSI 
	core escalates instead.  The device reset handler is called
	while the retried command is still running, because the retry
	never got aborted.

I hope this clarifies the situation.

Alan Stern


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-03-31 13:33     ` Alan Stern
@ 2014-03-31 14:37       ` Hannes Reinecke
       [not found]         ` <53397DAE.9010801-l3A5Bk7waGM@public.gmane.org>
  2014-03-31 22:29         ` Alan Stern
  0 siblings, 2 replies; 31+ messages in thread
From: Hannes Reinecke @ 2014-03-31 14:37 UTC (permalink / raw)
  To: Alan Stern; +Cc: James Bottomley, SCSI development list, USB list

On 03/31/2014 03:33 PM, Alan Stern wrote:
> On Mon, 31 Mar 2014, Hannes Reinecke wrote:
> 
>> On 03/28/2014 08:29 PM, Alan Stern wrote:
>>> On Fri, 28 Mar 2014, James Bottomley wrote:
>>>
>>>> This is a set of three patches we agreed to a while ago to eliminate a
>>>> USB deadlock.  I did rewrite the first patch, if it could be reviewed
>>>> and tested.
>>>
>>> I tested all three patches under the same conditions as before, and 
>>> they all worked correctly.
>>>
>>> In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't
>>> entirely clear.  This boils down to two questions, which I don't 
>>> know the answers to:
>>>
>>> 	What should happen in scmd_eh_abort_handler() if
>>> 	scsi_host_eh_past_deadline() returns true and thereby
>>> 	prevents scsi_try_to_abort_cmd() from being called?
>>> 	The flag wouldn't get cleared then.
>>>
>> Ah. Correct. But that's due to the first patch being incorrect.
>> Cf my response to the original first patch.
> 
> See my response to your response.  :-)
> 
Okay, So I probably should refrain from issueing a response to
your response to my response lest infinite recursion happens :-)

>>> 	What should happen if some other pathway manages to call
>>> 	scsi_try_to_abort_cmd() while scmd->abort_work is still
>>> 	sitting on the work queue?  The command would be aborted
>>> 	and the flag would be cleared, but the queued work would
>>> 	remain.  Can this ever happen?
>>>
>> Not that I could see.
>> A command abort is only ever triggered by the request timeout from
>> the block layer. And the timer is _not_ rearmed once the timeout
>> function (here: scsi_times_out()) is called.
>> Hence I fail to see how it can be called concurrently.
> 
> scsi_try_to_abort_cmd() is also called (via a different pathway) when a 
> command sent by the error handler itself times out.  I haven't traced 
> through all the different paths to make sure none of them can run 
> concurrently.  But I'm willing to take your word for it.
> 
Yes, but that's not calling scsi_abort_command(), but rather invokes
scsi_abort_eh_cmnd().

>>> Maybe scmd_eh_abort_handler() should check the flag before doing
>>> anything.  Is there any sort of sychronization to prevent the same
>>> incarnation of a command from being aborted twice (or by two different
>>> threads at the same time)?  If there is, it isn't obvious.
>>>
>> See above. scsi_times_out() will only ever called once.
>> What can happen, though, is that _theoretically_ the LLDD might
>> decide to call ->done() on a timed out command when
>> scsi_eh_abort_handler() is still pending.
> 
> That's okay.  We can expect the LLDD to have sufficient locking to
> handle that sort of thing without confusion (usb-storage does, for
> example).
> 
>>> (Also, what's going on at the start of scsi_abort_command()?  Contrary
>>> to what one might expect, the first part of the function _cancels_ a
>>> scheduled abort.  And it does so without clearing the
>>> SCSI_EH_ABORT_SCHEDULED flag.)
>>>
>> The original idea was this:
>>
>> SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
>> Point is, any command abort is only ever send for a timed-out
>> command. And the main problem for a timed-out command is that we
>> simply _do not_ know what happened for that command. So _if_ a
>> command timed out, _and_ we've send an abort, _and_ the command
>> times out _again_ we'll be running into an endless loop between
>> timeout and aborting, and never returning the command at all.
>>
>> So to prevent this we should set a marker on that command telling it
>> to _not_ try to abort the command again.
> 
> I disagree.  We _have_ to abort the command again -- how else can we
> stop a running command?  To prevent the loop you described, we should
> avoid _retrying_ the command after it is aborted the second time.
> 
The actual question is whether it's worth aborting the same command
a second time.
In principle any reset (like LUN reset etc) should clear the
command, too.
And the EH abort functionality is geared around this.
If, for some reason, the transport layer / device driver
requires a command abort to be send then sure, we need
to accommodate for that.

>> Which is what 'SCSI_EH_ABORT_SCHEDULED' was meant for:
>>
>> - A command times out, sets 'SCSI_EH_ABORT_SCHEDULED'
>> - abort _succeeds_
>> - The same command times out a second time, notifies
>>   that SCSI_EH_ABORT_SCHEDULED is set, and doesn't call
>>   scsi_eh_abort_command() but rather escalates directly.
> 
> The proper time to escalate is after the command is aborted again, not
> while the command is still running.  The only situation where you might
> want to escalate while a command is still running would be if you were
> unable to abort the command.
> 
> (Hmmm, maybe that's not true for SCSI devices in general.  It is true 
> for USB mass-storage, however.  Perhaps the reset handlers in 
> usb-storage should be changed so that they will automatically abort a 
> running command before starting the reset.)
> 
As said, yes, in principle you are right. We should be aborting the
command a second time, _and then_ starting the escalation.

So if the above reasoning is okay then this patch should be doing
the trick:

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..0e72374 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
                /*
                 * Retry after abort failed, escalate to next level.
                 */
+               scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
                SCSI_LOG_ERROR_RECOVERY(3,
                        scmd_printk(KERN_INFO, scmd,
                                    "scmd %p previous abort
failed\n", scmd));

(Beware of line
breaks)

Can you test with it?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
       [not found]         ` <53397DAE.9010801-l3A5Bk7waGM@public.gmane.org>
@ 2014-03-31 15:03           ` James Bottomley
  2014-03-31 22:41             ` Alan Stern
       [not found]             ` <1396278224.3152.26.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
  0 siblings, 2 replies; 31+ messages in thread
From: James Bottomley @ 2014-03-31 15:03 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Alan Stern, SCSI development list, USB list

[lets split the thread]
On Mon, 2014-03-31 at 16:37 +0200, Hannes Reinecke wrote:
> On 03/31/2014 03:33 PM, Alan Stern wrote:
> > On Mon, 31 Mar 2014, Hannes Reinecke wrote:
> >> On 03/28/2014 08:29 PM, Alan Stern wrote:
> >>> On Fri, 28 Mar 2014, James Bottomley wrote:
> >>> Maybe scmd_eh_abort_handler() should check the flag before doing
> >>> anything.  Is there any sort of sychronization to prevent the same
> >>> incarnation of a command from being aborted twice (or by two different
> >>> threads at the same time)?  If there is, it isn't obvious.
> >>>
> >> See above. scsi_times_out() will only ever called once.
> >> What can happen, though, is that _theoretically_ the LLDD might
> >> decide to call ->done() on a timed out command when
> >> scsi_eh_abort_handler() is still pending.
> > 
> > That's okay.  We can expect the LLDD to have sufficient locking to
> > handle that sort of thing without confusion (usb-storage does, for
> > example).
> > 
> >>> (Also, what's going on at the start of scsi_abort_command()?  Contrary
> >>> to what one might expect, the first part of the function _cancels_ a
> >>> scheduled abort.  And it does so without clearing the
> >>> SCSI_EH_ABORT_SCHEDULED flag.)
> >>>
> >> The original idea was this:
> >>
> >> SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
> >> Point is, any command abort is only ever send for a timed-out
> >> command. And the main problem for a timed-out command is that we
> >> simply _do not_ know what happened for that command. So _if_ a
> >> command timed out, _and_ we've send an abort, _and_ the command
> >> times out _again_ we'll be running into an endless loop between
> >> timeout and aborting, and never returning the command at all.
> >>
> >> So to prevent this we should set a marker on that command telling it
> >> to _not_ try to abort the command again.
> > 
> > I disagree.  We _have_ to abort the command again -- how else can we
> > stop a running command?  To prevent the loop you described, we should
> > avoid _retrying_ the command after it is aborted the second time.
> > 
> The actual question is whether it's worth aborting the same command
> a second time.
> In principle any reset (like LUN reset etc) should clear the
> command, too.
> And the EH abort functionality is geared around this.
> If, for some reason, the transport layer / device driver
> requires a command abort to be send then sure, we need
> to accommodate for that.

We already discussed this (that was my first response too).  USB needs
all outstanding commands aborted before proceeding to reset.  I'm
starting to think the actual way to fix this is to reset the abort
scheduled only if we send something else, so this might be a better fix.

It doesn't matter if we finish a command with abort scheduled because
the command then gets freed and the flags cleaned.

We can take our time with this because the other two patches, which I
can send separately fix the current deadlock (we no longer send an
unaborted request sense before the reset), and it can go via rc fixes.

James

---

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..3cfd2bf 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1007,6 +1007,12 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, unsigned char *cmnd,
 	const unsigned long stall_for = msecs_to_jiffies(100);
 	int rtn;
 
+	/*
+	 * We're sending another command we'll need to abort, so reset the
+	 * abort scheduled flag on the real command before we save its state
+	 */
+	scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
+
 retry:
 	scsi_eh_prep_cmnd(scmd, &ses, cmnd, cmnd_size, sense_bytes);
 	shost->eh_action = &done;


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-03-31 14:37       ` Hannes Reinecke
       [not found]         ` <53397DAE.9010801-l3A5Bk7waGM@public.gmane.org>
@ 2014-03-31 22:29         ` Alan Stern
  2014-04-01 15:37           ` Hannes Reinecke
  1 sibling, 1 reply; 31+ messages in thread
From: Alan Stern @ 2014-03-31 22:29 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: James Bottomley, SCSI development list, USB list

On Mon, 31 Mar 2014, Hannes Reinecke wrote:

> >> Ah. Correct. But that's due to the first patch being incorrect.
> >> Cf my response to the original first patch.
> > 
> > See my response to your response.  :-)
> > 
> Okay, So I probably should refrain from issueing a response to
> your response to my response lest infinite recursion happens :-)

Indeed.

> >>> 	What should happen if some other pathway manages to call
> >>> 	scsi_try_to_abort_cmd() while scmd->abort_work is still
> >>> 	sitting on the work queue?  The command would be aborted
> >>> 	and the flag would be cleared, but the queued work would
> >>> 	remain.  Can this ever happen?
> >>>
> >> Not that I could see.
> >> A command abort is only ever triggered by the request timeout from
> >> the block layer. And the timer is _not_ rearmed once the timeout
> >> function (here: scsi_times_out()) is called.
> >> Hence I fail to see how it can be called concurrently.
> > 
> > scsi_try_to_abort_cmd() is also called (via a different pathway) when a 
> > command sent by the error handler itself times out.  I haven't traced 
> > through all the different paths to make sure none of them can run 
> > concurrently.  But I'm willing to take your word for it.
> > 
> Yes, but that's not calling scsi_abort_command(), but rather invokes
> scsi_abort_eh_cmnd().

Sure.  But either way, we end up in scsi_try_to_abort_cmd(), which
calls the LLDD's abort handler.  Thus leading to the possibility of
aborting the same command more than once.

> >> The original idea was this:
> >>
> >> SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
> >> Point is, any command abort is only ever send for a timed-out
> >> command. And the main problem for a timed-out command is that we
> >> simply _do not_ know what happened for that command. So _if_ a
> >> command timed out, _and_ we've send an abort, _and_ the command
> >> times out _again_ we'll be running into an endless loop between
> >> timeout and aborting, and never returning the command at all.
> >>
> >> So to prevent this we should set a marker on that command telling it
> >> to _not_ try to abort the command again.
> > 
> > I disagree.  We _have_ to abort the command again -- how else can we
> > stop a running command?  To prevent the loop you described, we should
> > avoid _retrying_ the command after it is aborted the second time.
> > 
> The actual question is whether it's worth aborting the same command
> a second time.
> In principle any reset (like LUN reset etc) should clear the
> command, too.
> And the EH abort functionality is geared around this.
> If, for some reason, the transport layer / device driver
> requires a command abort to be send then sure, we need
> to accommodate for that.

As James mentioned, with USB a reset does not abort a running command.  
Instead it waits for the command to finish.  (However, this could be
changed in usb-storage, if required.)

> As said, yes, in principle you are right. We should be aborting the
> command a second time, _and then_ starting the escalation.
> 
> So if the above reasoning is okay then this patch should be doing
> the trick:
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 771c16b..0e72374 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
>                 /*
>                  * Retry after abort failed, escalate to next level.
>                  */
> +               scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
>                 SCSI_LOG_ERROR_RECOVERY(3,
>                         scmd_printk(KERN_INFO, scmd,
>                                     "scmd %p previous abort
> failed\n", scmd));
> 
> (Beware of line
> breaks)
> 
> Can you test with it?

I don't understand.  This doesn't solve the fundamental problem (namely 
that you escalate before aborting a running command).  All it does is 
clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.

What's needed is something like a 2-bit abort counter.  
scsi_try_to_abort_cmd() should increment the counter each time it runs, 
and if scmd_eh_abort_handler() sees that the counter is too high, it 
should avoid its retry pathway.  _Then_ you can move on to something 
more drastic.

Alan Stern


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-03-31 15:03           ` James Bottomley
@ 2014-03-31 22:41             ` Alan Stern
       [not found]             ` <1396278224.3152.26.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
  1 sibling, 0 replies; 31+ messages in thread
From: Alan Stern @ 2014-03-31 22:41 UTC (permalink / raw)
  To: James Bottomley; +Cc: Hannes Reinecke, SCSI development list, USB list

On Mon, 31 Mar 2014, James Bottomley wrote:

> > The actual question is whether it's worth aborting the same command
> > a second time.
> > In principle any reset (like LUN reset etc) should clear the
> > command, too.
> > And the EH abort functionality is geared around this.
> > If, for some reason, the transport layer / device driver
> > requires a command abort to be send then sure, we need
> > to accommodate for that.
> 
> We already discussed this (that was my first response too).  USB needs
> all outstanding commands aborted before proceeding to reset.  I'm
> starting to think the actual way to fix this is to reset the abort
> scheduled only if we send something else, so this might be a better fix.
> 
> It doesn't matter if we finish a command with abort scheduled because
> the command then gets freed and the flags cleaned.
> 
> We can take our time with this because the other two patches, which I
> can send separately fix the current deadlock (we no longer send an
> unaborted request sense before the reset), and it can go via rc fixes.
> 
> James
> 
> ---
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 771c16b..3cfd2bf 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1007,6 +1007,12 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, unsigned char *cmnd,
>  	const unsigned long stall_for = msecs_to_jiffies(100);
>  	int rtn;
>  
> +	/*
> +	 * We're sending another command we'll need to abort, so reset the
> +	 * abort scheduled flag on the real command before we save its state
> +	 */
> +	scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
> +
>  retry:
>  	scsi_eh_prep_cmnd(scmd, &ses, cmnd, cmnd_size, sense_bytes);
>  	shost->eh_action = &done;

I don't know if Hannes will like this, but I don't think it is right.  
This is not the retry pathway that's causing problems; that pathway 
begins where scmd_eh_abort_handler() calls scsi_queue_insert().

Now, that call is already guarded by this test:

		if (... && (++scmd->retries <= scmd->allowed)) {

which would seem to prevent the infinite loop that Hannes was concerned 
about.  So maybe the 1/3 patch posted a couple of days ago is the best 
solution after all.

Alan Stern


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
       [not found]             ` <1396278224.3152.26.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
@ 2014-04-01  6:14               ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2014-04-01  6:14 UTC (permalink / raw)
  To: James Bottomley; +Cc: Alan Stern, SCSI development list, USB list

On 03/31/2014 05:03 PM, James Bottomley wrote:
> [lets split the thread]
> On Mon, 2014-03-31 at 16:37 +0200, Hannes Reinecke wrote:
>> On 03/31/2014 03:33 PM, Alan Stern wrote:
>>> On Mon, 31 Mar 2014, Hannes Reinecke wrote:
>>>> On 03/28/2014 08:29 PM, Alan Stern wrote:
>>>>> On Fri, 28 Mar 2014, James Bottomley wrote:
>>>>> Maybe scmd_eh_abort_handler() should check the flag before doing
>>>>> anything.  Is there any sort of sychronization to prevent the same
>>>>> incarnation of a command from being aborted twice (or by two different
>>>>> threads at the same time)?  If there is, it isn't obvious.
>>>>>
>>>> See above. scsi_times_out() will only ever called once.
>>>> What can happen, though, is that _theoretically_ the LLDD might
>>>> decide to call ->done() on a timed out command when
>>>> scsi_eh_abort_handler() is still pending.
>>>
>>> That's okay.  We can expect the LLDD to have sufficient locking to
>>> handle that sort of thing without confusion (usb-storage does, for
>>> example).
>>>
>>>>> (Also, what's going on at the start of scsi_abort_command()?  Contrary
>>>>> to what one might expect, the first part of the function _cancels_ a
>>>>> scheduled abort.  And it does so without clearing the
>>>>> SCSI_EH_ABORT_SCHEDULED flag.)
>>>>>
>>>> The original idea was this:
>>>>
>>>> SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
>>>> Point is, any command abort is only ever send for a timed-out
>>>> command. And the main problem for a timed-out command is that we
>>>> simply _do not_ know what happened for that command. So _if_ a
>>>> command timed out, _and_ we've send an abort, _and_ the command
>>>> times out _again_ we'll be running into an endless loop between
>>>> timeout and aborting, and never returning the command at all.
>>>>
>>>> So to prevent this we should set a marker on that command telling it
>>>> to _not_ try to abort the command again.
>>>
>>> I disagree.  We _have_ to abort the command again -- how else can we
>>> stop a running command?  To prevent the loop you described, we should
>>> avoid _retrying_ the command after it is aborted the second time.
>>>
>> The actual question is whether it's worth aborting the same command
>> a second time.
>> In principle any reset (like LUN reset etc) should clear the
>> command, too.
>> And the EH abort functionality is geared around this.
>> If, for some reason, the transport layer / device driver
>> requires a command abort to be send then sure, we need
>> to accommodate for that.
> 
> We already discussed this (that was my first response too).  USB needs
> all outstanding commands aborted before proceeding to reset.  I'm
> starting to think the actual way to fix this is to reset the abort
> scheduled only if we send something else, so this might be a better fix.
> 
> It doesn't matter if we finish a command with abort scheduled because
> the command then gets freed and the flags cleaned.
> 
> We can take our time with this because the other two patches, which I
> can send separately fix the current deadlock (we no longer send an
> unaborted request sense before the reset), and it can go via rc fixes.
> 
Yes, agreed.

The USB case seems to be a bit more tricky, and at least I need some
more time to fully understand the details and implications of this.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare-l3A5Bk7waGM@public.gmane.org			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-03-31 22:29         ` Alan Stern
@ 2014-04-01 15:37           ` Hannes Reinecke
       [not found]             ` <533ADD26.1030300-l3A5Bk7waGM@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2014-04-01 15:37 UTC (permalink / raw)
  To: Alan Stern; +Cc: James Bottomley, SCSI development list, USB list

On 04/01/2014 12:29 AM, Alan Stern wrote:
> On Mon, 31 Mar 2014, Hannes Reinecke wrote:
> 
>>>> Ah. Correct. But that's due to the first patch being incorrect.
>>>> Cf my response to the original first patch.
>>>
>>> See my response to your response.  :-)
>>>
>> Okay, So I probably should refrain from issueing a response to
>> your response to my response lest infinite recursion happens :-)
> 
> Indeed.
> 
>>>>> 	What should happen if some other pathway manages to call
>>>>> 	scsi_try_to_abort_cmd() while scmd->abort_work is still
>>>>> 	sitting on the work queue?  The command would be aborted
>>>>> 	and the flag would be cleared, but the queued work would
>>>>> 	remain.  Can this ever happen?
>>>>>
>>>> Not that I could see.
>>>> A command abort is only ever triggered by the request timeout from
>>>> the block layer. And the timer is _not_ rearmed once the timeout
>>>> function (here: scsi_times_out()) is called.
>>>> Hence I fail to see how it can be called concurrently.
>>>
>>> scsi_try_to_abort_cmd() is also called (via a different pathway) when a 
>>> command sent by the error handler itself times out.  I haven't traced 
>>> through all the different paths to make sure none of them can run 
>>> concurrently.  But I'm willing to take your word for it.
>>>
>> Yes, but that's not calling scsi_abort_command(), but rather invokes
>> scsi_abort_eh_cmnd().
> 
> Sure.  But either way, we end up in scsi_try_to_abort_cmd(), which
> calls the LLDD's abort handler.  Thus leading to the possibility of
> aborting the same command more than once.
> 
>>>> The original idea was this:
>>>>
>>>> SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
>>>> Point is, any command abort is only ever send for a timed-out
>>>> command. And the main problem for a timed-out command is that we
>>>> simply _do not_ know what happened for that command. So _if_ a
>>>> command timed out, _and_ we've send an abort, _and_ the command
>>>> times out _again_ we'll be running into an endless loop between
>>>> timeout and aborting, and never returning the command at all.
>>>>
>>>> So to prevent this we should set a marker on that command telling it
>>>> to _not_ try to abort the command again.
>>>
>>> I disagree.  We _have_ to abort the command again -- how else can we
>>> stop a running command?  To prevent the loop you described, we should
>>> avoid _retrying_ the command after it is aborted the second time.
>>>
>> The actual question is whether it's worth aborting the same command
>> a second time.
>> In principle any reset (like LUN reset etc) should clear the
>> command, too.
>> And the EH abort functionality is geared around this.
>> If, for some reason, the transport layer / device driver
>> requires a command abort to be send then sure, we need
>> to accommodate for that.
> 
> As James mentioned, with USB a reset does not abort a running command.  
> Instead it waits for the command to finish.  (However, this could be
> changed in usb-storage, if required.)
> 
>> As said, yes, in principle you are right. We should be aborting the
>> command a second time, _and then_ starting the escalation.
>>
>> So if the above reasoning is okay then this patch should be doing
>> the trick:
>>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index 771c16b..0e72374 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
>>                 /*
>>                  * Retry after abort failed, escalate to next level.
>>                  */
>> +               scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
>>                 SCSI_LOG_ERROR_RECOVERY(3,
>>                         scmd_printk(KERN_INFO, scmd,
>>                                     "scmd %p previous abort
>> failed\n", scmd));
>>
>> (Beware of line
>> breaks)
>>
>> Can you test with it?
> 
> I don't understand.  This doesn't solve the fundamental problem (namely 
> that you escalate before aborting a running command).  All it does is 
> clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.
> 
Which was precisely the point :-)

Hmm. The comment might've been clearer.

What this patch is _supposed_ to be doing is that it'll clear the
SCSI_EH_ABORT_SCHEDULED flag it it has been set.
Which means this will be the second time scsi_abort_command() has
been called for the same command.
IE the first abort went out, did its thing, but now the same command
has timed out again.

So this flag gets cleared, and scsi_abort_command() returns FAILED,
and _no_ asynchronous abort is being scheduled.
scsi_times_out() will then proceed to call scsi_eh_scmd_add().
But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag
the SCSI_EH_CANCEL_CMD flag will continue to be set,
and the command will be aborted with the main SCSI EH routine.

It looks to me as if it should do what you desire, namely abort the
command asynchronously the first time, and invoking the SCSI EH the
second time.

Am I wrong?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
       [not found]             ` <533ADD26.1030300-l3A5Bk7waGM@public.gmane.org>
@ 2014-04-01 21:28               ` Alan Stern
       [not found]                 ` <Pine.LNX.4.44L0.1404011718350.7652-100000-pYrvlCTfrz9XsRXLowluHWD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Alan Stern @ 2014-04-01 21:28 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: James Bottomley, SCSI development list, USB list

On Tue, 1 Apr 2014, Hannes Reinecke wrote:

> >> So if the above reasoning is okay then this patch should be doing
> >> the trick:
> >>
> >> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> >> index 771c16b..0e72374 100644
> >> --- a/drivers/scsi/scsi_error.c
> >> +++ b/drivers/scsi/scsi_error.c
> >> @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
> >>                 /*
> >>                  * Retry after abort failed, escalate to next level.
> >>                  */
> >> +               scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
> >>                 SCSI_LOG_ERROR_RECOVERY(3,
> >>                         scmd_printk(KERN_INFO, scmd,
> >>                                     "scmd %p previous abort
> >> failed\n", scmd));
> >>
> >> (Beware of line
> >> breaks)
> >>
> >> Can you test with it?
> > 
> > I don't understand.  This doesn't solve the fundamental problem (namely 
> > that you escalate before aborting a running command).  All it does is 
> > clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.
> > 
> Which was precisely the point :-)
> 
> Hmm. The comment might've been clearer.
> 
> What this patch is _supposed_ to be doing is that it'll clear the
> SCSI_EH_ABORT_SCHEDULED flag it it has been set.
> Which means this will be the second time scsi_abort_command() has
> been called for the same command.
> IE the first abort went out, did its thing, but now the same command
> has timed out again.
> 
> So this flag gets cleared, and scsi_abort_command() returns FAILED,
> and _no_ asynchronous abort is being scheduled.
> scsi_times_out() will then proceed to call scsi_eh_scmd_add().
> But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag
> the SCSI_EH_CANCEL_CMD flag will continue to be set,
> and the command will be aborted with the main SCSI EH routine.
> 
> It looks to me as if it should do what you desire, namely abort the
> command asynchronously the first time, and invoking the SCSI EH the
> second time.
> 
> Am I wrong?

I don't know -- I'll have to try it out.  Currently I'm busy with a 
bunch of other stuff, so it will take some time.

Looking through the code, I have to wonder why scsi_times_out()  
modifies scmd->result.  Won't this value get overwritten by the LLDD
when the command eventually terminates?

Even worse, what happens in the event of a race where the command 
terminates normally just before scsi_times_out() changes scmd->result?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
       [not found]                 ` <Pine.LNX.4.44L0.1404011718350.7652-100000-pYrvlCTfrz9XsRXLowluHWD2FQJk+8+b@public.gmane.org>
@ 2014-04-02  6:03                   ` Hannes Reinecke
  2014-04-07 15:26                     ` Alan Stern
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2014-04-02  6:03 UTC (permalink / raw)
  To: Alan Stern; +Cc: James Bottomley, SCSI development list, USB list

On 04/01/2014 11:28 PM, Alan Stern wrote:
> On Tue, 1 Apr 2014, Hannes Reinecke wrote:
> 
>>>> So if the above reasoning is okay then this patch should be doing
>>>> the trick:
>>>>
>>>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>>>> index 771c16b..0e72374 100644
>>>> --- a/drivers/scsi/scsi_error.c
>>>> +++ b/drivers/scsi/scsi_error.c
>>>> @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
>>>>                 /*
>>>>                  * Retry after abort failed, escalate to next level.
>>>>                  */
>>>> +               scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
>>>>                 SCSI_LOG_ERROR_RECOVERY(3,
>>>>                         scmd_printk(KERN_INFO, scmd,
>>>>                                     "scmd %p previous abort
>>>> failed\n", scmd));
>>>>
>>>> (Beware of line
>>>> breaks)
>>>>
>>>> Can you test with it?
>>>
>>> I don't understand.  This doesn't solve the fundamental problem (namely 
>>> that you escalate before aborting a running command).  All it does is 
>>> clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.
>>>
>> Which was precisely the point :-)
>>
>> Hmm. The comment might've been clearer.
>>
>> What this patch is _supposed_ to be doing is that it'll clear the
>> SCSI_EH_ABORT_SCHEDULED flag it it has been set.
>> Which means this will be the second time scsi_abort_command() has
>> been called for the same command.
>> IE the first abort went out, did its thing, but now the same command
>> has timed out again.
>>
>> So this flag gets cleared, and scsi_abort_command() returns FAILED,
>> and _no_ asynchronous abort is being scheduled.
>> scsi_times_out() will then proceed to call scsi_eh_scmd_add().
>> But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag
>> the SCSI_EH_CANCEL_CMD flag will continue to be set,
>> and the command will be aborted with the main SCSI EH routine.
>>
>> It looks to me as if it should do what you desire, namely abort the
>> command asynchronously the first time, and invoking the SCSI EH the
>> second time.
>>
>> Am I wrong?
> 
> I don't know -- I'll have to try it out.  Currently I'm busy with a 
> bunch of other stuff, so it will take some time.
> 
> Looking through the code, I have to wonder why scsi_times_out()  
> modifies scmd->result.  Won't this value get overwritten by the LLDD
> when the command eventually terminates?
> 
Yes. However, the 'DID_TIME_OUT' is just a marker that the internal
timeout code triggered.
If the LLDD overwrites it with a 'real' error code everything's fine.

> Even worse, what happens in the event of a race where the command 
> terminates normally just before scsi_times_out() changes scmd->result?
> 
_That_ is the least of our worries.
_If_ the LLDD completes the command while scsi_times_out() is
running the whole thing is going down the drain anyway, as the
command will be terminated by the LLDD, and we can only hope that we
didn't mess up our reference counting. Otherwise we'd have EH
running on a stale command, which is going to be a fun to watch.

But looking closer, it might be that the line modifying the result
in scsi_times_out() is indeed pointless, seeing that it's being set
in scsi_eh_abort_handler(), too.
I'll be checking if we can simply remove that line.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare-l3A5Bk7waGM@public.gmane.org			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-04-02  6:03                   ` Hannes Reinecke
@ 2014-04-07 15:26                     ` Alan Stern
       [not found]                       ` <Pine.LNX.4.44L0.1404071125220.20747-100000-pYrvlCTfrz9XsRXLowluHWD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Alan Stern @ 2014-04-07 15:26 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: James Bottomley, SCSI development list, USB list

On Wed, 2 Apr 2014, Hannes Reinecke wrote:

> On 04/01/2014 11:28 PM, Alan Stern wrote:
> > On Tue, 1 Apr 2014, Hannes Reinecke wrote:
> > 
> >>>> So if the above reasoning is okay then this patch should be doing
> >>>> the trick:
> >>>>
> >>>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> >>>> index 771c16b..0e72374 100644
> >>>> --- a/drivers/scsi/scsi_error.c
> >>>> +++ b/drivers/scsi/scsi_error.c
> >>>> @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
> >>>>                 /*
> >>>>                  * Retry after abort failed, escalate to next level.
> >>>>                  */
> >>>> +               scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
> >>>>                 SCSI_LOG_ERROR_RECOVERY(3,
> >>>>                         scmd_printk(KERN_INFO, scmd,
> >>>>                                     "scmd %p previous abort
> >>>> failed\n", scmd));
> >>>>
> >>>> (Beware of line
> >>>> breaks)
> >>>>
> >>>> Can you test with it?
> >>>
> >>> I don't understand.  This doesn't solve the fundamental problem (namely 
> >>> that you escalate before aborting a running command).  All it does is 
> >>> clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.
> >>>
> >> Which was precisely the point :-)
> >>
> >> Hmm. The comment might've been clearer.
> >>
> >> What this patch is _supposed_ to be doing is that it'll clear the
> >> SCSI_EH_ABORT_SCHEDULED flag it it has been set.
> >> Which means this will be the second time scsi_abort_command() has
> >> been called for the same command.
> >> IE the first abort went out, did its thing, but now the same command
> >> has timed out again.
> >>
> >> So this flag gets cleared, and scsi_abort_command() returns FAILED,
> >> and _no_ asynchronous abort is being scheduled.
> >> scsi_times_out() will then proceed to call scsi_eh_scmd_add().
> >> But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag
> >> the SCSI_EH_CANCEL_CMD flag will continue to be set,
> >> and the command will be aborted with the main SCSI EH routine.
> >>
> >> It looks to me as if it should do what you desire, namely abort the
> >> command asynchronously the first time, and invoking the SCSI EH the
> >> second time.
> >>
> >> Am I wrong?
> > 
> > I don't know -- I'll have to try it out.  Currently I'm busy with a 
> > bunch of other stuff, so it will take some time.

I finally got a chance to try it out.  It does seem to do what we want.  
I didn't track the flow of control in complete detail, but the command 
definitely got aborted both times it was issued.

Alan Stern


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
       [not found]                       ` <Pine.LNX.4.44L0.1404071125220.20747-100000-pYrvlCTfrz9XsRXLowluHWD2FQJk+8+b@public.gmane.org>
@ 2014-04-09 17:42                         ` Hannes Reinecke
       [not found]                           ` <53458680.4090500-l3A5Bk7waGM@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2014-04-09 17:42 UTC (permalink / raw)
  To: James Bottomley; +Cc: Alan Stern, SCSI development list, USB list

On 04/07/2014 05:26 PM, Alan Stern wrote:
> On Wed, 2 Apr 2014, Hannes Reinecke wrote:
>
>> On 04/01/2014 11:28 PM, Alan Stern wrote:
>>> On Tue, 1 Apr 2014, Hannes Reinecke wrote:
>>>
>>>>>> So if the above reasoning is okay then this patch should be doing
>>>>>> the trick:
>>>>>>
>>>>>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>>>>>> index 771c16b..0e72374 100644
>>>>>> --- a/drivers/scsi/scsi_error.c
>>>>>> +++ b/drivers/scsi/scsi_error.c
>>>>>> @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
>>>>>>                  /*
>>>>>>                   * Retry after abort failed, escalate to next level.
>>>>>>                   */
>>>>>> +               scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
>>>>>>                  SCSI_LOG_ERROR_RECOVERY(3,
>>>>>>                          scmd_printk(KERN_INFO, scmd,
>>>>>>                                      "scmd %p previous abort
>>>>>> failed\n", scmd));
>>>>>>
>>>>>> (Beware of line
>>>>>> breaks)
>>>>>>
>>>>>> Can you test with it?
>>>>>
>>>>> I don't understand.  This doesn't solve the fundamental problem (namely
>>>>> that you escalate before aborting a running command).  All it does is
>>>>> clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.
>>>>>
>>>> Which was precisely the point :-)
>>>>
>>>> Hmm. The comment might've been clearer.
>>>>
>>>> What this patch is _supposed_ to be doing is that it'll clear the
>>>> SCSI_EH_ABORT_SCHEDULED flag it it has been set.
>>>> Which means this will be the second time scsi_abort_command() has
>>>> been called for the same command.
>>>> IE the first abort went out, did its thing, but now the same command
>>>> has timed out again.
>>>>
>>>> So this flag gets cleared, and scsi_abort_command() returns FAILED,
>>>> and _no_ asynchronous abort is being scheduled.
>>>> scsi_times_out() will then proceed to call scsi_eh_scmd_add().
>>>> But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag
>>>> the SCSI_EH_CANCEL_CMD flag will continue to be set,
>>>> and the command will be aborted with the main SCSI EH routine.
>>>>
>>>> It looks to me as if it should do what you desire, namely abort the
>>>> command asynchronously the first time, and invoking the SCSI EH the
>>>> second time.
>>>>
>>>> Am I wrong?
>>>
>>> I don't know -- I'll have to try it out.  Currently I'm busy with a
>>> bunch of other stuff, so it will take some time.
>
> I finally got a chance to try it out.  It does seem to do what we want.
> I didn't track the flow of control in complete detail, but the command
> definitely got aborted both times it was issued.
>
Good, so it is as I thought. James, can we include this patch instead of 
your prior solution?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare-l3A5Bk7waGM@public.gmane.org			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
       [not found]                           ` <53458680.4090500-l3A5Bk7waGM@public.gmane.org>
@ 2014-04-09 18:02                             ` Alan Stern
  2014-04-10 10:58                               ` Andreas Reis
  0 siblings, 1 reply; 31+ messages in thread
From: Alan Stern @ 2014-04-09 18:02 UTC (permalink / raw)
  To: Hannes Reinecke, Andreas Reis
  Cc: James Bottomley, SCSI development list, USB list

On Wed, 9 Apr 2014, Hannes Reinecke wrote:

> > I finally got a chance to try it out.  It does seem to do what we want.
> > I didn't track the flow of control in complete detail, but the command
> > definitely got aborted both times it was issued.
> >
> Good, so it is as I thought. James, can we include this patch instead of 
> your prior solution?

First, we should have the original bug reporter try it out.

Andreas, the patch in question can be found here:

	http://marc.info/?l=linux-usb&m=139627666606597&w=2

Can you try this in place of the 1/3 patch posted by James?  It should 
have the same effect, of preventing your system from crashing when the 
READ command fails.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-04-09 18:02                             ` Alan Stern
@ 2014-04-10 10:58                               ` Andreas Reis
  2014-04-10 11:37                                 ` Hannes Reinecke
  0 siblings, 1 reply; 31+ messages in thread
From: Andreas Reis @ 2014-04-10 10:58 UTC (permalink / raw)
  To: Alan Stern, Hannes Reinecke
  Cc: James Bottomley, SCSI development list, USB list

That patch appears to work in preventing the crashes, judged on one 
repeated appearance of the bug.

dmesg had the usual
[  215.229903] usb 4-2: usb_disable_lpm called, do nothing
[  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using xhci_hcd
[  215.350296] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called 
with disabled ep ffff880427b829c0
[  215.350305] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called 
with disabled ep ffff880427b82a08
[  215.350621] usb 4-2: usb_enable_lpm called, do nothing

repeated five times, followed by one
[  282.795801] sd 8:0:0:0: Device offlined - not ready after error recovery

and then as often as something tried to read from it:
[  295.585472] sd 8:0:0:0: rejecting I/O to offline device

The stick could then be properly un- and remounted (the latter if it had 
been physically replugged) without issue — for the bug to reoccur after 
one to three minutes. I tried this three times, no dmesg difference 
except the ep addresses varied on two of that.

Andreas Reis

On 09.04.2014 20:02, Alan Stern wrote:
> On Wed, 9 Apr 2014, Hannes Reinecke wrote:
>
>>> I finally got a chance to try it out.  It does seem to do what we want.
>>> I didn't track the flow of control in complete detail, but the command
>>> definitely got aborted both times it was issued.
>>>
>> Good, so it is as I thought. James, can we include this patch instead of
>> your prior solution?
>
> First, we should have the original bug reporter try it out.
>
> Andreas, the patch in question can be found here:
>
> 	http://marc.info/?l=linux-usb&m=139627666606597&w=2
>
> Can you try this in place of the 1/3 patch posted by James?  It should
> have the same effect, of preventing your system from crashing when the
> READ command fails.
>
> Alan Stern
>

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-04-10 10:58                               ` Andreas Reis
@ 2014-04-10 11:37                                 ` Hannes Reinecke
  2014-04-10 12:26                                   ` Andreas Reis
                                                     ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Hannes Reinecke @ 2014-04-10 11:37 UTC (permalink / raw)
  To: Andreas Reis, Alan Stern; +Cc: James Bottomley, SCSI development list, USB list

On 04/10/2014 12:58 PM, Andreas Reis wrote:
> That patch appears to work in preventing the crashes, judged on one
> repeated appearance of the bug.
> 
> dmesg had the usual
> [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
> [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
> xhci_hcd
> [  215.350296] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
> with disabled ep ffff880427b829c0
> [  215.350305] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
> with disabled ep ffff880427b82a08
> [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
> 
> repeated five times, followed by one
> [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
> recovery
> 
> and then as often as something tried to read from it:
> [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
> 
> The stick could then be properly un- and remounted (the latter if it
> had been physically replugged) without issue — for the bug to
> reoccur after one to three minutes. I tried this three times, no
> dmesg difference except the ep addresses varied on two of that.
> 
Was this just that patch you've tested with or the entire patch series?

If the latter, Alan, is this the expected outcome?
I would've thought the error recover should _not_ run into
offlining devices here, but rather the device should be recovered
eventually.

Andreas, can you test with the entire patch series and enable
'scsi_logging_level -s -E 5' prior to running the tests?

THX.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-04-10 11:37                                 ` Hannes Reinecke
@ 2014-04-10 12:26                                   ` Andreas Reis
  2014-04-10 12:29                                     ` Hannes Reinecke
  2014-04-10 15:31                                   ` Alan Stern
       [not found]                                   ` <53468297.1040909-l3A5Bk7waGM@public.gmane.org>
  2 siblings, 1 reply; 31+ messages in thread
From: Andreas Reis @ 2014-04-10 12:26 UTC (permalink / raw)
  To: Hannes Reinecke, Alan Stern
  Cc: James Bottomley, SCSI development list, USB list

Only your 0/3 patch to which Alan linked, along with two other patches 
by Mathias Nyman ("disable usb3 on intel hosts" and "disable all lpm 
related control transfers", one of which is the source of the "do 
nothing"s).

I'll revert the latter two and apply the rest of the set. Which I'm 
guessing currently consists of said 0/3 patch —
http://www.spinics.net/lists/linux-scsi/msg73502.html
— plus 2/3 and 3/3?

Or should I just omit 0/3 and try whichever of the two in 1/3 "works 
best"? Rather confusing ATM.

Anyway, for whatever reason the bug is happening rather frequently now. 
I've spotted the following occurring after the "Device offlined" line 
two times now:

[  206.901385] sd 11:0:0:0: [sdg] Unhandled error code
[  206.901394] sd 11:0:0:0: [sdg]
[  206.901397] Result: hostbyte=0x01 driverbyte=0x00
[  206.901400] sd 11:0:0:0: [sdg] CDB:
[  206.901403] cdb[0]=0x2a: 2a 00 02 25 1b 50 00 00 08 00
[  206.901419] end_request: I/O error, dev sdg, sector 35986256

The second time had "sd 12:0:0:0", "cdb[0]=0x28: 28 00 03 94 77 20 00 00 
08 00" and a different sector.

Andreas Reis

On 10.04.2014 13:37, Hannes Reinecke wrote:
> On 04/10/2014 12:58 PM, Andreas Reis wrote:
>> That patch appears to work in preventing the crashes, judged on one
>> repeated appearance of the bug.
>>
>> dmesg had the usual
>> [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
>> [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
>> xhci_hcd
>> [  215.350296] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
>> with disabled ep ffff880427b829c0
>> [  215.350305] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
>> with disabled ep ffff880427b82a08
>> [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
>>
>> repeated five times, followed by one
>> [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
>> recovery
>>
>> and then as often as something tried to read from it:
>> [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
>>
>> The stick could then be properly un- and remounted (the latter if it
>> had been physically replugged) without issue — for the bug to
>> reoccur after one to three minutes. I tried this three times, no
>> dmesg difference except the ep addresses varied on two of that.
>>
> Was this just that patch you've tested with or the entire patch series?
>
> If the latter, Alan, is this the expected outcome?
> I would've thought the error recover should _not_ run into
> offlining devices here, but rather the device should be recovered
> eventually.
>
> Andreas, can you test with the entire patch series and enable
> 'scsi_logging_level -s -E 5' prior to running the tests?
>
> THX.
>
> Cheers,
>
> Hannes
>

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-04-10 12:26                                   ` Andreas Reis
@ 2014-04-10 12:29                                     ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2014-04-10 12:29 UTC (permalink / raw)
  To: Andreas Reis, Alan Stern; +Cc: James Bottomley, SCSI development list, USB list

On 04/10/2014 02:26 PM, Andreas Reis wrote:
> Only your 0/3 patch to which Alan linked, along with two other
> patches by Mathias Nyman ("disable usb3 on intel hosts" and "disable
> all lpm related control transfers", one of which is the source of
> the "do nothing"s).
> 
> I'll revert the latter two and apply the rest of the set. Which I'm
> guessing currently consists of said 0/3 patch —
> http://www.spinics.net/lists/linux-scsi/msg73502.html
> — plus 2/3 and 3/3?
> 
Yes, that is correct.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-04-10 11:37                                 ` Hannes Reinecke
  2014-04-10 12:26                                   ` Andreas Reis
@ 2014-04-10 15:31                                   ` Alan Stern
  2014-04-10 17:52                                     ` Hannes Reinecke
       [not found]                                   ` <53468297.1040909-l3A5Bk7waGM@public.gmane.org>
  2 siblings, 1 reply; 31+ messages in thread
From: Alan Stern @ 2014-04-10 15:31 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Andreas Reis, James Bottomley, SCSI development list, USB list

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=UTF-8, Size: 1995 bytes --]

On Thu, 10 Apr 2014, Hannes Reinecke wrote:

> On 04/10/2014 12:58 PM, Andreas Reis wrote:
> > That patch appears to work in preventing the crashes, judged on one
> > repeated appearance of the bug.
> > 
> > dmesg had the usual
> > [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
> > [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
> > xhci_hcd
> > [  215.350296] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
> > with disabled ep ffff880427b829c0
> > [  215.350305] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
> > with disabled ep ffff880427b82a08
> > [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
> > 
> > repeated five times, followed by one
> > [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
> > recovery
> > 
> > and then as often as something tried to read from it:
> > [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
> > 
> > The stick could then be properly un- and remounted (the latter if it
> > had been physically replugged) without issue — for the bug to
> > reoccur after one to three minutes. I tried this three times, no
> > dmesg difference except the ep addresses varied on two of that.
> > 
> Was this just that patch you've tested with or the entire patch series?
> 
> If the latter, Alan, is this the expected outcome?

Yes, it is.  The same thing should happen with the entire patch series.

> I would've thought the error recover should _not_ run into
> offlining devices here, but rather the device should be recovered
> eventually.

The command times out, it is aborted, and the command is retried.  The
same thing happens, and we repeat five times.  Eventually the SCSI core
gives up and declares the device to be offline.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-04-10 15:31                                   ` Alan Stern
@ 2014-04-10 17:52                                     ` Hannes Reinecke
       [not found]                                       ` <5346DA43.4010603-l3A5Bk7waGM@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2014-04-10 17:52 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andreas Reis, James Bottomley, SCSI development list, USB list

On 04/10/2014 05:31 PM, Alan Stern wrote:
> On Thu, 10 Apr 2014, Hannes Reinecke wrote:
>
>> On 04/10/2014 12:58 PM, Andreas Reis wrote:
>>> That patch appears to work in preventing the crashes, judged on one
>>> repeated appearance of the bug.
>>>
>>> dmesg had the usual
>>> [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
>>> [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
>>> xhci_hcd
>>> [  215.350296] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
>>> with disabled ep ffff880427b829c0
>>> [  215.350305] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
>>> with disabled ep ffff880427b82a08
>>> [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
>>>
>>> repeated five times, followed by one
>>> [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
>>> recovery
>>>
>>> and then as often as something tried to read from it:
>>> [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
>>>
>>> The stick could then be properly un- and remounted (the latter if it
>>> had been physically replugged) without issue � for the bug to
>>> reoccur after one to three minutes. I tried this three times, no
>>> dmesg difference except the ep addresses varied on two of that.
>>>
>> Was this just that patch you've tested with or the entire patch series?
>>
>> If the latter, Alan, is this the expected outcome?
>
> Yes, it is.  The same thing should happen with the entire patch series.
>
>> I would've thought the error recover should _not_ run into
>> offlining devices here, but rather the device should be recovered
>> eventually.
>
> The command times out, it is aborted, and the command is retried.  The
> same thing happens, and we repeat five times.  Eventually the SCSI core
> gives up and declares the device to be offline.
>
Hmm. Ok. If you are fine with it who am I to argue here.
James, shall I resent the patch series?

Cheers,

Hannes

-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
       [not found]                                       ` <5346DA43.4010603-l3A5Bk7waGM@public.gmane.org>
@ 2014-04-10 20:36                                         ` James Bottomley
  2014-04-11  5:52                                           ` Hannes Reinecke
  0 siblings, 1 reply; 31+ messages in thread
From: James Bottomley @ 2014-04-10 20:36 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Alan Stern, Andreas Reis, SCSI development list, USB list

On Thu, 2014-04-10 at 19:52 +0200, Hannes Reinecke wrote:
> On 04/10/2014 05:31 PM, Alan Stern wrote:
> > On Thu, 10 Apr 2014, Hannes Reinecke wrote:
> >
> >> On 04/10/2014 12:58 PM, Andreas Reis wrote:
> >>> That patch appears to work in preventing the crashes, judged on one
> >>> repeated appearance of the bug.
> >>>
> >>> dmesg had the usual
> >>> [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
> >>> [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
> >>> xhci_hcd
> >>> [  215.350296] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
> >>> with disabled ep ffff880427b829c0
> >>> [  215.350305] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
> >>> with disabled ep ffff880427b82a08
> >>> [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
> >>>
> >>> repeated five times, followed by one
> >>> [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
> >>> recovery
> >>>
> >>> and then as often as something tried to read from it:
> >>> [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
> >>>
> >>> The stick could then be properly un- and remounted (the latter if it
> >>> had been physically replugged) without issue � for the bug to
> >>> reoccur after one to three minutes. I tried this three times, no
> >>> dmesg difference except the ep addresses varied on two of that.
> >>>
> >> Was this just that patch you've tested with or the entire patch series?
> >>
> >> If the latter, Alan, is this the expected outcome?
> >
> > Yes, it is.  The same thing should happen with the entire patch series.
> >
> >> I would've thought the error recover should _not_ run into
> >> offlining devices here, but rather the device should be recovered
> >> eventually.
> >
> > The command times out, it is aborted, and the command is retried.  The
> > same thing happens, and we repeat five times.  Eventually the SCSI core
> > gives up and declares the device to be offline.
> >
> Hmm. Ok. If you are fine with it who am I to argue here.
> James, shall I resent the patch series?

You mean the one patch?  No, it's OK, I have it.

It's still not complete, though, as I've said a couple of times.  The
problem is that we have abort memory on any eh command as well, which
this doesn't fix.

The scenario is abort command, set flag, abort completes, send TUR, TUR
doesn't return, so we now try to abort the TUR, but scsi_abort_eh_cmnd()
will skip the abort because the flag is set and move straight to reset.

The fix is this, I can just add it as well.

James

---

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..7516e2c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -920,6 +920,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses,
 	ses->prot_op = scmd->prot_op;
 
 	scmd->prot_op = SCSI_PROT_NORMAL;
+	scmd->eh_eflags = 0;
 	scmd->cmnd = ses->eh_cmnd;
 	memset(scmd->cmnd, 0, BLK_MAX_CDB);
 	memset(&scmd->sdb, 0, sizeof(scmd->sdb));


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
  2014-04-10 20:36                                         ` James Bottomley
@ 2014-04-11  5:52                                           ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2014-04-11  5:52 UTC (permalink / raw)
  To: James Bottomley; +Cc: Alan Stern, Andreas Reis, SCSI development list, USB list

On 04/10/2014 10:36 PM, James Bottomley wrote:
> On Thu, 2014-04-10 at 19:52 +0200, Hannes Reinecke wrote:
>> On 04/10/2014 05:31 PM, Alan Stern wrote:
>>> On Thu, 10 Apr 2014, Hannes Reinecke wrote:
>>>
>>>> On 04/10/2014 12:58 PM, Andreas Reis wrote:
>>>>> That patch appears to work in preventing the crashes, judged on one
>>>>> repeated appearance of the bug.
>>>>>
>>>>> dmesg had the usual
>>>>> [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
>>>>> [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
>>>>> xhci_hcd
>>>>> [  215.350296] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
>>>>> with disabled ep ffff880427b829c0
>>>>> [  215.350305] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
>>>>> with disabled ep ffff880427b82a08
>>>>> [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
>>>>>
>>>>> repeated five times, followed by one
>>>>> [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
>>>>> recovery
>>>>>
>>>>> and then as often as something tried to read from it:
>>>>> [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
>>>>>
>>>>> The stick could then be properly un- and remounted (the latter if it
>>>>> had been physically replugged) without issue � for the bug to
>>>>> reoccur after one to three minutes. I tried this three times, no
>>>>> dmesg difference except the ep addresses varied on two of that.
>>>>>
>>>> Was this just that patch you've tested with or the entire patch series?
>>>>
>>>> If the latter, Alan, is this the expected outcome?
>>>
>>> Yes, it is.  The same thing should happen with the entire patch series.
>>>
>>>> I would've thought the error recover should _not_ run into
>>>> offlining devices here, but rather the device should be recovered
>>>> eventually.
>>>
>>> The command times out, it is aborted, and the command is retried.  The
>>> same thing happens, and we repeat five times.  Eventually the SCSI core
>>> gives up and declares the device to be offline.
>>>
>> Hmm. Ok. If you are fine with it who am I to argue here.
>> James, shall I resent the patch series?
> 
> You mean the one patch?  No, it's OK, I have it.
> 
> It's still not complete, though, as I've said a couple of times.  The
> problem is that we have abort memory on any eh command as well, which
> this doesn't fix.
> 
> The scenario is abort command, set flag, abort completes, send TUR, TUR
> doesn't return, so we now try to abort the TUR, but scsi_abort_eh_cmnd()
> will skip the abort because the flag is set and move straight to reset.
> 
> The fix is this, I can just add it as well.
> 
> James
> 
> ---
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 771c16b..7516e2c 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -920,6 +920,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses,
>  	ses->prot_op = scmd->prot_op;
>  
>  	scmd->prot_op = SCSI_PROT_NORMAL;
> +	scmd->eh_eflags = 0;
>  	scmd->cmnd = ses->eh_cmnd;
>  	memset(scmd->cmnd, 0, BLK_MAX_CDB);
>  	memset(&scmd->sdb, 0, sizeof(scmd->sdb));
> 
> 
Oh yes, that is correct.

Acked-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
       [not found]                                   ` <53468297.1040909-l3A5Bk7waGM@public.gmane.org>
@ 2014-04-11 19:08                                     ` Andreas Reis
  0 siblings, 0 replies; 31+ messages in thread
From: Andreas Reis @ 2014-04-11 19:08 UTC (permalink / raw)
  To: Hannes Reinecke, Alan Stern
  Cc: James Bottomley, SCSI development list, USB list

I did set the scsi_logging_level as you wrote, but it still resulted 
into exactly the same dmesg (resp. sudo journalctl -b 0) output with the 
entire patchset (including the "scmd->eh_eflags = 0;"). Except for the 
reverted "do nothing"s, for course.

However, I did miss to mention the following which is printed once with 
or without the scsi_logging_level change:

[16254.735372] sd 9:0:0:0: Device offlined - not ready after error recovery
[16254.735393] sd 9:0:0:0: rejecting I/O to offline device
[16254.735397] sd 9:0:0:0: [sdf] killing request
[16254.735488] sd 9:0:0:0: [sdf] Unhandled error code
[16254.735489] sd 9:0:0:0: [sdf]
[16254.735490] Result: hostbyte=0x01 driverbyte=0x00
[16254.735492] sd 9:0:0:0: [sdf] CDB:
[16254.735493] cdb[0]=0x2a: 2a 00 03 1c 10 f8 00 00 f0 00
[16254.735498] end_request: I/O error, dev sdf, sector 52171000

Another:

[19389.907015] sd 11:0:0:0: Device offlined - not ready after error recovery
[19389.907037] sd 11:0:0:0: [sdf] Unhandled error code
[19389.907044] sd 11:0:0:0: [sdf]
[19389.907049] Result: hostbyte=0x05 driverbyte=0x00
[19389.907055] sd 11:0:0:0: [sdf] CDB:
[19389.907060] cdb[0]=0x28: 28 00 03 13 0d 28 00 00 08 00
[19389.907083] end_request: I/O error, dev sdf, sector 51580200
[19389.907121] sd 11:0:0:0: rejecting I/O to offline device
[19389.907126] sd 11:0:0:0: killing request

In case it matters, the stick's partition stays readable with regard to 
any data on it that has been cached.

On 10.04.2014 13:37, Hannes Reinecke wrote:
> Andreas, can you test with the entire patch series and enable
> 'scsi_logging_level -s -E 5' prior to running the tests?
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-04-11 19:08 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-28 17:47 [PATCH 0/3] Fix USB deadlock caused by SCSI error handling James Bottomley
2014-03-28 17:49 ` [PATCH 1/3] [SCSI] Fix abort state memory problem James Bottomley
2014-03-31  6:58   ` Hannes Reinecke
2014-03-31 13:06     ` Alan Stern
2014-03-28 17:50 ` [PATCH 2/3] [SCSI] Fix spurious request sense in error handling James Bottomley
2014-03-31  6:59   ` Hannes Reinecke
2014-03-28 17:51 ` [PATCH 3/3] [SCSI] Fix command result state propagation James Bottomley
2014-03-31  7:00   ` Hannes Reinecke
2014-03-28 19:29 ` [PATCH 0/3] Fix USB deadlock caused by SCSI error handling Alan Stern
2014-03-31  7:22   ` Hannes Reinecke
2014-03-31 13:33     ` Alan Stern
2014-03-31 14:37       ` Hannes Reinecke
     [not found]         ` <53397DAE.9010801-l3A5Bk7waGM@public.gmane.org>
2014-03-31 15:03           ` James Bottomley
2014-03-31 22:41             ` Alan Stern
     [not found]             ` <1396278224.3152.26.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2014-04-01  6:14               ` Hannes Reinecke
2014-03-31 22:29         ` Alan Stern
2014-04-01 15:37           ` Hannes Reinecke
     [not found]             ` <533ADD26.1030300-l3A5Bk7waGM@public.gmane.org>
2014-04-01 21:28               ` Alan Stern
     [not found]                 ` <Pine.LNX.4.44L0.1404011718350.7652-100000-pYrvlCTfrz9XsRXLowluHWD2FQJk+8+b@public.gmane.org>
2014-04-02  6:03                   ` Hannes Reinecke
2014-04-07 15:26                     ` Alan Stern
     [not found]                       ` <Pine.LNX.4.44L0.1404071125220.20747-100000-pYrvlCTfrz9XsRXLowluHWD2FQJk+8+b@public.gmane.org>
2014-04-09 17:42                         ` Hannes Reinecke
     [not found]                           ` <53458680.4090500-l3A5Bk7waGM@public.gmane.org>
2014-04-09 18:02                             ` Alan Stern
2014-04-10 10:58                               ` Andreas Reis
2014-04-10 11:37                                 ` Hannes Reinecke
2014-04-10 12:26                                   ` Andreas Reis
2014-04-10 12:29                                     ` Hannes Reinecke
2014-04-10 15:31                                   ` Alan Stern
2014-04-10 17:52                                     ` Hannes Reinecke
     [not found]                                       ` <5346DA43.4010603-l3A5Bk7waGM@public.gmane.org>
2014-04-10 20:36                                         ` James Bottomley
2014-04-11  5:52                                           ` Hannes Reinecke
     [not found]                                   ` <53468297.1040909-l3A5Bk7waGM@public.gmane.org>
2014-04-11 19:08                                     ` Andreas Reis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.