All of lore.kernel.org
 help / color / mirror / Atom feed
* Deadlock in usb-storage error handling
       [not found] ` <53298181.9020206-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-03-19 20:31   ` Alan Stern
       [not found]     ` <Pine.LNX.4.44L0.1403191449290.887-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Alan Stern @ 2014-03-19 20:31 UTC (permalink / raw)
  To: Andreas Reis, James Bottomley; +Cc: USB list, SCSI development list

On Wed, 19 Mar 2014, Andreas Reis wrote:

> I've uploaded a dmesg with the new debugging patch to bugzilla:
> https://bugzilla.kernel.org/attachment.cgi?id=130041

Thanks.  I have now managed to reproduce many of the features of this
problem on my own computer.

James, I will need your help (or help from somebody who understands the 
SCSI error handler) to figure out how this problem should be fixed.

Basically, usb-storage deadlocks when the SCSI error handler invokes
the eh_device_reset_handler callback while a command is running.  The
command has timed out and will never complete normally, because the
device's firmware has crashed.  But usb-storage's device-reset routine
waits for the current command to finish, which brings everything to a
standstill.

Is this design wrong?  That is, should the device-reset routine wait 
for currently executing commands to finish, or should it abort them, or 
what?

Or should the SCSI error handler abort the running command before 
invoking the eh_device_reset_handler callback?

For the record, and in case anyone is curious, here's the detailed
sequence of events during my test:

	sd issues a READ(10) command.  For whatever reason, the device
	goes nuts and the command times out.

	scsi_times_out() calls scsi_abort_command(), which queues an
	abort request.

	scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which
	succeeds in aborting the READ.

	The READ command is retried (I didn't trace through the details
	of this).  The retry fails with a Unit Attention (SK=6, 
	ASC=0x29, Reset or Bus Device Reset Occurred).

	The READ command is retried a second time, and it times out 
	again.

	This time around, scsi_times_out() calls scsi_abort_command()
	unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is
	still set).

	As a result, scsi_error_handler() calls scsi_unjam_host(), 
	which calls scsi_eh_get_sense().

	That routine calls scsi_request_sense(), which goes into
	scsi_send_eh_cmnd().

	The calls to shost->hostt->queuecommand() all fail, because the
	READ command is still running and usb-storage has a queue
	depth of 1.  The error messages produced by these failures are
	disconcerting but not dangerous.

	Since the REQUEST SENSE command was never issued, 
	scsi_eh_get_sense() returns 0.

	scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which
	does essentially nothing because the SCSI_EH_CANCEL_CMD flag
	for the only command on work_q is clear.  
	scsi_eh_test_devices() returns 0 because check_list is empty
	and work_q isn't.

	scsi_unjam_host() then calls scsi_eh_ready_devs().  This
	routine ends up calling scsi_eh_bus_device_reset(), at which 
	point usb-storage deadlocks as described above.

(On Andreas's system, the first READ retry times out as opposed to the
second retry as on my computer.  I doubt this makes any difference.)

I can't tell if this is all working as intended or if it went off the 
tracks somewhere.

Thanks for any guidance.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
       [not found]     ` <Pine.LNX.4.44L0.1403191449290.887-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
@ 2014-03-19 20:54       ` Dan Williams
  2014-03-19 21:35       ` James Bottomley
  1 sibling, 0 replies; 16+ messages in thread
From: Dan Williams @ 2014-03-19 20:54 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andreas Reis, James Bottomley, USB list, SCSI development list

On Wed, Mar 19, 2014 at 1:31 PM, Alan Stern <stern-nwvwT67g6+6dFdvTe/nMLpVzexx5G7lz@public.gmane.org> wrote:
> On Wed, 19 Mar 2014, Andreas Reis wrote:
>
>> I've uploaded a dmesg with the new debugging patch to bugzilla:
>> https://bugzilla.kernel.org/attachment.cgi?id=130041
>
> Thanks.  I have now managed to reproduce many of the features of this
> problem on my own computer.
>
> James, I will need your help (or help from somebody who understands the
> SCSI error handler) to figure out how this problem should be fixed.
>
> Basically, usb-storage deadlocks when the SCSI error handler invokes
> the eh_device_reset_handler callback while a command is running.  The
> command has timed out and will never complete normally, because the
> device's firmware has crashed.  But usb-storage's device-reset routine
> waits for the current command to finish, which brings everything to a
> standstill.
>
> Is this design wrong?  That is, should the device-reset routine wait
> for currently executing commands to finish, or should it abort them, or
> what?
>
> Or should the SCSI error handler abort the running command before
> invoking the eh_device_reset_handler callback?
>
> For the record, and in case anyone is curious, here's the detailed
> sequence of events during my test:
>
>         sd issues a READ(10) command.  For whatever reason, the device
>         goes nuts and the command times out.
>
>         scsi_times_out() calls scsi_abort_command(), which queues an
>         abort request.
>
>         scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which
>         succeeds in aborting the READ.
>
>         The READ command is retried (I didn't trace through the details
>         of this).  The retry fails with a Unit Attention (SK=6,
>         ASC=0x29, Reset or Bus Device Reset Occurred).
>
>         The READ command is retried a second time, and it times out
>         again.
>
>         This time around, scsi_times_out() calls scsi_abort_command()
>         unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is
>         still set).
>
>         As a result, scsi_error_handler() calls scsi_unjam_host(),
>         which calls scsi_eh_get_sense().
>
>         That routine calls scsi_request_sense(), which goes into
>         scsi_send_eh_cmnd().
>
>         The calls to shost->hostt->queuecommand() all fail, because the
>         READ command is still running and usb-storage has a queue
>         depth of 1.  The error messages produced by these failures are
>         disconcerting but not dangerous.
>
>         Since the REQUEST SENSE command was never issued,
>         scsi_eh_get_sense() returns 0.
>
>         scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which
>         does essentially nothing because the SCSI_EH_CANCEL_CMD flag
>         for the only command on work_q is clear.
>         scsi_eh_test_devices() returns 0 because check_list is empty
>         and work_q isn't.
>
>         scsi_unjam_host() then calls scsi_eh_ready_devs().  This
>         routine ends up calling scsi_eh_bus_device_reset(), at which
>         point usb-storage deadlocks as described above.
>
> (On Andreas's system, the first READ retry times out as opposed to the
> second retry as on my computer.  I doubt this makes any difference.)
>
> I can't tell if this is all working as intended or if it went off the
> tracks somewhere.
>
> Thanks for any guidance.
>

It seems to me we need an ->eh_strategy_handler() that understands the
usb transport and will escalate to device reset around the time
scsi_abort_command() fails.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
       [not found]     ` <Pine.LNX.4.44L0.1403191449290.887-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
  2014-03-19 20:54       ` Dan Williams
@ 2014-03-19 21:35       ` James Bottomley
       [not found]         ` <1395264905.2185.55.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 16+ messages in thread
From: James Bottomley @ 2014-03-19 21:35 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list

On Wed, 2014-03-19 at 16:31 -0400, Alan Stern wrote:
> On Wed, 19 Mar 2014, Andreas Reis wrote:
> 
> > I've uploaded a dmesg with the new debugging patch to bugzilla:
> > https://bugzilla.kernel.org/attachment.cgi?id=130041
> 
> Thanks.  I have now managed to reproduce many of the features of this
> problem on my own computer.
> 
> James, I will need your help (or help from somebody who understands the 
> SCSI error handler) to figure out how this problem should be fixed.
> 
> Basically, usb-storage deadlocks when the SCSI error handler invokes
> the eh_device_reset_handler callback while a command is running.  The
> command has timed out and will never complete normally, because the
> device's firmware has crashed.  But usb-storage's device-reset routine
> waits for the current command to finish, which brings everything to a
> standstill.
> 
> Is this design wrong?  That is, should the device-reset routine wait 
> for currently executing commands to finish, or should it abort them, or 
> what?

In some sense, yes, but not necessarily from the Point of View of USB.
What we assume in SCSI is that commands are forgettable, meaning there's
always some operation we can perform (be it abort or reset) that causes
the device to forget about all outstanding commands and reset its
internal state machine to a known good state.

The cardinal SCSI assumption is that after we've successfully done an
abort or reset on a command that it will never come back to us from the
device.

> Or should the SCSI error handler abort the running command before 
> invoking the eh_device_reset_handler callback?

So this is rooted in the "Abort can be a Problem" issue:  Abort
sometimes works well (and it's not very disruptive) but sometimes if the
device is having a problem in its command state machine, adding another
command (which is what the abort is) doesn't actually do anything, so we
need error escalation to reset.  We can't wait for the abort or other
commands to complete because they never will.  The reset is expected to
clear everything from the device (including the pending aborts).

> For the record, and in case anyone is curious, here's the detailed
> sequence of events during my test:
> 
> 	sd issues a READ(10) command.  For whatever reason, the device
> 	goes nuts and the command times out.
> 
> 	scsi_times_out() calls scsi_abort_command(), which queues an
> 	abort request.
> 
> 	scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which
> 	succeeds in aborting the READ.
> 
> 	The READ command is retried (I didn't trace through the details
> 	of this).  The retry fails with a Unit Attention (SK=6, 
> 	ASC=0x29, Reset or Bus Device Reset Occurred).
> 
> 	The READ command is retried a second time, and it times out 
> 	again.
> 
> 	This time around, scsi_times_out() calls scsi_abort_command()
> 	unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is
> 	still set).

>From the first time we sent the abort?  That sounds like a problem in
our state tracking.

> 	As a result, scsi_error_handler() calls scsi_unjam_host(), 
> 	which calls scsi_eh_get_sense().
> 
> 	That routine calls scsi_request_sense(), which goes into
> 	scsi_send_eh_cmnd().

I thought USB was autosense, so when it reports check condition, we
should already have sense ... or are we calling request_sense without
being sent a check condition status?

> 	The calls to shost->hostt->queuecommand() all fail, because the
> 	READ command is still running and usb-storage has a queue
> 	depth of 1.  The error messages produced by these failures are
> 	disconcerting but not dangerous.
> 
> 	Since the REQUEST SENSE command was never issued, 
> 	scsi_eh_get_sense() returns 0.
> 
> 	scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which
> 	does essentially nothing because the SCSI_EH_CANCEL_CMD flag
> 	for the only command on work_q is clear.  
> 	scsi_eh_test_devices() returns 0 because check_list is empty
> 	and work_q isn't.
> 
> 	scsi_unjam_host() then calls scsi_eh_ready_devs().  This
> 	routine ends up calling scsi_eh_bus_device_reset(), at which 
> 	point usb-storage deadlocks as described above.

OK, so in the case where the command can never complete (because the fw
has crashed), what should be the process for resetting the device so it
can again function?

James

> (On Andreas's system, the first READ retry times out as opposed to the
> second retry as on my computer.  I doubt this makes any difference.)
> 
> I can't tell if this is all working as intended or if it went off the 
> tracks somewhere.
> 
> Thanks for any guidance.
> 
> Alan Stern
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
       [not found]         ` <1395264905.2185.55.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
@ 2014-03-20 15:36           ` Alan Stern
  2014-03-20 16:09             ` James Bottomley
  0 siblings, 1 reply; 16+ messages in thread
From: Alan Stern @ 2014-03-20 15:36 UTC (permalink / raw)
  To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list

On Wed, 19 Mar 2014, James Bottomley wrote:

> > Basically, usb-storage deadlocks when the SCSI error handler invokes
> > the eh_device_reset_handler callback while a command is running.  The
> > command has timed out and will never complete normally, because the
> > device's firmware has crashed.  But usb-storage's device-reset routine
> > waits for the current command to finish, which brings everything to a
> > standstill.
> > 
> > Is this design wrong?  That is, should the device-reset routine wait 
> > for currently executing commands to finish, or should it abort them, or 
> > what?
> 
> In some sense, yes, but not necessarily from the Point of View of USB.
> What we assume in SCSI is that commands are forgettable, meaning there's
> always some operation we can perform (be it abort or reset) that causes
> the device to forget about all outstanding commands and reset its
> internal state machine to a known good state.
> 
> The cardinal SCSI assumption is that after we've successfully done an
> abort or reset on a command that it will never come back to us from the
> device.

The same assumption is present in usb-storage.

The difference appears to be that SCSI assumes a reset can be issued at 
any time, even while a command is running.  In usb-storage, that's true 
for a _bus_ reset but it's not true for a _device_ reset.

Perhaps this restriction on device resets could be lifted, but in real
life it probably won't make much difference.  The fact is, almost no
USB mass-storage devices implement device reset correctly, whereas most
of them do implement bus reset.  (Possibly related factoid: MS Windows
uses bus resets but doesn't use device resets in its USB mass-storage
driver.)

> > Or should the SCSI error handler abort the running command before 
> > invoking the eh_device_reset_handler callback?
> 
> So this is rooted in the "Abort can be a Problem" issue:  Abort
> sometimes works well (and it's not very disruptive) but sometimes if the
> device is having a problem in its command state machine, adding another
> command (which is what the abort is) doesn't actually do anything, so we
> need error escalation to reset.  We can't wait for the abort or other
> commands to complete because they never will.  The reset is expected to
> clear everything from the device (including the pending aborts).

With usb-storage, aborts usually work pretty well.  But sometimes they 
don't cure the underlying cause, so when the offending command is 
retried, it hangs up again.  Something like that seems to be happening 
here.

> > For the record, and in case anyone is curious, here's the detailed
> > sequence of events during my test:
> > 
> > 	sd issues a READ(10) command.  For whatever reason, the device
> > 	goes nuts and the command times out.
> > 
> > 	scsi_times_out() calls scsi_abort_command(), which queues an
> > 	abort request.
> > 
> > 	scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which
> > 	succeeds in aborting the READ.
> > 
> > 	The READ command is retried (I didn't trace through the details
> > 	of this).  The retry fails with a Unit Attention (SK=6, 
> > 	ASC=0x29, Reset or Bus Device Reset Occurred).
> > 
> > 	The READ command is retried a second time, and it times out 
> > 	again.
> > 
> > 	This time around, scsi_times_out() calls scsi_abort_command()
> > 	unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is
> > 	still set).
> 
> From the first time we sent the abort?

Yes.  As far as I can see, the only place where SCSI_EH_ABORT_SCHEDULED
gets cleared is in scsi_eh_finish_cmd(), which never got called.  
After the successful abort, scmd_eh_abort_handler() went directly into
its "if (rtn == SUCCESS)" and "scmd %p retry aborted command" /
scsi_queue_insert() code paths.

>  That sounds like a problem in
> our state tracking.

Could well be.  How should I track it down further?  Or can you suggest
a patch just from this much information?

> > 	As a result, scsi_error_handler() calls scsi_unjam_host(), 
> > 	which calls scsi_eh_get_sense().
> > 
> > 	That routine calls scsi_request_sense(), which goes into
> > 	scsi_send_eh_cmnd().
> 
> I thought USB was autosense, so when it reports check condition, we
> should already have sense ... or are we calling request_sense without
> being sent a check condition status?

usb-storage does autosense.  As far as I can tell, the REQUEST SENSE is
issued because there is no sense data, which is because the READ
command is still running.  Is this another state-tracking bug?

> > 	The calls to shost->hostt->queuecommand() all fail, because the
> > 	READ command is still running and usb-storage has a queue
> > 	depth of 1.  The error messages produced by these failures are
> > 	disconcerting but not dangerous.
> > 
> > 	Since the REQUEST SENSE command was never issued, 
> > 	scsi_eh_get_sense() returns 0.
> > 
> > 	scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which
> > 	does essentially nothing because the SCSI_EH_CANCEL_CMD flag
> > 	for the only command on work_q is clear.  
> > 	scsi_eh_test_devices() returns 0 because check_list is empty
> > 	and work_q isn't.
> > 
> > 	scsi_unjam_host() then calls scsi_eh_ready_devs().  This
> > 	routine ends up calling scsi_eh_bus_device_reset(), at which 
> > 	point usb-storage deadlocks as described above.
> 
> OK, so in the case where the command can never complete (because the fw
> has crashed), what should be the process for resetting the device so it
> can again function?

Simply abort the command.  usb-storage will automatically carry out a 
bus reset under those conditions, and that's the best we can do.

As you can see, the sequence above started out doing exactly this.  
The problems began when the command was retried.  That time it didn't 
get aborted, thanks to the leftover SCSI_EH_ABORT_SCHEDULED flag.  
Things got worse from there.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
  2014-03-20 15:36           ` Alan Stern
@ 2014-03-20 16:09             ` James Bottomley
  2014-03-20 16:34               ` Alan Stern
       [not found]               ` <1395331764.2244.16.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
  0 siblings, 2 replies; 16+ messages in thread
From: James Bottomley @ 2014-03-20 16:09 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 2014-03-20 at 11:36 -0400, Alan Stern wrote:
> On Wed, 19 Mar 2014, James Bottomley wrote:
> 
> > > Basically, usb-storage deadlocks when the SCSI error handler invokes
> > > the eh_device_reset_handler callback while a command is running.  The
> > > command has timed out and will never complete normally, because the
> > > device's firmware has crashed.  But usb-storage's device-reset routine
> > > waits for the current command to finish, which brings everything to a
> > > standstill.
> > > 
> > > Is this design wrong?  That is, should the device-reset routine wait 
> > > for currently executing commands to finish, or should it abort them, or 
> > > what?
> > 
> > In some sense, yes, but not necessarily from the Point of View of USB.
> > What we assume in SCSI is that commands are forgettable, meaning there's
> > always some operation we can perform (be it abort or reset) that causes
> > the device to forget about all outstanding commands and reset its
> > internal state machine to a known good state.
> > 
> > The cardinal SCSI assumption is that after we've successfully done an
> > abort or reset on a command that it will never come back to us from the
> > device.
> 
> The same assumption is present in usb-storage.
> 
> The difference appears to be that SCSI assumes a reset can be issued at 
> any time, even while a command is running.  In usb-storage, that's true 
> for a _bus_ reset but it's not true for a _device_ reset.
> 
> Perhaps this restriction on device resets could be lifted, but in real
> life it probably won't make much difference.  The fact is, almost no
> USB mass-storage devices implement device reset correctly, whereas most
> of them do implement bus reset.  (Possibly related factoid: MS Windows
> uses bus resets but doesn't use device resets in its USB mass-storage
> driver.)

Actually, you don't have to lift the restriction.  Just fail the device
reset if you can't issue it (so if there's any outstanding commands).
SCSI will move on to a bus reset which you can accept.

> > > Or should the SCSI error handler abort the running command before 
> > > invoking the eh_device_reset_handler callback?
> > 
> > So this is rooted in the "Abort can be a Problem" issue:  Abort
> > sometimes works well (and it's not very disruptive) but sometimes if the
> > device is having a problem in its command state machine, adding another
> > command (which is what the abort is) doesn't actually do anything, so we
> > need error escalation to reset.  We can't wait for the abort or other
> > commands to complete because they never will.  The reset is expected to
> > clear everything from the device (including the pending aborts).
> 
> With usb-storage, aborts usually work pretty well.  But sometimes they 
> don't cure the underlying cause, so when the offending command is 
> retried, it hangs up again.  Something like that seems to be happening 
> here.
> 
> > > For the record, and in case anyone is curious, here's the detailed
> > > sequence of events during my test:
> > > 
> > > 	sd issues a READ(10) command.  For whatever reason, the device
> > > 	goes nuts and the command times out.
> > > 
> > > 	scsi_times_out() calls scsi_abort_command(), which queues an
> > > 	abort request.
> > > 
> > > 	scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which
> > > 	succeeds in aborting the READ.
> > > 
> > > 	The READ command is retried (I didn't trace through the details
> > > 	of this).  The retry fails with a Unit Attention (SK=6, 
> > > 	ASC=0x29, Reset or Bus Device Reset Occurred).
> > > 
> > > 	The READ command is retried a second time, and it times out 
> > > 	again.
> > > 
> > > 	This time around, scsi_times_out() calls scsi_abort_command()
> > > 	unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is
> > > 	still set).
> > 
> > From the first time we sent the abort?
> 
> Yes.  As far as I can see, the only place where SCSI_EH_ABORT_SCHEDULED
> gets cleared is in scsi_eh_finish_cmd(), which never got called.  
> After the successful abort, scmd_eh_abort_handler() went directly into
> its "if (rtn == SUCCESS)" and "scmd %p retry aborted command" /
> scsi_queue_insert() code paths.

OK, that's a bug ... I'll see if I can find it.

> >  That sounds like a problem in
> > our state tracking.
> 
> Could well be.  How should I track it down further?  Or can you suggest
> a patch just from this much information?
> 
> > > 	As a result, scsi_error_handler() calls scsi_unjam_host(), 
> > > 	which calls scsi_eh_get_sense().
> > > 
> > > 	That routine calls scsi_request_sense(), which goes into
> > > 	scsi_send_eh_cmnd().
> > 
> > I thought USB was autosense, so when it reports check condition, we
> > should already have sense ... or are we calling request_sense without
> > being sent a check condition status?
> 
> usb-storage does autosense.  As far as I can tell, the REQUEST SENSE is
> issued because there is no sense data, which is because the READ
> command is still running.  Is this another state-tracking bug?

Yes ... we should only issue a request sense if we got a check condition
return.  If we're doing it without that, something is wrong somewhere.

> > > 	The calls to shost->hostt->queuecommand() all fail, because the
> > > 	READ command is still running and usb-storage has a queue
> > > 	depth of 1.  The error messages produced by these failures are
> > > 	disconcerting but not dangerous.
> > > 
> > > 	Since the REQUEST SENSE command was never issued, 
> > > 	scsi_eh_get_sense() returns 0.
> > > 
> > > 	scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which
> > > 	does essentially nothing because the SCSI_EH_CANCEL_CMD flag
> > > 	for the only command on work_q is clear.  
> > > 	scsi_eh_test_devices() returns 0 because check_list is empty
> > > 	and work_q isn't.
> > > 
> > > 	scsi_unjam_host() then calls scsi_eh_ready_devs().  This
> > > 	routine ends up calling scsi_eh_bus_device_reset(), at which 
> > > 	point usb-storage deadlocks as described above.
> > 
> > OK, so in the case where the command can never complete (because the fw
> > has crashed), what should be the process for resetting the device so it
> > can again function?
> 
> Simply abort the command.  usb-storage will automatically carry out a 
> bus reset under those conditions, and that's the best we can do.
> 
> As you can see, the sequence above started out doing exactly this.  
> The problems began when the command was retried.  That time it didn't 
> get aborted, thanks to the leftover SCSI_EH_ABORT_SCHEDULED flag.  
> Things got worse from there.

OK, so I think we have three things to do

     1. Investigate SCSI and fix it's abort state problem that's causing
        it not to send the abort second time around
     2. Fix usb-storage to fail a reset it can't do (i.e. device reset
        with outstanding commands)
     3. Find out why we're sending a spurious request sense.

I can look at 1 and 3 if you want to take 2.

James




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
  2014-03-20 16:09             ` James Bottomley
@ 2014-03-20 16:34               ` Alan Stern
       [not found]                 ` <Pine.LNX.4.44L0.1403201233150.858-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
  2014-03-20 18:22                 ` James Bottomley
       [not found]               ` <1395331764.2244.16.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
  1 sibling, 2 replies; 16+ messages in thread
From: Alan Stern @ 2014-03-20 16:34 UTC (permalink / raw)
  To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 20 Mar 2014, James Bottomley wrote:

> OK, so I think we have three things to do
> 
>      1. Investigate SCSI and fix it's abort state problem that's causing
>         it not to send the abort second time around
>      2. Fix usb-storage to fail a reset it can't do (i.e. device reset
>         with outstanding commands)
>      3. Find out why we're sending a spurious request sense.
> 
> I can look at 1 and 3 if you want to take 2.

It's a deal!  Thanks for your help.

Alan Stern


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
       [not found]                 ` <Pine.LNX.4.44L0.1403201233150.858-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
@ 2014-03-20 17:57                   ` James Bottomley
  2014-03-20 19:49                     ` Alan Stern
  0 siblings, 1 reply; 16+ messages in thread
From: James Bottomley @ 2014-03-20 17:57 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote:
> On Thu, 20 Mar 2014, James Bottomley wrote:
> 
> > OK, so I think we have three things to do
> > 
> >      1. Investigate SCSI and fix it's abort state problem that's causing
> >         it not to send the abort second time around
> >      2. Fix usb-storage to fail a reset it can't do (i.e. device reset
> >         with outstanding commands)
> >      3. Find out why we're sending a spurious request sense.
> > 
> > I can look at 1 and 3 if you want to take 2.
> 
> It's a deal!  Thanks for your help.

OK, I think this is the fix for 1, if you could try it out.

Thanks,

James

---

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..c52bfb2 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -145,14 +145,14 @@ scmd_eh_abort_handler(struct work_struct *work)
 						    "scmd %p retry "
 						    "aborted command\n", scmd));
 				scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
-				return;
+				goto out;
 			} else {
 				SCSI_LOG_ERROR_RECOVERY(3,
 					scmd_printk(KERN_WARNING, scmd,
 						    "scmd %p finish "
 						    "aborted command\n", scmd));
 				scsi_finish_command(scmd);
-				return;
+				goto out;
 			}
 		} else {
 			SCSI_LOG_ERROR_RECOVERY(3,
@@ -162,6 +162,8 @@ scmd_eh_abort_handler(struct work_struct *work)
 		}
 	}
 
+	scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
+
 	if (!scsi_eh_scmd_add(scmd, 0)) {
 		SCSI_LOG_ERROR_RECOVERY(3,
 			scmd_printk(KERN_WARNING, scmd,
@@ -170,6 +172,10 @@ scmd_eh_abort_handler(struct work_struct *work)
 		scmd->result |= DID_TIME_OUT << 16;
 		scsi_finish_command(scmd);
 	}
+	return;
+ out:
+	scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
+	return;
 }
 
 /**



--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
  2014-03-20 16:34               ` Alan Stern
       [not found]                 ` <Pine.LNX.4.44L0.1403201233150.858-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
@ 2014-03-20 18:22                 ` James Bottomley
  2014-03-20 19:48                   ` Alan Stern
  1 sibling, 1 reply; 16+ messages in thread
From: James Bottomley @ 2014-03-20 18:22 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote:
> On Thu, 20 Mar 2014, James Bottomley wrote:
> 
> > OK, so I think we have three things to do
> > 
> >      1. Investigate SCSI and fix it's abort state problem that's causing
> >         it not to send the abort second time around
> >      2. Fix usb-storage to fail a reset it can't do (i.e. device reset
> >         with outstanding commands)
> >      3. Find out why we're sending a spurious request sense.
> > 
> > I can look at 1 and 3 if you want to take 2.
> 
> It's a deal!  Thanks for your help.

And this looks to be 3: a bug in the way we attach sense data to
commands (we shouldn't look for attached sense if the device error code
didn't imply there would be any).

James

---

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..d020149 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1157,6 +1157,15 @@ int scsi_eh_get_sense(struct list_head *work_q,
 					     __func__));
 			break;
 		}
+		if (status_byte(scmd->result) != CHECK_CONDITION)
+			/*
+			 * don't request sense if there's no check condition
+			 * status because the error we're processing isn't one
+			 * that has a sense code (and some devices get
+			 * confused by sense requests out of the blue)
+			 */
+			continue;
+
 		SCSI_LOG_ERROR_RECOVERY(2, scmd_printk(KERN_INFO, scmd,
 						  "%s: requesting sense\n",
 						  current->comm));




^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
  2014-03-20 18:22                 ` James Bottomley
@ 2014-03-20 19:48                   ` Alan Stern
       [not found]                     ` <Pine.LNX.4.44L0.1403201518130.863-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Alan Stern @ 2014-03-20 19:48 UTC (permalink / raw)
  To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 20 Mar 2014, James Bottomley wrote:

> On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote:
> > On Thu, 20 Mar 2014, James Bottomley wrote:
> > 
> > > OK, so I think we have three things to do
> > > 
> > >      1. Investigate SCSI and fix it's abort state problem that's causing
> > >         it not to send the abort second time around
> > >      2. Fix usb-storage to fail a reset it can't do (i.e. device reset
> > >         with outstanding commands)
> > >      3. Find out why we're sending a spurious request sense.
> > > 
> > > I can look at 1 and 3 if you want to take 2.
> > 
> > It's a deal!  Thanks for your help.
> 
> And this looks to be 3: a bug in the way we attach sense data to
> commands (we shouldn't look for attached sense if the device error code
> didn't imply there would be any).
> 
> James
> 
> ---
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 771c16b..d020149 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1157,6 +1157,15 @@ int scsi_eh_get_sense(struct list_head *work_q,
>  					     __func__));
>  			break;
>  		}
> +		if (status_byte(scmd->result) != CHECK_CONDITION)
> +			/*
> +			 * don't request sense if there's no check condition
> +			 * status because the error we're processing isn't one
> +			 * that has a sense code (and some devices get
> +			 * confused by sense requests out of the blue)
> +			 */
> +			continue;
> +
>  		SCSI_LOG_ERROR_RECOVERY(2, scmd_printk(KERN_INFO, scmd,
>  						  "%s: requesting sense\n",
>  						  current->comm));

I tried this patch first, because fixing the earlier bug would mask
this one.

The patch sort of worked.  But the first time I tried it, it failed in
a rather amusing way.  While the second retry was running and hung,
scmd->result _was_ equal to CHECK_CONDITION -- because that was the
result from the _first_ retry, and it had never gotten cleared!

scmd->result needs to be set to 0 before the queuecommand callback is
invoked.  I ended up adding this to your patch, and then it worked
perfectly:


Index: usb-3.14/drivers/scsi/scsi_error.c
===================================================================
--- usb-3.14.orig/drivers/scsi/scsi_error.c
+++ usb-3.14/drivers/scsi/scsi_error.c
@@ -924,6 +924,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd
 	memset(scmd->cmnd, 0, BLK_MAX_CDB);
 	memset(&scmd->sdb, 0, sizeof(scmd->sdb));
 	scmd->request->next_rq = NULL;
+	scmd->result = 0;
 
 	if (sense_bytes) {
 		scmd->sdb.length = min_t(unsigned, SCSI_SENSE_BUFFERSIZE,
Index: usb-3.14/drivers/scsi/scsi_lib.c
===================================================================
--- usb-3.14.orig/drivers/scsi/scsi_lib.c
+++ usb-3.14/drivers/scsi/scsi_lib.c
@@ -159,6 +159,7 @@ static void __scsi_queue_insert(struct s
 	 * lock such that the kblockd_schedule_work() call happens
 	 * before blk_cleanup_queue() finishes.
 	 */
+	cmd->result = 0;
 	spin_lock_irqsave(q->queue_lock, flags);
 	blk_requeue_request(q, cmd->request);
 	kblockd_schedule_work(q, &device->requeue_work);


Maybe only the second one is necessary, but it seemed best to be
consistent.

Alan Stern


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
  2014-03-20 17:57                   ` James Bottomley
@ 2014-03-20 19:49                     ` Alan Stern
  2014-03-28 16:52                       ` James Bottomley
  0 siblings, 1 reply; 16+ messages in thread
From: Alan Stern @ 2014-03-20 19:49 UTC (permalink / raw)
  To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 20 Mar 2014, James Bottomley wrote:

> On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote:
> > On Thu, 20 Mar 2014, James Bottomley wrote:
> > 
> > > OK, so I think we have three things to do
> > > 
> > >      1. Investigate SCSI and fix it's abort state problem that's causing
> > >         it not to send the abort second time around
> > >      2. Fix usb-storage to fail a reset it can't do (i.e. device reset
> > >         with outstanding commands)
> > >      3. Find out why we're sending a spurious request sense.
> > > 
> > > I can look at 1 and 3 if you want to take 2.
> > 
> > It's a deal!  Thanks for your help.
> 
> OK, I think this is the fix for 1, if you could try it out.
> 
> Thanks,
> 
> James
> 
> ---
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 771c16b..c52bfb2 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -145,14 +145,14 @@ scmd_eh_abort_handler(struct work_struct *work)
>  						    "scmd %p retry "
>  						    "aborted command\n", scmd));
>  				scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
> -				return;
> +				goto out;
>  			} else {
>  				SCSI_LOG_ERROR_RECOVERY(3,
>  					scmd_printk(KERN_WARNING, scmd,
>  						    "scmd %p finish "
>  						    "aborted command\n", scmd));
>  				scsi_finish_command(scmd);
> -				return;
> +				goto out;
>  			}
>  		} else {
>  			SCSI_LOG_ERROR_RECOVERY(3,
> @@ -162,6 +162,8 @@ scmd_eh_abort_handler(struct work_struct *work)
>  		}
>  	}
>  
> +	scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
> +
>  	if (!scsi_eh_scmd_add(scmd, 0)) {
>  		SCSI_LOG_ERROR_RECOVERY(3,
>  			scmd_printk(KERN_WARNING, scmd,
> @@ -170,6 +172,10 @@ scmd_eh_abort_handler(struct work_struct *work)
>  		scmd->result |= DID_TIME_OUT << 16;
>  		scsi_finish_command(scmd);
>  	}
> +	return;
> + out:
> +	scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
> +	return;
>  }
>  
>  /**

This worked the first time.  :-)

But I wonder, is it safe to access scmd after calling 
scsi_finish_command()?

Alan Stern


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
       [not found]               ` <1395331764.2244.16.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
@ 2014-03-20 19:59                 ` Alan Stern
  2014-03-20 20:27                   ` James Bottomley
  0 siblings, 1 reply; 16+ messages in thread
From: Alan Stern @ 2014-03-20 19:59 UTC (permalink / raw)
  To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 20 Mar 2014, James Bottomley wrote:

> OK, so I think we have three things to do
> 
>      1. Investigate SCSI and fix it's abort state problem that's causing
>         it not to send the abort second time around
>      2. Fix usb-storage to fail a reset it can't do (i.e. device reset
>         with outstanding commands)
>      3. Find out why we're sending a spurious request sense.
> 
> I can look at 1 and 3 if you want to take 2.

I wrote a patch for 2.  It turned out not to help much, because I was
wrong -- while a command is running, usb-storage won't do a bus reset
either.  The lock occurs in a separate part of the code, where I wasn't
looking earlier.  When I tested the patch, the device reset failed
immediately (as desired) and then the attempted bus reset hung.  
Overall, not much of an improvement...

In fact, this restriction on bus resets is important and necessary.  
USB devices can be composite, meaning they can contain several
functions all packed into a single device.  If a driver for one of the
other functions needs to perform a bus reset, we don't want it to
interrupt an ongoing mass-storage command.  Thus, it is important for
the reset routine to wait for an outstanding command to complete.

Anyway, this looks to be a moot point.  With your fix for 1, neither 
sort of reset was necessary.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
       [not found]                     ` <Pine.LNX.4.44L0.1403201518130.863-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
@ 2014-03-20 20:26                       ` James Bottomley
       [not found]                         ` <1395347219.2244.47.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: James Bottomley @ 2014-03-20 20:26 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 2014-03-20 at 15:48 -0400, Alan Stern wrote:
> On Thu, 20 Mar 2014, James Bottomley wrote:
> 
> > On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote:
> > > On Thu, 20 Mar 2014, James Bottomley wrote:
> > > 
> > > > OK, so I think we have three things to do
> > > > 
> > > >      1. Investigate SCSI and fix it's abort state problem that's causing
> > > >         it not to send the abort second time around
> > > >      2. Fix usb-storage to fail a reset it can't do (i.e. device reset
> > > >         with outstanding commands)
> > > >      3. Find out why we're sending a spurious request sense.
> > > > 
> > > > I can look at 1 and 3 if you want to take 2.
> > > 
> > > It's a deal!  Thanks for your help.
> > 
> > And this looks to be 3: a bug in the way we attach sense data to
> > commands (we shouldn't look for attached sense if the device error code
> > didn't imply there would be any).
> > 
> > James
> > 
> > ---
> > 
> > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> > index 771c16b..d020149 100644
> > --- a/drivers/scsi/scsi_error.c
> > +++ b/drivers/scsi/scsi_error.c
> > @@ -1157,6 +1157,15 @@ int scsi_eh_get_sense(struct list_head *work_q,
> >  					     __func__));
> >  			break;
> >  		}
> > +		if (status_byte(scmd->result) != CHECK_CONDITION)
> > +			/*
> > +			 * don't request sense if there's no check condition
> > +			 * status because the error we're processing isn't one
> > +			 * that has a sense code (and some devices get
> > +			 * confused by sense requests out of the blue)
> > +			 */
> > +			continue;
> > +
> >  		SCSI_LOG_ERROR_RECOVERY(2, scmd_printk(KERN_INFO, scmd,
> >  						  "%s: requesting sense\n",
> >  						  current->comm));
> 
> I tried this patch first, because fixing the earlier bug would mask
> this one.
> 
> The patch sort of worked.  But the first time I tried it, it failed in
> a rather amusing way.  While the second retry was running and hung,
> scmd->result _was_ equal to CHECK_CONDITION -- because that was the
> result from the _first_ retry, and it had never gotten cleared!
> 
> scmd->result needs to be set to 0 before the queuecommand callback is
> invoked.  I ended up adding this to your patch, and then it worked
> perfectly:

Wow, the stale data bugs are just crawling out of the code.  Thanks for
checking.

> 
> Index: usb-3.14/drivers/scsi/scsi_error.c
> ===================================================================
> --- usb-3.14.orig/drivers/scsi/scsi_error.c
> +++ usb-3.14/drivers/scsi/scsi_error.c
> @@ -924,6 +924,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd
>  	memset(scmd->cmnd, 0, BLK_MAX_CDB);
>  	memset(&scmd->sdb, 0, sizeof(scmd->sdb));
>  	scmd->request->next_rq = NULL;
> +	scmd->result = 0;
>  
>  	if (sense_bytes) {
>  		scmd->sdb.length = min_t(unsigned, SCSI_SENSE_BUFFERSIZE,
> Index: usb-3.14/drivers/scsi/scsi_lib.c
> ===================================================================
> --- usb-3.14.orig/drivers/scsi/scsi_lib.c
> +++ usb-3.14/drivers/scsi/scsi_lib.c
> @@ -159,6 +159,7 @@ static void __scsi_queue_insert(struct s
>  	 * lock such that the kblockd_schedule_work() call happens
>  	 * before blk_cleanup_queue() finishes.
>  	 */
> +	cmd->result = 0;
>  	spin_lock_irqsave(q->queue_lock, flags);
>  	blk_requeue_request(q, cmd->request);
>  	kblockd_schedule_work(q, &device->requeue_work);
> 
> 
> Maybe only the second one is necessary, but it seemed best to be
> consistent.

Thanks, I'll add this one to the list as well and see if we can get it
into the merge window.  I take it you'd like a cc to stable on these
three?

James



--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
  2014-03-20 19:59                 ` Alan Stern
@ 2014-03-20 20:27                   ` James Bottomley
  2014-03-20 20:43                     ` Alan Stern
  0 siblings, 1 reply; 16+ messages in thread
From: James Bottomley @ 2014-03-20 20:27 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 2014-03-20 at 15:59 -0400, Alan Stern wrote:
> On Thu, 20 Mar 2014, James Bottomley wrote:
> 
> > OK, so I think we have three things to do
> > 
> >      1. Investigate SCSI and fix it's abort state problem that's causing
> >         it not to send the abort second time around
> >      2. Fix usb-storage to fail a reset it can't do (i.e. device reset
> >         with outstanding commands)
> >      3. Find out why we're sending a spurious request sense.
> > 
> > I can look at 1 and 3 if you want to take 2.
> 
> I wrote a patch for 2.  It turned out not to help much, because I was
> wrong -- while a command is running, usb-storage won't do a bus reset
> either.  The lock occurs in a separate part of the code, where I wasn't
> looking earlier.  When I tested the patch, the device reset failed
> immediately (as desired) and then the attempted bus reset hung.  
> Overall, not much of an improvement...

Hm, yes, sorry, after the bus reset fails, we'll just take the device
offline.

> In fact, this restriction on bus resets is important and necessary.  
> USB devices can be composite, meaning they can contain several
> functions all packed into a single device.  If a driver for one of the
> other functions needs to perform a bus reset, we don't want it to
> interrupt an ongoing mass-storage command.  Thus, it is important for
> the reset routine to wait for an outstanding command to complete.
> 
> Anyway, this looks to be a moot point.  With your fix for 1, neither 
> sort of reset was necessary.

Well as long as the tiny hammer works ... but there are going to be
devices that need the bigger one one day.

James




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
       [not found]                         ` <1395347219.2244.47.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
@ 2014-03-20 20:42                           ` Alan Stern
  0 siblings, 0 replies; 16+ messages in thread
From: Alan Stern @ 2014-03-20 20:42 UTC (permalink / raw)
  To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 20 Mar 2014, James Bottomley wrote:

> > I tried this patch first, because fixing the earlier bug would mask
> > this one.
> > 
> > The patch sort of worked.  But the first time I tried it, it failed in
> > a rather amusing way.  While the second retry was running and hung,
> > scmd->result _was_ equal to CHECK_CONDITION -- because that was the
> > result from the _first_ retry, and it had never gotten cleared!
> > 
> > scmd->result needs to be set to 0 before the queuecommand callback is
> > invoked.  I ended up adding this to your patch, and then it worked
> > perfectly:
> 
> Wow, the stale data bugs are just crawling out of the code.  Thanks for
> checking.
> 
> > 
> > Index: usb-3.14/drivers/scsi/scsi_error.c
> > ===================================================================
> > --- usb-3.14.orig/drivers/scsi/scsi_error.c
> > +++ usb-3.14/drivers/scsi/scsi_error.c
> > @@ -924,6 +924,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd
> >  	memset(scmd->cmnd, 0, BLK_MAX_CDB);
> >  	memset(&scmd->sdb, 0, sizeof(scmd->sdb));
> >  	scmd->request->next_rq = NULL;
> > +	scmd->result = 0;
> >  
> >  	if (sense_bytes) {
> >  		scmd->sdb.length = min_t(unsigned, SCSI_SENSE_BUFFERSIZE,
> > Index: usb-3.14/drivers/scsi/scsi_lib.c
> > ===================================================================
> > --- usb-3.14.orig/drivers/scsi/scsi_lib.c
> > +++ usb-3.14/drivers/scsi/scsi_lib.c
> > @@ -159,6 +159,7 @@ static void __scsi_queue_insert(struct s
> >  	 * lock such that the kblockd_schedule_work() call happens
> >  	 * before blk_cleanup_queue() finishes.
> >  	 */
> > +	cmd->result = 0;
> >  	spin_lock_irqsave(q->queue_lock, flags);
> >  	blk_requeue_request(q, cmd->request);
> >  	kblockd_schedule_work(q, &device->requeue_work);
> > 
> > 
> > Maybe only the second one is necessary, but it seemed best to be
> > consistent.
> 
> Thanks, I'll add this one to the list as well and see if we can get it
> into the merge window.  I take it you'd like a cc to stable on these
> three?

Yes, please.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
  2014-03-20 20:27                   ` James Bottomley
@ 2014-03-20 20:43                     ` Alan Stern
  0 siblings, 0 replies; 16+ messages in thread
From: Alan Stern @ 2014-03-20 20:43 UTC (permalink / raw)
  To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 20 Mar 2014, James Bottomley wrote:

> On Thu, 2014-03-20 at 15:59 -0400, Alan Stern wrote:
> > On Thu, 20 Mar 2014, James Bottomley wrote:
> > 
> > > OK, so I think we have three things to do
> > > 
> > >      1. Investigate SCSI and fix it's abort state problem that's causing
> > >         it not to send the abort second time around
> > >      2. Fix usb-storage to fail a reset it can't do (i.e. device reset
> > >         with outstanding commands)
> > >      3. Find out why we're sending a spurious request sense.
> > > 
> > > I can look at 1 and 3 if you want to take 2.
> > 
> > I wrote a patch for 2.  It turned out not to help much, because I was
> > wrong -- while a command is running, usb-storage won't do a bus reset
> > either.  The lock occurs in a separate part of the code, where I wasn't
> > looking earlier.  When I tested the patch, the device reset failed
> > immediately (as desired) and then the attempted bus reset hung.  
> > Overall, not much of an improvement...
> 
> Hm, yes, sorry, after the bus reset fails, we'll just take the device
> offline.
> 
> > In fact, this restriction on bus resets is important and necessary.  
> > USB devices can be composite, meaning they can contain several
> > functions all packed into a single device.  If a driver for one of the
> > other functions needs to perform a bus reset, we don't want it to
> > interrupt an ongoing mass-storage command.  Thus, it is important for
> > the reset routine to wait for an outstanding command to complete.
> > 
> > Anyway, this looks to be a moot point.  With your fix for 1, neither 
> > sort of reset was necessary.
> 
> Well as long as the tiny hammer works ... but there are going to be
> devices that need the bigger one one day.

Maybe so.  I'll worry about them when they appear.  :-)

Alan Stern


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Deadlock in usb-storage error handling
  2014-03-20 19:49                     ` Alan Stern
@ 2014-03-28 16:52                       ` James Bottomley
  0 siblings, 0 replies; 16+ messages in thread
From: James Bottomley @ 2014-03-28 16:52 UTC (permalink / raw)
  To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list

On Thu, 2014-03-20 at 15:49 -0400, Alan Stern wrote:
> On Thu, 20 Mar 2014, James Bottomley wrote:
> 
> > On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote:
> > > On Thu, 20 Mar 2014, James Bottomley wrote:
> > > 
> > > > OK, so I think we have three things to do
> > > > 
> > > >      1. Investigate SCSI and fix it's abort state problem that's causing
> > > >         it not to send the abort second time around
> > > >      2. Fix usb-storage to fail a reset it can't do (i.e. device reset
> > > >         with outstanding commands)
> > > >      3. Find out why we're sending a spurious request sense.
> > > > 
> > > > I can look at 1 and 3 if you want to take 2.
> > > 
> > > It's a deal!  Thanks for your help.
> > 
> > OK, I think this is the fix for 1, if you could try it out.
> > 
> > Thanks,
> > 
> > James
> > 
> > ---
> > 
> > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> > index 771c16b..c52bfb2 100644
> > --- a/drivers/scsi/scsi_error.c
> > +++ b/drivers/scsi/scsi_error.c
> > @@ -145,14 +145,14 @@ scmd_eh_abort_handler(struct work_struct *work)
> >  						    "scmd %p retry "
> >  						    "aborted command\n", scmd));
> >  				scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
> > -				return;
> > +				goto out;
> >  			} else {
> >  				SCSI_LOG_ERROR_RECOVERY(3,
> >  					scmd_printk(KERN_WARNING, scmd,
> >  						    "scmd %p finish "
> >  						    "aborted command\n", scmd));
> >  				scsi_finish_command(scmd);
> > -				return;
> > +				goto out;
> >  			}
> >  		} else {
> >  			SCSI_LOG_ERROR_RECOVERY(3,
> > @@ -162,6 +162,8 @@ scmd_eh_abort_handler(struct work_struct *work)
> >  		}
> >  	}
> >  
> > +	scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
> > +
> >  	if (!scsi_eh_scmd_add(scmd, 0)) {
> >  		SCSI_LOG_ERROR_RECOVERY(3,
> >  			scmd_printk(KERN_WARNING, scmd,
> > @@ -170,6 +172,10 @@ scmd_eh_abort_handler(struct work_struct *work)
> >  		scmd->result |= DID_TIME_OUT << 16;
> >  		scsi_finish_command(scmd);
> >  	}
> > +	return;
> > + out:
> > +	scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
> > +	return;
> >  }
> >  
> >  /**
> 
> This worked the first time.  :-)
> 
> But I wonder, is it safe to access scmd after calling 
> scsi_finish_command()?

Agree, I've redone the patch integrated into the try_to_abort call
instead.  Will post shortly.

James




^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-03-28 16:52 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <53298181.9020206@gmail.com>
     [not found] ` <53298181.9020206-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-03-19 20:31   ` Deadlock in usb-storage error handling Alan Stern
     [not found]     ` <Pine.LNX.4.44L0.1403191449290.887-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
2014-03-19 20:54       ` Dan Williams
2014-03-19 21:35       ` James Bottomley
     [not found]         ` <1395264905.2185.55.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2014-03-20 15:36           ` Alan Stern
2014-03-20 16:09             ` James Bottomley
2014-03-20 16:34               ` Alan Stern
     [not found]                 ` <Pine.LNX.4.44L0.1403201233150.858-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
2014-03-20 17:57                   ` James Bottomley
2014-03-20 19:49                     ` Alan Stern
2014-03-28 16:52                       ` James Bottomley
2014-03-20 18:22                 ` James Bottomley
2014-03-20 19:48                   ` Alan Stern
     [not found]                     ` <Pine.LNX.4.44L0.1403201518130.863-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
2014-03-20 20:26                       ` James Bottomley
     [not found]                         ` <1395347219.2244.47.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2014-03-20 20:42                           ` Alan Stern
     [not found]               ` <1395331764.2244.16.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2014-03-20 19:59                 ` Alan Stern
2014-03-20 20:27                   ` James Bottomley
2014-03-20 20:43                     ` Alan Stern

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.