All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Alan Stern <stern@rowland.harvard.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>,
	SCSI development list <linux-scsi@vger.kernel.org>,
	USB list <linux-usb@vger.kernel.org>
Subject: Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
Date: Tue, 01 Apr 2014 17:37:10 +0200	[thread overview]
Message-ID: <533ADD26.1030300@suse.de> (raw)
In-Reply-To: <Pine.LNX.4.44L0.1403311816400.5031-100000@netrider.rowland.org>

On 04/01/2014 12:29 AM, Alan Stern wrote:
> On Mon, 31 Mar 2014, Hannes Reinecke wrote:
> 
>>>> Ah. Correct. But that's due to the first patch being incorrect.
>>>> Cf my response to the original first patch.
>>>
>>> See my response to your response.  :-)
>>>
>> Okay, So I probably should refrain from issueing a response to
>> your response to my response lest infinite recursion happens :-)
> 
> Indeed.
> 
>>>>> 	What should happen if some other pathway manages to call
>>>>> 	scsi_try_to_abort_cmd() while scmd->abort_work is still
>>>>> 	sitting on the work queue?  The command would be aborted
>>>>> 	and the flag would be cleared, but the queued work would
>>>>> 	remain.  Can this ever happen?
>>>>>
>>>> Not that I could see.
>>>> A command abort is only ever triggered by the request timeout from
>>>> the block layer. And the timer is _not_ rearmed once the timeout
>>>> function (here: scsi_times_out()) is called.
>>>> Hence I fail to see how it can be called concurrently.
>>>
>>> scsi_try_to_abort_cmd() is also called (via a different pathway) when a 
>>> command sent by the error handler itself times out.  I haven't traced 
>>> through all the different paths to make sure none of them can run 
>>> concurrently.  But I'm willing to take your word for it.
>>>
>> Yes, but that's not calling scsi_abort_command(), but rather invokes
>> scsi_abort_eh_cmnd().
> 
> Sure.  But either way, we end up in scsi_try_to_abort_cmd(), which
> calls the LLDD's abort handler.  Thus leading to the possibility of
> aborting the same command more than once.
> 
>>>> The original idea was this:
>>>>
>>>> SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
>>>> Point is, any command abort is only ever send for a timed-out
>>>> command. And the main problem for a timed-out command is that we
>>>> simply _do not_ know what happened for that command. So _if_ a
>>>> command timed out, _and_ we've send an abort, _and_ the command
>>>> times out _again_ we'll be running into an endless loop between
>>>> timeout and aborting, and never returning the command at all.
>>>>
>>>> So to prevent this we should set a marker on that command telling it
>>>> to _not_ try to abort the command again.
>>>
>>> I disagree.  We _have_ to abort the command again -- how else can we
>>> stop a running command?  To prevent the loop you described, we should
>>> avoid _retrying_ the command after it is aborted the second time.
>>>
>> The actual question is whether it's worth aborting the same command
>> a second time.
>> In principle any reset (like LUN reset etc) should clear the
>> command, too.
>> And the EH abort functionality is geared around this.
>> If, for some reason, the transport layer / device driver
>> requires a command abort to be send then sure, we need
>> to accommodate for that.
> 
> As James mentioned, with USB a reset does not abort a running command.  
> Instead it waits for the command to finish.  (However, this could be
> changed in usb-storage, if required.)
> 
>> As said, yes, in principle you are right. We should be aborting the
>> command a second time, _and then_ starting the escalation.
>>
>> So if the above reasoning is okay then this patch should be doing
>> the trick:
>>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index 771c16b..0e72374 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
>>                 /*
>>                  * Retry after abort failed, escalate to next level.
>>                  */
>> +               scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
>>                 SCSI_LOG_ERROR_RECOVERY(3,
>>                         scmd_printk(KERN_INFO, scmd,
>>                                     "scmd %p previous abort
>> failed\n", scmd));
>>
>> (Beware of line
>> breaks)
>>
>> Can you test with it?
> 
> I don't understand.  This doesn't solve the fundamental problem (namely 
> that you escalate before aborting a running command).  All it does is 
> clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.
> 
Which was precisely the point :-)

Hmm. The comment might've been clearer.

What this patch is _supposed_ to be doing is that it'll clear the
SCSI_EH_ABORT_SCHEDULED flag it it has been set.
Which means this will be the second time scsi_abort_command() has
been called for the same command.
IE the first abort went out, did its thing, but now the same command
has timed out again.

So this flag gets cleared, and scsi_abort_command() returns FAILED,
and _no_ asynchronous abort is being scheduled.
scsi_times_out() will then proceed to call scsi_eh_scmd_add().
But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag
the SCSI_EH_CANCEL_CMD flag will continue to be set,
and the command will be aborted with the main SCSI EH routine.

It looks to me as if it should do what you desire, namely abort the
command asynchronously the first time, and invoking the SCSI EH the
second time.

Am I wrong?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2014-04-01 15:37 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-28 17:47 [PATCH 0/3] Fix USB deadlock caused by SCSI error handling James Bottomley
2014-03-28 17:49 ` [PATCH 1/3] [SCSI] Fix abort state memory problem James Bottomley
2014-03-31  6:58   ` Hannes Reinecke
2014-03-31 13:06     ` Alan Stern
2014-03-28 17:50 ` [PATCH 2/3] [SCSI] Fix spurious request sense in error handling James Bottomley
2014-03-31  6:59   ` Hannes Reinecke
2014-03-28 17:51 ` [PATCH 3/3] [SCSI] Fix command result state propagation James Bottomley
2014-03-31  7:00   ` Hannes Reinecke
2014-03-28 19:29 ` [PATCH 0/3] Fix USB deadlock caused by SCSI error handling Alan Stern
2014-03-31  7:22   ` Hannes Reinecke
2014-03-31 13:33     ` Alan Stern
2014-03-31 14:37       ` Hannes Reinecke
     [not found]         ` <53397DAE.9010801-l3A5Bk7waGM@public.gmane.org>
2014-03-31 15:03           ` James Bottomley
2014-03-31 22:41             ` Alan Stern
     [not found]             ` <1396278224.3152.26.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2014-04-01  6:14               ` Hannes Reinecke
2014-03-31 22:29         ` Alan Stern
2014-04-01 15:37           ` Hannes Reinecke [this message]
     [not found]             ` <533ADD26.1030300-l3A5Bk7waGM@public.gmane.org>
2014-04-01 21:28               ` Alan Stern
     [not found]                 ` <Pine.LNX.4.44L0.1404011718350.7652-100000-pYrvlCTfrz9XsRXLowluHWD2FQJk+8+b@public.gmane.org>
2014-04-02  6:03                   ` Hannes Reinecke
2014-04-07 15:26                     ` Alan Stern
     [not found]                       ` <Pine.LNX.4.44L0.1404071125220.20747-100000-pYrvlCTfrz9XsRXLowluHWD2FQJk+8+b@public.gmane.org>
2014-04-09 17:42                         ` Hannes Reinecke
     [not found]                           ` <53458680.4090500-l3A5Bk7waGM@public.gmane.org>
2014-04-09 18:02                             ` Alan Stern
2014-04-10 10:58                               ` Andreas Reis
2014-04-10 11:37                                 ` Hannes Reinecke
2014-04-10 12:26                                   ` Andreas Reis
2014-04-10 12:29                                     ` Hannes Reinecke
2014-04-10 15:31                                   ` Alan Stern
2014-04-10 17:52                                     ` Hannes Reinecke
     [not found]                                       ` <5346DA43.4010603-l3A5Bk7waGM@public.gmane.org>
2014-04-10 20:36                                         ` James Bottomley
2014-04-11  5:52                                           ` Hannes Reinecke
     [not found]                                   ` <53468297.1040909-l3A5Bk7waGM@public.gmane.org>
2014-04-11 19:08                                     ` Andreas Reis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=533ADD26.1030300@suse.de \
    --to=hare@suse.de \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=linux-scsi@vger.kernel.org \
    --cc=linux-usb@vger.kernel.org \
    --cc=stern@rowland.harvard.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.