From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hannes Reinecke Subject: Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling Date: Mon, 31 Mar 2014 09:22:31 +0200 Message-ID: <533917B7.4010808@suse.de> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from cantor2.suse.de ([195.135.220.15]:42212 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751039AbaCaHWc (ORCPT ); Mon, 31 Mar 2014 03:22:32 -0400 In-Reply-To: Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Alan Stern , James Bottomley Cc: SCSI development list , USB list On 03/28/2014 08:29 PM, Alan Stern wrote: > On Fri, 28 Mar 2014, James Bottomley wrote: >=20 >> This is a set of three patches we agreed to a while ago to eliminate= a >> USB deadlock. I did rewrite the first patch, if it could be reviewe= d >> and tested. >=20 > I tested all three patches under the same conditions as before, and=20 > they all worked correctly. >=20 > In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't > entirely clear. This boils down to two questions, which I don't=20 > know the answers to: >=20 > What should happen in scmd_eh_abort_handler() if > scsi_host_eh_past_deadline() returns true and thereby > prevents scsi_try_to_abort_cmd() from being called? > The flag wouldn't get cleared then. >=20 Ah. Correct. But that's due to the first patch being incorrect. Cf my response to the original first patch. > What should happen if some other pathway manages to call > scsi_try_to_abort_cmd() while scmd->abort_work is still > sitting on the work queue? The command would be aborted > and the flag would be cleared, but the queued work would > remain. Can this ever happen? >=20 Not that I could see. A command abort is only ever triggered by the request timeout from the block layer. And the timer is _not_ rearmed once the timeout function (here: scsi_times_out()) is called. Hence I fail to see how it can be called concurrently. > Maybe scmd_eh_abort_handler() should check the flag before doing > anything. Is there any sort of sychronization to prevent the same > incarnation of a command from being aborted twice (or by two differen= t > threads at the same time)? If there is, it isn't obvious. >=20 See above. scsi_times_out() will only ever called once. What can happen, though, is that _theoretically_ the LLDD might decide to call ->done() on a timed out command when scsi_eh_abort_handler() is still pending. > (Also, what's going on at the start of scsi_abort_command()? Contrar= y > to what one might expect, the first part of the function _cancels_ a > scheduled abort. And it does so without clearing the > SCSI_EH_ABORT_SCHEDULED flag.) >=20 The original idea was this: SCSI_EH_ABORT_SCHEDULED is sticky _per command_. Point is, any command abort is only ever send for a timed-out command. And the main problem for a timed-out command is that we simply _do not_ know what happened for that command. So _if_ a command timed out, _and_ we've send an abort, _and_ the command times out _again_ we'll be running into an endless loop between timeout and aborting, and never returning the command at all. So to prevent this we should set a marker on that command telling it to _not_ try to abort the command again. Which is what 'SCSI_EH_ABORT_SCHEDULED' was meant for: - A command times out, sets 'SCSI_EH_ABORT_SCHEDULED' - abort _succeeds_ - The same command times out a second time, notifies that SCSI_EH_ABORT_SCHEDULED is set, and doesn't call scsi_eh_abort_command() but rather escalates directly. _However_, I do feel that we've gotten the wrong track with all of this. In you original mail you stated: > Basically, usb-storage deadlocks when the SCSI error handler > invokes the eh_device_reset_handler callback while a command > is running. The command has timed out and will never complete > normally, because the device's firmware has crashed. But > usb-storage's device-reset routine waits for the current command > to finish, which brings everything to a standstill. Question now to you as the USB god: A command abort is only _ever_ send after a command timeout. And the main idea of the command abort is to remove any context the LLDD or the target might have. So by the time the command abort returns SUCCESS we _expect_ the firmware to have cleared that command. If for some reason the firmware isn't capable of doing so, it should return FAILED. So: - Has the command timeout fired? - If so, why hasn't it returned FAILED? - If it had returned SUCCESS, why did the device_reset_handler wait for the command to complete? Cheers, Hannes --=20 Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N=FCrnberg GF: J. Hawn, J. Guild, F. Imend=F6rffer, HRB 16746 (AG N=FCrnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html