From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Linton Subject: Re: [PATCH] scsi: Allow error handling timeout to be specified Date: Mon, 13 May 2013 09:40:14 -0500 Message-ID: <5190FB4E.4000900@tributary.com> References: <1368189791.3319.31.camel@localhost.localdomain> <1368194460.3319.40.camel@localhost.localdomain> <518D55FA.4080302@suse.de> <51907E45.7010409@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Return-path: Received: from relay.ihostexchange.net ([66.46.182.57]:37450 "EHLO relay.ihostexchange.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750992Ab3EMOkX (ORCPT ); Mon, 13 May 2013 10:40:23 -0400 In-Reply-To: <51907E45.7010409@suse.de> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Hannes Reinecke Cc: Baruch Even , emilne , "Martin K. Petersen" , linux-scsi , michaelc On 5/13/2013 12:46 AM, Hannes Reinecke wrote: > True. But and the end of the day, we _do_ want to recover the failed LUN. > If we were to disable that faulty LUN and continue running with the others > we won't have a chance of _ever_ recovering that one LUN. I don't buy this. Especially for FC devices, the vast majority of errors I see are related to zoning, SFP and cabling problems. Once one of those happens you tend to get a lot of shotgun debugging, which injects all kinds of further errors. None of these errors are fixed by the linux error recovery paths. That said, if the admin fixes something, for FC/SAS (and potentially others) you _WILL_ get notification that the device is online again. > SET when the link is down). So we basically _have_ to escalate it to the > next level. Even though that will mean to stop I/O to other, hitherto > unaffected instances. And a single failure, turns into performance bubbles and further errors on other devices. Particularly if the functional devices are stateful, and the error recovery mechanism isn't sufficiently intelligent about that state (see tape drives). Think about what happens when a marginal SFP on a target causes a device to repeatably drop off and reappear at some random point in the future. Anyway, It is possible to make a determination about the topology and make decisions about the likely-hood of any given portion being at fault. For example, if one lun on a target has failed and the remainder continue to work, then its unlikely that if abort and lun reset fail that anything higher up in the stack is going to succeed. I feel pretty strongly, at that point your better off providing good diagnostics about the failure and expecting user interaction rather than muddying the waters by causing other device interruptions. If the user tries everything and determines that a HBA reset is the right choice, provide that option, don't do it for them. If every device attached to the HBA fails then resetting the HBA is a valid choice, not before. Same for I_T.