* virtio-scsi issues duplicate tags when async_abort is enabled @ 2014-06-13 17:58 Venkatesh Srinivas 2014-06-13 18:37 ` Hannes Reinecke 2014-06-13 18:52 ` Christoph Hellwig 0 siblings, 2 replies; 11+ messages in thread From: Venkatesh Srinivas @ 2014-06-13 17:58 UTC (permalink / raw) To: hare, hch, Paolo Bonzini, JBottomley, linux-scsi, stable Hi, In Linux 3.14+, SCSI timeouts are handled first without invoking EH; this behavior is on by default but can be disabled with the per-shost-template no_async_abort flag. When a SCSI target is attached to a virtio-scsi HBA and is under I/O stress (lots of concurrent I/O + some I/O running slowly), we see Linux issue commands with duplicate tags, sometimes with tags matching commands which are in the process of being aborted; we see this readily in the Google Compute Engine hypervisor. This behaviour is not seen on Linux <= 3.13 and is not seen if 3.14's virtio_scsi driver has no_async_abort set to 1. An ordering we have seen, from the device perspective: t0: I/O with tag 18446612135224154432 issued t1: TMF Abort for tag 18446612135224154432 t2: Another I/O with the same tag, 18446612135224154432, issued; same offset/size as at t0 [neither the t0 I/O nor the TMF ABORT have yet returned!] Another ordering we have seen, from the device perspective: t0: I/O with tag 18446612135454768576 issued t1: TMF ABORT for tag 18446612135454768576 t2: I/O 18446612135454768576 completes with appropriate cancelled status t3: TMF ABORT completes with OK status t4: New I/O with tag 18446612135454768576, matching size/offset as t0 t5...: [Some other I/Os issued to the same SCSI target] t6...: [TMF ABORT for one of the new I/Os; proper return sequence] t7...: New I/O with tag 18446612135454768576. [Tag 18446612135454768576 has neither completed nor has it been aborted by Linux.] CC-ing stable as 3.14 and 3.15 are affected; a conservative fix is to enable no_async_abort until the problem is better-understood. Thanks, -- vs; ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: virtio-scsi issues duplicate tags when async_abort is enabled 2014-06-13 17:58 virtio-scsi issues duplicate tags when async_abort is enabled Venkatesh Srinivas @ 2014-06-13 18:37 ` Hannes Reinecke 2014-06-13 18:43 ` Venkatesh Srinivas 2014-06-13 18:52 ` Christoph Hellwig 1 sibling, 1 reply; 11+ messages in thread From: Hannes Reinecke @ 2014-06-13 18:37 UTC (permalink / raw) To: Venkatesh Srinivas, hch, Paolo Bonzini, JBottomley, linux-scsi, stable On 06/13/2014 07:58 PM, Venkatesh Srinivas wrote: > Hi, > > In Linux 3.14+, SCSI timeouts are handled first without invoking EH; > this behavior is on by default but can be disabled with the > per-shost-template no_async_abort flag. > > When a SCSI target is attached to a virtio-scsi HBA and is under I/O > stress (lots of concurrent I/O + some I/O running slowly), we see > Linux issue commands with duplicate tags, sometimes with tags matching > commands which are in the process of being aborted; we see this > readily in the Google Compute Engine hypervisor. > > This behaviour is not seen on Linux <= 3.13 and is not seen if 3.14's > virtio_scsi driver has no_async_abort set to 1. > > An ordering we have seen, from the device perspective: > t0: I/O with tag 18446612135224154432 issued > t1: TMF Abort for tag 18446612135224154432 > t2: Another I/O with the same tag, 18446612135224154432, issued; same > offset/size as at t0 > [neither the t0 I/O nor the TMF ABORT have yet returned!] > > Another ordering we have seen, from the device perspective: > t0: I/O with tag 18446612135454768576 issued > t1: TMF ABORT for tag 18446612135454768576 > t2: I/O 18446612135454768576 completes with appropriate cancelled status > t3: TMF ABORT completes with OK status > t4: New I/O with tag 18446612135454768576, matching size/offset as t0 > t5...: [Some other I/Os issued to the same SCSI target] > t6...: [TMF ABORT for one of the new I/Os; proper return sequence] > t7...: New I/O with tag 18446612135454768576. > [Tag 18446612135454768576 has neither completed nor has it been > aborted by Linux.] > > CC-ing stable as 3.14 and 3.15 are affected; a conservative fix is to > enable no_async_abort until the problem is better-understood. > Paolo, you had some fixes for virtio_scsi which should solve this, right? Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: virtio-scsi issues duplicate tags when async_abort is enabled 2014-06-13 18:37 ` Hannes Reinecke @ 2014-06-13 18:43 ` Venkatesh Srinivas 0 siblings, 0 replies; 11+ messages in thread From: Venkatesh Srinivas @ 2014-06-13 18:43 UTC (permalink / raw) To: Hannes Reinecke; +Cc: hch, Paolo Bonzini, JBottomley, linux-scsi, stable On Fri, Jun 13, 2014 at 11:37 AM, Hannes Reinecke <hare@suse.de> wrote: > On 06/13/2014 07:58 PM, Venkatesh Srinivas wrote: >> >> Hi, >> >> In Linux 3.14+, SCSI timeouts are handled first without invoking EH; >> this behavior is on by default but can be disabled with the >> per-shost-template no_async_abort flag. >> >> When a SCSI target is attached to a virtio-scsi HBA and is under I/O >> stress (lots of concurrent I/O + some I/O running slowly), we see >> Linux issue commands with duplicate tags, sometimes with tags matching >> commands which are in the process of being aborted; we see this >> readily in the Google Compute Engine hypervisor. >> >> This behaviour is not seen on Linux <= 3.13 and is not seen if 3.14's >> virtio_scsi driver has no_async_abort set to 1. >> >> An ordering we have seen, from the device perspective: >> t0: I/O with tag 18446612135224154432 issued >> t1: TMF Abort for tag 18446612135224154432 >> t2: Another I/O with the same tag, 18446612135224154432, issued; same >> offset/size as at t0 >> [neither the t0 I/O nor the TMF ABORT have yet returned!] >> >> Another ordering we have seen, from the device perspective: >> t0: I/O with tag 18446612135454768576 issued >> t1: TMF ABORT for tag 18446612135454768576 >> t2: I/O 18446612135454768576 completes with appropriate cancelled status >> t3: TMF ABORT completes with OK status >> t4: New I/O with tag 18446612135454768576, matching size/offset as t0 >> t5...: [Some other I/Os issued to the same SCSI target] >> t6...: [TMF ABORT for one of the new I/Os; proper return sequence] >> t7...: New I/O with tag 18446612135454768576. >> [Tag 18446612135454768576 has neither completed nor has it been >> aborted by Linux.] >> >> CC-ing stable as 3.14 and 3.15 are affected; a conservative fix is to >> enable no_async_abort until the problem is better-understood. >> > Paolo, you had some fixes for virtio_scsi which should solve this, right? The outstanding patches for virtio_scsi would not explain this bug, its dependence on Linux 3.14+, or that it does not repro with no_async_abort=1. Thanks, -- vs; ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: virtio-scsi issues duplicate tags when async_abort is enabled 2014-06-13 17:58 virtio-scsi issues duplicate tags when async_abort is enabled Venkatesh Srinivas 2014-06-13 18:37 ` Hannes Reinecke @ 2014-06-13 18:52 ` Christoph Hellwig 2014-06-13 19:09 ` James Bottomley 1 sibling, 1 reply; 11+ messages in thread From: Christoph Hellwig @ 2014-06-13 18:52 UTC (permalink / raw) To: Venkatesh Srinivas Cc: hare, hch, Paolo Bonzini, JBottomley, linux-scsi, stable On Fri, Jun 13, 2014 at 10:58:22AM -0700, Venkatesh Srinivas wrote: > CC-ing stable as 3.14 and 3.15 are affected; a conservative fix is to > enable no_async_abort until the problem is better-understood. No patch attached. Nevermind this is not a consdervative fix, but a band aid. The proper fix is to figure out what's actually going on here. >From your trace above it very much looks like a double completion of some sort. I've looked at the virtio-scsi code a bit, and one odd thing it does that comes in handy here is that it doesn't really use a traditional tag, but rather the address of the scsi command. Knowning that your above traces mean that we are resending a command to the HBA/storage that has a pending abort going on. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: virtio-scsi issues duplicate tags when async_abort is enabled 2014-06-13 18:52 ` Christoph Hellwig @ 2014-06-13 19:09 ` James Bottomley 2014-06-13 19:15 ` Venkatesh Srinivas 0 siblings, 1 reply; 11+ messages in thread From: James Bottomley @ 2014-06-13 19:09 UTC (permalink / raw) To: Christoph Hellwig Cc: Venkatesh Srinivas, hare, Paolo Bonzini, linux-scsi, stable On Fri, 2014-06-13 at 11:52 -0700, Christoph Hellwig wrote: > On Fri, Jun 13, 2014 at 10:58:22AM -0700, Venkatesh Srinivas wrote: > > CC-ing stable as 3.14 and 3.15 are affected; a conservative fix is to > > enable no_async_abort until the problem is better-understood. > > No patch attached. Nevermind this is not a consdervative fix, but a > band aid. The proper fix is to figure out what's actually going on > here. > > >From your trace above it very much looks like a double completion of > some sort. > > I've looked at the virtio-scsi code a bit, and one odd thing it does > that comes in handy here is that it doesn't really use a traditional > tag, but rather the address of the scsi command. What kernel version? This is the exact signature of the original USB REQUEST_SENSE problem. James ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: virtio-scsi issues duplicate tags when async_abort is enabled 2014-06-13 19:09 ` James Bottomley @ 2014-06-13 19:15 ` Venkatesh Srinivas 2014-06-13 19:18 ` James Bottomley 0 siblings, 1 reply; 11+ messages in thread From: Venkatesh Srinivas @ 2014-06-13 19:15 UTC (permalink / raw) To: James Bottomley Cc: Christoph Hellwig, Hannes Reinecke, Paolo Bonzini, linux-scsi, stable On Fri, Jun 13, 2014 at 12:09 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > On Fri, 2014-06-13 at 11:52 -0700, Christoph Hellwig wrote: >> On Fri, Jun 13, 2014 at 10:58:22AM -0700, Venkatesh Srinivas wrote: >> > CC-ing stable as 3.14 and 3.15 are affected; a conservative fix is to >> > enable no_async_abort until the problem is better-understood. >> >> No patch attached. Nevermind this is not a consdervative fix, but a >> band aid. The proper fix is to figure out what's actually going on >> here. >> >> >From your trace above it very much looks like a double completion of >> some sort. >> >> I've looked at the virtio-scsi code a bit, and one odd thing it does >> that comes in handy here is that it doesn't really use a traditional >> tag, but rather the address of the scsi command. > > What kernel version? This is the exact signature of the original USB > REQUEST_SENSE problem. Mix of kernels, all 3.14-based. Debian 3.14-0.bpo, Gentoo's 3.14, upstream from git as of a few days ago. Distribution 3.13 and earlier kernels (Debian's 3.2.0-4, Debian 3.13-0.bpo.1, Gentoo 3.13.6) do not hit this issue with the same workload. -- vs; ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: virtio-scsi issues duplicate tags when async_abort is enabled 2014-06-13 19:15 ` Venkatesh Srinivas @ 2014-06-13 19:18 ` James Bottomley 2014-06-13 19:31 ` Greg KH 2014-06-14 2:50 ` Venkatesh Srinivas 0 siblings, 2 replies; 11+ messages in thread From: James Bottomley @ 2014-06-13 19:18 UTC (permalink / raw) To: Venkatesh Srinivas Cc: Christoph Hellwig, Hannes Reinecke, Paolo Bonzini, linux-scsi, stable On Fri, 2014-06-13 at 12:15 -0700, Venkatesh Srinivas wrote: > On Fri, Jun 13, 2014 at 12:09 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > On Fri, 2014-06-13 at 11:52 -0700, Christoph Hellwig wrote: > >> On Fri, Jun 13, 2014 at 10:58:22AM -0700, Venkatesh Srinivas wrote: > >> > CC-ing stable as 3.14 and 3.15 are affected; a conservative fix is to > >> > enable no_async_abort until the problem is better-understood. > >> > >> No patch attached. Nevermind this is not a consdervative fix, but a > >> band aid. The proper fix is to figure out what's actually going on > >> here. > >> > >> >From your trace above it very much looks like a double completion of > >> some sort. > >> > >> I've looked at the virtio-scsi code a bit, and one odd thing it does > >> that comes in handy here is that it doesn't really use a traditional > >> tag, but rather the address of the scsi command. > > > > What kernel version? This is the exact signature of the original USB > > REQUEST_SENSE problem. > > Mix of kernels, all 3.14-based. Debian 3.14-0.bpo, Gentoo's 3.14, > upstream from git as of a few days ago. > > Distribution 3.13 and earlier kernels (Debian's 3.2.0-4, Debian > 3.13-0.bpo.1, Gentoo 3.13.6) do not hit this issue with the same > workload. OK, I've no idea what's in distro kernels, so you're looking for this fix: commit d555a2abf3481f81303d835046a5ec2c4fb3ca8e Author: James Bottomley <JBottomley@Parallels.com> Date: Fri Mar 28 10:50:17 2014 -0700 [SCSI] Fix spurious request sense in error handling It went into v3.15-rc3. It looks like it wasn't backported to stable. James ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: virtio-scsi issues duplicate tags when async_abort is enabled 2014-06-13 19:18 ` James Bottomley @ 2014-06-13 19:31 ` Greg KH 2014-06-13 19:29 ` James Bottomley 2014-06-14 2:50 ` Venkatesh Srinivas 1 sibling, 1 reply; 11+ messages in thread From: Greg KH @ 2014-06-13 19:31 UTC (permalink / raw) To: James Bottomley Cc: Venkatesh Srinivas, Christoph Hellwig, Hannes Reinecke, Paolo Bonzini, linux-scsi, stable On Fri, Jun 13, 2014 at 12:18:25PM -0700, James Bottomley wrote: > On Fri, 2014-06-13 at 12:15 -0700, Venkatesh Srinivas wrote: > > On Fri, Jun 13, 2014 at 12:09 PM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > > On Fri, 2014-06-13 at 11:52 -0700, Christoph Hellwig wrote: > > >> On Fri, Jun 13, 2014 at 10:58:22AM -0700, Venkatesh Srinivas wrote: > > >> > CC-ing stable as 3.14 and 3.15 are affected; a conservative fix is to > > >> > enable no_async_abort until the problem is better-understood. > > >> > > >> No patch attached. Nevermind this is not a consdervative fix, but a > > >> band aid. The proper fix is to figure out what's actually going on > > >> here. > > >> > > >> >From your trace above it very much looks like a double completion of > > >> some sort. > > >> > > >> I've looked at the virtio-scsi code a bit, and one odd thing it does > > >> that comes in handy here is that it doesn't really use a traditional > > >> tag, but rather the address of the scsi command. > > > > > > What kernel version? This is the exact signature of the original USB > > > REQUEST_SENSE problem. > > > > Mix of kernels, all 3.14-based. Debian 3.14-0.bpo, Gentoo's 3.14, > > upstream from git as of a few days ago. > > > > Distribution 3.13 and earlier kernels (Debian's 3.2.0-4, Debian > > 3.13-0.bpo.1, Gentoo 3.13.6) do not hit this issue with the same > > workload. > > OK, I've no idea what's in distro kernels, so you're looking for this > fix: > > commit d555a2abf3481f81303d835046a5ec2c4fb3ca8e > Author: James Bottomley <JBottomley@Parallels.com> > Date: Fri Mar 28 10:50:17 2014 -0700 > > [SCSI] Fix spurious request sense in error handling > > It went into v3.15-rc3. It looks like it wasn't backported to stable. That would be because no one asked it to be backported, why didn't you tag it as such? greg k-h ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: virtio-scsi issues duplicate tags when async_abort is enabled 2014-06-13 19:31 ` Greg KH @ 2014-06-13 19:29 ` James Bottomley 0 siblings, 0 replies; 11+ messages in thread From: James Bottomley @ 2014-06-13 19:29 UTC (permalink / raw) To: Greg KH Cc: Venkatesh Srinivas, Christoph Hellwig, Hannes Reinecke, Paolo Bonzini, linux-scsi, stable On Fri, 2014-06-13 at 12:31 -0700, Greg KH wrote: > On Fri, Jun 13, 2014 at 12:18:25PM -0700, James Bottomley wrote: > > On Fri, 2014-06-13 at 12:15 -0700, Venkatesh Srinivas wrote: > > > On Fri, Jun 13, 2014 at 12:09 PM, James Bottomley > > > <James.Bottomley@hansenpartnership.com> wrote: > > > > On Fri, 2014-06-13 at 11:52 -0700, Christoph Hellwig wrote: > > > >> On Fri, Jun 13, 2014 at 10:58:22AM -0700, Venkatesh Srinivas wrote: > > > >> > CC-ing stable as 3.14 and 3.15 are affected; a conservative fix is to > > > >> > enable no_async_abort until the problem is better-understood. > > > >> > > > >> No patch attached. Nevermind this is not a consdervative fix, but a > > > >> band aid. The proper fix is to figure out what's actually going on > > > >> here. > > > >> > > > >> >From your trace above it very much looks like a double completion of > > > >> some sort. > > > >> > > > >> I've looked at the virtio-scsi code a bit, and one odd thing it does > > > >> that comes in handy here is that it doesn't really use a traditional > > > >> tag, but rather the address of the scsi command. > > > > > > > > What kernel version? This is the exact signature of the original USB > > > > REQUEST_SENSE problem. > > > > > > Mix of kernels, all 3.14-based. Debian 3.14-0.bpo, Gentoo's 3.14, > > > upstream from git as of a few days ago. > > > > > > Distribution 3.13 and earlier kernels (Debian's 3.2.0-4, Debian > > > 3.13-0.bpo.1, Gentoo 3.13.6) do not hit this issue with the same > > > workload. > > > > OK, I've no idea what's in distro kernels, so you're looking for this > > fix: > > > > commit d555a2abf3481f81303d835046a5ec2c4fb3ca8e > > Author: James Bottomley <JBottomley@Parallels.com> > > Date: Fri Mar 28 10:50:17 2014 -0700 > > > > [SCSI] Fix spurious request sense in error handling > > > > It went into v3.15-rc3. It looks like it wasn't backported to stable. > > That would be because no one asked it to be backported, why didn't you > tag it as such? Because it was a fix for a recently introduced problem with async aborts at the time. The fact that it may fix other problems is only just becoming clear. James ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: virtio-scsi issues duplicate tags when async_abort is enabled 2014-06-13 19:18 ` James Bottomley 2014-06-13 19:31 ` Greg KH @ 2014-06-14 2:50 ` Venkatesh Srinivas 2014-06-14 15:10 ` James Bottomley 1 sibling, 1 reply; 11+ messages in thread From: Venkatesh Srinivas @ 2014-06-14 2:50 UTC (permalink / raw) To: James Bottomley Cc: Christoph Hellwig, Hannes Reinecke, Paolo Bonzini, linux-scsi, stable On 6/13/14, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: ... > OK, I've no idea what's in distro kernels, so you're looking for this > fix: > > commit d555a2abf3481f81303d835046a5ec2c4fb3ca8e > Author: James Bottomley <JBottomley@Parallels.com> > Date: Fri Mar 28 10:50:17 2014 -0700 > > [SCSI] Fix spurious request sense in error handling > > It went into v3.15-rc3. It looks like it wasn't backported to stable. Backporting this fix to 3.14 appears to resolve this problem. Thank you! 0) I don't understand how a command would be re-issued while its original is inflight without this patch. If we send down a spurious REQUESE SENSE and get NO SENSE, scsi_decide_disposition() in scsi_eh_get_sense() will return FAILED and not resubmit the command. Do you know the trace that results in the duplicate command being sent down? 1) virtio-scsi uses the address of the scsi_cmnd as a tag; if scsi_eh_get_sense() is invoked for a timed-out command before the command returned, the REQUEST SENSE task from scsi_send_eh_cmnd() will have the same tag as the unreturned prior task. Scary. 2) We would like to see this patch sent to the stable kernel tree; do you plan to send it out to stable@? Are there any further tests you plan to run on that particular fix before sending it that way? Thanks, -- vs; ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: virtio-scsi issues duplicate tags when async_abort is enabled 2014-06-14 2:50 ` Venkatesh Srinivas @ 2014-06-14 15:10 ` James Bottomley 0 siblings, 0 replies; 11+ messages in thread From: James Bottomley @ 2014-06-14 15:10 UTC (permalink / raw) To: Venkatesh Srinivas Cc: Christoph Hellwig, Hannes Reinecke, Paolo Bonzini, linux-scsi, stable On Fri, 2014-06-13 at 19:50 -0700, Venkatesh Srinivas wrote: > On 6/13/14, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > ... > > OK, I've no idea what's in distro kernels, so you're looking for this > > fix: > > > > commit d555a2abf3481f81303d835046a5ec2c4fb3ca8e > > Author: James Bottomley <JBottomley@Parallels.com> > > Date: Fri Mar 28 10:50:17 2014 -0700 > > > > [SCSI] Fix spurious request sense in error handling > > > > It went into v3.15-rc3. It looks like it wasn't backported to stable. > > Backporting this fix to 3.14 appears to resolve this problem. Thank you! OK, I've sent the patch in to Stable. > 0) I don't understand how a command would be re-issued while its > original is inflight without this patch. If we send down a spurious > REQUESE SENSE and get NO SENSE, scsi_decide_disposition() in > scsi_eh_get_sense() will return FAILED and not resubmit the command. > Do you know the trace that results in the duplicate command being sent > down? The original is still at the LLD while we reuse the scsi_command to send the request sense. We can easily get double completions from that if the original happens to come back. It violates one of the requirements of error handling: we may not reuse the command until we've forced the LLD to relinquish it. > 1) virtio-scsi uses the address of the scsi_cmnd as a tag; if > scsi_eh_get_sense() is invoked for a timed-out command before the > command returned, the REQUEST SENSE task from scsi_send_eh_cmnd() will > have the same tag as the unreturned prior task. Scary. > > 2) We would like to see this patch sent to the stable kernel tree; do > you plan to send it out to stable@? Are there any further tests you > plan to run on that particular fix before sending it that way? It's already gone to stable. James ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-06-14 15:10 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-06-13 17:58 virtio-scsi issues duplicate tags when async_abort is enabled Venkatesh Srinivas 2014-06-13 18:37 ` Hannes Reinecke 2014-06-13 18:43 ` Venkatesh Srinivas 2014-06-13 18:52 ` Christoph Hellwig 2014-06-13 19:09 ` James Bottomley 2014-06-13 19:15 ` Venkatesh Srinivas 2014-06-13 19:18 ` James Bottomley 2014-06-13 19:31 ` Greg KH 2014-06-13 19:29 ` James Bottomley 2014-06-14 2:50 ` Venkatesh Srinivas 2014-06-14 15:10 ` James Bottomley
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.