linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
@ 2008-10-26  9:46 Tejun Heo
  2008-10-26 15:15 ` James Bottomley
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2008-10-26  9:46 UTC (permalink / raw)
  To: Jens Axboe, linux-scsi, IDE/ATA development list, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 8624 bytes --]

Hello, Jens.

Commit 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 introduces a strange
regression for libata.  The second timeout gives puts different
pointer from the issued command onto eh_cmd_q breaking libata EH
command matching which triggers WARN_ON() in ata_eh_finish() and hangs
command processing or causes oops later depending on circumstances.

Here are logs with induced timeouts (patch attached).  In commit
242f9dcb8, the XXX messages for the second timeout shows different
scsi_cmd pointers for eh_cmd_q and qc->scmd which is initialized by
ata_scsi_qc_new() during command translation.

COMMIT 608aeef17a91747d6303de4df5e2c2e6899a95e8

[   60.878067] ata_piix 0000:00:1f.2: PCI INT B -> GSI 19 (level, low) -> IRQ 19
[   60.892956] ata_piix 0000:00:1f.2: MAP [ P0 P2 P1 P3 ]
[   60.906557] scsi0 : ata_piix
[   60.920296] scsi1 : ata_piix
[   60.935038] ata1: SATA max UDMA/133 cmd 0xa000 ctl 0x9c00 bmdma 0x9480 irq 19
[   60.949498] ata2: SATA max UDMA/133 cmd 0x9880 ctl 0x9800 bmdma 0x9488 irq 19
[   60.963803] ata1: XXX eh_cmd_q:
[   60.967773] ata1: XXX qc0: cmd=0 flags=00000000 scmd=0000000000000000
[   61.314020] ata1: SATA link down (SStatus 0 SControl 300)
[   61.326661] ata2: XXX eh_cmd_q:
[   61.330362] ata2: XXX qc0: cmd=0 flags=00000000 scmd=0000000000000000
[   61.835369] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[   61.860329] ata2.00: ATA-7: WDC WD1500ADFD-00NLR1, 20.07P20, max UDMA/133
[   61.874173] ata2.00: 293046768 sectors, multi 16: LBA48 NCQ (depth 0/32)
[   61.904307] ata2.00: configured for UDMA/133
[   61.931835] scsi 1:0:0:0: Direct-Access     ATA      WDC WD1500ADFD-0 20.0 PQ: 0 ANSI: 5
[   61.954781] sd 1:0:0:0: [sda] 293046768 512-byte hardware sectors (150040 MB)
[   61.974311] sd 1:0:0:0: [sda] Write Protect is off
[   61.986953] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   62.004210] sd 1:0:0:0: [sda] 293046768 512-byte hardware sectors (150040 MB)
[   62.018928] sd 1:0:0:0: [sda] Write Protect is off
[   62.031649] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   62.048168]  sda: unknown partition table
[   62.067140] sd 1:0:0:0: [sda] Attached SCSI disk
[   62.084652] sd 1:0:0:0: Attached scsi generic sg0 type 0
[   69.731222] ata2.00: XXX skipping completion
[  129.726468] ata2: XXX eh_cmd_q: ffff88003df1a450
[  129.730425] ata2: XXX qc0: cmd=ca flags=0000000b scmd=ffff88003df1a450
[  129.752854] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[  129.767447] ata2.00: cmd ca/00:20:00:00:00/00:00:00:00:00/e0 tag 0 dma 16384 out
[  129.767448]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  129.797623] ata2.00: status: { DRDY }
[  129.797634] ata2: hard resetting link
[  130.286510] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[  130.331427] ata2.00: configured for UDMA/133
[  130.343836] ata2: EH complete
[  130.354794] ata2.00: XXX skipping completion
[  190.361697] ata2: XXX eh_cmd_q: ffff88003df1a450
[  190.365662] ata2: XXX qc0: cmd=ca flags=0000000b scmd=ffff88003df1a450
[  190.389080] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[  190.404280] ata2.00: cmd ca/00:20:00:00:00/00:00:00:00:00/e0 tag 0 dma 16384 out
[  190.404281]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  190.435828] ata2.00: status: { DRDY }
[  190.447786] ata2: hard resetting link
[  190.933748] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[  190.978649] ata2.00: configured for UDMA/133
[  190.991373] ata2: EH complete
[  191.002663] ata2.00: XXX skipping completion

COMMIT 242f9dcb8ba6f68fcd217a119a7648a4f69290e9

[  107.180869] ata_piix 0000:00:1f.2: PCI INT B -> GSI 19 (level, low) -> IRQ 19
[  107.195909] ata_piix 0000:00:1f.2: MAP [ P0 P2 P1 P3 ]
[  107.209595] scsi0 : ata_piix
[  107.220667] scsi1 : ata_piix
[  107.237009] ata1: SATA max UDMA/133 cmd 0xa000 ctl 0x9c00 bmdma 0x9480 irq 19
[  107.251375] ata2: SATA max UDMA/133 cmd 0x9880 ctl 0x9800 bmdma 0x9488 irq 19
[  107.265641] ata1: XXX eh_cmd_q:
[  107.269613] ata1: XXX qc0: cmd=0 flags=00000000 scmd=0000000000000000
[  107.617455] ata1: SATA link down (SStatus 0 SControl 300)
[  107.630171] ata2: XXX eh_cmd_q:
[  107.634166] ata2: XXX qc0: cmd=0 flags=00000000 scmd=0000000000000000
[  108.126787] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[  108.147729] ata2.00: ATA-7: WDC WD1500ADFD-00NLR1, 20.07P20, max UDMA/133
[  108.161843] ata2.00: 293046768 sectors, multi 16: LBA48 NCQ (depth 0/32)
[  108.191731] ata2.00: configured for UDMA/133
[  108.219236] scsi 1:0:0:0: Direct-Access     ATA      WDC WD1500ADFD-0 20.0 PQ: 0 ANSI: 5
[  108.237140] sd 1:0:0:0: [sda] 293046768 512-byte hardware sectors (150040 MB)
[  108.253474] sd 1:0:0:0: [sda] Write Protect is off
[  108.267280] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  108.284851] sd 1:0:0:0: [sda] 293046768 512-byte hardware sectors (150040 MB)
[  108.299687] sd 1:0:0:0: [sda] Write Protect is off
[  108.312502] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  108.329320]  sda: unknown partition table
[  108.348303] sd 1:0:0:0: [sda] Attached SCSI disk
[  108.361898] sd 1:0:0:0: Attached scsi generic sg0 type 0
[  116.468909] ata2.00: XXX skipping completion
[  176.589876] ata2: XXX eh_cmd_q: ffff88003bd67cb8
[  176.593827] ata2: XXX qc0: cmd=ca flags=0000000b scmd=ffff88003bd67cb8
[  176.616842] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[  176.631953] ata2.00: cmd ca/00:20:00:00:00/00:00:00:00:00/e0 tag 0 dma 16384 out
[  176.631954]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  176.662845] ata2.00: status: { DRDY }
[  176.674218] ata2: hard resetting link
[  177.157912] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[  177.194824] ata2.00: configured for UDMA/133
[  177.206926] ata2: EH complete
[  177.218269] ata2.00: XXX skipping completion
[  207.637488] ata2: XXX eh_cmd_q: ffff88003bd679b8
[  207.641444] ata2: XXX qc0: cmd=ca flags=0000000b scmd=ffff88003bd67cb8
[  207.664891] ------------[ cut here ]------------
[  207.677609] WARNING: at drivers/ata/libata-eh.c:2944 ata_eh_finish+0xd2/0xe0 [libata]()
[  207.694344] Modules linked in: ata_piix libata
[  207.694350] Pid: 3974, comm: scsi_eh_1 Not tainted 2.6.27-rc9-work #562
[  207.694352] 
[  207.694352] Call Trace:
[  207.694360]  [<ffffffff802388bf>] warn_on_slowpath+0x5f/0x90
[  207.694374]  [<ffffffffa00142a0>] ? ata_sff_softreset+0x0/0x170 [libata]
[  207.694382]  [<ffffffffa000e5e8>] ? ata_eh_link_autopsy+0x118/0x870 [libata]
[  207.694391]  [<ffffffffa0002d90>] ? sata_std_hardreset+0x0/0x40 [libata]
[  207.694398]  [<ffffffffa000f15d>] ? ata_eh_recover+0x21d/0xcd0 [libata]
[  207.694406]  [<ffffffffa00120f0>] ? ata_sff_postreset+0x0/0x80 [libata]
[  207.694414]  [<ffffffffa0002d90>] ? sata_std_hardreset+0x0/0x40 [libata]
[  207.694421]  [<ffffffffa00142a0>] ? ata_sff_softreset+0x0/0x170 [libata]
[  207.694429]  [<ffffffffa0014410>] ? ata_sff_prereset+0x0/0xc0 [libata]
[  207.694437]  [<ffffffffa000fdc2>] ata_eh_finish+0xd2/0xe0 [libata]
[  207.694444]  [<ffffffffa0002d90>] ? sata_std_hardreset+0x0/0x40 [libata]
[  207.694451]  [<ffffffffa0002d90>] ? sata_std_hardreset+0x0/0x40 [libata]
[  207.694459]  [<ffffffffa00142a0>] ? ata_sff_softreset+0x0/0x170 [libata]
[  207.694467]  [<ffffffffa000fe7d>] ata_do_eh+0xad/0xc0 [libata]
[  207.694474]  [<ffffffffa0002d90>] ? sata_std_hardreset+0x0/0x40 [libata]
[  207.694481]  [<ffffffffa00142a0>] ? ata_sff_softreset+0x0/0x170 [libata]
[  207.694489]  [<ffffffffa0011fba>] ata_sff_error_handler+0xea/0x220 [libata]
[  207.694497]  [<ffffffffa0010873>] ata_scsi_error+0x2b3/0x870 [libata]
[  207.694501]  [<ffffffff8052d89f>] scsi_error_handler+0x12f/0x5d0
[  207.694505]  [<ffffffff80724cf2>] ? _spin_unlock_irqrestore+0x42/0x80
[  207.694507]  [<ffffffff8052d770>] ? scsi_error_handler+0x0/0x5d0
[  207.694510]  [<ffffffff802504a9>] kthread+0x49/0x90
[  207.694513]  [<ffffffff8020d529>] child_rip+0xa/0x11
[  207.694515]  [<ffffffff8020cb33>] ? restore_args+0x0/0x30
[  207.694517]  [<ffffffff80250460>] ? kthread+0x0/0x90
[  207.694519]  [<ffffffff8020d51f>] ? child_rip+0x0/0x11
[  207.694520] 
[  207.694522] ---[ end trace 27f6013c0d17023b ]---
[  207.694596] sd 1:0:0:0: [sda] 293046768 512-byte hardware sectors (150040 MB)
[  207.694621] sd 1:0:0:0: [sda] Write Protect is off
[  207.694663] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

-- 
tejun

[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 1328 bytes --]

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 79e3a8e..24179b8 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4642,6 +4642,13 @@ void ata_qc_complete(struct ata_queued_cmd *qc)
 {
 	struct ata_port *ap = qc->ap;
 
+	if (qc->tf.command == ATA_CMD_FPDMA_WRITE ||
+	    qc->tf.command == ATA_CMD_WRITE ||
+	    qc->tf.command == ATA_CMD_WRITE_EXT) {
+		ata_dev_printk(qc->dev, KERN_INFO, "XXX skipping completion\n");
+		return;
+	}
+
 	/* XXX: New EH and old EH use different mechanisms to
 	 * synchronize EH with regular execution path.
 	 *
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index c1db2f2..617c188 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -531,6 +531,19 @@ void ata_scsi_error(struct Scsi_Host *host)
 
 		spin_lock_irqsave(ap->lock, flags);
 
+		{
+			struct ata_queued_cmd *qc;
+
+			ata_port_printk(ap, KERN_INFO, "XXX eh_cmd_q:");
+			list_for_each_entry(scmd, &host->eh_cmd_q, eh_entry)
+				printk(" %p", scmd);
+			printk("\n");
+
+			qc = __ata_qc_from_tag(ap, 0);
+			ata_port_printk(ap, KERN_INFO, "XXX qc%d: cmd=%x flags=%08lx scmd=%p\n",
+					qc->tag, qc->tf.command, qc->flags, qc->scsicmd);
+		}
+
 		list_for_each_entry_safe(scmd, tmp, &host->eh_cmd_q, eh_entry) {
 			struct ata_queued_cmd *qc;
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
  2008-10-26  9:46 Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 Tejun Heo
@ 2008-10-26 15:15 ` James Bottomley
  2008-10-27  1:51   ` Tejun Heo
  0 siblings, 1 reply; 10+ messages in thread
From: James Bottomley @ 2008-10-26 15:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-scsi, IDE/ATA development list, Linux Kernel

On Sun, 2008-10-26 at 18:46 +0900, Tejun Heo wrote:
> Hello, Jens.
> 
> Commit 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 introduces a strange
> regression for libata.  The second timeout gives puts different
> pointer from the issued command onto eh_cmd_q breaking libata EH
> command matching which triggers WARN_ON() in ata_eh_finish() and hangs
> command processing or causes oops later depending on circumstances.
> 
> Here are logs with induced timeouts (patch attached).  In commit
> 242f9dcb8, the XXX messages for the second timeout shows different
> scsi_cmd pointers for eh_cmd_q and qc->scmd which is initialized by
> ata_scsi_qc_new() during command translation.

I can't see a way we could be getting a different command passed in from
the actual one, since the only way to lose the command from the request
is to go through the command completion routines which free it (and end
the request).

However, since the WARN_ON is specifically comparing the command with
the one found by the active tag, could this actually be a problem caused
by block tags?  I note that libata still uses its own array of
outstanding tags (ap->qcmd[tag]) instead of finding them using
blk_queue_find_tag() (or scsi_find_tag()).

James



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
  2008-10-26 15:15 ` James Bottomley
@ 2008-10-27  1:51   ` Tejun Heo
  2008-10-27 15:03     ` James Bottomley
  2008-10-27 21:30     ` Mike Anderson
  0 siblings, 2 replies; 10+ messages in thread
From: Tejun Heo @ 2008-10-27  1:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, linux-scsi, IDE/ATA development list, Linux Kernel

James Bottomley wrote:
> On Sun, 2008-10-26 at 18:46 +0900, Tejun Heo wrote:
>> Hello, Jens.
>>
>> Commit 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 introduces a strange
>> regression for libata.  The second timeout gives puts different
>> pointer from the issued command onto eh_cmd_q breaking libata EH
>> command matching which triggers WARN_ON() in ata_eh_finish() and hangs
>> command processing or causes oops later depending on circumstances.
>>
>> Here are logs with induced timeouts (patch attached).  In commit
>> 242f9dcb8, the XXX messages for the second timeout shows different
>> scsi_cmd pointers for eh_cmd_q and qc->scmd which is initialized by
>> ata_scsi_qc_new() during command translation.
> 
> I can't see a way we could be getting a different command passed in from
> the actual one, since the only way to lose the command from the request
> is to go through the command completion routines which free it (and end
> the request).

I have no idea either.  It's something in the timeout logic because on
the issue path the scmd pointer is identical but on tiemout pointer
for another scmd is queued on eh_cmd_q, which doesn't make much sense.

> However, since the WARN_ON is specifically comparing the command with
> the one found by the active tag, could this actually be a problem caused
> by block tags?  I note that libata still uses its own array of
> outstanding tags (ap->qcmd[tag]) instead of finding them using
> blk_queue_find_tag() (or scsi_find_tag()).

Nope, the tested commits are before the block queue tag transition and
I tested two consecutive commits.  242f9dcb^ is okay.  242f9dcb is
not.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
  2008-10-27  1:51   ` Tejun Heo
@ 2008-10-27 15:03     ` James Bottomley
  2008-10-27 21:30     ` Mike Anderson
  1 sibling, 0 replies; 10+ messages in thread
From: James Bottomley @ 2008-10-27 15:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-scsi, IDE/ATA development list, Linux Kernel

On Mon, 2008-10-27 at 10:51 +0900, Tejun Heo wrote:
> James Bottomley wrote:
> > On Sun, 2008-10-26 at 18:46 +0900, Tejun Heo wrote:
> >> Hello, Jens.
> >>
> >> Commit 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 introduces a strange
> >> regression for libata.  The second timeout gives puts different
> >> pointer from the issued command onto eh_cmd_q breaking libata EH
> >> command matching which triggers WARN_ON() in ata_eh_finish() and hangs
> >> command processing or causes oops later depending on circumstances.
> >>
> >> Here are logs with induced timeouts (patch attached).  In commit
> >> 242f9dcb8, the XXX messages for the second timeout shows different
> >> scsi_cmd pointers for eh_cmd_q and qc->scmd which is initialized by
> >> ata_scsi_qc_new() during command translation.
> > 
> > I can't see a way we could be getting a different command passed in from
> > the actual one, since the only way to lose the command from the request
> > is to go through the command completion routines which free it (and end
> > the request).
> 
> I have no idea either.  It's something in the timeout logic because on
> the issue path the scmd pointer is identical but on tiemout pointer
> for another scmd is queued on eh_cmd_q, which doesn't make much sense.

OK, so if we take the first trace as definitive, since it shows the same
command going around twice (being freed and reallocated from the slab in
between), it does tend to imply the one on eh_cmd_q is bogus.

Could you print out all scsi commands going into ata_qc_issue so we can
see if there's a clue in where this one is coming from?

> > However, since the WARN_ON is specifically comparing the command with
> > the one found by the active tag, could this actually be a problem caused
> > by block tags?  I note that libata still uses its own array of
> > outstanding tags (ap->qcmd[tag]) instead of finding them using
> > blk_queue_find_tag() (or scsi_find_tag()).
> 
> Nope, the tested commits are before the block queue tag transition and
> I tested two consecutive commits.  242f9dcb^ is okay.  242f9dcb is
> not.

Well, it was worth a shot.

James



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
  2008-10-27  1:51   ` Tejun Heo
  2008-10-27 15:03     ` James Bottomley
@ 2008-10-27 21:30     ` Mike Anderson
  2008-10-28  1:11       ` Tejun Heo
  1 sibling, 1 reply; 10+ messages in thread
From: Mike Anderson @ 2008-10-27 21:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, Jens Axboe, linux-scsi,
	IDE/ATA development list, Linux Kernel

Tejun Heo <tj@kernel.org> wrote:
> James Bottomley wrote:
> > On Sun, 2008-10-26 at 18:46 +0900, Tejun Heo wrote:
> >> Hello, Jens.
> >>
> >> Commit 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 introduces a strange
> >> regression for libata.  The second timeout gives puts different
> >> pointer from the issued command onto eh_cmd_q breaking libata EH
> >> command matching which triggers WARN_ON() in ata_eh_finish() and hangs
> >> command processing or causes oops later depending on circumstances.
> >>
> >> Here are logs with induced timeouts (patch attached).  In commit
> >> 242f9dcb8, the XXX messages for the second timeout shows different
> >> scsi_cmd pointers for eh_cmd_q and qc->scmd which is initialized by
> >> ata_scsi_qc_new() during command translation.
> > 
> > I can't see a way we could be getting a different command passed in from
> > the actual one, since the only way to lose the command from the request
> > is to go through the command completion routines which free it (and end
> > the request).
> 
> I have no idea either.  It's something in the timeout logic because on
> the issue path the scmd pointer is identical but on tiemout pointer
> for another scmd is queued on eh_cmd_q, which doesn't make much sense.
> 

I was trying to recreate this error using ata_ram wth v2.6.28-rc2.
Currently I am not able to see this error on timeout recovery using this
setup. Does IO load (or other factors) effect the error being seen?

-andmike
--
Michael Anderson
andmike@linux.vnet.ibm.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
  2008-10-27 21:30     ` Mike Anderson
@ 2008-10-28  1:11       ` Tejun Heo
  2008-10-29 13:26         ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2008-10-28  1:11 UTC (permalink / raw)
  To: Mike Anderson
  Cc: James Bottomley, Jens Axboe, linux-scsi,
	IDE/ATA development list, Linux Kernel

Mike Anderson wrote:
> Tejun Heo <tj@kernel.org> wrote:
>> James Bottomley wrote:
>>> On Sun, 2008-10-26 at 18:46 +0900, Tejun Heo wrote:
>>>> Hello, Jens.
>>>>
>>>> Commit 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 introduces a strange
>>>> regression for libata.  The second timeout gives puts different
>>>> pointer from the issued command onto eh_cmd_q breaking libata EH
>>>> command matching which triggers WARN_ON() in ata_eh_finish() and hangs
>>>> command processing or causes oops later depending on circumstances.
>>>>
>>>> Here are logs with induced timeouts (patch attached).  In commit
>>>> 242f9dcb8, the XXX messages for the second timeout shows different
>>>> scsi_cmd pointers for eh_cmd_q and qc->scmd which is initialized by
>>>> ata_scsi_qc_new() during command translation.
>>> I can't see a way we could be getting a different command passed in from
>>> the actual one, since the only way to lose the command from the request
>>> is to go through the command completion routines which free it (and end
>>> the request).
>> I have no idea either.  It's something in the timeout logic because on
>> the issue path the scmd pointer is identical but on tiemout pointer
>> for another scmd is queued on eh_cmd_q, which doesn't make much sense.
>>
> 
> I was trying to recreate this error using ata_ram wth v2.6.28-rc2.
> Currently I am not able to see this error on timeout recovery using this
> setup. Does IO load (or other factors) effect the error being seen?

Not at all.  That's the only write command I issued.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
  2008-10-28  1:11       ` Tejun Heo
@ 2008-10-29 13:26         ` Jens Axboe
  2008-10-29 14:02           ` James Bottomley
  2008-10-30  0:31           ` Tejun Heo
  0 siblings, 2 replies; 10+ messages in thread
From: Jens Axboe @ 2008-10-29 13:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Anderson, James Bottomley, linux-scsi,
	IDE/ATA development list, Linux Kernel

On Tue, Oct 28 2008, Tejun Heo wrote:
> Mike Anderson wrote:
> > Tejun Heo <tj@kernel.org> wrote:
> >> James Bottomley wrote:
> >>> On Sun, 2008-10-26 at 18:46 +0900, Tejun Heo wrote:
> >>>> Hello, Jens.
> >>>>
> >>>> Commit 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 introduces a strange
> >>>> regression for libata.  The second timeout gives puts different
> >>>> pointer from the issued command onto eh_cmd_q breaking libata EH
> >>>> command matching which triggers WARN_ON() in ata_eh_finish() and hangs
> >>>> command processing or causes oops later depending on circumstances.
> >>>>
> >>>> Here are logs with induced timeouts (patch attached).  In commit
> >>>> 242f9dcb8, the XXX messages for the second timeout shows different
> >>>> scsi_cmd pointers for eh_cmd_q and qc->scmd which is initialized by
> >>>> ata_scsi_qc_new() during command translation.
> >>> I can't see a way we could be getting a different command passed in from
> >>> the actual one, since the only way to lose the command from the request
> >>> is to go through the command completion routines which free it (and end
> >>> the request).
> >> I have no idea either.  It's something in the timeout logic because on
> >> the issue path the scmd pointer is identical but on tiemout pointer
> >> for another scmd is queued on eh_cmd_q, which doesn't make much sense.
> >>
> > 
> > I was trying to recreate this error using ata_ram wth v2.6.28-rc2.
> > Currently I am not able to see this error on timeout recovery using this
> > setup. Does IO load (or other factors) effect the error being seen?
> 
> Not at all.  That's the only write command I issued.

It's all extremely puzzling. Any chance I could talk you into stuffing
some debug printks in there to see what the hell is going on?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
  2008-10-29 13:26         ` Jens Axboe
@ 2008-10-29 14:02           ` James Bottomley
  2008-10-29 18:51             ` Mike Anderson
  2008-10-30  0:31           ` Tejun Heo
  1 sibling, 1 reply; 10+ messages in thread
From: James Bottomley @ 2008-10-29 14:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Mike Anderson, linux-scsi, IDE/ATA development list,
	Linux Kernel

On Wed, 2008-10-29 at 14:26 +0100, Jens Axboe wrote:
> On Tue, Oct 28 2008, Tejun Heo wrote:
> > Mike Anderson wrote:
> > > Tejun Heo <tj@kernel.org> wrote:
> > >> James Bottomley wrote:
> > >>> On Sun, 2008-10-26 at 18:46 +0900, Tejun Heo wrote:
> > >>>> Hello, Jens.
> > >>>>
> > >>>> Commit 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 introduces a strange
> > >>>> regression for libata.  The second timeout gives puts different
> > >>>> pointer from the issued command onto eh_cmd_q breaking libata EH
> > >>>> command matching which triggers WARN_ON() in ata_eh_finish() and hangs
> > >>>> command processing or causes oops later depending on circumstances.
> > >>>>
> > >>>> Here are logs with induced timeouts (patch attached).  In commit
> > >>>> 242f9dcb8, the XXX messages for the second timeout shows different
> > >>>> scsi_cmd pointers for eh_cmd_q and qc->scmd which is initialized by
> > >>>> ata_scsi_qc_new() during command translation.
> > >>> I can't see a way we could be getting a different command passed in from
> > >>> the actual one, since the only way to lose the command from the request
> > >>> is to go through the command completion routines which free it (and end
> > >>> the request).
> > >> I have no idea either.  It's something in the timeout logic because on
> > >> the issue path the scmd pointer is identical but on tiemout pointer
> > >> for another scmd is queued on eh_cmd_q, which doesn't make much sense.
> > >>
> > > 
> > > I was trying to recreate this error using ata_ram wth v2.6.28-rc2.
> > > Currently I am not able to see this error on timeout recovery using this
> > > setup. Does IO load (or other factors) effect the error being seen?
> > 
> > Not at all.  That's the only write command I issued.
> 
> It's all extremely puzzling. Any chance I could talk you into stuffing
> some debug printks in there to see what the hell is going on?

Right ... me too.  The number one thing I want to see is what SCSI
commands are going to what controllers ... that might tell us which one
is bogus and where it's coming from.

James



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
  2008-10-29 14:02           ` James Bottomley
@ 2008-10-29 18:51             ` Mike Anderson
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Anderson @ 2008-10-29 18:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Tejun Heo, linux-scsi, IDE/ATA development list,
	Linux Kernel

James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> On Wed, 2008-10-29 at 14:26 +0100, Jens Axboe wrote:
> > On Tue, Oct 28 2008, Tejun Heo wrote:
> > > Mike Anderson wrote:
> > > > Tejun Heo <tj@kernel.org> wrote:
> > > >> James Bottomley wrote:
> > > >>> On Sun, 2008-10-26 at 18:46 +0900, Tejun Heo wrote:
> > > >>>> Hello, Jens.
> > > >>>>
> > > >>>> Commit 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 introduces a strange
> > > >>>> regression for libata.  The second timeout gives puts different
> > > >>>> pointer from the issued command onto eh_cmd_q breaking libata EH
> > > >>>> command matching which triggers WARN_ON() in ata_eh_finish() and hangs
> > > >>>> command processing or causes oops later depending on circumstances.
> > > >>>>
> > > >>>> Here are logs with induced timeouts (patch attached).  In commit
> > > >>>> 242f9dcb8, the XXX messages for the second timeout shows different
> > > >>>> scsi_cmd pointers for eh_cmd_q and qc->scmd which is initialized by
> > > >>>> ata_scsi_qc_new() during command translation.
> > > >>> I can't see a way we could be getting a different command passed in from
> > > >>> the actual one, since the only way to lose the command from the request
> > > >>> is to go through the command completion routines which free it (and end
> > > >>> the request).
> > > >> I have no idea either.  It's something in the timeout logic because on
> > > >> the issue path the scmd pointer is identical but on tiemout pointer
> > > >> for another scmd is queued on eh_cmd_q, which doesn't make much sense.
> > > >>
> > > > 
> > > > I was trying to recreate this error using ata_ram wth v2.6.28-rc2.
> > > > Currently I am not able to see this error on timeout recovery using this
> > > > setup. Does IO load (or other factors) effect the error being seen?
> > > 
> > > Not at all.  That's the only write command I issued.
> > 
> > It's all extremely puzzling. Any chance I could talk you into stuffing
> > some debug printks in there to see what the hell is going on?
> 
> Right ... me too.  The number one thing I want to see is what SCSI
> commands are going to what controllers ... that might tell us which one
> is bogus and where it's coming from.

We may also want to update the debug output in ata_scsi_error.

I modified my debug output on my patched version of ata_ram plus the
logging in ata_scsi_error to print out more that tag zero. It appears as I
change my tests I switch from tag 0 to tag 7. Tejun it would be good to
see if your failing case is using another tag value also.

I provided a sample of my dmesg output where that failing cmd was tag 0 and then
switch to tag 7. The output was generated with a patched version of
ata_ram that will drop reads and writes based on a debugfs attribute. I
also have "scsi_logging_level --mlcomplete 1 --error 4 -s"

-andmike
--
Michael Anderson
andmike@linux.vnet.ibm.com

Oct 29 11:12:47 lab_system kernel: [ 1725.986250] ata_dbg: ata_ram_device_thread dropped cmd=c8 (ed 0) 
Oct 29 11:12:47 lab_system kernel: [ 1725.986359] sd 3:0:0:0: [sdj] 1048576 512-byte hardware sectors: (536 MB/512 MiB)
Oct 29 11:12:47 lab_system kernel: [ 1725.986404] sd 3:0:0:0: [sdj] Write Protect is off
Oct 29 11:12:47 lab_system kernel: [ 1725.986406] sd 3:0:0:0: [sdj] Mode Sense: 00 3a 00 00
Oct 29 11:12:47 lab_system kernel: [ 1725.986480] sd 3:0:0:0: [sdj] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Oct 29 11:12:57 lab_system kernel: [ 1735.986307] sd 3:0:0:0: [sdj] Done: TIMEOUT
Oct 29 11:12:57 lab_system kernel: [ 1735.986311] sd 3:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Oct 29 11:12:57 lab_system kernel: [ 1735.986316] sd 3:0:0:0: [sdj] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
Oct 29 11:12:57 lab_system kernel: [ 1735.986325] sd 3:0:0:0: [sdj] Unrecognized sense data (in hex):
Oct 29 11:12:57 lab_system kernel: [ 1735.986329]         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
Oct 29 11:12:57 lab_system kernel: [ 1735.986340]         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
Oct 29 11:12:57 lab_system kernel: [ 1735.986351] Sense Key : No Sense [current] 
Oct 29 11:12:57 lab_system kernel: [ 1735.986354] sd 3:0:0:0: [sdj] Add. Sense: No additional sense information
Oct 29 11:12:57 lab_system kernel: [ 1735.986359] ata1: ata_dbg: ata_scsi_timed_out Enter
Oct 29 11:12:57 lab_system kernel: [ 1735.986361] ata1: ata_dbg: ata_scsi_timed_out ops->error_handler set
Oct 29 11:12:57 lab_system kernel: [ 1735.986364] ata1: ata_dbg: ata_scsi_timed_out active_tag 0, scmd=ffff88007d82d7c0
Oct 29 11:12:57 lab_system kernel: [ 1735.986366] ata1: ata_dbg: ata_scsi_timed_out EXIT ret=0
Oct 29 11:12:57 lab_system kernel: [ 1735.986368] ata_dbg: scsi_times_out eh_timed_out rtn = 0
Oct 29 11:12:57 lab_system kernel: [ 1735.986414] Error handler scsi_eh_3 waking up
Oct 29 11:12:57 lab_system kernel: [ 1735.986419] ata1: ata_dbg: eh_cmd_q: ffff88007d82d7c0
Oct 29 11:12:57 lab_system kernel: [ 1735.986422] ata1: ata_dbg: tag 0, qc 0, cmd c8, flags 0000000b, scmd ffff88007d82d7c0
Oct 29 11:12:57 lab_system kernel: [ 1735.986425] ata1: ata_dbg: tag 1, qc -84148995, cmd 60, flags 00000000, scmd ffff88007d82dcc0
Oct 29 11:12:57 lab_system kernel: [ 1735.986427] ata1: ata_dbg: tag 2, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986430] ata1: ata_dbg: tag 3, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986433] ata1: ata_dbg: tag 4, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986435] ata1: ata_dbg: tag 5, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986438] ata1: ata_dbg: tag 6, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986441] ata1: ata_dbg: tag 7, qc -84148995, cmd ca, flags 00000000, scmd ffff880037856b80
Oct 29 11:12:57 lab_system kernel: [ 1735.986444] ata1: ata_dbg: tag 8, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986446] ata1: ata_dbg: tag 9, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986449] ata1: ata_dbg: tag 10, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986452] ata1: ata_dbg: tag 11, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986454] ata1: ata_dbg: tag 12, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986457] ata1: ata_dbg: tag 13, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986460] ata1: ata_dbg: tag 14, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986462] ata1: ata_dbg: tag 15, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986465] ata1: ata_dbg: tag 16, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986468] ata1: ata_dbg: tag 17, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986471] ata1: ata_dbg: tag 18, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986473] ata1: ata_dbg: tag 19, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986476] ata1: ata_dbg: tag 20, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986479] ata1: ata_dbg: tag 21, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986481] ata1: ata_dbg: tag 22, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986484] ata1: ata_dbg: tag 23, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986487] ata1: ata_dbg: tag 24, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986489] ata1: ata_dbg: tag 25, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986492] ata1: ata_dbg: tag 26, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986495] ata1: ata_dbg: tag 27, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986497] ata1: ata_dbg: tag 28, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986500] ata1: ata_dbg: tag 29, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986503] ata1: ata_dbg: tag 30, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986506] ata1: ata_dbg: tag 31, qc -84148995, cmd ec, flags 00000000, scmd 0000000000000000
Oct 29 11:12:57 lab_system kernel: [ 1735.986516] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 29 11:12:57 lab_system kernel: [ 1735.986522] ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Oct 29 11:12:57 lab_system kernel: [ 1735.986524]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 29 11:12:57 lab_system kernel: [ 1735.986526] ata1.00: status: { DRDY }
Oct 29 11:12:57 lab_system kernel: [ 1735.986534] ata1: hard resetting link
Oct 29 11:12:57 lab_system kernel: [ 1735.986558] ata_dbg: ata_ram_execute_command completing tag 31, cmd ec
Oct 29 11:12:57 lab_system kernel: [ 1735.986601] ata_dbg: ata_ram_execute_command completing tag 31, cmd ef
Oct 29 11:12:57 lab_system kernel: [ 1735.986620] ata_dbg: ata_ram_execute_command completing tag 31, cmd ec
Oct 29 11:12:57 lab_system kernel: [ 1735.986642] ata1.00: configured for UDMA/33
Oct 29 11:12:57 lab_system kernel: [ 1735.986650] scsi_eh_3: flush retry cmd: ffff88007d82d7c0
Oct 29 11:12:57 lab_system kernel: [ 1735.986663] ata1: EH complete
Oct 29 11:12:57 lab_system kernel: [ 1735.986670] scsi_restart_operations: waking up host to restart
Oct 29 11:12:57 lab_system kernel: [ 1735.986688] Error handler scsi_eh_3 sleeping
Oct 29 11:12:57 lab_system kernel: [ 1735.986698] ata_dbg: ata_ram_execute_command completing tag 0, cmd c8
Oct 29 11:12:57 lab_system kernel: [ 1735.986826] ata_dbg: ata_ram_execute_command completing tag 7, cmd ca
Oct 29 11:12:57 lab_system kernel: [ 1735.986923] sd 3:0:0:0: [sdj] 1048576 512-byte hardware sectors: (536 MB/512 MiB)
Oct 29 11:12:57 lab_system kernel: [ 1735.986974] sd 3:0:0:0: [sdj] Write Protect is off
Oct 29 11:12:57 lab_system kernel: [ 1735.986977] sd 3:0:0:0: [sdj] Mode Sense: 00 3a 00 00
Oct 29 11:12:57 lab_system kernel: [ 1735.987064] sd 3:0:0:0: [sdj] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA


Oct 29 11:16:52 lab_system root[5410]: ATA_RAM Fail Writei, Two Writers, bs=64k (Count 2, Drop Count 8)
Oct 29 11:17:18 lab_system kernel: [ 1996.270770] ata_dbg: ata_ram_device_thread dropped cmd=ca (ed 7) 
Oct 29 11:17:28 lab_system kernel: [ 2007.002206] sd 3:0:0:0: [sdj] Done: TIMEOUT
Oct 29 11:17:28 lab_system kernel: [ 2007.002212] sd 3:0:0:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
Oct 29 11:17:28 lab_system kernel: [ 2007.002217] sd 3:0:0:0: [sdj] CDB: Write(10): 2a 00 00 00 00 00 00 01 00 00
Oct 29 11:17:28 lab_system kernel: [ 2007.002229] ata1: ata_dbg: ata_scsi_timed_out Enter
Oct 29 11:17:28 lab_system kernel: [ 2007.002232] ata1: ata_dbg: ata_scsi_timed_out ops->error_handler set
Oct 29 11:17:28 lab_system kernel: [ 2007.002235] ata1: ata_dbg: ata_scsi_timed_out active_tag 7, scmd=ffff880037856380
Oct 29 11:17:28 lab_system kernel: [ 2007.002237] ata1: ata_dbg: ata_scsi_timed_out EXIT ret=0
Oct 29 11:17:28 lab_system kernel: [ 2007.002239] ata_dbg: scsi_times_out eh_timed_out rtn = 0
Oct 29 11:17:28 lab_system kernel: [ 2007.002264] Error handler scsi_eh_3 waking up
Oct 29 11:17:28 lab_system kernel: [ 2007.002271] ata1: ata_dbg: eh_cmd_q: ffff880037856380
Oct 29 11:17:28 lab_system kernel: [ 2007.002274] ata1: ata_dbg: tag 0, qc -84148995, cmd c8, flags 00000000, scmd ffff88007d82d7c0
Oct 29 11:17:28 lab_system kernel: [ 2007.002277] ata1: ata_dbg: tag 1, qc -84148995, cmd 60, flags 00000000, scmd ffff88007d82dcc0
Oct 29 11:17:28 lab_system kernel: [ 2007.002280] ata1: ata_dbg: tag 2, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002283] ata1: ata_dbg: tag 3, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002286] ata1: ata_dbg: tag 4, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002288] ata1: ata_dbg: tag 5, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002291] ata1: ata_dbg: tag 6, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002294] ata1: ata_dbg: tag 7, qc 7, cmd ca, flags 0000000b, scmd ffff880037856380
Oct 29 11:17:28 lab_system kernel: [ 2007.002296] ata1: ata_dbg: tag 8, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002299] ata1: ata_dbg: tag 9, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002302] ata1: ata_dbg: tag 10, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002304] ata1: ata_dbg: tag 11, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002307] ata1: ata_dbg: tag 12, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002310] ata1: ata_dbg: tag 13, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002312] ata1: ata_dbg: tag 14, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002315] ata1: ata_dbg: tag 15, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002318] ata1: ata_dbg: tag 16, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002321] ata1: ata_dbg: tag 17, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002323] ata1: ata_dbg: tag 18, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002326] ata1: ata_dbg: tag 19, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002329] ata1: ata_dbg: tag 20, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002331] ata1: ata_dbg: tag 21, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002334] ata1: ata_dbg: tag 22, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002337] ata1: ata_dbg: tag 23, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002339] ata1: ata_dbg: tag 24, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002342] ata1: ata_dbg: tag 25, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002345] ata1: ata_dbg: tag 26, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002347] ata1: ata_dbg: tag 27, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002350] ata1: ata_dbg: tag 28, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002353] ata1: ata_dbg: tag 29, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002355] ata1: ata_dbg: tag 30, qc 0, cmd 0, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002358] ata1: ata_dbg: tag 31, qc -84148995, cmd ec, flags 00000000, scmd 0000000000000000
Oct 29 11:17:28 lab_system kernel: [ 2007.002371] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 29 11:17:28 lab_system kernel: [ 2007.002379] ata1.00: cmd ca/00:00:00:00:00/00:00:00:00:00/e0 tag 7 dma 131072 out
Oct 29 11:17:28 lab_system kernel: [ 2007.002380]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 29 11:17:28 lab_system kernel: [ 2007.002383] ata1.00: status: { DRDY }
Oct 29 11:17:28 lab_system kernel: [ 2007.002391] ata1: hard resetting link
Oct 29 11:17:28 lab_system kernel: [ 2007.002421] ata_dbg: ata_ram_execute_command completing tag 31, cmd ec
Oct 29 11:17:28 lab_system kernel: [ 2007.002477] ata_dbg: ata_ram_execute_command completing tag 31, cmd ef
Oct 29 11:17:28 lab_system kernel: [ 2007.002499] ata_dbg: ata_ram_execute_command completing tag 31, cmd ec
Oct 29 11:17:28 lab_system kernel: [ 2007.002526] ata1.00: configured for UDMA/33
Oct 29 11:17:28 lab_system kernel: [ 2007.002536] scsi_eh_3: flush retry cmd: ffff880037856380
Oct 29 11:17:28 lab_system kernel: [ 2007.002555] ata1: EH complete

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
  2008-10-29 13:26         ` Jens Axboe
  2008-10-29 14:02           ` James Bottomley
@ 2008-10-30  0:31           ` Tejun Heo
  1 sibling, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2008-10-30  0:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Anderson, James Bottomley, linux-scsi,
	IDE/ATA development list, Linux Kernel

Jens Axboe wrote:
> On Tue, Oct 28 2008, Tejun Heo wrote:
>> Mike Anderson wrote:
>>> Tejun Heo <tj@kernel.org> wrote:
>>>> James Bottomley wrote:
>>>>> On Sun, 2008-10-26 at 18:46 +0900, Tejun Heo wrote:
>>>>>> Hello, Jens.
>>>>>>
>>>>>> Commit 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 introduces a strange
>>>>>> regression for libata.  The second timeout gives puts different
>>>>>> pointer from the issued command onto eh_cmd_q breaking libata EH
>>>>>> command matching which triggers WARN_ON() in ata_eh_finish() and hangs
>>>>>> command processing or causes oops later depending on circumstances.
>>>>>>
>>>>>> Here are logs with induced timeouts (patch attached).  In commit
>>>>>> 242f9dcb8, the XXX messages for the second timeout shows different
>>>>>> scsi_cmd pointers for eh_cmd_q and qc->scmd which is initialized by
>>>>>> ata_scsi_qc_new() during command translation.
>>>>> I can't see a way we could be getting a different command passed in from
>>>>> the actual one, since the only way to lose the command from the request
>>>>> is to go through the command completion routines which free it (and end
>>>>> the request).
>>>> I have no idea either.  It's something in the timeout logic because on
>>>> the issue path the scmd pointer is identical but on tiemout pointer
>>>> for another scmd is queued on eh_cmd_q, which doesn't make much sense.
>>>>
>>> I was trying to recreate this error using ata_ram wth v2.6.28-rc2.
>>> Currently I am not able to see this error on timeout recovery using this
>>> setup. Does IO load (or other factors) effect the error being seen?
>> Not at all.  That's the only write command I issued.
> 
> It's all extremely puzzling. Any chance I could talk you into stuffing
> some debug printks in there to see what the hell is going on?

I got it pinned down.  I'll post the fix after some more testing but it
looks like we'll need more extensive change to get it clean.  The
problem is when the command is passed to driver - at elv_next_request()
or blkdev_dequeue_request().  SCSI thinks it's blkdev_dequeue_request(),
block layer thinks it's elv_next_request() for some purposes while
blkdev_dequeue_request() for others.  I really think we should change
the interface to something like blk_peek_request() and
blk_fetch_request() and don't allow finishing a request which are not
fetched.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-10-30  0:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-26  9:46 Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9 Tejun Heo
2008-10-26 15:15 ` James Bottomley
2008-10-27  1:51   ` Tejun Heo
2008-10-27 15:03     ` James Bottomley
2008-10-27 21:30     ` Mike Anderson
2008-10-28  1:11       ` Tejun Heo
2008-10-29 13:26         ` Jens Axboe
2008-10-29 14:02           ` James Bottomley
2008-10-29 18:51             ` Mike Anderson
2008-10-30  0:31           ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).